Grounding Topic Models with Knowledge Bases

Similar documents
Probabilistic Latent Semantic Analysis

Mining Topic-level Opinion Influence in Microblog

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Short Text Understanding Through Lexical-Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Compositional Semantics

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Opinion on Private Garbage Collection in Scarborough Mixed

TextGraphs: Graph-based algorithms for Natural Language Processing

Matching Similarity for Keyword-Based Clustering

Python Machine Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

The stages of event extraction

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Attributed Social Network Embedding

Switchboard Language Model Improvement with Conversational Data from Gigaword

Experts Retrieval with Multiword-Enhanced Author Topic Model

Introduction to Simulation

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

THE world surrounding us involves multiple modalities

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Visual CP Representation of Knowledge

Constraining X-Bar: Theta Theory

Copyright by Sung Ju Hwang 2013

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

ACADEMIC TECHNOLOGY SUPPORT

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

A Bayesian Learning Approach to Concept-Based Document Classification

On-Line Data Analytics

Proof Theory for Syntacticians

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Rule-based Expert Systems

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

A Semantic Imitation Model of Social Tag Choices

Prediction of Maximal Projection for Semantic Role Labeling

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

The role of word-word co-occurrence in word learning

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Probability and Statistics Curriculum Pacing Guide

Vocabulary Usage and Intelligibility in Learner Language

Generative models and adversarial training

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time

arxiv: v2 [cs.ir] 22 Aug 2016

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Multi-Lingual Text Leveling

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

learning collegiate assessment]

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

A Comparison of Two Text Representations for Sentiment Analysis

A Vector Space Approach for Aspect-Based Sentiment Analysis

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Evolution of Symbolisation in Chimpanzees and Neural Nets

Discovery of Topical Authorities in Instagram

Learning Methods for Fuzzy Systems

STA 225: Introductory Statistics (CT)

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

Radius STEM Readiness TM

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Word Sense Disambiguation

Learning Methods in Multilingual Speech Recognition

Topic Modelling with Word Embeddings

USING LEARNING THEORY IN A HYPERMEDIA-BASED PETRI NET MODELING TUTORIAL

and Beyond! Evergreen School District PAC February 1, 2012

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

EOSC Governance Development Forum 4 May 2017 Per Öster

READY OR NOT? CALIFORNIA'S EARLY ASSESSMENT PROGRAM AND THE TRANSITION TO COLLEGE

Modeling function word errors in DNN-HMM based LVCSR systems

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Linking Task: Identifying authors and book titles in verbose queries

The MEANING Multilingual Central Repository

Networks in Cognitive Science

ABET Criteria for Accrediting Computer Science Programs

Genre classification on German novels

Efficient Online Summarization of Microblogging Streams

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Abstractions and the Brain

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Mining meaning from Wikipedia

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Corrective Feedback and Persistent Learning for Information Extraction

BENCHMARK TREND COMPARISON REPORT:

Abnormal Activity Recognition Based on HDP-HMM Models

Comparison of network inference packages and methods for multiple networks inference

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Hiroyuki Tsunoda Tsurumi University Tsurumi, Tsurumi-ku, Yokohama , Japan

arxiv: v1 [cs.cl] 29 Jun 2016

Transcription:

Grounding Topic Models with Knowledge Bases Zhiting Hu 1*, Gang Luo 2, Mrinmaya Sachan 1, Eric Xing 1, Zaiqing Nie 3 1 Carnegie Mellon University 2 Microsoft, California, US 3 Microsoft Research, Beijing, China *This work was done when the first two authors were at Microsoft Research, Beijing 1

Background Topic Modeling Represents latent topics as probability distributions over words 2

Background Topic Modeling Represents latent topics as probability distributions over words LDA (latent Dirichlet process) 3

Background Topic Modeling Represents latent topics as probability distributions over words LDA (latent Dirichlet process) [Blei et al., 2003] 4

Background Topic Modeling Represents latent topics as probability distributions over words hard to interpret due to incoherence lack of background context no grounded semantics [Blei et al., 2003] 5

Background Topic Modeling Represents latent topics as probability distributions over words hard to interpret due to incoherence lack of background context no grounded semantics Previous work combines external knowledge improves coherence, but topics = word distributions imposes one-to-one binding of topics to pre-defined knowledge base (KB) entities Sacrifices flexibility [Blei et al., 2003] 6

Overview This work A structured topic representation based on entity taxonomy from KBs 7

Overview This work A structured topic representation based on entity taxonomy from KBs Topic ``Death of Whitney Houston 8

Overview This work A structured topic representation based on entity taxonomy from KBs grounded semantics improved coherenceness: captures entity correlations encoded in the taxonomy 9

Overview This work A structured topic representation based on entity taxonomy from KBs grounded semantics improved coherenceness: captures entity correlations encoded in the taxonomy A probabilistic model to infer both hidden topics and entities from text corpora 10

Method Document Modeling Augments bag-of-word documents with entity mentions mentions carry salient semantics of a document {co-founder, wealthiest, man, } {Gates, Microsoft, } 11

Method Document Modeling Generative process: each mention <- an entity and a topic each word <- an index indicating which mention to describe 12

Method Topic: Random Walk on Taxonomy Entity taxonomy leaf: entity internal nodes: category Each topic as a root-to-leaf random walk a set of parent-to-child transition probabilities -> entity/category weights 13

Method Topic: Random Walk on Taxonomy Entity taxonomy leaf: entity internal nodes: category Each topic as a root-to-leaf random walk a set of parent-to-child transition probabilities -> entity/category weights Path-sharing: encourages clustering correlated entities into the same topic 14

Method Entity Modeling A distribution over mentions captures relatedness between the entity and mentions Microsoft Inc. MS, Gates A distribution over words characterizes the entity attributes Bill Gates - wealthiest Informative prior from KB mention/word frequencies on the entity page 15

Method Graphical Model Representation 16

Method Graphical Model Representation Latent Grounded Semantic Analysis (LGSA) 17

Experiments https://en.wikipedia.org/wiki/microsoft Knowledge Base: Wikipedia Entity Wikipedia pages Entity category hierarchy Datasets TMZ (tmz.com): celebrity gossip news celebrity labels #doc ~= 30K New York Times news (LDC) #doc ~= 330K Baselines 18

Experiments Topic Perplexity On the TMZ dataset On the NYT dataset 19

Experiments Key Entity Identification Key entity of a document E.g., the persons a news article is mainly about TMZ dataset: ground truth (celebrity label) available LGSA: θ d - distribution over entities for document d 20

Experiments Key Entity Identification Key entity of a document E.g., the persons a news article is mainly about TMZ dataset: ground truth (celebrity label) available LGSA: θ d - distribution over entities for document d 21

Experiments Example Topics: Sports 22

Experiments Example Topics: Kardashian and Humphries Divorce 23

Conclusion Traditional word-based topic representation lacks interpretability and grounded semantics A structured topic representation based on entity taxonomy from KBs A probabilistic model (LGSA) to infer latent grounded topics Improved performance on topic perplexity and key entity identification 25

Thanks.. 26