Topic Model Evaluation: How much does it help?

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Probabilistic Latent Semantic Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v1 [cs.cl] 2 Apr 2017

Word Segmentation of Off-line Handwritten Documents

CS Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Learning From the Past with Experiment Databases

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Georgetown University at TREC 2017 Dynamic Domain Track

A Comparison of Two Text Representations for Sentiment Analysis

Australian Journal of Basic and Applied Sciences

Reducing Features to Improve Bug Prediction

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Rule Learning With Negation: Issues Regarding Effectiveness

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Switchboard Language Model Improvement with Conversational Data from Gigaword

Comment-based Multi-View Clustering of Web 2.0 Items

Welcome to. ECML/PKDD 2004 Community meeting

CSL465/603 - Machine Learning

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Linking Task: Identifying authors and book titles in verbose queries

Modeling function word errors in DNN-HMM based LVCSR systems

The stages of event extraction

Learning to Rank with Selection Bias in Personal Search

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.cv] 30 Mar 2017

AQUA: An Ontology-Driven Question Answering System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Axiom 2013 Team Description Paper

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Laboratorio di Intelligenza Artificiale e Robotica

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Speech Emotion Recognition Using Support Vector Machine

Rule Learning with Negation: Issues Regarding Effectiveness

arxiv: v2 [cs.ir] 22 Aug 2016

The Role of String Similarity Metrics in Ontology Alignment

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Semantic and Context-aware Linguistic Model for Bias Detection

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Why Did My Detector Do That?!

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

(Sub)Gradient Descent

Laboratorio di Intelligenza Artificiale e Robotica

WHEN THERE IS A mismatch between the acoustic

As a high-quality international conference in the field

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Guru: A Computer Tutor that Models Expert Human Tutors

Cross-Lingual Text Categorization

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

A survey of multi-view machine learning

Human Emotion Recognition From Speech

Issues in the Mining of Heart Failure Datasets

Indian Institute of Technology, Kanpur

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Beyond the Pipeline: Discrete Optimization in NLP

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

A Bayesian Learning Approach to Concept-Based Document Classification

COBRA: A Fast and Simple Method for Active Clustering with Pairwise Constraints

Online Updating of Word Representations for Part-of-Speech Tagging

Attributed Social Network Embedding

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Introduction to Causal Inference. Problem Set 1. Required Problems

Unsupervised Cross-Lingual Scaling of Political Texts

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Discovery of Topical Authorities in Instagram

What is a Mental Model?

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Classify: by elimination Road signs

Term Weighting based on Document Revision History

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Artificial Neural Networks written examination

Software Maintenance

Finding Translations in Scanned Book Collections

On document relevance and lexical cohesion between query terms

Degree Qualification Profiles Intellectual Skills

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

Transcription:

Topic Model Tutorial at WebSci2016 Topic Model Evaluation: How much does it help? Laura Dietz laura.dietz@unh.edu Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 1

Why is this important? Topic Models are computationally demanding to train Is the effort worth it? Isn t there a simpler/faster method that is as good? For multi-component systems: How much do the topics add to the total performance? How to choose K and hyper parameters? How to quantify success? Empirical Evaluation Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 2

Do you believe in topic models? Well, we all know that it simply works Compared to placebo? It provides pretty pictures therefore it must work I WANT TO BELIEVE Different data / task? A scientific study showed it works, no need to test it ever again. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 3

Outline What is *not* an evaluation? Intrinsic evaluation Through holdout-log likelihood / perplexity With human-in-the-loop (word intrusion) Extrinsic evaluation Through classification test data Task-specific metric What to compare to? - Baselines Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 4

THIS IS *NOT* AN EVALUATION Banana picture licensed under CreativeCommons by-nc-sa by Viktor Hertz Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 5

Looking at Word clouds Which of these topics is better? Common: some correct topics, but many split or merged topics Did the author hand-pick the correct topics? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 6

Looking at Highlighted Documents Which of these highlights is better? Soccer ball goal referee soccer foot coach scandal finances news team fire coach president Soccer ball goal referee soccer foot coach scandal finances news team fire coach president Humans prefer long consecutive segments with same topics Is the segment topically coherent? How about relative clauses? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 7

Looking at Colored graphs Which of these graph colorations is better? Is color correlating with what we think it does? What if I told you I assigned random colors? (=placebo) Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 8

Danger of Human Nature We want it to work We over-interpret the story told by the visualization We corroborate a narrative that fits the results Licensed under CreativeCommons by-sa by JussiClone Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 9

Objective Measure Goal: Quantify what quality means! Issues No gold standard data available. Vague definition of topic. Multiple correct answers. Uncertain inference algorithms. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 10

Designing a User Interface Study Claim: Topic model visualizations help users perform a task better. Run random trial evaluating humans Compare to humans that get Placebo visualizations. First study: Make sure you design the right thing Second study: Design it in the right way Do not assume your assumptions to be true! Details out of scope. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 11

Intrinsic Evaluation HELD-OUT LIKELIHOOD & PERPLEXITY Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 12

Held-out Log Likelihood / Perplexity For the words in the test document, what is their probability under the (pre-trained) topic model? train θ, φ test Lower is better! log p θ, φ) How to get θ? Depends on doc! The lower, the better model captures patterns of natural language. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 13

Variant: Document Completion For the actual words in the test document, what is their probability under the (pre-trained) topic model? train test θ, φ Lower is better! log p θ, φ) p θ ) The lower, the better model captures patterns of natural language. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 14

Variations on Held-out Log Likelihood Perplexity: 2 log p Per-word measure: 1 words log p(words) Many ways to obtain, see Wallach et al. 2009 Wallach, Hanna M., et al. "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 15

Topic Coherence Select words with high probability in one topic T cat lion tiger puma All word pairs: In how many documents contain both words? cat lion cat lion cat lion Topic Coherence(T) = w1,w2 log docs with both words +1 docs with word w 2 Mimno et al. "Optimizing semantic coherence in topic models." EMNLP 2011. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 16

Intrinsic Evaluation HUMAN-IN-THE-LOOP Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 17

Word Intrusion Let a human guess which word does not fit. Human does not know the topic s word lists cat lion tiger puma apple High probability words Under topic T True intruder word = Low prob. under T High prob. under other topic Assumption: Humans guess right = Topic model is good! Model precision = correct guesses of true intruder all guesses Higher is better! Best = 1 Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 18

Topic Intrusion Let human guess which topic does not fit. θ T1 = 0.5 θ T2 = 0.25 θ T5 = 0.01 Soccer ball goal referee coach scandal finances news team fire coach president T1: soccer referee goal T2: scandal finances coach guess T5: news paper stock True intruder Human does not know topic proportions Assumption: Human guesses right = Topic model good! Topic log odds = log θ(true intruder) θ(guess) = log 0.01 0.25 = 3.4 Higher is better! Best = 0 Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 20

Sadly: Metrics do not Always Agree better better Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 21

Extrinsic Evaluation CLASSIFICATION TEST SET Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 22

Use Classification Dataset sports politics gossip Count how often a topic and a class match! sports politics gossip Topic 1 100 Topic 2 95 5 Topic 3 10 10 80 Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 23

Classification: Topic Class alignment Want to compute accuracy, precision, recall, F1 How to align topics and classes? topic distribution θ multiple class labels per document Solutions: <-> sports <-> politics <-> gossip Highest agreement by KL divergence Purity: All documents in one topic vote on a class Issue: What if one topic aligns to two classes (vice versa)? Split vote proportionally Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 24

Pairwise Accuracy: RAND Index / BLANC measure For every pair of documents, are they: Associated with the same topic? Yes / No Associated with the same class? Yes / No Count cases of agreement or confusion table sports A B C gossip D Topic = yes Class = yes sports Class=no sports A B B D Topic = no gossip sports C D B C Compute Accuracy, Precision, Recall, F1 Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 25

Downsides Will indicate success only if correspondence between Unsupervised topics Supervised classes But: good/useful topics do not have to align with classes Therefore: we might get bad scores, even of the topics are good. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 26

test train Supervised Classification with Topic Model Features Supervised classifiers represent each document as feature vector Use topic model as features! True classes Topic Features Predicted classes sports θ f = θ d1 (t1) θ d1 (t2) θ d1 (t3) sports sports θ f = θ d2 (t1) θ d2 (t2) θ d2 (t3) C politics politics? θ f = θ d3 (t1) θ d3 (t2) θ d3 (t3) politics If classification performance improves => topic model is good! Use k-fold cross validation Baseline? E.g. word features Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 27

The Problem with Topic Model Features Unfortunately, unsupervised topic model features are often outperformed by simple word-based features (e.g., TF-IDF). Example: Predicting scientific disciplines (physics, history, etc.). Rocchio: centroid vector per class; classify by cosine similarity (words versus LDA) Nanni, Glavas, Ponzetto, Dietz. Under submission. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 28

Extrinsic Evaluation YOUR TASK METRIC Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 29

Example: Prediction of Citation Influences Example: Given citation graph with paper abstracts X= paper, Y = influence strength of citations. Gold data: Ask authors to mark strengths with ++, +, -, -- Dietz, Bickel, Scheffer. Unsupervised Prediction of Citation Influences, ICML 2007. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 30

Your Task Specification Here! Task: Given input X, predict output Y Your approach: Use topic models to make a prediction Baseline approach: Use something else to make a prediction Claim: With topic model is better than without topic model X Topic Model inside! Y Base line Y Better than Y? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 31

Example: Biomedical Question Answering other Might be upper bound Topic model inside Atkinson, Montecinos, Curtis. Question-driven topic-based extraction of Protein- Protein Interaction Methods from biomedical literature. Information Sciences. 2016. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 32

Which Baselines to Compare to? Similarity of documents Kullback-Leibler Divergence or Cosine similarity of words Word cluster: K-means on words / Agglomerative Clustering Matrix Factorization Word Embeddings (e.g. Word2Vec) Source of topics: Thesaurus / Word sense dictionary, e.g. Wordnet Topic = Wikipedia categories (or words from articles) Topic = Twitter Hashtags Topic Features in Classification: Rocchio, K-Nearest Neighbor, Support Vector Machines Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 33

CONCLUSION Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 34

References Hold-out likelihood / Perplexity Wallach, Hanna M., et al. "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009. Coherence measure Mimno, David, et al. "Optimizing semantic coherence in topic models." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011. Human-in-the-loop evaluation Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Advances in neural information processing systems. 2009. Classification Data: Pairwise measures Recasens, Marta, and Eduard Hovy. "BLANC: Implementing the Rand index for coreference evaluation." Natural Language Engineering 17.04 (2011): 485-510. Variation of Information Meilă, Marina. "Comparing clusterings by the variation of information." Learning theory and kernel machines. Springer Berlin Heidelberg, 2003. 173-187. More metrics (on related task word embeddings) Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." Proc. of EMNLP. 2015. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 35

Conclusion User Study on visualizations Need: Experts & many humans Hold-out likelihood / Perplexity Need: only documents Human-in-the-loop: Word / Topic Intrusion Need: humans Classification performance Need: documents with class labels Please measure! Task metric Depends on the what your model is good for Thank you. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 36