Topic Model Evaluation: How much does it help?

Topic Model Tutorial at WebSci2016 Topic Model Evaluation: How much does it help? Laura Dietz laura.dietz@unh.edu Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 1

Why is this important? Topic Models are computationally demanding to train Is the effort worth it? Isn t there a simpler/faster method that is as good? For multi-component systems: How much do the topics add to the total performance? How to choose K and hyper parameters? How to quantify success? Empirical Evaluation Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 2

Do you believe in topic models? Well, we all know that it simply works Compared to placebo? It provides pretty pictures therefore it must work I WANT TO BELIEVE Different data / task? A scientific study showed it works, no need to test it ever again. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 3

Outline What is *not* an evaluation? Intrinsic evaluation Through holdout-log likelihood / perplexity With human-in-the-loop (word intrusion) Extrinsic evaluation Through classification test data Task-specific metric What to compare to? - Baselines Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 4

THIS IS *NOT* AN EVALUATION Banana picture licensed under CreativeCommons by-nc-sa by Viktor Hertz Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 5

Looking at Word clouds Which of these topics is better? Common: some correct topics, but many split or merged topics Did the author hand-pick the correct topics? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 6

Looking at Highlighted Documents Which of these highlights is better? Soccer ball goal referee soccer foot coach scandal finances news team fire coach president Soccer ball goal referee soccer foot coach scandal finances news team fire coach president Humans prefer long consecutive segments with same topics Is the segment topically coherent? How about relative clauses? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 7

Looking at Colored graphs Which of these graph colorations is better? Is color correlating with what we think it does? What if I told you I assigned random colors? (=placebo) Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 8

Danger of Human Nature We want it to work We over-interpret the story told by the visualization We corroborate a narrative that fits the results Licensed under CreativeCommons by-sa by JussiClone Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 9

Objective Measure Goal: Quantify what quality means! Issues No gold standard data available. Vague definition of topic. Multiple correct answers. Uncertain inference algorithms. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 10

Designing a User Interface Study Claim: Topic model visualizations help users perform a task better. Run random trial evaluating humans Compare to humans that get Placebo visualizations. First study: Make sure you design the right thing Second study: Design it in the right way Do not assume your assumptions to be true! Details out of scope. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 11

Intrinsic Evaluation HELD-OUT LIKELIHOOD & PERPLEXITY Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 12

Held-out Log Likelihood / Perplexity For the words in the test document, what is their probability under the (pre-trained) topic model? train θ, φ test Lower is better! log p θ, φ) How to get θ? Depends on doc! The lower, the better model captures patterns of natural language. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 13

Variant: Document Completion For the actual words in the test document, what is their probability under the (pre-trained) topic model? train test θ, φ Lower is better! log p θ, φ) p θ ) The lower, the better model captures patterns of natural language. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 14

Variations on Held-out Log Likelihood Perplexity: 2 log p Per-word measure: 1 words log p(words) Many ways to obtain, see Wallach et al. 2009 Wallach, Hanna M., et al. "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 15

Topic Coherence Select words with high probability in one topic T cat lion tiger puma All word pairs: In how many documents contain both words? cat lion cat lion cat lion Topic Coherence(T) = w1,w2 log docs with both words +1 docs with word w 2 Mimno et al. "Optimizing semantic coherence in topic models." EMNLP 2011. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 16

Intrinsic Evaluation HUMAN-IN-THE-LOOP Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 17

Word Intrusion Let a human guess which word does not fit. Human does not know the topic s word lists cat lion tiger puma apple High probability words Under topic T True intruder word = Low prob. under T High prob. under other topic Assumption: Humans guess right = Topic model is good! Model precision = correct guesses of true intruder all guesses Higher is better! Best = 1 Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 18

Topic Intrusion Let human guess which topic does not fit. θ T1 = 0.5 θ T2 = 0.25 θ T5 = 0.01 Soccer ball goal referee coach scandal finances news team fire coach president T1: soccer referee goal T2: scandal finances coach guess T5: news paper stock True intruder Human does not know topic proportions Assumption: Human guesses right = Topic model good! Topic log odds = log θ(true intruder) θ(guess) = log 0.01 0.25 = 3.4 Higher is better! Best = 0 Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 20

Sadly: Metrics do not Always Agree better better Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 21

Extrinsic Evaluation CLASSIFICATION TEST SET Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 22

Use Classification Dataset sports politics gossip Count how often a topic and a class match! sports politics gossip Topic 1 100 Topic 2 95 5 Topic 3 10 10 80 Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 23

Classification: Topic Class alignment Want to compute accuracy, precision, recall, F1 How to align topics and classes? topic distribution θ multiple class labels per document Solutions: <-> sports <-> politics <-> gossip Highest agreement by KL divergence Purity: All documents in one topic vote on a class Issue: What if one topic aligns to two classes (vice versa)? Split vote proportionally Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 24

Pairwise Accuracy: RAND Index / BLANC measure For every pair of documents, are they: Associated with the same topic? Yes / No Associated with the same class? Yes / No Count cases of agreement or confusion table sports A B C gossip D Topic = yes Class = yes sports Class=no sports A B B D Topic = no gossip sports C D B C Compute Accuracy, Precision, Recall, F1 Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 25

Downsides Will indicate success only if correspondence between Unsupervised topics Supervised classes But: good/useful topics do not have to align with classes Therefore: we might get bad scores, even of the topics are good. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 26

test train Supervised Classification with Topic Model Features Supervised classifiers represent each document as feature vector Use topic model as features! True classes Topic Features Predicted classes sports θ f = θ d1 (t1) θ d1 (t2) θ d1 (t3) sports sports θ f = θ d2 (t1) θ d2 (t2) θ d2 (t3) C politics politics? θ f = θ d3 (t1) θ d3 (t2) θ d3 (t3) politics If classification performance improves => topic model is good! Use k-fold cross validation Baseline? E.g. word features Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 27

The Problem with Topic Model Features Unfortunately, unsupervised topic model features are often outperformed by simple word-based features (e.g., TF-IDF). Example: Predicting scientific disciplines (physics, history, etc.). Rocchio: centroid vector per class; classify by cosine similarity (words versus LDA) Nanni, Glavas, Ponzetto, Dietz. Under submission. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 28

Extrinsic Evaluation YOUR TASK METRIC Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 29

Example: Prediction of Citation Influences Example: Given citation graph with paper abstracts X= paper, Y = influence strength of citations. Gold data: Ask authors to mark strengths with ++, +, -, -- Dietz, Bickel, Scheffer. Unsupervised Prediction of Citation Influences, ICML 2007. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 30

Your Task Specification Here! Task: Given input X, predict output Y Your approach: Use topic models to make a prediction Baseline approach: Use something else to make a prediction Claim: With topic model is better than without topic model X Topic Model inside! Y Base line Y Better than Y? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 31

Example: Biomedical Question Answering other Might be upper bound Topic model inside Atkinson, Montecinos, Curtis. Question-driven topic-based extraction of Protein- Protein Interaction Methods from biomedical literature. Information Sciences. 2016. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 32

Which Baselines to Compare to? Similarity of documents Kullback-Leibler Divergence or Cosine similarity of words Word cluster: K-means on words / Agglomerative Clustering Matrix Factorization Word Embeddings (e.g. Word2Vec) Source of topics: Thesaurus / Word sense dictionary, e.g. Wordnet Topic = Wikipedia categories (or words from articles) Topic = Twitter Hashtags Topic Features in Classification: Rocchio, K-Nearest Neighbor, Support Vector Machines Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 33

CONCLUSION Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 34

References Hold-out likelihood / Perplexity Wallach, Hanna M., et al. "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009. Coherence measure Mimno, David, et al. "Optimizing semantic coherence in topic models." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011. Human-in-the-loop evaluation Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Advances in neural information processing systems. 2009. Classification Data: Pairwise measures Recasens, Marta, and Eduard Hovy. "BLANC: Implementing the Rand index for coreference evaluation." Natural Language Engineering 17.04 (2011): 485-510. Variation of Information Meilă, Marina. "Comparing clusterings by the variation of information." Learning theory and kernel machines. Springer Berlin Heidelberg, 2003. 173-187. More metrics (on related task word embeddings) Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." Proc. of EMNLP. 2015. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 35

Conclusion User Study on visualizations Need: Experts & many humans Hold-out likelihood / Perplexity Need: only documents Human-in-the-loop: Word / Topic Intrusion Need: humans Classification performance Need: documents with class labels Please measure! Task metric Depends on the what your model is good for Thank you. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it help? @WebSci2016. Slide 36