Topic Model Evaluation: How much does it help?

Size: px

Start display at page:

Download "Topic Model Evaluation: How much does it help?"

Gordon Oliver
5 years ago
Views:

1 Topic Model Tutorial at WebSci2016 Topic Model Evaluation: How much does it help? Laura Dietz Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 1

2 Why is this important? Topic Models are computationally demanding to train Is the effort worth it? Isn t there a simpler/faster method that is as good? For multi-component systems: How much do the topics add to the total performance? How to choose K and hyper parameters? How to quantify success? Empirical Evaluation Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 2

3 Do you believe in topic models? Well, we all know that it simply works Compared to placebo? It provides pretty pictures therefore it must work I WANT TO BELIEVE Different data / task? A scientific study showed it works, no need to test it ever again. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 3

4 Outline What is *not* an evaluation? Intrinsic evaluation Through holdout-log likelihood / perplexity With human-in-the-loop (word intrusion) Extrinsic evaluation Through classification test data Task-specific metric What to compare to? - Baselines Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 4

5 THIS IS *NOT* AN EVALUATION Banana picture licensed under CreativeCommons by-nc-sa by Viktor Hertz Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 5

6 Looking at Word clouds Which of these topics is better? Common: some correct topics, but many split or merged topics Did the author hand-pick the correct topics? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 6

7 Looking at Highlighted Documents Which of these highlights is better? Soccer ball goal referee soccer foot coach scandal finances news team fire coach president Soccer ball goal referee soccer foot coach scandal finances news team fire coach president Humans prefer long consecutive segments with same topics Is the segment topically coherent? How about relative clauses? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 7

What if I told you I assigned random colors?

8 Looking at Colored graphs Which of these graph colorations is better? Is color correlating with what we think it does? What if I told you I assigned random colors? (=placebo) Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 8

9 Danger of Human Nature We want it to work We over-interpret the story told by the visualization We corroborate a narrative that fits the results Licensed under CreativeCommons by-sa by JussiClone Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 9

10 Objective Measure Goal: Quantify what quality means! Issues No gold standard data available. Vague definition of topic. Multiple correct answers. Uncertain inference algorithms. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 10

11 Designing a User Interface Study Claim: Topic model visualizations help users perform a task better. Run random trial evaluating humans Compare to humans that get Placebo visualizations. First study: Make sure you design the right thing Second study: Design it in the right way Do not assume your assumptions to be true! Details out of scope. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 11

12 Intrinsic Evaluation HELD-OUT LIKELIHOOD & PERPLEXITY Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 12

13 Held-out Log Likelihood / Perplexity For the words in the test document, what is their probability under the (pre-trained) topic model? train θ, φ test Lower is better! log p θ, φ) How to get θ? Depends on doc! The lower, the better model captures patterns of natural language. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 13

14 Variant: Document Completion For the actual words in the test document, what is their probability under the (pre-trained) topic model? train test θ, φ Lower is better! log p θ, φ) p θ ) The lower, the better model captures patterns of natural language. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 14

15 Variations on Held-out Log Likelihood Perplexity: 2 log p Per-word measure: 1 words log p(words) Many ways to obtain, see Wallach et al Wallach, Hanna M., et al. "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 15

16 Topic Coherence Select words with high probability in one topic T cat lion tiger puma All word pairs: In how many documents contain both words? cat lion cat lion cat lion Topic Coherence(T) = w1,w2 log docs with both words +1 docs with word w 2 Mimno et al. "Optimizing semantic coherence in topic models." EMNLP Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 16

17 Intrinsic Evaluation HUMAN-IN-THE-LOOP Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 17

18 Word Intrusion Let a human guess which word does not fit. Human does not know the topic s word lists cat lion tiger puma apple High probability words Under topic T True intruder word = Low prob. under T High prob. under other topic Assumption: Humans guess right = Topic model is good! Model precision = correct guesses of true intruder all guesses Higher is better! Best = 1 Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 18

19 Topic Intrusion Let human guess which topic does not fit. θ T1 = 0.5 θ T2 = 0.25 θ T5 = 0.01 Soccer ball goal referee coach scandal finances news team fire coach president T1: soccer referee goal T2: scandal finances coach guess T5: news paper stock True intruder Human does not know topic proportions Assumption: Human guesses right = Topic model good! Topic log odds = log θ(true intruder) θ(guess) = log = 3.4 Higher is better! Best = 0 Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 20

20 Sadly: Metrics do not Always Agree better better Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 21

21 Extrinsic Evaluation CLASSIFICATION TEST SET Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 22

22 Use Classification Dataset sports politics gossip Count how often a topic and a class match! sports politics gossip Topic Topic Topic Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 23

23 Classification: Topic Class alignment Want to compute accuracy, precision, recall, F1 How to align topics and classes? topic distribution θ multiple class labels per document Solutions: <-> sports <-> politics <-> gossip Highest agreement by KL divergence Purity: All documents in one topic vote on a class Issue: What if one topic aligns to two classes (vice versa)? Split vote proportionally Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 24

24 Pairwise Accuracy: RAND Index / BLANC measure For every pair of documents, are they: Associated with the same topic? Yes / No Associated with the same class? Yes / No Count cases of agreement or confusion table sports A B C gossip D Topic = yes Class = yes sports Class=no sports A B B D Topic = no gossip sports C D B C Compute Accuracy, Precision, Recall, F1 Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 25

25 Downsides Will indicate success only if correspondence between Unsupervised topics Supervised classes But: good/useful topics do not have to align with classes Therefore: we might get bad scores, even of the topics are good. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 26

26 test train Supervised Classification with Topic Model Features Supervised classifiers represent each document as feature vector Use topic model as features! True classes Topic Features Predicted classes sports θ f = θ d1 (t1) θ d1 (t2) θ d1 (t3) sports sports θ f = θ d2 (t1) θ d2 (t2) θ d2 (t3) C politics politics? θ f = θ d3 (t1) θ d3 (t2) θ d3 (t3) politics If classification performance improves => topic model is good! Use k-fold cross validation Baseline? E.g. word features Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 27

The Problem with Topic Model Features Unfortunately, unsupervised topic model features are often outperformed by simple word-based features (e.g., TF-IDF).

27 The Problem with Topic Model Features Unfortunately, unsupervised topic model features are often outperformed by simple word-based features (e.g., TF-IDF). Example: Predicting scientific disciplines (physics, history, etc.). Rocchio: centroid vector per class; classify by cosine similarity (words versus LDA) Nanni, Glavas, Ponzetto, Dietz. Under submission. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 28

28 Extrinsic Evaluation YOUR TASK METRIC Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 29

29 Example: Prediction of Citation Influences Example: Given citation graph with paper abstracts X= paper, Y = influence strength of citations. Gold data: Ask authors to mark strengths with ++, +, -, -- Dietz, Bickel, Scheffer. Unsupervised Prediction of Citation Influences, ICML Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 30

30 Your Task Specification Here! Task: Given input X, predict output Y Your approach: Use topic models to make a prediction Baseline approach: Use something else to make a prediction Claim: With topic model is better than without topic model X Topic Model inside! Y Base line Y Better than Y? Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 31

31 Example: Biomedical Question Answering other Might be upper bound Topic model inside Atkinson, Montecinos, Curtis. Question-driven topic-based extraction of Protein- Protein Interaction Methods from biomedical literature. Information Sciences Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 32

32 Which Baselines to Compare to? Similarity of documents Kullback-Leibler Divergence or Cosine similarity of words Word cluster: K-means on words / Agglomerative Clustering Matrix Factorization Word Embeddings (e.g. Word2Vec) Source of topics: Thesaurus / Word sense dictionary, e.g. Wordnet Topic = Wikipedia categories (or words from articles) Topic = Twitter Hashtags Topic Features in Classification: Rocchio, K-Nearest Neighbor, Support Vector Machines Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 33

33 CONCLUSION Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 34

34 References Hold-out likelihood / Perplexity Wallach, Hanna M., et al. "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, Coherence measure Mimno, David, et al. "Optimizing semantic coherence in topic models." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Human-in-the-loop evaluation Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Advances in neural information processing systems Classification Data: Pairwise measures Recasens, Marta, and Eduard Hovy. "BLANC: Implementing the Rand index for coreference evaluation." Natural Language Engineering (2011): Variation of Information Meilă, Marina. "Comparing clusterings by the variation of information." Learning theory and kernel machines. Springer Berlin Heidelberg, More metrics (on related task word embeddings) Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." Proc. of EMNLP Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 35

35 Conclusion User Study on visualizations Need: Experts & many humans Hold-out likelihood / Perplexity Need: only documents Human-in-the-loop: Word / Topic Intrusion Need: humans Classification performance Need: documents with class labels Please measure! Task metric Depends on the what your model is good for Thank you. Laura Dietz, Universität Mannheim Topic Model Evaluation: How much does it Slide 36

Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3