CS 572: Information Retrieval Lecture 9: Language Models for IR (cont d) Acknowledgments: Some slides in this lecture were adapted from Chris Manning (Stanford) and Jin Kim (UMass 12) 2/10/2016 CS 572: Information Retrieval. Spring 2016 1
New: IR based on Language Model (LM) Information need P( Q M d ) M d1 d1 query generation M d 2 d2 A common search heuristic is to use words that you expect to find in matching documents as your query why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good! The LM approach directly exploits that idea! M d n dn document collection 2
Probabilistic Language Modeling Goal: compute the probability of a document, a sentence, or sequence of words: P(W) = P(w 1,w 2,w 3,w 4,w 5 w n ) Related task: probability of an upcoming word: P(w 5 w 1,w 2,w 3,w 4 ) A model that computes either of these: P(W) or P(w n w 1,w 2 w n-1 ) is called a language model. Better: the grammar But language model or LM is standard
Evaluation: How good is our model? Does our language model prefer good sentences to bad ones? Assign higher probability to real or frequently observed sentences Than ungrammatical or rarely observed sentences? We train parameters of our model on a training set. We test the model s performance on data we haven t seen. A test set is an unseen dataset that is different from our training set, totally unused. An evaluation metric tells us how well our model does on the test set.
Training on the test set We can t allow test sentences into the training set We will assign it an artificially high probability when we set it in the test set Training on the test set Bad science! 5
Extrinsic evaluation of N-gram models Best evaluation for comparing models A and B Put each model in a task spelling corrector, speech recognizer, IR system Run the task, get an accuracy for A and for B How many misspelled words corrected properly How many relevant/non-relevant docs retrieved Compare accuracy for A and B Problematic! Time consuming (re-index docs/re-run search/user study) can take days or weeks Difficult to pinpoint problems in complex system/task
Intrinsic Evaluation: Perplexity Bad approximation unless the test data looks just like the training data So generally only useful in pilot experiments But is helpful to think about.
Intrinsic Evaluation: Perplexity The Shannon Game: How well can we predict the next word? I always order pizza with cheese and The 33 rd President of the US was I saw a mushrooms 0.1 pepperoni 0.1 anchovies 0.01 Unigrams are terrible at this game. (Why?) A better model of a text is one which assigns a higher probability to the word that actually occurs. fried rice 0.0001. and 1e-100
Perplexity (formal definition) The best language model is one that best predicts an unseen test set Gives the highest P(sentence) Perplexity is the inverse probability of the test set, normalized by the number of words: PP(W ) = P(w 1 w 2...w N ) - 1 N = N 1 P(w 1 w 2...w N ) Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability
Perplexity as branching factor Let s suppose a sentence consisting of random digits What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ N-gram Order Unigram Bigram Trigram Perplexity 962 170 109
The perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn t We need to train robust models that generalize! One kind of generalization: Zeros! Things that don t ever occur in the training set But occur in the test set
Summary: Discounts for Smoothing 2/10/2016 CS 572: Information Retrieval. Spring 2016 13
Smoothing: Interpolation 2/10/2016 CS 572: Information Retrieval. Spring 2016 14
Smoothing: Basic Interpolation Model General formulation of the LM for IR p( Q, d) p( d) ((1 ) p( t) p( t M d )) t Q general language model individual-document model The user has a document in mind, and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. 15
Jelinek-Mercer Smoothing 2/10/2016 CS 572: Information Retrieval. Spring 2016 16
Dirichlet Smoothing 2/10/2016 CS 572: Information Retrieval. Spring 2016 17
How to set the lambdas? Use a held-out corpus Training Data Held-Out Data Test Data Choose λs to maximize the probability of held-out data: Fix the N-gram probabilities (on the training data) Then search for λs that give largest probability to held-out set (or lowest perplexity of test set)
Huge web-scale n-grams How to deal with, e.g., Google N-gram corpus Pruning Only store N-grams with count > threshold. Remove singletons of higher-order n-grams Entropy-based pruning Efficiency Efficient data structures like tries Bloom filters: approximate language models Store words as indexes, not strings Use Huffman coding to fit large numbers of words into two bytes Quantize probabilities (4-8 bits instead of 8-byte float)
Smoothing for Web-scale N-grams Stupid backoff (Brants et al. 2007) No discounting, just use relative frequencies i-1 S(w i w i-k+1 ) = ì ï í ï ï î i count(w i-k+1 ) i-1 count(w i-k+1 ) i-1 0.4S(w i w i-k+2 i if count(w i-k+1 ) > 0 ) otherwise S(w i ) = count(w i) N 20
N-gram Smoothing Summary Add-1 smoothing: OK for text categorization, not for language modeling The most commonly used method in NLP: Extended Interpolated Kneser-Ney (see textbood) For very large N-grams like the Web: Stupid backoff For IR: variants of interpolation, discriminative models (choose Lambda to maximize retrieval metrics, not perplexity) 21
Language Modeling Toolkits SRILM http://www.speech.sri.com/projects/srilm/ KenLM https://kheafield.com/code/kenlm/
Google N-Gram Release, August 2006
Google N-Gram Release serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Google Book N-grams http://ngrams.googlelabs.com/
Higher Order LMs for IR 2/10/2016 CS 572: Information Retrieval. Spring 2016 26
Models of Text Generation 2/10/2016 CS 572: Information Retrieval. Spring 2016 27
Ranking with Language Models 2/10/2016 CS 572: Information Retrieval. Spring 2016 28
Ranking with LMs: Main Components Query probability: what is the probability to generate the given query, given a language model? Document Probability: what is the probability to generate the given document, given a language model? Model Comparison: how close are two language models? 2/10/2016 CS 572: Information Retrieval. Spring 2016 29
Ranking Using LMs: Multinomial 2/10/2016 CS 572: Information Retrieval. Spring 2016 30
Ranking with LMs: Multi-Bernoulli 2/10/2016 CS 572: Information Retrieval. Spring 2016 31
Score: Query Likelihood 2/10/2016 CS 572: Information Retrieval. Spring 2016 32
Score 2: Document Likelihood 2/10/2016 CS 572: Information Retrieval. Spring 2016 33
Score: Likelihood ratio (odds) 2/10/2016 CS 572: Information Retrieval. Spring 2016 34
Score: Model Comparison 2/10/2016 CS 572: Information Retrieval. Spring 2016 35
Kullback-Leibler Divergence Relative entropy between the two distributions Cost in bits of coding using Q when true distribution is P H ( P( x)) P( i)log( P( i)) i D KL ( P Q) i P( i)log( Q( i)) ( P( i)log( P( i))) 36
Kullback-Leibler Divergence D KL ( P Q) i P( i)log( P( i) Q( i) ) 37
Two-stage Smoothing [Zhai & Lafferty 02] Stage-1 Stage-2 -Explain unseen words -Dirichlet prior(bayesian) -Explain noise in query -2-component mixture P(w d) = (1- ) c(w,d) d + p(w C) + Collection LM + p(w U) 2008 ChengXiang Zhai User background model Can be approximated by p(w C) 38
Structured Document Retrieval [Ogilvie & Callan 03] D Title Abstract Body-Part1 Body-Part2 D 1 D 2 D 3 D k -Want to combine different parts of a document with appropriate weights -Anchor text can be treated as a part of a document - Applicable to XML retrieval Select D j and generate a query word using D j Q q q... q 1 2 m p( Q D, R 1) p( q D, R 1) m i 1 m i 1 k s( D D, R 1) p( q D, R 1) j 1 i j i j part selection prob. Serves as weight for D j Can be trained using EM 2008 ChengXiang Zhai 39
LMs for IR: Rules of Thumb 2/10/2016 CS 572: Information Retrieval. Spring 2016 40
LMs vs. vector space model (1) LMs have some things in common with vector space models. Term frequency is directed in the model. But it is not scaled in LMs. Probabilities are inherently length-normalized. Cosine normalization does something similar for vector space. Mixing document and collection frequencies has an effect similar to idf. Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. 41
LMs vs. vector space model (2) LMs vs. vector space model: commonalities Term frequency is directly in the model. Probabilities are inherently length-normalized. Mixing document and collection frequencies has an effect similar to idf. LMs vs. vector space model: differences LMs: based on probability theory Vector space: based on similarity, a geometric/ linear algebra notion Collection frequency vs. document frequency Details of term frequency, length normalization etc. 42
Vector space (tf-idf) vs. LM The language modeling approach always does better in these experiments...... but note that where the approach shows significant gains is at higher levels of recall. 43
LM vs. Prob. Model for IR The main difference is whether Relevance figures explicitly in the model or not LM approach attempts to do away with modeling relevance LM approach assumes that documents and expressions of information problems are of the same type Computationally tractable, intuitively appealing 44
LM vs. Prob. Model for IR Problems of basic LM approach Assumption of equivalence between document and information problem representation is unrealistic Very simple models of language Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance Can t easily accommodate phrases, passages, Boolean operators Current extensions focus on putting relevance back into the model, etc. 45
Ambiguity makes queries difficult American Airlines? or Alcoholics Anonymous? 46
Query Clarity Clarity score ~ low ambiguity Cronen-Townsend et. al. SIGIR 2002 Compare a language model over the relevant documents for a query over all possible documents The more difference these are, the more clear the query is programming perl vs. the 47
Clarity score Clarity score P( w Q)log 2 P ( w w coll V P( w Q) ) 48
Predicting Query Difficulty [Cronen-Townsend et al. 02] Observations: Discriminative queries tend to be easier Comparison of the query model and the collection model can indicate how discriminative a query is Method: Define query clarity as the KL-divergence between an estimated query model or relevance model and the collection LM pw ( Q) clarity( Q) p( w Q)log p ( w Collection ) w An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model) Correlation between the clarity scores and retrieval performance 2008 ChengXiang Zhai 49
Clarity scores on TREC-7 collection 50
Can use many more features http://www.slideshare.net/davidcarmel/sigir12- tutorial-query-perfromance-prediction-for-ir 2/10/2016 CS 572: Information Retrieval. Spring 2016 51