CS 572: Information Retrieval

Similar documents
Probabilistic Latent Semantic Analysis

A Case Study: News Classification Based on Term Frequency

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Cross Language Information Retrieval

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Language Independent Passage Retrieval for Question Answering

Detecting English-French Cognates Using Orthographic Edit Distance

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Lecture 1: Machine Learning Basics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Calibration of Confidence Measures in Speech Recognition

Deep Neural Network Language Models

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cross-Language Information Retrieval

Cross-Lingual Text Categorization

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

HLTCOE at TREC 2013: Temporal Summarization

arxiv: v1 [cs.cl] 2 Apr 2017

Learning to Rank with Selection Bias in Personal Search

Grade 6: Correlated to AGS Basic Math Skills

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Investigation on Mandarin Broadcast News Speech Recognition

The Strong Minimalist Thesis and Bounded Optimality

The taming of the data:

Universiteit Leiden ICT in Business

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Finding Translations in Scanned Book Collections

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Corpus Linguistics (L615)

arxiv:cmp-lg/ v1 22 Aug 1994

Toward a Unified Approach to Statistical Language Modeling for Chinese

The Role of String Similarity Metrics in Ontology Alignment

A Bayesian Learning Approach to Concept-Based Document Classification

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Comment-based Multi-View Clustering of Web 2.0 Items

A Re-examination of Lexical Association Measures

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Learning Methods in Multilingual Speech Recognition

Measuring Web-Corpus Randomness: A Progress Report

Assignment 1: Predicting Amazon Review Ratings

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Managerial Decision Making

Generative models and adversarial training

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

On document relevance and lexical cohesion between query terms

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Practical Language Processing for Virtual Humans

WHEN THERE IS A mismatch between the acoustic

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Search right and thou shalt find... Using Web Queries for Learner Error Detection

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Why Did My Detector Do That?!

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Multi-Lingual Text Leveling

Lecture 9: Speech Recognition

Introduction to Questionnaire Design

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A study of speaker adaptation for DNN-based speech synthesis

Georgetown University at TREC 2017 Dynamic Domain Track

Matching Meaning for Cross-Language Information Retrieval

Experts Retrieval with Multiword-Enhanced Author Topic Model

Short Text Understanding Through Lexical-Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

NPCEditor: Creating Virtual Human Dialogue Using Information Retrieval Techniques

Latent Semantic Analysis

Term Weighting based on Document Revision History

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Prediction of Maximal Projection for Semantic Role Labeling

Noisy SMS Machine Translation in Low-Density Languages

Julia Smith. Effective Classroom Approaches to.

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

OFFICE SUPPORT SPECIALIST Technical Diploma

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Learning to Think Mathematically With the Rekenrek

Lecture 10: Reinforcement Learning

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Transcription:

CS 572: Information Retrieval Lecture 9: Language Models for IR (cont d) Acknowledgments: Some slides in this lecture were adapted from Chris Manning (Stanford) and Jin Kim (UMass 12) 2/10/2016 CS 572: Information Retrieval. Spring 2016 1

New: IR based on Language Model (LM) Information need P( Q M d ) M d1 d1 query generation M d 2 d2 A common search heuristic is to use words that you expect to find in matching documents as your query why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good! The LM approach directly exploits that idea! M d n dn document collection 2

Probabilistic Language Modeling Goal: compute the probability of a document, a sentence, or sequence of words: P(W) = P(w 1,w 2,w 3,w 4,w 5 w n ) Related task: probability of an upcoming word: P(w 5 w 1,w 2,w 3,w 4 ) A model that computes either of these: P(W) or P(w n w 1,w 2 w n-1 ) is called a language model. Better: the grammar But language model or LM is standard

Evaluation: How good is our model? Does our language model prefer good sentences to bad ones? Assign higher probability to real or frequently observed sentences Than ungrammatical or rarely observed sentences? We train parameters of our model on a training set. We test the model s performance on data we haven t seen. A test set is an unseen dataset that is different from our training set, totally unused. An evaluation metric tells us how well our model does on the test set.

Training on the test set We can t allow test sentences into the training set We will assign it an artificially high probability when we set it in the test set Training on the test set Bad science! 5

Extrinsic evaluation of N-gram models Best evaluation for comparing models A and B Put each model in a task spelling corrector, speech recognizer, IR system Run the task, get an accuracy for A and for B How many misspelled words corrected properly How many relevant/non-relevant docs retrieved Compare accuracy for A and B Problematic! Time consuming (re-index docs/re-run search/user study) can take days or weeks Difficult to pinpoint problems in complex system/task

Intrinsic Evaluation: Perplexity Bad approximation unless the test data looks just like the training data So generally only useful in pilot experiments But is helpful to think about.

Intrinsic Evaluation: Perplexity The Shannon Game: How well can we predict the next word? I always order pizza with cheese and The 33 rd President of the US was I saw a mushrooms 0.1 pepperoni 0.1 anchovies 0.01 Unigrams are terrible at this game. (Why?) A better model of a text is one which assigns a higher probability to the word that actually occurs. fried rice 0.0001. and 1e-100

Perplexity (formal definition) The best language model is one that best predicts an unseen test set Gives the highest P(sentence) Perplexity is the inverse probability of the test set, normalized by the number of words: PP(W ) = P(w 1 w 2...w N ) - 1 N = N 1 P(w 1 w 2...w N ) Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability

Perplexity as branching factor Let s suppose a sentence consisting of random digits What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ N-gram Order Unigram Bigram Trigram Perplexity 962 170 109

The perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn t We need to train robust models that generalize! One kind of generalization: Zeros! Things that don t ever occur in the training set But occur in the test set

Summary: Discounts for Smoothing 2/10/2016 CS 572: Information Retrieval. Spring 2016 13

Smoothing: Interpolation 2/10/2016 CS 572: Information Retrieval. Spring 2016 14

Smoothing: Basic Interpolation Model General formulation of the LM for IR p( Q, d) p( d) ((1 ) p( t) p( t M d )) t Q general language model individual-document model The user has a document in mind, and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. 15

Jelinek-Mercer Smoothing 2/10/2016 CS 572: Information Retrieval. Spring 2016 16

Dirichlet Smoothing 2/10/2016 CS 572: Information Retrieval. Spring 2016 17

How to set the lambdas? Use a held-out corpus Training Data Held-Out Data Test Data Choose λs to maximize the probability of held-out data: Fix the N-gram probabilities (on the training data) Then search for λs that give largest probability to held-out set (or lowest perplexity of test set)

Huge web-scale n-grams How to deal with, e.g., Google N-gram corpus Pruning Only store N-grams with count > threshold. Remove singletons of higher-order n-grams Entropy-based pruning Efficiency Efficient data structures like tries Bloom filters: approximate language models Store words as indexes, not strings Use Huffman coding to fit large numbers of words into two bytes Quantize probabilities (4-8 bits instead of 8-byte float)

Smoothing for Web-scale N-grams Stupid backoff (Brants et al. 2007) No discounting, just use relative frequencies i-1 S(w i w i-k+1 ) = ì ï í ï ï î i count(w i-k+1 ) i-1 count(w i-k+1 ) i-1 0.4S(w i w i-k+2 i if count(w i-k+1 ) > 0 ) otherwise S(w i ) = count(w i) N 20

N-gram Smoothing Summary Add-1 smoothing: OK for text categorization, not for language modeling The most commonly used method in NLP: Extended Interpolated Kneser-Ney (see textbood) For very large N-grams like the Web: Stupid backoff For IR: variants of interpolation, discriminative models (choose Lambda to maximize retrieval metrics, not perplexity) 21

Language Modeling Toolkits SRILM http://www.speech.sri.com/projects/srilm/ KenLM https://kheafield.com/code/kenlm/

Google N-Gram Release, August 2006

Google N-Gram Release serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Google Book N-grams http://ngrams.googlelabs.com/

Higher Order LMs for IR 2/10/2016 CS 572: Information Retrieval. Spring 2016 26

Models of Text Generation 2/10/2016 CS 572: Information Retrieval. Spring 2016 27

Ranking with Language Models 2/10/2016 CS 572: Information Retrieval. Spring 2016 28

Ranking with LMs: Main Components Query probability: what is the probability to generate the given query, given a language model? Document Probability: what is the probability to generate the given document, given a language model? Model Comparison: how close are two language models? 2/10/2016 CS 572: Information Retrieval. Spring 2016 29

Ranking Using LMs: Multinomial 2/10/2016 CS 572: Information Retrieval. Spring 2016 30

Ranking with LMs: Multi-Bernoulli 2/10/2016 CS 572: Information Retrieval. Spring 2016 31

Score: Query Likelihood 2/10/2016 CS 572: Information Retrieval. Spring 2016 32

Score 2: Document Likelihood 2/10/2016 CS 572: Information Retrieval. Spring 2016 33

Score: Likelihood ratio (odds) 2/10/2016 CS 572: Information Retrieval. Spring 2016 34

Score: Model Comparison 2/10/2016 CS 572: Information Retrieval. Spring 2016 35

Kullback-Leibler Divergence Relative entropy between the two distributions Cost in bits of coding using Q when true distribution is P H ( P( x)) P( i)log( P( i)) i D KL ( P Q) i P( i)log( Q( i)) ( P( i)log( P( i))) 36

Kullback-Leibler Divergence D KL ( P Q) i P( i)log( P( i) Q( i) ) 37

Two-stage Smoothing [Zhai & Lafferty 02] Stage-1 Stage-2 -Explain unseen words -Dirichlet prior(bayesian) -Explain noise in query -2-component mixture P(w d) = (1- ) c(w,d) d + p(w C) + Collection LM + p(w U) 2008 ChengXiang Zhai User background model Can be approximated by p(w C) 38

Structured Document Retrieval [Ogilvie & Callan 03] D Title Abstract Body-Part1 Body-Part2 D 1 D 2 D 3 D k -Want to combine different parts of a document with appropriate weights -Anchor text can be treated as a part of a document - Applicable to XML retrieval Select D j and generate a query word using D j Q q q... q 1 2 m p( Q D, R 1) p( q D, R 1) m i 1 m i 1 k s( D D, R 1) p( q D, R 1) j 1 i j i j part selection prob. Serves as weight for D j Can be trained using EM 2008 ChengXiang Zhai 39

LMs for IR: Rules of Thumb 2/10/2016 CS 572: Information Retrieval. Spring 2016 40

LMs vs. vector space model (1) LMs have some things in common with vector space models. Term frequency is directed in the model. But it is not scaled in LMs. Probabilities are inherently length-normalized. Cosine normalization does something similar for vector space. Mixing document and collection frequencies has an effect similar to idf. Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. 41

LMs vs. vector space model (2) LMs vs. vector space model: commonalities Term frequency is directly in the model. Probabilities are inherently length-normalized. Mixing document and collection frequencies has an effect similar to idf. LMs vs. vector space model: differences LMs: based on probability theory Vector space: based on similarity, a geometric/ linear algebra notion Collection frequency vs. document frequency Details of term frequency, length normalization etc. 42

Vector space (tf-idf) vs. LM The language modeling approach always does better in these experiments...... but note that where the approach shows significant gains is at higher levels of recall. 43

LM vs. Prob. Model for IR The main difference is whether Relevance figures explicitly in the model or not LM approach attempts to do away with modeling relevance LM approach assumes that documents and expressions of information problems are of the same type Computationally tractable, intuitively appealing 44

LM vs. Prob. Model for IR Problems of basic LM approach Assumption of equivalence between document and information problem representation is unrealistic Very simple models of language Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance Can t easily accommodate phrases, passages, Boolean operators Current extensions focus on putting relevance back into the model, etc. 45

Ambiguity makes queries difficult American Airlines? or Alcoholics Anonymous? 46

Query Clarity Clarity score ~ low ambiguity Cronen-Townsend et. al. SIGIR 2002 Compare a language model over the relevant documents for a query over all possible documents The more difference these are, the more clear the query is programming perl vs. the 47

Clarity score Clarity score P( w Q)log 2 P ( w w coll V P( w Q) ) 48

Predicting Query Difficulty [Cronen-Townsend et al. 02] Observations: Discriminative queries tend to be easier Comparison of the query model and the collection model can indicate how discriminative a query is Method: Define query clarity as the KL-divergence between an estimated query model or relevance model and the collection LM pw ( Q) clarity( Q) p( w Q)log p ( w Collection ) w An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model) Correlation between the clarity scores and retrieval performance 2008 ChengXiang Zhai 49

Clarity scores on TREC-7 collection 50

Can use many more features http://www.slideshare.net/davidcarmel/sigir12- tutorial-query-perfromance-prediction-for-ir 2/10/2016 CS 572: Information Retrieval. Spring 2016 51