Machine Learning for Language Modelling Part 2: N-gram smoothing

Similar documents
Switchboard Language Model Improvement with Conversational Data from Gigaword

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Deep Neural Network Language Models

Investigation on Mandarin Broadcast News Speech Recognition

MYCIN. The MYCIN Task

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Toward a Unified Approach to Statistical Language Modeling for Chinese

Lecture 9: Speech Recognition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Lecture 1: Machine Learning Basics

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Virtually Anywhere Episodes 1 and 2. Teacher s Notes

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Experts Retrieval with Multiword-Enhanced Author Topic Model

Machine Learning and Development Policy

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Noisy SMS Machine Translation in Low-Density Languages

Probabilistic Latent Semantic Analysis

A study of speaker adaptation for DNN-based speech synthesis

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 10: Reinforcement Learning

Beyond Classroom Solutions: New Design Perspectives for Online Learning Excellence

Rule Learning with Negation: Issues Regarding Effectiveness

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

The Strong Minimalist Thesis and Bounded Optimality

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Language Model and Grammar Extraction Variation in Machine Translation

Eduroam Support Clinics What are they?

Large vocabulary off-line handwriting recognition: A survey

The taming of the data:

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Chapter 4 - Fractions

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

Training and evaluation of POS taggers on the French MULTITAG corpus

Human-like Natural Language Generation Using Monte Carlo Tree Search

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Python Machine Learning

Getting Started with Deliberate Practice

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Grade 6: Correlated to AGS Basic Math Skills

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

arxiv: v1 [math.at] 10 Jan 2016

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Seminar - Organic Computing

Cross-Lingual Text Categorization

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Class-based Language Model Approach to Chinese Named Entity Identification 1

Re-evaluating the Role of Bleu in Machine Translation Research

TINE: A Metric to Assess MT Adequacy

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

On document relevance and lexical cohesion between query terms

Detecting English-French Cognates Using Orthographic Edit Distance

Office: CLSB 5S 066 (via South Tower elevators)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Stochastic Model for the Vocabulary Explosion

arxiv:cmp-lg/ v1 22 Aug 1994

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Distant Supervised Relation Extraction with Wikipedia and Freebase

Online Updating of Word Representations for Part-of-Speech Tagging

Handling Sparsity for Verb Noun MWE Token Classification

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

MKTG 611- Marketing Management The Wharton School, University of Pennsylvania Fall 2016

CLASS EXODUS. The alumni giving rate has dropped 50 percent over the last 20 years. How can you rethink your value to graduates?

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Interactive Whiteboard

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Event on Teaching Assignments October 7, 2015

LEARNER VARIABILITY AND UNIVERSAL DESIGN FOR LEARNING

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Financing Education In Minnesota

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Hentai High School A Game Guide

Calibration of Confidence Measures in Speech Recognition

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Afm Math Review Download or Read Online ebook afm math review in PDF Format From The Best User Guide Database

Pre-AP English 1-2. Mrs. Kimberly Cloud Career Tech & Global Studies Room N-201

Ks3 Sats Papers Maths 2003

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v1 [cs.cl] 2 Apr 2017

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Transcription:

Machine Learning for Language Modelling Part 2: N-gram smoothing Marek Rei

Recap P(word) = number of times we see this word in the text total number of words in the text P(word context) = number of times we see context followed by word number of times we see context

Recap P(the weather is nice) =? Using the chain rule P(the weather is nice) = P(the) * P(weather the) * P(is the weather) * P(nice the weather is)

Recap Using the Markov assumption P(the weather is nice) = P(the <s>) * P(weather the) * P(is weather) * P(nice is)

Data sparsity The scientists are trying to solve the mystery If we have not seen trying to solve in our training data, then P(solve trying to) = 0 The system will consider this to be an impossible word sequence Any sentence containing trying to solve will have 0 probability Cannot compute perplexity on the test set (div by 0)

Data sparsity Shakespeare works contain N=884,647 tokens, with V=29,066 unique words. Around 300,000 unique bigrams by Shakespeare There are V*V = 844,000,000 possible bigrams So 99.96% of the possible bigrams were never seen

Data sparsity Cannot expect to see all possible sentences (or word sequences) in the training data. Solution 1: use more training data Does help but usually not enough Solution 2: Assign non-zero probability to unseen n-grams Known as smoothing

Smoothing: intuition Take a bit from the ones who have, and distribute to the ones who don t P(w trying to)

Smoothing: intuition Take a bit from the ones who have, and distribute to the ones who don t P(w trying to) Make sure there s still a valid probability distribution!

Really simple approach During training Choose your vocabulary (e.g., all words that occur at least 5 times) Replace all other words by a special token <unk> During testing Replace any word not in the fixed vocabulary with <unk> But we still have zero counts with longer ngrams

Add-1 smoothing (Laplace) Add 1 to every n-gram count As if we ve seen every possible n-gram at least once.

Add-1 counts Original: Add-1:

Add-1 probabilities Original: Add-1:

Reconstituting counts Let s calculate the counts that we should have seen, in order to get the same probabilities as Add-1 smoothing.

Add-1 reconstituted counts Original: Add-1:

Add-1 smoothing Advantage: Very easy to implement Disadvantages: Takes too much probability mass from real events Assigns too much probability to unseen events Doesn t take the predicted word into account Not really used in practice

Additive smoothing Add k to each n-gram Generalisation of Add-1 smoothing

Good-Turing smoothing = frequency of frequency c The count of things we ve seen c times Example: hello how are you hello hello you w c hello 3 you 2 how 1 are 1 N3 = 1 N2 = 1 N1 = 2

Good-Turing smoothing Let s find the probability mass assigned to words that occurred only once Distribute that probability mass to words that were never seen - original (real) word count - the probability mass for words with frequency c+1 - new (adjusted) word count

Good-Turing smoothing Bigram frequencies of frequencies from 22 million AP bigrams, and Good-Turing re-estimations after Church and Gale (1991) N0 = V2 - number of observed bigrams

Good-Turing smoothing - Good-Turing adjusted count for the bigram

Good-Turing smoothing If there are many words that we have only seen once, then unseen words get a high probability If we there are only very few words we ve seen once, then unseen words get a low probability The adjusted counts still sum up to the original value

Good-Turing smoothing Problem: What if Nc+1 = 0? c Nc 100 1 50 2 49 4 48 5...... N50 = 2 N51 = 0

Good-Turing smoothing Solutions Approximate Nc at high values of c with a smooth curve Choose a and b so that f(c) approximates Nc at known values Assume that c is reliable at high values, and only use c* for low values Have to make sure that the probabilities are still normalised

Backoff Perhaps we need to find the next word in the sequence Next Tuesday I will varnish If we have not seen varnish the or varnish thou in the training data, both Add-1 and GoodTuring will give P(the varnish) = P(thou varnish) But intuitively P(the varnish) > P(thou varnish) Sometimes it s helpful to use less context

Backoff Consult the most detailed model first and, if that doesn t work, back off to a lower-order model If the trigram is reliable (has a high count), then use the trigram LM Otherwise, back off and use a bigram LM Continue backing off until you reach a model that has some counts Need to make sure we discount the higher order probabilities, or we won t have a valid probability distribution

Stupid Backoff A score, not a valid probability Works well in practice, on large scale datasets - number of words in text

Interpolation Instead of backing off, we could combine all the models Use evidence from unigram, bigram, trigram, etc. Usually works better than backoff

Interpolation Development data Training data Test data Train different n-gram language models on the training data Using these language models, optimise lambdas to perform best on the development data Evaluate the final system on the test data

Jelinek-Mercer interpolation Lambda values can change based on the n-gram context Usually better to group lambdas together, for example based on n-gram frequency, to reduce parameters

Absolute discounting Combining ideas from interpolation and GoodTuring Good-Turing subtracts approximately the same amount from each count Use that directly

Absolute discounting Subtract a constant amount D from each count Assign this probability mass to the lower order language model

Absolute discounting backoff weight bigram probability discounted trigram probability The number of unique words wj that follow context (wi-2 wi-1) Also the number of trigrams we subtract D from The is a free variable

Interpolation vs absolute discounting trigram weight trigram probability bigram weight - Trigram count - Discounting parameter bigram probability

Kneser-Ney smoothing Heads up: Kneser-Ney is considered the state-of-the-art in N-gram language modelling Absolute discounting is good, but it has some problems For example: if we have not seen a bigram at all, we are going to rely only on the unigram probability

Kneser-Ney smoothing I can t see without my reading If we ve never seen the bigram reading glasses, we ll back off to just P(glasses) Francisco is more common than glasses, therefore P(Francisco) > P(glasses) But Francisco almost always occurs only after San

Kneser-Ney smoothing Instead of - how likely is w we want to use - how likely is w to appear as a novel continuation - number of unique words that come before w - total unique bigrams

Kneser-Ney smoothing For a bigram language model: General form:

Kneser-Ney smoothing Paul Mary Nick They is running is running is cycling are running Pcontinuation(is) =? Pcontinuation(Paul) =? Pcontinuation(running) =? PKN(running is) =?

Kneser-Ney smoothing Paul Mary Nick They is running is running is cycling are running Pcontinuation(is) = 3/11 Pcontinuation(Paul) = 1/11 Pcontinuation(running) = 2/11 PKN(running is) = 1/3 + (2/3) * (2/11)

Recap Assigning zero probabilities causes problems We use smoothing to distribute some probability mass to unseen n-grams

Recap Add-1 smoothing Good-Turing smoothing

Recap Backoff Interpolation

Recap Absolute discounting Kneser-Ney

References Speech and Language Processing Daniel Jurafsky & James H. Martin (2000) Evaluating language models. Julia Hockenmaier. https://courses.engr.illinois.edu/cs498jh/ Language Models. Nitin Madnani, Jimmy Lin. (2010) http://www.umiacs.umd.edu/~jimmylin/cloud-2010-spring/ An Empirical Study of Smoothing Techniques for Language Modeling Stanley F. Chen, Joshua Goodman. (1998) http://www.speech.sri.com/projects/srilm/manpages/pdfs/chen-goodman-tr-10-98. pdf Natural Language Processing Dan Jurafsky & Christopher Manning (2012) https://www.coursera.org/course/nlp

Extra materials

Katz Backoff Discount using Good-Turing, then distribute the extra probability mass to lower-order n-grams