Language Modelling. Marco Kuhlmann Department of Computer and Information Science Partially based on material developed by David Chiang

Similar documents
Lecture 1: Machine Learning Basics

Disambiguation of Thai Personal Name from Online News Articles

Switchboard Language Model Improvement with Conversational Data from Gigaword

Grade 6: Correlated to AGS Basic Math Skills

Detecting English-French Cognates Using Orthographic Edit Distance

Introducing the New Iowa Assessments Mathematics Levels 12 14

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

Artificial Neural Networks written examination

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Statewide Framework Document for:

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Math 96: Intermediate Algebra in Context

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Investigation on Mandarin Broadcast News Speech Recognition

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Probabilistic Latent Semantic Analysis

Extending Place Value with Whole Numbers to 1,000,000

Lecture 10: Reinforcement Learning

AP Chemistry

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Linking Task: Identifying authors and book titles in verbose queries

Rule Learning With Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Radius STEM Readiness TM

TA Script of Student Test Directions

Reducing Features to Improve Bug Prediction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Probability Therefore (25) (1.33)

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Learning From the Past with Experiment Databases

Are You Ready? Simplify Fractions

Learning Methods in Multilingual Speech Recognition

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Rendezvous with Comet Halley Next Generation of Science Standards

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Clerical Skills Level II

Multi-Lingual Text Leveling

Fourth Grade. Reporting Student Progress. Libertyville School District 70. Fourth Grade

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Speech Recognition at ICSI: Broadcast News and beyond

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Excel Intermediate

Corrective Feedback and Persistent Learning for Information Extraction

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Constructing Parallel Corpus from Movie Subtitles

Cal s Dinner Card Deals

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Distant Supervised Relation Extraction with Wikipedia and Freebase

Australian Journal of Basic and Applied Sciences

Language and Computers. Writers Aids. Introduction. Non-word error detection. Dictionaries. N-gram analysis. Isolated-word error correction

School of Innovative Technologies and Engineering

Large vocabulary off-line handwriting recognition: A survey

Evidence for Reliability, Validity and Learning Effectiveness

What the National Curriculum requires in reading at Y5 and Y6

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Cross Language Information Retrieval

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [math.at] 10 Jan 2016

Indian Institute of Technology, Kanpur

Calibration of Confidence Measures in Speech Recognition

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CS Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Measurement. When Smaller Is Better. Activity:

The Strong Minimalist Thesis and Bounded Optimality

BRAZOSPORT COLLEGE LAKE JACKSON, TEXAS SYLLABUS. POFI 1301: COMPUTER APPLICATIONS I (File Management/PowerPoint/Word/Excel)

Transcription:

TDDE09, 729A27 Natural Language Processing (2017) Language Modelling Marco Kuhlmann Department of Computer and Information Science Partially based on material developed by David Chiang This work is licensed under a Creative Commons Attribution 4.0 International License.

Language models A language model is a model of what words are more or less likely to be generated in some language. More specifically, it is a model that predicts what the next word will be, given the words so far. Instead of on words, language models can also be defined on characters (or signs, or symbols).

Text classification using language models The word probabilities in the Naive Bayes classifier define a simple language model: class-specific language model

When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Warren Weaver (1894 1978)

The Noisy Channel E R sender receiver noise language model Russian is noisy English. argmax E P(E R) = argmax E P(R E) P(E)

Autocomplete and autocorrect

Pinyin input method ping guo gong si ping

Shannon s game Shannon s game is like Hangman except that It s no fun at all. You may only guess one character at a time. When moving on to the next character, you lose all information about previously guessed characters. Claude Shannon (1916 2001) Image source: Wikipedia

N-gram models

N-gram models An n-gram is a sequence of n words or characters. unigram, bigram, trigram, quadrigram An n-gram model is a language model where the probability of a word depends only on the n 1 immediately preceding words.

Unigram model A unigram language model is a bag-of-words model: Thus the probabilities of all words in the text are mutually independent.

Markov models The probability of each item depends only on the immediately preceding item. The probability of a sequence of items is the product of these conditional probabilities. For a well-defined model, we need to mark the beginning and the end of a sequence. Андрéй Мáрков (1856 1922) Image source: Wikipedia

Probability of a sequence of words beginning-of-sentence end-of-sentence

Bigram models A bigram model is a Markov model on sequences of words: Thus the probability of a word depends only on the immediately preceding word.

Formal definition of an n-gram model n V the model s order (1 = unigram, 2 = bigram, ) a set of possible words (character); the vocabulary P(w u) a probability that specifies how likely it is to observe the word w after the context (n 1)-gram u one value for each combination of a word w and a context u

Simple uses of n-gram models Prediction To predict the next word, we can choose the word that has the highest probability among all possible words w: predicted word = argmax w P(w preceding words) Generation We can generate a random sequence of words where each word w is sampled with probability P(w preceding words).

P(w 1 w 1 ) w 1 P(w 1 BOS) P(EOS w 1 ) BOS P(w 2 w 1 ) P(w 1 w 2 ) EOS w 2 P(w 2 BOS) P(EOS w 2 ) P(w 2 w 2 )

Learning n-gram models

Estimating unigram probabilities P(Sherlock) c(sherlock) count of the unigram Sherlock N total number of unigrams (tokens)

Estimating bigram probabilities P(Holmes Sherlock) c(sherlock Holmes) count of the bigram Sherlock Holmes c(sherlock w) count of bigrams starting with Sherlock

Estimating unigram and bigram probabilities This is the count of the unigram w, divided by the total number of tokens. This is the count of the bigram uw, divided by the count of bigrams starting with u.

A problem with maximum likelihood estimation Shakespeare s collected works contain ca. 31,000 word types. There are 961 million different bigrams with these words. In his texts we only find 300,000 bigrams. This means that 99.97% of all theoretically possible bigrams have count 0. Under a bigram model, each sentence containing one of those bigrams will receive a probability of zero. Zero probabilities destroy information!

Notation N c(w) c(uw) c(u ) V V k+ V k number of word tokens, excluding BOS count of unigram (word) w count of bigram uw count of bigrams starting with u number of word types, including UNK number of word types seen at k times number of word types seen exactly k times Source: Chen and Goodman (1998)

Additive smoothing We can do add-k smoothing as for the Naive Bayes classifier: why?

A problem with additive smoothing Chiang looks at a sample of 55,708,861 words of English. In this sample, the word the appears 3,579,493 times. Add-one smoothing yields P(the) = 0.0641. How many times does the model expect the word the to appear in an equal-sized sample? Why is that a problem?

A problem with additive smoothing We have only a constant amount of probability mass that we can distribute among the word types. Therefore, although we are adding to the count of every word type, we are not adding to the probability of every word type. probabilities still need to sum to one We take away a certain percentage of the probability mass from each word type and redistribute it equally to all word types.

Additive smoothing for unigram probabilities The formula for add-k smoothing of unigrams, can be written as a mixture of the maximum-likelihood estimate and the uniform distribution over word types: where

Witten Bell smoothing for unigram probabilities Writing V 1+ for the number of word types seen at least once, where gives us Witten Bell smoothing. This is another form of additive smoothing and can also be written as where

Absolute discounting for unigram probabilities Intuitively, smoothing should not decrease the expected count of a word (relative to its empirical count) by more than about one. In absolute discounting, we subtract from the count of every seen word type and distribute the total gain equally to all types: where

Smoothing bigram probabilities When smoothing unigram probabilities, we took away from the ML estimate and gave back equally to all word types. For bigram probabilities, a better option is to give back to word types proportional to their unigram probability.

Smoothing bigram probabilities Witten Bell smoothing Absolute discounting

Unknown words In addition to new bigrams, a new text may even contain completely new words. For these, smoothing will not help. One way to deal with this is to introduce a special word type UNK, and smooth it like any other word type in the vocabulary. For additive smoothing: Hallucinate k occurrences of the unknown word. At test time, we replace every unknown word with UNK.

Evaluation of n-gram models

Intrinsic and extrinsic evaluation Intrinsic evaluation How does the method or model score with respect to a given evaluation measure? in classification: accuracy, precision, recall Extrinsic evaluation How much does the method or model help the application in which it is embedded? predictive input, machine translation, speech recognition

Intrinsic evaluation of language models, intuition Learn a language model from a set of training sentences and use it to compute the probability of a set of test sentences. If the language model is good, the probability of the test sentences should be high. This assumes that the training sample and the test sample are similar. In the following, for simplicity we assume that we have only one, potentially very long test sentence.

The problem with different sentence lengths Under a Markov model, the probability of a sentence is the product of the bigram probabilities. Therefore, all other things being equal, the probability of a sentence decreases with the sentence length. This makes it hard to compare the probability of the test data to the probability of the training data. Intuitively, we would like to average over sentence length.

Perplexity and entropy The perplexity of a language model P on a test sample x 1, x N is This measure is not easy to understand intuitively. We will therefore focus on the term in the exponent: This measure is known as the entropy of the test sample.

From probabilities to surprisal Instead of computing probabilities for the test sentence, we will compute negative log probabilities. P(w c) becomes log P(w c) Intuitively, this measures how surprised we are about seeing the test sentence, given our language model. high probability = low surprisal We can then simply divide by the number of words in the sentence to average over sentence length.

Negative log probabilities 5 3.75 log p 2.5 1.25 0 0 0.25 0.5 0.75 1 p

Entropy and smoothing When smoothing a language model, we are redistributing probability mass to observations we have never made. This leaves a smaller amount of the probablity mass to the observations that we actually did make during learning. When we evaluate the smoothed model on the training data, its entropy will therefore be higher than without smoothing.

The problem with unknown words The held-out data will in general contain unknown words words that we have not seen in the training data. Because we are multiplying probabilities, a single unknown word will bring down the probability of the held-out data to zero. Zero probabilities destroy information! The conclusion is that we should never compare language models with different vocabularies.

Edit distance

Autocomplete and autocorrect

Edit distance Many misspelled words are quite similar to the correctly spelled words; there are typically only a few mistakes. lingvisterma, word prefiction Given a misspelled word, we want to propose one or several similar words and propose the most probable one. This idea requires a measure of orthographic similarity between two words.

Edit operations We can measure the similarity between two words by the number of operations needed to transform one into the other. Here we assume three types of operations: insertion deletion substitution add a letter before or after another one delete a letter from the word substitute a letter for another one

Edit operations, example How many edits does it take to go from intention to execution? intention delete the letter i substitute e for n substitute x for t insert the letter c substitute u for n ntention etention exention execntion execution

Letter alignments i n t e * n t i o n * e x e c u t i o n i n t e n * t i o n e x * e c u t i o n

Levenshtein distance Each edit operation is assigned a cost: The cost for insertion and deletion is 1. The cost for substitution is 0 if the substituted letter is the same as the original one, and 1 in all other cases. The Levenshtein distance between two words is the minimal cost for transforming one word into the other.

Computing the Levenshtein distance We would like to find a sequence of operations which transforms one word into the other and has minimal cost. The search space for this problem is huge in fact, in theory there are infinitely many sequences of operations. However, if we are only interested in sequences with minimal cost, we can solve the problem using dynamic programming. Wagner Fisher algorithm

Wagner Fisher algorithm The Wagner Fisher algorithm is a dynamic programming algorithm that computes the Levenshtein distance of two words. dynamic programming = recursion + memoisation Its central data structure is a matrix L. Each cell in L will hold the Levenshtein distance for two prefixes of the input words. The cell is filled for longer and longer prefixes, from prefixes of length zero all the way up to complete words.

n o i t n e t n i # # e x e c u t i o n We want to transform intention into execution.

L(0, 0) n o i t n e t n i # 0 # e x e c u t i o n The cost of transforming the empty string into the empty string is zero.

L(i, 0) n 9 o 8 i 7 t 6 n 5 e 4 t 3 n 2 i 1 # 0 # e x e c u t i o n We can transform intention to the empty string by deleting all characters, one after the other.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n In the general case there are three possibilities. We want to pick the possibility that yields the minimal cost.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 4 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n How can we transform inte into exe? Possibility 1: Remove the last e and transform int into exe.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 5 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n How can we transform inte into exe? Possibility 2: Transform inte till ex and insert the final e.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 3 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n How can we transform inte into exe? Possibility 3: Substitute e for e and transform int till ex.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 3 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n Possibility 3 gives the minimal score. We store a back pointer to remember this.

L(9, 9) n 9 8 8 8 8 8 8 7 6 5 o 8 7 7 7 7 7 7 6 5 6 i 7 6 6 6 6 6 6 5 6 7 t 6 5 5 5 5 5 5 6 7 8 n 5 4 4 4 4 5 6 7 7 7 e 4 3 4 3 4 5 6 6 7 8 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n The Levenshtein distance for this pair of sentences is 5.

L(9, 9) n 9 8 8 8 8 8 8 7 6 5 o 8 7 7 7 7 7 7 6 5 6 i 7 6 6 6 6 6 6 5 6 7 t 6 5 5 5 5 5 5 6 7 8 n 5 4 4 4 4 5 6 7 7 7 e 4 3 4 3 4 5 6 6 7 8 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n To find a sequence of operations that witnesses this distance, we follow the backpointers.

Computational complexity Let m, n denote the lengths of the two words. The memory required by the Wagner Fisher algorithm is in O(mn); this corresponds to the size of the matrix L. Can be improved to O(max(m, n)). The runtime required by the Wagner Fisher algorithm is in O(mn); this is the number of cells that need to be filled.

Other measures of edit distance Practical systems for spelling correction typically use more finegrained weights than the ones that we use here. s instead of a is more probable than d instead of a We can still use the same algorithm for computing the Levenshtein distance; we only have to change the weights. An even more realistic measure is the Damerau Levenshteindistance, which even permits transposition, with cost 1. transposition = switching the positions of two adjacent characters