Language Modelling. Marco Kuhlmann Department of Computer and Information Science Partially based on material developed by David Chiang

TDDE09, 729A27 Natural Language Processing (2017) Language Modelling Marco Kuhlmann Department of Computer and Information Science Partially based on material developed by David Chiang This work is licensed under a Creative Commons Attribution 4.0 International License.

Language models A language model is a model of what words are more or less likely to be generated in some language. More specifically, it is a model that predicts what the next word will be, given the words so far. Instead of on words, language models can also be defined on characters (or signs, or symbols).

Text classification using language models The word probabilities in the Naive Bayes classifier define a simple language model: class-specific language model

When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Warren Weaver (1894 1978)

The Noisy Channel E R sender receiver noise language model Russian is noisy English. argmax E P(E R) = argmax E P(R E) P(E)

Autocomplete and autocorrect

Pinyin input method ping guo gong si ping

Shannon s game Shannon s game is like Hangman except that It s no fun at all. You may only guess one character at a time. When moving on to the next character, you lose all information about previously guessed characters. Claude Shannon (1916 2001) Image source: Wikipedia

N-gram models

N-gram models An n-gram is a sequence of n words or characters. unigram, bigram, trigram, quadrigram An n-gram model is a language model where the probability of a word depends only on the n 1 immediately preceding words.

Unigram model A unigram language model is a bag-of-words model: Thus the probabilities of all words in the text are mutually independent.

Markov models The probability of each item depends only on the immediately preceding item. The probability of a sequence of items is the product of these conditional probabilities. For a well-defined model, we need to mark the beginning and the end of a sequence. Андрéй Мáрков (1856 1922) Image source: Wikipedia

Probability of a sequence of words beginning-of-sentence end-of-sentence

Bigram models A bigram model is a Markov model on sequences of words: Thus the probability of a word depends only on the immediately preceding word.

Formal definition of an n-gram model n V the model s order (1 = unigram, 2 = bigram, ) a set of possible words (character); the vocabulary P(w u) a probability that specifies how likely it is to observe the word w after the context (n 1)-gram u one value for each combination of a word w and a context u

Simple uses of n-gram models Prediction To predict the next word, we can choose the word that has the highest probability among all possible words w: predicted word = argmax w P(w preceding words) Generation We can generate a random sequence of words where each word w is sampled with probability P(w preceding words).

P(w 1 w 1 ) w 1 P(w 1 BOS) P(EOS w 1 ) BOS P(w 2 w 1 ) P(w 1 w 2 ) EOS w 2 P(w 2 BOS) P(EOS w 2 ) P(w 2 w 2 )

Learning n-gram models

Estimating unigram probabilities P(Sherlock) c(sherlock) count of the unigram Sherlock N total number of unigrams (tokens)

Estimating bigram probabilities P(Holmes Sherlock) c(sherlock Holmes) count of the bigram Sherlock Holmes c(sherlock w) count of bigrams starting with Sherlock

Estimating unigram and bigram probabilities This is the count of the unigram w, divided by the total number of tokens. This is the count of the bigram uw, divided by the count of bigrams starting with u.

A problem with maximum likelihood estimation Shakespeare s collected works contain ca. 31,000 word types. There are 961 million different bigrams with these words. In his texts we only find 300,000 bigrams. This means that 99.97% of all theoretically possible bigrams have count 0. Under a bigram model, each sentence containing one of those bigrams will receive a probability of zero. Zero probabilities destroy information!

Notation N c(w) c(uw) c(u ) V V k+ V k number of word tokens, excluding BOS count of unigram (word) w count of bigram uw count of bigrams starting with u number of word types, including UNK number of word types seen at k times number of word types seen exactly k times Source: Chen and Goodman (1998)

Additive smoothing We can do add-k smoothing as for the Naive Bayes classifier: why?

A problem with additive smoothing Chiang looks at a sample of 55,708,861 words of English. In this sample, the word the appears 3,579,493 times. Add-one smoothing yields P(the) = 0.0641. How many times does the model expect the word the to appear in an equal-sized sample? Why is that a problem?

A problem with additive smoothing We have only a constant amount of probability mass that we can distribute among the word types. Therefore, although we are adding to the count of every word type, we are not adding to the probability of every word type. probabilities still need to sum to one We take away a certain percentage of the probability mass from each word type and redistribute it equally to all word types.

Additive smoothing for unigram probabilities The formula for add-k smoothing of unigrams, can be written as a mixture of the maximum-likelihood estimate and the uniform distribution over word types: where

Witten Bell smoothing for unigram probabilities Writing V 1+ for the number of word types seen at least once, where gives us Witten Bell smoothing. This is another form of additive smoothing and can also be written as where

Absolute discounting for unigram probabilities Intuitively, smoothing should not decrease the expected count of a word (relative to its empirical count) by more than about one. In absolute discounting, we subtract from the count of every seen word type and distribute the total gain equally to all types: where

Smoothing bigram probabilities When smoothing unigram probabilities, we took away from the ML estimate and gave back equally to all word types. For bigram probabilities, a better option is to give back to word types proportional to their unigram probability.

Smoothing bigram probabilities Witten Bell smoothing Absolute discounting

Unknown words In addition to new bigrams, a new text may even contain completely new words. For these, smoothing will not help. One way to deal with this is to introduce a special word type UNK, and smooth it like any other word type in the vocabulary. For additive smoothing: Hallucinate k occurrences of the unknown word. At test time, we replace every unknown word with UNK.

Evaluation of n-gram models

Intrinsic and extrinsic evaluation Intrinsic evaluation How does the method or model score with respect to a given evaluation measure? in classification: accuracy, precision, recall Extrinsic evaluation How much does the method or model help the application in which it is embedded? predictive input, machine translation, speech recognition

Intrinsic evaluation of language models, intuition Learn a language model from a set of training sentences and use it to compute the probability of a set of test sentences. If the language model is good, the probability of the test sentences should be high. This assumes that the training sample and the test sample are similar. In the following, for simplicity we assume that we have only one, potentially very long test sentence.

The problem with different sentence lengths Under a Markov model, the probability of a sentence is the product of the bigram probabilities. Therefore, all other things being equal, the probability of a sentence decreases with the sentence length. This makes it hard to compare the probability of the test data to the probability of the training data. Intuitively, we would like to average over sentence length.

Perplexity and entropy The perplexity of a language model P on a test sample x 1, x N is This measure is not easy to understand intuitively. We will therefore focus on the term in the exponent: This measure is known as the entropy of the test sample.

From probabilities to surprisal Instead of computing probabilities for the test sentence, we will compute negative log probabilities. P(w c) becomes log P(w c) Intuitively, this measures how surprised we are about seeing the test sentence, given our language model. high probability = low surprisal We can then simply divide by the number of words in the sentence to average over sentence length.

Negative log probabilities 5 3.75 log p 2.5 1.25 0 0 0.25 0.5 0.75 1 p

Entropy and smoothing When smoothing a language model, we are redistributing probability mass to observations we have never made. This leaves a smaller amount of the probablity mass to the observations that we actually did make during learning. When we evaluate the smoothed model on the training data, its entropy will therefore be higher than without smoothing.

The problem with unknown words The held-out data will in general contain unknown words words that we have not seen in the training data. Because we are multiplying probabilities, a single unknown word will bring down the probability of the held-out data to zero. Zero probabilities destroy information! The conclusion is that we should never compare language models with different vocabularies.

Edit distance

Autocomplete and autocorrect

Edit distance Many misspelled words are quite similar to the correctly spelled words; there are typically only a few mistakes. lingvisterma, word prefiction Given a misspelled word, we want to propose one or several similar words and propose the most probable one. This idea requires a measure of orthographic similarity between two words.

Edit operations We can measure the similarity between two words by the number of operations needed to transform one into the other. Here we assume three types of operations: insertion deletion substitution add a letter before or after another one delete a letter from the word substitute a letter for another one

Edit operations, example How many edits does it take to go from intention to execution? intention delete the letter i substitute e for n substitute x for t insert the letter c substitute u for n ntention etention exention execntion execution

Letter alignments i n t e * n t i o n * e x e c u t i o n i n t e n * t i o n e x * e c u t i o n

Levenshtein distance Each edit operation is assigned a cost: The cost for insertion and deletion is 1. The cost for substitution is 0 if the substituted letter is the same as the original one, and 1 in all other cases. The Levenshtein distance between two words is the minimal cost for transforming one word into the other.

Computing the Levenshtein distance We would like to find a sequence of operations which transforms one word into the other and has minimal cost. The search space for this problem is huge in fact, in theory there are infinitely many sequences of operations. However, if we are only interested in sequences with minimal cost, we can solve the problem using dynamic programming. Wagner Fisher algorithm

Wagner Fisher algorithm The Wagner Fisher algorithm is a dynamic programming algorithm that computes the Levenshtein distance of two words. dynamic programming = recursion + memoisation Its central data structure is a matrix L. Each cell in L will hold the Levenshtein distance for two prefixes of the input words. The cell is filled for longer and longer prefixes, from prefixes of length zero all the way up to complete words.

n o i t n e t n i # # e x e c u t i o n We want to transform intention into execution.

L(0, 0) n o i t n e t n i # 0 # e x e c u t i o n The cost of transforming the empty string into the empty string is zero.

L(i, 0) n 9 o 8 i 7 t 6 n 5 e 4 t 3 n 2 i 1 # 0 # e x e c u t i o n We can transform intention to the empty string by deleting all characters, one after the other.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n In the general case there are three possibilities. We want to pick the possibility that yields the minimal cost.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 4 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n How can we transform inte into exe? Possibility 1: Remove the last e and transform int into exe.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 5 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n How can we transform inte into exe? Possibility 2: Transform inte till ex and insert the final e.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 3 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n How can we transform inte into exe? Possibility 3: Substitute e for e and transform int till ex.

L(4, 3) n 9 8 8 o 8 7 7 i 7 6 6 t 6 5 5 n 5 4 4 e 4 3 4 3 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n Possibility 3 gives the minimal score. We store a back pointer to remember this.

L(9, 9) n 9 8 8 8 8 8 8 7 6 5 o 8 7 7 7 7 7 7 6 5 6 i 7 6 6 6 6 6 6 5 6 7 t 6 5 5 5 5 5 5 6 7 8 n 5 4 4 4 4 5 6 7 7 7 e 4 3 4 3 4 5 6 6 7 8 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n The Levenshtein distance for this pair of sentences is 5.

L(9, 9) n 9 8 8 8 8 8 8 7 6 5 o 8 7 7 7 7 7 7 6 5 6 i 7 6 6 6 6 6 6 5 6 7 t 6 5 5 5 5 5 5 6 7 8 n 5 4 4 4 4 5 6 7 7 7 e 4 3 4 3 4 5 6 6 7 8 t 3 3 3 3 4 5 5 6 7 8 n 2 2 2 3 4 5 6 7 7 7 i 1 1 2 3 4 5 6 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n To find a sequence of operations that witnesses this distance, we follow the backpointers.

Computational complexity Let m, n denote the lengths of the two words. The memory required by the Wagner Fisher algorithm is in O(mn); this corresponds to the size of the matrix L. Can be improved to O(max(m, n)). The runtime required by the Wagner Fisher algorithm is in O(mn); this is the number of cells that need to be filled.

Other measures of edit distance Practical systems for spelling correction typically use more finegrained weights than the ones that we use here. s instead of a is more probable than d instead of a We can still use the same algorithm for computing the Levenshtein distance; we only have to change the weights. An even more realistic measure is the Damerau Levenshteindistance, which even permits transposition, with cost 1. transposition = switching the positions of two adjacent characters