Statistical Language Models. Language Models, LM Noisy Channel model Simple Markov Models Smoothing. NLP Language Models 1

Similar documents
Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Recognition at ICSI: Broadcast News and beyond

Distant Supervised Relation Extraction with Wikipedia and Freebase

Investigation on Mandarin Broadcast News Speech Recognition

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Deep Neural Network Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The Evolution of Random Phenomena

arxiv:cmp-lg/ v1 22 Aug 1994

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Probabilistic Latent Semantic Analysis

Natural Language Processing. George Konidaris

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Linking Task: Identifying authors and book titles in verbose queries

Lecture 9: Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Learning Methods in Multilingual Speech Recognition

Training and evaluation of POS taggers on the French MULTITAG corpus

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

CS 598 Natural Language Processing

Modeling function word errors in DNN-HMM based LVCSR systems

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Toward a Unified Approach to Statistical Language Modeling for Chinese

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Memory-based grammatical error correction

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

CS Machine Learning

Modeling function word errors in DNN-HMM based LVCSR systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The taming of the data:

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Beyond the Pipeline: Discrete Optimization in NLP

Corpus Linguistics (L615)

Indian Institute of Technology, Kanpur

arxiv: v1 [cs.cl] 2 Apr 2017

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A Syllable Based Word Recognition Model for Korean Noun Extraction

Language and Computers. Writers Aids. Introduction. Non-word error detection. Dictionaries. N-gram analysis. Isolated-word error correction

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Python Machine Learning

The NICT Translation System for IWSLT 2012

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The Strong Minimalist Thesis and Bounded Optimality

Using dialogue context to improve parsing performance in dialogue systems

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

English Language and Applied Linguistics. Module Descriptions 2017/18

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Constructing Parallel Corpus from Movie Subtitles

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Multi-Lingual Text Leveling

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Emotion Recognition Using Support Vector Machine

Large vocabulary off-line handwriting recognition: A survey

A Bayesian Learning Approach to Concept-Based Document Classification

CS 446: Machine Learning

Semi-Supervised Face Detection

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

MYCIN. The MYCIN Task

Calibration of Confidence Measures in Speech Recognition

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

The stages of event extraction

HOLIDAY LESSONS.com

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Daily Assessment (All periods)

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Transcription:

Statistical Language Models Language Models, LM Noisy Channel model Simple Markov Models Smoothing NLP Language Models 1

Two Main Approaches to NLP Knowlege (AI) Statistical models - Inspired in speech recognition : probability of next word based on previous - Others statistical models NLP Language Models 2

Probability Theory X be uncertain outcome of some event. Called a random variable V(X) finite number of possible outcome (not a real number) P(X=x), probability of the particular outcome x (x belongs V(X)) X desease of your patient, V(X) all possible diseases, NLP Language Models 3

Probability Theory Conditional probability of the outcome of an event based upon the outcome of a second event We pick two words randomly from a book. We know first word is the, we want to know probability second word is dog P(W 2 = dog W 1 = the) = W 1 = the,w 2 = dog / W 1 = the Bayes s law: P(x y) = P(x) P(y x) / P(y) NLP Language Models 4

Probability Theory Bayes s law: P(x y) = P(x) P(y x) / P(y) P(desease/symptom)= P(desease)P(symptom/desease)/P(symptom) P(w 1,n speech signal) = P(w 1,n )P(speech signal w 1,n )/ P(speech signal) We only need to maximize the numerator P(speech signal w 1,n ) expresses how well the speech signal fits the sequence of words w 1,n NLP Language Models 5

Probability Theory Useful generalizations of the Bayes s law - To find the probability of something happening calculate the probability that it hapens given some second event times the probability of the second event - P(w,x y,z) = P(w,x) P(y,z w,x) / P(y,z) Where x,y,y,z are separates events (i.e. take a word) - P(w 1,w 2,..,w n ) =P(w 1 )P(w 2 w 1 )P(w 3 w 1,w 2 ), P(w n w 1..,w n-1 ) also when conditioned on some event P(w 1,w 2,..,w n x) =P(w 1 x)p(w 2 w 1,x) P(w n w 1..,w n-1,x) NLP Language Models 6

Statistical Model of a Language Statistical models of words of sentences language models Probability of all possible sequences of words. For sequences of words of length n, assign a number to P(W 1,n = w 1,n ), being w 1,n a sequence of words NLP Language Models 7

Ngram Model Simple but durable statistical model Useful to indentify words in noisy, ambigous input. Speech recognition, many input speech sounds similar and confusable Machine translation, spelling correction, handwriting recognition, predictive text input Other NLP tasks: part of speech tagging, NL generation, word similarity NLP Language Models 8

CORPORA Corpora (singular corpus) are online collections of text or speech. Brown Corpus: 1 million word collection from 500 written texts from different genres (newspaper,novels, academic). Punctuation can be treated as words. Switchboard corpus: 2430 Telephone conversations averaging 6 minutes each, 240 hour of speech and 3 million words NLP Language Models 9

Training and Test Sets Probabilities of N-gram model come from the corpus it is trained for Data in the corpus is divided into training set (or training corpus) and test set (or test corpus). Perplexity: compare statistical models NLP Language Models 10

Ngram Model How can we compute probabilities of entire sequences P(w 1,w 2,..,w n ) Descomposition using the chain rule of probability P(w 1,w 2,..,w n ) =P(w 1 )P(w 2 w 1 )P(w 3 w 1,w 2 ), P(w n w 1..,w n-1 ) Assigns a conditional probability to possible next words considering the history. Markov assumption : we can predict the probability of some future unit without looking too far into the past. Bigrams only consider previous usint, trigrams, two previous unit, n-grams, n previous unit NLP Language Models 11

Ngram Model Assigns a conditional probability to possible next words.only n-1 previous words have effect on the probabilities of next word For n = 3, Trigrams P(w n w 1..,w n-1 ) = P(w n w n-2,w n-1 ) How we estimate these trigram or N-gram probabilities? To maximize the likelihood of the training set T given the model M (P(T/M) To create the model use training text (corpus), taking counts and normalizing them so they lie between 0 and 1. NLP Language Models 12

Ngram Model For n = 3, Trigrams P(w n w 1..,w n-1 ) = P(w n w n-2,w n-1 ) To create the model use training text and record pairs and triples of words that appear in the text and how many times P(w i w i-2,w i-1 )= C(w i-2,i ) / C(w i-2,i-1 ) P(submarine the, yellow) = C(the,yellow, submarine)/c(the,yellow) Relative frequency: observed frequency of a particular sequence divided by observed fequency of a prefix NLP Language Models 13

Language Models Statistical Models Language Models (LM) Vocabulary (V), word w V Language (L), sentence s L L V * usually infinite s = w 1, w N Probability of s P(s) NLP Language Models 14

Noisy Channel Model message W X Y W* Channel encoder decoder p(y x) input to channel Output from channel Attempt to reconstruct message based on output NLP Language Models 15

Noisy Channel Model in NLP In NLP we do not usually act on coding. The problem is reduced to decode for getting the most likely input given the output, I I O decoder Noisy Channel p(o I) p i p o i NLP Language Models 16

real Language X noisy channel X Y Observed Language Y We want to retrieve X from Y NLP Language Models 17

real Language X Correct text noisy channel X Y errors Observed Language Y text with errors NLP Language Models 18

real Language X Correct text noisy channel X Y space removing Observed Language Y text without spaces NLP Language Models 19

real Language X language model noisy channel X Y acoustic model Observed Language Y text spelling speech NLP Language Models 20

real Language X Source Language noisy channel X Y Translation Observed Language Y Target Language NLP Language Models 21

Example: ASR Automatic Speech Recognizer Acoustic chain word chain X 1... X T ASR w 1... w N Language model Acoustic Model NLP Language Models 22

Example: Machine Translation Target Language Model Translation Model NLP Language Models 23

Naive Implementation Enumerate s L Compute p(s) Parameters of the model L But... L is usually infinite How to estimate the parameters? Simplifications History h i = { w i, w i-1 } Markov Models NLP Language Models 24

Markov Models of order n + 1 P(w i h i ) = P(w i w i-n+1, w i-1 ) 0-gram 1-gram P(w i h i ) = P(w i ) 2-gram P(w i h i ) = P(w i w i-1 ) 3-gram P(w i h i ) = P(w i w i-2,w i-1 ) NLP Language Models 25

n large: more context information (more discriminative power) n small: more cases in the training corpus (more reliable) Selecting n: ej. for V = 20.000 n num. parameters 2 (bigrams) 400,000,000 3 (trigrams) 8,000,000,000,000 4 (4-grams) 1.6 x 10 17 NLP Language Models 26

Parameters of an n-gram model V n MLE estimation From a training corpus Problem of sparseness NLP Language Models 27

1-gram Model 2-gram Model P w = C w MLE V C w i 1 w i P w w = MLE i i 1 C w i 1 3-gram Model P MLE w i w i 1,w i 2 = C w i 2 w i 1 w i C w i 2 w i 1 NLP Language Models 28

NLP Language Models 29

NLP Language Models 30

True probability distribution NLP Language Models 31

The seen cases are overestimated the unseen ones have a null probability NLP Language Models 32

Save a part of the mass probability from seen cases and assign it to the unseen ones SMOOTHING NLP Language Models 33

Some methods perform on the countings: Laplace, Lidstone, Jeffreys-Perks Some methods perform on the probabilities: Held-Out Good-Turing Descuento Some methods combine models Linear interpolation Back Off NLP Language Models 34

Laplace (add 1) P laplace w 1 w n = C w 1 w n 1 N+B P = probability of an n-gram C = counting of the n-gram in the training corpus N = total of n-grams in the training corpus B = parameters of the model (possible n- grams) NLP Language Models 35

Lidstone (generalization of Laplace) P Lid w 1 w n = C w 1 w n +λ λ = small positive number M.L.E: λ = 0 Laplace: λ = 1 Jeffreys-Perks: λ = ½ N+B λ NLP Language Models 36

Held-Out Compute the percentage of the probability mass that has to be reserved for the n-grams unseen in the training corpus We separate from the training corpus a held-out corpus We compute howmany n-grams unseen in the training corpus occur in the held-out corpus An alternative of using a held-out corpus is using Cross-Validation Held-out interpolation Deleted interpolation NLP Language Models 37

Held-Out Let a n-gram w 1 w n r = C(w 1 w n ) C 1 (w 1 w n ) counting of the n-gram in the training set C 2 (w 1 w n ) counting of the n-gram in the held-out set N r number of n-grams with counting r in the training set T r = { w w 1 n :C 1 w 1 w n =r } P ho w 1 w n = T r N r N C 2 w 1 w n NLP Language Models 38

Good-Turing r= r+1 E N r+1 E N r P GT =r / N r * = adjusted count N r = number of n-gram-types occurring r times E(N r ) = expected value E(N r+1 ) < E(N r ) Zipf law NLP Language Models 39

Combination of models: Linear combination (interpolation) P li w n w n 2,w n 1 = λ 1 P 1 w n +λ 2 P 2 w n w n 1 +λ 1 P 3 w n w n 2,w n 1 Linear combination of de 1-gram, 2-gram, 3-gram,... Estimation of λ using a development corpus NLP Language Models 40

Katz s Backing-Off Start with a n-gram model Back off to n-1 gram for null (or low) counts Proceed recursively NLP Language Models 41

Performing on the history Class-based Models Clustering (or classifying) words into classes POS, syntactic, semantic Rosenfeld, 2000: P(wi wi-2,wi-1) = P(wi Ci) P(Ci wi-2,wi-1) P(wi wi-2,wi-1) = P(wi Ci) P(Ci wi-2,ci-1) P(wi wi-2,wi-1) = P(wi Ci) P(Ci Ci-2,Ci-1) P(wi wi-2,wi-1) = P(wi Ci-2,Ci-1) NLP Language Models 42

Structured Language Models Jelinek, Chelba, 1999 Including the syntactic structure into the history T i are the syntactic structures binarized lexicalized trees NLP Language Models 43