n-grams BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Switchboard Language Model Improvement with Conversational Data from Gigaword

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Distant Supervised Relation Extraction with Wikipedia and Freebase

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Machine Learning Basics

Lecture 9: Speech Recognition

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Online Updating of Word Representations for Part-of-Speech Tagging

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Investigation on Mandarin Broadcast News Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Learning Methods in Multilingual Speech Recognition

Corpus Linguistics (L615)

Universiteit Leiden ICT in Business

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Methods for the Qualitative Evaluation of Lexical Association Measures

Speech Recognition at ICSI: Broadcast News and beyond

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Lecture 10: Reinforcement Learning

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Python Machine Learning

arxiv:cmp-lg/ v1 22 Aug 1994

Analysis of Enzyme Kinetic Data

Training and evaluation of POS taggers on the French MULTITAG corpus

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Software Maintenance

Introduction, Organization Overview of NLP, Main Issues

A Statistical Model for Word Discovery in Transcribed Speech

Concepts and Properties in Word Spaces

Introduction to the Practice of Statistics

A Bootstrapping Model of Frequency and Context Effects in Word Learning

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Deep Neural Network Language Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

What is NLP? CS 188: Artificial Intelligence Spring Why is Language Hard? The Big Open Problems. Information Extraction. Machine Translation

Language Independent Passage Retrieval for Question Answering

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Foothill College Summer 2016

Collocation extraction measures for text mining applications

Cross-Lingual Text Categorization

A corpus-based approach to the acquisition of collocational prepositional phrases

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Calibration of Confidence Measures in Speech Recognition

Grade 6: Correlated to AGS Basic Math Skills

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The taming of the data:

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Memory-based grammatical error correction

Mandarin Lexical Tone Recognition: The Gating Paradigm

Finding Translations in Scanned Book Collections

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Using Web Searches on Important Words to Create Background Sets for LSI Classification

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

A Case Study: News Classification Based on Term Frequency

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Major Milestones, Team Activities, and Individual Deliverables

Formulaic Language and Fluency: ESL Teaching Applications

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Functional Skills Mathematics Level 2 assessment

Modeling function word errors in DNN-HMM based LVCSR systems

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Centre for Evaluation & Monitoring SOSCA. Feedback Information

Improving Conceptual Understanding of Physics with Technology

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Spinners at the School Carnival (Unequal Sections)

A Syllable Based Word Recognition Model for Korean Noun Extraction

CS Machine Learning

On document relevance and lexical cohesion between query terms

arxiv: v1 [cs.cl] 20 Jul 2015

Cross-lingual Text Fragment Alignment using Divergence from Randomness

TITLE: Shakespeare: The technical words. DATE(S): Project will run for four weeks during June or July

The Strong Minimalist Thesis and Bounded Optimality

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Translating Collocations for Use in Bilingual Lexicons

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Syntactic surprisal affects spoken word duration in conversational contexts

English Language and Applied Linguistics. Module Descriptions 2017/18

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Artificial Neural Networks written examination

Transcription:

n-grams BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de October 28, 2016

Today n-grams Zipf s law language models 2

Maximum Likelihood Estimation We want to estimate the parameters of our model from frequency observations. There are many ways to do this. For now, we focus on maximum likelihood estimation, MLE. Likelihood L(O ; p) is the probability of our model generating the observations O, given parameter values p. Goal: Find value for parameters that maximizes the likelihood. 3

Bernoulli model Let s say we had training data C of size N, and we had NH observations of H and NT observations of T. 4

Likelihood functions (Wikipedia page on MLE; licensed from Casp11 under CC BY-SA 3.0) 5

Logarithm is monotonic Observation: If x1 > x2, then ln(x1) > ln(x2). Therefore, argmax L(C) = argmax l(c) p p 6

Maximizing the log-likelihood Find maximum of function by setting derivative to zero: Solution is p = NH / N = f(h). 7

Language Modelling 8

Let s play a game I will write a sentence on the board. Each of you, in turn, gives me a word to continue that sentence, and I will write it down. 9

Let s play another game You write a word on a piece of paper. You get to see the piece of paper of your neighbor, but none of the earlier words. In the end, I will read the sentence you wrote. 10

Statistical models for NLP Generative statistical model of language: prob. dist. P(w) over NL expressions that we can observe. w may be complete sentences or smaller units will later extend this to pd P(w, t) with hidden random variables t Assumption: A corpus of observed sentences w is generated by repeatedly sampling from P(w). We try to estimate the parameters of the prob dist from the corpus, so we can make predictions about unseen data. 11

Example bla 12

Word-by-word random process A language model LM is a probability distribution P(w) over words. Think of it as a random process that generates sentences word by word: X1 X2 X3 X4 13

Word-by-word random process A language model LM is a probability distribution P(w) over words. Think of it as a random process that generates sentences word by word: X1 X2 X3 X4 Are 14

Word-by-word random process A language model LM is a probability distribution P(w) over words. Think of it as a random process that generates sentences word by word: X1 X2 Are you X3 X4 15

Word-by-word random process A language model LM is a probability distribution P(w) over words. Think of it as a random process that generates sentences word by word: X1 X2 X3 Are you sure X4 16

Word-by-word random process A language model LM is a probability distribution P(w) over words. Think of it as a random process that generates sentences word by word: X1 X2 X3 X4 Are you sure that 17

Our game as a process Each of you = a random variable Xt; event Xt = wt means word at position t is wt. When you chose wt, you could see the outcomes of the previous variables: X1 = w1,..., Xt-1 = wt-1. Thus, each Xt followed a pd P(Xt = wt X1 = w1,...,xt-1 = wt-1) 18

Our game as a process Assume that Xt follows some given pd P(Xt = wt X1 = w1,...,xt-1 = wt-1) Then probability of the entire sentence (or corpus) w = w1... wn is P(w1... wn) = P(w1)P(w2 w1)p(w3 w1,w2) P(wn w1,...,wn-1) 19

Parameters of the model Our model has one parameter for P(Xt = wt w1,..., wt-1) for all t and w1,..., wt. Can use maximum likelihood estimation: Let s say a natural language has 105 different words. How many tuples w1,... wt of length t? t = 1: 105 t = 2: 1010 different contexts t = 3: 1015; etc. 20

Sparse data problem typical corpus size: Brown corpus: 106 tokens Gigaword corpus: 109 tokens Problem exacerbated by Zipf s Law: Order all words by their absolute frequency in corpus (rank 1 = most frequent word). Then rank is inversely proportional to absolute frequency; i.e., most words are really rare. Zipf s Law is very robust across languages and corpora. 21

Interlude: Corpora 22

Terminology N = corpus size; number of (word) tokens V = vocabulary; number of (word) types hapax legomenon = a word that appears exactly once in the corpus 23

An example corpus Tokens: 86 Types: 53 24

Frequency list 25

Frequency list 26

Frequency profile 27

Plotting corpus frequencies Number of types rank frequency 1 1 8 2 3 5 4 7 3 10 17 2 36 53 1 How many different words in the corpus are there with each frequency? 28

Plotting corpus frequencies x-axis: rank y-axis: frequency 29

Typical frequency patterns Some other corpora Across text types & languages 30

Zipf s Law Zipf s Law characterizes the relation between frequent and rare words: or equivalently: f(w) = C / r(w) f(w) * r(w) = C Frequency of lexical items (words types) in a large corpus is inversely proportional to their rank. Empirical observation in many different corpora Brown corpus: half of all types are hapax legomena 31

Effects of Zipf s Law Lexicography: Sinclair (2005): need at least 20 instances BNC (108 Tokens): <14% of words appear 20 times or more Speech synthesis: may accept bad output for rare words but most words are rare! (at least 1 per sentence) Vocabulary growth: vocabulary growth of corpora is not constant G = #hapaxes / #tokens 32

Back to Language Models 33

Independence assumptions Let s pretend that word at position t depends only on the words at positions t-1, t-2,..., t-k for some fixed k (Markov assumption of degree k). Then we get an n-gram model, with n = k+1: P(Xt X1,...,Xt-1) = P(Xt Xt-k,...,Xt-1) for all t. Special names for unigram models (n = 1), bigram models (n = 2), trigram models (n = 3). 34

Independence assumption We assume independence of Xt from events that are too far in the past, although we know that this assumption is incorrect. Typical tradeoff in statistical NLP: if model is too shallow, it won t represent important linguistic dependencies if model is too complex, its parameters can t be estimated accurately from the available data low n modeling errors high n estimation errors 35

Tradeoff in practice (Manning/Schütze, ch. 6) 36

Tradeoff in practice (Manning/Schütze, ch. 6) 37

Tradeoff in practice (Manning/Schütze, ch. 6) 38

Conclusion Statistical models of natural language Language models using n-grams Data sparseness is a problem. 39

next Tuesday smoothing language models 40