Language Models (2) CMSC 470 Marine Carpuat. Slides credit: Jurasky & Martin

Similar documents
(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Deep Neural Network Language Models

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Lecture 1: Machine Learning Basics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CS Machine Learning

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CS 598 Natural Language Processing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Language Independent Passage Retrieval for Question Answering

arxiv: v1 [cs.cl] 2 Apr 2017

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Using dialogue context to improve parsing performance in dialogue systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Probing for semantic evidence of composition by means of simple classification tasks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Context Free Grammars. Many slides from Michael Collins

Assignment 1: Predicting Amazon Review Ratings

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

arxiv: v1 [cs.cl] 20 Jul 2015

Speech Recognition at ICSI: Broadcast News and beyond

A Comparison of Two Text Representations for Sentiment Analysis

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Second Exam: Natural Language Parsing with Neural Networks

Device Independence and Extensibility in Gesture Recognition

Phenomena of gender attraction in Polish *

Artificial Neural Networks written examination

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Universiteit Leiden ICT in Business

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Syntactic systematicity in sentence processing with a recurrent self-organizing network

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Argument structure and theta roles

Attributed Social Network Embedding

Human-like Natural Language Generation Using Monte Carlo Tree Search

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Usage-Based Approach to Recursion in Sentence Processing

Re-evaluating the Role of Bleu in Machine Translation Research

Multi-Lingual Text Leveling

Calibration of Confidence Measures in Speech Recognition

Training and evaluation of POS taggers on the French MULTITAG corpus

The Strong Minimalist Thesis and Bounded Optimality

Sample Goals and Benchmarks

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Derivational and Inflectional Morphemes in Pak-Pak Language

Words come in categories

A Bayesian Learning Approach to Concept-Based Document Classification

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Compositional Semantics

The Evolution of Random Phenomena

Investigation on Mandarin Broadcast News Speech Recognition

Natural Language Processing. George Konidaris

Improvements to the Pruning Behavior of DNN Acoustic Models

Corpus Linguistics (L615)

Evolution of Symbolisation in Chimpanzees and Neural Nets

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Controlled vocabulary

Knowledge Transfer in Deep Convolutional Neural Nets

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Formulaic Language and Fluency: ESL Teaching Applications

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Regression for Sentence-Level MT Evaluation with Pseudo References

Word Embedding Based Correlation Model for Question/Answer Matching

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Learning to Rank with Selection Bias in Personal Search

A deep architecture for non-projective dependency parsing

A Vector Space Approach for Aspect-Based Sentiment Analysis

Prediction of Maximal Projection for Semantic Role Labeling

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Lecture 9: Speech Recognition

Dublin City Schools Mathematics Graded Course of Study GRADE 4

1. Introduction. 2. The OMBI database editor

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

Experts Retrieval with Multiword-Enhanced Author Topic Model

Writing a composition

arxiv:cmp-lg/ v1 22 Aug 1994

How People Learn Physics

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Transcription:

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin

Roadmap Language Models Our first example of modeling sequences n-gram language models How to estimate them? How to evaluate them? Neural models

Pros and cons of n-gram models Really easy to build, can train on billions and billions of words Smoothing helps generalize to new data Only work well for word prediction if the test corpus looks like the training corpus Only capture short distance context

Evaluation: How good is our model? Does our language model prefer good sentences to bad ones? Assign higher probability to real or frequently observed sentences Than ungrammatical or rarely observed sentences? Extrinsic vs intrinsic evaluation

Intrinsic evaluation: intuition The Shannon Game: How well can we predict the next word? I always order pizza with cheese and The 33 rd President of the US was I saw a mushrooms 0.1 pepperoni 0.1 anchovies 0.01 Unigrams are terrible at this game. (Why?) A better model of a text assigns a higher probability to the word that actually occurs. fried rice 0.0001. and 1e-100

Intrinsic evaluation metric: perplexity The best language model is one that best predicts an unseen test set Gives the highest P(sentence) Perplexity is the inverse probability of the test set, normalized by the number of words: PP(W ) = P(w 1 w 2...w N ) - 1 N = N 1 P(w 1 w 2...w N ) Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability

Perplexity as branching factor Let s suppose a sentence consisting of random digits What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ N-gram Order Unigram Bigram Trigram Perplexity 962 170 109

The perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn t! We need to train robust models that generalize Smoothing is important Choose n carefully

Roadmap Language Models Our first example of modeling sequences n-gram language models How to estimate them? How to evaluate them? Neural models

Toward a Neural Language Model Figures by Philipp Koehn (JHU)

Representing Words one hot vector dog = [ 0, 0, 0, 0, 1, 0, 0, 0 ] cat = [ 0, 0, 0, 0, 0, 0, 1, 0 ] eat = [ 0, 1, 0, 0, 0, 0, 0, 0 ] That s a large vector! practical solutions: limit to most frequent words (e.g., top 20000) cluster words into classes break up rare words into subword units

Language Modeling with Feedforward Neural Networks Map each word into a lower-dimensional real-valued space using shared weight matrix Embedding layer Bengio et al. 2003

Example: Prediction with a Feedforward LM

Example: Prediction with a Feedforward LM Note: bias omitted in figure

Estimating Model Parameters Intuition: a model is good if it gives high probability to existing word sequences Training examples: sequences of words in the language of interest Error/loss: negative log likelihood At the corpus level error λ = E in corpus log P λ(e) At the word level error λ = log P λ (e t e 1 e t 1 )

Example: Parameter Estimation Loss function at each position t Parameter update rule

Word Embeddings: a useful by-product of neural LMs Words that occurs in similar contexts tend to have similar embeddings Embeddings capture many usage regularities Useful features for many NLP tasks

Word Embeddings

Word Embeddings

Word Embeddings Capture Useful Regularities Morpho-Syntactic Adjectives: base form vs. comparative Nouns: singular vs. plural Verbs: present tense vs. past tense [Mikolov et al. 2013] Semantic Word similarity/relatedness Semantic relations But tends to fail at distinguishing Synonyms vs. antonyms Multiple senses of a word

Language Modeling with Feedforward Neural Networks Bengio et al. 2003

Count-based n-gram models vs. feedforward neural networks Pros of feedforward neural LM Word embeddings capture generalizations across word typesq Cons of feedforward neural LM Closed vocabulary Training/testing is more computationally expensive Weaknesses of both types of model Only work well for word prediction if the test corpus looks like the training corpus Only capture short distance context

Roadmap Language Models Our first example of modeling sequences n-gram language models How to estimate them? How to evaluate them? Neural models Feedfworward neural networks Recurrent neural networks