TTIC 31190: Natural Language Processing

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Python Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

Speech Recognition at ICSI: Broadcast News and beyond

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Language Model and Grammar Extraction Variation in Machine Translation

Cross Language Information Retrieval

Noisy SMS Machine Translation in Low-Density Languages

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Artificial Neural Networks written examination

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Constructing Parallel Corpus from Movie Subtitles

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

CSL465/603 - Machine Learning

AQUA: An Ontology-Driven Question Answering System

The Strong Minimalist Thesis and Bounded Optimality

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Generative models and adversarial training

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The stages of event extraction

Probabilistic Latent Semantic Analysis

A heuristic framework for pivot-based bilingual dictionary induction

Learning Methods for Fuzzy Systems

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Evolutive Neural Net Fuzzy Filtering: Basic Description

Prediction of Maximal Projection for Semantic Role Labeling

Context Free Grammars. Many slides from Michael Collins

Applications of memory-based natural language processing

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Linking Task: Identifying authors and book titles in verbose queries

Multi-Lingual Text Leveling

CS 598 Natural Language Processing

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Case Study: News Classification Based on Term Frequency

Software Maintenance

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Constraining X-Bar: Theta Theory

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

TU-E2090 Research Assignment in Operations Management and Services

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

The NICT Translation System for IWSLT 2012

Human Emotion Recognition From Speech

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Finding Translations in Scanned Book Collections

Multilingual Sentiment and Subjectivity Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Truth Inference in Crowdsourcing: Is the Problem Solved?

CS 446: Machine Learning

Making Sales Calls. Watertown High School, Watertown, Massachusetts. 1 hour, 4 5 days per week

Re-evaluating the Role of Bleu in Machine Translation Research

BYLINE [Heng Ji, Computer Science Department, New York University,

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Comparison of Two Text Representations for Sentiment Analysis

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Regression for Sentence-Level MT Evaluation with Pseudo References

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

SEMAFOR: Frame Argument Resolution with Log-Linear Models

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

STA 225: Introductory Statistics (CT)

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

TextGraphs: Graph-based algorithms for Natural Language Processing

Matching Similarity for Keyword-Based Clustering

The Role of the Head in the Interpretation of English Deverbal Compounds

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Word Segmentation of Off-line Handwritten Documents

Learning Methods in Multilingual Speech Recognition

Lecture 10: Reinforcement Learning

TINE: A Metric to Assess MT Adequacy

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Speech Emotion Recognition Using Support Vector Machine

On document relevance and lexical cohesion between query terms

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Calibration of Confidence Measures in Speech Recognition

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Natural Language Processing. George Konidaris

Learning to Schedule Straight-Line Code

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Transcription:

TTIC 31190: Natural Language Processing Kevin Gimpel Winter 2016 Lecture 15: Introduction to Machine Translation

Announcements Assignment 3 due Monday email me to sign up for your (10-minute) class presentation on 3/3 or 3/8

classification words lexical semantics language modeling Roadmap sequence labeling neural network methods in NLP syntax and syntactic parsing computational semantics machine translation other NLP applications

People rely on machine translation!

People rely on machine translation!

Approaches to Machine Translation: The Vauquois Triangle

Interlingua Example

Classification Framework for Machine Translation inference: solve _ modeling: define score function learning: choose _ modern systems are data-driven first we need data!

Data?

Data?

Data?

Also: news articles company websites laws & patents subtitles Data?

Parallel Data parallel data: bilingual data that is naturally aligned at some level usually aligned at the document level sentence-level alignments are generated automatically how might you design an algorithm for this? it can be done well without dictionaries! can throw out sentences that don t align with anything

Learning from Parallel Sentences Chickasaw 1. Ofi 'at kowi 'ã lhiyohli 2. Kowi 'at ofi 'ã lhiyohli 3. Ofi 'at shoha English 1. The dog chases the cat 2. The cat chases the dog 3. The dog stinks

Learning from Parallel Sentences Chickasaw 1. Ofi 'at kowi 'ã lhiyohli 2. Kowi 'at ofi 'ã lhiyohli 3. Ofi 'at shoha English 1. The dog chases the cat 2. The cat chases the dog 3. The dog stinks

Machine Translation Evaluation human judgments are ideal, but expensive what other problems are there with human judgments? we need automatic evaluation metrics BLEU (BiLingual Evaluation Understudy), Papineni et al. (2002) compare n-gram overlap between system output and human-produced translation correlates with human judgments surprisingly well, but only at the document level (not sentence level!) other metrics do soft matching based on stemming and synonyms from WordNet this is not a solved problem!

Statistical Machine Translation One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Arabic, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Warren Weaver, 1947

Noisy Channel Model

Noisy Channel Model for Translating French ( f ) to English (e) p(e) ) ( e f p ) ( arg max ˆ f e p e e = ) ( ) ( ) ( arg max f p e p e f p e = ) ( ) ( max arg e p e f p e = e f

Modeling for the Noisy Channel We need to model two probability distributions: P(e) and P(f e) P(e) should favor fluent translations P(f e) should favor accurate/faithful translations

Modeling for the Noisy Channel We need to model two probability distributions: P(e) and P(f e) P(e) should favor fluent translations P(f e) should favor accurate/faithful translations Let s start with P(e) How do we compute the probability of an English sentence? This is an important part of MT (e.g., Google)

Word Alignments

Word Alignments is a hidden variable (not part of training data) for each French word, it holds the index of the aligned English word (or NULL)

remember: our goal was to model why would we introduce a hidden variable? to make it easier to define the model we often want to share certain types of information across multiple instances in our data latent variables are a natural way to capture this think of clustering (some of the points come from the same cluster)

Alignments as Hidden Variables for simplicity, assume that each French word aligns to 1 English word (or to NULL) analogy to clustering: each data point has 1 vote which it can distribute among all the clusters here, each French word has 1 vote which it can distribute among all the English words or NULL

Modeling Alignments: IBM Model 1

Modeling Alignments: IBM Model 1 How do we obtain?

Modeling Alignments: IBM Model 1 How do we obtain? Sum over all alignments:

Modeling Alignments: IBM Model 1 Parameters in the model, learned using expectation maximization

Aside: are alignments always hidden? certain small parallel corpora have been hand-aligned issues with this? annotators don t agree we have lots of parallel text, very little is hand-aligned for some language pairs, we will never have manual alignments word alignment has become a fundamental part of MT, and we need unsupervised learning to solve it!

IBM Model 1 Example Consider a training set of two sentence pairs: green house the house casa verde la casa Initial Parameter Estimates: = probability of translating e into f After 1 iteration of EM:

IBM Model 1 IBM Model 2

IBM Model 3

Moving to Phrases NULL Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

Moving to Phrases Not necessarily syntactic phrases Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

Relies on a phrase table Phrase-Based Translation massive bilingual phrase dictionary, with probabilities To build: Find the best word alignment for each sentence pair Extract all phrase pairs consistent with the word alignment Compute probabilities using relative frequency estimation

Relies on a phrase table Phrase-Based Translation massive bilingual phrase dictionary, with probabilities To build: Find the best word alignment for each sentence pair Extract all phrase pairs consistent with the word alignment Compute probabilities using relative frequency estimation Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

Relies on a phrase table Phrase-Based Translation massive bilingual phrase dictionary, with probabilities To build: Find the best word alignment for each sentence pair Extract all phrase pairs consistent with the word alignment Compute probabilities Auf using diese relative Frage frequency to estimation this question 1.0 Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

Relies on a phrase table Phrase-Based Translation massive bilingual phrase dictionary, with probabilities To build: Find the best word alignment for each sentence pair Extract all phrase pairs consistent with the word alignment Compute probabilities Auf using diese relative Frage frequency to estimation this question 1.0 Antwort an answer 1.0 Antwort answer 1.0 Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question

Relies on a phrase table Phrase-Based Translation massive bilingual phrase dictionary, with probabilities To build: Find the best word alignment for each sentence pair Extract all phrase pairs consistent with the word alignment Compute probabilities using relative frequency estimation: German English Count Auf diese Frage to this question 1.0 Antwort an answer 1.0 Antwort answer 1.0 German English P(e f ) Auf diese Frage to this question 1.0 Antwort an answer 0.5 Antwort answer 0.5

Adding Syntax: Synchronous Context-Free Grammars CFG SCFG NN

CFG SCFG NN

Noisy Channel

Noisy Channel predicted translation source sentence

Noisy Channel assumes we have the right model, and that we estimate it perfectly

Noisy Channel assumes we have the right model, and that we estimate it perfectly

Noisy Channel assumes we have the right model, and that we estimate it perfectly extra parameters to tune, can tune to optimize BLEU

Noisy Channel assumes we have the right model, and that we estimate it perfectly extra parameters to tune, can tune to optimize BLEU tuning

Noisy Channel à Linear Model? since we re not using idealized decoding rule anymore, why not add more feature functions? word count feature :

Noisy Channel à Linear Model? since we re not using idealized decoding rule anymore, why not add more feature functions? word count feature :

Noisy Channel à Linear Model? since we re not using idealized decoding rule anymore, why not add more feature functions? word count feature : reverse translation model feature :

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score predicted translation opposition to sanctions against Zimbabwe African National Congress model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score African National Congress opposition sanctions against Zimbabwe predicted translation opposition to sanctions against Zimbabwe African National Congress model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score African National Congress opposition sanctions against Zimbabwe African sanctioning to Zimbabwe s opposing predicted translation opposition to sanctions against Zimbabwe African National Congress model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 BLEU score Gold standard: African National Congress opposes sanctions against Zimbabwe learning moves translations in this plot model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 BLEU score Gold standard: African National Congress opposes sanctions against Zimbabwe learning moves translations left or right in this plot model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score ideal model model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score Where s the gold standard translation? model score

African National Congress opposition sanction Zimbabwe 非国大反对制裁津巴布韦 Gold standard: African National Congress opposes sanctions against Zimbabwe BLEU score Issue: gold standard translation is often unreachable by the model Why? limited translation rules, free translations, noisy data model score

Free Translations Machine translation: Sharon's office said, leader of the main opposition Labor Party has admitted defeat and congratulatory telephone calls to Sharon. Human-generated translation: According to a representative of Sharon's office, the leader of the main opposition Labor Party has admitted defeat and made the obligatory congratulating telephone call to Sharon.

Free Translations Even if gold standard translation was Machine reachable translation: by model, we might not Sharon's office said, leader of the main opposition Labor Party has want admitted to defeat learn and from congratulatory it directly telephone calls to Sharon. Human-generated Applicable translation: to other tasks: According to a representative of Sharon's office, the leader of the main opposition summarization Labor Party has admitted defeat and made the obligatory congratulating telephone call to Sharon. image caption generation

Loss Functions name loss where used cost ( 0-1 ) perceptron hinge log intractable, but underlies direct error minimization perceptron algorithm (Rosenblatt, 1958) support vector machines, other largemargin algorithms logistic regression, conditional random fields, maximum entropy models 65

Loss Functions name loss where used cost ( 0-1 ) issue: gold standard translation is often unreachable by the model intractable, but underlies direct error minimization perceptron hinge log perceptron algorithm (Rosenblatt, 1958) support vector machines, other largemargin algorithms logistic regression, conditional random fields, maximum entropy models 66

Loss Functions name loss where used cost ( 0-1 ) intractable, but it doesn t need to compute model score of gold standard! intractable, but underlies direct error minimization perceptron hinge log perceptron algorithm (Rosenblatt, 1958) support vector machines, other largemargin algorithms logistic regression, conditional random fields, maximum entropy models 67

MERT, Och (2003)

Notation feature weights feature vector source sentence translation latent derivation

Minimum Error Rate Training (MERT)

Minimum Error Rate Training (MERT) set of source sentences references decoder outputs

Minimum Error Rate Training (MERT) how bad are these translations? e.g., negative BLEU set of source sentences references decoder outputs

Minimum Error Rate Training (MERT) minimize the cost of the decoder output how bad are these translations? e.g., negative BLEU intractable in general how can we solve it? set of source sentences references decoder outputs

Minimum Error Rate Training (MERT) minimize the cost of the decoder output how bad are these translations? e.g., negative BLEU intractable in general how can we solve it? set of source sentences generate k-best lists of translations, approximately references minimize cost decoder on k-best lists, outputs repeat with new parameters (pool k-best lists across iterates)

BLEU model score

BLEU each point is a translation for the same sentence Arabic-English, phrase-based model score

BLEU 10,000-best list, default Moses weights 1-best: 28 BLEU model score

BLEU same sentence, 10,000-best list after MERT 1-best: 34 BLEU model score

BLEU another sentence, default Moses weights 1-best: 46 BLEU model score

BLEU same sentence, after MERT 1-best: 62 BLEU model score

Why are there horizontal bands? BLEU model score

Why are there horizontal bands? BLEU latent derivations, different translations with same BLEU model score

references decoder outputs What are some issues with this loss function? Discontinuous & non-convex optimization relies on randomized search No regularization leads to overfitting As a result, MERT is only effective for very small models (<40 parameters)

Many researchers tried to improve MERT: Regularization and Search for MERT (Cer et al., 2008) Random Restarts in MERT for MT (Moore & Quirk, 2008) Stabilizing MERT (Foster & Kuhn, 2009) Issues remain: Better Hypothesis Testing for Statistical MT: Controlling for Optimizer Instability (Clark et al., 2011) They suggest running MERT 3-5 times due to its instability

Perceptron Loss BLEU score reference model score

Perceptron Loss BLEU score reference model prediction model score

Perceptron Loss for MT? (Collins, 2002) BLEU score reference model prediction model score

k-best Perceptron for MT (Liang et al., 2006) BLEU score model prediction model score

k-best Perceptron for MT (Liang et al., 2006) BLEU score model prediction model score

k-best Perceptron for MT (Liang et al., 2006) BLEU score BLEU oracle on k-best list model prediction model score

Ramp Loss Minimization BLEU score model score

Ramp Loss Minimization BLEU score model prediction model score

Ramp Loss Minimization BLEU score model prediction fear translation model score

Fear Ramp Loss (Do et al., 2008) BLEU score model prediction fear translation model score

Fear Ramp Loss (Do et al., 2008) BLEU score model prediction gold standard fear translation model score

Hope Ramp Loss (McAllester & Keshet, 2011; Liang et al., 2006) BLEU score model prediction model score

Hope Ramp Loss (McAllester & Keshet, 2011; Liang et al., 2006) BLEU score hope translation model prediction model score

Hope-Fear Ramp Loss (Chiang et al., 2008; 2009; Cherry & Foster, 2012; Chiang, 2012) BLEU score hope translation fear translation model score

Hope-Fear Ramp Loss (Chiang et al., 2008; 2009; Cherry & Foster, 2012; Chiang, 2012) BLEU score argmax hy,hi2t x (i) hope translation > f(x (i), y, h) cost(y (i), y) argmax > f(x (i), y, h) + cost(y (i), y) hy,hi2t x (i) fear translation model score

Experiments (Gimpel, 2012) averages over 8 test sets across 3 language pairs Moses Hiero %BLEU %BLEU MERT 35.9 37.0 Fear Ramp (away from bad) 34.9 34.2 Hope Ramp (toward good) 35.2 36.0 Hope-Fear Ramp (toward good + away from bad) 35.7 37.0

Pairwise Ranking Optimization (Hopkins & May, 2011) BLEU score model score