Phrase-Based MT: Decoding. February 19, 2015

Similar documents
Language Model and Grammar Extraction Variation in Machine Translation

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Lecture 1: Machine Learning Basics

Discriminative Learning of Beam-Search Heuristics for Planning

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Detecting English-French Cognates Using Orthographic Edit Distance

(Sub)Gradient Descent

Re-evaluating the Role of Bleu in Machine Translation Research

Lecture 10: Reinforcement Learning

Theoretical Syntax Winter Answers to practice problems

CS Machine Learning

Learning goal-oriented strategies in problem solving

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Using dialogue context to improve parsing performance in dialogue systems

Chapter 2 Rule Learning in a Nutshell

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Noisy SMS Machine Translation in Low-Density Languages

Learning Methods for Fuzzy Systems

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Smart/Empire TIPSTER IR System

arxiv: v1 [cs.cl] 2 Apr 2017

Rule Learning With Negation: Issues Regarding Effectiveness

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Introduction to Simulation

Disambiguation of Thai Personal Name from Online News Articles

CSCI 5582 Artificial Intelligence. Today 12/5

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Contents. Foreword... 5

Lecture 1: Basic Concepts of Machine Learning

Improvements to the Pruning Behavior of DNN Acoustic Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Cross Language Information Retrieval

Lecture 9: Speech Recognition

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Mathematics process categories

Rule Learning with Negation: Issues Regarding Effectiveness

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Corrective Feedback and Persistent Learning for Information Extraction

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

12- A whirlwind tour of statistics

Introduction to Questionnaire Design

Large vocabulary off-line handwriting recognition: A survey

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Parsing of part-of-speech tagged Assamese Texts

Probability and Statistics Curriculum Pacing Guide

CSL465/603 - Machine Learning

A Case Study: News Classification Based on Term Frequency

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Laboratorio di Intelligenza Artificiale e Robotica

Get a Smart Start with Youth

T Seminar on Internetworking

Artificial Neural Networks written examination

Grammars & Parsing, Part 1:

Individual Differences & Item Effects: How to test them, & how to test them well

Major Milestones, Team Activities, and Individual Deliverables

MTH 215: Introduction to Linear Algebra

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Training Pack. Kaizen Focused Improvement Teams (F.I.T.)

Learning to Schedule Straight-Line Code

TCC Jim Bolen Math Competition Rules and Facts. Rules:

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Self Study Report Computer Science

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Radius STEM Readiness TM

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Calibration of Confidence Measures in Speech Recognition

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Reinforcement Learning by Comparing Immediate Reward

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Finding Translations in Scanned Book Collections

Explaining: a central discourse function in instruction. Christiane Dalton-Puffer University of Vienna

TUESDAYS/THURSDAYS, NOV. 11, 2014-FEB. 12, 2015 x COURSE NUMBER 6520 (1)

Switchboard Language Model Improvement with Conversational Data from Gigaword

Cal s Dinner Card Deals

Transcription:

Phrase-Based MT: Decoding February 19, 2015

Administrative Final proposal draft due Tuesday It needs to be revised Bring 3 printed copies again HW 2 is due two weeks from today

Phrase Based MT e = arg max e = arg max e arg max e p(e f) p(f e) p(e) p(f, a e) p(e) Recipe: Ingredients Segmentation / Reordering model Phrase model Language Model

Marginal Decoding e = arg max e = arg max e arg max e p(e f) p(f e) p(e) p(f, a e) p(e) Does this last approximation matter? - Variational & MCMC explored - slight benefits, depending on training - Really hard problem (Sima an, 1997)

Reordering Model

Phrase Tables f e p(f e) the issue 0.41 das Thema the point 0.72 the subject 0.47 the thema 0.99 es gibt there is 0.96 there are 0.72 morgen tomorrow 0.9 will I fly 0.63 fliege ich will fly 0.17 I will fly 0.13

Recipe: Instructions

Translation Process Task: translate this sentence from German into English er geht ja nicht nach hause Chapter 6: Decoding 2

Translation Process Task: translate this sentence from German into English er geht ja nicht nach hause er he Pick phrase in input, translate Chapter 6: Decoding 3

Translation Process Task: translate this sentence from German into English er geht ja nicht nach hause er ja nicht he does not Pick phrase in input, translate it is allowed to pick words out of sequence reordering phrases may have multiple words: many-to-many translation Chapter 6: Decoding 4

Translation Process Task: translate this sentence from German into English er geht ja nicht nach hause er geht ja nicht he does not go Pick phrase in input, translate Chapter 6: Decoding 5

Translation Process Task: translate this sentence from German into English er geht ja nicht nach hause er geht ja nicht nach hause he does not go home Pick phrase in input, translate Chapter 6: Decoding 6

Computing Translation Probability Probabilistic model for phrase-based translation: e best = argmax e IY i=1 ( f i ē i ) d(start i end i 1 1) p lm (e) Score is computed incrementally for each partial hypothesis Components Phrase translation Picking phrase f i to be translated as a phrase ē i! look up score ( f i ē i ) from phrase translation table Reordering Previous phrase ended in end i 1,currentphrasestartsatstart i! compute d(start i end i 1 1) Language model For n-gram model, need to keep track of last n 1 words! compute score p lm (w i w i (n 1),...,w i 1 ) for added words w i Chapter 6: Decoding 7

Translation Options er geht ja nicht nach hause he it, it, he it is he will be it goes he goes is are goes go is are is after all does yes is, of course, not is not are not is not a not is not does not do not not do not does not is not to following not after not to after to according to in home under house return home do not house home chamber at home Many translation options to choose from in Europarl phrase table: 2727 matching phrase pairs for this sentence by pruning to the top 20 per phrase, 202 translation options remain Chapter 6: Decoding 8

Translation Options er geht ja nicht nach hause he it, it, he it is he will be it goes he goes is are goes go is are is after all does yes is, of course not is not are not is not a not is not does not do not not do not does not is not to following not after not to after to according to in home under house return home do not house home chamber at home The machine translation decoder does not know the right answer picking the right translation options arranging them in the right order! Search problem solved by heuristic beam search Chapter 6: Decoding 9

Decoding algorithm Translation as a search problem Partial hypothesis keeps track of which source words have been translated (coverage vector) n-1 most recent words of English (for LM!) a back pointer list to the previous hypothesis + (e,f) phrase pair used the (partial) translation probability the estimated probability of translating the remaining words (precomputed, a function of the coverage vector) Start state: no translated words, E=<s>, bp=nil Goal state: all translated words

Decoding: Precompute Translation Options er geht ja nicht nach hause consult phrase translation table for all input phrases Chapter 6: Decoding 10

Decoding: Start with Initial Hypothesis er geht ja nicht nach hause initial hypothesis: no input words covered, no output produced Chapter 6: Decoding 11

Decoding: Hypothesis Expansion er geht ja nicht nach hause are pick any translation option, create new hypothesis Chapter 6: Decoding 12

Decoding: Hypothesis Expansion er geht ja nicht nach hause he are it create hypotheses for all other translation options Chapter 6: Decoding 13

Decoding: Hypothesis Expansion er geht ja nicht nach hause yes he goes home are does not go home it to also create hypotheses from created partial hypothesis Chapter 6: Decoding 14

Decoding: Find Best Path er geht ja nicht nach hause yes he goes home are does not go home it to backtrack from highest scoring complete hypothesis Chapter 6: Decoding 15

Complexity This is an NP-complete problem Reduction to TSP (sketch) Each source word is a city A bigram LM encodes the distance between pairs of cities Knight (1999) has careful proof How do we solve such problems? Dynamic programming [risk free] The state is the current city C & the set of previous visited cities Doesn t matter the order the previous list was visited in as long as we keep the best path to C through How many states are there? Approximate search [risky]

Recombination Two hypothesis paths lead to two matching hypotheses same number of foreign words translated same English words in the output di erent scores it is it is Worse hypothesis is dropped it is Chapter 6: Decoding 17

Recombination Two hypothesis paths lead to hypotheses indistinguishable in subsequent search same number of foreign words translated same last two English words in output (assuming trigram language model) same last foreign word translated di erent scores he does not it does not Worse hypothesis is dropped he does not it Chapter 6: Decoding 18

Restrictions on Recombination Translation model: Phrase translation independent from each other! no restriction to hypothesis recombination Language model: Last n 1 words used as history in n-gram language model! recombined hypotheses must match in their last n 1 words Reordering model: Distance-based reordering model based on distance to end position of previous input phrase! recombined hypotheses must have that same end position Other feature function may introduce additional restrictions Chapter 6: Decoding 19

Pruning Recombination reduces search space, but not enough (we still have a NP complete problem on our hands) Pruning: remove bad hypotheses early put comparable hypothesis into stacks (hypotheses that have translated same number of input words) limit number of hypotheses in each stack Chapter 6: Decoding 20

Stacks goes does not he are it yes no word translated one word translated two words translated three words translated Hypothesis expansion in a stack decoder translation option is applied to hypothesis new hypothesis is dropped into a stack further down Chapter 6: Decoding 21

Stack Decoding Algorithm 1: place empty hypothesis into stack 0 2: for all stacks 0...n 1 do 3: for all hypotheses in stack do 4: for all translation options do 5: if applicable then 6: create new hypothesis 7: place in stack 8: recombine with existing hypothesis if possible 9: prune stack if too big 10: end if 11: end for 12: end for 13: end for Chapter 6: Decoding 22

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... e: <s> cp : --------- : 1.0

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 Mary did not e: did not cp : **------- : 0.3

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 did not Mary did not e: did not cp : **------- : 0.3

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 did not did not Mary did not e: did not cp : **------- : 0.45

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 not e: Mary not cp : **------- : 0.1 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 did not did not Mary did not e: did not cp : **------- : 0.45 slap e: not slap cp : *****---- : 0.316

Pruning Pruning strategies histogram pruning: keep at most k hypotheses in each stack stack pruning: keep hypothesis with score best score ( < 1) Computational time complexity of decoding with histogram pruning O(max stack size translation options sentence length) Number of translation options is linear with sentence length, hence: Quadratic complexity O(max stack size sentence length 2 ) Chapter 6: Decoding 23

Reordering Limits Limiting reordering to maximum reordering distance Typical reordering distance 5 8 words depending on language pair larger reordering limit hurts translation quality Reduces complexity to linear O(max stack size sentence length) Speed / quality trade-o by setting maximum stack size Chapter 6: Decoding 24

Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische tm:-1.16,lm:-2.93 d:0, all:-4.09 initiative initiative tm:-1.21,lm:-4.67 d:0, all:-5.88 the first time das erste mal tm:-0.56,lm:-2.81 d:-0.74. all:-4.11 both hypotheses translate 3 words worse hypothesis has better score Chapter 6: Decoding 25

Estimating Future Cost Future cost estimate: how expensive is translation of rest of sentence? Optimistic: choose cheapest translation options Cost for each translation option translation model: cost known language model: output words known, but not context! estimate without context reordering model: unknown, ignored for future cost estimation Chapter 6: Decoding 26

Cost Estimates from Translation Options the tourism initiative addresses this for the first time -1.0-2.0-1.5-2.4-1.4-1.0-1.0-1.9-1.6-4.0-2.5-2.2-1.3-2.4-2.7-2.3-2.3-2.3 cost of cheapest translation options for each input span (log-probabilities) Chapter 6: Decoding 27

Cost Estimates for all Spans Compute cost estimate for all contiguous spans by combining cheapest options first future cost estimate for n words (from first) word 1 2 3 4 5 6 7 8 9 the -1.0-3.0-4.5-6.9-8.3-9.3-9.6-10.6-10.6 tourism -2.0-3.5-5.9-7.3-8.3-8.6-9.6-9.6 initiative -1.5-3.9-5.3-6.3-6.6-7.6-7.6 addresses -2.4-3.8-4.8-5.1-6.1-6.1 this -1.4-2.4-2.7-3.7-3.7 for -1.0-1.3-2.3-2.3 the -1.0-2.2-2.3 first -1.9-2.4 time -1.6 Function words cheaper (the: -1.0) than content words (tourism -2.0) Common phrases cheaper (for the first time: -2.3) than unusual ones (tourism initiative addresses: -5.9) Chapter 6: Decoding 28

Combining Score and Future Cost -6.1-9.3-6.9-2.2 the tourism initiative die touristische initiative tm:-1.21,lm:-4.67 d:0, all:-5.88-6.1 + the first time das erste mal -9.3 + this for... time für diese zeit -9.1 + -5.88 = -4.11 = -4.86 = tm:-0.56,lm:-2.81 tm:-0.82,lm:-2.98-11.98-13.41-13.96 d:-0.74. all:-4.11 d:-1.06. all:-4.86 Hypothesis score and future cost estimate are combined for pruning left hypothesis starts with hard part: the tourism initiative score: -5.88, future cost: -6.1! total cost -11.98 middle hypothesis starts with easiest part: the first time score: -4.11, future cost: -9.3! total cost -13.41 right hypothesis picks easy parts: this for... time score: -4.86, future cost: -9.1! total cost -13.96 Chapter 6: Decoding 29

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : *-------- : 0.9 fc: 8.6e-9 e: <s> cp : --------- Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : *-------- p: 0.3 fc: 8.6e-9 Not e cp e cp : <s> Not : -*------- : 0.4 fc: 1.0e-9 Future costs make these }hypotheses comparable.

Other Decoding Algorithms A* search Greedy hill-climbing Using finite state transducers (standard toolkits) Chapter 6: Decoding 30

A* Search probability + heuristic estimate cheapest score depth-first expansion to completed path number of words covered Uses admissible future cost heuristic: never overestimates cost Translation agenda: create hypothesis with lowest score + heuristic cost Done, when complete hypothesis created Chapter 6: Decoding 31

Greedy Hill-Climbing Create one complete hypothesis with depth-first search (or other means) Search for better hypotheses by applying change operators change the translation of a word or phrase combine the translation of two words into a phrase split up the translation of a phrase into two smaller phrase translations move parts of the output into a di erent position swap parts of the output with the output at a di erent part of the sentence Terminates if no operator application produces a better translation Chapter 6: Decoding 32

Decoding algorithm Translation as a search problem Partial hypothesis keeps track of which source words have been translated (coverage vector) n-1 most recent words of English (for LM!) a back pointer list to the previous hypothesis + (e,f) phrase pair used the (partial) translation probability the estimated probability of translating the remaining words (precomputed, a function of the coverage vector) Start state: no translated words, E=<s>, bp=nil Goal state: all translated words

Decoding algorithm Q[0] Start state for i = 0 to f -1 Keep b best hypotheses at Q[i] for each hypothesis h in Q[i] for each untranslated span in h.c for which there is a translation <e,f> in the phrase table h = h extend by <e,f> Is there an item in Q[ h.c ] with = LM state? yes: update the item bp list and probability no: Q[ h.c ] h Find the best hypothesis in Q[ f ], reconstruction translation by following back pointers

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... e: <s> cp : --------- : 1.0

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 Mary did not e: did not cp : **------- : 0.3

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 did not Mary did not e: did not cp : **------- : 0.3

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 did not did not Mary did not e: did not cp : **------- : 0.45

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary e: <s> Mary cp : *-------- : 0.9 not e: Mary not cp : **------- : 0.1 e: <s> cp : --------- : 1.0 Maria e: <s> Maria cp : *-------- : 0.3 did not did not Mary did not e: did not cp : **------- : 0.45 slap e: not slap cp : *****---- : 0.316

Reordering Language express words in different orders bruja verde vs. green witch Phrase pairs can memorize some of these More general: in decoding, skip ahead Problem: Won t easy parts of the sentence be translated first? Solution: Future cost estimate For every coverage vector, estimate what it will cost to translate the remaining untranslated words When pruning, use p * future cost!

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : *-------- : 0.9 fc: 8.6e-9 e: <s> cp : --------- Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : *-------- p: 0.3 fc: 8.6e-9 e cp

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : *-------- : 0.9 fc: 8.6e-9 e: <s> cp : --------- Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : *-------- p: 0.3 fc: 8.6e-9 Not e cp e cp : <s> Not : -*------- : 0.4 fc: 1.0e-9

f: Maria no dio una bofetada a la bruja verde Q[0] Q[1] Q[2]... Mary : <s> Mary : *-------- : 0.9 fc: 8.6e-9 e: <s> cp : --------- Maria e: <s> Maria : 1.0 fc: 1.5e-9 c : *-------- p: 0.3 fc: 8.6e-9 Not e cp e cp : <s> Not : -*------- : 0.4 fc: 1.0e-9 Future costs make these }hypotheses comparable.

Decoding summary Finding the best hypothesis is NP-hard Even with no language model, there are an exponential number of states! Solution 1: limit reordering Solution 2: (lossy) pruning