Machine Translation CMSC 723 / LING 723 / INST 725 MARINE CARPUAT.

Similar documents
(Sub)Gradient Descent

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Probabilistic Latent Semantic Analysis

Compositional Semantics

Cross Language Information Retrieval

Introduction to Simulation

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CSCI 5582 Artificial Intelligence. Today 12/5

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The Strong Minimalist Thesis and Bounded Optimality

arxiv: v1 [cs.cl] 2 Apr 2017

Lecture 10: Reinforcement Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Case Study: News Classification Based on Term Frequency

The NICT Translation System for IWSLT 2012

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Discriminative Learning of Beam-Search Heuristics for Planning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Language Model and Grammar Extraction Variation in Machine Translation

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Methods in Multilingual Speech Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Python Machine Learning

Rule-based Expert Systems

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CSL465/603 - Machine Learning

Detecting English-French Cognates Using Orthographic Edit Distance

South Carolina English Language Arts

The Evolution of Random Phenomena

STA 225: Introductory Statistics (CT)

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Truth Inference in Crowdsourcing: Is the Problem Solved?

Rule Learning With Negation: Issues Regarding Effectiveness

INPE São José dos Campos

Eye Movements in Speech Technologies: an overview of current research

Proof Theory for Syntacticians

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Software Maintenance

Chapter 4 - Fractions

12- A whirlwind tour of statistics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

A heuristic framework for pivot-based bilingual dictionary induction

Artificial Neural Networks written examination

Lecture 1: Basic Concepts of Machine Learning

How to Do Research. Jeff Chase Duke University

Cal s Dinner Card Deals

Laboratorio di Intelligenza Artificiale e Robotica

CS Machine Learning

Australian Journal of Basic and Applied Sciences

Finding Your Friends and Following Them to Where You Are

Using dialogue context to improve parsing performance in dialogue systems

Algebra 2- Semester 2 Review

Speech Emotion Recognition Using Support Vector Machine

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Noisy SMS Machine Translation in Low-Density Languages

Using focal point learning to improve human machine tacit coordination

Speech Recognition at ICSI: Broadcast News and beyond

Mathematics process categories

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Exemplar Grade 9 Reading Test Questions

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

arxiv: v2 [cs.cv] 30 Mar 2017

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Rule Learning with Negation: Issues Regarding Effectiveness

Constructing Parallel Corpus from Movie Subtitles

Finding Translations in Scanned Book Collections

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Navigating the PhD Options in CMS

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Why Did My Detector Do That?!

Toward Probabilistic Natural Logic for Syllogistic Reasoning

What is a Mental Model?

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Reinforcement Learning by Comparing Immediate Reward

SEMAFOR: Frame Argument Resolution with Log-Linear Models

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Action Models and their Induction

On document relevance and lexical cohesion between query terms

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Improvements to the Pruning Behavior of DNN Acoustic Models

Transcription:

Machine Translation CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu

Noisy Channel Model for Machine Translation The noisy channel model decomposes machine translation into two independent subproblems Word alignment Language modeling

Word Alignment with IBM Models 1, 2 Probabilistic models with strong independence assumptions Results in linguistically naïve models asymmetric, 1-to-many alignments But allows efficient parameter estimation and inference Alignments are hidden variables unlike words which are observed require unsupervised learning (EM algorithm)

Today Walk through an example of EM Phrase-based Models A slightly more recent translation model Decoding

EM FOR IBM1

IBM Model 1: generative story Input an English sentence of length l a length m For each French position i in 1..m Pick an English source index j Choose a translation

EM for IBM Model 1 Expectation (E)-step: Compute expected counts for parameters (t) based on summing over hidden variable Maximization (M)-step: Compute the maximum likelihood estimate of t from the expected counts

EM example: initialization green house the house casa verde la casa For the rest of this talk, French = Spanish

EM example: E-step (a) compute probability of each alignment p(a f,e) Note: we re making many simplification assumptions in this example!! No NULL word We only consider alignments were each French and English word is aligned to something We ignore q

EM example: E-step (b) normalize to get p(a f,e)

EM example: E-step (c) compute expected counts (weighting each count by p(a e,f)

EM example: M-step Compute probability estimate by normalizing expected counts

EM example: next iteration

EM for IBM 1 in practice The previous example aims to illustrate the intuition of EM algorithm But it is a little naïve we had to enumerate all possible alignments very inefficient!! In practice, we don t need to sum overall all possible alignments explicitly for IBM1 http://www.cs.columbia.edu/~mcollins/courses/nlp2011 /notes/ibm12.pdf

PHRASE-BASED MODELS

Phrase-based models Most common way to model P(F E) nowadays (instead of IBM models) Start position of f_i End position of f_(i-1) Probability of two consecutive English phrases being separated by a particular span in French

Phrase alignments are derived This means that the IBM model represents P(Spanish English) from word alignments Get high confidence alignment links by intersecting IBM word alignments from both directions

Phrase alignments are derived from word alignments Improve recall by adding some links from the union of alignments

Phrase alignments are derived from word alignments Extract phrases that are consistent with word alignment

Phrase Translation Probabilities Given such phrases we can get the required statistics for the model from

Phrase-based Machine Translation

DECODING

Decoding for phrase-based MT Basic idea search the space of possible English translations in an efficient manner. According to our model

Decoding as Search Starting point: null state. No French content covered, no English included. We ll drive the search by Choosing French word/phrases to cover, Choosing a way to cover them Subsequent choices are pasted left-toright to previous choices. Stop: when all input words are covered.

Decoding Maria no dio una bofetada a la bruja verde

Decoding Maria no dio una bofetada a la bruja verde Mary

Decoding Maria no dio una bofetada a la bruja verde Mary did not

12/8/2015 Speech and Language Processing - Jurafsky 28 Decoding Maria no dio una bofetada a la bruja verde Mary Did not slap

Decoding Maria no dio una bofetada a la bruja verde Mary Did not slap the

Decoding Maria no dio una bofetada a la bruja verde Mary Did not slap the green

Decoding Maria no dio una bofetada a la bruja verde Mary Did not slap the green witch

Decoding Maria no dio una bofetada a la bruja verde Mary did not slap the green witch

Decoding In practice: we need to incrementally pursue a large number of paths. Solution: heuristic search algorithm called multi-stack beam search

Stack decoding: a simplified view

Space of possible English translations given phrase-based model

Three stages of stack decoding

multi-stack beam search

multi-stack beam search One stack per number of French words covered: so that we make apples-to-apples comparisons when pruning Beam-search pruning for each stack: prune high cost states (those outside the beam )

Cost = current cost + future cost Future cost = cost of translating remaining words in the French sentence Exact future cost = minimum probability of all remaining translations Too expensive to compute! Approximation Find sequence of English phrases that has the minimum product of language model and translation model costs

Complexity Analysis Time complexity of decoding as described so far O(max stack size x sentence length^2) O( max stack size x number of ways to expand hyps. x sentence length) Number of hyp expansions is linear in sentence length, because we only consider the top k translation candidates in the phrase-table In practice: O(max stack size x sentence length) because we limit reordering distance, so that only a constant number of hypothesis expansions are considered

RECAP

Phrase-based Machine Translation: the full picture

Phrase-based MT: discussion What is the advantage of splitting the problem in 2? What are the strengths and weaknesses of this approach?