Evaluation in Machine Translation

Similar documents
Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Assignment 1: Predicting Amazon Review Ratings

WHEN THERE IS A mismatch between the acoustic

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Reducing Features to Improve Bug Prediction

arxiv: v1 [cs.cl] 2 Apr 2017

Calibration of Confidence Measures in Speech Recognition

Re-evaluating the Role of Bleu in Machine Translation Research

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS Machine Learning

The NICT Translation System for IWSLT 2012

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Linking Task: Identifying authors and book titles in verbose queries

Artificial Neural Networks written examination

Regression for Sentence-Level MT Evaluation with Pseudo References

Using dialogue context to improve parsing performance in dialogue systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Detecting English-French Cognates Using Orthographic Edit Distance

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

INPE São José dos Campos

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning From the Past with Experiment Databases

CSL465/603 - Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probabilistic Latent Semantic Analysis

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Evolutive Neural Net Fuzzy Filtering: Basic Description

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

A Quantitative Method for Machine Translation Evaluation

Rule Learning with Negation: Issues Regarding Effectiveness

Multi-label classification via multi-target regression on data streams

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Axiom 2013 Team Description Paper

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The Strong Minimalist Thesis and Bounded Optimality

Knowledge Transfer in Deep Convolutional Neural Nets

Language Model and Grammar Extraction Variation in Machine Translation

Rule Learning With Negation: Issues Regarding Effectiveness

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Semi-Supervised Face Detection

Model Ensemble for Click Prediction in Bing Search Ads

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning Methods for Fuzzy Systems

Dublin City Schools Mathematics Graded Course of Study GRADE 4

School of Innovative Technologies and Engineering

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Improvements to the Pruning Behavior of DNN Acoustic Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Emotion Recognition Using Support Vector Machine

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Comparison of Two Text Representations for Sentiment Analysis

Disambiguation of Thai Personal Name from Online News Articles

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Generative models and adversarial training

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Statewide Framework Document for:

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Learning to Schedule Straight-Line Code

Noisy SMS Machine Translation in Low-Density Languages

Test Effort Estimation Using Neural Network

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Discriminative Learning of Beam-Search Heuristics for Planning

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Truth Inference in Crowdsourcing: Is the Problem Solved?

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

On the Combined Behavior of Autonomous Resource Management Agents

Georgetown University at TREC 2017 Dynamic Domain Track

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Transcription:

Evaluation in Machine Translation Emine Sakir, Stefan Petrik 1

Overview Problems of the N-Gram Approach Word Error Rate (WER) Based Measures WER, mwer (Word Error Rate, multi-reference Word Error Rate) PER, mper (Position-independent word Error Rate, multi-reference) GSA (Generation String Accuracy) RED (grader based on Edit Distances) Minimum Error Rate Training 2

Word Error Rate (WER) Based Measures Problems of the n-gram approach position-dependent score intolerance towards small errors as in conversational speech 3

Problems of the N-Gram Approach Position-dependent score I brought a small white flower to my girl. 1) I took a small white flower to my girl. 2) I once brought a small white flower to my girl. 3) I a small white flower to my girl. I brought a small white flower to my girl. 1) I brought a little white flower to my girl. 2) I brought a very small white flower to my girl. 3) I brought a white flower to my girl. 4

Problems of the N-Gram Approach Intolerance for small deviations word swap semantically similar words differentiation between content & function words Example I brought a small white flower to my girl. 1) I brought a white small flower to my girl. 2) I brought a little snow-white flower to my girl. 3) I brought small white flower my girl. 4) I brought a flower to my girl. 5

Word Error Rate (WER) Based Measures WER (Word Error Rate) sum of substitutions (S), insertions (I), and deletions (D) between machine-translated text and reference translation in relation to number of words in reference translation multiple references: select minimum WER WER= S I D R mwer=min i S i I i D i R i PER (Position-independent word Error Rate) sentence = bag of words (no word positions) PER = number of differences between machine-translated text and reference translation GSA (Generation String Accuracy) consider moves M (=ins+del of same element) as one edit operation GSA=1 M S I' D' N 6

Word Error Rate (WER) Based Measures Examples Ref = w 1 w 2 w 3 MT = w 1 w 3 w 2 w 4 WER = PER = GSA = 2/3 (1 INS, 1 SUB) 1/3 (1 SUB) 1/3 (1 MOV, 1 INS) Ref = w 1 w 2 w 3 w 4 MT = w 2 w 3 w 4 w 1 WER = 2/4 (1 INS, 1 DEL) PER = 0 GSA = 3/4 (1 MOV) 7

Word Error Rate (WER) Based Measures RED (grader based on Edit Distances) Idea learn human judgement from small set of sample human gradings use multiple edit distances as features reduce complexity of grading task to grading scale A,B,C,D Used Edit Distances ED = WER (number of INS, DEL & SUB) ED swp allow swap operator, i.e. d(ab,ba) = d(ab,ab) = 0 ED sem use semantic instead of morphologic information ED cnt restrict comparison to content words, ignore functional words ED key restrict comparison to keywords 8

RED (grader based on Edit Distances) Algorithm (learning) 1) Human labelling compute median score of human labels 2) Encode into 17-dimensional vector M = M 1..M 17 M 1 = ED M 2..M 16 = all combinations of ED swp ED sem ED cnt ED key M 17 = human score 3) Learn a decision tree with C4.5 algorithm Algorithm (evaluation) 1) Redo step 2 w. M 17 = 0 and apply learned decision tree to obtain M 17 9

RED (grader based on Edit Distances) Experiments comparison of 9 MT systems on sentence level and system level 9 human judges produced manual scores 10-fold cross validation Method sentence-level evaluation discriminant analysis of scores for grades, accuracy measured system-level evaluation statistical multiple comparison test of average sentence grades Data 345 sentence pairs English Japanese randomly chosen from BTEC corpus (topic: travelling, type: dialogues) 16 reference translations / sentence 10

RED (grader based on Edit Distances) Results 11

RED (grader based on Edit Distances) Conclusions RED outperforms BLEU on both, sentence level & system level comp. higher agreement with human scores However: simplified task (only 4 grades possible) only shown for one language pair (English --> Japanese) small evaluation corpus size 12

Minimum Error Rate Training State-of-the-art: Training of statistical model parameters based on maximum likelihood et al. criteria Problem: Difference in classification of error between statistical approach and automatic evaluation methods decision rule only optimal f. zero-one loss function other loss functions (e.g. BLEU) require different decision functions Idea: Optimize model parameters with respect to evaluation criterion, e.g. BLEU, NIST, WER Method: New training criterion f. log-linear MT model 13

Minimum Error Rate Training Statistical MT with Log-linear models model posterior Pr(e f) with M feature functions h m (e,f) with model parameters λ m Maximum mutual information criterion f. parameter optimization Properties unique global optimum algorithms with guaranteed convergence (e.g. gradient descent) 14

Minimum Error Rate Training New training criterion error counting function E(e,r) for sentence e against reference r candidate translations C s = {e s,1,...,e s,k } Problems argmax prevents gradient descent many local optima 15

Minimum Error Rate Training Solution: Smoothing 16

Minimum Error Rate Training Optimization algorithm parameterize candidate translations in C as lines (t,m constant) piecewise linear function compute intervals and incremental error count changes for each candidate sentence f C traverse sequence of interval boundaries & update error count to find minimum E update parameters according to interval for which min E was found 17

Minimum Error Rate Training Experiments M=8 feature functions e.g. language model logprob translation model logprob dynamic programming beam search + n-best list from A* search pseudo-reference translations for MMI criterion = sentences w. minimum word errors from n-best list Data 2002 TIDES corpus, Chinese --> English 18

Results development set Minimum Error Rate Training test set 19

Conclusions Minimum Error Rate Training Best performance for equal training error criterion / evaluation metric MMI is significantly worse except for mwer metric No difference between smoothed & unsmoothed error counts small number of parameters no overfitting 20

References Y. Akiba, K. Imamura, E. Sumita, H. Nakaiwa, S. Yamamoto, H. G. Okuno, Using, Multiple Edit Distances to Automatically Grade Outputs from Machine Translation Systems. IEEE Transactions on Audio, Speech and Language Processing, Vol. 14, No. 2, pp. 393-402, 2006. Franz Josef Och, Minimum Error Rate Training in Statistical Machine Translation. In Proc. of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), 160-167, 2003. 21

Thank you 22