MISTRAL: A Lattice Translation System for IWSLT 2007

Similar documents
Language Model and Grammar Extraction Variation in Machine Translation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Noisy SMS Machine Translation in Low-Density Languages

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Deep Neural Network Language Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

The NICT Translation System for IWSLT 2012

Calibration of Confidence Measures in Speech Recognition

The KIT-LIMSI Translation System for WMT 2014

Assignment 1: Predicting Amazon Review Ratings

Using dialogue context to improve parsing performance in dialogue systems

A Case Study: News Classification Based on Term Frequency

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

arxiv: v1 [cs.cl] 2 Apr 2017

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition at ICSI: Broadcast News and beyond

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Investigation on Mandarin Broadcast News Speech Recognition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

(Sub)Gradient Descent

Modeling function word errors in DNN-HMM based LVCSR systems

Re-evaluating the Role of Bleu in Machine Translation Research

Lecture 1: Machine Learning Basics

Radius STEM Readiness TM

A study of speaker adaptation for DNN-based speech synthesis

Corrective Feedback and Persistent Learning for Information Extraction

Software Maintenance

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Universiteit Leiden ICT in Business

Reducing Features to Improve Bug Prediction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

CS Machine Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

SARDNET: A Self-Organizing Feature Map for Sequences

Discriminative Learning of Beam-Search Heuristics for Planning

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Probabilistic Latent Semantic Analysis

Large vocabulary off-line handwriting recognition: A survey

Improvements to the Pruning Behavior of DNN Acoustic Models

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Linking Task: Identifying authors and book titles in verbose queries

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Knowledge-Based - Systems

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Genre classification on German novels

Cross-Lingual Text Categorization

Prediction of Maximal Projection for Semantic Role Labeling

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

The stages of event extraction

Mandarin Lexical Tone Recognition: The Gating Paradigm

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

arxiv:cmp-lg/ v1 22 Aug 1994

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Using Task Context to Improve Programmer Productivity

Rule Learning With Negation: Issues Regarding Effectiveness

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Strong Minimalist Thesis and Bounded Optimality

Visit us at:

Cross Language Information Retrieval

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Comparison of network inference packages and methods for multiple networks inference

Finding Translations in Scanned Book Collections

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Chapter 2 Rule Learning in a Nutshell

Miscommunication and error handling

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Importance of Social Network Structure in the Open Source Software Developer Community

A Graph Based Authorship Identification Approach

HLTCOE at TREC 2013: Temporal Summarization

Detecting English-French Cognates Using Orthographic Edit Distance

Truth Inference in Crowdsourcing: Is the Problem Solved?

Beyond the Pipeline: Discrete Optimization in NLP

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

On the Combined Behavior of Autonomous Resource Management Agents

Overview of the 3rd Workshop on Asian Translation

CS 446: Machine Learning

Graph Alignment for Semi-Supervised Semantic Role Labeling

Using AMT & SNOMED CT-AU to support clinical research

Transcription:

MISTRAL: A Lattice Translation System for IWSLT 2007 Alexandre Patry 1 Philippe Langlais 1 Frédéric Béchet 2 1 Université de Montréal 2 University of Avignon International Workshop on Spoken Language Translation, 2007

Overview of Mistral The main characteristics of mistral are: it uses a phrase-based model it works directly on lattices it scores (and rescores) hypotheses with a log-linear model it uses a beam search algorithm to organise the search space

General algorithm mistral uses the following algorithm to translate a lattice: 1. Push empty source, empty target, lattice s start node on the stack. 2. Extend and prune incomplete hypotheses. 3. Return the best hypothesis that points at the lattice s end node.

General algorithm mistral uses the following algorithm to translate a lattice: 1. Push empty source, empty target, lattice s start node on the stack. 2. Extend and prune incomplete hypotheses. 3. Return the best hypothesis that points at the lattice s end node.

Example of hypothesis expansion è is, it is, node 1 mio 3... problema problemi 2 4... Italian mio mio problema mio problema mi problemi English my my problem my concern my problems... 1 ma 5... in the stack mi 6 7... problema problemi 8... not explored yet explored pruned

Example of hypothesis expansion è is mio, it is my, node 2 mio 3... problema problemi 2 4... Italian mio mio problema mio problema mi problemi English my my problem my concern my problems... 1 ma 5... in the stack mi 6 7... problema problemi 8... not explored yet explored pruned

Example of hypothesis expansion è is mio problema, it is my problem, node 3 mio 3... problema problemi 2 4... Italian mio mio problema mio problema mi problemi English my my problem my concern my problems... 1 ma 5... in the stack mi 6 7... problema problemi 8... not explored yet explored pruned

Example of hypothesis expansion è is mio problema, it is my concern, node 3 mio 3... problema problemi 2 4... Italian mio mio problema mio problema mi problemi English my my problem my concern my problems... 1 ma 5... in the stack mi 6 7... problema problemi 8... not explored yet explored pruned

Example of hypothesis expansion è is mi problemi, it is my problems, node 8 mio 3... problema problemi 2 4... Italian mio mio problema mio problema mi problemi English my my problem my concern my problems... 1 ma 5... in the stack mi 6 7... problema problemi 8... not explored yet explored pruned

Example of hypothesis expansion mio 3... problema problemi 2 4... Italian mio mio problema mio problema mi problemi English my my problem my concern my problems... 1 ma 5... in the stack mi 6 7... problema problemi 8... not explored yet explored pruned

Unknown words when no expansion is possible... ecco per,... here s for, node 1 Italian English empty... curiosita 1 2... in the stack not explored yet explored pruned

Unknown words when no expansion is possible... ecco per curiosità,... here s for curiosità, node 2 Italian English empty... curiosita 1 2... in the stack not explored yet explored pruned

Unknown words when no expansion is possible Italian English empty... curiosita 1 2... in the stack not explored yet explored pruned

Beam search The search space is organised with a beam search: One stack for each time slice of 0.1 second. Breadth first search for each stack (it can happen when a word is shorter than 0.1 second). When this happens, pruning is done before the exploration of each depth.

Pruning of a stack The pruning of a stack is done in two steps: 1. Keep the 50 best hypotheses. 2. Recombine the remaining hypotheses sharing: the same node their last two source words (source lm) their last two target words (target lm)

Exponential model Each hypothesis is scored (and rescored) with an exponential model: R ê = argmax max λ r h r (e, f, o) e f r=1 where f and e are the source and target sentences and o is the lattice returned the ASR system.

Evaluation protocol We evaluated mistral on the Italian-English track using the following protocol: 1. Train translation tables on the training corpus and europarl. 2. Tune the first pass on the first 300 sentences of the dev corpus. 3. Tune the rescoring pass on the 300 following sentences of the dev corpus. 4. Test on the remaining 396 sentences of the dev corpus.

Training We created one language model and one translation table for each of the following corpora: iwslt training data (19 722 sentence pairs) europarl corpus (> 928,000 sentence pairs) We manually created a third translation table containing 122 rules for days, months and numbers. Our final translation table is the concatenation of those three.

First pass The following feature functions were used for the first pass: Posterior probability of the path in the lattice. Two source and two target trigrams. Source and target word penalties. Translation table scores: relative frequencies lexical probabilities constant penalty three binary features associating an entry with its corpus

First pass tuning The first pass weights are tuned as follow: 1. Initialise the weight of the posterior probability to 10 and the other weights to 0.1. 2. Extract the 500 best translations from each lattice of the first 200 sentences. 3. Optimize bleu on those N-Best lists using the downhill simplex algorithm. 4. If the weights were updated, go to 2. 5. Use the last 100 sentences as a validation corpus to select the weights of the best iteration.

First pass tuning

Rescoring The following feature functions are added for the rescoring pass: Two source and two target 4-grams. Lexical probabilities of the complete sentences in both translation directions.

Rescoring tuning The rescoring weights are tuned on the next 300 sentences of the dev corpus as follow: 1. Initialise the weights of the new feature functions to 0.1. 2. Run the first pass to extract the 500 best translations of each lattice. 3. Optimize bleu on those N-Best lists using the downhill simplex algorithm.

Results System 1st Pass Rescoring wer bleu wer bleu Ref 0 20.09 0 21.27 1-best 11.90 17.97 11.90 19.37 Opt. on bleu 12.04 17.08 12.07 19.24 Opt. on wer and bleu 11.81 17.58 11.87 18.93 Opt. on bleu, pruned lattices 10.96 19.21 11.04 20.28 We used mistral on the reference and the 1-best to have an idea of the performance we should expect.

Results System 1st Pass Rescoring wer bleu wer bleu Ref 0 20.09 0 21.27 1-best 11.90 17.97 11.90 19.37 Opt. on bleu 12.04 17.08 12.07 19.24 Opt. on wer and bleu 11.81 17.58 11.87 18.93 Opt. on bleu, pruned lattices 10.96 19.21 11.04 20.28 Disappointing results when our system is run on unpruned lattices. Worse wer and bleu than 1-best.

Results System 1st Pass Rescoring wer bleu wer bleu Ref 0 20.09 0 21.27 1-best 11.90 17.97 11.90 19.37 Opt. on bleu 12.04 17.08 12.07 19.24 Opt. on wer and bleu 11.81 17.58 11.87 18.93 Opt. on bleu, pruned lattices 10.96 19.21 11.04 20.28 Optimizing on the harmonic mean of wer and bleu diminishes wer at the expense of bleu.

Results System 1st Pass Rescoring wer bleu wer bleu Ref 0 20.09 0 21.27 1-best 11.90 17.97 11.90 19.37 Opt. on bleu 12.04 17.08 12.07 19.24 Opt. on wer and bleu 11.81 17.58 11.87 18.93 Opt. on bleu, pruned lattices 10.96 19.21 11.04 20.28 Our best results were obtained when we optimized on bleu and pruned the the lattices. An edge is pruned if its post-probability is lower than 1% of the highest post-probability of all edges starting at the same node.

Results System 1st Pass Rescoring wer bleu wer bleu Ref 0 20.09 0 21.27 1-best 11.90 17.97 11.90 19.37 Opt. on bleu 12.04 17.08 12.07 19.24 Opt. on wer and bleu 11.81 17.58 11.87 18.93 Opt. on bleu, pruned lattices 10.96 19.21 11.04 20.28 The average number of word hypotheses per spoken word passes from 360 to 2.7 after pruning.

Results System 1st Pass Rescoring wer bleu wer bleu Ref 0 20.09 0 21.27 1-best 11.90 17.97 11.90 19.37 Opt. on bleu 12.04 17.08 12.07 19.24 Opt. on wer and bleu 11.81 17.58 11.87 18.93 Opt. on bleu, pruned lattices 10.96 19.21 11.04 20.28 Even translating the reference yielded poor results. Is it our model or our implementation that is at fault?

Comparison with moses Input System 1st Pass Rescoring bleu Ref mistral 20.09 21.27 moses w/o distortion 20.91-1-best mistral 17.97 19.37 moses w/o distortion 18.99 - moses was systematically better on the 1st pass, but its bleu scores are low as well. The models we trained are probably at fault.

Comparison with moses Input System 1st Pass Rescoring bleu Ref mistral 20.09 21.27 moses w/o distortion 20.91-1-best mistral 17.97 19.37 moses w/o distortion 18.99 - Later experiments showed us that results of mistral are similar to those of moses when the translation table is not pruned. It is because we did not consider the word penalty and the language models but only the translation table scores during pruning.

A note on features We made the following observations about the features and their weights: The features that had the highest weights were the one related to ASR (post-probability, italian trigrams and penalties). The europarl translation table helped us gain more than 1 point in bleu. Same observation for the binary feature functions associating an entry of the translation table to its origin. When rescoring, 4-grams did not help, lexical probabilities alone did the job.

Post-processing The capitalisation was restored with the disambig tool from the srilm toolkit. Each word was ambiguously capitalized or not. Only final punctuation marks were restored by a Naïve Bayes classifier taking as input the first word of each sentence. Both models were trained on the training corpus supplied for the shared task.

Shared task results System bleu Before C P C + P Official run 21.03 18.66 16.12 13.90 Updated system 23.81 20.75 17.60 16.17 Bugs in mistral were fixed since we submitted our official results, so we repeated the shared task with our updated system.

Future works mistral is a young system and many things were overlooked due to a lack of time: The pruning parameters were not thoroughly examined (stack size, nbest list sizes, duration of a time slice). We have always started tuning from the same point. No statistical significance test were run. We should test our system on a bigger corpus.

Conclusion We presented mistral, a phrase-based decoder working directly on lattices. Our results are disappointing in two ways: bleu scores are low in general it does not surpass clearly the 1-best baseline