SMT TIDES and all that

Similar documents
Cross Language Information Retrieval

Language Model and Grammar Extraction Variation in Machine Translation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Noisy SMS Machine Translation in Low-Density Languages

Linking Task: Identifying authors and book titles in verbose queries

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Re-evaluating the Role of Bleu in Machine Translation Research

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Applications of memory-based natural language processing

Speech Recognition at ICSI: Broadcast News and beyond

The Strong Minimalist Thesis and Bounded Optimality

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Constructing Parallel Corpus from Movie Subtitles

arxiv: v1 [cs.cl] 2 Apr 2017

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The Smart/Empire TIPSTER IR System

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Investigation on Mandarin Broadcast News Speech Recognition

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

BYLINE [Heng Ji, Computer Science Department, New York University,

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

Lecture 10: Reinforcement Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The KIT-LIMSI Translation System for WMT 2014

Multi-Lingual Text Leveling

Task Tolerance of MT Output in Integrated Text Processes

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

A Case Study: News Classification Based on Term Frequency

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

A heuristic framework for pivot-based bilingual dictionary induction

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

An Introduction to the Minimalist Program

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Multilingual Sentiment and Subjectivity Analysis

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Regression for Sentence-Level MT Evaluation with Pseudo References

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Prediction of Maximal Projection for Semantic Role Labeling

Training and evaluation of POS taggers on the French MULTITAG corpus

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Using dialogue context to improve parsing performance in dialogue systems

Overview of the 3rd Workshop on Asian Translation

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Probabilistic Latent Semantic Analysis

Learning Methods in Multilingual Speech Recognition

CS 598 Natural Language Processing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Matching Meaning for Cross-Language Information Retrieval

CEFR Overall Illustrative English Proficiency Scales

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Finding Translations in Scanned Book Collections

Cross-Lingual Text Categorization

Large vocabulary off-line handwriting recognition: A survey

Deep Neural Network Language Models

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

South Carolina English Language Arts

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Detecting English-French Cognates Using Orthographic Edit Distance

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

A Comparison of Two Text Representations for Sentiment Analysis

The stages of event extraction

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Parsing of part-of-speech tagged Assamese Texts

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Copyright Corwin 2015

Disciplinary Literacy in Science

Language Acquisition Chart

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Toward a Unified Approach to Statistical Language Modeling for Chinese

The NICT Translation System for IWSLT 2012

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Class-based Language Model Approach to Chinese Named Entity Identification 1

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

A Bayesian Learning Approach to Concept-Based Document Classification

Translating Collocations for Use in Bilingual Lexicons

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Disambiguation of Thai Personal Name from Online News Articles

Transcription:

SMT TIDES and all that Aus der Vogel-Perspektive A Bird s View (human translation) Stephan Vogel Language Technologies Institute Carnegie Mellon University

Machine Translation Approaches Interlingua-based Transfer-based Direct Example-based Statistical

Statistical versus Grammar-Based Often statistical and grammar-based MT are seen as opposing approaches wrong!!! Dichotomies are: Use probabilities everything is equally likely (in between: heuristics) Rich (deep) structure no or only flat structure Both dimensions are more or less continuous Examples EBMT: flat structure and heuristics SMT: flat structure and probabilities XFER: deep(er) structure and heuristics Goal: structurally rich probabilistic models

Statistical Approach Using statistical models Create many alternatives (hypotheses) Give a score to each hypothesis Select the best -> search Advantages Avoid hard decisions, avoid early decisions Sometimes, optimality can be guaranteed Speed can be traded with quality, no all-or-nothing It works better! (in many applications) Disadvantages Difficulties in handling structurally rich models, mathematically and computationally (but that s also true for non-statistical systems) Need data to train the model parameters

Statistical Machine Translation Based on Bayes Decision Rule: ê = argmax{ p(e f) } = argmax{ p(e) p(f e) }

Tasks in SMT Modelling build statistical models which capture characteristic features of translation equivalences and of the target language Training train translation model on bilingual corpus, train language model on monolingual corpus Decoding find best translation for new sentences according to models

Alignment Example Translation models based on concept of alignment Most general: each source word aligns (partially, with some probability) to each target word Additional restrictions to make it mathematical and computationally tractable

Translation Models The heritage: IBM IBM1 lexical probabilities only IBM2 lexicon plus absolut position IBM3 plus fertilities IBM4 inverted relative position alignment IBM5 non-deficient version of model 4 In the same mood: HMM lexicon plus relative position BiBr Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation Syntax-based align parse trees

Training Need bilingual corpora Usually, the more the better But needs to be appropriate domain specific - and clean No need for manual annotation Training of word alignment models Iterative training: EM algorithm For HMM: Forward-Backward For BiBr: Inside-Outside Often maximum approximation: Viterbi alignment GIZA toolkit Partly developed at JHU workshop Chief programmer: Franz Josef Och

How does it work? First iteration: start with uniform probability distribution Bilingual Corpus: A B C # R S T E B F G # S U V A D B E # R V S Word Pairs: A - R : 2 A - S : 2 A - T : 1 B - R : 1 B - S : 3 Probabilities p(s t): A - R : 2/7 A - S : 2/11 A - T : 1/3 B - R : 1/2 B - S : 3/11 Next iteration: multiply counts by probabilities always renormalize

Phrase Translation Why? To capture context Local word reordering How? Typically: Train word alignment model and extract phrase-to-phrase translations from Viterbi path But also: Integrated segmentation and alignment Also: rule-base segmentation Notes: Often better results when training target to source for extraction of phrase translations due to asymmetry of alignment models Phrases are not fully integrated into alignment model, they are extracted only after training is completed

Language Model Standard n-gram model: p(w 1... w n ) = Π i p(w i w 1... w i-1 ) = Π i p(w i w i-2 w i-1 ) trigram = Π i p(w i w i-1 ) bigram Many events not seen -> smoothing required Also class-based LMs and syntactic LMs, interpolated with word-based LM Use of available toolkits: CMU LM toolkit, SRI LM toolkit

Search for the best Translation Given new source sentence Brute force search Translation model generates many translations Each translation has a score, including the language model score Pick the one with the highest score Result Best translation according to model Not necessarily the best translation according to evaluation metric Not necessarily the best translation according to human judgment Realistic search Grow many translations in parallel Throw away low scoring candidates (pruning) Search errors: found translation is not the best according to models

MT Evaluation Human evaluation all along Fluency, adequacy, overall score, etc. Problems: inter-evaluator agreement, reproducibility, cost Automatic scoring Use one or several reference translation to compare agains Define a distance measure, then: the closer, the better Different scoring metrics proposed and used Position independent error rate (how many words are correct) Word error rate (are the all in the correct order) Blue n-gram: how many n-grams match NIST n-gram: how many n-grams match, how informative are they Precision Recall MT Evaluation hot topic, more competition in metric development than in MT development

TIDES DARPA funded NLP project: T Translingual (Translation undercover ;-) I Information D Detection E Extraction S Summarization Large number of research groups (universities and companies) See http://www.darpa.mil/iao/tides.htm

Program Objective Develop advanced language processing technology to enable English speakers to find and interpret critical information in multiple languages without requiring knowledge of those languages.

Program Strategy Research Conduct research to develop effective algorithms for detection, extraction, summarization, and translation -- where the source data may be large volumes of naturally occurring speech or text in multiple languages. Evaluation Measure accuracy in rigorous, objective evaluations. Outside groups are invited to participate in the annual Information Retrieval, Topic Detection and Tracking, Automatic Content Extraction, and Machine Translation evaluations run by NIST. Application Integrate core capabilities to form effective text and audio processing (TAP) systems. Experiment with those systems on real data with real users, then refine and iterate.

MT in TIDES Evaluations every year Chinese large data track: > 100m words of bilingual corpus Chinese small data track: 100k words bilingual corpus, 10k dictionary Arabic large data track: 80m words bilingual corpus Open data track: use whatever you can find before data collection deadline but no significant improvement over large data track results Many strong teams TIDES funded plus external groups Friendly competition: you tell me your trick I tell you my trick Exciting improvements over last two years Automatic metrics over-score machine translations or underscore human translations

Surprise Language Evaluation Do learning approaches allow to build useful NLP system for new language within weeks? Dry run exercise: Cebuano Only data collection Most data essentially found within days Very inhomogeneous corpus resulted: Bible to party propaganda Actual evaluation: Hindi Enormous problems with different encodings, many proprietary Amount of data > 2 million words bilingual Several dictionaries MT systems, but also NE tagging, cross-lingual IR, etc built within 4 weeks Nobody liked it: only dealing with encoding, no new NLP research

The Future Continuous evaluations: Arabic and Chinese and perhaps new surprises Possible other genres, not only news Constant improvements In evaluation approaches ;-) But also in translation! Similar comparative evaluations are underway and will follow in other projects, also for speech-to-speech translation