Exploiting Parallel Treebanks in Phrase-Based SMT. Statistical Machine Translation

Similar documents
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Cross Language Information Retrieval

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv: v1 [cs.cl] 2 Apr 2017

Language Model and Grammar Extraction Variation in Machine Translation

Noisy SMS Machine Translation in Low-Density Languages

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Accurate Unlexicalized Parsing for Modern Hebrew

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

Annotation Projection for Discourse Connectives

Linking Task: Identifying authors and book titles in verbose queries

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

The NICT Translation System for IWSLT 2012

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Prediction of Maximal Projection for Semantic Role Labeling

The KIT-LIMSI Translation System for WMT 2014

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Re-evaluating the Role of Bleu in Machine Translation Research

A Domain Ontology Development Environment Using a MRD and Text Corpus

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Some Principles of Automated Natural Language Information Extraction

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Overview of the 3rd Workshop on Asian Translation

Using dialogue context to improve parsing performance in dialogue systems

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Regression for Sentence-Level MT Evaluation with Pseudo References

TINE: A Metric to Assess MT Adequacy

Learning Methods in Multilingual Speech Recognition

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Detecting English-French Cognates Using Orthographic Edit Distance

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Finding Translations in Scanned Book Collections

The Smart/Empire TIPSTER IR System

Cross-lingual Text Fragment Alignment using Divergence from Randomness

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Task Tolerance of MT Output in Integrated Text Processes

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Constructing Parallel Corpus from Movie Subtitles

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

CS 598 Natural Language Processing

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Adding syntactic structure to bilingual terminology for improved domain adaptation

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Rule Learning With Negation: Issues Regarding Effectiveness

Probabilistic Latent Semantic Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Speech Recognition at ICSI: Broadcast News and beyond

Applications of memory-based natural language processing

The stages of event extraction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Developing a TT-MCTAG for German with an RCG-based Parser

The Role of the Head in the Interpretation of English Deverbal Compounds

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

The Interface between Phrasal and Functional Constraints

A Case Study: News Classification Based on Term Frequency

The International Coach Federation (ICF) Global Consumer Awareness Study

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Rule Learning with Negation: Issues Regarding Effectiveness

A Version Space Approach to Learning Context-free Grammars

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

CS Machine Learning

Training and evaluation of POS taggers on the French MULTITAG corpus

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

PROGRESS MONITORING FOR STUDENTS WITH DISABILITIES Participant Materials

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Australian Journal of Basic and Applied Sciences

Multi-Lingual Text Leveling

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Context Free Grammars. Many slides from Michael Collins

Tun your everyday simulation activity into research

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Learning Computational Grammars

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Using Semantic Relations to Refine Coreference Decisions

A heuristic framework for pivot-based bilingual dictionary induction

Vocabulary Agreement Among Model Summaries And Source Documents 1

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Ensemble Technique Utilization for Indonesian Dependency Parser

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

The Survey of Adult Skills (PIAAC) provides a picture of adults proficiency in three key information-processing skills:

Word Segmentation of Off-line Handwritten Documents

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Lecture 2: Quantifiers and Approximation

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Universiteit Leiden ICT in Business

Constraining X-Bar: Theta Theory

Transcription:

Exploiting Parallel Treebanks in Phrase-Based Statistical Machine Translation John Tinsley National Centre for Language Technology Dublin City University Ireland Collaborators: Mary Hearne and Andy Way CICLing 2009 05/03/2009

Overview

Overview Phrase-based SMT systems contain purely statistically induced translation models We have demonstrated on small scale that translation accuracy can be improved by supplementing these models with linguistically motivated phrase pairs extracted from parallel treebanks Here we test this hypothesis on a large-scale MT task We investigate further ways to exploit parallel treebanks in this MT framework

Overview

Data 729,891 sentence pairs from English Spanish Europarl (v2) 1,000 sentence devset and 2,000 sentence testset

Data 729,891 sentence pairs from English Spanish Europarl (v2) 1,000 sentence devset and 2,000 sentence testset Parallel Treebank Parse both sides monolingually: Berkeley for En; Bikel for Es Align using DCU subtree alignment tool

Data 729,891 sentence pairs from English Spanish Europarl (v2) 1,000 sentence devset and 2,000 sentence testset Parallel Treebank Parse both sides monolingually: Berkeley for En; Bikel for Es Align using DCU subtree alignment tool MT System Baseline PB-SMT system built with Moses 5-gram language model (SRILM) Minimum error-rate training on devset Automatic evaluation using Bleu, Nist and Meteor

Overview

Experiment I - Direct Combination We build three translation models SMT phrase pairs only (Baseline) Parallel treebank phrase pairs only (Tree only) Union of the above two models (Baseline+Tree)

Experiment I - Direct Combination We build three translation models SMT phrase pairs only (Baseline) Parallel treebank phrase pairs only (Tree only) Union of the above two models (Baseline+Tree) Config. Bleu Nist %Meteor Baseline 0.3341 7.0765 57.39 +Tree 0.3397 7.0891 57.82 Tree only 0.3153 6.8187 55.98

Experiment I - Direct Combination Resource Baseline Treebank Unique Types 23,261,022 4,985,266 Overlap 1,447,505 1-to-1 1.54% 15.91% 1-to-n 3.51% 4.43%

Experiment I - Direct Combination We noticed issues with some treebank word alignments Constitute 20.3% of total extracted pairs 7.35% were high-frequency alignments between function words and punctuation Filtered these from model and rerun translation with this model (Strict phrases)

Experiment I - Direct Combination We noticed issues with some treebank word alignments Constitute 20.3% of total extracted pairs 7.35% were high-frequency alignments between function words and punctuation Filtered these from model and rerun translation with this model (Strict phrases) Config. Bleu Nist %Meteor Baseline 0.3341 7.0765 57.39 +Tree 0.3397 7.0891 57.82 Strict phrases 0.3414 7.1283 57.98

Experiment II - Treebank-Driven Phrase Extraction Phrase pairs are extracted using heuristics over the statistical word alignment

Experiment II - Treebank-Driven Phrase Extraction Phrase pairs are extracted using heuristics over the statistical word alignment We create new models by running the heuristics over two different word alignments: treebank word alignment only (Treebank extr) union of SMT and treebank word alignments (Union extr)

Experiment II - Treebank-Driven Phrase Extraction Phrase pairs are extracted using heuristics over the statistical word alignment We create new models by running the heuristics over two different word alignments: treebank word alignment only (Treebank extr) union of SMT and treebank word alignments (Union extr) Config. Bleu Nist %Meteor Baseline 0.3341 7.0765 57.39 +Tree 0.3397 7.0891 57.82 Treebank extr 0.3102 6.6990 55.64 +Tree 0.3199 6.8517 5639 Union extr 0.3277 6.9587 56.79 +Tree 0.3384 7.0508 57.88

Experiment II - Treebank-Driven Phrase Extraction An interesting observation Model Union extr+tree gives comparable translation performance to the highest scoring system Its phrase table is 56% smaller

Experiment II - Treebank-Driven Phrase Extraction An interesting observation Model Union extr+tree gives comparable translation performance to the highest scoring system Its phrase table is 56% smaller Word Alignment #Phrases #Phrases+Tree Baseline 24.7M 29.7M Treebank 88.5M 92.89M Union 7.5M 13.1M

Further 1. Giving additional weight to treebank phrase pairs in the model 2. Filtering longer phrase pairs from the model 3. Using treebank word alignments to calculate lexical weighting feature in translation model

Overview

Conclusions improving SMT by supplementing models with treebank phrase pairs scales treebank word alignments lack sufficient recall to have a positive impact within the SMT framework we can use treebanks lexical alignments to extract smaller translation models with competative translation quality

Conclusions improving SMT by supplementing models with treebank phrase pairs scales treebank word alignments lack sufficient recall to have a positive impact within the SMT framework we can use treebanks lexical alignments to extract smaller translation models with competative translation quality Future Work play with different ways to combine the two phrase resources investigate extraction of refined phrase tables further apply treebanks to more syntactically-aware MT paradigms e.g. Stat-XFER

Thank you http://computing.dcu.ie/ jtinsley http://nclt.dcu.ie/mt

References Tinsley, J., V. Zhechev, M. Hearne and A. Way. 2007a. Robust Language Pair-Independent Sub-Tree Alignment. In Machine Translation Summit XI. Copenhagen, Denmark. p.467 474 Hearne, M., J. Tinsley, V. Zhechev, and A. Way. 2007. Capturing Translational Divergences with a Statistical Tree-to-Tree Aligner. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation. Skvde, Sweden. p.83 94 Tinsley, J., M. Hearne and A. Way. 2007b. Exploiting Parallel Treebanks for use in Statistical Machine Translation. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 07). Bergen, Norway. p.175 187 Hearne, M., S. Ozdowska, J. Tinsley, 2008. Comparing Constituency and Dependency Representations for SMT Phrase-Extraction. In Actes de la 15éme Conférence Annuelle sur le Traitement Automatique des Langues Naturelles (TALN 08), Avignon, France.

Experiment III - Weighting Treebank Data We build three new translation models in which we directly combine the two sets of phrases but we count the treebank phrase pairs 2, 3 and 5 times respectively

Experiment III - Weighting Treebank Data We build three new translation models in which we directly combine the two sets of phrases but we count the treebank phrase pairs 2, 3 and 5 times respectively Config. Bleu Nist %Meteor Baseline+Tree 0.3397 7.0891 57.82 +Tree x2 0.3386 7.0813 57.76 +Tree x3 0.3361 7.0584 57.56 +Tree x5 0.3377 7.0829 57.71

Experiment III - Weighting Treebank Data We use a feature of the MT system which allows us to supply the two phrase tables separately. In this case the decoder will select phrases from either table for translation as is deemed appropriate by the model.

Experiment III - Weighting Treebank Data We use a feature of the MT system which allows us to supply the two phrase tables separately. In this case the decoder will select phrases from either table for translation as is deemed appropriate by the model. Config. Bleu Nist %Meteor Baseline+Tree 0.3397 7.0891 57.82 Two Tables 0.3365 7.0812 57.50

Exploiting Word Alignments Given a parallel treebank, we also have a set of word alignments between the sentence pairs i.e. alignments between pre-terminal nodes. Word alignments are vital to core tasks in SMT.

Exploiting Word Alignments Given a parallel treebank, we also have a set of word alignments between the sentence pairs i.e. alignments between pre-terminal nodes. Word alignments are vital to core tasks in SMT. We use treebank based word alignments in place of statistical word alignments in MT for phrase translation model extraction lexical weight scoring

Experiment IV - Treebank-Based Lexical Weights Lexical weights are calculated bidirectionally for each phrase pair based on the word alignment between the source and target phrases. Done using the lexical translation probability distribution produced by Giza++

Experiment IV - Treebank-Based Lexical Weights Lexical weights are calculated bidirectionally for each phrase pair based on the word alignment between the source and target phrases. Done using the lexical translation probability distribution produced by Giza++ We substitute this with a distribution calculated over the word alignments in the parallel treebank treebank word alignment only (Treebank weights) union of SMT and treebank word alignments (Union weights)

Experiment IV - Treebank-Based Lexical Weights Lexical weights are calculated bidirectionally for each phrase pair based on the word alignment between the source and target phrases. Done using the lexical translation probability distribution produced by Giza++ We substitute this with a distribution calculated over the word alignments in the parallel treebank treebank word alignment only (Treebank weights) union of SMT and treebank word alignments (Union weights) Config. Bleu Nist %Meteor Baseline+Tree 0.3397 7.0891 57.82 Treebank weights 0.3356 7.0355 57.32 Union weights 0.3355 7.0272 57.41