FACTORED TRANSLATION MODELS. Raj Dabre Raksha Sharma Avishek Dan

Similar documents
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Training and evaluation of POS taggers on the French MULTITAG corpus

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv: v1 [cs.cl] 2 Apr 2017

Accurate Unlexicalized Parsing for Modern Hebrew

A Graph Based Authorship Identification Approach

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

The KIT-LIMSI Translation System for WMT 2014

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Japanese Language Course 2017/18

Two methods to incorporate local morphosyntactic features in Hindi dependency

Linking Task: Identifying authors and book titles in verbose queries

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Modeling full form lexica for Arabic

Prediction of Maximal Projection for Semantic Role Labeling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

3 Character-based KJ Translation

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Language Model and Grammar Extraction Variation in Machine Translation

THE VERB ARGUMENT BROWSER

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Cross Language Information Retrieval

Parsing of part-of-speech tagged Assamese Texts

Problems of the Arabic OCR: New Attitudes

Words come in categories

S. RAZA GIRLS HIGH SCHOOL

On the Notion Determiner

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Noisy SMS Machine Translation in Low-Density Languages

Underlying and Surface Grammatical Relations in Greek consider

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Using dialogue context to improve parsing performance in dialogue systems

UC Berkeley Berkeley Undergraduate Journal of Classics

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Developing a TT-MCTAG for German with an RCG-based Parser

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The Role of the Head in the Interpretation of English Deverbal Compounds

Specifying a shallow grammatical for parsing purposes

BULATS A2 WORDLIST 2

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Leveraging Sentiment to Compute Word Similarity

The taming of the data:

The NICT Translation System for IWSLT 2012

Indian Institute of Technology, Kanpur

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

The stages of event extraction

Phenomena of gender attraction in Polish *

CS 598 Natural Language Processing

Applications of memory-based natural language processing

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

The Smart/Empire TIPSTER IR System

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Development of the First LRs for Macedonian: Current Projects

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Grammars & Parsing, Part 1:

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Memory-based grammatical error correction

A High-Quality Web Corpus of Czech

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Annotation Projection for Discourse Connectives

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Introduction to Text Mining

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

A Computational Evaluation of Case-Assignment Algorithms

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Simple Surface Realization Engine for Telugu

Context Free Grammars. Many slides from Michael Collins

Re-evaluating the Role of Bleu in Machine Translation Research

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The Interplay of Text Cohesion and L2 Reading Proficiency in Different Levels of Text Comprehension Among EFL Readers

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Named Entity Recognition: A Survey for the Indian Languages

A process by any other name

Constructing Parallel Corpus from Movie Subtitles

Florida Reading Endorsement Alignment Matrix Competency 1

The Pennsylvania State University. The Graduate School. College of the Liberal Arts THE TEACHABILITY HYPOTHESIS AND CONCEPT-BASED INSTRUCTION

Transcription:

FACTORED TRANSLATION MODELS Raj Dabre Raksha Sharma Avishek Dan

Purpose of the talk To give motivations for Factored Based Machine Translation (FBMT) To cover the basic concepts of FBMT To highlight all possible factors that can help in translation To illustrate the process of FBMT

Flow of the Presentation Motivation Introduction to FBMT Decomposing the FBMT Process Lemma Translation Morphology Translation Generation Statistical Model Training Combining Components Decoding Experiments and Analysis Conclusion

Motivation Consider the example: यह म र ग ड़ ह {yaha meri gaadi hai} {This is my car} ग ड़ + Plural य म र ग ड़ड़य ह {ye meri gaadiyan hai} {These are my cars} Utilize Factors to overcome data sparsity

FBMT in Vauquois Triangle

Another motivating example from the Orient わたしのなまえわラジです Watashino namaewa laji desu. My name is Raj. Difficult to know word mappings. Suppose POS tags are given. わたしの (PRON) なまえわ (NN) ラジ (NNP) です (VM/VCOP). My(PRON) name(nn) is(vm/vcop) Raj(NNP). Mappings easier to deduce. Factors reduce uncertainty.

Introduction to FBMT Definition FBMT is an extension of phrase-based statistical machine translation models that integrates additional annotation at the wordlevel. Annotations can be linguistic markup or automatically generated word classes.

Factors to Exploit Surface form Lemma Part-of-speech Morphological features gender count and case Automatic word classes Shallow syntactic tags Dedicated factors to ensure agreement

Example ग ड़ड़य from (य म र ग ड़ड़य ह ) Surface form: ग ड़ड़य Lemma: ग ड़ Part-of-speech: NN Morphological features Gender: female Number: plural Case: Accusative Shallow syntactic tags: NP

Decomposition of FBMT

Decomposition of FBMT For translating cars to ग ड़ य Translate input lemmas into output lemmas car to ग ड़ Translate morphological and POS factors Noun to Noun Plural to Plural Neuter to Female Generate surface forms given the lemma and linguistic factors ग ड़ + Noun + Plural + Female = ग ड़ य

Statistical Model - Training Automatically annotate the parallel corpus with additional factors POS, Morphology Word Alignment using GIZA++ Can specify alignment basis POS to POS, Theta roles to Theta roles etc. Can use any combination of factors 3 types of tables generated Lemma translation (source lemma to target lemma) Morphology translation (source morphology to target morphology) Word Generation (target lemma + target morphology to target word form)

Annotating the corpus Use POS taggers, Shallow Syntactic parsers, UNL and dependency parsers for generating factors. Example: These are my cars य म र ग ड़ य ह These this DET subj Are is VM/VCOP present My me PRON possessor Cars car NN neuter, plural, object य यह DET subj म र म र PRON possessor ग ड़ड़य ग ड़ NN feminine, plural, object ह ह न VM/VCOP present

Alignments of Phrases यह म र ग ह This Is My Car

Alignments of Factors DET-Subj DET/Subj PRON - Poss NN - Fem VM/VCOP - Pres VM/VCOP- Pres PRON-Poss NN-Neu

Translation Tables Sr. English Phrase Hindi Phrase 1 This यह 2 My Car म र ग ड़ 3 Is ह Sr. English Factors Hindi Factors 1 DET-Subj DET-Subj 2 PRON-possessor NN- neuter, plural, object Lemma Translation Table PRON-possessor NN-feminine, plural, object 3 VM/VCOP-present VM/VCOP-present Factor Translation Table

Generation Table (Target Language) Sr no Lemma Factors Surface word 1 यह DET+subj य 2 म र PRON+possessor म र 3 ग ड़ NN+feminine, plural, object ग ड़ड़य 4 ह न VM/VCOP+present ह

A tougher example I am going home म घर ज रह ह {Main ghar jaa raha hoon} Here home is neuter, singular which is mapped to घर which is masculine, singular. Non trivial mapping since difference in gender. Here am going is mapped to ज रह ह. Non trivial since 2 word phrase mapped to 3 word phrase. Am going has factors [(is VM Present) (go VAUX greund, continuous)] ज रह ह has factors [(ज न VM )(रहन VAUX continuous)(ह न VAUX Present)] Here extracting the factor mappings is also non trivial. Difficulty is greater when small phrases map to big phrases. Morphologically rich to morphologically poor languages.

Alignments of Phrases म घर ज न रहन ह न I Is Go Home

Alignments of Factors PRON VM,PRES VAUX,CONT NN, NEUTER PRON NN,FEMININE VM VAUX,CONT VAUX,PRES

Components of FBMT

Combining the Components

Decoding Beam search decoding algorithm is used. Start with empty hypothesis. Generate and add hypothesis until full sentence is covered. Highest scoring complete hypothesis is the best translation. Per phrase translation options limited to 50 to address combinatorial explosion.

EXPERIMENTS AND RESULTS

Corpus English German Training: Europarl corpus Training: News Commentary corpus Test: WMT 2006 test set English Spanish Training: Europarl corpus English Czech WSJ corpus

Syntactic Enrichment Implementation part of Moses Factors Surface word (3 gram) POS (7 gram) Morphological Shallow syntactic Higher order sequence model obtained supports syntactic coherence of output

Syntactic Enrichment Results

Morphological Analysis and Generation Translate word lemma and morphology separately Pure lemma/ morph model yields poor results Evidence based choice of model 21% of unknown word forms translated

Conclusion Incorporating linguistic tools in the translation model improves translation accuracy Linguistic tools ensure grammatical coherence Separate translation of lemma and morphology leads to better handling of OOV words Complex factor models lead to larger search space and increased computation time

References Philipp Koehn and Hieu Hoang, Factored Translation Models, Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,2007 Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in English- Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 800 808, Suntec, Singapore.Association for Computational Linguistics,2009.