The CMU Arabic-to-English Statistical MT System

Similar documents
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Noisy SMS Machine Translation in Low-Density Languages

Cross Language Information Retrieval

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

BULATS A2 WORDLIST 2

CS 598 Natural Language Processing

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Constructing Parallel Corpus from Movie Subtitles

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Linking Task: Identifying authors and book titles in verbose queries

Language Model and Grammar Extraction Variation in Machine Translation

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

STANDARDS. Essential Question: How can ideas, themes, and stories connect people from different times and places? BIN/TABLE 1

Derivational and Inflectional Morphemes in Pak-Pak Language

Parsing of part-of-speech tagged Assamese Texts

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Training and evaluation of POS taggers on the French MULTITAG corpus

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Words come in categories

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Dictionary-based techniques for cross-language information retrieval q

Modeling full form lexica for Arabic

1. Introduction. 2. The OMBI database editor

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The taming of the data:

Character Stream Parsing of Mixed-lingual Text

Compositional Semantics

What the National Curriculum requires in reading at Y5 and Y6

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Overview of the 3rd Workshop on Asian Translation

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

THE VERB ARGUMENT BROWSER

Speech Recognition at ICSI: Broadcast News and beyond

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

BYLINE [Heng Ji, Computer Science Department, New York University,

arxiv: v1 [cs.cl] 2 Apr 2017

Using dialogue context to improve parsing performance in dialogue systems

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

The KIT-LIMSI Translation System for WMT 2014

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Year 4 National Curriculum requirements

Literature and the Language Arts Experiencing Literature

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Modeling function word errors in DNN-HMM based LVCSR systems

Fourth Grade. Reporting Student Progress. Libertyville School District 70. Fourth Grade

Construction Grammar. University of Jena.

Florida Reading Endorsement Alignment Matrix Competency 1

Information Retrieval

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The stages of event extraction

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

ARNE - A tool for Namend Entity Recognition from Arabic Text

A Case Study: News Classification Based on Term Frequency

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Modeling function word errors in DNN-HMM based LVCSR systems

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

An Evaluation of POS Taggers for the CHILDES Corpus

The NICT Translation System for IWSLT 2012

A Class-based Language Model Approach to Chinese Named Entity Identification 1

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A hybrid approach to translate Moroccan Arabic dialect

Specifying a shallow grammatical for parsing purposes

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

Development of the First LRs for Macedonian: Current Projects

Grade 4. Common Core Adoption Process. (Unpacked Standards)

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Investigation on Mandarin Broadcast News Speech Recognition

Ch VI- SENTENCE PATTERNS.

HOLIDAY LESSONS.com

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Context Free Grammars. Many slides from Michael Collins

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

AQUA: An Ontology-Driven Question Answering System

The Smart/Empire TIPSTER IR System

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

INSTANT VOCABULARY 6-10

Prediction of Maximal Projection for Semantic Role Labeling

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Indian Institute of Technology, Kanpur

Test Blueprint. Grade 3 Reading English Standards of Learning

CS Machine Learning

CX 101/201/301 Latin Language and Literature 2015/16

Grammars & Parsing, Part 1:

Transcription:

The CMU Arabic-to-English Statistical MT System Alicia Tribble, Stephan Vogel Language Technologies Institute Carnegie Mellon University

The Data For translation model: UN corpus: 80 million words UN Ummah Some smaller news corpora For LM English side from bilingual corpus: Language model should have seen the words generated by the translation model Additional data from Xinhua news General preprocessing and cleaning Separate punctuation mark Remove sentence pairs with large length mismatch Remove sentences which have too many non-words (numbers, special characters)

The System Alignment models: IBM1 and HMM, trained in both directions Phrase extraction From Viterbi path of HMM alignment Integrated Segmentation and Alignment Decoder Essentially left to right over source sentence Build translation lattice with partial translations Find best path, allowing for local reordering Sentence length model Pruning: remove low-scoring hypotheses

Some Results Two test sets: DevTest 203 sentences, May2003 Baseline: monotone decoding RO: word reordering SL: sentence length model DevTest DevTest May 2003 NIST Bleu4 NIST Baseline 8.59 0.385 8.95 RO 9.02 0.441 9.26 RO + SL 9.24 0.455?

Questions What s specific to Arabic Encoding Named Entities Syntax and Morphology What s needed to get further improvements

What s Specific to Arabic Specific to Arabic Right to left not really an issue, as this is only display Text in file is left to right Problem in UN corpus: numbers (Latin characters) sometimes in the wrong direction, eg. 1997 -> 7991 Data not in vocalized form Vocalization not really studied Ambiguity can be handled by statistical systems

Encoding and Vocalization Encoding Different encodings: Unicode, UTF-8, CP-1256, romanized forms not too bad, definitely not as bad as Hindi;-) Needed to convert, e.g. training and testing data in different encodings Not all conversion are loss-less Used romanized form for processing Converted all data using Darwish transliteration Several characters (ya, allef, hamzda) are collapsed into two classes Conversion not completely reversible Effect of Normalization Reduction in vocabulary: ~5% Reduction of singletons: >10% Reduction of 3-gram perplexity: ~5%

Named Entities NEs resulted in small but significant improvement in translation quality in the Chinese-English system In Chinese: unknown words are splitted into single characters which are then translated as individual words In Arabic no segmentation issues -> damage less severe NEs not used so far for Arabic, but started to work on it

Language-Specific Issues for Arabic MT Syntactic issues: Error analysis revealed two common syntactic errors Verb-Noun reordering Subject-Verb reordering Morphology issues: Problems specific to AR morphology Based on Darwish transliteration Based on Buckwalter transliteration Poor Man s morphology

Syntax Issues: Adjective-Noun reordering Adjectives and nouns are frequently reordered between Arabic and English Example: EN: big green chair AR: chair green big Experiment: identify noun-adjective sequences in AR and reorder them in preprocessing step Problem: Often long sequences, e.g. N N Adj Adj N Adj N N Result: no improvement

Syntax Issues: Subject-Noun reordering AR: main verb at the beginning of the sentence followed by its subject EN: order prefers to have the subject precede the verb Example: EN: the President visited Egypt AR: Visited Egypt the President Experiment: identify verbs at the beginning of the AR sentence and move them to a position following the first noun No full parsing Done as preprocessing on the Arabic side Result: no effect

Morphology Issues Structural mismatch between English and Arabic Arabic has richer morphology Types Ar-En: ~2.2 : 1 Tokens Ar-En: ~ 0.9 : 1 Tried two different tools for morphological analysis: Buckwalter analyzer http://www.xrce.xerox.com/ competencies/content-analysis/arabic/info/buckwalter-about.html 1-1 Transliteration scheme for Arabic characters Darwish analyzer www.cs.umd.edu/library/trs/cs-tr-4326/cs-tr-4326.pdf Several characters (ya, alef, hamza) are collapsed into two classes with one character representative each

Morphology with Darwish Transliteration Addressed the compositional part of AR morphology since this contributes to the structural mismatch between AR and EN Goal was to get better word-level alignment Toolkit comes with a stemmer Created modified version for separating instead of removing affixes Experiment 1: Trained on stemmed data Arabic types reduced by ~60%, nearly matching number of English types But loosing discriminative power Experiment 2: Trained on affix-separated data Number of tokens increased Mismatch in tokens much larger Result: Doing morphology monolingually can even increase structural mismatch

Morphology with Buckwalter Transliteration Focused on DET and CONJ prefixes: AR: the, and frequently attached to nouns and adjectives EN: always separate Different spitting strategies: Loosest: Use all prefixes and split even if remaining word is not a stem More conservative: Use only prefixes classified as DET or CONJ Most conservative: Full analysis, split only can be analyzed as a DET or CONJ prefix plus legitimate stem Experiments: train on each kind of split data Result: All set-ups gave lower scores

Poor Man s Morphology List of pre- and suffixes compiled by native speaker Only for unknown words Remove more and more pre- and suffixes Stop when stripped word is in trained lexicon Typically: 1/2 to 2/3 of the unknown words can be mapped to known words Translation not always correct, therefore overall improvement limited Result: this has so far been (for us) the only morphological processing which gave a small improvement

Experience with Morphology and Syntax Initial experiments with full morphological analysis did not give an improvement Most words are seen in large corpus Unknown words: < 5% tokens, < 10% types Simple prefix splitting reduced to half Phrase translation captures some of the agreement information Local word reordering in the decoder reduces word order problems We still believe that morphology could give an additional improvement

Requirements for Improvements Data More specific data: We have large corpus (UN) but only small news corpora Manual dictionary could help, it helps for Chinese Better use of existing resources Lexicon not trained on all data Treebanks not used Continues improvement of models and decoder Recent improvements in decoder (word reordering, overlapping phrases, sentence length model) helped for Arabic Expect improvement from named entities Integrate morphology and alignment