Valency-Aware Machine Translation Project Proposal

Similar documents
Adding syntactic structure to bilingual terminology for improved domain adaptation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Semi-supervised Training for the Averaged Perceptron POS Tagger

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

arxiv: v1 [cs.cl] 2 Apr 2017

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A High-Quality Web Corpus of Czech

CS 598 Natural Language Processing

Ensemble Technique Utilization for Indonesian Dependency Parser

Linking Task: Identifying authors and book titles in verbose queries

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Accurate Unlexicalized Parsing for Modern Hebrew

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Issues of Projectivity in the Prague Dependency Treebank

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Cross Language Information Retrieval

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Proceedings of the 19th COLING, , 2002.

Training and evaluation of POS taggers on the French MULTITAG corpus

1. Introduction. 2. The OMBI database editor

The stages of event extraction

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Parsing of part-of-speech tagged Assamese Texts

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

THE VERB ARGUMENT BROWSER

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Applications of memory-based natural language processing

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Using dialogue context to improve parsing performance in dialogue systems

LTAG-spinal and the Treebank

Developing a TT-MCTAG for German with an RCG-based Parser

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Memory-based grammatical error correction

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Specifying a shallow grammatical for parsing purposes

Dependency Annotation of Coordination for Learner Language

Vocabulary Usage and Intelligibility in Learner Language

Leveraging Sentiment to Compute Word Similarity

Distant Supervised Relation Extraction with Wikipedia and Freebase

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Experiments with a Higher-Order Projective Dependency Parser

Prediction of Maximal Projection for Semantic Role Labeling

Re-evaluating the Role of Bleu in Machine Translation Research

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The MEANING Multilingual Central Repository

BULATS A2 WORDLIST 2

Development of the First LRs for Macedonian: Current Projects

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Constructing Parallel Corpus from Movie Subtitles

An Evaluation of POS Taggers for the CHILDES Corpus

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Annotation Projection for Discourse Connectives

Words come in categories

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

The KIT-LIMSI Translation System for WMT 2014

Modeling function word errors in DNN-HMM based LVCSR systems

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Multilingual Sentiment and Subjectivity Analysis

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Language Model and Grammar Extraction Variation in Machine Translation

Modeling full form lexica for Arabic

An Interactive Intelligent Language Tutor Over The Internet

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Learning Computational Grammars

Some Principles of Automated Natural Language Information Extraction

Natural Language Processing. George Konidaris

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Noisy SMS Machine Translation in Low-Density Languages

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

LING 329 : MORPHOLOGY

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation

AQUA: An Ontology-Driven Question Answering System

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Regression for Sentence-Level MT Evaluation with Pseudo References

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Graph Based Authorship Identification Approach

Experts Retrieval with Multiword-Enhanced Author Topic Model

The CESAR Project: Enabling LRT for 70M+ Speakers

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Building an HPSG-based Indonesian Resource Grammar (INDRA)

Multi-Lingual Text Leveling

Transcription:

Valency-Aware Machine Translation Project Proposal Ondřej Bojar obo@cuni.cz August 17, 2006

Overview 1 JHU Workshop motivation and one of the results. State-of-the-art MT errors. Project goal. Motivation: Why Czech. Proposed strategy and information sources. Summary. Appendices: References, illustrations and further details on Czech and English

Workshop Motivation 2 Statistical machine translation (SMT) into morphologically rich languages is more difficult than from them. See e.g. Koehn (2005). One of workshop goals: examine utility of factored translation models to translate into morphologically rich languages. There was room for improvement: Regular BLEU English Czech 25% BLEU of lemmatized MT against lemmatized references 32% Errors in morphology cause large BLEU loss.

One of the Workshop Results 3 Significant improvements gained on small data sets: English Czech: 20k sentences, BLEU 25.82% to 27.62% or up to 28.12% with additional out-of-domain parallel data. Still far below the margin of lemmatized BLEU (35%). However local agreement already very good: Microstudy: Adjective-Noun Agreement 74% correct, 2% mismatch, other: missing noun etc. So where are the morphological errors?

Current English Czech MT Errors Microstudy of current best MT output (BLEU 28.12%), intuitive metric: 4 15 sentences, 77 verb-modifier pairs in source text examined: Translation of... preserves meaning... is disrupted... is missing Verb 43% 14% 21% Modifier 79% 12% 6% But: When Verb&Mod correct, 44% of cases are non-grammatical or meaningdisturbing relations.

Input: MT output: Gloss: Correct: Samples Errors Keep on investing. Pokračovalo investování. (grammar correct here!) Continued investing. (Meaning: The investing continued.) Pokračujte v investování. language model misled us need to include source valency information. Input: brokerage firms rushed out ads... MT Output: brokerské firmy vyběhl reklamy Gloss: brokerage firms pl.fem ran sg.masc ads pl.nom,pl.acc,pl.voc,sg.gen Correct option 1: brokerské firmy vyběhly s reklamami pl.instr Correct option 2: brokerské firmy vydaly reklamy pl.acc Target-side data may be rich enough to learn: vyběhnout s instr Not rich enough to learn all morphological and lexical variants: vyběhl s reklamou, vyběhla s reklamami, vyběhl s prohlášením, vyběhli s oznámením,... 5

Project Goal 6 Improve MT output quality by valency information.

Motivation: Why Czech Relevant properties: very rich morphological system and relatively free word order. Well-established theory on syntax and valency in particular. Sgall, Hajičová, and Panevová (1986), Panevová (1994) Data available: monolingual and parallel corpora manual surface and deep treebanks (parallel forthcoming!) manual valency lexicons 7 Language Corpus Annotation up to Tokens Cs PDT 2.0 (Hajič, 2005) manual surface and deep syntax 1.5M surf. Cs CNC (Kocek, Kopřivová, and Kučera, 2000) automatic lemmatization and morphology 114M Cs Web corpus automatic surface syntax 100M Cs En PCEDT 1.0 (Čmejrek, Cuřín, and Havelka, 2003) automatic surface and deep syntax 500k Cs En CzEng 0.5 automatic surface syntax 15M

Preliminary experiments at workshop: Proposed Strategy Factored models touching valency explored during workshop perform badly. No gain or a slight loss. 8 Future: Evaluate the causes. Was it just sparse data? Check subcategorization using partially lexicalized language models. Morphological LM with verbs lexicalized should capture subcategorization. Experiment with syntax-based language models. (Chelba and Jelinek, 1998; Charniak, 2001) Map explicit subcategorization information from source to target. Translate lemma+subcat to lemma+subcat and POS to POS, generate surface from this.

Project Will Use these Sources of Information 9 Available valency/subcategorization dictionaries. VALLEX for Czech. ( PropBank for English.) Automatically collected subcategorization data. (Korhonen, 2002) and previous, my diss. in prep. Word-sense-like algorithms to label verb occurrences with frames. (Bojar, Semecký, and Benešová, 2005), and all WSD community results Compare with simple approaches: More monolingual data for plain n-gram language models may help enough. Are valency-based generalizations useful in general/on small data/on out-ofdomain data?

Summary 10 Factored models help fixing morphology local dependencies already correct. Significant margin for improving verb-modifier agreement. English Czech pair is a good fit for the experiments. Improved valency models should improve translation quality: Valency theory, data and methods available.

11 References Bojar, Ondřej. 2003. Towards Automatic Extraction of Verb Frames. Prague Bulletin of Mathematical Linguistics, 79 80:101 120. Bojar, Ondřej, Jiří Semecký, and Václava Benešová. 2005. VALEVAL: Testing VALLEX Consistency and Experimenting with Word-Frame Disambiguation. Prague Bulletin of Mathematical Linguistics, 83:5 17. Charniak, Eugene. 2001. Immediate-head parsing for language models. In Meeting of the Association for Computational Linguistics, pages 116 123. Chelba, Ciprian and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling. In Christian Boitet and Pete Whitelock, editors, Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pages 225 231, San Francisco, California. Morgan Kaufmann Publishers. Čmejrek, Martin, Jan Cuřín, and Jiří Havelka. 2003. Czech-English Dependency-based Machine

Translation. In EACL 2003 Proceedings of the Conference, pages 83 90. Association for Computational Linguistics, April. Collins, Michael. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 184 191. Collins, Michael, Jan Hajič, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A Statistical Parser of Czech. In Proceedings of 37th ACL Conference, pages 505 512, University of Maryland, College Park, USA. Hajič, Jan. 2005. Complex Corpus Annotation: The Prague Dependency Treebank. In Mária Šimková, editor, Insight into Slovak and Czech Corpus Linguistics, pages 54 73, Bratislava, Slovakia. Veda, vydavateľstvo SAV. Holan, Tomáš. 2003. K syntaktické analýze českých(!) vět. In MIS 2003. MATFYZPRESS, January 18 25, 2003. Kocek, Jan, Marie Kopřivová, and Karel Kučera, editors. 2000. Český národní korpus - úvod a příručka uživatele. FF UK - ÚČNK, Praha. Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of MT Summit X, September. 12

Korhonen, Anna. 2002. Subcategorization Acquisition. Technical Report UCAM-CL-TR-530, University of Cambridge, Computer Laboratory, Cambridge, UK, February. Kruijff, Geert-Jan M. 2003. 3-Phase Grammar Learning. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development. Panevová, Jarmila. 1994. Valency Frames and the Meaning of the Sentence. In Ph. L. Luelsdorff, editor, The Prague School of Structural and Functional Linguistics, pages 223 243, Amsterdam-Philadelphia. John Benjamins. Sgall, Petr, Eva Hajičová, and Jarmila Panevová. 1986. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech Republic/Dordrecht, Netherlands. 13

Analytic (surface syntactic): #36 PRED Zákony Laws OBJ udělejte make Analysis of Czech AUXP pro for Tectogrammatical (deep syntactic): #36 PRED zákon P l law P l PAT udělat imp make imp ACT BEN you ADV lidi people člověk P l,pro person P l,for 14 Morphological: Form Lemma Morphological tag zákony zákon NNIP1-----A---- zákony zákon NNIP4-----A---- zákony zákon NNIP5-----A---- zákony zákon NNIP7-----A---- udělejte udělat Vi-P---2--A---- udělejte udělat Vi-P---3--A---4 pro pro-1 RR--4---------- lidi člověk NNMP1-----A---- lidi člověk NNMP4-----A---- lidi člověk NNMP5-----A----

Properties of Czech language Czech English Rich morphology 4,000 tags possible, 2,300 seen 50 used Word order free rigid 15 rigid global word order phenomena: clitics rigid local word order phenomena: coordination, clitics mutual order Nonprojective sentences 16,920 23.3% Nonprojective edges 23,691 1.9% Known parsing results Czech English Edge accuracy 69.2 82.5% 91% Sentence correctness 15.0 30.9% 43% Data by (Collins et al., 1999), (Holan, 2003), Zeman (http://ckl.mff.cuni.cz/ zeman/ /projekty/neproj/index.html) and (Bojar, 2003). Consult (Kruijff, 2003) for measuring word order freeness.

Edge length 1 2 5 English [%] 74.2 86.3 95.6 Czech [%] 51.8 72.1 90.2 Detailed numbers on Czech Number of gaps 0 1 2 Sentences [%] 76.9 22.7 0.42 2 Climbing steps 1 2 3 4 5 Nodes [%] 90.3 8.0 1.3 0.3 0.1 3 1 16 1 Data for English by (Collins, 1996). Data for Czech by (Holan, 2003). 2 Data by (Holan, 2003). 3 Data by (Holan, 2003).

17 Analytic vs. Tectogrammatical (2) PRED AUXK SB AUXV OBJ AUXR #45 To It by conjunct particle se reflexive particle mělo should změnit change. full stop PRED PAT PRED ACT #45 to it mít should změnit conj change conj Generic Actor