Valency-Aware Machine Translation Project Proposal

Valency-Aware Machine Translation Project Proposal Ondřej Bojar obo@cuni.cz August 17, 2006

Overview 1 JHU Workshop motivation and one of the results. State-of-the-art MT errors. Project goal. Motivation: Why Czech. Proposed strategy and information sources. Summary. Appendices: References, illustrations and further details on Czech and English

Workshop Motivation 2 Statistical machine translation (SMT) into morphologically rich languages is more difficult than from them. See e.g. Koehn (2005). One of workshop goals: examine utility of factored translation models to translate into morphologically rich languages. There was room for improvement: Regular BLEU English Czech 25% BLEU of lemmatized MT against lemmatized references 32% Errors in morphology cause large BLEU loss.

One of the Workshop Results 3 Significant improvements gained on small data sets: English Czech: 20k sentences, BLEU 25.82% to 27.62% or up to 28.12% with additional out-of-domain parallel data. Still far below the margin of lemmatized BLEU (35%). However local agreement already very good: Microstudy: Adjective-Noun Agreement 74% correct, 2% mismatch, other: missing noun etc. So where are the morphological errors?

Current English Czech MT Errors Microstudy of current best MT output (BLEU 28.12%), intuitive metric: 4 15 sentences, 77 verb-modifier pairs in source text examined: Translation of... preserves meaning... is disrupted... is missing Verb 43% 14% 21% Modifier 79% 12% 6% But: When Verb&Mod correct, 44% of cases are non-grammatical or meaningdisturbing relations.

Input: MT output: Gloss: Correct: Samples Errors Keep on investing. Pokračovalo investování. (grammar correct here!) Continued investing. (Meaning: The investing continued.) Pokračujte v investování. language model misled us need to include source valency information. Input: brokerage firms rushed out ads... MT Output: brokerské firmy vyběhl reklamy Gloss: brokerage firms pl.fem ran sg.masc ads pl.nom,pl.acc,pl.voc,sg.gen Correct option 1: brokerské firmy vyběhly s reklamami pl.instr Correct option 2: brokerské firmy vydaly reklamy pl.acc Target-side data may be rich enough to learn: vyběhnout s instr Not rich enough to learn all morphological and lexical variants: vyběhl s reklamou, vyběhla s reklamami, vyběhl s prohlášením, vyběhli s oznámením,... 5

Project Goal 6 Improve MT output quality by valency information.

Motivation: Why Czech Relevant properties: very rich morphological system and relatively free word order. Well-established theory on syntax and valency in particular. Sgall, Hajičová, and Panevová (1986), Panevová (1994) Data available: monolingual and parallel corpora manual surface and deep treebanks (parallel forthcoming!) manual valency lexicons 7 Language Corpus Annotation up to Tokens Cs PDT 2.0 (Hajič, 2005) manual surface and deep syntax 1.5M surf. Cs CNC (Kocek, Kopřivová, and Kučera, 2000) automatic lemmatization and morphology 114M Cs Web corpus automatic surface syntax 100M Cs En PCEDT 1.0 (Čmejrek, Cuřín, and Havelka, 2003) automatic surface and deep syntax 500k Cs En CzEng 0.5 automatic surface syntax 15M

Preliminary experiments at workshop: Proposed Strategy Factored models touching valency explored during workshop perform badly. No gain or a slight loss. 8 Future: Evaluate the causes. Was it just sparse data? Check subcategorization using partially lexicalized language models. Morphological LM with verbs lexicalized should capture subcategorization. Experiment with syntax-based language models. (Chelba and Jelinek, 1998; Charniak, 2001) Map explicit subcategorization information from source to target. Translate lemma+subcat to lemma+subcat and POS to POS, generate surface from this.

Project Will Use these Sources of Information 9 Available valency/subcategorization dictionaries. VALLEX for Czech. ( PropBank for English.) Automatically collected subcategorization data. (Korhonen, 2002) and previous, my diss. in prep. Word-sense-like algorithms to label verb occurrences with frames. (Bojar, Semecký, and Benešová, 2005), and all WSD community results Compare with simple approaches: More monolingual data for plain n-gram language models may help enough. Are valency-based generalizations useful in general/on small data/on out-ofdomain data?

Summary 10 Factored models help fixing morphology local dependencies already correct. Significant margin for improving verb-modifier agreement. English Czech pair is a good fit for the experiments. Improved valency models should improve translation quality: Valency theory, data and methods available.

11 References Bojar, Ondřej. 2003. Towards Automatic Extraction of Verb Frames. Prague Bulletin of Mathematical Linguistics, 79 80:101 120. Bojar, Ondřej, Jiří Semecký, and Václava Benešová. 2005. VALEVAL: Testing VALLEX Consistency and Experimenting with Word-Frame Disambiguation. Prague Bulletin of Mathematical Linguistics, 83:5 17. Charniak, Eugene. 2001. Immediate-head parsing for language models. In Meeting of the Association for Computational Linguistics, pages 116 123. Chelba, Ciprian and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling. In Christian Boitet and Pete Whitelock, editors, Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pages 225 231, San Francisco, California. Morgan Kaufmann Publishers. Čmejrek, Martin, Jan Cuřín, and Jiří Havelka. 2003. Czech-English Dependency-based Machine

Translation. In EACL 2003 Proceedings of the Conference, pages 83 90. Association for Computational Linguistics, April. Collins, Michael. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 184 191. Collins, Michael, Jan Hajič, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A Statistical Parser of Czech. In Proceedings of 37th ACL Conference, pages 505 512, University of Maryland, College Park, USA. Hajič, Jan. 2005. Complex Corpus Annotation: The Prague Dependency Treebank. In Mária Šimková, editor, Insight into Slovak and Czech Corpus Linguistics, pages 54 73, Bratislava, Slovakia. Veda, vydavateľstvo SAV. Holan, Tomáš. 2003. K syntaktické analýze českých(!) vět. In MIS 2003. MATFYZPRESS, January 18 25, 2003. Kocek, Jan, Marie Kopřivová, and Karel Kučera, editors. 2000. Český národní korpus - úvod a příručka uživatele. FF UK - ÚČNK, Praha. Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of MT Summit X, September. 12

Korhonen, Anna. 2002. Subcategorization Acquisition. Technical Report UCAM-CL-TR-530, University of Cambridge, Computer Laboratory, Cambridge, UK, February. Kruijff, Geert-Jan M. 2003. 3-Phase Grammar Learning. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development. Panevová, Jarmila. 1994. Valency Frames and the Meaning of the Sentence. In Ph. L. Luelsdorff, editor, The Prague School of Structural and Functional Linguistics, pages 223 243, Amsterdam-Philadelphia. John Benjamins. Sgall, Petr, Eva Hajičová, and Jarmila Panevová. 1986. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech Republic/Dordrecht, Netherlands. 13

Analytic (surface syntactic): #36 PRED Zákony Laws OBJ udělejte make Analysis of Czech AUXP pro for Tectogrammatical (deep syntactic): #36 PRED zákon P l law P l PAT udělat imp make imp ACT BEN you ADV lidi people člověk P l,pro person P l,for 14 Morphological: Form Lemma Morphological tag zákony zákon NNIP1-----A---- zákony zákon NNIP4-----A---- zákony zákon NNIP5-----A---- zákony zákon NNIP7-----A---- udělejte udělat Vi-P---2--A---- udělejte udělat Vi-P---3--A---4 pro pro-1 RR--4---------- lidi člověk NNMP1-----A---- lidi člověk NNMP4-----A---- lidi člověk NNMP5-----A----

Properties of Czech language Czech English Rich morphology 4,000 tags possible, 2,300 seen 50 used Word order free rigid 15 rigid global word order phenomena: clitics rigid local word order phenomena: coordination, clitics mutual order Nonprojective sentences 16,920 23.3% Nonprojective edges 23,691 1.9% Known parsing results Czech English Edge accuracy 69.2 82.5% 91% Sentence correctness 15.0 30.9% 43% Data by (Collins et al., 1999), (Holan, 2003), Zeman (http://ckl.mff.cuni.cz/ zeman/ /projekty/neproj/index.html) and (Bojar, 2003). Consult (Kruijff, 2003) for measuring word order freeness.

Edge length 1 2 5 English [%] 74.2 86.3 95.6 Czech [%] 51.8 72.1 90.2 Detailed numbers on Czech Number of gaps 0 1 2 Sentences [%] 76.9 22.7 0.42 2 Climbing steps 1 2 3 4 5 Nodes [%] 90.3 8.0 1.3 0.3 0.1 3 1 16 1 Data for English by (Collins, 1996). Data for Czech by (Holan, 2003). 2 Data by (Holan, 2003). 3 Data by (Holan, 2003).

17 Analytic vs. Tectogrammatical (2) PRED AUXK SB AUXV OBJ AUXR #45 To It by conjunct particle se reflexive particle mělo should změnit change. full stop PRED PAT PRED ACT #45 to it mít should změnit conj change conj Generic Actor