Translation of Noun Phrase from English to Thai using Phrase-based SMT with CCG Reordering Rules

Similar documents
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The KIT-LIMSI Translation System for WMT 2014

Noisy SMS Machine Translation in Low-Density Languages

Language Model and Grammar Extraction Variation in Machine Translation

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

The NICT Translation System for IWSLT 2012

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

arxiv: v1 [cs.cl] 2 Apr 2017

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Re-evaluating the Role of Bleu in Machine Translation Research

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

3 Character-based KJ Translation

Parsing of part-of-speech tagged Assamese Texts

Regression for Sentence-Level MT Evaluation with Pseudo References

Training and evaluation of POS taggers on the French MULTITAG corpus

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Constructing Parallel Corpus from Movie Subtitles

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Using dialogue context to improve parsing performance in dialogue systems

CS 598 Natural Language Processing

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Prediction of Maximal Projection for Semantic Role Labeling

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

A Quantitative Method for Machine Translation Evaluation

The stages of event extraction

Enhancing Morphological Alignment for Translating Highly Inflected Languages

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

Learning Computational Grammars

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

TINE: A Metric to Assess MT Adequacy

A heuristic framework for pivot-based bilingual dictionary induction

Cross Language Information Retrieval

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Developing a TT-MCTAG for German with an RCG-based Parser

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Context Free Grammars. Many slides from Michael Collins

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Modeling function word errors in DNN-HMM based LVCSR systems

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Accurate Unlexicalized Parsing for Modern Hebrew

Memory-based grammatical error correction

Multi-Lingual Text Leveling

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Applications of memory-based natural language processing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

LTAG-spinal and the Treebank

Modeling function word errors in DNN-HMM based LVCSR systems

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Experts Retrieval with Multiword-Enhanced Author Topic Model

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Ensemble Technique Utilization for Indonesian Dependency Parser

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

An Interactive Intelligent Language Tutor Over The Internet

Problems of the Arabic OCR: New Attitudes

The Smart/Empire TIPSTER IR System

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

BYLINE [Heng Ji, Computer Science Department, New York University,

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Distant Supervised Relation Extraction with Wikipedia and Freebase

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Universiteit Leiden ICT in Business

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

The Ups and Downs of Preposition Error Detection in ESL Writing

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Overview of the 3rd Workshop on Asian Translation

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

THE VERB ARGUMENT BROWSER

Vocabulary Usage and Intelligibility in Learner Language

Combining a Chinese Thesaurus with a Chinese Dictionary

AQUA: An Ontology-Driven Question Answering System

Transcription:

Translation of Noun Phrase from English to Thai using Phrase-based SMT with CCG Reordering Rules Peerachet Porkaew, Taneth Ruangrajitpakorn, Kanokorn Trakultaweekoon and Thepchai Supnithi Human Language Technology Laboratory National Electronics and Computer Technology 112 Thailand Science Park, Klong Nueng, Klong Luang, Pathumthani, Thailand 12120 {peerachet.porkaew,taneth.ruangrajitpakorn, kanokorn.trakultaweekoon,thepchai.supnithi}@nectec.or.th Abstract Statistical machine translation becomes the core research in MT community. There are researches investigating a methodology to connect between linguistic knowledge and statistical method. Our paper applies CCG notation to reorder English noun phrase before training and translating. The experiment results show that our methodology overcomes baseline SMT both in statistical evaluation and human evaluation. Our system improves BLEU score from 13.11% to 13.50%. Human evaluation shows that around 75% (1732/2310) of sentences, based on our approach outperforms baseline system. 1 Introduction The early SMT system, introduced by IBM (Brown, 1993), is based on word translation probabilities. Several improvements such as fertility and distortion has been attracted by many researchers. A well-known word alignment toolkit GIZA++ is developed based on IBM models. However, using only word translation probabilities leads the system choosing translation option for local context ambiguously. Meanwhile, phrase-based system focuses on groups of connecting words. This approach claims to have a better result comparing with the word-based approach. The word translation model has been adapted to phrase translation model, using phrase translation table. Because translation quality significantly depends on phrase translation model, many researches focus on constructing it. The efficiency of phrase translation table is affected by two main factors i.e. (1) the correctness of phrase pair and (2) phrase scores such as translation probabilities. Marcu and Wong (2002) extracts phrase pair using phrase joint probability, while Och et al., (1999) extracts phrase pair by the intersection of word alignments. Other methods such as overlapping phrase (Tribble, 2003) and phrase extraction by using N-best alignment. (Xue, 2006) are applied to gain more information in the phrase table. Building phrase table is a knowledge acquisition process for the system. The more knowledge the system gains, the better quality is expected Okuma (2007) introduced adding dictionary into phrase-based system with reordering information. Apart from surface form, morphological information and part-of-speech are factors applied in language model (Axelrod, 2006) and translation model (Koehn, 2007). To building English-to-Thai SMT by using the phrase-based approach, the difference of word order in noun phrase between the two languages becomes one of major issue to be solved. In English adjectives are located before noun which they modify while in Thai adjectives are located after noun. Figure.1 shows the difference of word order between English and Thai. Because noun phrase in both languages have their own structures, we investigated a number of linguistic-knowledge-based reordering mechanisms. The factored translation model focused on adding linguistic information, called factors. Linguistic information, for instance part-ofspeech, lemma and word classes, improved the translation models including the reordering or distortion model. Yamada and Knight (2001) presented tree stochastic operations to transform source-language parsed trees to target-language parsed trees. Parameters in those operations were

automatically learned from the linguisticallyparsed parallel corpus. Chiang (2005) employs the parallel-text-induced rules for synchronous context-free grammar which can solve distant reordering problem. Our algorithm is similar to the reordering algorithm in (Elming, 2008). However, we define reordering rules based on combinatory categorical grammar (CCG) (Steedman, 2000), instead of using reordering rules based on POS. Our paper is explained as follows. Section 2 describes our CCG parser and illustrates reordering rules. Section 3 explains experimental design. We show the experimental results in Section 4. Finally conclusion and the future work are narrated in Section 5. Figure1. The difference of word order between English and Thai. 2 English and Thai Noun Phrase Gapping English and Thai are categorized into different type of language. There are some different characteristics, such as word order, inflection and so on. In this paper, we will focus on only word order. 2.1 Difference between English and Thai Noun Phrase The problem when translating English to Thai is reordering of a noun phrase. Linguistic units that can modify noun are adjective, determiner, and preposition phrase. Both of Thai and English have these linguistic units but they modify noun in noun phrase differently. There are three cases of reordering between English and Thai. 1. Switch: This operator is applied when the order between English and Thai is different. For example, adjective and determiner are placed before their head noun in English, but they are switched to after head noun in Thai. 2. Drop: This operator is applied when words are dropped in one language but appear in another language. For example, article (a, an, the) do not have any correspondence word in Thai so it is grammatically dropped while translating English to Thai. 3. None: This operator is applied when the order between English and Thai has no different. For example, Thai and English have the same appearance in prepositional phrase, therefore it has no concerning to reorder prepositional phrase. 2.2 Noun Phrase Extraction In our work, C&C tool (Curran, 2007) is applied to English input to obtain tagged sentences. A CCG tagger can potentially assign more than one lexical category to a word and it results higher accuracy rate with its fine-grained lexical categories comparing to POS tag (Curran, 2006; Clark, 2007). The sub-categories however is removed since they do not show significant effects on reordering accuracy. After obtaining tagged sentences, we parse with our LR parser to get N-Best CCG tree. After that we manually select correct parsed tree. Then, noun phrases are extracted to focus on the reordering. Example of extracted noun phrase tree is shown in Figure 2. 2.3 Reordering Rule for English to Thai translation We aim to reorder noun modifier before translating the entire sentence. Table 1. shows reordering rules for translation English into Thai. English phenomena Thai reordering adjective noun noun adjective determiner noun noun determiner Article noun (drop) noun Table 1. Reordering rules The extracted noun phrases were examined with the rules. If the trees are matched, the rule is applied to reorder the matched units. The given rules are possible to apply recursively since there can be more that one adjective modifying a head noun. Example of noun phrase that applied with reordering rules is illustrated in Figure 3.

English sentence \ : She was glad to accept his invitation. Figure 2. Example of English noun phrase tree with CCG tag Figure 3. Example of noun phrase reordering 3 Experiment Design We design our framework, as shown in Figure 4, to evaluate the proposed method. The corpus used for training process consists of 160K sentences pair. The test set consists of 16K sentences. In the experiment we parse English training set with CCG parser. Then, noun phrases in every sentence are extracted and reordered following the rules in Section 2.3. After that translation model and language model are generated. We apply SRILM for language modeling and Giza++ for word alignment modeling. The phrase table of reordered data is trained by phrase extraction algorithm of Moses toolkits. We also reorder test sentences. The reordered test sentences are translated with reordered translation model. We evaluate the quality in term of BLEU score (Doddington 2002) which is popular evaluation method in the field of statistical machine translation. The similarity of results and references is computed based on n-gram approach. However, the BLEU score may not give accurate evaluation result because it is com- puted from whole sentence not just noun phrase. Therefore, we also use human evaluation for more accurate comparison. To evaluate the accuracy of proposed method by human, we randomly selected 2,310 sentences for three linguists to vote. There are three vote options; "better", "equal", and "worse". The "better" means that the translated sentence is better than the baseline. The "equal" means both translation results are equivalent. The "worse" means that the translated sentence is worse than the baseline. Options which are maximum vote were count. Note that if vote scores of a sentence are equal, we decide to another expert linguist to make a final decision. 4 Experiment Results In the experiment, the BLEU score of baseline system was 13.11% and the BLEU score of reordering model is 13.50%. There is 0.39% increasing scores from baseline system. The higher BLEU score means that the result is closer to reference output translated by human. We obtained a small improvement in term of BLEU score because we improved only the translations of noun phrases. Table 2 shows the human evaluation which 75% from selected test sentences were voted as better. Our proposed method increases quality of noun phrases translation. Vote Result Number of Percent sentences Better 1,732 75% Equal 273 12% Worse 305 13% Total 2,310 100% Table 2. Experiment result by human evaluation

Figure 3. Example of English nou n phrase tree with CCG tag Training Process Translation Process Figure 4. Flow Diagram of the experiment 5 Conclusion and Future Work The system integrated the advantage of syntactic reordering and phrase-based SMT. Our system applied CCG in reordering which is more accurate to parse and extract NP. We built reordering rules based on linguistic knowledge to transform English noun phrase to Thai-structure noun phrase. The phrase translation model was built on reordered training set. Therefore, the system has better alignments and maintains a characteristic of phrase-based SMT. In this paper, we proposed a noun phrase reordering by using CCG parser for English-to-Thai SMT. We have defined CCG reordering rules and integrated in phrase-based SMT using the similar method in (Elming, 2008). We achieved 0.39% of additional BLEU score gain. Based on human evaluation results, 75% of 2,310 sentences of the improved system are realized as "better". The proposed method gave a promising result for noun phrase translation. It still remains several challenges. In the future work, we plan to solve reordering problem with classifier. Moreover, we plan to apply pattern based approach to overcome long distance dependency problem. References Andreas Stolcke. 2002. SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, Denver, CO, USA. Axelrod Amittai. 2006. Factored Language Model for Statistical Machine Translation. MRes Thesis. Edinburgh University. Brown Peter E., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, vol. 19, no. 2, pp. 263 311 Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, EMNLP. Franz J. Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, pp. 19-51 March Franz J. Och, Tillmann, C., and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint Conference of Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20 28. George Doddington. 2002. Automatic Evaluation of Machine Translation Quality Using N-gram Co-occurrence Statistics, In Proceeding of ARPA Workshop on Human Language Technology, Plainsboro, NJ, USA Jakob Elming. 2008. Syntactic Reordering Integrated with Phrase-based SMT. In Proceedings of the ACL Workshop on Syntax and Structure in Statistical Translation (SSST-2 ), Columbus, OH, USA.

James R. Curran, Stephen Clark, and David Vadas. 2006. Multi-Tagging for Lexicalized- Grammar Parsing. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (ACL), Sydney, Australia. James R. Curran, Stephen Clark, and Johan Bos. 2007. Linguistically Motivated Large-Scale NLP with C&C and Boxer. In Proceedings of the ACL 2007 Demonstrations (ACL demo), Prague, Czech Republic. Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceeding of the ACL Tribble Alicia, Stephan Vogel, and Alex Waibel. 2003, Overlapping Phrase-Level Translation Rules in an SMT Engine. Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), Beijing, China. Xue Yong-Zeng, Sheng Li, Tie-Jun Zhao, Mu- Yun Yang, Jun Li, 2006, Bilingual Phrase Extraction from N-Best Alignments, Proceedings of the First International Conference on Innovative Computing, Information and Control. jwordseg : http://www.suparsit.com/nlp-tools - Thai word segmentation toolkit. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: a method for automatic evaluation of machine translation. Technical Report RC22176(W0109-022), IBM Research Report. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical Phrase-based Translation. In Proceedings of NAACL 2003, Edmonton Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demonstrations (ACL demo), Prague, Czech Republic. Okuma Hideo, Hirofumi Yamamoto, Eiichiro Sumita. 2007, Introducing Translation Dictionary Into Phrase-based SMT, Proceedings of Machine Translation Summit, pp.361-367 Stolcke, Andreas., 2002. SRILM an Extensible Language Modeling Toolkit. International Conference on Spoken Language Processing. Stephen Clark, and James R. Curran. 2007. Formalism-Independent Parser Evaluation with CCG and DepBank. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), Prague, Czech Republic.