Translation of Noun Phrase from English to Thai using Phrase-based SMT with CCG Reordering Rules

Translation of Noun Phrase from English to Thai using Phrase-based SMT with CCG Reordering Rules Peerachet Porkaew, Taneth Ruangrajitpakorn, Kanokorn Trakultaweekoon and Thepchai Supnithi Human Language Technology Laboratory National Electronics and Computer Technology 112 Thailand Science Park, Klong Nueng, Klong Luang, Pathumthani, Thailand 12120 {peerachet.porkaew,taneth.ruangrajitpakorn, kanokorn.trakultaweekoon,thepchai.supnithi}@nectec.or.th Abstract Statistical machine translation becomes the core research in MT community. There are researches investigating a methodology to connect between linguistic knowledge and statistical method. Our paper applies CCG notation to reorder English noun phrase before training and translating. The experiment results show that our methodology overcomes baseline SMT both in statistical evaluation and human evaluation. Our system improves BLEU score from 13.11% to 13.50%. Human evaluation shows that around 75% (1732/2310) of sentences, based on our approach outperforms baseline system. 1 Introduction The early SMT system, introduced by IBM (Brown, 1993), is based on word translation probabilities. Several improvements such as fertility and distortion has been attracted by many researchers. A well-known word alignment toolkit GIZA++ is developed based on IBM models. However, using only word translation probabilities leads the system choosing translation option for local context ambiguously. Meanwhile, phrase-based system focuses on groups of connecting words. This approach claims to have a better result comparing with the word-based approach. The word translation model has been adapted to phrase translation model, using phrase translation table. Because translation quality significantly depends on phrase translation model, many researches focus on constructing it. The efficiency of phrase translation table is affected by two main factors i.e. (1) the correctness of phrase pair and (2) phrase scores such as translation probabilities. Marcu and Wong (2002) extracts phrase pair using phrase joint probability, while Och et al., (1999) extracts phrase pair by the intersection of word alignments. Other methods such as overlapping phrase (Tribble, 2003) and phrase extraction by using N-best alignment. (Xue, 2006) are applied to gain more information in the phrase table. Building phrase table is a knowledge acquisition process for the system. The more knowledge the system gains, the better quality is expected Okuma (2007) introduced adding dictionary into phrase-based system with reordering information. Apart from surface form, morphological information and part-of-speech are factors applied in language model (Axelrod, 2006) and translation model (Koehn, 2007). To building English-to-Thai SMT by using the phrase-based approach, the difference of word order in noun phrase between the two languages becomes one of major issue to be solved. In English adjectives are located before noun which they modify while in Thai adjectives are located after noun. Figure.1 shows the difference of word order between English and Thai. Because noun phrase in both languages have their own structures, we investigated a number of linguistic-knowledge-based reordering mechanisms. The factored translation model focused on adding linguistic information, called factors. Linguistic information, for instance part-ofspeech, lemma and word classes, improved the translation models including the reordering or distortion model. Yamada and Knight (2001) presented tree stochastic operations to transform source-language parsed trees to target-language parsed trees. Parameters in those operations were

automatically learned from the linguisticallyparsed parallel corpus. Chiang (2005) employs the parallel-text-induced rules for synchronous context-free grammar which can solve distant reordering problem. Our algorithm is similar to the reordering algorithm in (Elming, 2008). However, we define reordering rules based on combinatory categorical grammar (CCG) (Steedman, 2000), instead of using reordering rules based on POS. Our paper is explained as follows. Section 2 describes our CCG parser and illustrates reordering rules. Section 3 explains experimental design. We show the experimental results in Section 4. Finally conclusion and the future work are narrated in Section 5. Figure1. The difference of word order between English and Thai. 2 English and Thai Noun Phrase Gapping English and Thai are categorized into different type of language. There are some different characteristics, such as word order, inflection and so on. In this paper, we will focus on only word order. 2.1 Difference between English and Thai Noun Phrase The problem when translating English to Thai is reordering of a noun phrase. Linguistic units that can modify noun are adjective, determiner, and preposition phrase. Both of Thai and English have these linguistic units but they modify noun in noun phrase differently. There are three cases of reordering between English and Thai. 1. Switch: This operator is applied when the order between English and Thai is different. For example, adjective and determiner are placed before their head noun in English, but they are switched to after head noun in Thai. 2. Drop: This operator is applied when words are dropped in one language but appear in another language. For example, article (a, an, the) do not have any correspondence word in Thai so it is grammatically dropped while translating English to Thai. 3. None: This operator is applied when the order between English and Thai has no different. For example, Thai and English have the same appearance in prepositional phrase, therefore it has no concerning to reorder prepositional phrase. 2.2 Noun Phrase Extraction In our work, C&C tool (Curran, 2007) is applied to English input to obtain tagged sentences. A CCG tagger can potentially assign more than one lexical category to a word and it results higher accuracy rate with its fine-grained lexical categories comparing to POS tag (Curran, 2006; Clark, 2007). The sub-categories however is removed since they do not show significant effects on reordering accuracy. After obtaining tagged sentences, we parse with our LR parser to get N-Best CCG tree. After that we manually select correct parsed tree. Then, noun phrases are extracted to focus on the reordering. Example of extracted noun phrase tree is shown in Figure 2. 2.3 Reordering Rule for English to Thai translation We aim to reorder noun modifier before translating the entire sentence. Table 1. shows reordering rules for translation English into Thai. English phenomena Thai reordering adjective noun noun adjective determiner noun noun determiner Article noun (drop) noun Table 1. Reordering rules The extracted noun phrases were examined with the rules. If the trees are matched, the rule is applied to reorder the matched units. The given rules are possible to apply recursively since there can be more that one adjective modifying a head noun. Example of noun phrase that applied with reordering rules is illustrated in Figure 3.

English sentence \ : She was glad to accept his invitation. Figure 2. Example of English noun phrase tree with CCG tag Figure 3. Example of noun phrase reordering 3 Experiment Design We design our framework, as shown in Figure 4, to evaluate the proposed method. The corpus used for training process consists of 160K sentences pair. The test set consists of 16K sentences. In the experiment we parse English training set with CCG parser. Then, noun phrases in every sentence are extracted and reordered following the rules in Section 2.3. After that translation model and language model are generated. We apply SRILM for language modeling and Giza++ for word alignment modeling. The phrase table of reordered data is trained by phrase extraction algorithm of Moses toolkits. We also reorder test sentences. The reordered test sentences are translated with reordered translation model. We evaluate the quality in term of BLEU score (Doddington 2002) which is popular evaluation method in the field of statistical machine translation. The similarity of results and references is computed based on n-gram approach. However, the BLEU score may not give accurate evaluation result because it is com- puted from whole sentence not just noun phrase. Therefore, we also use human evaluation for more accurate comparison. To evaluate the accuracy of proposed method by human, we randomly selected 2,310 sentences for three linguists to vote. There are three vote options; "better", "equal", and "worse". The "better" means that the translated sentence is better than the baseline. The "equal" means both translation results are equivalent. The "worse" means that the translated sentence is worse than the baseline. Options which are maximum vote were count. Note that if vote scores of a sentence are equal, we decide to another expert linguist to make a final decision. 4 Experiment Results In the experiment, the BLEU score of baseline system was 13.11% and the BLEU score of reordering model is 13.50%. There is 0.39% increasing scores from baseline system. The higher BLEU score means that the result is closer to reference output translated by human. We obtained a small improvement in term of BLEU score because we improved only the translations of noun phrases. Table 2 shows the human evaluation which 75% from selected test sentences were voted as better. Our proposed method increases quality of noun phrases translation. Vote Result Number of Percent sentences Better 1,732 75% Equal 273 12% Worse 305 13% Total 2,310 100% Table 2. Experiment result by human evaluation

Figure 3. Example of English nou n phrase tree with CCG tag Training Process Translation Process Figure 4. Flow Diagram of the experiment 5 Conclusion and Future Work The system integrated the advantage of syntactic reordering and phrase-based SMT. Our system applied CCG in reordering which is more accurate to parse and extract NP. We built reordering rules based on linguistic knowledge to transform English noun phrase to Thai-structure noun phrase. The phrase translation model was built on reordered training set. Therefore, the system has better alignments and maintains a characteristic of phrase-based SMT. In this paper, we proposed a noun phrase reordering by using CCG parser for English-to-Thai SMT. We have defined CCG reordering rules and integrated in phrase-based SMT using the similar method in (Elming, 2008). We achieved 0.39% of additional BLEU score gain. Based on human evaluation results, 75% of 2,310 sentences of the improved system are realized as "better". The proposed method gave a promising result for noun phrase translation. It still remains several challenges. In the future work, we plan to solve reordering problem with classifier. Moreover, we plan to apply pattern based approach to overcome long distance dependency problem. References Andreas Stolcke. 2002. SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, Denver, CO, USA. Axelrod Amittai. 2006. Factored Language Model for Statistical Machine Translation. MRes Thesis. Edinburgh University. Brown Peter E., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, vol. 19, no. 2, pp. 263 311 Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, EMNLP. Franz J. Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, pp. 19-51 March Franz J. Och, Tillmann, C., and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint Conference of Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20 28. George Doddington. 2002. Automatic Evaluation of Machine Translation Quality Using N-gram Co-occurrence Statistics, In Proceeding of ARPA Workshop on Human Language Technology, Plainsboro, NJ, USA Jakob Elming. 2008. Syntactic Reordering Integrated with Phrase-based SMT. In Proceedings of the ACL Workshop on Syntax and Structure in Statistical Translation (SSST-2 ), Columbus, OH, USA.

James R. Curran, Stephen Clark, and David Vadas. 2006. Multi-Tagging for Lexicalized- Grammar Parsing. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (ACL), Sydney, Australia. James R. Curran, Stephen Clark, and Johan Bos. 2007. Linguistically Motivated Large-Scale NLP with C&C and Boxer. In Proceedings of the ACL 2007 Demonstrations (ACL demo), Prague, Czech Republic. Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceeding of the ACL Tribble Alicia, Stephan Vogel, and Alex Waibel. 2003, Overlapping Phrase-Level Translation Rules in an SMT Engine. Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), Beijing, China. Xue Yong-Zeng, Sheng Li, Tie-Jun Zhao, Mu- Yun Yang, Jun Li, 2006, Bilingual Phrase Extraction from N-Best Alignments, Proceedings of the First International Conference on Innovative Computing, Information and Control. jwordseg : http://www.suparsit.com/nlp-tools - Thai word segmentation toolkit. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: a method for automatic evaluation of machine translation. Technical Report RC22176(W0109-022), IBM Research Report. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical Phrase-based Translation. In Proceedings of NAACL 2003, Edmonton Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demonstrations (ACL demo), Prague, Czech Republic. Okuma Hideo, Hirofumi Yamamoto, Eiichiro Sumita. 2007, Introducing Translation Dictionary Into Phrase-based SMT, Proceedings of Machine Translation Summit, pp.361-367 Stolcke, Andreas., 2002. SRILM an Extensible Language Modeling Toolkit. International Conference on Spoken Language Processing. Stephen Clark, and James R. Curran. 2007. Formalism-Independent Parser Evaluation with CCG and DepBank. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), Prague, Czech Republic.