A Comparative Study on Applying Hierarchical Phrase-based and Phrase-based on Thai-Chinese Translation

Size: px
Start display at page:

Download "A Comparative Study on Applying Hierarchical Phrase-based and Phrase-based on Thai-Chinese Translation"

Transcription

1 2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems A Comparative Study on Applying Hierarchical Phrase-based and Phrase-based on Thai-Chinese Translation Prasert Luekhong 1,2, Rattasit Sukhauta 2, Peerachet Porkaew 3, Taneth Ruangrajitpakorn 3 and Thepchai Supnithi 3 1 College of Integrated Science and Technology, Rajamangala University of Technology Lanna, Chiang Mai, Thailand prasert@rmutl.ac.th 2 Computer Science Department, Faculty of Science, Chiang Mai University, Chiang Mai, Thailand rattasit.s@cmu.ac.th 3 Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center, Thailand {peerachet.porkaew, taneth.rua, thepchai}@nectec.or.th Abstract To set an appropriate goal of SMT research for Thaibased translation, the comparative study of potential and suitability between phrase-based translation (PBT) and hierarchical phrase-based translation (HPBT) becomes an initial question. Thai-Chinese language-pair is chosen as experimental subject since they share most of common syntactic pattern and Chinese resource is numerous. Based on standard setting, we gain a result that 3-gram HPBT gains significantly better BLEU point over 3-gram PBT while 3-gram HPBT is approximately equal to 5-gram PBT. Moreover, from the results, a Chinese-to- Thai translation obtains better accuracies than a Thai-to-Chinese translation from every approach. Keywords- hierarchical phrase-based translation; SMT; Thai- Chinese translation I. INTRODUCTION In the past decades, many researches on statistical machine translation (SMT) have been conducted and they result in several methods and approaches. The major approaches of SMT can be categorized as word-based approach, phrasebased approach and tree-based approach [1]. With the high demand on SMT development, various softwares were developed to help on implementing SMT such as Moses[2], Phrasal[3], Cdec[4], Joshua[5] and Jane[6]. Moses and Phrasal gains our focus since they both are open-source and can effectively generate all three above mentioned approaches while the other cannot. However, Moses receives more public attention over Phrasal in terms of popularity since it has been applied as a baseline in several consortiums such as ACL since 2007, Coling, EMNLP, and so on. With tool such as Moses, SMT developer at least requires a parallel corpus of any language-pair to conveniently implement a baseline of statistical translation. Various language-pairs were trialed and applied to SMT in the past such as English-French, English-Spanish, English- German, and they eventually gained much impressive accuracy results [7] since they have sufficient and well-made data for training, for instance The Linguistic Data Consortium (LDC) [8], The parallel corpus for statistical machine translation (Europarl) [9], The JRC-Acquis [10] and English- Persian Parallel Corpus[11]. Unfortunately for low-resource language such as Thai, the researchers suffer from insufficient data to conduct a full-scale experiment on SMT thus the translation accuracies with any other languages are obviously low, for example simple phrase-based SMT on English-to- Thai gained BLEU score around 13.11% [12]. Furthermore, Thai currently lacks of sufficient resource on syntactic treebank to effort on the tree-based approach hence an SMT research on Thai is limited to word-based approach and phrase-based approach. Since phrase-based SMT has been evidentially claimed to overcome the translation result from word-based approach [1], the development of word-based SMT for Thai is dismissed. With the limited resource for experimenting complete Thai tree-based SMT by Moses, hierarchical phrase-based translation (HPBT) becomes more interesting since its accuracies on other language-pairs are severally reported to be higher than simple phrase-based translation approach (PBT) [13]. Though a high potential of HPBT is renowned, none of any experiment on HPBT for Thai has been yet submitted. In the contrary, there are also some documents claiming the negative results of HPBT as well, for example the BLEU score result of Arabic and English translation using HPBT is reported to gain 0.6 BLEU point lower than the PBT [14]. Therefore, this raises a question that which approach is more suitable for Thai language. From the linguistic point of view, it is clearly that SMT works better with the language-pairs from the same typology since the impressive BLEU points are noticeably obtained from European language-pairs [7]. Therefore, to test a suitability of different approaches on Thai, Chinese is selected as a language-pair in the test because of its resourcefulness and resemblance of Thai structures. In this work, a comparative study between Chinese-Thai translation based on HPBT and PBT approach will be conducted to set as flag-ship for further researches on Thai SMT. Moreover, different surrounding words (3-grams and 5- grams) as factor will also be studied to compare as how they are affected to a translation result /12 $ IEEE DOI /KICSS

2 The rest of this paper is organized as follows. Section II gives a background on past document relevant to HPBT and PBT translation result. Section III explains the methods and set-ups on HPBT and PBT implementation for Thai-Chinese pair. Section IV gives detail on experiment setting and shows the experiment result with discussion. Lastly, Section V gives a conclusion and a list of further plans for improvement. II. BACKGROUND Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The idea behind statistical machine translation comes from information theory. A document is translated according to the probability distribution Defined by Brown[15] that a string in the target language (for example, English) is the translation of a string in the source language (for example, French). SMT can be divided to 3 categorical as Word-Based, Phrase-Based and Tree-Based models. A. Word-Based Model Word-based Model based on lexical translation. The translation of words in isolation requires bilingual dictionary that maps word from one language to another. The main issue on this approach is greatly caused by a lexicon complexity. Generally in natural language, lexicons with same surface are not solely referred to single concept but multiple entities. Even though they can be all defined in the dictionary, they still are not sufficient to cover the actual meaning while they are in different contexts. For example, Thai word (koh) can be translated into either island (noun) or to stick (verb) in English. An example of a word-based translation system is the freely available GIZA++ [16] package (GPLed), which includes the training program for IBM models1-5 that follows the description by Brawn [15] and hidden Markov model [17]. Recently, the attention in the word-based approach has been fading since it is proved to have unreliably low result and they are several methods that overcome its capability. B. Phrase-Based Model As of its name, phrase-based model performs a translation based on phrasal unit. It gains advantage over simple wordbased model in terms of appropriateness on selecting translation by the surrounding context. Koehn states that the currently best performing statistical machine translation systems are based on phrase-based models [1]. The capability to translate small word sequences at a time is arguably the advantage of phrase-based translation. Though many SMT systems [18][19][20][21][22] are developed regarding to this approach and show adequate outcome, the remaining limitation is that it is a pure statisticalbased method without linguistic knowledge and can return unexpected error caused by sparseness and insufficient amount of training data. Nevertheless, this model still gains much favorable since it is simple to implement with the plain parallel corpora. C. Tree-Based Model Tree-based model can be defined as the usage of syntactic tree for assisting on mapping different linguist structure and contextual word translation by using synchronous grammar [23][24][25]. Nevertheless, it requires treebank[26] as a resource for total translation process. Therefore, less informative model, such as tree to string[27][28] and string to tree[29], or a model without linguistic information, such as hierarchical phrase-base model, were proposed. For rich-resource languages with comfortable treebank, implementing a tree-based model with full linguistic information can be planed. Otherwise, less informative model or model without linguistic information are only options for low-resource language. III. DEVELOPMENT OF THAI-CHINESE SMT This work aims to learn compatibility to Thai language from two famous approaches of SMT, i.e. phrase-based translation (PBT) and hierarchical phrase-based translation (HPBT). We design the system architecture to experiment as show in Figure 1. From Figure 1, the machine translation process starts with training process. From a parallel corpus, rules for HPBT and phrases for PBT are separately extracted into tables while the data in a parallel corpus will also be used in training for genrating a language model. In a summary, a training process returns three mandatory outputs for testing process as rule Figure 1. System architecture 127

3 table for HPBT, phrase table for PBT, and language model for both. For testing process, input sentence for tranlation is needed. As the systemmanage input once a sentence, input is designed to one sentence per line. To translate based on HPBT and PBT, each decoder is executed separately and return a translation result. For more details, each process is described in the following sections. A. Phrase-based Translation (PBT) The statistical phrase-based MT is an improvement of the statistical word-based MT. The word-based approach use word-to-word translation probability to translate source sentence. The phrase-based approach allows the system to divide the source sentence into segments before translating those segments. Because segmented translation pair (so called phrase translation pair) can capture a local reordering and can reduce translation alternatives, the quality of output from phrase-based approach is generally higher than wordbased approach. It should be noted that phrase pairs are automatically extracted from corpus and they are not defined as same as traditional linguistic phrases. As a baseline for a comparison with HPBT, PBT is developed based on both 3-gram and 5-gram. In this work, phrase-based model translation proposed in [1] is implemented as follows. 1) Phrase Extraction Algorithm The process of producing phrase translation model starts from phrase extraction algorithm. Below is an overview of phrase extraction algorithm. 1) Collect word translation probabilities of source-totarget (forward model) and target-to-source (backward model) by using IBM model 4. 2) Use the forward model and backward model from step 1) to align words for source sentence-to-target pair and target-to-source pair respectively. Only the highest probability is choose for each word. 3) Intersect both forward and backward word alignment point to get highly accurate alignment point. 4) Fill additional alignment points using heuristic growing procedure. 5) Collect consistence phrase pair from step 4) 2) Phrase-based Model Given is the set of possible translation results and is the source sentence. Finding the best translation result can be done by maximize the using Bayes s rule. (1) Where is a translation model and is the target language model. Target Language model can be trained from a monolingual corpus of the target language. (1) can be written in a form of Log Linear Model to add a given customized features. For each phrase-pair, five features are introduced i.e. forward and backward phrase translation probability distribution, forward and backward lexical weight, and phrase penalty. According to these five features, (1) can be summarized as follows: (2) In (2) is a phrase segmentation of. The terms and are the phrase-level conditional probabilities for forward and backward probability distribution with feature weight and respectively. and are lexical weight scores for phrase pair with weights and. These lexical weights for each pair are calculated from forward and backward word alignment probabilities. The term is phrase penalty with feature weight. The phrase penalty score support fewer and longer phrase pairs to be selected. is the language model with weight. The phrase-level conditional probabilities or phrase translation probabilities can be obtained from phrase extraction process. (3) The lexical weight is applied to check the quality of an extracted phrase pair. For a given phrase pair with an alignment, lexical weight is the joint probability of every word alignment. For a source word that aligns to more than one target word, the average probability is used. (4) Where is lexical translation probability of the word pair and is number of word in phrase. 3) Decoding The decoder is used to search the most likely translation according to the source sentence, phrase translation model and the target language model. The search algorithm can be performed by beam-search[30]. The main algorithm of beam search starts from an initial hypothesis. The next hypothesis can be expanded from the initial hypothesis which is not necessary to be the next phrase segmentation of the source sentence. Words in the path of hypothesis expansion are marked. The system produces a translation alternative when a path covers all words. The scores of each alternative are calculated and the sentence with highest score is selected. Some techniques such as hypothesis recombination and heuristic pruning can be applied to overcome the exponential size of search space. B. Hierarchical Phrase-based Translation (HPBT) Chiang [13] proposed hierarchical phrase-based translation (HPBT) in his work. It is a statistical machine translation model that uses hierarchical phrases. Hierarchical phrases are defined as phrases consisting of two or more sub-phrases that hierarchically link to each other. To create hierarchical phrase 128

4 model, a synchronous context-free grammar (aka. a syntaxdirected transduction grammar [31]) is learned from a parallel text without any syntactic annotations. A synchronous CFG derivation begins with a pair of linked start symbols. At each step, two linked non-terminals are rewritten using the two components of a single rule. When denoting links with boxed indices, they was re-indexed the newly introduced symbols apart from the symbols already present. In this work, we follow the implement instruction based on Chiang [13]. The methodology can be summarized as follows. Since a grammar in a synchronous CFG is elementary structures which rewrite rules with aligned pairs of right-hand sides, it can be defined as: (5) Where is a non-terminal, and are both strings of terminals and non-terminals, and is a one-to-one correspondence between nonterminal occurrences in and nonterminal occurrences in. 1) Rule Extraction Algorithm The extraction process begins with a word-aligned corpus: a set of triples, where is a source sentence, is an target sentence, and is a (many-to-many) binary relation between positions of and positions of. The word alignments are obtained by running GIZA++ [16] on the corpus in both directions, and forming the union of the two sets of word alignments. Each word-aligned sentence from the two sets of word alignments is extracted into a pair of a set of rules that are consistent with the word alignments. This can be listed in two main steps. 1) Identify initial phrase pairs using the same criterion as most phrase-based systems [22], namely, there must be at least one word inside one phrase aligned to a word inside the other, but no word inside one phrase can be aligned to a word outside the other phrase. 2) In order to obtain rules from the phrases, they look for phrases that contain other phrases and replace the sub-phrases with nonterminal symbols. 2) Hierarchical-phrase-based Model Chiang [13] explained hierarchical-phrase-based model that Given a source sentence, a synchronous CFG will have many derivations that yield on the source side, and therefore many possible target translations. With such explanation, a model over derivations is defined to predict which translations are more likely than others. Following the log-linear model [32] over derivations D, the calculation is obtained as: (6) Where the are features defined on derivations and the are feature weights. One of the features is an -gram language model ; the remainder of the features will define as products of functions on the rules used in a derivation: Thus we can rewrite as (7) (8) The factors other than the language model factor can be put into a particularly convenient form. A weighted synchronous CFG is a synchronous CFG together with a function that assigns weights to rules. This function induces a weight function over derivations: If we define then the probability model becomes 3) Training (9) (10) (11) On the attempt to estimate the parameters of the phrase translation and lexical-weighting features, frequencies of phrases are necessary for the extracted rules. For each sentence pair in the training data, more than one derivation of the sentence pair use the several rules extracted from it. They are following Och and others, to use heuristics to hypothesize a distribution of possible rules as though then observed them in the training data, a distribution that does not necessarily maximize the likelihood of the training data. Och s method [22]gives a count of one to each extracted phrase pair occurrence. They give a count of one to each initial phrase pair occurrence, and then distribute its weight equally among the rules obtained by subtracting sub-phrases from it. Treating this distribution data, They use relative-frequency estimation to obtain and. Finally, the parameters of the log-linear model (16) are learned by minimum-error-rate training[33], which tries to set the parameters so as to maximize the BLEU score [34] of a development set. This gives a weighted synchronous CFG according to (6) that is ready to be used by the decoder. 4) Decoding We applied CKY parser as a decoder. We also exploited beam search in the post-process for mapping source and target derivation. Given a source sentence, it finds the target yield of the single best derivation that has source yield: (12) They find not only the best derivation for a source sentence but also a list of the k-best derivations. These k-best derivations are utilized for minimum-error-rate training to rescore a language model, and they also use to reduce searching space by cube pruning[35]. 129

5 Figure 2. A sentence example of Chinese-to-Thai language with the alignment of words 5) Example of the Hierarchical Translation Process In order to explain the process of hierarchical translation, translation processes by steps are demonstrated using Thai and Chinese as example. Figure 2 exemplifies pair of Chinese and Thai sentence with the word alignment for reader s understandability. From Figure 2, a synchronous CFG extracted from parallel corpus is selected according to the given words. With the highest probability for each word, a list of rule is gained as shown in Figure 3. Rule Rule Number (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Figure 3. The hierarchical rule extracted from our example sentence. In Figure 3, where a notation in the left hand side, is non-terminal and S is start-symbol. The right hand side contains two set of CFG rules separated by comma. The left set of rule is a set of both terminal and non-terminal rule of source target which is Chinese in the example while another side contains Thai set of rule. These hierarchical rules will be utilized in derivation within decoding process as demonstrated in Figure 3. Following the example in Figure 3, the derivation of synchronous CFG shown in Figure 4 is processed in topdown manner by expanding from top non-terminal node with the immediate child non-terminal node until finding a terminal node of the leftmost node. A number notation above arrow in Figure 4 refers to rule number in Figure 3. From the actual data, examples of rule table of HPBT and phrase table of PBT gained from learning of Thai-Chinese parallel corpus are shown in Figure 5 and Figure 6, respectively. The difference between Figure 5 and Figure 6 is the [X] notation in Figure 5 to indicate a slot for other word or phrase to derive as a tree. IV. EXPERIMENT A. Data Preparation To experiment a Thai-Chinese translation, parallel corpus was gathered from two sources: BTEC (Basic Travel Expression Corpus) [36] and HIT London Olympic Corpus [37]. The former and latter consists of 26,544 English- Chinese sentences and 62,733 English-Chinese sentence pairs, respectively. All English sentences in the both corpora were carefully translated into Thai by professional linguists and translators. In total, we gain Thai-Chinese parallel corpus with 89,277 sentence pairs. In the preprocess, Chinese sentences were word-segmented by using Stanford Chinese Word Segmentation Tool [38] while Thai sentences were segmented by exploiting SWATH [39]. Since both languages do not have a reliable explicit word boundary. We manually selected 877 sentence pairs as a development set and randomly chose 1,000 sentence pairs as a test set. The remaining sentence pairs were applied as a training data set. B. Experiment Setting This work aims to compare the quality of Thai-Chinese SMT between phrase-based translation (PBT) approach and hierarchical phrase-based translation (HPBT) approach. language modeling tool SRILM [40] was exploited to generate 3-gram and 5-gram language model of Chinese and Thai. Moses [2] is chosen to function on phrase extraction, ruletable generation and decoding and the minimum-error-rate training (MERT) function was applied for tuning weights of both models. The results of Moses for HPBT and PBT are rule 130

6 Figure 4. A derivational process of translation of Figure 2 with rules from Figure 3 [X][X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X] [X][X] [X] Figure 5. An example of rule table from HBPT of Thai-to-Chinese (0) (0) (0,1) (1) (1) (0) (2) (0,1) (2) (2) (1) (0) (3) (2) (0,1) (1) (1,2) () (0,1) (1) (0) (0,1) (0,1) (1) (0) (5) (4) (1) (0,1,2,3) (0) (0) (0,4,5,6) (3,4) (4) (4) (2) (1) e e (1) (1,2) () () (0,1) (1) (0) (0,1) () (0,1) (1) Figure 6. An example of phrase table from PBT of Thai-to-Chinese table and phrase table, respectively. Examples of both tables are given in Figure 5 and Figure 6. The difference from the tables is that HPBT rule table includes translations of terminal and non-terminal nodes to clarify hierarchy while PBT phrase table informs translation pairs of phrase with word order. C. Results and Discussion We evaluate the system from both directions, Chinese-to- Thai and Thai-to-Chinese. Table 1 shows the experimental results in term of BLEU score [34]. The evaluation involves in 3-gram PBT, 5-gram PBT and 3-gram HPBT. We evaluate the system from both direction, Chinese-to-Thai and Thai-to- Chinese. Table I shows the experimental results on translation accuracy in term of BLEU score [34]. From the BLEU point result shown in Table I, the best result is obtained with 3-gram HPBT from every test. TABLE I. Source-to-target language THAI-CHINESE TRANSLATION EXPERIMENT RESULT BLEU score PBT 3-gram PBT 5-gram HPBT 3-gram Chinese-to-Thai Thai-to-Chinese For only Chinese-to-Thai translation, 3-gram HBPT overcomes 3-gram PBT for about 3.9 BLEU point, and 3-gram HBPT and 5-gram PBT are approximately equal. In case of Thai-to-Chinese, 3-gram PBT returns the lowest result and 3- gram HBPT gains the best BLEU point and defeats both 3- gram PBT and 5-gram PBT for 2.2 and 1.3 BLEU point, respectively. For the viewpoint of language base, Chinese-to- 131

7 Thai translation is obviously better than Thai-to-Chinese translation. Since Chinese-to-Thai translation results of both 3-gram HPBT model and 5-gram PBT model are slightly different, it is better to focus on working for 3-gram HPBT model. With less n-gram, generated rules are much smaller and the required size of corpus does not have to cover the sparseness of the surrounding words. V. CONCLUSION AND FUTURE WORK In this work, we studied on applying 3-gram PBT model, 5- gram PBT model and 3-gram HPBT model to translate Thaito-Chinese and Chinese-to-Thai. By comparing the results, we found that 3-gram HPBT shows potential on translating of both directions since the BLEU points of 3-gram HPBT are the best on Thai-to-Chinese translation. In case of Chinese-to- Thai, both 3-gram HPBT model and 5-gram PBT model return approximately equal BELU result which is greater than 3- gram PBT model about 4 BLEU point. From the experiment, results on Chinese-to-Thai are obviously better than Thai-to- Chinese results. To improve this work, we plan to add little linguistic information to the training data for reducing the currently large amount of synchronous CFG rules. Moreover, we plan to experiment a 3-gram HPBT model on different sentence length of Thai to separately study accuracy ratio based on sentence length since Thai sentence is naturally long. To cover all available SMT approach, tree-to-string model will be tested on Chinese-to-Thai. Lastly, English-Thai language pair will be tested with HPBT. ACKNOWLEDGEMENT The Authors would like to thank the Office of the Higher Education Commission, Thailand, for funding support under the program Strategic Scholarships for Frontier Research Network for the Ph.D. Program. Prasert Luekhong also thanks Graduate School, Chiang Mai University, Thailand and Rajamangala University of Technology Lanna, Thailand for their funding. Prasert Luekhong is grateful to Dr. Liu Qun for an opportunity as visiting researcher at The ICT Natural Language Processing Research Group, China Academic of Science, Beijing, China. REFERENCES [1] P. Koehn, Statistical machine translation. Cambridge University Press, 2010, p [2] P. Koehn et al., Moses: Open source toolkit for statistical machine translation, in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, no. June, pp [3] D. Cer, M. Galley, and D. Jurafsky, Phrasal: A toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features, Proceedings of the NAACL, no. June, pp. 9-12, [4] C. Dyer, J. Weese, H. Setiawan, and A. Lopez, cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models, in Proceedings of the ACL, 2010, no. July, pp [5] L. Schwartz, W. Thornton, and J. Weese, Joshua: An open source toolkit for parsing-based machine translation, Machine Translation, no. March, pp , [6] D. Vilar, D. Stein, and M. Huck, Jane: Open source hierarchical translation, extended with reordering and lexicon models, in on Statistical Machine Translation and and Metrics MATR (WMT 2010), 2010, no. July, pp [7] Matrix Euro. [Online]. Available: [Accessed: 29-May-2012]. [8] M. Liberman and C. Cieri, The Creation, Distribution and Use of Linguistic Data: The Case of the Linguistic Data Consortium, in proceedings of the 1st International Conference on Language Resources and Evaluation (LREC), [9] P. Koehn, Europarl: A parallel corpus for statistical machine translation, in MT summit, 2005, vol. 11. [10] R. Steinberger, B. Pouliquen, and A. Widiger, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages, Arxiv preprint cs/ , vol. 4, no. 1, [11] M. V. Yazdchi and H. Faili, Generating english-persian parallel corpus using an automatic anchor finding sentence aligner, in Natural Language Processing and Knowledge Engineering (NLP- KE), 2010 International Conference on, 2010, pp [12] P. Porkaew and T. Ruangrajitpakorn, Translation of Noun Phrase from English to Thai using Phrase-based SMT with CCG Reordering Rules, in Design, [13] D. Chiang, Hierarchical Phrase-Based Translation, Computational Linguistics, vol. 33, no. 2, pp , Jun [14] M. Huck, M. Ratajczak, P. Lehnen, and H. Ney, A comparison of various types of extended lexicon models for statistical machine translation, in Conf. of the Assoc. for Machine Translation in the Americas (AMTA), Denver, CO, [15] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer, The mathematics of statistical machine translation: Parameter estimation, Computational linguistics, vol. 19, no. 2, pp , [16] F. J. Och and H. Ney, Giza++: Training of statistical translation models. Internal report, RWTH Aachen University, i6. informatik. rwth-aachen. de, [17] S. Vogel, H. Ney, and C. Tillmann, HMM-based word alignment in statistical translation, Proceedings of the 16th conference on Computational linguistics, vol. 2, p. 836, [18] F. J. Och and H. Weber, Improving statistical natural language translation with categories and rules, Proceedings of the 36th annual meeting on Association for Computational Linguistics -, pp , [19] F. J. Och, C. Tillmann, and H. Ney, Improved alignment models for statistical machine translation, in Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999, pp [20] F. J. Och, Statistical machine translation: from single-word models to alignment templates, Citeseer, [21] P. Koehn, F. J. F. J. Och, and D. Marcu, Statistical phrase-based translation, in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 2003, pp [22] F. J. Och and H. Ney, The Alignment Template Approach to Statistical Machine Translation, Computational Linguistics, vol. 30, no. 4, pp , Dec [23] S. M. Shieber and Y. Schabes, Generation and synchronous treeadjoining grammars, Computational Intelligence, vol. 7, no. 4, pp , Nov [24] D. Chiang and K. Knight, An introduction to synchronous grammars, Tutorial on ACL-06, no. June, pp. 1-16, [25] P. Blunsom, T. Cohn, and M. Osborne, Bayesian synchronous grammar induction, in Advances in Neural Information Processing Systems, 2009, vol. 21, pp [26] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Computational linguistics, vol. 19, no. 2, pp , [27] Y. Liu, Q. Liu, and S. Lin, Tree-to-string alignment template for statistical machine translation, in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, 2006, no. 6, pp [28] L. Huang and H. Mi, Efficient incremental decoding for tree-tostring translation, in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010, pp

8 [29] R. S. Used, The UOT System: Improve String-to-Tree translation Using Head-Driven Phrase Structure Grammar and Predicate- Argument Structures, in mt-archive.info, 2009, pp [30] P. Koehn, Pharaoh: a beam search decoder for phrase-based statistical machine translation models, Machine translation: From real users to research, pp , [31] P. L. II and S. R. E., Syntax-directed transduction, Journal of the ACM (JACM), vol. l, no. 3, pp , [32] F. J. Och and H. Ney, Discriminative Training and Maximum Entropy Models for Statistical Machine Translation, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, no. July, pp [33] F. J. Och, Minimum error rate training in statistical machine translation, in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, 2003, vol. 1001, no. 1, pp [34] K. Papineni, S. Roukos, T. Ward, and W.-jing Zhu, BLEU : a Method for Automatic Evaluation of Machine Translation, Computational Linguistics, no. July, pp , [35] Y. Feng, H. Mi, Y. Liu, and Q. Liu, An efficient shift-reduce decoding algorithm for phrased-based machine translation, in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, 2010, pp [36] BTEC Task International Workshop on Spoken Language Translation. [Online]. Available: [Accessed: 27-May-2012]. [37] M. Yang, H. Jiang, and T. Zhao, Construct trilingual parallel corpus on demand, in Chinese Spoken Language Processing, 2006, pp [38] P. Chang, M. Galley, and C. D. Manning, Optimizing Chinese word segmentation for machine translation performance, in Proceedings of the Third Workshop on Statistical Machine Translation, [39] P. Charoenpornsawat, SWATH: Smart Word Analysis for Thai [40] A. Stolcke, SRILM-an extensible language modeling toolkit, in Seventh International Conference on Spoken Language Processing,

Language Model and Grammar Extraction Variation in Machine Translation

Language Model and Grammar Extraction Variation in Machine Translation Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department

More information

Noisy SMS Machine Translation in Low-Density Languages

Noisy SMS Machine Translation in Low-Density Languages Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of

More information

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation AUTHORS AND AFFILIATIONS MSR: Xiaodong He, Jianfeng Gao, Chris Quirk, Patrick Nguyen, Arul Menezes, Robert Moore, Kristina Toutanova,

More information

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer

More information

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu

More information

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation Baskaran Sankaran and Anoop Sarkar School of Computing Science Simon Fraser University Burnaby BC. Canada {baskaran,

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling Pratyush Banerjee, Sudip Kumar Naskar, Johann Roturier 1, Andy Way 2, Josef van Genabith

More information

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

The NICT Translation System for IWSLT 2012

The NICT Translation System for IWSLT 2012 The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

The KIT-LIMSI Translation System for WMT 2014

The KIT-LIMSI Translation System for WMT 2014 The KIT-LIMSI Translation System for WMT 2014 Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexandre Allauzen, François Yvon and Alex Waibel LIMSI-CNRS, Orsay, France Karlsruhe Institute of Technology,

More information

Re-evaluating the Role of Bleu in Machine Translation Research

Re-evaluating the Role of Bleu in Machine Translation Research Re-evaluating the Role of Bleu in Machine Translation Research Chris Callison-Burch Miles Osborne Philipp Koehn School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW callison-burch@ed.ac.uk

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Greedy Decoding for Statistical Machine Translation in Almost Linear Time in: Proceedings of HLT-NAACL 23. Edmonton, Canada, May 27 June 1, 23. This version was produced on April 2, 23. Greedy Decoding for Statistical Machine Translation in Almost Linear Time Ulrich Germann

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Cross-lingual Text Fragment Alignment using Divergence from Randomness Cross-lingual Text Fragment Alignment using Divergence from Randomness Sirvan Yahyaei, Marco Bonzanini, and Thomas Roelleke Queen Mary, University of London Mile End Road, E1 4NS London, UK {sirvan,marcob,thor}@eecs.qmul.ac.uk

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract

More information

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017 Jan-Thorsten Peter, Andreas Guta, Tamer Alkhouli, Parnia Bahar, Jan Rosendahl, Nick Rossenbach, Miguel

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Annotation Projection for Discourse Connectives

Annotation Projection for Discourse Connectives SFB 833 / Univ. Tübingen Penn Discourse Treebank Workshop Annotation projection Basic idea: Given a bitext E/F and annotation for F, how would the annotation look for E? Examples: Word Sense Disambiguation

More information

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Accurate Unlexicalized Parsing for Modern Hebrew

Accurate Unlexicalized Parsing for Modern Hebrew Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The

More information

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Jianfeng Gao Microsoft Research One Microsoft Way Redmond, WA 98052 USA jfgao@microsoft.com Xiaodong He Microsoft

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

Proof Theory for Syntacticians

Proof Theory for Syntacticians Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax

More information

Grammars & Parsing, Part 1:

Grammars & Parsing, Part 1: Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm syntax: from the Greek syntaxis, meaning setting out together

More information

Natural Language Processing. George Konidaris

Natural Language Processing. George Konidaris Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

BYLINE [Heng Ji, Computer Science Department, New York University,

BYLINE [Heng Ji, Computer Science Department, New York University, INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference

More information

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB

More information

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

An Interactive Intelligent Language Tutor Over The Internet

An Interactive Intelligent Language Tutor Over The Internet An Interactive Intelligent Language Tutor Over The Internet Trude Heift Linguistics Department and Language Learning Centre Simon Fraser University, B.C. Canada V5A1S6 E-mail: heift@sfu.ca Abstract: This

More information

A Quantitative Method for Machine Translation Evaluation

A Quantitative Method for Machine Translation Evaluation A Quantitative Method for Machine Translation Evaluation Jesús Tomás Escola Politècnica Superior de Gandia Universitat Politècnica de València jtomas@upv.es Josep Àngel Mas Departament d Idiomes Universitat

More information

Using Semantic Relations to Refine Coreference Decisions

Using Semantic Relations to Refine Coreference Decisions Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

A Version Space Approach to Learning Context-free Grammars

A Version Space Approach to Learning Context-free Grammars Machine Learning 2: 39~74, 1987 1987 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands A Version Space Approach to Learning Context-free Grammars KURT VANLEHN (VANLEHN@A.PSY.CMU.EDU)

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general

More information

TINE: A Metric to Assess MT Adequacy

TINE: A Metric to Assess MT Adequacy TINE: A Metric to Assess MT Adequacy Miguel Rios, Wilker Aziz and Lucia Specia Research Group in Computational Linguistics University of Wolverhampton Stafford Street, Wolverhampton, WV1 1SB, UK {m.rios,

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Overview of the 3rd Workshop on Asian Translation

Overview of the 3rd Workshop on Asian Translation Overview of the 3rd Workshop on Asian Translation Toshiaki Nakazawa Chenchen Ding and Hideya Mino Japan Science and National Institute of Technology Agency Information and nakazawa@pa.jst.jp Communications

More information

LTAG-spinal and the Treebank

LTAG-spinal and the Treebank LTAG-spinal and the Treebank a new resource for incremental, dependency and semantic parsing Libin Shen (lshen@bbn.com) BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA Lucas Champollion (champoll@ling.upenn.edu)

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Basic Parsing with Context-Free Grammars Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Announcements HW 2 to go out today. Next Tuesday most important for background to assignment Sign up

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

Regression for Sentence-Level MT Evaluation with Pseudo References

Regression for Sentence-Level MT Evaluation with Pseudo References Regression for Sentence-Level MT Evaluation with Pseudo References Joshua S. Albrecht and Rebecca Hwa Department of Computer Science University of Pittsburgh {jsa8,hwa}@cs.pitt.edu Abstract Many automatic

More information

Context Free Grammars. Many slides from Michael Collins

Context Free Grammars. Many slides from Michael Collins Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures

More information

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment Takako Aikawa, Lee Schwartz, Ronit King Mo Corston-Oliver Carmen Lozano Microsoft

More information

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR ROLAND HAUSSER Institut für Deutsche Philologie Ludwig-Maximilians Universität München München, West Germany 1. CHOICE OF A PRIMITIVE OPERATION The

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information