System Description of NiCT-ATR SMT for NTCIR-7

Similar documents
The NICT Translation System for IWSLT 2012

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Language Model and Grammar Extraction Variation in Machine Translation

The KIT-LIMSI Translation System for WMT 2014

Noisy SMS Machine Translation in Low-Density Languages

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

arxiv: v1 [cs.cl] 2 Apr 2017

3 Character-based KJ Translation

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Re-evaluating the Role of Bleu in Machine Translation Research

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Overview of the 3rd Workshop on Asian Translation

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Regression for Sentence-Level MT Evaluation with Pseudo References

Deep Neural Network Language Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Investigation on Mandarin Broadcast News Speech Recognition

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

A heuristic framework for pivot-based bilingual dictionary induction

A Quantitative Method for Machine Translation Evaluation

Residual Stacking of RNNs for Neural Machine Translation

Modeling function word errors in DNN-HMM based LVCSR systems

Using dialogue context to improve parsing performance in dialogue systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Python Machine Learning

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Cross Language Information Retrieval

Modeling function word errors in DNN-HMM based LVCSR systems

Training and evaluation of POS taggers on the French MULTITAG corpus

Learning to Rank with Selection Bias in Personal Search

A Case-Based Approach To Imitation Learning in Robotic Agents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Multi-Lingual Text Leveling

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

PRODUCT COMPLEXITY: A NEW MODELLING COURSE IN THE INDUSTRIAL DESIGN PROGRAM AT THE UNIVERSITY OF TWENTE

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

TINE: A Metric to Assess MT Adequacy

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Methods in Multilingual Speech Recognition

Constructing Parallel Corpus from Movie Subtitles

Cross-lingual Text Fragment Alignment using Divergence from Randomness

The stages of event extraction

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Case Study: News Classification Based on Term Frequency

CS Machine Learning

Word Segmentation of Off-line Handwritten Documents

Learning Methods for Fuzzy Systems

Detecting English-French Cognates Using Orthographic Edit Distance

Towards a Collaboration Framework for Selection of ICT Tools

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Enhancing Morphological Alignment for Translating Highly Inflected Languages

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Model Ensemble for Click Prediction in Bing Search Ads

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

A hybrid approach to translate Moroccan Arabic dialect

Speech Recognition at ICSI: Broadcast News and beyond

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Corpus Linguistics (L615)

Cross-Lingual Text Categorization

Assignment 1: Predicting Amazon Review Ratings

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Axiom 2013 Team Description Paper

Lecture 1: Machine Learning Basics

Memory-based grammatical error correction

arxiv: v1 [cs.lg] 3 May 2013

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

AQUA: An Ontology-Driven Question Answering System

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Task Tolerance of MT Output in Integrated Text Processes

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Speech Emotion Recognition Using Support Vector Machine

arxiv:cmp-lg/ v1 22 Aug 1994

Universiteit Leiden ICT in Business

BYLINE [Heng Ji, Computer Science Department, New York University,

Australian Journal of Basic and Applied Sciences

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Transcription:

System Description of NiCT-ATR SMT for NTCIR-7 Keiji Yasuda, Andrew Finch, Hideo Okuma, Masao Utiyama Hirofumi Yamamoto,, Eiichiro Sumita, National Institute of Communications Technology {keiji.yasuda,andrew.finch,hideo.okuma mutiyama,eiichiro.sumita}@nict.go.jp 2 2 2, Hikaridai, Keihanna Science City, 619 0228, Japan ATR Spoken Language Research Laboratories 2 2 2, Hikaridai, Keihanna Science City, 619 0228, Japan Kinki University School of Science and Engineering Department of Information 3 4 1, Kowakae, Higashiosaka city, Osaka, 577 8502 Japan yama@info.kindai.ac.jp Abstract In this paper we propose a method to improve SMT based patent translatioin. This method first employs International Classification to build class based models. Then, multiple models are interpolated by weighting method employing source side language models. We carried out experiments using data from the patent translation task of NTCIR-7 workshop. According to the experimental results, the proposed method improved the most of automatic scores, which were NIST, WER and PER. Experimental results also shows BLUE score degradation in the proposed method. However, statistical tests by bootstrapping does not show significance for the degradation. Keywords: IPC, domain adaptation. 1 Introduction Current machine translation (MT) research shows the effectiveness of a copus-based machine translation framework [6]. MT system using the frame work such as Statistical Machine (SMT) [1] is thought to be convenient technology because of rapid and mostly automated MT system building. For SMT research, parallel corpora are one of the most important components. There are mainly two factors of parallel corpora contributing to system performance. The first is the quality of the parallel corpus, and the second is its quantity. A parallel corpus that has similar statistical characteristics to the target domain should yield a more efficient models. However, domain mismatched training data might reduce the model s performance. And sufficiently size of parallel corpus solves the data sparseness problem of model training. Meanwhile from a point of view of commercial, it is more important to create consumer-demanded MTs, considering language pair and target domain of MT use than parallel corpus availability. Considering all of the previously mentioned points, Japanse-English patent translation is one of the most interesting SMT research fields which satisfies these points. A large-sized Japanese-English patent parallel corpus has just been released [13]. Commercial demand is also very high for both directions of Japanese- English patent translation. Additinally, an SMT evaluation campaign using the parallel corpus is ongoing [3]. This boosts related technology and helps information exchange. The research shown in this paper deals with Japanese-to-English patent translaion. The proposed method in this paper uses International Classification (IPC) to improve the SMT based patent translation system. IPC is used to build class-based models. Then, multiple class based models and a general model are interpolated by weighting method employing source side language models. Section 2 explains IPC. Section 3 describes the method of using IPC information for SMT. Section 4 details the experimental setting and results. Section 5 discusses the comparison of the propsed method to related work. Section 6 concludes the paper. 2 Internatinal Classification The International Classification (IPC), established by the Strasbourg Agreement 1971, provides for 415

Divide corpus Using IPC label corpus Corpus Language Language Language Language General model IPC A model CleopATRa Decoder Corpus Language Language IPC H model Figure 1. Framework of the proposed method. 416

a hierarchical system of language independent symbols for the classification of patents and utility models according to the different areas of technology to which they pertain. Each section has a title and a symbol. The title consists of one or more words and the symbol consists of a capital letter of the Roman alphabet. They are as follows: A Human Necessities B Performing Operations; Transporting C Chemistry; Metallurgy D Textiles; Paper E Fixed Constructions F Mechanical Engineering; Lighting; Heating; Weapons; Blasting G Physics H Electricity In this research, we only use above information from IPC, i.e., the top layer of the IPC hierarchy. 3 Proposed method Figure 1 shows the flow of the proposed method. In the proposed method, first we train general models by using all available parallel corpus. Three kinds of models are trained here: a translation model, a target-side language model, and a source-side language model. Secondly, we divide the original patent parallel corpus into eight subcorpora using IPC label. Then, three models (IPC based models) are trained for each subcorpus. In the proposed method, source side language models are used to calculate weights for general and IPC based models. Detailed weight formulas are shown in Section 4.2. 4 Experiments 4.1 Phrase-based SMT We employed a log-linear model as a phrase-based statistical machine translation framework [5]. This model expresses the probability of a target-language word sequence (e) of a given source language word sequence (f) given by: ( M ) exp i=1 λ ih i (e, f) P (e f) = ( e exp M ) (1) i=1 λ ih i (e,f) where h i (e, f) is the feature function, such as the translation model or the language model, λ i is the feature function s weight, and M is the number of features. λ i is tuned by using the Minimum Error Rate Training (MERT) algorithm [9] on a development set. We can approximate Eq. 1 by regarding its denominator as constant. The translation results (ê) are then obtained by { M } ê(f) = arg max exp λ i h i (e, f) (2) e 4.2 Calculation i=1 As mentioned in Section 3, features (models) are trained for each IPC. Additionally, λ i is also tuned for each IPC. Addition to the feature s weight (λ i ) shown in Eq. 2, we also need to compute weight for each IPCbased model (μ) for each given input sentence. By using μ, Eq. 2 is reformulated as follows ê(f) = argmax e where M exp IPC i=1 j=a λ i,j μ j h i,j (e, f) (3) IPC {A, B, C, D, E, F, G, H, General} (4) and μ is calculated by the following formula μ j = P (j S input ) IPC k=a P (k S input) (5) Here, P (j S input ) is the probability of input sentence (S input ) belonging to the IPC j. Using the source-side language model of IPC i, P (j S input ) is calculated by the following formula: P (j S input ) = P (S input j) P (j) P (S input ) (6) where P (S input j) is the sentence probability of the input sentence on the source side language model of IPC j. P (j) is the category probability which is a sentence ratio of the training subcorpus of IPC j to the full-sized corpus. In the proposed method, values of μ are calculated for each input. Our in-house decoder, which is CleopATRa can handle multiple models with changing μ sentence by sentence. 4.3 Experimental Setting We used training set from NTCIR-7 workshop patent translation task [3] for the experiments. A development set and a test set were also provided data by the workshop. Details of this data is shown in Table 1. 417

Table 1. Details of data for the experiments (# of sentences) For the statistical machine-translation experiments, we segmented Japanese words using Japanese morphological analyzer ChaSen [7]. Then, we used the preprocessed data to train the phrase-based translation model by using GIZA++ [10] and the Moses tool kit [4]. Source-side and target-side language model are trained by SRI language model toolkit [12]. The language model configuration is a modified Kneser-Ney [2] 4-gram. 4.4 Experimental Results Table 2 shows the experimental results. This table shows the evaluation results of the baseline and the proposed method. The evaluation is done by several automatic metrics, BLEU[11], NIST[8], WER and PER. In this table, the better score is underlined. As shown in the table, the BLEU score shows degradation of the MT performance using the proposed method. Meanwhile, all of the other metrics shows improvement using the proposed method. To test the significance of the scores improvement and degradation, we carried out MT evaluation bootstrapping [14] with a 1000 times sampling. In the table, if there is significant difference between the baseline and the proposed method, boldface numbers are used to express the significantly better score. Looking at the table, BLEU score degradation by the proposed method is not significant. Meanwhile, all of the other scores improvement by the proposed method is significant. 5 Discussions A Japanese-English patent parallel corpus was just released in 2007 [13]. Thus, there is not much SMTrelated research on English-to-Japanese patent translation. However, [13] carried out experiments using the aforementioned corpus and IPC information. The experiments simply used the training corpus in the same IPC as an input sentence. According to their results, that method causes degradation of the BELU score compared to a baseline method which is similar to our baseline method. They only used the BLEU score for the evaluation, and the score degradation is around 0.78% to 2.02%, which is larger than our results (0.33%). The baseline performance of the experiments is lower than our experiments, thus the relative score degradation of their method is even higher than ours. Considering this point, our method is thought to work better than their method. 6 Conclusions We proposed a method of using IPC information for SMT based patent translation. This method uses IPC section information to train class based translation and language models. Then, multiple class based models and a general model are interpolated by weighting method employing source side language models. We carried out experiments using data from the NTCIR-7 workshop patent translation task. The experimental results indicated that our method improved most of automatic scores, which are NIST, WER and PER. Although the proposed method caused BLUE score degradation, the statistical test using a bootstrapping method does not show significant difference. References [1] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. In Computational Linguistics, pages 19(2):263 311, 1993. [2] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Technical report TR-10-98, Center for Research in Computing Technology (Harvard University), 1998. [3] A. Fujii, M. Utiyama, M. Yamamoto, and T. Utsuro. Overview of the Task at the NTCIR-7Workshop. In Proc. of NTCIR-7, 2008. [4] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open Source Toolkit for Statistical Machine. In Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, 2007. [5] P. Koehn, F. J. Och, and D. Marcu. Statistical Phrase- Based. Proc. of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT- NAACL), pages 127 133, 2003. 418

Table 2. Evaluation results by automatic metrics [6] M. Nagao. A framework of a mechanical translation between Japanese and English by analogy principle. In In the International NATO Symposium on Artificial and Human Intelligence, 1981. [7] NAIST. ChaSen, 2008. http://chasen-legacy.sourceforge.jp/. [8] NIST. Automatic Evaluation of Machine Quality Using N-gram Co-Occurence Statistics, 2002. http://www.nist.gov/speech/tests/mt/ mt2001/resource/. [9] F. J. Och. Minimum Error Rate Training for Statistical Machine. Proc. of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160 167, 2003. [10] F. J. Och and H. Ney. A Systematic Comparison of Various Statistical Alignment s. Computational Linguistics, 29(1):19 51, 2003. [11] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311 318, 2002. [12] A. Stolcke. SRILM - An Extensible Language ing Toolkit. In Proceedings of International Conference on Spoken Language Processing, 2002. [13] M. Utiyama and H. Isahara. A Japanese-English Parallel Corpus. In MT Summit XI, 2007. [14] Y. Zhang and S. Vogel. Measuring confidence intervals for the machine translation evaluation metrics. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine, 2004. 419