Word Alignment Annotation in a Japanese-Chinese Parallel Corpus

Similar documents
Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Vocabulary Usage and Intelligibility in Learner Language

Japanese Language Course 2017/18

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Linking Task: Identifying authors and book titles in verbose queries

Constructing Parallel Corpus from Movie Subtitles

Chinese Intermediate CEFR Level: B1

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

arxiv: v1 [cs.cl] 2 Apr 2017

Cross Language Information Retrieval

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Multilingual Sentiment and Subjectivity Analysis

A heuristic framework for pivot-based bilingual dictionary induction

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Chinese for Beginners CEFR Level: A1

3 Character-based KJ Translation

Loughton School s curriculum evening. 28 th February 2017

1. Introduction. 2. The OMBI database editor

THE PERCEPTIONS OF THE JAPANESE IMPERFECTIVE ASPECT MARKER TEIRU AMONG NATIVE SPEAKERS AND L2 LEARNERS OF JAPANESE

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Derivational and Inflectional Morphemes in Pak-Pak Language

Language Model and Grammar Extraction Variation in Machine Translation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

JAPELAS: Supporting Japanese Polite Expressions Learning Using PDA(s) Towards Ubiquitous Learning

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Annotation Projection for Discourse Connectives

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

The Smart/Empire TIPSTER IR System

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Noisy SMS Machine Translation in Low-Density Languages

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The Ups and Downs of Preposition Error Detection in ESL Writing

Learning Methods in Multilingual Speech Recognition

What the National Curriculum requires in reading at Y5 and Y6

Detecting English-French Cognates Using Orthographic Edit Distance

The stages of event extraction

BYLINE [Heng Ji, Computer Science Department, New York University,

Ensemble Technique Utilization for Indonesian Dependency Parser

CEFR Overall Illustrative English Proficiency Scales

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Advanced Grammar in Use

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Intensive English Program Southwest College

5. UPPER INTERMEDIATE

Prediction of Maximal Projection for Semantic Role Labeling

Florida Reading Endorsement Alignment Matrix Competency 1

A Quantitative Method for Machine Translation Evaluation

Memory-based grammatical error correction

Formulaic Language and Fluency: ESL Teaching Applications

Overview of the 3rd Workshop on Asian Translation

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Analysis of Lexical Structures from Field Linguistics and Language Engineering

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The taming of the data:

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

Creating Travel Advice

Fluency is a largely ignored area of study in the years leading up to university entrance

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Modeling full form lexica for Arabic

The Interplay of Text Cohesion and L2 Reading Proficiency in Different Levels of Text Comprehension Among EFL Readers

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

The College Board Redesigned SAT Grade 12

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

AQUA: An Ontology-Driven Question Answering System

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Disambiguation of Thai Personal Name from Online News Articles

Using Semantic Relations to Refine Coreference Decisions

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

A Domain Ontology Development Environment Using a MRD and Text Corpus

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Mandarin Lexical Tone Recognition: The Gating Paradigm

Language Independent Passage Retrieval for Question Answering

Using dialogue context to improve parsing performance in dialogue systems

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

ScienceDirect. Malayalam question answering system

THE VERB ARGUMENT BROWSER

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

TEKS Correlations Proclamation 2017

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

Language Acquisition Chart

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Transcription:

Word Alignment Annotation in a Japanese-Chinese Parallel Corpus Yujie Zhang, Zhulong Wang, Kiyotaka Uchimoto, Qing Ma, Hitoshi Isahara National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan, 619-0289 E-mail: {yujie.zhang, uchimoto, isahara} @nict.go.jp, wangzhulong@cn.fujitsu.com, qma@math.ryukoku.ac.jp Abstract Parallel corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignment is of significance to provide gold-standard for developing and evaluating both example-based machine translation model and statistical machine translation model. This paper presents the work of word alignment annotation in the NICT Japanese-Chinese parallel corpus, which is constructed at the National Institute of Information and Communications Technology (NICT). We describe the specification of word alignment annotation and the tools specially developed for the manual annotation. The manual annotation on 17,000 sentence pairs has been completed. We examined the manually annotated word alignment data and extracted translation knowledge from the word aligned corpus. 1. Introduction Parallel corpora contain translation equivalences of various granularities and therefore are critical resources for machine translation research and development. Manual annotation of word alignment is of significance for providing gold-standard to development and evaluation of both example-based machine translation model and statistical machine translation model. Because parallel corpora of Asian languages are less developed, the National Institute of Information and Communications Technology (NICT) started a multilingual corpora construction project in 2002, which is focused on Asian language pairs (Uchimoto et al., 2004). The project makes effort on the annotation of detailed information, including syntactic structure and alignment at word & phrase levels. We call the corpora the NICT Multilingual Corpora. This paper presents the word alignment annotation in the Japanese-Chinese parallel corpus, one parallel corpus of the NICT Multilingual Corpora. We will describe the specification for manual annotation and the tools specially developed for the manual annotation. The experience we obtained and points we paid special attentions are also introduced for share with other researches who are engaged in parallel corpora construction. To the best of our knowledge, the corpus is the first Japanese-Chinese parallel corpus annotated with the detailed information in the world. It will provide materials for investigation into the characteristic of translation from Japanese to Chinese. 2. NICT Japanese-Chinese Parallel Corpus The NICT Japanese-Chinese parallel corpus consists of the original Japanese text and its Chinese translations (Zhang et al., 2005-a). The original data is from newspaper articles or journals, such as Mainichi Newspaper in Japanese. The original articles were translated by skilled translators. In human translation, the articles of one domain were all assigned to the same translator to maintain consistency of terminology in Chinese. The Chinese translations were then revised by other translators and lastly revised by Chinese natives. Each article was translated one sentence to one sentence, so the obtained parallel corpus is already sentence aligned. In Japanese side, morphological and syntactic structure information has been annotated following the specification of the Corpus of Spontaneous Japanese (Maekawa et al., 2000). In Chinese side, word segmentations and parts-of-speech have been annotated following the specification of Peking University (Yu, 1997). The detail of the corpus is listed in Table 1. Japanese Chinese Sentences 38,383 Words 947,066 877,859 Vocabulary 36,657 33,425 Singletons 15,036 13,238 Aver. Sentence length 24.7 22.9 Table 1. Characteristics of NICT Japanese-Chinese Parallel Corpus. 3. Tool for Word Alignment Annotation We specially designed and developed a tool for manual annotation of word alignment based on the investigation into a few work of word alignment annotation (Melamed, 2001; LDC, 2006). Our motivation is as follows. (a) Since automatic alignment technologies are applicable, the annotation here is manual revision on the automatically obtained alignments. A multi-aligner developed by (Zhang et al. 2005-b) is used in this work. The multialigner consists of a lexical knowledge based aligner, Chinese to Japanese direction application of GIZA++ and Japanese to Chinese direction application of GIZA++ (Och & Ney, 2000 ). The evaluation of the multi-aligner showed that 63% recall rate and 79% precision have been achieved. In order to use the results of the multi-aligner, the tool should be able to display the results of the multi- 1025

aligner. (b) The tool should be able to provide a visualized interface for annotators to easily revise the automatically obtained results. (c) The tool should be able to display larger syntactic granularities in addition to words when syntactic structure information is available. In this way, annotators may select larger syntactic granularities and then align them effectively and conveniently. So the translation equivalences between larger units or between syntactic structures can be annotated through this tool. Figure 1. Visualized interface of the tool for manual alignment annotation (word level). Figure 2. Visualized interface of the tool for manual alignment annotation (phrase level). 1026

3.1 Visualized Interface The visualized interface of our tool is showed in Figure 1 and Figure 2. In Figure 1, the left area is used to display the ID list of the input sentence pairs. On the right, the upper area and the lower area are used to display the syntactic structures of Chinese sentence and Japanese sentence, respectively. The middle area is operation area where alignments are displayed and are to be revised. Each word of both Chinese sentence and the Japanese sentence is displayed in one quadrilateral button separately. Annotators click the buttons to select words for annotation. The lines connecting the Chinese words and the Japanese words mean alignments between them. The lines of different alignments are displayed in different colour. The quadrilateral buttons of the unaligned words are displayed in yellow, while ones of the aligned ones are displayed in grey. After one word is selected and aligned, the colour of the quadrilateral button will be changed from yellow to grey. For labeling one alignment, annotators select Chinese (Japanese) words and then the corresponding words in the Japanese (Chinese) side by clicking the left button of the mouse. After the click operations, the selected Chinese words and the Japanese words are linked by lines. If the alignment is one-to-one, there is one line between them. If the alignment is many-to-many, i.e. multiple Chinese words being aligned to multiple Japanese words, the lines from each word of both the Chinese side and the Japanese side are got together at the same point, which is located at the middle area. See the 2-to-5 alignment in Figure 1, 不仅 也 -to- ばかりではなく も (not only but). This presentation is different from the tool proposed by (Melamed, 2001), where each word of the group of the one side is linked to each word of the group on another side and therefore too many lines look redundancy. Strictly speaking, the correspondences between two groups of words do not mean each word of one group correspond to every word of another group. In our tool, the lines from each word of one group are got together at one point first and then from the point the lines are radiated to each word of another group. One-to-many and many-to-one alignments are the special cases of many-to-many alignments. For adding one word to an alignment, just click the word and then click any word or line of the alignment. For deleting one word from one alignment, just click the word. Then the corresponding line will disappear. If phrase alignment mode is selected, the quadrilateral buttons of phrase will be displayed in the middle area. At present, only Japanese sentences have syntactic structure information. As shown in Figure 2, words of each phrase are contained in a larger quadrilateral button. Annotators can click the quadrilateral button to select phrases for annotation. In this way, larger units can be considered and therefore be annotated more effectively. 3.2 Data Structure We store the alignment data in a XML file and encode them in Unicode. The following tags are designed for different types of data. <srctext> Chinese sentence <srcword> Chinese word sequence with Part of Speech <tgttext> Japanese sentence <tgtword> Japanese word sequence with Part of Speech; <wordalignment_1> automatically aligned result <srctree> syntactic structure of the Chinese sentence <tgttree> syntactic structure of the Japanese sentence <wordalignment_2> manually annotated word alignment <structalignment> manually annotated phrase alignment. The alignment data of the example displayed in Figure 1 is shown in Figure 3. Figure 3. The illustration of the data structure. 1027

4. Specification for Manual Word Alignment Annotation In alignment annotation, the semantic equivalences between Chinese words and Japanese words are detected and are aligned. The annotation is carried out according to the following criteria. (1) Content words are considered first. After all content words are aligned, the left words are processed. (2) The semantic equivalences should be approved based on a few pre-specified Japanese-Chinese translation dictionaries. The translation dictionaries are used to qualify the annotation of semantic equivalences in order to avoid annotation of free translations, whose semantic equivalences only appear in the certain sentence pairs and therefore are not applicable in other case. In word alignment annotation, we only consider semantic equivalences that will be applicable in general. We will deal with free translations in phrase level alignment. (3) Alignment unit should be the minimum granularity, i.e. the smallest number of words. The words within one unit should correlate each other in semantics and therefore can not be separated further. This criterion aims at increasing the coverage of the translation knowledge that will be extracted from the aligned corpus. (4) In the case of idioms and frozen expressions, the larger units should be preferred on both sides to ensure the two groups to be equivalent semantically and grammatically. (5) For some Japanese postpositions, the counterpart in Chinese usually consists of one preposition and one suffix, appearing discontinuously. For instance, in the Japanese postposition phrase 机の上で (on the table) and its Chinese translation 在桌子上 (on the table), Japanese postposition で (on) corresponds to the Chinese preposition 在 (on) and the suffix 上 (on). The Chinese preposition 在 (on) and the suffix 上 (on) should be glued together first and then aligned to the Japanese postposition で (on). (6) For some Japanese conjunctions, the counterpart in Chinese usually consists of two conjunctions, appearing discontinuously. For instance, in the Japanese sub sentence 遅いが (it is late, but) and its Chinese translation 虽然晚了, 但是 (it is late, but), the Japanese conjunction が (but) corresponds to the Chinese conjunction 虽然 (but) and 但是 (but). The separated two Chinese conjunctions 虽然 (but) and 但是 (but) should be glued together first and then aligned to the Japanese conjunction が (but). (7) One big difference between Japanese and Chinese language is that the former has inflectional morphology but the latter has not. In Chinese, the words such as 了 (past tense particle), 过 (perfect aspect particle) and 着 (progressive aspect particle) are used to express tense and aspect, 被 (by) are used to express passive voice, 使 (causative morpheme) are used to express causative aspect. If the Japanese inflection suffix is segmented from its root, i.e. appearing as one independent morpheme, the Chinese particle should be aligned to the suffix. Otherwise, the Chinese particle should be glued to its main verb first and then aligned to the Japanese verb, which consists of the root and the inflection suffix. When the subject of the active sentence appears in the passive sentence, in Japanese it is expressed as SUBJECT に (by SUBJECT) and in Chinese it is expressed as 被 SUBJECT (by SUBJECT). In this case, the Chinese word 被 (by) is aligned to the Japanese word に (by). 5. What are Extracted from the Annotated Corpus The manual annotation of word and phrase alignment on 17,000 sentence pairs has been completed. We examined the manually annotated data and extracted translation knowledge from them. From the word alignment data, a translation dictionary is obtained which may be used without restriction on context. From the phrase alignment data, the translation templates are obtained, in which context restrictions are contained. The former knowledge aims at increasing the coverage of applying the translation knowledge, while the later aims at increasing the accuracy of applying the translation knowledge. Some examples of the extracted knowledge are shown in Figure 4 and Figure 5. Figure 4 shows the examples of the obtained semantic equivalences at phrase level. Figure 5 shows the examples of the obtained translation templates which are obtained by replacing the specified aligned words, displayed in parentheses, with variables, like X1. Figure 4. Examples of the obtained semantic equivalences at phrase level. Figure 5. Examples of the obtained translation templates. 1028

6. Conclusion This paper presents word and phrase alignment annotation in the NICT Japanese-Chinese parallel corpus. A visualized tool is developed to assistant the manual annotation. The data structure and the general guideline are described. The translation knowledge extracted from the aligned corpus is also reported. At present each sentence pair is annotated by only one annotator. We plan to select a small part of sentence pairs and ask different annotators to annotate alignment on them, in order to examine divergence among different annotators. References Linguistic Data Consortium. (2006). Guidelines for Chinese Word Alignment Annotation. Maekawa, K., Koiso, H., Furui, F., Isahara, H. (2000). Spontaneous Speech Corpus of Japanese. In Proc. of LREC2000, pp. 947--952. Melamed, I. Dan. (2001). Empirical Methods for Exploiting Parallel Texts. The MIT Press. Och, Franz J., Ney, H. (2000). Giza++: Training of statistical translation models. Available at http://www-i6.informatik.rwthaachen.de/~och/softwar e/giza++.html. Uchimoto, K., Zhang, Y., Sudo, K., Murata, M., Sekine, S. and Isahara, H. (2004). Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information and Its Applications. In Proc. of the MLR2004: PostCOLING Workshop on Multilingual Linguistic Resources,pp.63--70. Yu, S. (1997). Grammatical Knowledge Base of Contemporary Chinese. Tsinghua Publishing Company. Zhang, Y., Liu, Q., Ma, Q., Isahara, H. (2005-a). A Multi-aligner for Japanese-Chinese Parallel Corpora. In The Tenth Machine Translation Summit Proceedings, pp.133-140. Zhang, Y., Uchimoto, K., Ma, Q., Isahara H. (2005-b). Building an Annotated Japanese-Chinese Parallel Corpus A Part of NICT Multi lingual Corpora. In the Tenth Machine Translation Summit Proceedings, pp.71-78. 1029