Indonesian-English Transitive Translation for Cross-Language Information Retrieval

Similar documents
Cross Language Information Retrieval

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Cross-Lingual Text Categorization

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Dictionary-based techniques for cross-language information retrieval q

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Resolving Ambiguity for Cross-language Retrieval

Cross-Language Information Retrieval

Constructing Parallel Corpus from Movie Subtitles

A Case Study: News Classification Based on Term Frequency

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Matching Meaning for Cross-Language Information Retrieval

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

Matching Similarity for Keyword-Based Clustering

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Using Synonyms for Author Recognition

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Cross-lingual Text Fragment Alignment using Divergence from Randomness

English-Chinese Cross-Lingual Retrieval Using a Translation Package

Multilingual Sentiment and Subjectivity Analysis

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Postprint.

Notes and references on early automatic classification work

Language Independent Passage Retrieval for Question Answering

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Term Weighting based on Document Revision History

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Finding Translations in Scanned Book Collections

A heuristic framework for pivot-based bilingual dictionary induction

Georgetown University at TREC 2017 Dynamic Domain Track

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

EUROPEAN DAY OF LANGUAGES

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Linking Task: Identifying authors and book titles in verbose queries

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Ontological spine, localization and multilingual access

Graphical Data Displays and Database Queries: Helping Users Select the Right Display for the Task

Probabilistic Latent Semantic Analysis

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Laporan Penelitian Unggulan Prodi

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Multi-Lingual Text Leveling

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

arxiv: v1 [cs.cl] 2 Apr 2017

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cal s Dinner Card Deals

Conversational Framework for Web Search and Recommendations

As a high-quality international conference in the field

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Speech Recognition at ICSI: Broadcast News and beyond

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

The Role of String Similarity Metrics in Ontology Alignment

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Detecting English-French Cognates Using Orthographic Edit Distance

I N T E R P R E T H O G A N D E V E L O P HOGAN BUSINESS REASONING INVENTORY. Report for: Martina Mustermann ID: HC Date: May 02, 2017

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Word Segmentation of Off-line Handwritten Documents

Observing Teachers: The Mathematics Pedagogy of Quebec Francophone and Anglophone Teachers

Measurement. When Smaller Is Better. Activity:

Learning Methods in Multilingual Speech Recognition

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

Universiteit Leiden ICT in Business

A Student s Assistant for Open e-learning

English-German Medical Dictionary And Phrasebook By A.H. Zemback

Community-oriented Course Authoring to Support Topic-based Student Modeling

Organizational Knowledge Distribution: An Experimental Evaluation

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Common Core State Standards

An evolutionary survey from Monolingual Text Reuse to Cross Lingual Text Reuse in context to English-Hindi. Aarti Kumar*, Sujoy Das** IJSER

A Comparison of Two Text Representations for Sentiment Analysis

Literature and the Language Arts Experiencing Literature

Switchboard Language Model Improvement with Conversational Data from Gigaword

Standardized Assessment & Data Overview December 21, 2015

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

(Care-o-theque) Pflegiothek is a care manual and the ideal companion for those working or training in the areas of nursing-, invalid- and geriatric

Guru: A Computer Tutor that Models Expert Human Tutors

A Bayesian Learning Approach to Concept-Based Document Classification

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

ScienceDirect. Malayalam question answering system

Computer Science PhD Program Evaluation Proposal Based on Domain and Non-Domain Characteristics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Age Effects on Syntactic Control in. Second Language Learning

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

A Case-Based Approach To Imitation Learning in Robotic Agents

Australian Journal of Basic and Applied Sciences

Transcription:

Indonesian-English Transitive Translation for Cross-Language Information Retrieval Mirna Adriani, Herika Hayurani, and Syandra Sari Faculty of Computer Science University of Indonesia Depok 16424, Indonesia mirna@cs.ui.ac.id, {heha51,sysa51}@ui.edu Abstract. This is a report on our evaluation of using some language resources for the Indonesian-English bilingual task of the 2007 Cross-Language Evaluation Forum (CLEF). We chose to translate an Indonesian query set into English using machine translation, transitive translation, and parallel corpus-based techniques. We also made an attempt to improve the retrieval effectiveness using a query expansion technique. The result shows that the best retrieval performance was achieved by combining the machine translation technique and the query expansion technique. 1 Introduction To participate in the bilingual 2007 Cross Language Evaluation Forum (CLEF) task, i.e., the Indonesian-English CLIR, we needed to use language resources to translate Indonesian queries into English. However, there were not many language resources that were available on the Internet for free. We sought out for some language resources that can be used for the translation process. We learned from our previous work [1, 2] that freely available dictionaries on the Internet could not correctly translate many Indonesian terms, as their vocabulary was very limited. This lead us to exploring other possible approaches such as using machine translation techniques [3], parallel corpus-based techniques, and also transitive translation techniques. Previous work has demonstrated that parallel corpus could be used as a way to find word pairs in different languages [4, 5, 6]. The word pairs could then be used to translate the queries from one language to be used to retrieve documents in another language. If such resource is not available, another possibility is by translating through some other language, known as pivot language, that has more language resources [3, 7, 8]. 2 The Query Translation Process As a first step, we manually translated the original CLEF query set from English into Indonesian. We then translated the resulting Indonesian queries back into English using machine translation technique, transitive queries technique, and the parallel corpus. For the machine translation technique, we translate the Indonesian queries into English using the available machine translation on the Internet. The transitive C. Peters et al. (Eds.): CLEF 2007, LNCS 5152, pp. 127 133, 2008. Springer-Verlag Berlin Heidelberg 2008

128 M. Adriani, H. Hayurani, and S. Sari technique uses German and French as the pivot languages. So, Indonesian queries are translated into French and German using bilingual dictionaries, then the German and French queries are translated into English using other dictionaries. The third technique uses a parallel corpus to translate the Indonesian queries. We created a parallel corpus by translating all the English documents in the CLEF collection into Indonesian using a commercial machine translation software called Transtool 1.We then created the English queries by taking a certain number of terms from certain number of documents that appear in the top document list. 2.1 Query Expansion Technique Adding the translated queries with relevant terms (known as query expansion) has been shown to improve CLIR effectiveness [1, 3]. One of the query expansion techniques is called the pseudo relevance feedback [5]. This technique is based on an assumption that the top few documents initially retrieved are indeed relevant to the query, and so they must contain other terms that are also relevant to the query. The query expansion technique adds such terms into the previous query. We applied this technique in this work. To choose the relevant terms from the top ranked documents, we used the tf*idf term weighting formula [9]. We added a certain number of terms that have the highest weight scores. 3 Experiment We participated in the bilingual task with English topics. The English document collection contains 190,604 documents from two English newspapers, the Glasgow Herald and the Los Angeles Times. We opted to use the query title and the query description provided with the query topics. The query translation process was performed fully automatic using a machine translation technique, transitive technique, and the parallel corpus (Figure 1). The machine translation technique translates the Indonesian queries into English using Toggletext 2, a machine translation that is available on the Internet. The transitive technique translates the Indonesian queries into English through German and French as the pivot languages. The translation is done using a dictionary. All of the Indonesian words are translated into German or French if they are found on the bilingual dictionaries, otherwise they are left in the original language. In our experiments we took several approaches to handling transitive translation such as using English sense words found in either German or French dictionary (Union); and using only English sense words that appear in both German and French dictionaries (Intersection). For the parallel corpus-based technique, we used pseudo translation to get English words using Indonesian queries. First, an Indonesian query is used to retrieve the top N Indonesian documents through an IR system. Next, we identify English documents that are parallel (paired) to these top N Indonesian documents. From the top N English documents, we created the equivalent English query based on the top T terms that have highest tf-idf scores [9]. 1 See http://www.geocities.com/cdpenerjemah/ 2 See http://www.toggletext.com/

Indonesian-English Transitive Translation for Cross-Language Information Retrieval 129 1. Indonesian Query Machine Translation English Query 2. Indonesian Query Parallel Corpus English Query 3. Indonesian Query German English Query (using dictionary) 4. Indonesian Query French English Query (using dictionary) 5. English queries contains 3 & 4 6. Indonesian Query French English Query English Query German English Query Fig. 1. The translation techniques that are used in the experiments We then applied a pseudo relevance-feedback query-expansion technique to the queries that were translated using the three techniques above. In these experiments, we used Lemur 3 information retrieval system, which is based on a language model, to index and retrieve the documents. In these experiments we also use the synonym operators to handle the translation words that are found in the dictionaries. The synonym operator gives the same weights to all the words inside it. 4 Results Our work focused on the bilingual task using Indonesian queries to retrieve documents in the English collections. Our experiments contain official runs that have identification labels and non-official runs that do not have identification labels. Table 1-6 shows the result of our experiments. The retrieval performance of the title-based translation queries dropped 15.59% below that of the equivalent monolingual retrieval (see Table 1). The retrieval performance of using a combination of query title and description dropped 15.72% below that of the equivalent monolingual queries. Table 1. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using the machine translation Query Monolingual Machine Translation (MT) Title (depok.uiqttoggle) 0.3835 0.3237 (-15.59%) Title + Description (depok.uiqtdtoggle) 0.4056 0.3418 (-15.72%) 3 See http://www.lemurproject.org/

130 M. Adriani, H. Hayurani, and S. Sari The retrieval performance of the title-based translation queries dropped 1.64% below that of the equivalent monolingual retrieval (see Table 2) after applying the query expansion technique to the translated queries. It is increased the average precision retrieval performance by 13.95% compared to the machine translation only. However, applying query expansion to the combination of the query title and description achieves 4.38% below that of the equivalent monolingual queries. It increases the average retrieval precision of the machine translation technique by 11.34%. Table 2. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using the machine translation and query expansion techniques Query Monolingual MT + QE Title (depok.uiqttogglefb10d10t) 0.3835 0.3772 (-1.64%) Title + Description (depok.uiqtdtogglefb10d10t) 0.4056 0.3878 (-4.38%) Table 3. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using transitive translation (Indonesian queries are translated to English queries via German only and via French only) Query Monolingual Transitive Translation Title + Description (via French only- depok.uiqtdfrsyn) 0.4056 0.2697 (-33.50%) Title + Description (via German only-depok.uiqtddesyn) Title + Description (via German and French) 0.4056 0.2878 (-29.04%) 0.4056 0.2710 (-33.18%) The result of using the transitive translation technique for the combination of the title and description queries is shown in Table 3. Translating the queries into English using French as the pivot language decreased the mean average precision by 33.50% compared to the monolingual queries. Translating the Indonesian queries into English using German as the pivot language decreased the mean average precision by 29.04% compared to the monolingual queries. Translating Indonesian queries into English queries using two pivot languages decreases the mean average precision by 33.18% compared to the monolingual queries. The transitive translation technique was applied for translating the Indonesian queries into English via German and French. All the English terms that were derived from the German and French words were taken based on the union and the intersection between the two sets. Adding Indonesian words that could not be translated into English resulted in a drop of the average precision by 34.56% compared to the

Indonesian-English Transitive Translation for Cross-Language Information Retrieval 131 Table 4. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using transitive translation (Indonesian queries are translated to English queries via German) Query Monolingual Transitive Translation Title + Description 0.4056 0.2878 (-29.04%) Title + Description + QE (depok.uiqtddesynfb10d10t) Title + Description + QE(depok.uiqtddesynfb10d10t) Title + Description + QE(depok.uiqtddesynfb5d10t) 0.4056 0.3342 (-17.60%) 0.4056 0.3460 (-14.69%) 0.4056 0.3432 (-15.38%) Table 5. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics, their translation queries using transitive translation (Indonesian queries are translated to English queries via German and French), and applying the query expansion Query Monolingual Transitive Translation Title + Description 0.4056 0.2831 (-30.20%) (uiqtintersectionunionsyn) Title + Description + QE (depok. uiqtdintersectionunionsynf b5d10t) Title + Description + QE (depok.uiqtdintersectionunionsynf b10d10t) 0.4056 0.3437 (-15.26%) 0.4056 0.3297 (-18.71%) Title + Description (Union) 0.4056 0.2710 (-33.18%) Title + Description (Intersection & add untranslated Ind terms) 0.4056 0.2654 (-34.56%) equivalent monolingual queries. Applying the query expansion technique (see Table 5) to the resulting English queries resulted in retrieval performance that is 15-33% below the equivalent monolingual queries. The best result of using query expansion for the translated queries was obtained by taking the intersection approach, which resulted in retrieval performance 15.26% lower than that of the equivalent monolingual queries. When the query expansion technique was applied to the translated queries resulted from using German as the pivot language the average retrieval performance dropped by 14-17% compared to the equivalent monolingual queries (see Table 4).

132 M. Adriani, H. Hayurani, and S. Sari Table 6. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using parallel corpus and query expansion Query Monolingual Parallel Corpus Title + Description 0.4056 0.0374 (-90.77%) Title + Description + QE (5 terms from 5 terms) 0.4056 0.0462 (-88.60%) Next, we obtained the English translation of the Indonesian queries using the parallel corpus-based technique. The pseudo translation that we applied to the Indonesian queries was done by taking the English documents that are parallel with the Indonesian documents marked as relevant to the Indonesian queries by the information retrieval system. We then took the top T English terms as the English queries that had the highest weights within the top N documents. The result (see Table 6) shows that the mean average precision dropped by 90.77% of the equivalent monolingual queries. The query expansion technique that was applied to the English queries only increased the mean average precision by 2.17%. The result of the parallel corpus-based technique was very poor because the Indonesian version of the English documents in the corpus was of poor quality, in terms of the accuracy of the translation. The retrieval performance of the transitive translation using one language, i.e. German, is better than using two languages, i.e., German and French. Translating Indonesian queries through German resulted in fewer definitions or senses than through French, meaning that the ambiguity of translating through Indonesian-German- English is less than that of translating through Indonesian-French-English. 5 Summary Our results demonstrate that the retrieval performance of queries that were translated using a machine translation technique for Bahasa Indonesia achieved the best retrieval performance compared to the transitive technique and the parallel corpus technique. However, two of the machine translation techniques for Indonesian and English produced different results. Even though the best result was achieved by translating Indonesian queries into English using one machine translation technique; another machine translation technique that was used for creating parallel corpus produced poor results. The result of using the transitive translation technique showed that by using only one pivot language, the retrieval performance of the translated queries was better than using two pivot languages. The query expansion that is applied to the translated queries improves the retrieval performance of the translated queries. Even though the transitive technique performance was not as good as the machine translation technique, it can be considered as a viable alternative method for the translation process, especially for languages that do not have many available language resources such as Bahasa Indonesia.

Indonesian-English Transitive Translation for Cross-Language Information Retrieval 133 References 1. Adriani, M., van Rijsbergen, C.J.: Term Similarity Based Query Expansion for Cross Language Information Retrieval. In: Abiteboul, S., Vercoustre, A.-M. (eds.) ECDL 1999. LNCS, vol. 1696, pp. 311 322. Springer, Heidelberg (1999) 2. Adriani, M.: Ambiguity Problem in Multilingual Information Retrieval. In: CLEF 2000 Working Note Workshop, Portugal (2000) 3. Ballesteros, L.A.: Cross Language Retrieval via transitive translation. In: Croft, W.B. (ed.) Advances in Information Retrieval: Recent Research from the CIIR, pp. 203 234. Kluwer Academic Publishers, Dordrecht (2000) 4. Chen, J., Nie, J.: Automatic Construction of Parallel English-Chinese Corpus for Cross- Language Information Retrieval. In: Proceedings of the 6th Conference on Applied Natural Language Processing, pp. 90 95. ACM Press, New York (2000) 5. Larenko, V., Choquette, M., Croft, W.B.: Cross-Lingual Relevance Models. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 175 182. ACM Press, New York (2002) 6. Nie, J., Simard, M., Isabelle, P., Durand, R.: Cross-Language Information Retrieval Based on Parallel Text and Automatic Mining of Parallel Text from the Web. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York (1999) 7. Gollins, T., Sanderson, M.: Improving Cross Language Retrieval with Triangulated Retrieval. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 90 95. ACM Press, New York (2004) 8. Lehtokangas, R., Airio, E., Jarvelin, K.: Transitive Dictionary Translation Challenges Direct Dictionary Translation in CLIR. Information Processing and Management: An International Journal 40(6), 973 988 (2004) 9. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)