Indonesian-English Transitive Translation for Cross-Language Information Retrieval

Size: px

Start display at page:

Download "Indonesian-English Transitive Translation for Cross-Language Information Retrieval"

Ethelbert Taylor
5 years ago
Views:

1 Indonesian-English Transitive Translation for Cross-Language Information Retrieval Mirna Adriani, Herika Hayurani, and Syandra Sari Faculty of Computer Science University of Indonesia Depok 16424, Indonesia Abstract. This is a report on our evaluation of using some language resources for the Indonesian-English bilingual task of the 2007 Cross-Language Evaluation Forum (CLEF). We chose to translate an Indonesian query set into English using machine translation, transitive translation, and parallel corpus-based techniques. We also made an attempt to improve the retrieval effectiveness using a query expansion technique. The result shows that the best retrieval performance was achieved by combining the machine translation technique and the query expansion technique. 1 Introduction To participate in the bilingual 2007 Cross Language Evaluation Forum (CLEF) task, i.e., the Indonesian-English CLIR, we needed to use language resources to translate Indonesian queries into English. However, there were not many language resources that were available on the Internet for free. We sought out for some language resources that can be used for the translation process. We learned from our previous work [1, 2] that freely available dictionaries on the Internet could not correctly translate many Indonesian terms, as their vocabulary was very limited. This lead us to exploring other possible approaches such as using machine translation techniques [3], parallel corpus-based techniques, and also transitive translation techniques. Previous work has demonstrated that parallel corpus could be used as a way to find word pairs in different languages [4, 5, 6]. The word pairs could then be used to translate the queries from one language to be used to retrieve documents in another language. If such resource is not available, another possibility is by translating through some other language, known as pivot language, that has more language resources [3, 7, 8]. 2 The Query Translation Process As a first step, we manually translated the original CLEF query set from English into Indonesian. We then translated the resulting Indonesian queries back into English using machine translation technique, transitive queries technique, and the parallel corpus. For the machine translation technique, we translate the Indonesian queries into English using the available machine translation on the Internet. The transitive C. Peters et al. (Eds.): CLEF 2007, LNCS 5152, pp , Springer-Verlag Berlin Heidelberg 2008

2 128 M. Adriani, H. Hayurani, and S. Sari technique uses German and French as the pivot languages. So, Indonesian queries are translated into French and German using bilingual dictionaries, then the German and French queries are translated into English using other dictionaries. The third technique uses a parallel corpus to translate the Indonesian queries. We created a parallel corpus by translating all the English documents in the CLEF collection into Indonesian using a commercial machine translation software called Transtool 1.We then created the English queries by taking a certain number of terms from certain number of documents that appear in the top document list. 2.1 Query Expansion Technique Adding the translated queries with relevant terms (known as query expansion) has been shown to improve CLIR effectiveness [1, 3]. One of the query expansion techniques is called the pseudo relevance feedback [5]. This technique is based on an assumption that the top few documents initially retrieved are indeed relevant to the query, and so they must contain other terms that are also relevant to the query. The query expansion technique adds such terms into the previous query. We applied this technique in this work. To choose the relevant terms from the top ranked documents, we used the tf*idf term weighting formula [9]. We added a certain number of terms that have the highest weight scores. 3 Experiment We participated in the bilingual task with English topics. The English document collection contains 190,604 documents from two English newspapers, the Glasgow Herald and the Los Angeles Times. We opted to use the query title and the query description provided with the query topics. The query translation process was performed fully automatic using a machine translation technique, transitive technique, and the parallel corpus (Figure 1). The machine translation technique translates the Indonesian queries into English using Toggletext 2, a machine translation that is available on the Internet. The transitive technique translates the Indonesian queries into English through German and French as the pivot languages. The translation is done using a dictionary. All of the Indonesian words are translated into German or French if they are found on the bilingual dictionaries, otherwise they are left in the original language. In our experiments we took several approaches to handling transitive translation such as using English sense words found in either German or French dictionary (Union); and using only English sense words that appear in both German and French dictionaries (Intersection). For the parallel corpus-based technique, we used pseudo translation to get English words using Indonesian queries. First, an Indonesian query is used to retrieve the top N Indonesian documents through an IR system. Next, we identify English documents that are parallel (paired) to these top N Indonesian documents. From the top N English documents, we created the equivalent English query based on the top T terms that have highest tf-idf scores [9]. 1 See 2 See

3 Indonesian-English Transitive Translation for Cross-Language Information Retrieval Indonesian Query Machine Translation English Query 2. Indonesian Query Parallel Corpus English Query 3. Indonesian Query German English Query (using dictionary) 4. Indonesian Query French English Query (using dictionary) 5. English queries contains 3 & 4 6. Indonesian Query French English Query English Query German English Query Fig. 1. The translation techniques that are used in the experiments We then applied a pseudo relevance-feedback query-expansion technique to the queries that were translated using the three techniques above. In these experiments, we used Lemur 3 information retrieval system, which is based on a language model, to index and retrieve the documents. In these experiments we also use the synonym operators to handle the translation words that are found in the dictionaries. The synonym operator gives the same weights to all the words inside it. 4 Results Our work focused on the bilingual task using Indonesian queries to retrieve documents in the English collections. Our experiments contain official runs that have identification labels and non-official runs that do not have identification labels. Table 1-6 shows the result of our experiments. The retrieval performance of the title-based translation queries dropped 15.59% below that of the equivalent monolingual retrieval (see Table 1). The retrieval performance of using a combination of query title and description dropped 15.72% below that of the equivalent monolingual queries. Table 1. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using the machine translation Query Monolingual Machine Translation (MT) Title (depok.uiqttoggle) (-15.59%) Title + Description (depok.uiqtdtoggle) (-15.72%) 3 See

4 130 M. Adriani, H. Hayurani, and S. Sari The retrieval performance of the title-based translation queries dropped 1.64% below that of the equivalent monolingual retrieval (see Table 2) after applying the query expansion technique to the translated queries. It is increased the average precision retrieval performance by 13.95% compared to the machine translation only. However, applying query expansion to the combination of the query title and description achieves 4.38% below that of the equivalent monolingual queries. It increases the average retrieval precision of the machine translation technique by 11.34%. Table 2. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using the machine translation and query expansion techniques Query Monolingual MT + QE Title (depok.uiqttogglefb10d10t) (-1.64%) Title + Description (depok.uiqtdtogglefb10d10t) (-4.38%) Table 3. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using transitive translation (Indonesian queries are translated to English queries via German only and via French only) Query Monolingual Transitive Translation Title + Description (via French only- depok.uiqtdfrsyn) (-33.50%) Title + Description (via German only-depok.uiqtddesyn) Title + Description (via German and French) (-29.04%) (-33.18%) The result of using the transitive translation technique for the combination of the title and description queries is shown in Table 3. Translating the queries into English using French as the pivot language decreased the mean average precision by 33.50% compared to the monolingual queries. Translating the Indonesian queries into English using German as the pivot language decreased the mean average precision by 29.04% compared to the monolingual queries. Translating Indonesian queries into English queries using two pivot languages decreases the mean average precision by 33.18% compared to the monolingual queries. The transitive translation technique was applied for translating the Indonesian queries into English via German and French. All the English terms that were derived from the German and French words were taken based on the union and the intersection between the two sets. Adding Indonesian words that could not be translated into English resulted in a drop of the average precision by 34.56% compared to the

5 Indonesian-English Transitive Translation for Cross-Language Information Retrieval 131 Table 4. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using transitive translation (Indonesian queries are translated to English queries via German) Query Monolingual Transitive Translation Title + Description (-29.04%) Title + Description + QE (depok.uiqtddesynfb10d10t) Title + Description + QE(depok.uiqtddesynfb10d10t) Title + Description + QE(depok.uiqtddesynfb5d10t) (-17.60%) (-14.69%) (-15.38%) Table 5. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics, their translation queries using transitive translation (Indonesian queries are translated to English queries via German and French), and applying the query expansion Query Monolingual Transitive Translation Title + Description (-30.20%) (uiqtintersectionunionsyn) Title + Description + QE (depok. uiqtdintersectionunionsynf b5d10t) Title + Description + QE (depok.uiqtdintersectionunionsynf b10d10t) (-15.26%) (-18.71%) Title + Description (Union) (-33.18%) Title + Description (Intersection & add untranslated Ind terms) (-34.56%) equivalent monolingual queries. Applying the query expansion technique (see Table 5) to the resulting English queries resulted in retrieval performance that is 15-33% below the equivalent monolingual queries. The best result of using query expansion for the translated queries was obtained by taking the intersection approach, which resulted in retrieval performance 15.26% lower than that of the equivalent monolingual queries. When the query expansion technique was applied to the translated queries resulted from using German as the pivot language the average retrieval performance dropped by 14-17% compared to the equivalent monolingual queries (see Table 4).

6 132 M. Adriani, H. Hayurani, and S. Sari Table 6. Mean Average Precision (MAP) of the monolingual runs of the title and combination of title and description topics and their translation queries using parallel corpus and query expansion Query Monolingual Parallel Corpus Title + Description (-90.77%) Title + Description + QE (5 terms from 5 terms) (-88.60%) Next, we obtained the English translation of the Indonesian queries using the parallel corpus-based technique. The pseudo translation that we applied to the Indonesian queries was done by taking the English documents that are parallel with the Indonesian documents marked as relevant to the Indonesian queries by the information retrieval system. We then took the top T English terms as the English queries that had the highest weights within the top N documents. The result (see Table 6) shows that the mean average precision dropped by 90.77% of the equivalent monolingual queries. The query expansion technique that was applied to the English queries only increased the mean average precision by 2.17%. The result of the parallel corpus-based technique was very poor because the Indonesian version of the English documents in the corpus was of poor quality, in terms of the accuracy of the translation. The retrieval performance of the transitive translation using one language, i.e. German, is better than using two languages, i.e., German and French. Translating Indonesian queries through German resulted in fewer definitions or senses than through French, meaning that the ambiguity of translating through Indonesian-German- English is less than that of translating through Indonesian-French-English. 5 Summary Our results demonstrate that the retrieval performance of queries that were translated using a machine translation technique for Bahasa Indonesia achieved the best retrieval performance compared to the transitive technique and the parallel corpus technique. However, two of the machine translation techniques for Indonesian and English produced different results. Even though the best result was achieved by translating Indonesian queries into English using one machine translation technique; another machine translation technique that was used for creating parallel corpus produced poor results. The result of using the transitive translation technique showed that by using only one pivot language, the retrieval performance of the translated queries was better than using two pivot languages. The query expansion that is applied to the translated queries improves the retrieval performance of the translated queries. Even though the transitive technique performance was not as good as the machine translation technique, it can be considered as a viable alternative method for the translation process, especially for languages that do not have many available language resources such as Bahasa Indonesia.

7 Indonesian-English Transitive Translation for Cross-Language Information Retrieval 133 References 1. Adriani, M., van Rijsbergen, C.J.: Term Similarity Based Query Expansion for Cross Language Information Retrieval. In: Abiteboul, S., Vercoustre, A.-M. (eds.) ECDL LNCS, vol. 1696, pp Springer, Heidelberg (1999) 2. Adriani, M.: Ambiguity Problem in Multilingual Information Retrieval. In: CLEF 2000 Working Note Workshop, Portugal (2000) 3. Ballesteros, L.A.: Cross Language Retrieval via transitive translation. In: Croft, W.B. (ed.) Advances in Information Retrieval: Recent Research from the CIIR, pp Kluwer Academic Publishers, Dordrecht (2000) 4. Chen, J., Nie, J.: Automatic Construction of Parallel English-Chinese Corpus for Cross- Language Information Retrieval. In: Proceedings of the 6th Conference on Applied Natural Language Processing, pp ACM Press, New York (2000) 5. Larenko, V., Choquette, M., Croft, W.B.: Cross-Lingual Relevance Models. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp ACM Press, New York (2002) 6. Nie, J., Simard, M., Isabelle, P., Durand, R.: Cross-Language Information Retrieval Based on Parallel Text and Automatic Mining of Parallel Text from the Web. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York (1999) 7. Gollins, T., Sanderson, M.: Improving Cross Language Retrieval with Triangulated Retrieval. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp ACM Press, New York (2004) 8. Lehtokangas, R., Airio, E., Jarvelin, K.: Transitive Dictionary Translation Challenges Direct Dictionary Translation in CLIR. Information Processing and Management: An International Journal 40(6), (2004) 9. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

Cross Language Information Retrieval

Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................