Dictionary based Amharic - English Information Retrieval

Similar documents
Cross Language Information Retrieval

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Linking Task: Identifying authors and book titles in verbose queries

Constructing Parallel Corpus from Movie Subtitles

Postprint.

Dictionary-based techniques for cross-language information retrieval q

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A Case Study: News Classification Based on Term Frequency

ScienceDirect. Malayalam question answering system

1. Introduction. 2. The OMBI database editor

Modeling full form lexica for Arabic

Derivational and Inflectional Morphemes in Pak-Pak Language

Finding Translations in Scanned Book Collections

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

The Role of String Similarity Metrics in Ontology Alignment

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Learning Methods in Multilingual Speech Recognition

A heuristic framework for pivot-based bilingual dictionary induction

Cross-Lingual Text Categorization

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

C a l i f o r n i a N o n c r e d i t a n d A d u l t E d u c a t i o n. E n g l i s h a s a S e c o n d L a n g u a g e M o d e l

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Recognition at ICSI: Broadcast News and beyond

Multi-Lingual Text Leveling

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Arabic Orthography vs. Arabic OCR

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Universiteit Leiden ICT in Business

A Bayesian Learning Approach to Concept-Based Document Classification

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Multilingual Sentiment and Subjectivity Analysis

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

National Literacy and Numeracy Framework for years 3/4

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

THE VERB ARGUMENT BROWSER

Problems of the Arabic OCR: New Attitudes

Cross-lingual Text Fragment Alignment using Divergence from Randomness

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Matching Meaning for Cross-Language Information Retrieval

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Probabilistic Latent Semantic Analysis

A Domain Ontology Development Environment Using a MRD and Text Corpus

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Modeling function word errors in DNN-HMM based LVCSR systems

The Smart/Empire TIPSTER IR System

AQUA: An Ontology-Driven Question Answering System

Text-mining the Estonian National Electronic Health Record

BULATS A2 WORDLIST 2

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCES

On document relevance and lexical cohesion between query terms

South Carolina English Language Arts

HLTCOE at TREC 2013: Temporal Summarization

Plainfield Public School District Reading/3 rd Grade Curriculum Guide. Modifications/ Extensions (How will I differentiate?)

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Term Weighting based on Document Revision History

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Integrating simulation into the engineering curriculum: a case study

Literature and the Language Arts Experiencing Literature

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Strands & Standards Reference Guide for World Languages

Test Blueprint. Grade 3 Reading English Standards of Learning

Semantic Modeling in Morpheme-based Lexica for Greek

Radius STEM Readiness TM

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Rule Learning with Negation: Issues Regarding Effectiveness

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

(Musselwhite, 2008) classrooms.

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Houghton Mifflin Online Assessment System Walkthrough Guide

Evaluation for Scenario Question Answering Systems

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

A Graph Based Authorship Identification Approach

ARNE - A tool for Namend Entity Recognition from Arabic Text

Mandarin Lexical Tone Recognition: The Gating Paradigm

CEFR Overall Illustrative English Proficiency Scales

Detecting English-French Cognates Using Orthographic Edit Distance

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Preprint.

Math 96: Intermediate Algebra in Context

Biome I Can Statements

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

A Note on Structuring Employability Skills for Accounting Students

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Transcription:

Dictionary based Amharic - English Information Retrieval Atelach Alemu Argaw 1, Lars Asker 1, Rickard Cöster 2 and Jussi Karlgren 2 1 Department of Computer and Systems Sciences Stockholm University/KTH, Sweden 2 Swedish Institute of Computer Science Sweden Abstract. We present two approaches to the Amharic - English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the same task. The resulting translated (English) terms are then submitted to a retrieval engine that supports the Boolean and vector-space models. In our experiments, the second approach (based on a list of English stop words) performs slightly better than the one based on IDF values for the Amharic terms. 1 Background Amharic is an Afro-Asiatic language belonging to the Southwest Semitic group. It uses its own unique alphabet (see Figure 1) and is spoken mainly in Ethiopia but also to a limited extent in Egypt and Israel [1]. Amharic is the official government language of Ethiopia and is spoken by a substantial segment of the population. In the 1998 census, 17.4 million people claimed Amharic as their first language and 5.1 as their second language. Ethiopia is a multi lingual country with over 80 distinct languages [2], and with a population of more than 59.9 million as authorities estimated on the basis of the 1998 census. Owing to political and social conditions and the multiplicity of the languages, Amharic has gained ground through out the country. Amharic is used in business, government, and education. Newspapers are printed in Amharic as are numerous books on all subjects [3]. 2 Introduction In this paper we describe our experiments at the CLEF 2004 Amharic - English bilingual track. It consists of two approaches that are variants of the same basic dictionary based approach. At a general level the two approaches both consist of a first step that transforms the Amharic topics into English queries, followed by a second step that takes the English queries as input to a retrieval system. In both

Fig. 1. The Amharic alphabeth (Fidel) from http://www.omniglot.com/

approaches the translation was done through a simple dictionary lookup that takes each stemmed Amharic word in the topic set and tries to get a match and the corresponding translation from a machine readable dictionary (MRD) 3 [6]. The first approach (AmEnI) reduces the number of Amharic words by removing those that have an IDF value below a certain threshold level (in this case we used 3.000 as the threshold value) and then looks up the remaining words in the MRD. An overview of this approach is presented in Figure 2 below. The second approach (AmEnA) uses the MRD to translate all Amharic words into English, and then reduces the number of English words by removing those that occur in a list of English stop words. An overview of this approach is given in Figure 3 below. The results from the two approaches differ somewhat, with AmEnA performing slightly better, but they both perform reasonably well, considering the simplicity of the approaches. 3 Method 3.1 Translation and Transliteration The English topic sets were translated into Amharic by human translators. Amharic uses its own and unique alphabet (Fidel) and there exist a number of fonts for this, but to date there is no standard for the language. The Amharic topics were originally represented using a Unicode compliant Ethiopic font called Visual Geez. For ease of use and compatibility reasons we transliterated it into an ASCII representation using SERA 4. The title and description fields of the original 50 Amharic topics contained 781 terms (493 unique) distributed over 808 words (because a few Amharic terms consisted of more than one word). Out of these 493 unique terms 397 were found in the original Amharic - English Machine Readable Dictionary. This dictionary consists of a little more than 14,600 entries. The remaining 96 terms were included in a manually constructed dictionary consisting of these terms and their translation of the relevant sense. Almost all of the 96 terms in this dictionary were proper names. 3.2 Stemming Amharic is a Semitic language which is morphologically complex [2]. Words are inflected with prefixes, suffixes and infixes. Once the topic set was transliterated, a semi automatic crude stemming that stripped off the prefixes and suffixes from each word was performed. The MRD used in the experiments is one that consisted of an entry for words and their derivational variants. The infixed words were represented separately in the dictionary. 3 The electronic version of the MRD is made available through the courtesy of Dr. Daniel Yacob of the Ge ez Frontier Foundation 4 SERA stands for System for Ethiopic Representation in ASCII, http://www.abyssiniacybergateway.net/fidel/sera-faq.html

1. Amharic topic set 1a. Transliteration 2. Transliterated Amharic topic set 2a. Semi automatic crude stemming (only prefixes and suffixes) 3. Stemmed Amharic topic set 3a. IDF-based stop word removal 4. Reduced Amharic topic set 4a. Dictionary lookup 5. Topic set (in English) including all possible translations 5a. Manual disambiguation 6. English terms (bag of words) 6a. Retrieval (Indexing, keyword search, ranking) 7. Retrieved Documents Fig. 2. Flow chart for AmEnI 3.3 Dictionary Lookup and Disambiguation A machine readable dictionary consisting of about 14,600 words was used in the experiments to perform the lexical lookup in translating the Amharic queries to English. The dictionary consisted of entries for words and their derivational variants. The stemmed words in the Amharic query were automatically looked up for possible translations in the MRD. In cases where there was a match and there was only one sense of the word, then the corresponding English word/phrase in the dictionary was taken as the possible translation. When there was more than one sense to the term, then all possible translations were picked out and a manual disambiguation was performed. For most of the proper names there was no entry in the MRD. Hence the terms were added manually. The Amharic query set contained 493 unique terms. Of these, 285 occurred in the dictionary with only one possible translation, 112 occurred in the dictionary with more than one sense (average number of senses for this group was 2.55), and 96 terms (mostly proper names) did not occur at all. The 96 terms that did not occur in the MRD were manually added in a separate dictionary

1. Amharic topic set 1a. Transliteration 2. Transliterated Amharic topic set 2a. Semi automatic crude stemming (only prefixes and suffixes) 3. Stemmed Amharic topic set 3a. Dictionary lookup 4. Topic set (in English) including all possible translations 4a. Manual disambiguation 5. Translated English terms and phrases 5a. Stop word removal 6. English terms (bag of words) 6a. Retrieval (Indexing, keyword search, ranking) 7. Retrieved Documents Fig. 3. Flow chart for AmEnA In the MRD some of the translations were phrasal, and when the phrases are taken, it introduced more words in the query. Some of the Amharic entries were also phrasal (22 total/14 unique), which in turn reduced the number of words in the query. 3.4 Stop Word Removal The main difference between the two approaches is in the way words that are likely to be less informative are identified and removed from the queries. For the first approach (AmEnI) the number of Amharic words was reduced by removing those that have an Inverted Document Frequency (IDF) value below a threshold value of 3.00. The IDF values were calculated from an Amharic news corpus consisting of approximately 2 million words of text. With a threshold value of 3.00, 123 of the 493 unique Amharic words were removed (25%). The second approach (AmEnA) removed those words from the translated queries that occurred in a list of 517 English stop words. With this approach, 118 unique terms were removed and the total number of remaining words in the resulting English query set was 559 compared to 547 for the AmEnI approach. Thus the two approaches left approximately the same number of words.

3.5 Retrieval Engine The underlying retrieval engine is an experimental system developed at SICS. For retrieval, we use Pivoted Unique Normalization [4], where the score for a document d given a query with m query terms is defined as m i=1 1+log (tf i,d ) 1+log (average tf d ) (1 slope) pivot + slope # of unique terms where tf i,d is the term frequency of query term i in document d, and average tf d is the average term frequency in document d. The slope was set to 0.3, and the pivot to the average number of unique terms in a document, as suggested in [4]. 4 Results We participated in the cross language Amharic to English run. Two runs were performed on the data set using two sets of queries. In the first run stop word removal using IDF weights was done before the translation of terms, in the second one, the stop word removal was done only after the terms were translated into English. Table 1 lists the precision at various levels of recall for the two runs. Recall AmEnI AmEnA 0.00 0.4799 0.5150 0.10 0.4597 0.4961 0.20 0.4535 0.4896 0.30 0.4074 0.4392 0.40 0.3863 0.4181 0.50 0.3724 0.4043 0.60 0.3458 0.3964 0.70 0.3356 0.3732 0.80 0.3273 0.3664 0.90 0.3109 0.3460 1.00 0.2961 0.3276 Table 1. Recall-Precision tables for AmEnI and AmEnA A summary of the results obtained in both runs is reported in Table 2. The number of relevant documents, the retrieved relevant documents, the noninterpolated average precision as well as the precision after R (=num rel) documents retrieved (R-Precision) are summarized in the table. 5 Conclusions We have described our experiments at the CLEF 2004 Amharic-English cross language track. The approach we followed is a dictionary based one to translate

Relevant-tot Relevant-retrieved Avg Precision R-Precision AmEnI 375 297 0.3615 0.3251 AmEnA 375 307 0.4009 0.3663 Table 2. Summary of results from both runs the Amharic queries into English Bags-of-words. One of the experiments reported removes non-content bearing words from the Amharic queries based on their IDF value, while the other uses a list of English stop words to perform the same task. The resulting translated (English) terms are then submitted to the retrieval engine. As can be seen from the results, the second approach (based on a list of English stop words) has an average precision of 0.4009 while the first approach (based on IDF values for the Amharic terms) reports 0.3615. This could be attained to the fact that although non content bearing words were removed from the Amharic queries in the first approach, a lot of stop words were introduced while performing the dictionary lookup, hence introducing noise. A combination of the two approaches may result in a better performance in terms of precision, while means of query expansion in order to increase the recall remains open for investigation. In future experiments we plan to investigate the possibility to automatize some of the tasks that have been done manually in these experiments (sense disambiguation, addition of proper names in the MRD) using various techniques such as e.g. statistical co occurrence for disambiguation, cognate matching for proper names. Experimenting with different retrieval techniques, comparing the performance of the algorithms, and the effects of various levels of stemming (root, stem, word) etc are also issues that we plan to address. References 1. http://www.ethnologue.org/ 2. Bender, M. L., Head, S. W., and Cowley, R.: The Ethiopian Writing System. In Bender et al (Eds.) Language in Ethiopia. London: Oxford University Press (1976). 3. Leslau, W.: Amharic Textbook. California, Berkley University (1968). 4. Singhal, A., Buckley, C., and Mitra, M.: Pivoted Document Length Normalization. Proceedings of the 19th International Conference on Research and Development in Information Retrieval (1996) 21 29 5. Fissaha, S., and Haller, J.: Amharic verb lexicon in the context of Machine Translation. Proceedings of TALN 2003, Workshop on Natural Language Processing of Minority Languages and Small Languages, Batz-sur-Mer, France, (2003) 6. Amsalu Aklilu: Amharic - English Dictionary. Kuraz Printing Enterprise (1987).