Amharic-English Information Retrieval

Similar documents
Cross Language Information Retrieval

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Case Study: News Classification Based on Term Frequency

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Linking Task: Identifying authors and book titles in verbose queries

Test Blueprint. Grade 3 Reading English Standards of Learning

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Multilingual Sentiment and Subjectivity Analysis

ScienceDirect. Malayalam question answering system

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

1. Introduction. 2. The OMBI database editor

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Cross-Lingual Text Categorization

Postprint.

Universiteit Leiden ICT in Business

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Modeling full form lexica for Arabic

Detecting English-French Cognates Using Orthographic Edit Distance

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Learning Methods in Multilingual Speech Recognition

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

The Smart/Empire TIPSTER IR System

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Using dialogue context to improve parsing performance in dialogue systems

South Carolina English Language Arts

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

ARNE - A tool for Namend Entity Recognition from Arabic Text

Rule Learning With Negation: Issues Regarding Effectiveness

Constructing Parallel Corpus from Movie Subtitles

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Florida Reading Endorsement Alignment Matrix Competency 1

First Grade Curriculum Highlights: In alignment with the Common Core Standards

The taming of the data:

Arabic Orthography vs. Arabic OCR

Probabilistic Latent Semantic Analysis

Speech Recognition at ICSI: Broadcast News and beyond

Memory-based grammatical error correction

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

The Role of String Similarity Metrics in Ontology Alignment

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Problems of the Arabic OCR: New Attitudes

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Let's Learn English Lesson Plan

Indian Institute of Technology, Kanpur

A heuristic framework for pivot-based bilingual dictionary induction

Literature and the Language Arts Experiencing Literature

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Reinforcement Learning by Comparing Immediate Reward

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Primary English Curriculum Framework

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

AQUA: An Ontology-Driven Question Answering System

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

CS Machine Learning

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

National Literacy and Numeracy Framework for years 3/4

Disambiguation of Thai Personal Name from Online News Articles

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Conducting the Reference Interview:

A NOTE ON UNDETECTED TYPING ERRORS

Australian Journal of Basic and Applied Sciences

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

SAMPLE PAPER SYLLABUS

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Loughton School s curriculum evening. 28 th February 2017

Oakland Unified School District English/ Language Arts Course Syllabus

Guidelines for Writing an Internship Report

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Finding Translations in Scanned Book Collections

HLTCOE at TREC 2013: Temporal Summarization

Thought and Suggestions on Teaching Material Management Job in Colleges and Universities Based on Improvement of Innovation Capacity

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Introduction of Open-Source e-learning Environment and Resources: A Novel Approach for Secondary Schools in Tanzania

Identifying Novice Difficulties in Object Oriented Design

Prentice Hall Literature Common Core Edition Grade 10, 2012

What the National Curriculum requires in reading at Y5 and Y6

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Accurate Unlexicalized Parsing for Modern Hebrew

Mandarin Lexical Tone Recognition: The Gating Paradigm

Rule Learning with Negation: Issues Regarding Effectiveness

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Word Segmentation of Off-line Handwritten Documents

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Transcription:

Amharic-English Information Retrieval Atelach Alemu Argaw and Lars Asker Department of Computer and Systems Sciences, Stockholm University/KTH [atelach,asker]@dsv.su.se Abstract We describe Amharic-English cross lingual information retrieval experiments in the adhoc bilingual tracs of the CLEF 2006. The query analysis is supported by morphological analysis and part of speech tagging while we used different machine readable dictionaries for term lookup in the translation process. Out of dictionary terms were handled using fuzzy matching and Lucene[4] was used for indexing and searching. Four experiments that differed in terms of utilized fields in the topic set, fuzzy matching, and term weighting, were conducted. The results obtained are reported and discussed. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Managment]: Languages Query Languages General Terms Languages, Measurement, Performance, Experimentation Keywords Amharic, Amharic-to-English, Cross-Language Information Retrieval 1 Introduction Amharic is the official government language spoken in Ethiopia. It is a Semitic Language of the Afro-Asiatic Language Group that is related to Hebrew, Arabic, and Syrian. Amharic, the syllabic language, uses a script which originated from the Ge ez alphabet (the liturgical language of the Ethiopian Orthodox Church). The language has 33 basic characters with each having 7 forms for each consonant-vowel combination, and extra characters that are consonant-vowel-vowel combinations for some of the basic consonants and vowels. It also has a unique set of punctuation marks and digits. Unlike Arabic, Hebrew or Syrian, the language is written from left to right. Amharic alphabets are one of a kind and unique to Ethiopia. Manuscripts in Amharic are known from the 14th century and the language has been used as a general medium for literature, journalism, education, national business and cross-communication. A wide variety of literature including religious writings, fiction, poetry, plays, and magazines are available in the language (Arthur Lynn.s World Languages). The Amharic topic set for CLEF 2006 was constructed by manually translating the English topics. This was done by professional translators in Addis Abeba. The Amharic topic set which was written using fidel, the writing system for Amharic, was then transliterated to an ASCII

representation using SERA 1. The transliteration was done using a file conversion utility called g2 2 which is available in the LibEth 3 package. We designed four experiments in our task. The experiments differ from one another in terms of query expansion, fuzzy matching, and usage of the title and description fields in the topic sets. Details of these is given in the Experiments section. Lucene [4], an open source search toolbox, was used as the search engine for these experiments. The paper is organized as follows, section 1 gives an introduction of the language under consideration and the overall experimental setup. Section 2 deals with the query analysis which consists of morphological analysis, part of speech tagging, filtering as well as dictionary lookup. Section 3 reports how out of dictionary terms were handeled. It is followed by the setup of the four retrieval experiments in section 4. Section 5 presents the results and section 6 discusses the obtained results and gives concluding remarks. 2 Query Analysis and Dictionary Lookup The dictionary lookup requires that the (transliterated) Amharic terms are first morphologically analyzed and represented by their lemmatized citation form. Amharic, just like other Semitic languages, has a very rich morphology. A verb could for example have well over 150 different forms. This means that successful translation of the query terms using a machine readable dictionary will be crucially dependent on a correct morphological analysis of the Amharic terms. For our experiments, we developed a morphological analyzer and Part-of-speech tagger for Amharic, and were used as the first pre-processing step in the retrieval process. We used the morphological analyzer to lemmatize the Amharic terms and the POS-tagger to filter out less content bearing words. The 50 queries in the Amharic topic set were analyzed and the morphological analyser had an accuracy of 86.66% and the POS tagger 97.45%. After the terms in the queries were POS tagged, the filtering was done by keeping Nouns and Noun phrases in the keyword list being constructed while discarding all words with other POS tags. Starting with tri-grams, bi-grams and finally at the word level, each remaining term was then looked up in the an Amharic - English dictionary [2]. If the term could not be found in the dictionary, a triangulation method issued where by the terms were looked up in an Amharic - French dictionary [1] and then further translate the terms from French to English using an online English - French dictionary WordReference (http://www.wordreference.com/). We also used an on-line English - Amharic dictionary (http://www.amharicdictionary.com/) to translate the remaining terms that were not found in any of the above dictionaries. For the terms that were found in the dictionaries, we used all senses and all synonyms that were found. This means that one single Amharic term could in our case give rise to as many as up to eight alternative or complementary English terms. At the query level, this means that each query was initially maximally expanded. 3 Out-of-Dictionary Terms Those terms that where pos-tagged as nouns and not found in any of the dictionaries were selected as candidates for possible fuzzy matching using edit distance. The assumption here is that these words are most likely cognates, named entities, or borrowed words. The candidates were first filtered by counting the number of times they occurred in a large (3.5 million words) Amharic news corpus. If they occur in the new corpus (in either their lemmatized or original form) more frequently than a predefined threshold value of 10 4, they would be considered likely 1 SERA stands for System for Ethiopic Representation in ASCII, http://www.abyssiniacybergateway.net/fidel/serafaq.html 2 g2 was made available to us through Daniel Yacob of the Ge ez Frontier Foundation (http://www.ethiopic.org/ 3 LibEth is a library for Ethiopic text processing written in ANSI C http://libeth.sourceforge.net/ 4 It should be noted that this number is an empirically set number and is dependent on the type and size of the corpus under consideration

to be non-cognates, and removed from the fuzzy matching unless they were labeled as cognates by an algorithm specifically designed to find (English) cognates in Amharic text [3]. The set of possible fuzzy matching terms was further reduced by removing those terms that occurred in 9 or more of the original 50 queries assuming that they would be remains of non informative sentence fragments of the type Find documents that describe... ). When the list of fuzzy matching candidates had been finally decided, some of the terms in the list were slightly modified in order to allow for a more English like spelling than the one provided by the transliteration system [5]. All occurrences of x which is a representation of the sound sh would be replaced by sh ( jorj bux George bush ). 4 Retrieval The retrieval was done using the Apache Lucene, an open source high-performance, full-featured text search engine library written in Java [4]. It is a technology deemed suitable for applications that require full-text search, especially in a cross-platform. Four experiments were designed and run using Lucene. 4.1 Fully Expanded Queries using Title and Description The translated and maximally expanded query terms from the title and description fields of the Amharic topic set were used in this experiment. In order to cater for the varying number of synonyms that are given as possible translations for the terms in the queries, the corresponding synonym sets for each Amharic term were down weighted. This is done by dividing 1 by the number of synonyms in each set and giving those equal fractional weights that adds up to 1. An edit distance based fuzzy matching was used in this experiment to handle cognates, named entities and borrowed words. 4.2 Fully Expanded Queries using Title The above experiment is repeated in this one except the usage of only the title field in the topic set. This is an attempt to investigate how much the performance of the retrieval is affected with and without the presence of the description field in the topic set. 4.3 Up Weighted Fuzzy Matching In this experiment, both the title and description fields were used and is similar to the first experiment except that fuzzy matching terms were given much higher importance in the query set by boosting their weight by 10. 4.4 Fully Expanded Queries without Fuzzy Matching This experiment is designed to be used as a comparative measure of how much the fuzzy matching affects the performance of the retrieval system. The setup in the first experiment is adopted here, except the use of fuzzy matching. Cognates, named entities and borrowed words, which so far have been handled by fuzzy matching, were treated manually. They were picked out and looked up separately and all translations for such entries are manual. 5 Results Table 1 lists the precision at various levels of recall for the four runs. A summary of the results obtained from all runs is reported in Table 2. The number of relevant documents, the retrieved relevant documents, the non-interpolated average precision as well as the precision after R (=num rel) documents retrieved (R-Precision) are summarized in the table.

Recall full or title or plus full or nofuzz full or 0.00 40,90 31,24 38,50 47,19 0.10 33,10 25,46 28,35 39,26 0.20 27,55 21,44 23,73 31,85 0.30 24,80 18,87 21,01 28,61 0.40 20,85 16,92 16,85 25,19 0.50 17,98 15,06 15,40 23,47 0.60 15,18 13,25 13,24 20,60 0.70 13,05 11,73 10,77 17,28 0.80 10,86 8,49 8,50 14,71 0.90 8,93 6,85 6,90 11,61 1.00 7,23 5,73 6,05 8,27 Table 1: Recall-Precision tables for the four runs Relevant-tot Relevant-retrieved Avg Precision R-Precision full or 1,258 751 18.43 19.17 title or 1,258 643 14.40 16.47 plus full or 1,258 685 15.70 16.60 nofuzz full or 1,258 835 22.78 22.83 Table 2: Summary of results for the four runs 6 Discussion and Directives We have been able to get better retrieval performance for Amharic compared to runs in the previous two years. Linguistically motivated approaches were added in the query analysis. The topic set has been morphologically analyzed and POS tagged. Both the analyzer and POS tagger were trained with a large news corpus for Amharic, and performed very well when used to analyze the Amharic topic set. It should be noted that these tools have not been tested for other domains. The POS tags were used to remove non-content bearing words while we used the morphological analyzer to derive the citation forms of words. The morphological analysis ensured that various forms of a word would be properly reduced to the citation form and be looked up in the dictionary rather than being missed out and labeled as an out-of-dictionary entry. Although that is the case, in the few times the analyzer segments a word wrongly, the results are very bad since that entails that the translation of a completely unrelated word would be in the keywords list. Especially for shorter queries, this could have a great effect. For example in query C346, the phrase grand slam, the named entity slam was analyzed as s-lam, and during the dictionary look up cow was put in the keywords list since that is the translation given for the Amharic word lam. We had a below median performance on such queries. On the other hand, stop word removal based on POS tags by keeping the nouns and noun phrases only worked well. Manual investigation showed that the words removed are mainly noncontent bearing words. The experiment with no fuzzy matching since all cognates, names, and borrowed words were added manually, gave the highest result. From the experiments that were done automatically, the best results obtained is for the experiment with the fully expanded queries with down weighting and using both the title and description fields, while the worst one is for the experiment in which only the title fields were used. The experiment where fuzzy matching words were boosted 10 times gave slightly worse results than the non-boosted experiment. The assumption here was that such words that are mostly names and borrowed words tend to contain much more information than

the rest of the words in the query. Although this may be intuitively appealing, there is room for boosting the wrong words. In such huge data collections, it is likely that there would be unrelated words matching fuzzily with those named entities. The decrease in performance in this experiment when compared to the one without fuzzy match boosting could be due to up weighting such words. Further experiments with different weighting schemes, as well as different levels of natural language processing will be conducted in order to investigate the effects such factors has on the retrieval performance. References [1] Berhanou Abebe. Dictionnaire Amharique-Francais. [2] Amsalu Aklilu. Amharic English Dictionary. [3] Jerker. Hagman. Mining for cognates. MSc thesis (forthcoming), Dept. of Computer and Systems Sciences, Stockholm University, 2006. [4] URL. http://lucene.apache.org/java/docs/index.html, 2005. [5] D. Yacob. System for ethiopic representation in ascii (sera). http://www.abyssiniacybergateway.net/fidel/, 1996.