The effect of using a thesaurus in Arabic information retrieval system

Similar documents
Natural language processing implementation on Romanian ChatBot

Management Science Letters

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

arxiv: v1 [cs.dl] 22 Dec 2016

Application for Admission

Consortium: North Carolina Community Colleges

'Norwegian University of Science and Technology, Department of Computer and Information Science

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

VISION, MISSION, VALUES, AND GOALS

part2 Participatory Processes

also inside Continuing Education Alumni Authors College Events

A Case Study: News Classification Based on Term Frequency

2014 Gold Award Winner SpecialParent

Cross Language Information Retrieval

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

AQUA: An Ontology-Driven Question Answering System

Detecting English-French Cognates Using Orthographic Edit Distance

Linking Task: Identifying authors and book titles in verbose queries

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The Extend of Adaptation Bloom's Taxonomy of Cognitive Domain In English Questions Included in General Secondary Exams

A Comparison of Two Text Representations for Sentiment Analysis

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Cross-Lingual Text Categorization

South Carolina English Language Arts

On-Line Data Analytics

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

Saeed Rajaeepour Associate Professor, Department of Educational Sciences. Seyed Ali Siadat Professor, Department of Educational Sciences

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Matching Similarity for Keyword-Based Clustering

Big Fish. Big Fish The Book. Big Fish. The Shooting Script. The Movie

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education

Language Independent Passage Retrieval for Question Answering

DOES OUR EDUCATIONAL SYSTEM ENHANCE CREATIVITY AND INNOVATION AMONG GIFTED STUDENTS?

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Abdul Rahman Chik a*, Tg. Ainul Farha Tg. Abdul Rahman b

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

Learning Methods in Multilingual Speech Recognition

Parsing of part-of-speech tagged Assamese Texts

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Cross-Language Information Retrieval

THE IMPLEMENTATION OF SPEED READING TECHNIQUE TO IMPROVE COMPREHENSION ACHIEVEMENT

Mining Association Rules in Student s Assessment Data

9TH GRADE HEALTH BOOK ONLINE PDF

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Learning Style Patterns Among Special Needs Adult Students at King Saud University

Curriculum and Assessment Policy

Lower and Upper Secondary

Module Title: Managing and Leading Change. Lesson 4 THE SIX SIGMA

E-LEARNING IN LIBRARY OF JAMIA HAMDARD UNIVERSITY

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

What the National Curriculum requires in reading at Y5 and Y6

The Role of String Similarity Metrics in Ontology Alignment

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Reducing Features to Improve Bug Prediction

Multimedia Courseware of Road Safety Education for Secondary School Students

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Georgetown University at TREC 2017 Dynamic Domain Track

ScienceDirect. Malayalam question answering system

& Jenna Bush. New Children s Book Authors. Award Winner. Volume XIII, No. 9 New York City May 2008 THE EDUCATION U.S.

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

An Interactive Intelligent Language Tutor Over The Internet

Practice Examination IREB

Matching Meaning for Cross-Language Information Retrieval

The Impact of Morphological Awareness on Iranian University Students Listening Comprehension Ability

HLTCOE at TREC 2013: Temporal Summarization

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The Implementation of Interactive Multimedia Learning Materials in Teaching Listening Skills

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

A New Computing Book Series From ACM

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Formulaic Language and Fluency: ESL Teaching Applications

Word Stress and Intonation: Introduction

Daily Common Core Ela Warm Ups

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

The Smart/Empire TIPSTER IR System

MABEL ABRAHAM. 710 Uris Hall Broadway mabelabraham.com New York, New York Updated January 2017 EMPLOYMENT

4th Grade Science Test Ecosystems

On document relevance and lexical cohesion between query terms

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Rule Learning With Negation: Issues Regarding Effectiveness

Analyzing Linguistically Appropriate IEP Goals in Dual Language Programs

A Pumpkin Grows. Written by Linda D. Bullock and illustrated by Debby Fisher

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Data Fusion Models in WSNs: Comparison and Analysis

Book Review: Build Lean: Transforming construction using Lean Thinking by Adrian Terry & Stuart Smith

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Loughton School s curriculum evening. 28 th February 2017

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

Literature and the Language Arts Experiencing Literature

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Children need activities which are

Multi-Lingual Text Leveling

Transcription:

ISSN (Olie) 1694-0814 www.ijcsi.org 431 The effect of usig a thesaurus i Arabic iformatio retrieval system Mohammad Wedya, Basim Alhadidi ad Ada Alrabea Computer Sciece Departmet, Al-Balqa Applied Uiversity, Al-Salt, Jorda Abstract Automatic query expasio methods for Eglish ad other laguages text retrieval have bee studied for a log time. I this research we study the retrieval effectiveess, achieved whe we apply a successful automatic query expasio method i Arabic text retrieval based o a automatic thesaurus. Our experimets show that the automatic query expasio method resulted i a otable improvemet i Arabic text retrieval usig a sample of abstracts of Arabic documets. The study showed that the use of a thesaurus has improved iformatio retrieval system by 10% -20%. The study also shows that the greater the umber of documets i the buildig thesaurus, Thesaurus was more accurate. Keywords Arabic retrieval, thesaurus, stop words, idexig, iformatio retrieval system. 1. Itroductio Arabic is a laguage that holds the miracle of holy Qura, ad that accomplished all the requiremets of Arabic ad Islamic civilizatio i its peak flourishig. Arab books i Medicie ad Sciece had bee the mai referece books for the west ad i most of its importat uiversities. [1] Iteratioally, it gaied full acceptace ad recogitio ad become a credited laguage i UN istitutios alog side with the other five laguages previously used. [1] Arabic has may Properties, first, Arabic laguage cosists of 28 letters, 16 of them have oe dot, two or three dots. Secod, Writig is from right to left. Third, varyig ways of writig. For example completely mashkool (all sigs of tashkeel are used) or partially mashkool or Not mashkool Fourth, Letters chage their shape accordig to the place of occur i. fifth, Dual laguage formal ad iformal.sixth, Grammatical flexibility, words may be arraged i may differet ways. [2] Experimetal results show that spellig ormalizatio ad stemmig ca sigificatly improve Arabic mooligual retrieval. Character tri-grams from stems improved retrieval modestly o the test corpus, but the improvemet is ot statistically sigificat. [3] Therefore this study will statemet effect of usig a thesaurus o the iformatio retrieval system (IRS), ad compared the improvemet after usig automatic thesaurus from the traditioal system. 2. Evaluatig iformatio retrieval systems. Ay retrieval system is usually evaluated accordig to its efficiecy ad effectiveess. There are two aspects of efficiecy, they are time ad space. Time is the speed of matchig the i-use queries with the documet descriptios. Space is the space eeded i a disk that the system eeds. Efficiecy is determied accordig to the ability of the system to retur documets relevat to the user query. The perfect status of the system is referrig all the files that are relevat to the process of query ad ever referrig ay irrelevat files. The difficulty lies i the determiatio of relevace because the process of determiig relevace of documets is a subjective oe. [4] The decisio of the perso depeds much o may factors; experiece, for example. Ay professioal i a certai field may see the geeral iformatio retrieved from a system as irrelevat while ay amateur (begier) sees it as fully relevat. This may lead to icreasig i the determiatio of relevace. I research, researchers usually cosider the process of determiatio of relevace as a objective process. [4] We suggest here that evaluatio process is objective ad previously agreed o. Criteria used i the process of evaluatig the performace of a system are precisio ad recall. Precisio meas the ability of the system to retur documets that have relevace to the query. [4] The most commoly used measuremets of retrieval performace are precisio ad recall. Precisio measures the ability of the system to retrieve oly the documets that are relevat to a query [4] A mout of relevat documets retrieved Precisio = A mout of documets retrieved Recall measures the ability of the system to retrieve all documets that are relevat to a query [4] Recall = 3-Idexdig A mout of relevat documets retrieved A mout of relevat documets i the collectio Idexig is defied as the process of choosig a term or a umber of terms that ca represet what the documet cotais. These terms are called (Idex terms). [3]

ISSN (Olie) 1694-0814 www.ijcsi.org 432 Idexig ca be performed either maually (Maual Idexig) or through usig computers software ad programs (automatic Idexig) [4]. Maual idexig has some weakesses that metioed. The perso who performs idexig must have the complete kowledge of what the documet cotais, ad what the documet talks about. The result may vary due to differet experieces of idexers. This leads to icreasig cost.[5] This research uses automatic idexig, so it will be our focus. 3-2 Automatic Idexig The first step i idexig is the Lexical Aalysis. The process of chagig the text ito a group of separate words, each word is called (toke), a toke is a group of letters. Lexical aalysis is also the first step i queries aalysis [6]. The process of lexical aalysis may preset idioms that ca be used as (Idex Terms), i order to assig the suitable idex term to reach the suitable documet.[6] The comes the process of separatig uecessary words, they are called 'Stop Words" as (قد) ad,(هذا) they are repeated i all documets ad texts. The importace of this step is discussed later i this study. 3-3 Elimiatig Stop words- Stop words are those words that are repeated i every documet, so they are cosidered as weak to be distiguished, we caot distiguish the cotet of a text depedig o them.[5] There are other beefits from elimiatig them as "shorteig idexig structure"[7]ad are useful i makig the process faster ad does't have iformatio Retrieval ad the degree of the efficiecy of recallig system. [6] It does't also burde the system with uecessary iformatio [8] It is ot clear which words ca be cosidered o stop words ad which caot. Traditioal methods cosider that words that are repeated may times are stop words, but there are some words that are repeated i a certai documet ad cosidered as importat words "idexig terms''. But whe the subjects are more specialized, as to say a subject specialized i data base. The the use of repeated words, eve if simply, as "idex terms" as computer laguage egies" are useless to be "idex terms''. [6] The other way is to save stop words i a list, the we search for each toke separately. That result from lexical aalysis ad comparig it with the list, if it is i the list, it will be igored ad ot processed later. [6] Arabic is very rich i lexical tokes, that meas stop words are available i big quatities. [8] Swaie said several characteristics of stop words i his book. First, they have o meaig if they are used separately. Secod, appear may times i a text. Third, ecessary for the costructio of the laguage. Fourth, mostly adjectives. Fifth, geeral words ad ot particularly used i a certai field. Sixth, ay researcher does't ask about such words. Seveth, ever form a full setece whe used aloe. 4. Thesaurus Thesaurus is a efficiet tool i IRS specially i the moder systems, i idexig or i searchig which helps i extedig queries through usig more suitable tokes. [4] Costructig thesauruses has a great beefit i IRS, it stregthes precisio ad cotrol of idioms i order to serve ad icreasig format i the process of documets. Idexig ad retrieval ad i usig the best idioms ad helps the user to reformulate his queries if ecessary [6]. Simply the thesaurus cosists of a list of the importat words, a certai subject, each word is coected with other words i the list. [7]. Most thesauruses we use have bee built maually depedig o experts i certai fields or o the experts i the field of documet descriptio. Buildig thesauruses maually is a waste of time ad moey, the result may also be subjective, because the perso who builds it may use his ow choices which may affect the costructio of the thesauruses, so we are i eed of a automatic costructio of thesauruses which will save time, effort ad cost ad make the results more objective easy to be modified i the future [4] Takig ito cosideratio what is metioed previously, we will build Automatic Thesauruses which have may beefits over the maual oe [7]. It supports stadard vocabulary i idexig or i searchig it helps the user i puttig dow the suitable expressios i queries. It supports differet hierarchies as it allows broadeig or arrowig the query accordig to the user eeds. 4.1 Automatic Thesaurus Costructio. i vector space models documets are represeted by vectors as bellow D j =(W 1,j, W 2,j, W 3,j,.,W t,j ) t Total Number of Idex Terms W weight D j Vector for doc j We ca compute the weight by these equatios Wi,j = the f i,j weight * log N/ of the i ------------------[7] term i i the documet j. N umber of documets i the system. i the umber of documets that term i appear i it. Fi.j Normalized Frequecy ad compute by

ISSN (Olie) 1694-0814 www.ijcsi.org 433 f i, j = freq i,j / MAX L freq L,j -----[7] Freqi,j the uber of times the term i appeared i the text of the documet j. MAX L freq L,j the maximum is compute over all terms which are metioed i the text of the documets dj. These vectors of a group of documets va be represeted as follows D 1 D 2 W 11 W 21 W 12 W 22 W 13 W 23 T W 1 W 2 Cosie similarity S j, k w w i, j 2 i, j w w These equatios to calculate similarity betwee each idex term brigs out a matrix as the followig ( S 11 S 12 S 13 S 21 S 22 S 23 S 31 S 32 S 33 * * i, k ) 2 i, k T S 1 S 2 S 3 D 3 W 31 W 32 W 33 W 3 T m S m1 S m2 S m3 S m D m W m1 W m2 W m3 W m Figure (3) The term-term similarity Figure (1) Documets Vectors The comes the step of calculatig similarity betwee idex terms usig ay of the equatios of similarity calculatios as i the followig table D 1 D 2 D 3 D m W 11 W 21 W 31 W m1 W 12 W 22 W 32 W m2 W 13 W 23 W 33 W m3 W 1 W 2 W 3 Wm Sm, resembles the similarity betwee the term (N) ad the term (M). We have ow similarity matrix; because the similarity betwee (Tx) ad (Ty) equals the similarity betwee (Tx) ad (Ty). 5. RELATED WORK Despite the very little Arabic efforts i developig thesauruses, the theoretical efforts supported ad opeed ew paths for buildig Arabic thesaurus, eve though very limited, the first trials i this field were traslatio of foreig thesauruses, example of this is the list of Arabic Idioms prepared by Idustrial Developmet Ceter for the Arab World i 1970, ad the Islamic thesaurus which was built maually[9]. Cosie Similarity S j,k Figure (2) compute the term-term similarity Some studies i IRS ad i buildig thesauruses. Abu salem (1992) for example, studies the IR i Arabic Laguage. His study was based o 120 documets he received from the Saudi Arabia Natioal Computer Coferece ad o 32 queries. i his research, he studied idexig by usig full words ad by usig the roots oly. He foud that usig the roots is superior to other ways. He also built a maual thesaurus usig the relatio betwee expressios to test the possibility of supportig a IRS through this thesaurus. He foud that the thesaurus makes IR much better. The Geeral Thesaurus preseted by UN Aid Program.The Program of Authorizatio i the Arabic World (2003). This oe uses iitially syoyms that help the researcher to choose his expressios that he has to look for. This thesaurus icludes also the relatios of origi ad braches ad those of cotextualizatio betwee expressios. This helps i boardig the search, if the search has o

ISSN (Olie) 1694-0814 www.ijcsi.org 434 matches whe usig a certai expressio, the researcher ca use either broad terms or arrower oes. Syoyms are the first step i this thesaurus Precisio Kaaa ad wedya (2006). Their study was based o 242 documets they received from the Saudi Arabia Natioal Computer Coferece ad o 24 queries. I their research, they studied idexig by usig full words ad by usig the roots. They foud that usig the roots is superior to other ways. They also built a Automatic thesaurus usig the relatio betwee expressios to test the possibility of supportig a IRS through thesaurus. They foud that the thesaurus makes IR much better betwee 1% ad 10%. 6. Coclusios This study aims at reiforcig IRS depedig o Arabic. The results after applyig 35 queries, this study was based o 500 documets those were give to a group of studets who have certai liks with those subjects to determie the relevat documet to each query. Accordig to the determiatio of those studets, work o these results bega ad results were aalyzed usig the criteria of Precisio ad Recallig ad by usig smoothig Algorithm that was used by Abu Salem (1992) ad by Kaaa (1997). Average Recall Precisio was calculated. Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 without usig thesaurus 0.706 0.7 0.63 0.45 0.352 0.25 0.18 0.09 0.05 0.021 Average Recall Precisio with use thesaurus 0.884 0.872 0.81 0.607 0.498 0.38 0.305 0.198 0.151 0.121 Improvemet (%) 17.8 17.2 18 15.7 14.6 13 12.5 10.8 10.1 Table (1) The above Table Showig how better were the results whe usig with the thesaurus. Figure A compariso betwee the values of average Recall Precisio whe full words were used with ad without the thesaurus. 10 Figure (4) Showig how better were the results whe usig full words with the thesaurus. The previous chart shows the effect of usig the thesaurus o makig the system efficiecy that depeds o whole words better by applyig the criterio of average recall precisio. Whe the thesaurus was used, the results were better. This goes well with what Hai Abu Salem(1992) ad Kaa(2006) calculated whe he aid that the use of thesaurus i Arabic will make the efficiecy of the Arabic IRS better whe full words were used. Ad whe we icrease umber of documets that used to build thesaurus the result will be better. Kaa ad wedya (2006) used 242 documets to build their thesaurus ad i this study we use same equatios to build our thesaurus but we used 500 documets This study may be applied o other equatios as Jaccard ad Dice or be applied o huge umber of documets. The user ca be utilized i feedig the system i order to have a high precisio thesaurus. Refereces [1] Khatib, Ahmed Shafiq,1997, termiological specificatios ad applicatios i the Arabic laguage, cultural fifteeth seaso of the Arabic Laguage Academy of Jorda, Amma, Jorda, pp. 177-213.(Arabic) [2]Ali, Nabil, 1988, Arabic ad computer, localizatio, Cairo. (Arabic) [3] J. Xu, A. Fraser, ad R. Weischedel, 2002, Empirical studies i strategies for Arabic retrieval, Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, Tampere, Filad ACM, pp. 269-274. [4] Lassi, M., 2002, Automatic Thesaurus Costructio, uiversity collage of boras, [5] Salto, G., ad McGill, M., 1983, Itroductio to Moder Iformatio Retrieval, McGraw-Hill, New-York. [6] Frakes, W., ad Baeza-yates, R.,1992, Iformatio Retrieval Data Stractures & Algorithms, P T R Pretice Hall, New Jersey.

ISSN (Olie) 1694-0814 www.ijcsi.org 435 [7] Baeza-yates, R.,ad Rierio-eto, B.,1999, Moder Iformatio Retrieval, Addiso-Wesley,New-York. [8] Soaa, Ali Suleima,1994, iformatio retrieval i the Arabic laguage, Kig Fahd Natioal Library.(Arabic). [9] Abdul-Jabbar,Abdul Rahma,1993, The use of a system cosultat i buildig thesauruses, scietific record of the Symposium o the use of Arabic i Iformatio Techology orgaized by the Kig Abdul Aziz Library public, Riyadh, Saudi Arabia.(Arabic). [10] Abu Salem, H.,1992, A Microcomputer BasedArabic Bibliographic Iformatio Retrieval system With Relatioal Thesau ri, Ph.D. Thesis, Uiversity of Illiois,Chicago,USA. [11] Kaaa, G.,1997, Comparig Automatic Statistical ad Sytactic Phrase Idexig for Arabic Iformatio Retrieval,1997, Ph.D. Thesis, Uiversity of Illiois, Chicago, USA. [12] Kaaa, G., M, Wedya.,2006, Costructig a Automatic Thesaurus to Ehace Arabic Iformatio Retrieval System, The 2d Jordaia Iteratioal Coferece o Computer Sciece ad egieerig, JICCSE, Salt, Jorda. 89-97