ISSN (Olie) 1694-0814 www.ijcsi.org 431 The effect of usig a thesaurus i Arabic iformatio retrieval system Mohammad Wedya, Basim Alhadidi ad Ada Alrabea Computer Sciece Departmet, Al-Balqa Applied Uiversity, Al-Salt, Jorda Abstract Automatic query expasio methods for Eglish ad other laguages text retrieval have bee studied for a log time. I this research we study the retrieval effectiveess, achieved whe we apply a successful automatic query expasio method i Arabic text retrieval based o a automatic thesaurus. Our experimets show that the automatic query expasio method resulted i a otable improvemet i Arabic text retrieval usig a sample of abstracts of Arabic documets. The study showed that the use of a thesaurus has improved iformatio retrieval system by 10% -20%. The study also shows that the greater the umber of documets i the buildig thesaurus, Thesaurus was more accurate. Keywords Arabic retrieval, thesaurus, stop words, idexig, iformatio retrieval system. 1. Itroductio Arabic is a laguage that holds the miracle of holy Qura, ad that accomplished all the requiremets of Arabic ad Islamic civilizatio i its peak flourishig. Arab books i Medicie ad Sciece had bee the mai referece books for the west ad i most of its importat uiversities. [1] Iteratioally, it gaied full acceptace ad recogitio ad become a credited laguage i UN istitutios alog side with the other five laguages previously used. [1] Arabic has may Properties, first, Arabic laguage cosists of 28 letters, 16 of them have oe dot, two or three dots. Secod, Writig is from right to left. Third, varyig ways of writig. For example completely mashkool (all sigs of tashkeel are used) or partially mashkool or Not mashkool Fourth, Letters chage their shape accordig to the place of occur i. fifth, Dual laguage formal ad iformal.sixth, Grammatical flexibility, words may be arraged i may differet ways. [2] Experimetal results show that spellig ormalizatio ad stemmig ca sigificatly improve Arabic mooligual retrieval. Character tri-grams from stems improved retrieval modestly o the test corpus, but the improvemet is ot statistically sigificat. [3] Therefore this study will statemet effect of usig a thesaurus o the iformatio retrieval system (IRS), ad compared the improvemet after usig automatic thesaurus from the traditioal system. 2. Evaluatig iformatio retrieval systems. Ay retrieval system is usually evaluated accordig to its efficiecy ad effectiveess. There are two aspects of efficiecy, they are time ad space. Time is the speed of matchig the i-use queries with the documet descriptios. Space is the space eeded i a disk that the system eeds. Efficiecy is determied accordig to the ability of the system to retur documets relevat to the user query. The perfect status of the system is referrig all the files that are relevat to the process of query ad ever referrig ay irrelevat files. The difficulty lies i the determiatio of relevace because the process of determiig relevace of documets is a subjective oe. [4] The decisio of the perso depeds much o may factors; experiece, for example. Ay professioal i a certai field may see the geeral iformatio retrieved from a system as irrelevat while ay amateur (begier) sees it as fully relevat. This may lead to icreasig i the determiatio of relevace. I research, researchers usually cosider the process of determiatio of relevace as a objective process. [4] We suggest here that evaluatio process is objective ad previously agreed o. Criteria used i the process of evaluatig the performace of a system are precisio ad recall. Precisio meas the ability of the system to retur documets that have relevace to the query. [4] The most commoly used measuremets of retrieval performace are precisio ad recall. Precisio measures the ability of the system to retrieve oly the documets that are relevat to a query [4] A mout of relevat documets retrieved Precisio = A mout of documets retrieved Recall measures the ability of the system to retrieve all documets that are relevat to a query [4] Recall = 3-Idexdig A mout of relevat documets retrieved A mout of relevat documets i the collectio Idexig is defied as the process of choosig a term or a umber of terms that ca represet what the documet cotais. These terms are called (Idex terms). [3]
ISSN (Olie) 1694-0814 www.ijcsi.org 432 Idexig ca be performed either maually (Maual Idexig) or through usig computers software ad programs (automatic Idexig) [4]. Maual idexig has some weakesses that metioed. The perso who performs idexig must have the complete kowledge of what the documet cotais, ad what the documet talks about. The result may vary due to differet experieces of idexers. This leads to icreasig cost.[5] This research uses automatic idexig, so it will be our focus. 3-2 Automatic Idexig The first step i idexig is the Lexical Aalysis. The process of chagig the text ito a group of separate words, each word is called (toke), a toke is a group of letters. Lexical aalysis is also the first step i queries aalysis [6]. The process of lexical aalysis may preset idioms that ca be used as (Idex Terms), i order to assig the suitable idex term to reach the suitable documet.[6] The comes the process of separatig uecessary words, they are called 'Stop Words" as (قد) ad,(هذا) they are repeated i all documets ad texts. The importace of this step is discussed later i this study. 3-3 Elimiatig Stop words- Stop words are those words that are repeated i every documet, so they are cosidered as weak to be distiguished, we caot distiguish the cotet of a text depedig o them.[5] There are other beefits from elimiatig them as "shorteig idexig structure"[7]ad are useful i makig the process faster ad does't have iformatio Retrieval ad the degree of the efficiecy of recallig system. [6] It does't also burde the system with uecessary iformatio [8] It is ot clear which words ca be cosidered o stop words ad which caot. Traditioal methods cosider that words that are repeated may times are stop words, but there are some words that are repeated i a certai documet ad cosidered as importat words "idexig terms''. But whe the subjects are more specialized, as to say a subject specialized i data base. The the use of repeated words, eve if simply, as "idex terms" as computer laguage egies" are useless to be "idex terms''. [6] The other way is to save stop words i a list, the we search for each toke separately. That result from lexical aalysis ad comparig it with the list, if it is i the list, it will be igored ad ot processed later. [6] Arabic is very rich i lexical tokes, that meas stop words are available i big quatities. [8] Swaie said several characteristics of stop words i his book. First, they have o meaig if they are used separately. Secod, appear may times i a text. Third, ecessary for the costructio of the laguage. Fourth, mostly adjectives. Fifth, geeral words ad ot particularly used i a certai field. Sixth, ay researcher does't ask about such words. Seveth, ever form a full setece whe used aloe. 4. Thesaurus Thesaurus is a efficiet tool i IRS specially i the moder systems, i idexig or i searchig which helps i extedig queries through usig more suitable tokes. [4] Costructig thesauruses has a great beefit i IRS, it stregthes precisio ad cotrol of idioms i order to serve ad icreasig format i the process of documets. Idexig ad retrieval ad i usig the best idioms ad helps the user to reformulate his queries if ecessary [6]. Simply the thesaurus cosists of a list of the importat words, a certai subject, each word is coected with other words i the list. [7]. Most thesauruses we use have bee built maually depedig o experts i certai fields or o the experts i the field of documet descriptio. Buildig thesauruses maually is a waste of time ad moey, the result may also be subjective, because the perso who builds it may use his ow choices which may affect the costructio of the thesauruses, so we are i eed of a automatic costructio of thesauruses which will save time, effort ad cost ad make the results more objective easy to be modified i the future [4] Takig ito cosideratio what is metioed previously, we will build Automatic Thesauruses which have may beefits over the maual oe [7]. It supports stadard vocabulary i idexig or i searchig it helps the user i puttig dow the suitable expressios i queries. It supports differet hierarchies as it allows broadeig or arrowig the query accordig to the user eeds. 4.1 Automatic Thesaurus Costructio. i vector space models documets are represeted by vectors as bellow D j =(W 1,j, W 2,j, W 3,j,.,W t,j ) t Total Number of Idex Terms W weight D j Vector for doc j We ca compute the weight by these equatios Wi,j = the f i,j weight * log N/ of the i ------------------[7] term i i the documet j. N umber of documets i the system. i the umber of documets that term i appear i it. Fi.j Normalized Frequecy ad compute by
ISSN (Olie) 1694-0814 www.ijcsi.org 433 f i, j = freq i,j / MAX L freq L,j -----[7] Freqi,j the uber of times the term i appeared i the text of the documet j. MAX L freq L,j the maximum is compute over all terms which are metioed i the text of the documets dj. These vectors of a group of documets va be represeted as follows D 1 D 2 W 11 W 21 W 12 W 22 W 13 W 23 T W 1 W 2 Cosie similarity S j, k w w i, j 2 i, j w w These equatios to calculate similarity betwee each idex term brigs out a matrix as the followig ( S 11 S 12 S 13 S 21 S 22 S 23 S 31 S 32 S 33 * * i, k ) 2 i, k T S 1 S 2 S 3 D 3 W 31 W 32 W 33 W 3 T m S m1 S m2 S m3 S m D m W m1 W m2 W m3 W m Figure (3) The term-term similarity Figure (1) Documets Vectors The comes the step of calculatig similarity betwee idex terms usig ay of the equatios of similarity calculatios as i the followig table D 1 D 2 D 3 D m W 11 W 21 W 31 W m1 W 12 W 22 W 32 W m2 W 13 W 23 W 33 W m3 W 1 W 2 W 3 Wm Sm, resembles the similarity betwee the term (N) ad the term (M). We have ow similarity matrix; because the similarity betwee (Tx) ad (Ty) equals the similarity betwee (Tx) ad (Ty). 5. RELATED WORK Despite the very little Arabic efforts i developig thesauruses, the theoretical efforts supported ad opeed ew paths for buildig Arabic thesaurus, eve though very limited, the first trials i this field were traslatio of foreig thesauruses, example of this is the list of Arabic Idioms prepared by Idustrial Developmet Ceter for the Arab World i 1970, ad the Islamic thesaurus which was built maually[9]. Cosie Similarity S j,k Figure (2) compute the term-term similarity Some studies i IRS ad i buildig thesauruses. Abu salem (1992) for example, studies the IR i Arabic Laguage. His study was based o 120 documets he received from the Saudi Arabia Natioal Computer Coferece ad o 32 queries. i his research, he studied idexig by usig full words ad by usig the roots oly. He foud that usig the roots is superior to other ways. He also built a maual thesaurus usig the relatio betwee expressios to test the possibility of supportig a IRS through this thesaurus. He foud that the thesaurus makes IR much better. The Geeral Thesaurus preseted by UN Aid Program.The Program of Authorizatio i the Arabic World (2003). This oe uses iitially syoyms that help the researcher to choose his expressios that he has to look for. This thesaurus icludes also the relatios of origi ad braches ad those of cotextualizatio betwee expressios. This helps i boardig the search, if the search has o
ISSN (Olie) 1694-0814 www.ijcsi.org 434 matches whe usig a certai expressio, the researcher ca use either broad terms or arrower oes. Syoyms are the first step i this thesaurus Precisio Kaaa ad wedya (2006). Their study was based o 242 documets they received from the Saudi Arabia Natioal Computer Coferece ad o 24 queries. I their research, they studied idexig by usig full words ad by usig the roots. They foud that usig the roots is superior to other ways. They also built a Automatic thesaurus usig the relatio betwee expressios to test the possibility of supportig a IRS through thesaurus. They foud that the thesaurus makes IR much better betwee 1% ad 10%. 6. Coclusios This study aims at reiforcig IRS depedig o Arabic. The results after applyig 35 queries, this study was based o 500 documets those were give to a group of studets who have certai liks with those subjects to determie the relevat documet to each query. Accordig to the determiatio of those studets, work o these results bega ad results were aalyzed usig the criteria of Precisio ad Recallig ad by usig smoothig Algorithm that was used by Abu Salem (1992) ad by Kaaa (1997). Average Recall Precisio was calculated. Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 without usig thesaurus 0.706 0.7 0.63 0.45 0.352 0.25 0.18 0.09 0.05 0.021 Average Recall Precisio with use thesaurus 0.884 0.872 0.81 0.607 0.498 0.38 0.305 0.198 0.151 0.121 Improvemet (%) 17.8 17.2 18 15.7 14.6 13 12.5 10.8 10.1 Table (1) The above Table Showig how better were the results whe usig with the thesaurus. Figure A compariso betwee the values of average Recall Precisio whe full words were used with ad without the thesaurus. 10 Figure (4) Showig how better were the results whe usig full words with the thesaurus. The previous chart shows the effect of usig the thesaurus o makig the system efficiecy that depeds o whole words better by applyig the criterio of average recall precisio. Whe the thesaurus was used, the results were better. This goes well with what Hai Abu Salem(1992) ad Kaa(2006) calculated whe he aid that the use of thesaurus i Arabic will make the efficiecy of the Arabic IRS better whe full words were used. Ad whe we icrease umber of documets that used to build thesaurus the result will be better. Kaa ad wedya (2006) used 242 documets to build their thesaurus ad i this study we use same equatios to build our thesaurus but we used 500 documets This study may be applied o other equatios as Jaccard ad Dice or be applied o huge umber of documets. The user ca be utilized i feedig the system i order to have a high precisio thesaurus. Refereces [1] Khatib, Ahmed Shafiq,1997, termiological specificatios ad applicatios i the Arabic laguage, cultural fifteeth seaso of the Arabic Laguage Academy of Jorda, Amma, Jorda, pp. 177-213.(Arabic) [2]Ali, Nabil, 1988, Arabic ad computer, localizatio, Cairo. (Arabic) [3] J. Xu, A. Fraser, ad R. Weischedel, 2002, Empirical studies i strategies for Arabic retrieval, Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, Tampere, Filad ACM, pp. 269-274. [4] Lassi, M., 2002, Automatic Thesaurus Costructio, uiversity collage of boras, [5] Salto, G., ad McGill, M., 1983, Itroductio to Moder Iformatio Retrieval, McGraw-Hill, New-York. [6] Frakes, W., ad Baeza-yates, R.,1992, Iformatio Retrieval Data Stractures & Algorithms, P T R Pretice Hall, New Jersey.
ISSN (Olie) 1694-0814 www.ijcsi.org 435 [7] Baeza-yates, R.,ad Rierio-eto, B.,1999, Moder Iformatio Retrieval, Addiso-Wesley,New-York. [8] Soaa, Ali Suleima,1994, iformatio retrieval i the Arabic laguage, Kig Fahd Natioal Library.(Arabic). [9] Abdul-Jabbar,Abdul Rahma,1993, The use of a system cosultat i buildig thesauruses, scietific record of the Symposium o the use of Arabic i Iformatio Techology orgaized by the Kig Abdul Aziz Library public, Riyadh, Saudi Arabia.(Arabic). [10] Abu Salem, H.,1992, A Microcomputer BasedArabic Bibliographic Iformatio Retrieval system With Relatioal Thesau ri, Ph.D. Thesis, Uiversity of Illiois,Chicago,USA. [11] Kaaa, G.,1997, Comparig Automatic Statistical ad Sytactic Phrase Idexig for Arabic Iformatio Retrieval,1997, Ph.D. Thesis, Uiversity of Illiois, Chicago, USA. [12] Kaaa, G., M, Wedya.,2006, Costructig a Automatic Thesaurus to Ehace Arabic Iformatio Retrieval System, The 2d Jordaia Iteratioal Coferece o Computer Sciece ad egieerig, JICCSE, Salt, Jorda. 89-97