PROBABILISTIC ARABIC PART OF SPEECH TAGGER WITH UNKNOWN WORDS HANDLING

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 PROBABILISTIC ARABIC PART OF SPEECH TAGGER WITH UNKNOWN WORDS HANDLING Mohammed Albared, 2 Tareq Al-Moslmi, 3 Nazlia Omar, 4 Adel Al-Shabi, 5 Fadl Mutaher Ba-Alwi 2,3,4 Cetre for Artificial Itelligece Techology, Faculty of Iformatio Sciece ad Techology, Uiversiti Ke-bagsaa Malaysia, 43600 Bagi, Selagor, Malaysia,5 Faculty of Computer ad Iformatio Techology, Saa'a Uiversity, Yeme E-mail: mohammed_albared@yahoo.com, 2 tareq.almoslmi@gmail.com, 3 azlia@ukm.edu.my, 4 adel.alshabi@gmail.com, 5 dr.fadlbaalwi@gmail.com ABSTRACT Part Of Speech (POS) tagger is a essetial preprocessig step i may atural laguage applicatios. I this paper, we ivestigate the best cofiguratio of trigram Hidde Markov Model (HMM) Arabic POS tagger whe small tagged corpus is available. With small traiig data, ukow word POS guessig is the mai problem. This problem becomes more serious i laguages which have huge size of vocabulary ad rich ad complex morphology like Arabic. I order to hadle this problem i Arabic POS tagger, we have studied the effect of itegratig a lexico based morphological aalyzer to improve the performace of the tagger. Moreover, i this work, several lexical models have bee empirically defied, implemeted ad evaluated. These models are based essetially o the iteral structure ad the formatio process of Arabic words. Furthermore, several combiatios of these models have bee preseted. The POS tagger has bee traied with a traiig corpus of 29300 words ad it uses a tagset of 24 differet POS tags. Our system achieves state-of-the-art overall accuracy i Arabic part of speech taggig ad outperforms other Arabic taggers i ukow word POS taggig accuracy. Keywords: Part of Speech Tagger, Arabic Laguage, Ukow Word Guessig.. INTRODUCTION Part of speech disambiguatio is the ability to computatioally determie which part of speech of a word is activated by its use i a particular cotext. Automatic text taggig is a importat preprocessig step i may NLP applicatios such as iformatio extractio, questio aswerig ad machie traslatio. POS taggig is a otrivial problem. It caot exclusively cosist of a lexico due to the MorphoSytactic ambiguity, ad the existece of ukow words, that is, words that have ot bee previously see i the aotated traiig set. Ukow words are major problem i ay taggig systems, ad always decrease the performace of the systems. The accuracy of partof-speech (POS) taggig for ukow words is sigificatly lower tha that for kow words. The processig of ukow words is so importat due to several reasos. First, the ukow words play a importat role i the meaig of a setece more tha kow words; ukow words are specialized words ad hold more sematic iformatio tha kow word []. This is because most of the ukow words belog to ope POS classes such as ous ad verbs ad ulikely to be i the closed classes such as particles. Secod, the performace of the POS tagger i ukow word taggig is a measure of its robustess ad reliability, which is, its ability to tag documet from differet domais or laguage varieties without substatially decrease o its performace. Fially, the improvemet of ukow words taggig cotribute to the overall accuracy of the POS tagger. For these reasos, properly POS taggig of ukow words is so importat, so the iformatio carried by them ca be used correctly i future steps of a NLP system. Ukow word POS taggig is a substatial problem i Arabic POS taggig due to several reasos. First, the lack of large ad free publicity available aotated corpora. Secod, Arabic laguage is oe of the richest laguages i term of vocabulary [2], I the DIINAR. resource, the effective umber of simple word forms is 7,774,938 [3]. As a result, to desig a reliable ad robust statistical POS tagger, we eed extremely large aotated corpus. Third, Arabic laguage is iflected laguage with rich ad complex morphology. Fially, the orthographic ambiguity; the form of certai letters i Arabic script allows 236

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 suboptimal orthographic variats of the same word to coexist i the same text. I this work, we employ the well-kow trigram HMM POS taggig architecture for taggig Arabic text our baselie tagger implemetatio is iflueced by Brats et.al. [4]. Durig the implemetatio of our baselie tagger, we observed that the suffix guessig does ot performed well o Arabic ukow words. This is due to the limitatio i the traiig data size ad laguage characteristics. So, to cope with the ukow words problem first, we study how iformatio supplied by a lexico based Morphological Aalyzer (MA) ca be used to improve the accuracy of the system. The, we defie, implemet ad evaluate several lexical models based o the iteral structure of Arabic word i.e. the word formatio process. Experimetal results show that the proposed approaches achieve very ecouragig results, although the traiig is performed o very small size corpus. The rest of the paper is orgaized as follows. Sec. 2 discusses related works. Sec. 3 ad Sec. 4 describes our tagset ad corpora. Sec. 5 gives ecessary details about Arabic word formatio. Sec. 6 describes our baselie HMM tagger. I Sec. 7 we discuss the modificatios to better hadle ukow words POS taggig i Arabic text. Sec. 8 gives Experimetal results. Fially, coclusios ad future work appear i Sec. 9. 2. RELATED WORK Research o POS taggig has a log history. Numerous approaches have bee successfully applied to POS taggig. The POS taggig techiques i the literature ca be classified ito the followig: POS taggig techiques i the literature ca be classified ito the followig: Rule-based part-of-speech taggig which is based o a lexico ad a set of disambiguatio rules. Supervised POS taggig: these approaches use machie-learig techiques to lear a classifier from labeled traiig sets such as Maximum Etropy Model [5], Hidde Markov Model [4], [6],Coditioal Radom field [7], Cyclic Depedecy Networks [8] ad Support Vector Machie [9], [0]. Usupervised POS taggig: these approaches do ot require pre-tagged traiig data, but rely o dictioary iformatio. Previous work o POS taggig has utilized differet kid of features to tackle ukow word POS taggig. These features are maily based o word substrig iformatio, word cotext iformatio ad/or global iformatio. Weischedel et al. [] create a probability distributio for a ukow word based o certai features: word edigs, hypheatio, ad capitalizatio. Brill et.al. [2] uses suffix iformatio with trasformatio rules. Rataparkhi et al. [5] uses character -gram prefixes ad suffixes, ad spellig cues such as capitalizatio, hyphes, ad umbers. Brats et.al. [4] uses the liear iterpolatio of fixed legth suffix model for ukow word hadlig. Nakagawa et.al. [3] uses global iformatio ad local iformatio. They model the probability distributio of the POS of all the occurreces of ukow words with the same lexical form i a documet. The parameters were estimated usig Gibbs samplig. Agic et al. [4] ad showed that the performace of high iflected laguage POS tagger ca be improved sigificatly by itegratig the output of morphological aalyzer. Recetly, several works have bee proposed to Arabic POS taggig such as [5] [2], for more details about Arabic works i POS taggig see Albared et al. [22]. Amog all these works, AlGahtai et al. [6] ad Marsi et al. [8] reported their taggers performaces o ukow word POS taggig which are 67.0% ad (80 %-85%) respectively. However, the reported results still less tha achieved results i other laguages like Eglish. Marsi et al. [8] used prefix, suffix, two previous words tags ad oe ext word tag to hadle ukow words. I additio, Al Shamsi et al. [7] ad El Hadj et al. [5] used HMM for Arabic POS taggig. Both of them used 000 words as test set. But, they worked uder closed vocabulary assumptio. 3. THE TAGSET FOR ARABIC POS TAGGING Our tagset have bee ispired by Arabic TreeBak (ATB) POS Guidelies [23]. The used tagset cosists of 24 tags (see table ). This tagset is a refiemet of the Arabic TreeBak tagset, which is cosist of 23 tags, used by Masour et al [9], Diab et al [20] ad Habash et al [2]. We oly add some modificatios to hadle some liguistic limitatio o previous Arabic taggers. The first oe, we itroduce a tag for the Broke Plural (BP) to distiguish betwee it ad the sigular ou. Ulike Eglish irregular plural, which is ucommo, Arabic broke plural is very commo. BPs form 40% of the plurals ad the remaiig 237

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 percetage 60% is for the other types of plurals: soud masculie ad femiie plurals [24]. I our aotated corpus, BPs form 55% of the plurals. Moreover, BPs costitute 0% of ay Arabic text [25]. Several works i Arabic NLP have bee proposed to idetify BP i Arabic text [25] [27]. However, previous Arabic taggers do ot idetify BP as idepedet tag. Most of the time, BPs are tagged as sigular ou which leads to lose a lot of iformatio such as Masour et al [9], Diab et al [20] ad Habash et al [2]. The mai word formatio process i Arabic laguages is iheretly o-cocateative; the BP is the best example of this o-cocateative morphology [27]. We ca measure the performace of our algorithms o hadlig o-cocateative ukow words by measurig its performace o hadlig ukow words which are BP. The secod modificatio, our tagset does ot iclude NO_FUNC (o solutio chose) tag, which is used as a tag i the above metioed Arabic TreeBak tagset. They use this tag for ay Arabic word with o selected solutio [28]. Fially, we distiguish betwee iflected ad o-iflected verbs. 4. THE TRAINING CORPUS Our corpus cosists of 29340 maually aotated word forms from two types of Arabic texts. Over 7000 word forms come from old Arabic text or what is called Traditioal Arabic text ad aother 2000 are comig from moder stadard Arabic. The mai differece betwee the two types of text is oly Out Of Vocabulary words. A few old Arabic words are rarely used owadays writig. I cotrast, some ew techical terms ad ew words have etered commo usage. We use this corpus to trai ad test our tagger. We spilt the corpus ito traiig set with size 22800 words ad test set with size 6540 words. 5. ARABIC WORD STRUCTURE Arabic word form is either simple or complex (see Figure ). The simple form of Arabic word cosists of prefix, stem ad suffix. The complex form cosists of proclitics, the simple form ad eclitics. Clitics (proclitics ad eclitics) have their ow POS tags. Taggig at complex word form level icrease the data sparseess problem (icrease ukow word problem) ad icrease the complexity of the tagset [28][29]. Furthermore, Barhaim et al. [29] showed that POS taggig usig simple word form outperforms taggig usig complex word form i Semitic laguages. However, throughout this research the simple word form will be termed word. We assume the segmetatio as a preprocessig step of the POS tagger. Arabic words are quite differet from Eglish words, ad the word formatio process for Arabic words is quite complex. The mai formatio of Eglish word is cocateative. I cotrast, the mai word formatio process i Arabic laguages is iheretly o-cocateative [30]. The word i Arabic laguage ca be described as combiatios of two morphemes: a root ad patter. The root is a sequece of three (rarely two or four) characters which is called radicals. The patter is a combiatio of augmeted characters (vowel characters ad it ca be احرف الزیادة cosoats), with geeric (or variables) characters ito which the Root Radical Characters (RRC) are beig iserted (throughout this works, we use the Eglish letter X to represet the patter geeric characters). The augmeted characters (sometimes called fixed characters) are fixed i each patter. Words are derived by iterdigitatig roots ito patters: the first radical is iserted ito the first geeric character, the secod radical fills the secod geeric ad the third fills the last geeric as show i Table 2. Arabic has a small umber, a few hudreds, of patters ad a few thousad of roots. The Arabic alphabet has 28 basic letters. Arabic word letters are divided ito two sets. The first oe is the root radical characters. Ay Arabic letter ca be root radical character. Root radical characters i Arabic word do ot play ay role i the detectio of the word possible POS tags. For example, i the Arabic word. are مad ص,خcharacters the three, متخاصمون root radical characters. However, we ca replace,ص) as them by other three Arabic characters such متصادقون to produce other Arabic word (د, ق which have differet meaig but both words are SPN. The secod set is the augmeted characters. Each augmeted character ca be oly oe of the.{ا,ه,ي,ن,و,م,ت,ل,أ,س} these Arabic te characters However, the augmeted characters associated with their positio i the word may play a critical role i determiig the possible POS tag of the word for example Arabic words all have three یتصاعد, یتفاھم, یتصالح, یتعامل augmeted characters ad ت, ي ا i the first,secod ad the fourth positio ad all these words are either PSV or VBP. Table. Word Deviatio Process Of Some Arabic Words From The Root كتب With Differet Patters 238

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 Patter Root The resultig word XXX كتب كتب XاXX كتب كاتب مXXوX كتب مكتوب مXXXة كتب مكتبة XXاX كتب كتاب 6. OUR BASELINE MODEL :THE HMM POS TAGGER Hidde Markov Model (HMM) is a well-kow probabilistic model, which ca predict the tag of the curret word give the tags of oe previous word (bi-gram) or two previous words (trigram). The HMM tagger assig a probability value to each pair< w, t >, wherew = w... w is the iput setece adt = t.tis the POS tag sequece. I HMM, the POS problem ca be defied as the fidig the best tag sequecet give the word sequecew. The label sequecet geerated by the model is the oe which has highest probability amog all the possible label sequeces for the iput word sequece. This is ca be formally expressed as: = i i i i i= t ( ) ( ) ( ) t arg max p t t. t. p w t. t = i i t i= ( ) ( ) ( ) t arg max p t t. t. p w t 2 The first parameter p( w t. t ) i i i i is a kow as the emissio probability ad secod p t t. t is kow as the trasitio parameter ( ) i i probability. These two model parameters are estimated from aotated corpus by Maximum Likelihood Estimatio (MLE), which is derived from the relative frequecies. Give these two probabilities, we ca fid the most likely tag sequece for a give word sequece. Usig the Viterbi algorithm [3], we selected the path whose overall probability was the highest, ad the took the tag predictios from that path. However, MLE is a bad estimator for statistical iferece especially, i NLP applicatio, because data teds to be sparse. This is eve for corpus with large umber of words. Sparseess meas that various words are either ifrequet or usee. This leads to zero probabilities beig assiged to usee evets, causig the probability of the whole sequece to be set to zero whe multiplyig probabilities. There are may differet smoothig algorithms i the literature to hadle the sparseess problem [32], all of them cosistig of decreasig the probability assiged to the kow evet ad distributig the remaiig mass amog the ukow evets. I our work, we use liear iterpolatio of uigram, bigram ad trigram maximum likelihood estimates i order to estimate the trigram trasitio probability: ( 3 2, ) = λ ( 3) + λ2 ( 3 2) + λp( t t t ) P t t t P t P t t 2 3 2, (3) whereλ +λ 2 +λ 3 =, so Prepresets a valid probability distributio. Theλs are estimated by deleted iterpolatio. For ukow word, we use the liear iterpolatio of fixed legth suffix model for ukow word hadlig. The probability distributio for a ukow word suffix is geerated from all words i the traiig set that have the same suffix up to some predefied maximum legth. Probabilities are smoothed by successive abstractio. This method was proposed by Samuelsso et.al. [33] ad implemeted for Eglish ad Germa [4]. (,..., c ) P tc i+ = Pt (, c i+,..., c) + θp(, t c i+ 2,..., c) + θ S θ = ( P( tj) P) S J= ad P= s i+ S J= 0 ( j) P t 2 (4) Wherec,..., c represet the last characters of the word of the words. I additio to word suffix, the experimets utilize the followig features: the presece of o-alphabetic characters ad the existece of foreig characters. I additio to the suffix guessig model, we defie aother basic model based o both ukow word prefix ad suffix. The mai liguistic motivatio behid combiig affixes iformatio is that i Arabic word sometimes a affix requires or forbids the existece of aother affix [34]. Prefix ad suffix idicate substrigs that come at the begiig ad ed of a word respectively, ad are ot ecessarily morphologically meaigful. I this model, the lexical probabilities are estimated as follows: 239

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 Give a ukow word w, the lexical probabilities P(suffix(w) t) are estimated usig the suffix tries as i Equatio 4. The, the lexical probabilities P(prefix(w) t) are estimated usig the prefix tries as i Equatio 4. Here, the probability distributio for a ukow word prefix is geerated from all words i the traiig set that have the same prefix up to some predefied maximum legth. Fially, we use the liear iterpolatio of both the lexical probabilities obtaied from both word suffix ad prefix to calculate the lexical probability of the word w as i the followig equatio: ( ) = λ ( suffix( ) ) P wt P w t + ( λ) P(prefix(w) t) (5) Where λ is a iterpolatio factor, experimetally set to X. prefix(w) ad suffix( w) are the first m ad the last characters, respectively. Table 3 summarizes the results of experimets with prefix, suffix ad prefix + suffix basic models. The first model (LM) is TT suffix guessig algorithm. The secod model (LM2) is prefix guessig algorithm. The third model (LM3) is the liear iterpolatio of both prefix guessig algorithm ad suffix guessig algorithm for ukow words. LM3, which combie iformatio from both suffix ad prefix, gives a cosiderable rise i accuracy compared to the suffix guessig method. However, the performaces of LM, LM2 ad LM3 i ukow words still far away from what are achieved i other laguages. The results also show that some techiques which proved to be effective for some laguages does ot work well for Arabic laguages such as LM (suffix guessig algorithm) which proved to be a good idicator for ukow word POS guessig i Eglish ad Germa [4]. I the ext sectio, we discuss our effort to improve the accuracy of the ukow word predictor. We combie the weighted output of MA with word suffix ad prefix iformatio ad with word patter suffix ad prefix. Table 2. The average POS taggig accuracy usig the HMM tagger with the basic lexical models. Model LM(TT)Suffix guessig algorithm LM2 Prefix guessig algorithm LM3 Prefix +suffix guessig algorithm % of ukow word 7. SYSTEM IMPROVEMENT Ukow acc. The overall 0.7 66.3 94.7 0.7 56.4 93.6 0.7 69.5 95.0 acc. 7. Itegratio of Morphological Iformatio I order to further improve the taggig accuracy, we itegrate morphological iformatio with lexical models. The mai reaso of our choice of usig exteral MA is based o the fact that suffix tries ad successive abstractio algorithm does work well with Arabic laguage. I our opiio, the mai reasos that make this algorithm usuitable for Arabic laguage are: ) data sparseess 2) suffix ambiguity 3) the o-cocateative ature of Arabic word. A MA is a fuctio that iputs a word w ad outputs the set of all its possible POS tags. Note that the size of tags produced by the MA is much smaller tha the size of the tagset. Thus, we have a restricted choice of tags as well as tag sequeces for a give setece. Sice the correct tag t for w is always i the MA output tags (assumig here that the MA is complete), it is always possible to fid out the correct tag sequece for a setece eve after applyig the morphological restrictio. Sice the MA does ot assig probabilities to the tags, we address this problem by assumig uiform distributio of the tags proposed by the MA give the word. I our system, we utilize the LDC-distributed Buckwalter Morphological Aalyzer for Arabic (BAMA).The BAMA system is based o three tables: prefixes table, stem table ad suffixes table. The stem table of BAMA has a very high coverage. Due to the differeces betwee BAMA tagset ad our tagset, we implemet a mappig fuctio that map each tag produced by BAMA to oe tag or more i our used tagset. The ew combied models (LM4 ad LM5) follow a simple method for usig the iformatio from the MA: ) If the word is i the traiig corpus, the lexical probabilities are estimated usig MLE just as i the basic models, otherwise. 240

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 2) If the word is ot i the traiig corpus ad it is kow to the MA, the MA output the set of possible tags. the lexical probabilities are estimated as follow: a. The tags probability distributio is calculated by usig appropriate weightig fuctio such as assumig uiform distributio of the tags proposed by the MA. b. The, lexical probabilities are calculated usig Bayesia iversio. c. Fially, we combie these lexical probabilities with lexical probabilities provided by LM or LM3. 3) If the word is also ukow to the MA, the lexical probabilities provided by LM or LM3 are used. However, for fixed iput text the umber of ukow words ca be reduced by erichig the lexico of the MA. This is a suitable if the tagger is domai specific. But as a geeral solutio for geeral multi-domai texts, the tagger must be equipped with some models that hadle ukow word efficietly without extedig the size of the MA lexico. Moreover, BAMA has may weakesses i its coverage ad i its aalysis [28], [34], [35]. Masour et al. [9] stated that more tha 5% of the words i ATB2 ad ATB3 caot be tagged correctly usig BAMA uless further data are added to those provided by the morphological aalyzer. However, BAMA aalyze Arabic words i cocateative maer. It has problem to aalyze words with o-cocateative morphology such as broke plurals [36]. This meas it is uable to hadle ukow words which are ococateative. Our objective is to provide a solutio to ukow word POS guessig problem which overcome the limitatio of the MA ad also overcome the eed of huge amout of aotated data. I the ext sectio, we will defie lexical models which deped o some specific features of Arabic words. These models have the ability to extract the useful iformatio from Arabic words, which are formed either usig cocateative morphology or o-cocateative morphology. The, they use this iformatio to predict the word appropriate POS tag. 7.2 Usig Words Iteral Structure Arabic words are quite differet from Eglish words. The word formatio process for Arabic words is quite complex. The mai formatio of Eglish word is cocateative i.e. simply attachig affixes to the begiig ad the ed of the stem. Hece, the word suffixes are strog idicator for the word POS class. Brats et.al. [4], for example, 24 showed that a Eglish word edig i the suffix - able is very likely to be a adjective. I cotrast, the mai word formatio process i Arabic laguages is iheretly o-cocateative [30]. Thus, Arabic word (miimal word form) suffixes are ambiguous, short ad sparse. For example, most of the time Arabic words, which are derived from the same root, share the same suffix eve if they have differet POS. Moreover, words, which belog to the same POS class, ofte have differet suffixes (see Table 4). As we state i Sec. 5, Arabic words is derived by isertig root radical characters ito patter s geeric characters. Arabic words characters are divided ito root radical characters or augmeted characters. While te characters which,{ا,ه,ي,ن,و,م,ت,ل,أ,س} is called Augmeted Characters(AC), of the Arabic 28 ca be used as root radical or augmeted characters, ay character of the remaiig 8 characters ca be used oly as root radical characters. The augmeted characters appear i Arabic words ad their patters so they are sometimes are called fixed characters [2]. I cotrast, root radical characters oly appear i the Arabic words ad they are replaced with geeric characters (or variables) i its patter. However, the reverse process to word derivatio is the patter idetificatio (or root extractio). The patter idetificatio is the process that idetifies the root radical characters i a Arabic word ad replaces them with geeric characters. Table 3. List Of Some Arabic Words Derived From Roots Ad "عمل" صنع Ad Their POS. The Table Shows The Ambiguity Ad The Sparseess Of Arabic Word Suffixes Arabic words Patter Arabic POS patter عمل صنع XXX فعل PV VBS سنفعل سنXXX سنعمل سنصنع عمل صنع XXX فعل SN معمل مصنع مXXX مفعل SN معامل مصانع مXاXX مفاعل BP عمال صناع XXاX فعال BP یعملون یصنعون يXXXون یفعلون VBS SNP متفاعلون مفعلون مXXXون متXاXXون متعاملون مصنعون The patter of Arabic word is a good idicator of its possible POS tags. I additio, patters ca be used to overcome the eed of huge aotated data to cover the laguage vocabularies. All Arabic words which belog to ope classes ca be mapped to few thousad of patters. Furthermore, by

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 removig root radical characters from Arabic words, suffixes become less ambiguous, less sparse ad log (see Table 5). But, it s ot that easy to fully utilized patters iformatio. Patter idetificatio (or root extractio) i itself is a complicated task i Arabic NLP. I our curret work, we try to balace betwee the beefit that we ca get from the patter ad the complicatio of the patter idetificatio. We propose a light patter idetificatio algorithm to map the Arabic word to its patter. The algorithm works as follow: give a Arabic word which belog to ope class: first, check the words if it cotais oe character or more from radical characters oly set, replace them with geeric character X. Secod, for the remaiig characters, we use some positios rules, which proposed by (Sobol et al., 2008), to detect if they are root radicals or augmeted characters. We called the patter produced by this algorithm Augmeted Character Form (ACF). We use this algorithm to map each o fuctioal word i the dictioary obtaied from the traiig corpus to its ACF. The, we estimate the emissio probability (the lexical model) for each uique ACF. We use augmeted letter tree, to represet the lexical model. Fially, for each ukow word i the test set, we estimated the probability of its ACF s suffix usig suffix tries ad successive abstractio. The, we combie this probability with the output of the MA. The algorithm is described more formally as follow (step ad 2 are performed i the traiig phase where step 3 is i the test phase): ) First, for each word W i the dictioary obtaied from the traiig data : if oe of its possible tags belog to the ope classes the covert it to its augmeted character form(acf) as follow: For each character C i W: If C AC, the, we replace it with the geeric character X. 2) Else if C AC the we checks if c is augmeted or root radical character usig the positio rules ad if its root radical character, we replace it with the geeric character X. 3) If two or more words have the same ACF, we represet them as oe etry (ACFj). The possible tags of the resultig ACF is equal the possible tags of all of its words. The probability distributio of ACFj give a tag t is calculated as i the followig equatio : ( j ) = P ACF t P( w t) (6) i= i Wherew,, w are words that have the same ACF j. 4) Fially, For each ukow word i the test set, we do the followig: a) The word is coverted to its ACF. b) The lexical probabilities P(ACF_suffix t) ad P(ACF_prefix t) are estimated usig the suffix tries ad prefix tries as i Sec. 5. The oly differece that we replace the word by its ACF. c) We combied this iformatio with the MA output, if the word is kow to the MA. 8. EXPERIMENTS AND EVALUATION The mai purpose of this work is to study the behavior of differet lexical models for HMM POS tagger, i order to determie the best way to hadle ukow words POS guessig for Arabic laguage especially, whe small amout of data is available. We evaluate these lexical models o the test set. We have a total of six models. The same traiig data has bee used to estimate the parameters for all the models. Moreover, the same test set has bee used to evaluate all the models. The size of the test set is 6540 words i which 700 words are ukow. We defie the taggig accuracy as the ratio of the correctly tagged words to the total umber of words. Results are summarized i Table 6. Model Table 4. The Average Taggig Accuracy Of Arabic Text Usig The Improved Lexical Models % of ukow word Ukow acc. The overall acc. (LM3)word Prefix+ suffix 0.7 69.5 95.0 guessig (LM4) MA+ suffix 0.7 83.7 96.6 (LM5) MA+ word suffix+ 0.7 83.5 96.6 word prefix (LM6) MA+ACF suffix 0.7 88.3 97. +ACF prefix We fid that i both HMM based models (LM4 ad LM5), the use of a morphological aalyzer with word affixes improve the accuracy with respect to the basic models (see Table 3 ad Table 6). A sigificat icrease i the ukow word POS taggig accuracy ad cosequetly i the overall accuracy is clearly oticeable. As we have oted already the use of MA ad word affixes 242

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 improve the accuracy of the POS tagger. But what is sigificat to ote is that the percetage of improvemet is higher whe we use of MA ad ACF affixes. The results of the experimets usig LM6, which combies iformatio from word morphological iformatio ad ACF suffix ad prefix, show a cosiderable icrease over all other approaches. I additio, Table 7 compares our works i Arabic ukow words POS guessig with all related Arabic works. Our combied model (LM6) outperforms all related works o Arabic POS taggig that tackle ukow word problems, although our traiig data is small. Table 5. Compariso Of Our Results I Ukow Word POS Taggig With Other Related Arabic Taggers Tagger The Mai techique % of ukow word i the test set Size of ukow words Ukow acc. Marsi 2005 MB %6.6 947 %73 AlGahtai 2009 TBL %5.3 790 %85 LM6 HMM %0.6 700 %88. 3 9. CONCLUSION AND FUTURE WORK Ukow words taggig is a serious problem i POS taggig especially whe small aotated data is available. The impact of this problem icreases i laguages which have huge vocabulary ad rich morphological system like Arabic. I this paper, we have ivestigated the best cofiguratio of secod order HMM POS tagger for Arabic whe the traiig corpus is small. We have proposed several lexical models based o iteral specific features of Arabic words. I additio, exteral morphological aalyzer has bee itegrated with the POS tagger to improve the tagger results. Furthermore, we have preseted several combiatios of these lexical models. The best result is achieved by the combied lexical model which combies the weighted output of the morphological aalyzer ad affixatio tries of word augmeted character form (patter form).our tagger achieves the state of art i Arabic text taggig ad outperforms other Arabic taggers i ukow word taggig. Our future directio is to improve the patter based ukow word predictor. This improvemet ca be doe through several steps. First, we ited to icrease the size of traiig corpus from small sized to medium sized to cover most of the Arabic words patters. The secod step is to improve the patter idetificatio algorithm so that each ukow word ca be mapped to a patter of kow word. Aother future directio is to develop ew test set to re-evaluate the performace of the tagger. The ew test set will iclude aotated data from multiple domais. APPENDIX The Arabic POS tagset used i aotatig our corpus has bee attached i table. I Figure, the simple ad complex forms of Arabic word with oe of its possible tags sequece فلمعتقداتھم (composite tag) has bee explaied. Tables 5 show the list of some patters with their possible POS tags. REFRENCES: [] D. Vadas ad J. R. Curra, Taggig ukow words with raw text features, i Proceedigs of the Australasia Laguage Techology Workshop, 2005, pp. 32 39. [2] M. A. M. E. Ahmed, ALarge-SCALE COMPUTATIONAL PROCESSOR OF THE ARABIC MORPHOLOGY, AND APPLICATIONS, Faculty of Egieerig, Cairo Uiversity Giza, Egypt, 2000. [3] R. Abbès, J. Dichy, ad M. Hassou, The Architecture of a Stadard Arabic lexical database: some figures, ratios ad categories from the DIINAR. source program, i Proceedigs of the Workshop o Computatioal Approaches to Arabic Script-based Laguages, 2004, pp. 5 22. [4] T. Brats, TT: a statistical part-of-speech tagger, i Proceedigs of the sixth coferece o Applied atural laguage processig, 2000, pp. 224 23. [5] A. Rataparkhi ad others, A maximum etropy model for part-of-speech taggig, i Proceedigs of the coferece o empirical methods i atural laguage processig, 996, vol., pp. 33 42. [6] S. M. Thede ad M. P. Harper, A secodorder hidde Markov model for part-ofspeech taggig, i Proceedigs of the 37th aual meetig of the Associatio for Computatioal Liguistics o Computatioal Liguistics, 999, pp. 75 82. [7] J. Lafferty, A. McCallum, ad F. C. N. Pereira, Coditioal radom fields: Probabilistic models for segmetig ad labelig sequece data, 200. 243

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 [8] K. Toutaova, D. Klei, C. D. Maig, ad Y. Siger, Feature-rich part-of-speech taggig with a cyclic depedecy etwork, i Proceedigs of the 2003 Coferece of the North America Chapter of the Associatio for Computatioal Liguistics o Huma Laguage Techology-Volume, 2003, pp. 73 80. [9] L. Màrquez ad J. Giméez, A geeral pos tagger geerator based o support vector machies, J. Mach. Lear. Res., 2004. [0] M. Poel, L. Stegema, ad R. op De Akker, A support vector machie approach to dutch part-of-speech taggig, i Advaces i itelliget data aalysis VII, Spriger, 2007, pp. 274 283. [] R. Weischedel, R. Schwartz, J. Palmucci, M. Meteer, ad L. Ramshaw, Copig with ambiguity ad ukow words through probabilistic models, Comput. Liguist., vol. 9, o. 2, pp. 36 382, 993. [2] E. Brill, Some advaces i trasformatiobased part of speech taggig, arxiv Prepr. C., 994. [3] T. Nakagawa, Multiligual word segmetatio ad part-of-speech taggig: a machie learig approach icorporatig diverse features, Nara Istitute of Sciece ad Techology, Japa, 2006. [4] Ž. Agic ad Z. Doveda, Improvig partof-speech taggig accuracy for Croatia by morphological aalysis, Iformatica, vol. 32, o. 4, 2008. [5] Y. El Hadj, I. Al-Sughayeir, ad A. Al- Asari, Arabic part-of-speech taggig usig the setece structure, i Proceedigs of the Secod Iteratioal Coferece o Arabic Laguage Resources ad Tools, Cairo, Egypt, 2009. [6] S. AlGahtai, W. Black, ad J. McNaught, Arabic part-of-speech taggig usig trasformatio-based learig, i Proceedigs of the Secod Iteratioal Coferece o Arabic Laguage Resources ad Tools, Cairo, Egypt, 2009. [7] F. Al Shamsi ad A. Guessoum, A hidde Markov model-based POS tagger for Arabic, i Proceedig of the 8th Iteratioal Coferece o the Statistical Aalysis of Textual Data, Frace, 2006, pp. 3 42. [8] E. Marsi, A. Va De Bosch, ad A. Soudi, Memory-based morphological aalysis geeratio ad part-of-speech taggig of Arabic, i Proceedigs of the ACL workshop o computatioal approaches to semitic laguages, 2005, pp. 8. [9] S. Masour, K. Sima a, ad Y. Witer, Smoothig a lexico-based POS tagger for Arabic ad Hebrew, i Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues ad Resources, 2007, pp. 97 03. [20] M. Diab, K. Hacioglu, ad D. Jurafsky, Automatic taggig of Arabic text: From raw text to base phrase chuks, i Proceedigs of HLT-NAACL 2004: Short Papers, 2004, pp. 49 52. [2] N. Habash ad O. Rambow, Arabic tokeizatio, part-of-speech taggig ad morphological disambiguatio i oe fell swoop, i Proceedigs of the 43rd Aual Meetig o Associatio for Computatioal Liguistics, 2005, pp. 573 580. [22] M. Albared, N. Omar, ad M. J. Ab Aziz, Arabic part of speech disambiguatio: A survey, It. Rev. Comput. Softw., pp. 57 532, 2009. [23] M. Maamouri, A. Bies, ad S. Kulick, Ehacig the Arabic Treebak: a Collaborative Effort toward New Aotatio Guidelies., i LREC, 2008. [24] S. Boudelaa ad M. G. Gaskell, A reexamiatio of the default system for Arabic plurals, Lag. Cog. Process., vol. 7, o. 3, pp. 32 343, 2002. [25] A. Goweder ad A. De Roeck, Assessmet of a sigificat Arabic corpus, i Arabic NLP Workshop at ACL/EACL, 200. [26] N. K. A. Alajmi, S. Bi Deris, ad S. Alajem, Computatioal Approach to Arabic Broke Derived Nous Morphology, i Advaced Computer Theory ad Egieerig, 2008. ICACTE 08. Iteratioal Coferece o, 2008, pp. 704 708. [27] A. Clark, Supervised ad usupervised learig of Arabic morphology, i Arabic Computatioal Morphology, Spriger, 2007, pp. 8 200. [28] S. Masour, Combiig character ad morpheme based models for part-of-speech taggig of Semitic laguages, Techio- Israel Istitute of Techology, Faculty of Computer Sciece, 2008. [29] R. Bar-Haim, K. Sima A, ad Y. Witer, Part-of-speech taggig of Moder Hebrew 244

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 text, Nat. Lag. Eg., vol. 4, o. 02, pp. 223 25, 2008. [30] Y. Cohe-Sygal, Computatioal implemetatio of o-cocateative morphology, Uiversity of Haifa, 2004. [3] A. J. Viterbi, Error bouds for covolutioal codes ad a asymptotically optimum decodig algorithm, If. Theory, IEEE Tras., vol. 3, o. 2, pp. 260 269, 967. [32] S. F. Che ad J. Goodma, A empirical study of smoothig techiques for laguage modelig, Comput. Speech Lag., vol. 3, o. 4, pp. 359 393, 999. [33] C. Samuelsso, Hadlig sparse data by successive abstractio, i Proceedigs of the 6th coferece o Computatioal liguistics-volume 2, 996, pp. 895 900. [34] M. A. Attia, Hadlig Arabic morphological ad sytactic ambiguity withi the LFG framework with a view to machie traslatio, Uiversity of Machester, 2008. [35] M. Sawalha ad E. S. Atwell, Comparative evaluatio of arabic laguage morphological aalysers ad stemmers, i Proceedigs of COLING 2008 22d Iteratioal Coferece o Comptatioal Liguistics (Poster Volume)), 2008, pp. 07 0. [36] S. Alasary, M. Nagi, ad N. Adly, Towards aalyzig the iteratioal corpus of Arabic (ICA): Progress of morphological stage, i 8th Iteratioal Coferece o Laguage Egieerig, Egypt, 2008. 245

Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No.2 2005-206 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 APPENDIX Table 6. The Arabic POS Tagset Used I Aotatig Our Corpus Pos Tag Label Pos Tag Label Cojuctio CC Broke Plural Nou BPN Number CD Possessive Proou POSS_PRON Adverb ADV Imperfective Verb VBP Particle PART No Iflected Verb NIV Imperative Verb IV Relative Proou REL_PRON Foreig Word FOREIGN Iterjectio INTERJ Perfect Verb PV Iterrogative Particle INTER_PART Passive Verb PSSV Iterrogative Adverb INTER_ADV Prepositio PREP Demostrative Proou DEM_ PROP Adjective ADJ Puctuatio PUNC Sigular Nou SN Proper Nou NOUN_PROP Soud Plural Nou SPN Persoal Proou PRON Figure.. The Simple Ad Complex Forms Of Arabic Word فلمعتقداتھم With Oe Of Its Possible Tags Sequece (Composite Tag) CC+PREP+SPN+POSS_PRON Table 7. List Of Some Patters With Their Possible POS Tags Patter Arabic Examples of Patter s Words Patter Possible Tags XXوX فعول نفوس /صدور/علوم/دخول/غفور /. BPN,SN,ADJ اXتXاX افتعال اجتماع /اقتصاد /افتتاح / SN یستXXX یستفعل یستخرج /یستعمل/یستھلك / VBS,PSSV تXXX تفعل /تخرج/تعلم/تحدث /, IV,PSSV,PV,SNتدرب يXXXون یفعلون VBS یعملون /یصنعون/یسمعون / 246