PROBABILISTIC ARABIC PART OF SPEECH TAGGER WITH UNKNOWN WORDS HANDLING

Size: px
Start display at page:

Download "PROBABILISTIC ARABIC PART OF SPEECH TAGGER WITH UNKNOWN WORDS HANDLING"

Transcription

1 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: PROBABILISTIC ARABIC PART OF SPEECH TAGGER WITH UNKNOWN WORDS HANDLING Mohammed Albared, 2 Tareq Al-Moslmi, 3 Nazlia Omar, 4 Adel Al-Shabi, 5 Fadl Mutaher Ba-Alwi 2,3,4 Cetre for Artificial Itelligece Techology, Faculty of Iformatio Sciece ad Techology, Uiversiti Ke-bagsaa Malaysia, Bagi, Selagor, Malaysia,5 Faculty of Computer ad Iformatio Techology, Saa'a Uiversity, Yeme mohammed_albared@yahoo.com, 2 tareq.almoslmi@gmail.com, 3 azlia@ukm.edu.my, 4 adel.alshabi@gmail.com, 5 dr.fadlbaalwi@gmail.com ABSTRACT Part Of Speech (POS) tagger is a essetial preprocessig step i may atural laguage applicatios. I this paper, we ivestigate the best cofiguratio of trigram Hidde Markov Model (HMM) Arabic POS tagger whe small tagged corpus is available. With small traiig data, ukow word POS guessig is the mai problem. This problem becomes more serious i laguages which have huge size of vocabulary ad rich ad complex morphology like Arabic. I order to hadle this problem i Arabic POS tagger, we have studied the effect of itegratig a lexico based morphological aalyzer to improve the performace of the tagger. Moreover, i this work, several lexical models have bee empirically defied, implemeted ad evaluated. These models are based essetially o the iteral structure ad the formatio process of Arabic words. Furthermore, several combiatios of these models have bee preseted. The POS tagger has bee traied with a traiig corpus of words ad it uses a tagset of 24 differet POS tags. Our system achieves state-of-the-art overall accuracy i Arabic part of speech taggig ad outperforms other Arabic taggers i ukow word POS taggig accuracy. Keywords: Part of Speech Tagger, Arabic Laguage, Ukow Word Guessig.. INTRODUCTION Part of speech disambiguatio is the ability to computatioally determie which part of speech of a word is activated by its use i a particular cotext. Automatic text taggig is a importat preprocessig step i may NLP applicatios such as iformatio extractio, questio aswerig ad machie traslatio. POS taggig is a otrivial problem. It caot exclusively cosist of a lexico due to the MorphoSytactic ambiguity, ad the existece of ukow words, that is, words that have ot bee previously see i the aotated traiig set. Ukow words are major problem i ay taggig systems, ad always decrease the performace of the systems. The accuracy of partof-speech (POS) taggig for ukow words is sigificatly lower tha that for kow words. The processig of ukow words is so importat due to several reasos. First, the ukow words play a importat role i the meaig of a setece more tha kow words; ukow words are specialized words ad hold more sematic iformatio tha kow word []. This is because most of the ukow words belog to ope POS classes such as ous ad verbs ad ulikely to be i the closed classes such as particles. Secod, the performace of the POS tagger i ukow word taggig is a measure of its robustess ad reliability, which is, its ability to tag documet from differet domais or laguage varieties without substatially decrease o its performace. Fially, the improvemet of ukow words taggig cotribute to the overall accuracy of the POS tagger. For these reasos, properly POS taggig of ukow words is so importat, so the iformatio carried by them ca be used correctly i future steps of a NLP system. Ukow word POS taggig is a substatial problem i Arabic POS taggig due to several reasos. First, the lack of large ad free publicity available aotated corpora. Secod, Arabic laguage is oe of the richest laguages i term of vocabulary [2], I the DIINAR. resource, the effective umber of simple word forms is 7,774,938 [3]. As a result, to desig a reliable ad robust statistical POS tagger, we eed extremely large aotated corpus. Third, Arabic laguage is iflected laguage with rich ad complex morphology. Fially, the orthographic ambiguity; the form of certai letters i Arabic script allows 236

2 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: suboptimal orthographic variats of the same word to coexist i the same text. I this work, we employ the well-kow trigram HMM POS taggig architecture for taggig Arabic text our baselie tagger implemetatio is iflueced by Brats et.al. [4]. Durig the implemetatio of our baselie tagger, we observed that the suffix guessig does ot performed well o Arabic ukow words. This is due to the limitatio i the traiig data size ad laguage characteristics. So, to cope with the ukow words problem first, we study how iformatio supplied by a lexico based Morphological Aalyzer (MA) ca be used to improve the accuracy of the system. The, we defie, implemet ad evaluate several lexical models based o the iteral structure of Arabic word i.e. the word formatio process. Experimetal results show that the proposed approaches achieve very ecouragig results, although the traiig is performed o very small size corpus. The rest of the paper is orgaized as follows. Sec. 2 discusses related works. Sec. 3 ad Sec. 4 describes our tagset ad corpora. Sec. 5 gives ecessary details about Arabic word formatio. Sec. 6 describes our baselie HMM tagger. I Sec. 7 we discuss the modificatios to better hadle ukow words POS taggig i Arabic text. Sec. 8 gives Experimetal results. Fially, coclusios ad future work appear i Sec RELATED WORK Research o POS taggig has a log history. Numerous approaches have bee successfully applied to POS taggig. The POS taggig techiques i the literature ca be classified ito the followig: POS taggig techiques i the literature ca be classified ito the followig: Rule-based part-of-speech taggig which is based o a lexico ad a set of disambiguatio rules. Supervised POS taggig: these approaches use machie-learig techiques to lear a classifier from labeled traiig sets such as Maximum Etropy Model [5], Hidde Markov Model [4], [6],Coditioal Radom field [7], Cyclic Depedecy Networks [8] ad Support Vector Machie [9], [0]. Usupervised POS taggig: these approaches do ot require pre-tagged traiig data, but rely o dictioary iformatio. Previous work o POS taggig has utilized differet kid of features to tackle ukow word POS taggig. These features are maily based o word substrig iformatio, word cotext iformatio ad/or global iformatio. Weischedel et al. [] create a probability distributio for a ukow word based o certai features: word edigs, hypheatio, ad capitalizatio. Brill et.al. [2] uses suffix iformatio with trasformatio rules. Rataparkhi et al. [5] uses character -gram prefixes ad suffixes, ad spellig cues such as capitalizatio, hyphes, ad umbers. Brats et.al. [4] uses the liear iterpolatio of fixed legth suffix model for ukow word hadlig. Nakagawa et.al. [3] uses global iformatio ad local iformatio. They model the probability distributio of the POS of all the occurreces of ukow words with the same lexical form i a documet. The parameters were estimated usig Gibbs samplig. Agic et al. [4] ad showed that the performace of high iflected laguage POS tagger ca be improved sigificatly by itegratig the output of morphological aalyzer. Recetly, several works have bee proposed to Arabic POS taggig such as [5] [2], for more details about Arabic works i POS taggig see Albared et al. [22]. Amog all these works, AlGahtai et al. [6] ad Marsi et al. [8] reported their taggers performaces o ukow word POS taggig which are 67.0% ad (80 %-85%) respectively. However, the reported results still less tha achieved results i other laguages like Eglish. Marsi et al. [8] used prefix, suffix, two previous words tags ad oe ext word tag to hadle ukow words. I additio, Al Shamsi et al. [7] ad El Hadj et al. [5] used HMM for Arabic POS taggig. Both of them used 000 words as test set. But, they worked uder closed vocabulary assumptio. 3. THE TAGSET FOR ARABIC POS TAGGING Our tagset have bee ispired by Arabic TreeBak (ATB) POS Guidelies [23]. The used tagset cosists of 24 tags (see table ). This tagset is a refiemet of the Arabic TreeBak tagset, which is cosist of 23 tags, used by Masour et al [9], Diab et al [20] ad Habash et al [2]. We oly add some modificatios to hadle some liguistic limitatio o previous Arabic taggers. The first oe, we itroduce a tag for the Broke Plural (BP) to distiguish betwee it ad the sigular ou. Ulike Eglish irregular plural, which is ucommo, Arabic broke plural is very commo. BPs form 40% of the plurals ad the remaiig 237

3 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: percetage 60% is for the other types of plurals: soud masculie ad femiie plurals [24]. I our aotated corpus, BPs form 55% of the plurals. Moreover, BPs costitute 0% of ay Arabic text [25]. Several works i Arabic NLP have bee proposed to idetify BP i Arabic text [25] [27]. However, previous Arabic taggers do ot idetify BP as idepedet tag. Most of the time, BPs are tagged as sigular ou which leads to lose a lot of iformatio such as Masour et al [9], Diab et al [20] ad Habash et al [2]. The mai word formatio process i Arabic laguages is iheretly o-cocateative; the BP is the best example of this o-cocateative morphology [27]. We ca measure the performace of our algorithms o hadlig o-cocateative ukow words by measurig its performace o hadlig ukow words which are BP. The secod modificatio, our tagset does ot iclude NO_FUNC (o solutio chose) tag, which is used as a tag i the above metioed Arabic TreeBak tagset. They use this tag for ay Arabic word with o selected solutio [28]. Fially, we distiguish betwee iflected ad o-iflected verbs. 4. THE TRAINING CORPUS Our corpus cosists of maually aotated word forms from two types of Arabic texts. Over 7000 word forms come from old Arabic text or what is called Traditioal Arabic text ad aother 2000 are comig from moder stadard Arabic. The mai differece betwee the two types of text is oly Out Of Vocabulary words. A few old Arabic words are rarely used owadays writig. I cotrast, some ew techical terms ad ew words have etered commo usage. We use this corpus to trai ad test our tagger. We spilt the corpus ito traiig set with size words ad test set with size 6540 words. 5. ARABIC WORD STRUCTURE Arabic word form is either simple or complex (see Figure ). The simple form of Arabic word cosists of prefix, stem ad suffix. The complex form cosists of proclitics, the simple form ad eclitics. Clitics (proclitics ad eclitics) have their ow POS tags. Taggig at complex word form level icrease the data sparseess problem (icrease ukow word problem) ad icrease the complexity of the tagset [28][29]. Furthermore, Barhaim et al. [29] showed that POS taggig usig simple word form outperforms taggig usig complex word form i Semitic laguages. However, throughout this research the simple word form will be termed word. We assume the segmetatio as a preprocessig step of the POS tagger. Arabic words are quite differet from Eglish words, ad the word formatio process for Arabic words is quite complex. The mai formatio of Eglish word is cocateative. I cotrast, the mai word formatio process i Arabic laguages is iheretly o-cocateative [30]. The word i Arabic laguage ca be described as combiatios of two morphemes: a root ad patter. The root is a sequece of three (rarely two or four) characters which is called radicals. The patter is a combiatio of augmeted characters (vowel characters ad it ca be احرف الزیادة cosoats), with geeric (or variables) characters ito which the Root Radical Characters (RRC) are beig iserted (throughout this works, we use the Eglish letter X to represet the patter geeric characters). The augmeted characters (sometimes called fixed characters) are fixed i each patter. Words are derived by iterdigitatig roots ito patters: the first radical is iserted ito the first geeric character, the secod radical fills the secod geeric ad the third fills the last geeric as show i Table 2. Arabic has a small umber, a few hudreds, of patters ad a few thousad of roots. The Arabic alphabet has 28 basic letters. Arabic word letters are divided ito two sets. The first oe is the root radical characters. Ay Arabic letter ca be root radical character. Root radical characters i Arabic word do ot play ay role i the detectio of the word possible POS tags. For example, i the Arabic word. are مad ص,خcharacters the three, متخاصمون root radical characters. However, we ca replace,ص) as them by other three Arabic characters such متصادقون to produce other Arabic word (د, ق which have differet meaig but both words are SPN. The secod set is the augmeted characters. Each augmeted character ca be oly oe of the.{ا,ه,ي,ن,و,م,ت,ل,أ,س} these Arabic te characters However, the augmeted characters associated with their positio i the word may play a critical role i determiig the possible POS tag of the word for example Arabic words all have three یتصاعد, یتفاھم, یتصالح, یتعامل augmeted characters ad ت, ي ا i the first,secod ad the fourth positio ad all these words are either PSV or VBP. Table. Word Deviatio Process Of Some Arabic Words From The Root كتب With Differet Patters 238

4 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: Patter Root The resultig word XXX كتب كتب XاXX كتب كاتب مXXوX كتب مكتوب مXXXة كتب مكتبة XXاX كتب كتاب 6. OUR BASELINE MODEL :THE HMM POS TAGGER Hidde Markov Model (HMM) is a well-kow probabilistic model, which ca predict the tag of the curret word give the tags of oe previous word (bi-gram) or two previous words (trigram). The HMM tagger assig a probability value to each pair< w, t >, wherew = w... w is the iput setece adt = t.tis the POS tag sequece. I HMM, the POS problem ca be defied as the fidig the best tag sequecet give the word sequecew. The label sequecet geerated by the model is the oe which has highest probability amog all the possible label sequeces for the iput word sequece. This is ca be formally expressed as: = i i i i i= t ( ) ( ) ( ) t arg max p t t. t. p w t. t = i i t i= ( ) ( ) ( ) t arg max p t t. t. p w t 2 The first parameter p( w t. t ) i i i i is a kow as the emissio probability ad secod p t t. t is kow as the trasitio parameter ( ) i i probability. These two model parameters are estimated from aotated corpus by Maximum Likelihood Estimatio (MLE), which is derived from the relative frequecies. Give these two probabilities, we ca fid the most likely tag sequece for a give word sequece. Usig the Viterbi algorithm [3], we selected the path whose overall probability was the highest, ad the took the tag predictios from that path. However, MLE is a bad estimator for statistical iferece especially, i NLP applicatio, because data teds to be sparse. This is eve for corpus with large umber of words. Sparseess meas that various words are either ifrequet or usee. This leads to zero probabilities beig assiged to usee evets, causig the probability of the whole sequece to be set to zero whe multiplyig probabilities. There are may differet smoothig algorithms i the literature to hadle the sparseess problem [32], all of them cosistig of decreasig the probability assiged to the kow evet ad distributig the remaiig mass amog the ukow evets. I our work, we use liear iterpolatio of uigram, bigram ad trigram maximum likelihood estimates i order to estimate the trigram trasitio probability: ( 3 2, ) = λ ( 3) + λ2 ( 3 2) + λp( t t t ) P t t t P t P t t 2 3 2, (3) whereλ +λ 2 +λ 3 =, so Prepresets a valid probability distributio. Theλs are estimated by deleted iterpolatio. For ukow word, we use the liear iterpolatio of fixed legth suffix model for ukow word hadlig. The probability distributio for a ukow word suffix is geerated from all words i the traiig set that have the same suffix up to some predefied maximum legth. Probabilities are smoothed by successive abstractio. This method was proposed by Samuelsso et.al. [33] ad implemeted for Eglish ad Germa [4]. (,..., c ) P tc i+ = Pt (, c i+,..., c) + θp(, t c i+ 2,..., c) + θ S θ = ( P( tj) P) S J= ad P= s i+ S J= 0 ( j) P t 2 (4) Wherec,..., c represet the last characters of the word of the words. I additio to word suffix, the experimets utilize the followig features: the presece of o-alphabetic characters ad the existece of foreig characters. I additio to the suffix guessig model, we defie aother basic model based o both ukow word prefix ad suffix. The mai liguistic motivatio behid combiig affixes iformatio is that i Arabic word sometimes a affix requires or forbids the existece of aother affix [34]. Prefix ad suffix idicate substrigs that come at the begiig ad ed of a word respectively, ad are ot ecessarily morphologically meaigful. I this model, the lexical probabilities are estimated as follows: 239

5 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: Give a ukow word w, the lexical probabilities P(suffix(w) t) are estimated usig the suffix tries as i Equatio 4. The, the lexical probabilities P(prefix(w) t) are estimated usig the prefix tries as i Equatio 4. Here, the probability distributio for a ukow word prefix is geerated from all words i the traiig set that have the same prefix up to some predefied maximum legth. Fially, we use the liear iterpolatio of both the lexical probabilities obtaied from both word suffix ad prefix to calculate the lexical probability of the word w as i the followig equatio: ( ) = λ ( suffix( ) ) P wt P w t + ( λ) P(prefix(w) t) (5) Where λ is a iterpolatio factor, experimetally set to X. prefix(w) ad suffix( w) are the first m ad the last characters, respectively. Table 3 summarizes the results of experimets with prefix, suffix ad prefix + suffix basic models. The first model (LM) is TT suffix guessig algorithm. The secod model (LM2) is prefix guessig algorithm. The third model (LM3) is the liear iterpolatio of both prefix guessig algorithm ad suffix guessig algorithm for ukow words. LM3, which combie iformatio from both suffix ad prefix, gives a cosiderable rise i accuracy compared to the suffix guessig method. However, the performaces of LM, LM2 ad LM3 i ukow words still far away from what are achieved i other laguages. The results also show that some techiques which proved to be effective for some laguages does ot work well for Arabic laguages such as LM (suffix guessig algorithm) which proved to be a good idicator for ukow word POS guessig i Eglish ad Germa [4]. I the ext sectio, we discuss our effort to improve the accuracy of the ukow word predictor. We combie the weighted output of MA with word suffix ad prefix iformatio ad with word patter suffix ad prefix. Table 2. The average POS taggig accuracy usig the HMM tagger with the basic lexical models. Model LM(TT)Suffix guessig algorithm LM2 Prefix guessig algorithm LM3 Prefix +suffix guessig algorithm % of ukow word 7. SYSTEM IMPROVEMENT Ukow acc. The overall acc. 7. Itegratio of Morphological Iformatio I order to further improve the taggig accuracy, we itegrate morphological iformatio with lexical models. The mai reaso of our choice of usig exteral MA is based o the fact that suffix tries ad successive abstractio algorithm does work well with Arabic laguage. I our opiio, the mai reasos that make this algorithm usuitable for Arabic laguage are: ) data sparseess 2) suffix ambiguity 3) the o-cocateative ature of Arabic word. A MA is a fuctio that iputs a word w ad outputs the set of all its possible POS tags. Note that the size of tags produced by the MA is much smaller tha the size of the tagset. Thus, we have a restricted choice of tags as well as tag sequeces for a give setece. Sice the correct tag t for w is always i the MA output tags (assumig here that the MA is complete), it is always possible to fid out the correct tag sequece for a setece eve after applyig the morphological restrictio. Sice the MA does ot assig probabilities to the tags, we address this problem by assumig uiform distributio of the tags proposed by the MA give the word. I our system, we utilize the LDC-distributed Buckwalter Morphological Aalyzer for Arabic (BAMA).The BAMA system is based o three tables: prefixes table, stem table ad suffixes table. The stem table of BAMA has a very high coverage. Due to the differeces betwee BAMA tagset ad our tagset, we implemet a mappig fuctio that map each tag produced by BAMA to oe tag or more i our used tagset. The ew combied models (LM4 ad LM5) follow a simple method for usig the iformatio from the MA: ) If the word is i the traiig corpus, the lexical probabilities are estimated usig MLE just as i the basic models, otherwise. 240

6 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: ) If the word is ot i the traiig corpus ad it is kow to the MA, the MA output the set of possible tags. the lexical probabilities are estimated as follow: a. The tags probability distributio is calculated by usig appropriate weightig fuctio such as assumig uiform distributio of the tags proposed by the MA. b. The, lexical probabilities are calculated usig Bayesia iversio. c. Fially, we combie these lexical probabilities with lexical probabilities provided by LM or LM3. 3) If the word is also ukow to the MA, the lexical probabilities provided by LM or LM3 are used. However, for fixed iput text the umber of ukow words ca be reduced by erichig the lexico of the MA. This is a suitable if the tagger is domai specific. But as a geeral solutio for geeral multi-domai texts, the tagger must be equipped with some models that hadle ukow word efficietly without extedig the size of the MA lexico. Moreover, BAMA has may weakesses i its coverage ad i its aalysis [28], [34], [35]. Masour et al. [9] stated that more tha 5% of the words i ATB2 ad ATB3 caot be tagged correctly usig BAMA uless further data are added to those provided by the morphological aalyzer. However, BAMA aalyze Arabic words i cocateative maer. It has problem to aalyze words with o-cocateative morphology such as broke plurals [36]. This meas it is uable to hadle ukow words which are ococateative. Our objective is to provide a solutio to ukow word POS guessig problem which overcome the limitatio of the MA ad also overcome the eed of huge amout of aotated data. I the ext sectio, we will defie lexical models which deped o some specific features of Arabic words. These models have the ability to extract the useful iformatio from Arabic words, which are formed either usig cocateative morphology or o-cocateative morphology. The, they use this iformatio to predict the word appropriate POS tag. 7.2 Usig Words Iteral Structure Arabic words are quite differet from Eglish words. The word formatio process for Arabic words is quite complex. The mai formatio of Eglish word is cocateative i.e. simply attachig affixes to the begiig ad the ed of the stem. Hece, the word suffixes are strog idicator for the word POS class. Brats et.al. [4], for example, 24 showed that a Eglish word edig i the suffix - able is very likely to be a adjective. I cotrast, the mai word formatio process i Arabic laguages is iheretly o-cocateative [30]. Thus, Arabic word (miimal word form) suffixes are ambiguous, short ad sparse. For example, most of the time Arabic words, which are derived from the same root, share the same suffix eve if they have differet POS. Moreover, words, which belog to the same POS class, ofte have differet suffixes (see Table 4). As we state i Sec. 5, Arabic words is derived by isertig root radical characters ito patter s geeric characters. Arabic words characters are divided ito root radical characters or augmeted characters. While te characters which,{ا,ه,ي,ن,و,م,ت,ل,أ,س} is called Augmeted Characters(AC), of the Arabic 28 ca be used as root radical or augmeted characters, ay character of the remaiig 8 characters ca be used oly as root radical characters. The augmeted characters appear i Arabic words ad their patters so they are sometimes are called fixed characters [2]. I cotrast, root radical characters oly appear i the Arabic words ad they are replaced with geeric characters (or variables) i its patter. However, the reverse process to word derivatio is the patter idetificatio (or root extractio). The patter idetificatio is the process that idetifies the root radical characters i a Arabic word ad replaces them with geeric characters. Table 3. List Of Some Arabic Words Derived From Roots Ad "عمل" صنع Ad Their POS. The Table Shows The Ambiguity Ad The Sparseess Of Arabic Word Suffixes Arabic words Patter Arabic POS patter عمل صنع XXX فعل PV VBS سنفعل سنXXX سنعمل سنصنع عمل صنع XXX فعل SN معمل مصنع مXXX مفعل SN معامل مصانع مXاXX مفاعل BP عمال صناع XXاX فعال BP یعملون یصنعون يXXXون یفعلون VBS SNP متفاعلون مفعلون مXXXون متXاXXون متعاملون مصنعون The patter of Arabic word is a good idicator of its possible POS tags. I additio, patters ca be used to overcome the eed of huge aotated data to cover the laguage vocabularies. All Arabic words which belog to ope classes ca be mapped to few thousad of patters. Furthermore, by

7 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: removig root radical characters from Arabic words, suffixes become less ambiguous, less sparse ad log (see Table 5). But, it s ot that easy to fully utilized patters iformatio. Patter idetificatio (or root extractio) i itself is a complicated task i Arabic NLP. I our curret work, we try to balace betwee the beefit that we ca get from the patter ad the complicatio of the patter idetificatio. We propose a light patter idetificatio algorithm to map the Arabic word to its patter. The algorithm works as follow: give a Arabic word which belog to ope class: first, check the words if it cotais oe character or more from radical characters oly set, replace them with geeric character X. Secod, for the remaiig characters, we use some positios rules, which proposed by (Sobol et al., 2008), to detect if they are root radicals or augmeted characters. We called the patter produced by this algorithm Augmeted Character Form (ACF). We use this algorithm to map each o fuctioal word i the dictioary obtaied from the traiig corpus to its ACF. The, we estimate the emissio probability (the lexical model) for each uique ACF. We use augmeted letter tree, to represet the lexical model. Fially, for each ukow word i the test set, we estimated the probability of its ACF s suffix usig suffix tries ad successive abstractio. The, we combie this probability with the output of the MA. The algorithm is described more formally as follow (step ad 2 are performed i the traiig phase where step 3 is i the test phase): ) First, for each word W i the dictioary obtaied from the traiig data : if oe of its possible tags belog to the ope classes the covert it to its augmeted character form(acf) as follow: For each character C i W: If C AC, the, we replace it with the geeric character X. 2) Else if C AC the we checks if c is augmeted or root radical character usig the positio rules ad if its root radical character, we replace it with the geeric character X. 3) If two or more words have the same ACF, we represet them as oe etry (ACFj). The possible tags of the resultig ACF is equal the possible tags of all of its words. The probability distributio of ACFj give a tag t is calculated as i the followig equatio : ( j ) = P ACF t P( w t) (6) i= i Wherew,, w are words that have the same ACF j. 4) Fially, For each ukow word i the test set, we do the followig: a) The word is coverted to its ACF. b) The lexical probabilities P(ACF_suffix t) ad P(ACF_prefix t) are estimated usig the suffix tries ad prefix tries as i Sec. 5. The oly differece that we replace the word by its ACF. c) We combied this iformatio with the MA output, if the word is kow to the MA. 8. EXPERIMENTS AND EVALUATION The mai purpose of this work is to study the behavior of differet lexical models for HMM POS tagger, i order to determie the best way to hadle ukow words POS guessig for Arabic laguage especially, whe small amout of data is available. We evaluate these lexical models o the test set. We have a total of six models. The same traiig data has bee used to estimate the parameters for all the models. Moreover, the same test set has bee used to evaluate all the models. The size of the test set is 6540 words i which 700 words are ukow. We defie the taggig accuracy as the ratio of the correctly tagged words to the total umber of words. Results are summarized i Table 6. Model Table 4. The Average Taggig Accuracy Of Arabic Text Usig The Improved Lexical Models % of ukow word Ukow acc. The overall acc. (LM3)word Prefix+ suffix guessig (LM4) MA+ suffix (LM5) MA+ word suffix word prefix (LM6) MA+ACF suffix ACF prefix We fid that i both HMM based models (LM4 ad LM5), the use of a morphological aalyzer with word affixes improve the accuracy with respect to the basic models (see Table 3 ad Table 6). A sigificat icrease i the ukow word POS taggig accuracy ad cosequetly i the overall accuracy is clearly oticeable. As we have oted already the use of MA ad word affixes 242

8 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: improve the accuracy of the POS tagger. But what is sigificat to ote is that the percetage of improvemet is higher whe we use of MA ad ACF affixes. The results of the experimets usig LM6, which combies iformatio from word morphological iformatio ad ACF suffix ad prefix, show a cosiderable icrease over all other approaches. I additio, Table 7 compares our works i Arabic ukow words POS guessig with all related Arabic works. Our combied model (LM6) outperforms all related works o Arabic POS taggig that tackle ukow word problems, although our traiig data is small. Table 5. Compariso Of Our Results I Ukow Word POS Taggig With Other Related Arabic Taggers Tagger The Mai techique % of ukow word i the test set Size of ukow words Ukow acc. Marsi 2005 MB % %73 AlGahtai 2009 TBL % %85 LM6 HMM % % CONCLUSION AND FUTURE WORK Ukow words taggig is a serious problem i POS taggig especially whe small aotated data is available. The impact of this problem icreases i laguages which have huge vocabulary ad rich morphological system like Arabic. I this paper, we have ivestigated the best cofiguratio of secod order HMM POS tagger for Arabic whe the traiig corpus is small. We have proposed several lexical models based o iteral specific features of Arabic words. I additio, exteral morphological aalyzer has bee itegrated with the POS tagger to improve the tagger results. Furthermore, we have preseted several combiatios of these lexical models. The best result is achieved by the combied lexical model which combies the weighted output of the morphological aalyzer ad affixatio tries of word augmeted character form (patter form).our tagger achieves the state of art i Arabic text taggig ad outperforms other Arabic taggers i ukow word taggig. Our future directio is to improve the patter based ukow word predictor. This improvemet ca be doe through several steps. First, we ited to icrease the size of traiig corpus from small sized to medium sized to cover most of the Arabic words patters. The secod step is to improve the patter idetificatio algorithm so that each ukow word ca be mapped to a patter of kow word. Aother future directio is to develop ew test set to re-evaluate the performace of the tagger. The ew test set will iclude aotated data from multiple domais. APPENDIX The Arabic POS tagset used i aotatig our corpus has bee attached i table. I Figure, the simple ad complex forms of Arabic word with oe of its possible tags sequece فلمعتقداتھم (composite tag) has bee explaied. Tables 5 show the list of some patters with their possible POS tags. REFRENCES: [] D. Vadas ad J. R. Curra, Taggig ukow words with raw text features, i Proceedigs of the Australasia Laguage Techology Workshop, 2005, pp [2] M. A. M. E. Ahmed, ALarge-SCALE COMPUTATIONAL PROCESSOR OF THE ARABIC MORPHOLOGY, AND APPLICATIONS, Faculty of Egieerig, Cairo Uiversity Giza, Egypt, [3] R. Abbès, J. Dichy, ad M. Hassou, The Architecture of a Stadard Arabic lexical database: some figures, ratios ad categories from the DIINAR. source program, i Proceedigs of the Workshop o Computatioal Approaches to Arabic Script-based Laguages, 2004, pp [4] T. Brats, TT: a statistical part-of-speech tagger, i Proceedigs of the sixth coferece o Applied atural laguage processig, 2000, pp [5] A. Rataparkhi ad others, A maximum etropy model for part-of-speech taggig, i Proceedigs of the coferece o empirical methods i atural laguage processig, 996, vol., pp [6] S. M. Thede ad M. P. Harper, A secodorder hidde Markov model for part-ofspeech taggig, i Proceedigs of the 37th aual meetig of the Associatio for Computatioal Liguistics o Computatioal Liguistics, 999, pp [7] J. Lafferty, A. McCallum, ad F. C. N. Pereira, Coditioal radom fields: Probabilistic models for segmetig ad labelig sequece data,

9 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: [8] K. Toutaova, D. Klei, C. D. Maig, ad Y. Siger, Feature-rich part-of-speech taggig with a cyclic depedecy etwork, i Proceedigs of the 2003 Coferece of the North America Chapter of the Associatio for Computatioal Liguistics o Huma Laguage Techology-Volume, 2003, pp [9] L. Màrquez ad J. Giméez, A geeral pos tagger geerator based o support vector machies, J. Mach. Lear. Res., [0] M. Poel, L. Stegema, ad R. op De Akker, A support vector machie approach to dutch part-of-speech taggig, i Advaces i itelliget data aalysis VII, Spriger, 2007, pp [] R. Weischedel, R. Schwartz, J. Palmucci, M. Meteer, ad L. Ramshaw, Copig with ambiguity ad ukow words through probabilistic models, Comput. Liguist., vol. 9, o. 2, pp , 993. [2] E. Brill, Some advaces i trasformatiobased part of speech taggig, arxiv Prepr. C., 994. [3] T. Nakagawa, Multiligual word segmetatio ad part-of-speech taggig: a machie learig approach icorporatig diverse features, Nara Istitute of Sciece ad Techology, Japa, [4] Ž. Agic ad Z. Doveda, Improvig partof-speech taggig accuracy for Croatia by morphological aalysis, Iformatica, vol. 32, o. 4, [5] Y. El Hadj, I. Al-Sughayeir, ad A. Al- Asari, Arabic part-of-speech taggig usig the setece structure, i Proceedigs of the Secod Iteratioal Coferece o Arabic Laguage Resources ad Tools, Cairo, Egypt, [6] S. AlGahtai, W. Black, ad J. McNaught, Arabic part-of-speech taggig usig trasformatio-based learig, i Proceedigs of the Secod Iteratioal Coferece o Arabic Laguage Resources ad Tools, Cairo, Egypt, [7] F. Al Shamsi ad A. Guessoum, A hidde Markov model-based POS tagger for Arabic, i Proceedig of the 8th Iteratioal Coferece o the Statistical Aalysis of Textual Data, Frace, 2006, pp [8] E. Marsi, A. Va De Bosch, ad A. Soudi, Memory-based morphological aalysis geeratio ad part-of-speech taggig of Arabic, i Proceedigs of the ACL workshop o computatioal approaches to semitic laguages, 2005, pp. 8. [9] S. Masour, K. Sima a, ad Y. Witer, Smoothig a lexico-based POS tagger for Arabic ad Hebrew, i Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues ad Resources, 2007, pp [20] M. Diab, K. Hacioglu, ad D. Jurafsky, Automatic taggig of Arabic text: From raw text to base phrase chuks, i Proceedigs of HLT-NAACL 2004: Short Papers, 2004, pp [2] N. Habash ad O. Rambow, Arabic tokeizatio, part-of-speech taggig ad morphological disambiguatio i oe fell swoop, i Proceedigs of the 43rd Aual Meetig o Associatio for Computatioal Liguistics, 2005, pp [22] M. Albared, N. Omar, ad M. J. Ab Aziz, Arabic part of speech disambiguatio: A survey, It. Rev. Comput. Softw., pp , [23] M. Maamouri, A. Bies, ad S. Kulick, Ehacig the Arabic Treebak: a Collaborative Effort toward New Aotatio Guidelies., i LREC, [24] S. Boudelaa ad M. G. Gaskell, A reexamiatio of the default system for Arabic plurals, Lag. Cog. Process., vol. 7, o. 3, pp , [25] A. Goweder ad A. De Roeck, Assessmet of a sigificat Arabic corpus, i Arabic NLP Workshop at ACL/EACL, 200. [26] N. K. A. Alajmi, S. Bi Deris, ad S. Alajem, Computatioal Approach to Arabic Broke Derived Nous Morphology, i Advaced Computer Theory ad Egieerig, ICACTE 08. Iteratioal Coferece o, 2008, pp [27] A. Clark, Supervised ad usupervised learig of Arabic morphology, i Arabic Computatioal Morphology, Spriger, 2007, pp [28] S. Masour, Combiig character ad morpheme based models for part-of-speech taggig of Semitic laguages, Techio- Israel Istitute of Techology, Faculty of Computer Sciece, [29] R. Bar-Haim, K. Sima A, ad Y. Witer, Part-of-speech taggig of Moder Hebrew 244

10 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: text, Nat. Lag. Eg., vol. 4, o. 02, pp , [30] Y. Cohe-Sygal, Computatioal implemetatio of o-cocateative morphology, Uiversity of Haifa, [3] A. J. Viterbi, Error bouds for covolutioal codes ad a asymptotically optimum decodig algorithm, If. Theory, IEEE Tras., vol. 3, o. 2, pp , 967. [32] S. F. Che ad J. Goodma, A empirical study of smoothig techiques for laguage modelig, Comput. Speech Lag., vol. 3, o. 4, pp , 999. [33] C. Samuelsso, Hadlig sparse data by successive abstractio, i Proceedigs of the 6th coferece o Computatioal liguistics-volume 2, 996, pp [34] M. A. Attia, Hadlig Arabic morphological ad sytactic ambiguity withi the LFG framework with a view to machie traslatio, Uiversity of Machester, [35] M. Sawalha ad E. S. Atwell, Comparative evaluatio of arabic laguage morphological aalysers ad stemmers, i Proceedigs of COLING d Iteratioal Coferece o Comptatioal Liguistics (Poster Volume)), 2008, pp [36] S. Alasary, M. Nagi, ad N. Adly, Towards aalyzig the iteratioal corpus of Arabic (ICA): Progress of morphological stage, i 8th Iteratioal Coferece o Laguage Egieerig, Egypt,

11 Joural of Theoretical ad Applied Iformatio Techology 3 st August 206. Vol.90. No JATIT & LLS. All rights reserved. ISSN: E-ISSN: APPENDIX Table 6. The Arabic POS Tagset Used I Aotatig Our Corpus Pos Tag Label Pos Tag Label Cojuctio CC Broke Plural Nou BPN Number CD Possessive Proou POSS_PRON Adverb ADV Imperfective Verb VBP Particle PART No Iflected Verb NIV Imperative Verb IV Relative Proou REL_PRON Foreig Word FOREIGN Iterjectio INTERJ Perfect Verb PV Iterrogative Particle INTER_PART Passive Verb PSSV Iterrogative Adverb INTER_ADV Prepositio PREP Demostrative Proou DEM_ PROP Adjective ADJ Puctuatio PUNC Sigular Nou SN Proper Nou NOUN_PROP Soud Plural Nou SPN Persoal Proou PRON Figure.. The Simple Ad Complex Forms Of Arabic Word فلمعتقداتھم With Oe Of Its Possible Tags Sequece (Composite Tag) CC+PREP+SPN+POSS_PRON Table 7. List Of Some Patters With Their Possible POS Tags Patter Arabic Examples of Patter s Words Patter Possible Tags XXوX فعول نفوس /صدور/علوم/دخول/غفور /. BPN,SN,ADJ اXتXاX افتعال اجتماع /اقتصاد /افتتاح / SN یستXXX یستفعل یستخرج /یستعمل/یستھلك / VBS,PSSV تXXX تفعل /تخرج/تعلم/تحدث /, IV,PSSV,PV,SNتدرب يXXXون یفعلون VBS یعملون /یصنعون/یسمعون / 246

Natural language processing implementation on Romanian ChatBot

Natural language processing implementation on Romanian ChatBot Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics

More information

arxiv: v1 [cs.dl] 22 Dec 2016

arxiv: v1 [cs.dl] 22 Dec 2016 ScieceWISE: Topic Modelig over Scietific Literature Networks arxiv:1612.07636v1 [cs.dl] 22 Dec 2016 A. Magalich, V. Gemmetto, D. Garlaschelli, A. Boyarsky Uiversity of Leide, The Netherlads {magalich,

More information

Consortium: North Carolina Community Colleges

Consortium: North Carolina Community Colleges Associatio of Research Libraries / Texas A&M Uiversity www.libqual.org Cotributors Collee Cook Texas A&M Uiversity Fred Heath Uiversity of Texas BruceThompso Texas A&M Uiversity Martha Kyrillidou Associatio

More information

Management Science Letters

Management Science Letters Maagemet Sciece Letters 4 (24) 2 26 Cotets lists available at GrowigSciece Maagemet Sciece Letters homepage: www.growigsciece.com/msl A applicatio of data evelopmet aalysis for measurig the relative efficiecy

More information

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent Fuzzy Referece Gai-Schedulig Approach as Itelliget Agets: FRGS Aget J. E. ARAUJO * eresto@lit.ipe.br K. H. KIENITZ # kieitz@ita.br S. A. SANDRI sadra@lac.ipe.br J. D. S. da SILVA demisio@lac.ipe.br * Itegratio

More information

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev Titre du documet / Documet title E-learig usability : A learer-adapted approach based o the evaluatio of leaer's prefereces Auteur(s) / Author(s) TERZIEVA Valetia ; PAVLOV Yuri (1) ; ANDREEV Rume (2) ;

More information

'Norwegian University of Science and Technology, Department of Computer and Information Science

'Norwegian University of Science and Technology, Department of Computer and Information Science The helpful Patiet Record System: Problem Orieted Ad Kowledge Based Elisabeth Bayega, MS' ad Samso Tu, MS2 'Norwegia Uiversity of Sciece ad Techology, Departmet of Computer ad Iformatio Sciece ad Departmet

More information

part2 Participatory Processes

part2 Participatory Processes part part2 Participatory Processes Participatory Learig Approaches Whose Learig? Participatory learig is based o the priciple of ope expressio where all sectios of the commuity ad exteral stakeholders

More information

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO HANDBOOK Career Ceter Hadbook CALIFORNIA STATE UNIVERSITY, SACR AMENTO Tools & Tips for Career Search Success Academic Advisig ad Career Ceter 6000 J Street Lasse Hall 1013 Sacrameto, CA 95819-6064 916-278-6231

More information

Application for Admission

Application for Admission Applicatio for Admissio Admissio Office PO Box 2900 Illiois Wesleya Uiversity Bloomig, Illiois 61702-2900 Apply o-lie at: www.iwu.edu Applicatio Iformatio I am applyig: Early Actio Regular Decisio Early

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING  Version 1.1, September 2014 preview begis oct 2014 lauches ja 2015 INTRODUCING WWW.FEEDBACKCOMMONS.ORG A serviced cloud platform to share ad compare feedback data ad collaboratively develop feedback ad learig practice CONSTITUENT

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Classical Approaches to Tagging The slides are posted on the web. The url is http://chss.montclair.edu/~feldmana/esslli10/.

More information

VISION, MISSION, VALUES, AND GOALS

VISION, MISSION, VALUES, AND GOALS 6 VISION, MISSION, VALUES, AND GOALS 2010-2015 VISION STATEMENT Ohloe College will be kow throughout Califoria for our iclusiveess, iovatio, ad superior rates of studet success. MISSION STATEMENT The Missio

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

A Syllable Based Word Recognition Model for Korean Noun Extraction

A Syllable Based Word Recognition Model for Korean Noun Extraction are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Page 1 of 35 Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Kaihong Liu, MD, MS, Wendy Chapman, PhD, Rebecca Hwa, PhD, and Rebecca S. Crowley, MD, MS

More information

BULATS A2 WORDLIST 2

BULATS A2 WORDLIST 2 BULATS A2 WORDLIST 2 INTRODUCTION TO THE BULATS A2 WORDLIST 2 The BULATS A2 WORDLIST 21 is a list of approximately 750 words to help candidates aiming at an A2 pass in the Cambridge BULATS exam. It is

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

ARNE - A tool for Namend Entity Recognition from Arabic Text

ARNE - A tool for Namend Entity Recognition from Arabic Text 24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

1. Introduction. 2. The OMBI database editor

1. Introduction. 2. The OMBI database editor OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Phonological Processing for Urdu Text to Speech System

Phonological Processing for Urdu Text to Speech System Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence. NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and

More information

HybridTechniqueforArabicTextCompression

HybridTechniqueforArabicTextCompression Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

An Evaluation of POS Taggers for the CHILDES Corpus

An Evaluation of POS Taggers for the CHILDES Corpus City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects Graduate Center 9-30-2016 An Evaluation of POS Taggers for the CHILDES Corpus Rui Huang The Graduate

More information

Derivational and Inflectional Morphemes in Pak-Pak Language

Derivational and Inflectional Morphemes in Pak-Pak Language Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

also inside Continuing Education Alumni Authors College Events

also inside Continuing Education Alumni Authors College Events SUMMER 2016 JAMESTOWN COMMUNITY COLLEGE ALUMNI MAGAZINE create a etrepreeur creatig a busiess a artist creatig beauty a citize creatig the future also iside Cotiuig Educatio Alumi Authors College Evets

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

2014 Gold Award Winner SpecialParent

2014 Gold Award Winner SpecialParent Award Wier SpecialParet Dedicated to all families of childre with special eeds 6 th Editio/Fall/Witer 2014 Desig ad Editorial Awards Competitio MISSION Our goal is to provide parets of childre with special

More information

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

A Graph Based Authorship Identification Approach

A Graph Based Authorship Identification Approach A Graph Based Authorship Identification Approach Notebook for PAN at CLEF 2015 Helena Gómez-Adorno 1, Grigori Sidorov 1, David Pinto 2, and Ilia Markov 1 1 Center for Computing Research, Instituto Politécnico

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Noisy SMS Machine Translation in Low-Density Languages

Noisy SMS Machine Translation in Low-Density Languages Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary michiga veteriary medical associatio i this issue... 3 Great Lakes Veteriary Coferece 4 What You Need to Kow Whe Issuig a Iterstate Certificate of Ispectio 6 Low Pathogeic Avia Iflueza H5 Virus Detectios

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Accurate Unlexicalized Parsing for Modern Hebrew

Accurate Unlexicalized Parsing for Modern Hebrew Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Development of the First LRs for Macedonian: Current Projects

Development of the First LRs for Macedonian: Current Projects Development of the First LRs for Macedonian: Current Projects Ruska Ivanovska-Naskova Faculty of Philology- University St. Cyril and Methodius Bul. Krste Petkov Misirkov bb, 1000 Skopje, Macedonia rivanovska@flf.ukim.edu.mk

More information

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets

More information

Natural Language Processing. George Konidaris

Natural Language Processing. George Konidaris Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles Rayner Alfred 1, Adam Mujat 1, and Joe Henry Obit 2 1 School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan

More information

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative English Teaching Cycle The English curriculum at Wardley CE Primary is based upon the National Curriculum. Our English is taught through a text based curriculum as we believe this is the best way to develop

More information

An Online Handwriting Recognition System For Turkish

An Online Handwriting Recognition System For Turkish An Online Handwriting Recognition System For Turkish Esra Vural, Hakan Erdogan, Kemal Oflazer, Berrin Yanikoglu Sabanci University, Tuzla, Istanbul, Turkey 34956 ABSTRACT Despite recent developments in

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

Words come in categories

Words come in categories Nouns Words come in categories D: A grammatical category is a class of expressions which share a common set of grammatical properties (a.k.a. word class or part of speech). Words come in categories Open

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Using Semantic Relations to Refine Coreference Decisions

Using Semantic Relations to Refine Coreference Decisions Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu

More information

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Grade 4. Common Core Adoption Process. (Unpacked Standards) Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences

More information

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Search right and thou shalt find... Using Web Queries for Learner Error Detection Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA

More information

Language Independent Passage Retrieval for Question Answering

Language Independent Passage Retrieval for Question Answering Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Named Entity Recognition: A Survey for the Indian Languages

Named Entity Recognition: A Survey for the Indian Languages Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information