Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic

Size: px
Start display at page:

Download "Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic"

Transcription

1 British Joural of Applied Sciece & Techology 9(): -0, 207; Article o.bjast ISSN: , NLM ID: SCIENCEDOMAIN iteratioal Choosig the Optimal Segmetatio Level for POS Taggig of the Quraic Arabic Fadl Mutaher Ba-Alwi *, Mohammed Albared ad Tareq Al-Moslmi 2 Faculty of Computer ad Iformatio Techology, Saa'a Uiversity, P.O.Box 247, Yeme. 2 Faculty of Iformatio Sciece ad Techology, Uiversiti Kebagsaa Malaysia, Malaysia. Authors cotributios This work was carried out i collaboratio betwee all authors. Author FMBA desiged the study, performed the statistical aalysis, wrote the protocol, wrote the first draft of the mauscript ad maaged the literature searches. Authors MA ad TAM maaged the aalyses of the study ad literature searches. All authors read ad approved the fial mauscript. Article Iformatio DOI: /BJAST/207/29754 Editor(s): () Kleopatra Nikolopoulou, School of Educatio, Uiversity of Athes, Athes, Greece. Reviewers: () Azma Bi Che Mat, UiTM Tereggau, Malaysia. (2) Sazeli Arif, Uiversiti Tekikal Malaysia Melaka (UTeM), Malaysia. Complete Peer review History: Origial Research Article Received 27 th September 206 Accepted th November 206 Published 3 th February 207 ABSTRACT As a morphologically rich laguage, Arabic poses special challeges to Part-of-Speech (POS) taggig. Words i Arabic texts ofte cotai several segmets; each has its ow POS category. The choice of the segmetatio level or the iput uit, word-based or morpheme-based, is a major issue i desigig ay Arabic atural laguage processig system. I word-based approaches, words are used the atomic uits of the laguage. I this case, composite POS tags are assiged to words. Therefore, large amouts of traiig data are required i order to esure statistical sigificace. They suffer from the problems of data sparseess ad ukow words. I case of morpheme-based approaches, morpheme compoets of words are used as the atomic uits. This, however, results i high level of ambiguity rate ad also small size of cotext for resolvig such ambiguity because the spa of the -gram might be limited to a sigle word. This paper compares ad cotrasts the morpheme-based ad word-based statistical POS taggig strategies. This paper evaluates the taggig performace of three statistical models, amely, the Arabic HMM POS tagger with the prefix guessig models, the Arabic HMM POS tagger with the liear iterpolatio guessig models ad the TT tagger, give traiig data from both morphemebased ad word-based tokeizatio levels. It also studies the ifluece of each choice o the *Correspodig author: dr.fadlbaalwi@gmail.com;

2 taggig performace of the Arabic POS taggig models, i terms of the taggig accuracy ad the time complexity. I additio, this paper also evaluates the taggig performace of several stochastic models, give traiig data from both segmetatio levels. Results show that the morpheme-based POS taggig strategy is more adequate for the purpose of traiig statistical POS taggig models as it provides a better overall taggig accuracy ad a much faster traiig ad taggig time. Keywords: Arabic atural laguage processig; POS taggig; segmetatio levels.. INTRODUCTION Part of Speech (POS) disambiguatio is the ability to computatioally determie which POS of a word is activated by its use i a particular cotext []. Automatic text taggig is a importat pre-processig step i may NLP applicatios. Arabic laguage is a morphologically rich laguage which offers some challeges to Natural Laguage Processig (NLP) systems due to the may forms a word ca take, which leads to data sparseess (the isufficiecy of data). Most of the curret researches i NLP are based o supervised machie learig techiques i which the classifier lears from traiig sets which cotai a fair amout of words ad their associated aotatio. These classifiers eed a huge amout of traiig data to get a reasoable accuracy eve with less morphological laguages such as Eglish. I morphologically rich laguages, as the classifier will be faced by may forms of the same word that do ot repeat eough for the tagger to lear the patter (data sparseess problem). These laguages have a high vocabulary growth rate which results i a large umber of ukow words [2]. I Arabic ad also i other Semitic laguages, a word, a sigle orthographic space-delimited strig, ofte cosists of a cocateatio of subtokes, up to four sub-tokes [3], which fuctio as free morph-sytactic uits, each sub-toke with its ow POS category. I fact, Arabic word cosists of proclitics, stem with affixes (prefixes ad suffixes) ad eclitics. The clitics (proclitics ad eclitics) have their ow POS tags. Followig previous works, the terms morphemelevel taggig pertai to morphemes as the wordsegmets which are assiged POS tags from a give tag set. Accordig to this, the Arabic word your) (ad + by/with + promises + ف ب و ع ودك م ف-ب- (sub-tokes) cosists of four morphemes The POS of this word is a composite. وعود-كم POS tag (Coj+Prep+Nou+Poss.Pro). Cosequetly, whe desigig POS taggers or ay NLP applicatio for Arabic laguage ad other Semitic laguages, a major architectural decisio cocers the choice of whether we should aalyze a word as a sequece of morphological uits (morpheme-based) or we should treat space-delimited words as the primitive uits of our aalyses (word-based) [2,4]. From theoretical poit of view, both methods have advatages ad disadvatages. The use of the morpheme-based approach icreases the level of ambiguity but it icreases the coverage level ad decreases the size of the ukow words. O the other had, the word-based approach suffers from the data sparseess ad large size of ukow words ad large tag set with composite tags problems, ad but it reduces less ambiguity. I additio, the word formatio process for Arabic words is quite complex. While the mai formatio process of Eglish word is cocateative, the mai word formatio process i Arabic laguages is o-cocateative [2,5]. As a Semitic laguage, the word i Arabic laguage ca be described as combiatios of two morphemes: a root ad patter. A root is a set of cosoats (also called radicals) which has a basic lexical meaig. A patter cosists of a set of vowels which are iserted amog the cosoats of a root to form a stem. I additio to this o-cocateative morphological feature, Arabic uses differet affixes to create iflectioal ad derivatioal word forms. Thus, the direct adoptio of the NLP methods which are developed for wester laguages for Arabic is ot a appropriate choice due to the specific features of the Arabic laguage [6]. The purpose of this paper is, therefore, to explore the ifluece of the differet segmetatio levels o the taggig performace, i terms of accuracy ad time complexity, of the Arabic POS taggig models i order to determie the best segmetatio level to be used for POS taggig whe small amout of traiig data is available ad a large size of ukow words exist i the test data. I additio, this paper evaluates the taggig performace of three fully 2

3 supervised statistical models, amely, the Arabic HMM POS tagger with the prefix guessig models, the Arabic HMM POS tagger with the liear iterpolatio guessig models ad the TT tagger (Arabic versio), give traiig data from both tokeizatio levels. The rest of the paper is orgaized as follows, Sectio. 2 discuss related works. Sectio 3 describes the used corpora. Sectio 4 describes the HMM taggig approaches ad also discusses the modificatios to better hadlig ukow words POS taggig i Arabic text. Sectio 5 gives experimetal results ad discusses them. Fially, coclusios ad future work appear i Sectio MATERIALS AND METHODS 2. Related Work I Research o POS taggig has a log history. Numerous approaches have bee successfully applied to POS taggig. The POS taggig techiques i the literature ca be classified ito the followig: Rule-based POS taggig: this approach is based o a lexico ad a set of disambiguatio rules [7,8]. Supervised POS taggig: these approaches use machie-learig techiques to lear a classifier from labeled traiig sets such as maximum etropy model [9], Hidde Markov model [0], coditioal radom field [], cyclic depedecy etworks [2] ad support vector machie [3]. Usupervised POS taggig: these approaches do ot require pre-tagged traiig data, but rely o dictioary iformatio. However, POS taggig for Arabic laguage has bee a active topic of research i recet years. AlGahtai et al. [4] Yousif ad Sembok [5], Al- Taai ad Abu Al-Rub [6], Zribi et al. [7] ad Alqraiy [8] are some examples for this lie of work o Arabic. Similar to this work, the selectio of the best segmetatio level problem, usig morphemes or words as iput uits i Semitic laguage NLP, has bee studied before by [2,4,9,20]. Bar-Haim et al. [4,9] study the choice of the optimal architecture for the Hebrew POS taggig ad other Semitic laguages. They show that a model whose termial symbols are word segmets (morphemes), is advatageous over a word-level model for the task of POS taggig. Tachbelie [2] explored differet ways of laguage modellig for Amharic, a morphologically rich Semitic laguage, usig morphemes as uits. The study showed that usig morphemes i modellig morphologically rich laguages is advatageous, especially i reducig the OOV rate. I cotrast with these result, Mohamed ad Kübler [2] ad Kübler ad Mohamed [20] come with differet results ad differet coclusio. They state that word-based POS taggig approach is more appropriate tha morpheme-based POS taggig approach for moder stadard Arabic POS taggig. Ulike Mohamed ad Kübler [2], this work evaluates the ifluece of the segmetatio level o the taggig performace of the taggig models give a data from the Quraic Arabic (Classic Arabic). Ali ad Jarray [22] used the Geetic algorithm to develop a Arabic part of speech taggig. They used a reduced tagset i their tagger. Hadi et al. [23] propose a Hidde Markov Model (HMM) itegrated with Arabic Rule-Based method. Their POS tagger geerates a set of three POS tags: Nou, Verb, ad Particle. Albared et al. [24] preset a approach based o the combiatio of several N-attributes probabilistic classifiers. First, the POS disambiguatio problem is decoupled ito several N-attributes taggig sub-problems. The, several classifiers are used to solve each subproblem. Fially, the outcomes of all N-attributes classifiers are combied. Several problem decompositio methods ad classifiers combiatio algorithms are ivestigated. Kadim ad Lazrek [25] preset bidirectioal HMM-based Arabic POS taggig i which they combie both direct ad reverse taggers to tag the same sequece of words i both seses. This work also evaluates the ifluece of the segmetatio level o the taggig performace, ot oly o term of the taggig accuracy but also o term of the taggig time complexity. Moreover, this work evaluates the taggig performace of several fully supervised statistical taggig models, developed especially for Arabic text. 2.2 Methodology The probabilistic taggig models used i this work are based o the trigram Hidde Markov Model (HMM). The HMM tagger assig a probability value to each pair < w, t >, where 3

4 w =,..., w w is the iput setece ad t =,..., t t is the POS tag sequece. I HMM, the POS problem ca be defied as the fidig the best tag sequece t give the word sequece w. The label sequece t geerated by the model is the oe which has highest probability amog all the possible label sequeces for the iput word sequece. This is ca be formally expressed as: t = arg max t p ( t i t i,..., t ) p ( t i w i) The first parameter p( t t,..., ) i i t is a kow as the trasitio probability ad secod parameter p( t ) iw is kow as the emissio probability. i These two model parameters are estimated from aotated corpus by Maximum Likelihood Estimatio (MLE), which is derived from the relative frequecies. Give these two probabilities, we ca fid the most likely tag sequece for a give word sequece usig the Viterbi algorithm. However, MLE is a bad estimator for statistical iferece because data teds to be sparse. To hadle the sparseess problem i this work, we use liear iterpolatio of uigram, bigram ad trigram maximum likelihood estimates i order to estimate the trigram trasitio probability: p( t 3 t 2, t) = λ p( t 3) + λ 2 p( t 3 t 2) + λ 3 p( t 3 t 2, t) where + + =, so p represets a valid 2 3 λ λ λ probability distributio. λ s are estimated by deleted iterpolatio. To create a HMM POS tagger that ca accurately tag ukow words, it is ecessary to determie a estimate of the probability p ( w t ) for use i the tagger. As i j kow, if a word does ot occur i the traiig data the p ( w t ) lexical probability for that word i j is 0 for all t. This requires addig a algorithm j to the HMM to approximate the probability that the curret tag will emit give ukow words [0]. To hadle the ukow words, we have used the followig the suffix Probability algorithm [26], the prefix probability algorithm ad the liear iterpolatio guessig algorithm [27]. 2.3 Dataset The data used i this work is the Quraic Arabic Corpus [28]. The Quraic Arabic Corpus is a aotated liguistic resource which shows the Arabic grammar, sytax ad morphology for each word i the Holy Qura, the religious book of Islam which is writte i classical Quraic Arabic (c. 600 CE). The research project is orgaized at the Uiversity of Leeds, ad is part of the Arabic laguage computig research group withi the School of Computig. The Quraic Arabic Corpus is cosistig of 77,430 words of Quraic Arabic. For the purpose of this work, we have used two versios from the Quraic Arabic Corpus: The word-based versio: A example from this versio is show i Table. The composite tag is cosistig of multiple tags separated by +, a tag for each word segmet. The composite tag set is cosistig of 375 tags. The morpheme-based versio: A example from this versio is show i Table. The tag set of this versio cosists of 45 simple tags. A brief statistical summary (the total umber of words, the total umber of uique words ad the tag set) of the two versios are show i Table 2. Table. Examples from the word-based versio ad the morpheme-based versio of the Quraic corpus The word-based versio The morphemebased versio Word POS Word POS <V> <V> الذين REL الذين REL يؤمنون V+PRON يؤمن V بالغيب P+DET+N ون PRON ب P ال DET غيب N Table 2. Statistical summary of the two versios Number of words Uique words Size of tag set Word-based versio Morpheme-based versio

5 3. RESULTS AND DISCUSSION I this sectio, we report a empirical compariso betwee the two levels of the segmetatio preseted i the previous sectios, ad also study the ifluece of the two segmetatio levels o the taggig performace of Arabic POS taggig models whe oly small amout of traiig data is available. 3. Experimetal Settig The two traiig data are split ito two sets, traiig set ad testig set. Essetially, we have divided the word-based versio radomly ito 90.25% (69980 words, 5700 setece) for traiig ad 9.75% (7550 words, 536 seteces) for testig. The test data are chose idepedetly from the traiig data. After that, the morpheme-based versio is divided usig the same settig, see Table 3. As show from the table, the umber of vocabularies is larger i case of the morphemebased versio tha i the word-based versio eve whe the traiig ad testig sets are equally i both versios. Furthermore, i order to study the effect of the size of the traiig data, we radomly portioed our traiig data from the two versios to costruct seve traiig sets. Table 4 shows sizes of the traiig data sets ad percetages of ukow words with respect to the test data set. The test set is the same as test set for all experimets. Although each traiig set from the morpheme-based versio cotais the same data as i its equivalet i the word-based versio, the umber of words ad the percetages of ukow words are differet. It is iterestig to ote that the umber of words are larger ad the percetages of ukow words are less i case of traiig sets which come from the morpheme-based versio tha their word-based couterparts (cotais the same seteces). 3.2 Results ad Discussio First of all, several experimets are coducted usig the TT model. Table 5 presets the results (kow accuracy, ukow words accuracy ad the overall accuracy) obtaied for each traiig data set from the two versios: the word-based versio ad the morpheme-based versio. We ca ote that the ukow word accuracy of the TT tagger over traiig data sets from the Word-Based Versio are so low ad it does ot show ay sesitivity to the icrease of data size. However, a overall accuracy of 88.% (96.2% o kow words ad 37.7% o ukow words) is obtaied whe the whole traiig data are used (traiig set 7). Table 3. Statistical summary of the traiig ad testig data from the two versios of the Quraic corpus Word-based versio Morpheme-base versio Traiig Testig Traiig Testig Percetage 90.25% 9.75% 89.% 0.9% # of seteces (verses) # of words # of uique words Table 4. The sizes of the traiig sets from the two versios of the Arabic Quraic corpus, ad the percetage of ukow words i each set with respect to the test set Traiig set Word-based versio Morpheme-based versio Traiig size % of ukow words Traiig size % of ukow words % % % % % % % % % % % % % % 5

6 Usig traiig data sets from the morphemebased versio, ukow words taggig results of the TT tagger are much better tha its results over those from the Word-Based Versio. However, a overall accuracy of 93.8% (of 95.6% o kow words ad 73.4% o ukow words) is obtaied whe the whole traiig data are used. I geeral, give TT as taggig model, morpheme-based POS taggig yields much better results tha full word- based taggig (93.8% vs. 88.4%). Secodly, several experimets are coducted usig the Arabic HMM POS tagger with the prefix guessig model. Table 6 presets the results obtaied for each traiig data set from the two versios. It has bee observed from both Tables 5 ad 6 that the Arabic HMM POS tagger with the prefix guessig model always performs sigificatly better tha TT tagger with the suffix guessig model regardless of the segmetatio level used ad also regardless of the traiig data set sizes. The results i Tables 5 ad 6 (the overall taggig results) also show that the morpheme-based POS taggig always yields much better results tha the Word-based taggig regardless of the taggig model ad the size of the traiig data set used. It is very iterestig to ote that the word-based POS taggig produces slightly better kow word accuracy tha those of the morpheme-based POS taggig. This is actually due to that the morpheme-based approach icreases the level of ambiguity. O the other had, the morphemebased POS taggig produces much better ukow word accuracy tha those of the wordbased POS taggig. I fact, these results show that dealig with segmetatio as separate pre-processig step (usig segmeted text) is better for hadlig ukow words ad for POS taggig i geeral especially whe traiig data is small. I additio, we compare the computatioal time cost (traiig ad testig) of two POS taggig models (TT tagger ad the Arabic HMM POS tagger with prefix guessig model) whe they are traied usig differet sized traiig data sets from the two versios: the word-based versio ad the morpheme-based versio. First, we have foud that both the TT POS tagger ad the Arabic HMM POS tagger with the Prefix guessig model have approximately the same computatioal time (traiig ad testig) whe they are traied ad tested usig the same traiig ad test data. This meas that both taggers are equally efficiet with respect to the executio time. Due to this, we oly study here the computatioal time cost of the Arabic HMM POS tagger with the Prefix guessig model whe it is traied usig differet sized traiig data sets (ad therefore differet percetages of ukow words) from the two segmetatio level approaches (ad therefore differet sizes of tag sets): the word-based versio ad the morpheme-based versio. Figs. ad 2 show the curves of the average traiig ad testig time take by the Arabic HMM POS tagger with the Prefix guessig model whe it is traied usig differet sized traiig data sets from the two tokeizatio levels. Table 5. Taggig accuracies of the TT Tagger with the varyig size of the traiig data form the two traiig Quraic versios Taitig Word-based versio Morpheme-based versio set Ukow Kow Overall Ukow Kow Overall

7 Table 6. Taggig accuracies of the Arabic HMM tagger with the prefix guessig model with the varyig size of the traiig data form the two traiig Quraic versios Taitig Word-based versio Morpheme-based versio set Ukow Kow Overall Ukow Kow Overall Traiig Time(M) Fig.. The traiig time take by the Arabic HMM POS tagger traied usig differet sized traiig data sets from both tokeizatio levels Word-Based Morpheme-Based Testig Time(M) Word-Based Morpheme-Based Fig. 2. The testig time take by the Arabic HMM POS tagger traied usig differet sized traiig data sets from both tokeizatio levels Table 7. The taggig performace (Time ad accuracy) of the Arabic HMM POS tagger with the liear iterpolatio guessig model for each oe of the two tokeizatio levels Corpus % of Best Time i miute Accuracy ukow λ Traiig Testig Ukow Kow Overall Word-based Morpheme-based

8 From Figs. ad 2, we ca draw several importat observatios. First, the traiig time is much lower tha the testig time i spite of the traiig data set used ad the corpus versio used. Secod, the traiig time i case of a traiig data set from the morpheme-based versio is lower tha the traiig time i case of its couterpart from the word-based versio. Third, the traiig time icreased as the traiig data icreased, see Fig., ad the testig time decreased as the traiig data icreased, see Fig. 2. The explaatio of this is that as the traiig data icreased, the size of ukow words i the test data are substatially decreased, see Table 4, therefore less exceptioal processig time ad less taggig time. I fact, there is a strog positive correlatio of 0.99 betwee the testig time ad percetages of ukow words i the test sets regardless of the tokeizatio level used, which idicates that taggig time ad the percetage of ukow words go i same directios. Fourth, it is most importatly to ote that the testig time of the word- based POS taggig ( hours to 2hours) is much larger tha the testig time of the morpheme-based POS taggig (few secods). From Figs. ad 2, we ca readily observe that morpheme-based POS taggig would be a optimal choice as its taggig time is much larger tha the taggig time of the word-based POS taggig. Fially, several experimets are coducted usig our HMM tagger with the liear iterpolatio guessig model which is traied usig the whole traiig data (traiig set 7) from the two corpus versios. Varyig the λ value from 0.0 to ; the value is icremeted by 0. each time. Table 7 summarizes the taggig results, the computatioal time eeded ad the best λ at which the model ca give the best result, for each oe of the two segmetatio approach. The results also show morpheme-based POS taggig always yields better results tha word- based taggig. I additio, as i previous models (TT ad Arabic Trigram HMM tagger with prefix guessig model ) the taggig time of the wordbased POS taggig (5 miutes) is much larger tha the taggig time of the morpheme-based POS taggig (few secods). Moreover, the liear iterpolatio guessig model performs better tha the two previous models (TT ad Arabic HMM POS tagger with the prefix guessig model) for both tokeizatio levels. 4. CONCLUSION Desigig a POS tagger for Arabic with small traiig data is a challegig task due to the specific features of the Arabic laguage ad the high degree of ambiguity i Arabic. I this paper, we compare ad cotrast morpheme-based POS ad word-based POS taggig strategies ad study the ifluece of each o the taggig performace of the Arabic POS taggig models, o term of the taggig accuracy ad the time complexity. I additio, we also evaluate ad compare several stochastic taggig models. We coducted a series of experimets usig two versios of the Quraic Arabic corpus: morpheme-based versio ad word-based versio. Results show that taggig models performs sigificatly better whe their termial symbols are word segmets (morpheme-based), tha whe their termial symbols are word (word-based). I additio, the results show that the Arabic Trigram HMM POS tagger with the liear iterpolatio guessig algorithm substatially improve the taggig results over the TT tagger regardless of the tokeizatio level used. However, our future directio is to study the ifluece of the segmetatio level o aother Arabic NLP process. Moreover, we pla to desig a joit segmetatio ad POS taggig model which do both tasks simultaeously.. COMPETING INTERESTS Authors have declared that o competig iterests exist. REFERENCES. Albared M, Omar N, Ab Aziz MJ. Developig a competitive HMM arabic POS tagger usig small traiig corpora. I Proceedigs of the Third Iteratioal Coferece o Itelliget Iformatio ad Database Systems - Volume Part I, Daegu, Korea. 20; Tachbelie MY. Morphology-based laguage modelig for amharic. Ph.D., Departmet of Iformatics, Uiversity of Hamburg; Attia MA. Arabic tokeizatio system. Preseted at the Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues 8

9 ad Resources, Prague, Czech Republic; Bar-Haim R, Sima'A K, Witer Y. Part-ofspeech taggig of moder Hebrew text. Natural Laguage Egieerig. 2008;4: Beesley KR, Karttue L. Fiite-state ococateative morphotactics. Preseted at the Proceedigs of the 38 th Aual Meetig o Associatio for Computatioal Liguistics, Hog Kog; Farghaly A, Shaala K. Arabic atural laguage processig: Challeges ad solutios. ACM Trasactios o Asia Laguage Iformatio Processig (TALIP). 2009;8: Loftsso H. Taggig Iceladic text: A liguistic rule-based approach. Nordic Joural of Liguistics. 2008;3: Brill E. A simple rule-based part of speech tagger. Preseted at the Proceedigs of the Third Coferece o Applied Natural Laguage Processig, Treto, Italy; Rataparkhi A. Maximum etropy models for atural laguage ambiguity resolutio. Ph.D., Computer ad Iformatio Sciece,Uiversity of Pesylvaia; Thede SM, Harper MP. A secod-order Hidde Markov Model for part-of-speech taggig. Preseted at the Proceedigs of the 37 th Aual Meetig of the Associatio for Computatioal Liguistics o Computatioal Liguistics, College Park, Marylad; Lafferty JD, McCallum A, Pereira FCN. Coditioal radom fields: Probabilistic models for segmetig ad labelig sequece data. Preseted at the Proceedigs of the Eighteeth Iteratioal Coferece o Machie Learig; Toutaova K, Klei D, Maig CD, Siger Y. Feature-rich part-of-speech taggig with a cyclic depedecy etwork. Preseted at the Proceedigs of NAACL '03, Edmoto, Caada; Giméez J, Màrquez L. SVMTool: A geeral POS tagger geerator based o support vector machies. I Proceedigs of 4 th Iteratioal Coferece o Laguage Resources ad Evaluatio (LREC), Lisbo, Portugal. 2004; AlGahtai S, Black W, McNaught J. Arabic Part-of-speech taggig usig trasformatio-based learig. Preseted at the Proceedigs of the Secod Iteratioal Coferece o Arabic Laguage Resources ad Tools, Cairo, Egyp; Yousif JH, Sembok T. Arabic part-ofspeech tagger based support vectors machies. I Iformatio Techology, ITSim Iteratioal Symposium. 2008; Al-Taai A, Abu Al-Rub S. A rule-based approach for taggig o-vocalized Arabic words. The Iteratioal Arab Joural of Iformatio Techology. 2009;9: Zribi C, Torjme A, Ahmed M. A multiaget system for POS-taggig vocalized Arabic text. The Iteratioal Arab Joural of Iformatio Techology; Alqraiy S. A morphological-sytactical aalysis approach For Arabic textual taggig. Ph.D., De Motfort Uiversity, Leicester, UK; Bar-Haim R, Sima'a K, Witer Y. Choosig a optimal architecture for segmetatio ad POS-taggig of moder Hebrew. Preseted at the Proceedigs of the ACL Workshop o Computatioal Approaches to Semitic Laguages, A Arbor, Michiga; Kübler S, Mohamed E. Part of speech taggig for Arabic. Natural Laguage Egieerig. First View. 20; Mohamed E, Kübler S. Is Arabic part of speech taggig feasible without word segmetatio? Preseted at the The 200 Aual Coferece of the North America Chapter of the Associatio for Computatioal Liguistics, Los Ageles, Califoria, USA; Ali BB, Jarray F. Geetic approach for Arabic part of speech taggig. Iteratioal Joural o Natural Laguage Computig. 203;2: Hadi M, Ouatik S, Lachkar A, Mekassi M. Hybrid Part-of-speech tagger for ovocalized Arabic text. Iteratioal Joural o Natural Laguage Computig (IJNLC). 203; Albared M, Hazaa M. N-attributes stochastic classifier combiatio for Arabic morphological disambiguatio. Saba Joural of iformatio Techology Ad Networkig (SJITN). 205; Kadim A, Lazrek A. Bidirectioal HMMbased Arabic POS taggig. Iteratioal Joural of Speech Techology. 206;9: Brats T. TT: A statistical part-of-speech tagger. Preseted at the Proceedigs of 9

10 the sixth coferece o Applied atural laguage processig. Seattle, Washigto; Albared M, Omar N, Ab Aziz MJ, Nazri MZA. Automatic part of speech taggig for Arabic: A experimet usig Bigram hidde Markov model. Preseted at Coferece o Rough Set ad Kowledge Techology, Beijig, Chia; Dukes K, Atwell E, Sharaf ABM. Sytactic aotatio guidelies for the quraic Arabic Depedecy Treebak. Preseted at the Laguage Resources ad Evaluatio Coferece (LREC 200), Valletta, Malta; 200. the Proceedigs of the 5 th Iteratioal 207 Ba-Alwi et al.; This is a Ope Access article distributed uder the terms of the Creative Commos Attributio Licese ( which permits urestricted use, distributio, ad reproductio i ay medium, provided the origial work is properly cited. Peer-review history: The peer review history for this paper ca be accessed here: 0

Natural language processing implementation on Romanian ChatBot

Natural language processing implementation on Romanian ChatBot Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics

More information

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev Titre du documet / Documet title E-learig usability : A learer-adapted approach based o the evaluatio of leaer's prefereces Auteur(s) / Author(s) TERZIEVA Valetia ; PAVLOV Yuri (1) ; ANDREEV Rume (2) ;

More information

arxiv: v1 [cs.dl] 22 Dec 2016

arxiv: v1 [cs.dl] 22 Dec 2016 ScieceWISE: Topic Modelig over Scietific Literature Networks arxiv:1612.07636v1 [cs.dl] 22 Dec 2016 A. Magalich, V. Gemmetto, D. Garlaschelli, A. Boyarsky Uiversity of Leide, The Netherlads {magalich,

More information

Consortium: North Carolina Community Colleges

Consortium: North Carolina Community Colleges Associatio of Research Libraries / Texas A&M Uiversity www.libqual.org Cotributors Collee Cook Texas A&M Uiversity Fred Heath Uiversity of Texas BruceThompso Texas A&M Uiversity Martha Kyrillidou Associatio

More information

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent Fuzzy Referece Gai-Schedulig Approach as Itelliget Agets: FRGS Aget J. E. ARAUJO * eresto@lit.ipe.br K. H. KIENITZ # kieitz@ita.br S. A. SANDRI sadra@lac.ipe.br J. D. S. da SILVA demisio@lac.ipe.br * Itegratio

More information

Management Science Letters

Management Science Letters Maagemet Sciece Letters 4 (24) 2 26 Cotets lists available at GrowigSciece Maagemet Sciece Letters homepage: www.growigsciece.com/msl A applicatio of data evelopmet aalysis for measurig the relative efficiecy

More information

'Norwegian University of Science and Technology, Department of Computer and Information Science

'Norwegian University of Science and Technology, Department of Computer and Information Science The helpful Patiet Record System: Problem Orieted Ad Kowledge Based Elisabeth Bayega, MS' ad Samso Tu, MS2 'Norwegia Uiversity of Sciece ad Techology, Departmet of Computer ad Iformatio Sciece ad Departmet

More information

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING  Version 1.1, September 2014 preview begis oct 2014 lauches ja 2015 INTRODUCING WWW.FEEDBACKCOMMONS.ORG A serviced cloud platform to share ad compare feedback data ad collaboratively develop feedback ad learig practice CONSTITUENT

More information

part2 Participatory Processes

part2 Participatory Processes part part2 Participatory Processes Participatory Learig Approaches Whose Learig? Participatory learig is based o the priciple of ope expressio where all sectios of the commuity ad exteral stakeholders

More information

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO HANDBOOK Career Ceter Hadbook CALIFORNIA STATE UNIVERSITY, SACR AMENTO Tools & Tips for Career Search Success Academic Advisig ad Career Ceter 6000 J Street Lasse Hall 1013 Sacrameto, CA 95819-6064 916-278-6231

More information

Application for Admission

Application for Admission Applicatio for Admissio Admissio Office PO Box 2900 Illiois Wesleya Uiversity Bloomig, Illiois 61702-2900 Apply o-lie at: www.iwu.edu Applicatio Iformatio I am applyig: Early Actio Regular Decisio Early

More information

VISION, MISSION, VALUES, AND GOALS

VISION, MISSION, VALUES, AND GOALS 6 VISION, MISSION, VALUES, AND GOALS 2010-2015 VISION STATEMENT Ohloe College will be kow throughout Califoria for our iclusiveess, iovatio, ad superior rates of studet success. MISSION STATEMENT The Missio

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Classical Approaches to Tagging The slides are posted on the web. The url is http://chss.montclair.edu/~feldmana/esslli10/.

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary michiga veteriary medical associatio i this issue... 3 Great Lakes Veteriary Coferece 4 What You Need to Kow Whe Issuig a Iterstate Certificate of Ispectio 6 Low Pathogeic Avia Iflueza H5 Virus Detectios

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

A Syllable Based Word Recognition Model for Korean Noun Extraction

A Syllable Based Word Recognition Model for Korean Noun Extraction are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.

More information

HybridTechniqueforArabicTextCompression

HybridTechniqueforArabicTextCompression Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

ARNE - A tool for Namend Entity Recognition from Arabic Text

ARNE - A tool for Namend Entity Recognition from Arabic Text 24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123

More information

2014 Gold Award Winner SpecialParent

2014 Gold Award Winner SpecialParent Award Wier SpecialParet Dedicated to all families of childre with special eeds 6 th Editio/Fall/Witer 2014 Desig ad Editorial Awards Competitio MISSION Our goal is to provide parets of childre with special

More information

Natural Language Processing. George Konidaris

Natural Language Processing. George Konidaris Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

also inside Continuing Education Alumni Authors College Events

also inside Continuing Education Alumni Authors College Events SUMMER 2016 JAMESTOWN COMMUNITY COLLEGE ALUMNI MAGAZINE create a etrepreeur creatig a busiess a artist creatig beauty a citize creatig the future also iside Cotiuig Educatio Alumi Authors College Evets

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se

More information

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

Semi-supervised Training for the Averaged Perceptron POS Tagger

Semi-supervised Training for the Averaged Perceptron POS Tagger Semi-supervised Training for the Averaged Perceptron POS Tagger Drahomíra johanka Spoustová Jan Hajič Jan Raab Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics,

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Cross-Lingual Text Categorization

Cross-Lingual Text Categorization Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

An Evaluation of POS Taggers for the CHILDES Corpus

An Evaluation of POS Taggers for the CHILDES Corpus City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects Graduate Center 9-30-2016 An Evaluation of POS Taggers for the CHILDES Corpus Rui Huang The Graduate

More information

Large vocabulary off-line handwriting recognition: A survey

Large vocabulary off-line handwriting recognition: A survey Pattern Anal Applic (2003) 6: 97 121 DOI 10.1007/s10044-002-0169-3 ORIGINAL ARTICLE A. L. Koerich, R. Sabourin, C. Y. Suen Large vocabulary off-line handwriting recognition: A survey Received: 24/09/01

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10) Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Nebraska Reading/Writing Standards (Grade 10) 12.1 Reading The standards for grade 1 presume that basic skills in reading have

More information

Abdul Rahman Chik a*, Tg. Ainul Farha Tg. Abdul Rahman b

Abdul Rahman Chik a*, Tg. Ainul Farha Tg. Abdul Rahman b Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 66 ( 2012 ) 223 231 The 8th International Language for Specific Purposes (LSP) Seminar - Aligning Theoretical Knowledge

More information

Named Entity Recognition: A Survey for the Indian Languages

Named Entity Recognition: A Survey for the Indian Languages Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India

More information

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

1. Introduction. 2. The OMBI database editor

1. Introduction. 2. The OMBI database editor OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper

More information

Language Independent Passage Retrieval for Question Answering

Language Independent Passage Retrieval for Question Answering Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract Comparing a Linguistic and a Stochastic Tagger Christer Samuelsson Lucent Technologies Bell Laboratories 600 Mountain Ave, Room 2D-339 Murray Hill, NJ 07974, USA christer@research.bell-labs.com Atro Voutilainen

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Grade 4. Common Core Adoption Process. (Unpacked Standards) Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences

More information

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9) Nebraska Reading/Writing Standards, (Grade 9) 12.1 Reading The standards for grade 1 presume that basic skills in reading have been taught before grade 4 and that students are independent readers. For

More information

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za

More information

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles Rayner Alfred 1, Adam Mujat 1, and Joe Henry Obit 2 1 School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Arabic Orthography vs. Arabic OCR

Arabic Orthography vs. Arabic OCR Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Jianfeng Gao Microsoft Research One Microsoft Way Redmond, WA 98052 USA jfgao@microsoft.com Xiaodong He Microsoft

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

Accurate Unlexicalized Parsing for Modern Hebrew

Accurate Unlexicalized Parsing for Modern Hebrew Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Universiteit Leiden ICT in Business

Universiteit Leiden ICT in Business Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Modeling full form lexica for Arabic

Modeling full form lexica for Arabic Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS Objectives Presentation of the current standardization activity in the domain of lexical data modeling

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Derivational and Inflectional Morphemes in Pak-Pak Language

Derivational and Inflectional Morphemes in Pak-Pak Language Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes

More information

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Experiments with a Higher-Order Projective Dependency Parser

Experiments with a Higher-Order Projective Dependency Parser Experiments with a Higher-Order Projective Dependency Parser Xavier Carreras Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory (CSAIL) 32 Vassar St., Cambridge,

More information

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Page 1 of 35 Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Kaihong Liu, MD, MS, Wendy Chapman, PhD, Rebecca Hwa, PhD, and Rebecca S. Crowley, MD, MS

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general

More information

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract

More information

Literature and the Language Arts Experiencing Literature

Literature and the Language Arts Experiencing Literature Correlation of Literature and the Language Arts Experiencing Literature Grade 9 2 nd edition to the Nebraska Reading/Writing Standards EMC/Paradigm Publishing 875 Montreal Way St. Paul, Minnesota 55102

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information