CBAS: CONTEXT BASED ARABIC STEMMER

Size: px
Start display at page:

Download "CBAS: CONTEXT BASED ARABIC STEMMER"

Transcription

1 CBAS: CONTEXT BASED ARABIC STEMMER Mahmoud El-Defrawy, Yasser El-Sobaty ad Nahla A. Belal College of Computig ad Iformatio Techology, AAST, Alexadria, Egypt ABSTRACT Arabic morphology ecapsulates may valuable features such as word s root. Arabic roots are beig utilized for may tasks; the process of extractig a word s root is referred to as stemmig. Stemmig is a essetial part of most Natural Laguage Processig tasks, especially for derivative laguages such as Arabic. However, stemmig is faced with the problem of ambiguity, where two or more roots could be extracted from the same word. O the other had, distributioal sematics is a powerful co-occurrece model. It captures the meaig of a word based o its cotext. I this paper, a distributioal sematics model utilizig Smoothed Poitwise Mutual Iformatio (SPMI) is costructed to ivestigate its effectiveess o the stemmig aalysis task. It showed a accuracy of 81.5%, with a at least 9.4% improvemet over other stemmers. KEYWORDS Natural Laguage Processig, Computatioal Liguistics, Text Aalysis, Stemmig 1.INTRODUCTION Natural Laguages (NLs) are the commuicatio chaels betwee humas. It allows coveyig iformatio, exchagig kowledge, ad sharig ideas. For may years scietists studied Natural Laguages ad developed theories ad rules that gover the use of Natural Laguages, such as Grammar ad Morphology. Natural Laguage Processig (NLP) is the itersectio betwee liguistics, ad Computatioal Sciece (CS) [1]. NLP allows utilizig liguistics to use Natural Laguages as a way of commuicatio with computatioal devices[1]. The associatio curve betwee liguistics ad computatioal scieces has evolved over time. Machie Traslatio (MT) was oe of the first NLP tasks i the 1950s; it bega as traslatio from Russia to Eglish [2]. The progress of MT was limited due to the complexity of liguistics rules, ad low computatio power at the time[1]. However, Chomsky s theory[3] of atural laguage s grammar formed the basis for the formatio of Backus-Naur Form (BNF). BNF[4] otatios are commoly used to represet Cotext Free Grammar (CFG). CFGs are used to systematically describe, ad validate artificial laguages, such as programmig laguages. Usig CFGs to describe some aspects of Naturals Laguages requires a o-trivial set of rules, which results i some ambiguity due to the uexpected rules iteractios. The itroductio of statistical methods gave some isights for reducig NLP ambiguity[1]. For example, the Probabilistic CFGs exteds the traditioal CGFs by deducig liguistic rules ad assigig weights[5]. Rules ad weights are statistically deduced from large aotated corpus. The oticeable improvemet i MT sparks the research i NLP [1]. DOI : /ijlc

2 Table 1. Cotext Matrix Sample. القاھرة Cairo) ( جامعة Uiversity) ( دول Coutries) ( عجائب( Woders ( نظم Systems) ( سائح Tourist) ( حكم( Judgemet ( NLP teds to work with large sets of data from various laguages. It raises the eed for defiig a cocise represetatio for the data while preservig as may of its features as possible. Cocise represetatio is required mostly for ay NLP task (word, setece, or documet levels). Stemmig is a primary NLP task, ad it cotributes i may other NLP tasks [6]. Stemmig is reducig a word to its basic form[7], while preservig its mai characteristics. May laguages defie liguistic rules for stemmig but ot with the same degree[8]. Derivative laguages are highly systematic, ad highly supportive for stemmig aalysis. Most of the derivative laguages share the property that complex forms are derived from basic oes. Arabic is oe of the derivative laguages that liguistically supports stemmig. The Arabic laguage is a widely used laguage[9] ad it exists i differet formats. For example, Arabic words ca be give i the format of separate text or it could be extracted from images[10]. Arabic laguage defies a accurate set of rules kow as morphological rules, or morphology. Morphology accurately describes the formulatio of a Arabic word from its basic form. The basic forms are commoly called roots. However, stemmig is faced with ambiguity, as most of the NLP tasks. Various techiques were used to resolve ambiguity. Amog which is sematic aalysis, that is to capture the iteded meaig of a word [11]. Sematic aalysis is a very powerful tool to tackle the ambiguity problem but it is very challegig to model. Distributioal Sematic (DS)[11] is a type of sematic aalysis based o co-occurrece aalysis. It represets a word s meaig by its cotext (surroudig words) distributio as show i Table 1. For example, the first row of Table 1 shows that the words ( Woders"meas عجائب( ad (meas Coutries ) دول appeared i the cotext of the word (meas Systems ) نظم with frequecies 0 ad 5, respectively. Differet measures ca be computed such as Poitwise Mutual Iformatio (PMI), Positive PMI (PPMI), ad Smoothed PMI (SPMI). They measure the correlatio betwee a word, ad its cotext[12,13]. This paper itroduces a cotext based Arabic stemmer for extractig a word s root. The proposed stemmer (CBAS) explores all possible roots the selects the appropriate root usig Distributioal Sematics (DS). The DS utilizatio impact is viewed as a series of comparisos with other stemmers usig a maually aotated set of articles. The paper is orgaized as follows; sectio 2 is a itroductio to Arabic morphology. Sectio 3 explores related work, ad techiques used for costructig stemmers. The descriptio of the proposed stemmer (CBAS) is itroduced i sectio 4. I sectio 5, a detailed aalysis, ad evaluatio of the proposed stemmer (CBAS) is preseted. Fially, a coclusio is preseted i sectio BACKGROUND Regardless of the approach used for developig morphological aalyzers, a basic uderstadig of morphological rules is eeded to set expectatios, evaluate the results, ad desig improvemets. This sectio itroduces Arabic morphology, ad commo challeges. Morphology is the study of word formulatio. Arabic morphology is based o the derivatio priciple, whereas words are acquired from roots. Roots are usually three, four, or five characters. 2

3 Roots are the seeds for Arabic words geeratio. A ew word is acquired by modifyig its root. For example the word كاتب (kātb, meas Writer ) is derived from the root ك ت ب (kāf tāʾ bāʾ, meas Wrote ) by addig ا (ʾlf) i the middle. Not every additio is cosidered to be valid. Arabic laguage itroduces a set of templates to defie valid combiatios ad additios. Templates are referred to as patters. It is a ordered sequece of letters. Sice patters work with all roots, a set of letters geerically represet roots letters ad its order while augmeted letters are represeted by themselves i correct positios. كاتب (fāʿl, meas Actor ) was used to derive the previous word فاعل For example the patter (kātb, meas Writer ) by substitutig ف (fāʾ) with ك (kāf), ع (ʿy) with ت (tāʿ), ad ل (lām) with ب (bāʾ),respectively. As oted, roots are commoly writte as separated characters to idicate possible isertios[14]. Augmeted letters are reflected o the patter, ad fially o the word itself. However, a augmeted uit (oe, or more augmeted letters) which is added i the frot, or at the ed of a word, is called a prefix, or a suffix additio. I most cases, prefixes ad suffixes are ot part of a word s meaig, rather they are additioal features[14]. For example الكاتب (ʾlkātb, meas The writer ) by addig ال (ʾl, meas The ) i frot of the word. This poit of view would substatially reduce the umber of eumerated patters. As described above, the root-patter system is simple, elegat, ad straightforward. However, the system is faced with morphological challeges, amely vocalizatio, mutatio, ad the absece of diacritics (aotatio above, or below a word s letter that captures morphological, ad grammatical additioal features). For example, a letter ca chage its form due to grammatical or phoological rules. Aother challege of the root patter system is stopwords, such as coectio words that do ot obey derivatioal rules. The process of derivig back a word to its root (stemmig) looks like a straightforward operatio, ع (fāʾ), ف by simply aligig a word to its patter, ad collectig the letters correspodig to (ʿy), ad ل (lām) letters. However, due to the challeges described above, a word may be derived back to multiple roots. 3. RELATED WORK Iformatio Retrieval (IR) is oe of the early tasks that utilized stemmig aalysis[7]. But, Stemmig aalysis is ot limited to IR. Stemmig aalysis has improved may tasks, such as Machie Traslatio (MT)[15], Setimet Aalysis [16], ad may more tasks. This sectio views commo stemmig aalysis algorithms for differet Natural Laguages (NLs). The ature of a laguage has a great impact o the developmet of related stemmig algorithms. For example, the ature of the Eglish laguage makes Eglish stemmers cocered with removig word s suffixes oly, while removig prefixes may imply a differet meaig, such as sufficiet ad isufficiet. Various stemmers were developed for Eglish[8,17]. However, the uderlyig ature of the laguage limited its extesio for other laguages, amog which is the Arabic[18] ad Urdu[19] laguages. However, stemmig is ot effective i the same degree for all laguages[20,21]. Arabic is a morphological rich laguage which has eriched the Natural Laguage Processig (NLP)[22-24]. Arabic roots have rich liguistic features; they are sematically represetative, derivable, ad fiite i umbers. Stemmers were developed over the years to take advatages of such features. This sectio itroduces commo Arabic stemmers, ad their root extractio process which 3

4 ecapsulates sematic decisios. Fially, it itroduces sematic aalysis techiques idepedetly from stemmig aalysis. Khoja stemmer[18] is oe of the early ad most powerful approaches developed for Arabic stemmig[7]. Khoja simulates the liguistic process as much as possible. It removes prefixes, ad suffixes from a word ofte after the ormalizatio process, the matches the resultig word to a patter, ad fially extracts the root. The extracted root gets validated agaist a list of correct Arabic roots to esure liguistic correctess. Khoja stemmer[18] resolves ambiguity by defiig a set of liguistic paths, or decisios based o various features such as words first character, prefixes, or suffixes legth. Additioally, decisios are implicitly ordered whereas the result would be the first correct root. For example, Khoja stemmer hadles roots with duplicate letters first. Root extractio is a highly complex process due to the existece of overlappig rules, which requires more iformatio. A ew type of stemmig aalysis itroduced is light stemmig. Light stemmig is aother way of acquirig reduced represetatio of Arabic words. Light stemmig is ot as complex as root extractio. It removes prefixes ad suffixes oly from a word. For كاتب (ʾlkātbū, meas The writers ) would be stemmed to the word الكاتبون example, the word (kātb, meas Writer ) istead of ك ت ب (kāf tāʾ bāʾ, meas Writig ). Light stemmig is widely used for Iformatio Retrieval (IR)[25]. Light stemmig has show competitive results i IR agaist root extractio based stemmers [6,26]. Light stemmers are relatively faster ad efficiet, which preserves more specific features of the word, for example, كاتب (kātb, meas Writer ) is more related to the word الكاتبون (ʾlkātbū, meas The writers) tha ك ت ب (kāf tāʾ bāʾ, meas Writig ). But, the umber of words i Arabic without prefixes ad suffixes is far more tha the roots listed i Arabic dictioaries[14]. However, there is o explicit evidece that lightly stemmed words are more efficiet tha roots[26]. ISRI [27] is aother liguistic based Arabic stemmer that roughly uses the same sequece used by Khojas stemmer [18]. But, the mai differece is that ISRI does ot liguistically validate the extracted root, o dictioary is used, ad orgaizes the defied patter set as sub-groups whereas each sub-group has commo features. Besides, chages to the ormalizatio process, prefixes, ad suffixes hadlig. The mai goal of ISRI is to get the miimum represetatio of a give word. Due to various chages, ad prioritizatio, ISRI resolves ambiguity differetly, by chagig the order of applyig morphological rules. But, ISRI still makes static decisios. Most likely, ISRI would be used for iformatio retrieval rather tha liguistic based tasks. Tashaphye [28] is aother Arabic stemmer. It maily supports light stemmig (removig prefixes ad suffixes). It follows the same approach used by Khoja [18], ad ISRI [27] stemmers. It ca be used for root extractio as well. Darwish[29] utilizes the existig word-root pairs used to costruct the Fiite State Trasducer (FST) like stemmers. It uses a learig based techique to ot oly ifer Arabic patters, but also to rak extracted roots. This methodology eumerates possible roots like FSTs [30] approaches, but additioally gives a preferece to the extracted roots. It is aother way to hadle ambiguity other tha static rules. However, part of the iferred patters would be iadequate due to cases such as vocalizatio, ad mutatio. Ad, the rakig of roots is based o iferred patters frequecy, eglectig the words features. Later, this approach has bee modified to hadle light stemmig (prefixes ad suffixes removal) [7]. Arabic grammar has high ifluece o morphological aalysis[14]. ElixirFM [31] employs sytactic features to ehace morphological results. It takes advatage from Prague Arabic Depedecy Treebak (PADT)[32] to acquire sytactic features ad other morphological features 4

5 from BuckWalter [33] stem dictioary. ElixirFM[31] defies a set of morphological rules to extract possible stems while rakig is used to disambiguate the extracted roots, or stems usig the uderlyig data. MADAMIRA[15] morphological aalyzer cosists of two tools MADA [34], ad AMIRA [35]. MADAMIRA [15] takes the advatage of large aotated corpus by usig machie learig techiques such as Support Vector Machie[35]. It combies several tools such as word segmetatio, Part of Speech Taggig (POST), ad light stemmig. It is differet from the previous approaches, sice it does ot explicitly defie morphological rules. Stemmers are part of may NLP tasks. For example, i setimet aalysis they employ the use of classifiers such as Naive Bayes or Support Vector Machies (SVMs) to perform classificatio o sets of test samples give a tagged traiig set ad rich feature sets for various tasks[16,36,37], questio ad aswer systems [38,39], ad may more. However, stemmig is ot limited to NLP. It has bee used for Iformatio Retrieval (IR), ad most of the stemmers are evaluated idirectly usig IR bechmarks [7]. May IR experimets [6,26,27] showed that Arabic roots had improved the Arabic IR. 4. PROPOSED STEMMER (CBAS) The proposed stemmer, Cotext-Based Arabic Stemmer (CBAS), utilizes distributioal similarity of a word s cotext to gai additioal iformatio about its sematic. This iformatio assists with the selectio of the correct root by excludig sematically irrelevat cadidate roots withi the cotext. This sectio itroduces the mai phases of the proposed stemmer (CBAS), cotext matrix costructio, roots geeratio, ad root selectio, whereas each phase cosists of a set of steps. The proposed algorithm is show i Fig. 1 ad Fig. 2 Fig.1: Cotext Matrix Costructio Algorithm 5

6 4.1. Data Resources Fig.2: Stemmer s Algorithm Predefied liguistic data is a essetial part of the proposed stemmer (CBAS). This sectio itroduces the data defied by CBAS. The Arabic word cosists of three parts prefix, ifix, ad suffix[14]. Prefixes ad suffixes are a set of features that ca be added to a word, such as the defiite article ال (ʾl, meas The ) or coected proous ھم (hm, meas Them ). Prefixes ad suffixes lists cotai idividual ad compoud letters that could appear i the frot or the ed of the Arabic word. There is also a list of Arabic patters which is used to extract possible Arabic roots. The fial list is the Roots dictioary which has bee extracted from the Khoja stemmer [18] to validate the extracted root. The prefixes, suffixes, patters, ad dictioary lists are beig used for the roots geeratio phase. Previous lists are commoly defied for Arabic stemmers. However, CBAS uses a raw data set which cosists of a set of Articles that have bee extracted from Omai ewspapers[40]. The dataset cotais articles from various topics, for example, culture ad sport[40], which represets a wide rage of the Arabic laguage curret usage. The raw dataset plays a cetral role i selectig sematically correct roots, which differs from other used stemmers which do ot commoly employ cotext i their algorithms Cotext Matrix Costructio Cotext Matrix is a powerful ad flexible tool to acquire some sematic properties [11]. It defies a widow of words, where the target word is at positio i, ad the rest of the surroudig words are its cotext [11]. The widow slides over the corpus associatig the target word with its cotext distributio as show i Table 1. Various measures ca be computed from the cotext matrix, ad employed i differet tasks Root Geeratio 6

7 It is automatio for root extractio. However, ulike the maual process, it extracts all possible roots. It cosists of three major sub processes, word segmetatio, patter matchig, ad root validatio. Word segmetatio breaks a word ito all possible three parts, prefix, suffix, ad ifix, usig the predefied prefixes, ad suffixes lists. Patter matchig matches the ifix obtaied i word segmetatio with oe or more patters with respect to its legth. For each matched patter, roots characters are collected, ad passed to dictioary validatio. It also hadles weak letters, stopwords, ad some other liguistic cases. Dictioary validatio esures that the extracted roots are liguistically correct. It validates extracted roots agaist a list of correct Arabic roots. Dictioary is ot sufficiet to geerate oly oe correct root, due to various liguistic cases, roots geeratio has the potetial of extractig oe, or more correct roots Roots Selectio This sectio will utilize the cotext matrix to select a appropriate root from two or more cadidate roots. Poitwise Mutual Iformatio (PMI) [12,13] measures the correlatio betwee two or more words. Sice some words ca produce two, or more root cadidates. The proposed algorithm (CBAS) uses a variatio of PMI, Smoothed PMI (SPMI) [12] to hadle sparse matrices. As show i Table 2, SPMI achieved the highest accuracy of 81.5% whe compared to PMI ad PPMI, where the achieved accuracy was 78.84% ad 79.49%, respectively. This is due to the fact that SPMI overcomes the tedecy towards rare co-occurrece evets, which is a side effect of PMI [12]. SPMI is utilized to measure the correlatio betwee the geerated roots, ad its previous cotext. To take advatage of the uderlyig matrix, set words are derived for each cadidate root, i additio to the root itself. For each derived word, the average SPMI is computed with the previous word (as cotext). The root with the highest average correlatio to its cotext is the selected. 5. RESULTS AND EVALUATION IR is the commo methodology for evaluatig a ew stemmer because of the lack of stemmed bechmarks[21]. This sectio itroduces the validatio dataset, evaluatio measures, ad fially the experimetal results Validatio Dataset Direct evaluatio is importat to show the stemmig accuracy, ad potetial improvemets. A maually aotated dataset has bee provided to measure the stemmer accuracy, ad compare with other stemmers. The dataset is part of the Itetioal Corpus of Arabic (ICA) [41].Various Arabic resources have cotributed i collectig the ICA such as ewspapers, books, ad magazies. It has bee costructed to provide a appropriate represetatio to Arabic laguage i Moder Stadard Arabic (MSA) [41]. The dataset cosists of tokes associated with various features. There exist 3629 uique word-root pairs, while other words do ot have roots associated due to the existece of stopwords, ad o-arabic words. The dataset cotais 8941 words after stopwords removal. This is show i Fig 3. 7

8 5.2. Evaluatio Criteria Fig.3: Validatio Dataset Stemmig is beeficial for may tasks, where every task uses the roots i a differet way. For example, IR uses roots as a cluster represetative to group related words, while setimet aalysis is more cocered with the liguistic accuracy of a root. A set of metrics were used to measure differet usages, ad compare them with other stemmers. Stemmig accuracy is oe of the basic measures for the effectiveess of the stemmer. It is defied as the ratio betwee the umber of correctly stemmed words, ad the umber of the words i the complete dataset. Collectig related words uder the same group is importat for tasks such as IR. There are two variatios of groupig related words. First, words ca be grouped correctly uder a sematically correct root; this is referred to it as classificatio. While the secod is to group related words together, ot ecessarily uder a correct root, ad this is referred to as clusterig. Stadard metrics for classificatio, ad clusterig are: accuracy, precisio, recall, ad F 1 measure, ad are defied as follows [42, 43]: 1 accuracy = 1 precisio = X Y i i i= 1 X i Yi X Y i i i= 1 Yi 8

9 1 recall = 1 F1 measure = X Y i i i= 1 X i X Yi Y i i= 1 X i + i Where Ad is the umber of extracted roots. X is the set of extracted root. X i is a idividual extracted root. Y is the set of extracted root. Y i is a idividual valid root Results The complete 8941 words were used to test the proposed stemmer (CBAS), with a widow size =3, the the set was reduced to a set of uique word-root pairs to be compared with other stemmers. Table 2 shows the compariso betwee the proposed stemmer (CBAS) ad other stemmers. It shows that the proposed stemmer (CBAS) achieved a accuracy of 81.5% with a improvemet of 9.4%, 67.3%, ad 51.2% over Khoja, ISRI, ad Tashphaye stemmers, respectively. Accuracy ehacemet is due to explorig various possibilities of roots. Such exploratio would ot be possible without distributioal sematics, which provides a dyamic ad robust way for selectig a appropriate root. Table 3 ad Table 4; show the performace of the proposed stemmer (CBAS) whe usig it as a groupig mechaism. Table 3 clarifies that the proposed stemmer (CBAS) has a higher potetial to liguistically group Arabic words tha other stemmers. CBAS outperformed other stemmers i the classificatio task, with a accuracy of 65.45%. While Table 4 shows that the proposed stemmer (CBAS) has potetial improvemets i o-liguistic based tasks, achievig a accuracy of 73.83% i clusterig. By comparig liguistic (classificatio), ad o-liguistic (clusterig) groupig measures, there is a icrease i all correspodig measures. This is due to that some clusters were correctly formulated irrespective to the clusters seeds. Classificatio ad clusterig measures show the superiority of the CBAS over other stemmers. This idicates the beeficial features of the CBAS for the IR task. Table 2. Stemmers Liguistic Accuracy Stemmer Liguistic Accuracy Khoja 72.1% ISRI 14.2% Tashaphaye 30.3% CBAS-PMI 78.84% CBAS-PPMI 79.49% CBAS 81.5% 9

10 6. CONCLUSION Table 3. Stemmers Classificatio Measures Stemmer Accuracy Precisio Recall F 1 measure Khoja 57.53% 57.53% 59.59% 58.55% ISRI 10.43% 10.43% 10.49% 10.46% Tashaphaye 25.07% 25.07% 25.15% 25.11% CBAS 65.45% 65.45% 68.23% 66.51% Table 4. Stemmers Clusterig Measures Stemmer Accuracy Precisio Recall F 1 measure Khoja 71.71% 93.09% 75.74% 83.52% ISRI 12.59% 69.40% 13.34% 22.27% Tashaphaye 32.25% 72.54% 37.03% 49.03% CBAS 73.83% 93.71% 75.46% 84.50% May stemmers were developed to gai the rich liguistic features provided by the roots. Most of the stemmers made explicit decisios, statistical-based or liguistic-based, to select oly oe root. Other stemmers used rakig to express their selectio preferece rather tha selectig a sigle root. However, at the very ed, a sigle root would be chose. Static decisios are very appropriate for commo ad frequet cases. However, addig other features such as sytactic ad maual aotatios would also be valuable. The itroduced stemmer employs distributioal similarity to hadle icorrect roots selectio, which is a side effect of root geeratio phase. The existece of robust filterig mechaisms, such distributioal aalysis, allows explorig various roots. Distributioal aalysis has several advatages. It ca be computed for ay corpus ad ay laguage, ad it is relatively fast ad iexpesive to costruct compared to maually aotated corpus. Distributioal sematics covers may relatios betwee words, ad it is robust agaist ay prefereces, or missig iformatio. It is also very adaptive to cotext chages, which makes it suitable for may topics. However, distributioal aalysis is ot as accurate as maually aotated data; hece, the word geeratio process was added to the roots selectio phase to tolerate possible errors. The previous techiques were compared to the proposed stemmer (CBAS) results. CBAS shows a accuracy of 81.5% with a improvemet of 9.4%, 67.3%, ad 51.2% over Khoja, ISRI, ad Tashphaye stemmers, respectively. CBAS also shows a improvemet i classificatio ad clusterig, with a accuracy of 65.45% ad 73.83%, respectively. Results idicate that the proposed stemmer (CBAS) ehaces stemmig ad other related tasks. CBAS represets a methodology for capturig a word s cotext ad makes decisios based o it. CBAS could chage its behaviour based o the uderlyig data which could be specialized i a sub domai of the Arabic laguage. The statistical model used by CBAS is relatively simple. It icorporates importat iformatio (cotext) of a word which would be a complex process to iclude i a rule based stemmer. The statistical model reduces liguistic complexity of represetig various liguistic cases. It also prevets uexpected iteractios ad prioritizatio schemes for orderig the rules. 10

11 REFERENCES [1] P. M. Nadkari, L. Oho-Machado, ad W. W. Chapma, "Natural laguage processig: a itroductio," Joural of the America Medical Iformatics Associatio, vol. 18, pp , [2] J. Hutchis, "The first public demostratio of machie traslatio: the Georgetow-IBM system, 7th Jauary 1954," oviembre de, [3] N. Chomsky, "Three models for the descriptio of laguage," Iformatio Theory, IRE Trasactios o, vol. 2, pp , [4] A. Aho, "R. Sethi, ad J. D. Ullma," Compilers: Priciples, Techiques, ad Tools, [5] D. Klei ad C. D. Maig, "Accurate ulexicalized parsig," i Proceedigs of the 41st Aual Meetig o Associatio for Computatioal Liguistics-Volume 1, 2003, pp [6] M. Aljlayl ad O. Frieder, "O Arabic search: improvig the retrieval effectiveess via a light stemmig approach," i Proceedigs of the eleveth iteratioal coferece o Iformatio ad kowledge maagemet, 2002, pp [7] I. A. Al Sughaiyer ad I. A. Al Kharashi, "Arabic morphological aalysis techiques: A comprehesive survey," Joural of the America Society for Iformatio Sciece ad Techology, vol. 55, pp , [8] M. F. Porter, "Sowball: A laguage for stemmig algorithms," ed, [9] J. Xu, A. Fraser, ad R. Weischedel, "Empirical studies i strategies for Arabic retrieval," i Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, 2002, pp [10] R. Fathalla, Y. El Sobaty, ad M. A. Ismail, "Extractio of Arabic Words form Complex Color Images," i 9th IEEE Iteratioal Coferece o Documet Aalysis ad Recogitio (ICDAR 2007), Brazil, pp [11] C. Akkaya, J. Wiebe, ad R. Mihalcea, "Utilizig sematic compositio i distributioal sematic models for word sese discrimiatio ad word sese disambiguatio," i Sematic Computig (ICSC), 2012 IEEE Sixth Iteratioal Coferece o, 2012, pp [12] D. Jurafsky. Word Seses ad Word Relatios. [13] G. Bouma, "Normalized (poitwise) mutual iformatio i collocatio extractio," Proceedigs of GSCL, pp , [14] K. C. Rydig, A referece grammar of moder stadard Arabic: Cambridge uiversity press, [15]A. Pasha, M. Al-Badrashiy, M. Diab, A. El Kholy, R. Eskader, N. Habash, et al., "Madamira: A fast, comprehesive tool for morphological aalysis ad disambiguatio of arabic," i Proceedigs of the Laguage Resources ad Evaluatio Coferece (LREC), Reykjavik, Icelad, [16] S. M. Oraby, Y. El-Sobaty, ad M. A. El-Nasr, "Explorig the Effects of Word Roots for Arabic Setimet Aalysis," i Iteratioal Joit Coferece o Natural Laguage Processig, Nagoya, Japa, 2013, pp [17] J. B. Lovis, Developmet of a stemmig algorithm: MIT Iformatio Processig Group, Electroic Systems Laboratory, [18] S. Khoja ad R. Garside, "Stemmig arabic text," Lacaster, UK, Computig Departmet, Lacaster Uiversity, [19] M. S. Husai, "A usupervised approach to develop stemmer," Iteratioal Joural o Natural Laguage Computig, vol. 1, pp , [20] D. Harma, "How effective is suffixig?," JASIS, vol. 42, pp. 7-15, [21] I. Smirov, "Overview of stemmig algorithms," Mechaical Traslatio, vol. 52, [22] Y. Beajiba, M. Diab, ad P. Rosso, "Arabic amed etity recogitio usig optimized feature sets," i Proceedigs of the Coferece o Empirical Methods i Natural Laguage Processig, 2008, pp [23] K. Darwish ad D. W. Oard, "CLIR Experimets at Marylad for TREC-2002: Evidece combiatio for Arabic-Eglish retrieval," DTIC Documet2003. [24] L. S. Larkey ad M. E. Coell, "Arabic iformatio retrieval at UMass i TREC-10," DTIC Documet2006. [25] L. S. Larkey, L. Ballesteros, ad M. E. Coell, "Light stemmig for Arabic iformatio retrieval," i Arabic computatioal morphology, ed: Spriger, 2007, pp

12 [26] L. S. Larkey, L. Ballesteros, ad M. E. Coell, "Improvig stemmig for Arabic iformatio retrieval: light stemmig ad co-occurrece aalysis," i Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, 2002, pp [27] K. Taghva, R. Elkhoury, ad J. Coombs, "Arabic stemmig without a root dictioary," i ull, 2005, pp [28] T. Zerrouki. (2010). Tashaphye, Arabic light stemmer/segmet. [29] K. Darwish, "Buildig a shallow Arabic morphological aalyzer i oe day," i Proceedigs of the ACL-02 workshop o Computatioal approaches to semitic laguages, 2002, pp [30] K. R. Beesley, "Arabic morphological aalysis o the Iteret," i Proceedigs of the 6th Iteratioal Coferece ad Exhibitio o Multi-ligual Computig, [31] O. Smrž, "Elixirfm: implemetatio of fuctioal arabic morphology," i Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues ad Resources, 2007, pp [32] O. PetrZemáek, "Prague Arabic Depedecy Treebak: A Word o the Millio Words." [33] T. Buckwalter, "Buckwalter {Arabic} Morphological Aalyzer Versio 1.0," [34] N. Habash, O. Rambow, ad R. Roth, "MADA+ TOKAN: A toolkit for Arabic tokeizatio, diacritizatio, morphological disambiguatio, POS taggig, stemmig ad lemmatizatio," i Proceedigs of the 2d Iteratioal Coferece o Arabic Laguage Resources ad Tools (MEDAR), Cairo, Egypt, 2009, pp [35] M. Diab, K. Hacioglu, ad D. Jurafsky, "Automated methods for processig arabic text: From tokeizatio to base phrase chukig," Arabic Computatioal Morphology: Kowledge-based ad Empirical Methods. Kluwer/Spriger, [36] S. N. Saleh ad Y. El-Sobaty, "A feature selectio algorithm with redudacy reductio for text classificatio," i Computer ad iformatio scieces, iscis d iteratioal symposium o, 2007, pp [37] S. Oraby, Y. El-Sobaty, ad M. A. El-Nasr, "Fidig Opiio Stregth Usig Rule-Based Parsig for Arabic Setimet Aalysis," i Advaces i Soft Computig ad Its Applicatios, ed: Spriger, 2013, pp [38] A. M. Ezzeldi, M. H. Kholief, ad Y. El-Sobaty, "ALQASIM: Arabic laguage questio aswer selectio i machies," i Iformatio Access Evaluatio. Multiliguality, Multimodality, ad Visualizatio, ed: Spriger, 2013, pp [39] A. M. Ezzeldi, Y. El-Sobaty, ad M. H. Kholief, "Explorig the Effects of Root Expasio, Setece Splittig ad Otology o Arabic Aswer Selectio," Natural Laguage Processig ad Cogitive Sciece: Proceedigs 2014, p. 273, [40] M. Abbas, K. Smaïli, ad D. Berkai, "Evaluatio of Topic Idetificatio Methods o Arabic Corpora," JDIM, vol. 9, pp , [41] S. Alasary, M. Nagi, ad N. Adly, "Buildig a Iteratioal Corpus of Arabic (ICA): progress of compilatio stage," i 7th iteratioal coferece o laguage egieerig, Cairo, Egypt, 2007, pp [42] S. Godbole ad S. Sarawagi, "Discrimiative methods for multi-labeled classificatio," i Advaces i Kowledge Discovery ad Data Miig, ed: Spriger, 2004, pp [43] M. Hillemeyer. Machie Learig. 12

Natural language processing implementation on Romanian ChatBot

Natural language processing implementation on Romanian ChatBot Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics

More information

arxiv: v1 [cs.dl] 22 Dec 2016

arxiv: v1 [cs.dl] 22 Dec 2016 ScieceWISE: Topic Modelig over Scietific Literature Networks arxiv:1612.07636v1 [cs.dl] 22 Dec 2016 A. Magalich, V. Gemmetto, D. Garlaschelli, A. Boyarsky Uiversity of Leide, The Netherlads {magalich,

More information

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev Titre du documet / Documet title E-learig usability : A learer-adapted approach based o the evaluatio of leaer's prefereces Auteur(s) / Author(s) TERZIEVA Valetia ; PAVLOV Yuri (1) ; ANDREEV Rume (2) ;

More information

Management Science Letters

Management Science Letters Maagemet Sciece Letters 4 (24) 2 26 Cotets lists available at GrowigSciece Maagemet Sciece Letters homepage: www.growigsciece.com/msl A applicatio of data evelopmet aalysis for measurig the relative efficiecy

More information

'Norwegian University of Science and Technology, Department of Computer and Information Science

'Norwegian University of Science and Technology, Department of Computer and Information Science The helpful Patiet Record System: Problem Orieted Ad Kowledge Based Elisabeth Bayega, MS' ad Samso Tu, MS2 'Norwegia Uiversity of Sciece ad Techology, Departmet of Computer ad Iformatio Sciece ad Departmet

More information

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent Fuzzy Referece Gai-Schedulig Approach as Itelliget Agets: FRGS Aget J. E. ARAUJO * eresto@lit.ipe.br K. H. KIENITZ # kieitz@ita.br S. A. SANDRI sadra@lac.ipe.br J. D. S. da SILVA demisio@lac.ipe.br * Itegratio

More information

Consortium: North Carolina Community Colleges

Consortium: North Carolina Community Colleges Associatio of Research Libraries / Texas A&M Uiversity www.libqual.org Cotributors Collee Cook Texas A&M Uiversity Fred Heath Uiversity of Texas BruceThompso Texas A&M Uiversity Martha Kyrillidou Associatio

More information

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING  Version 1.1, September 2014 preview begis oct 2014 lauches ja 2015 INTRODUCING WWW.FEEDBACKCOMMONS.ORG A serviced cloud platform to share ad compare feedback data ad collaboratively develop feedback ad learig practice CONSTITUENT

More information

part2 Participatory Processes

part2 Participatory Processes part part2 Participatory Processes Participatory Learig Approaches Whose Learig? Participatory learig is based o the priciple of ope expressio where all sectios of the commuity ad exteral stakeholders

More information

Application for Admission

Application for Admission Applicatio for Admissio Admissio Office PO Box 2900 Illiois Wesleya Uiversity Bloomig, Illiois 61702-2900 Apply o-lie at: www.iwu.edu Applicatio Iformatio I am applyig: Early Actio Regular Decisio Early

More information

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO HANDBOOK Career Ceter Hadbook CALIFORNIA STATE UNIVERSITY, SACR AMENTO Tools & Tips for Career Search Success Academic Advisig ad Career Ceter 6000 J Street Lasse Hall 1013 Sacrameto, CA 95819-6064 916-278-6231

More information

VISION, MISSION, VALUES, AND GOALS

VISION, MISSION, VALUES, AND GOALS 6 VISION, MISSION, VALUES, AND GOALS 2010-2015 VISION STATEMENT Ohloe College will be kow throughout Califoria for our iclusiveess, iovatio, ad superior rates of studet success. MISSION STATEMENT The Missio

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

A Comparative Survey on Arabic Stemming: Approaches and Challenges

A Comparative Survey on Arabic Stemming: Approaches and Challenges Intelligent Information Management, 2017, 9, 39-67 http://www.scirp.org/journal/iim ISSN Online: 2160-5920 ISSN Print: 2160-5912 A Comparative Survey on Arabic Stemming: Approaches and Challenges Mohammad

More information

2014 Gold Award Winner SpecialParent

2014 Gold Award Winner SpecialParent Award Wier SpecialParet Dedicated to all families of childre with special eeds 6 th Editio/Fall/Witer 2014 Desig ad Editorial Awards Competitio MISSION Our goal is to provide parets of childre with special

More information

also inside Continuing Education Alumni Authors College Events

also inside Continuing Education Alumni Authors College Events SUMMER 2016 JAMESTOWN COMMUNITY COLLEGE ALUMNI MAGAZINE create a etrepreeur creatig a busiess a artist creatig beauty a citize creatig the future also iside Cotiuig Educatio Alumi Authors College Evets

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

Language Independent Passage Retrieval for Question Answering

Language Independent Passage Retrieval for Question Answering Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

ARNE - A tool for Namend Entity Recognition from Arabic Text

ARNE - A tool for Namend Entity Recognition from Arabic Text 24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123

More information

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

HybridTechniqueforArabicTextCompression

HybridTechniqueforArabicTextCompression Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary michiga veteriary medical associatio i this issue... 3 Great Lakes Veteriary Coferece 4 What You Need to Kow Whe Issuig a Iterstate Certificate of Ispectio 6 Low Pathogeic Avia Iflueza H5 Virus Detectios

More information

Matching Similarity for Keyword-Based Clustering

Matching Similarity for Keyword-Based Clustering Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web

More information

Cross-lingual Short-Text Document Classification for Facebook Comments

Cross-lingual Short-Text Document Classification for Facebook Comments 2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh

More information

Cross-Lingual Text Categorization

Cross-Lingual Text Categorization Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

A Comparison of Two Text Representations for Sentiment Analysis

A Comparison of Two Text Representations for Sentiment Analysis 010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection 1 Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection X. Saralegi, M. Lopez de Lacalle Elhuyar R&D Zelai Haundi kalea, 3.

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval Jianqiang Wang and Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park,

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Noisy SMS Machine Translation in Low-Density Languages

Noisy SMS Machine Translation in Low-Density Languages Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of

More information

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm syntax: from the Greek syntaxis, meaning setting out together

More information

1. Introduction. 2. The OMBI database editor

1. Introduction. 2. The OMBI database editor OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

Dictionary-based techniques for cross-language information retrieval q

Dictionary-based techniques for cross-language information retrieval q Information Processing and Management 41 (2005) 523 547 www.elsevier.com/locate/infoproman Dictionary-based techniques for cross-language information retrieval q Gina-Anne Levow a, *, Douglas W. Oard b,

More information

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Grade 4. Common Core Adoption Process. (Unpacked Standards) Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Prediction of Maximal Projection for Semantic Role Labeling

Prediction of Maximal Projection for Semantic Role Labeling Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

CEFR Overall Illustrative English Proficiency Scales

CEFR Overall Illustrative English Proficiency Scales CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey

More information

Evolution of Symbolisation in Chimpanzees and Neural Nets

Evolution of Symbolisation in Chimpanzees and Neural Nets Evolution of Symbolisation in Chimpanzees and Neural Nets Angelo Cangelosi Centre for Neural and Adaptive Systems University of Plymouth (UK) a.cangelosi@plymouth.ac.uk Introduction Animal communication

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Arabic Orthography vs. Arabic OCR

Arabic Orthography vs. Arabic OCR Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among

More information

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique Hiromi Ishizaki 1, Susan C. Herring 2, Yasuhiro Takishima 1 1 KDDI R&D Laboratories, Inc. 2 Indiana University

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Criterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations

Criterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations Program 2: / Arts English Development Basic Program, K-8 Grade Level(s): K 3 SECTIO 1: PROGRAM DESCRIPTIO All instructional material submissions must meet the requirements of this program description section,

More information

Abstractions and the Brain

Abstractions and the Brain Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT

More information

Derivational and Inflectional Morphemes in Pak-Pak Language

Derivational and Inflectional Morphemes in Pak-Pak Language Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

As a high-quality international conference in the field

As a high-quality international conference in the field The New Automated IEEE INFOCOM Review Assignment System Baochun Li and Y. Thomas Hou Abstract In academic conferences, the structure of the review process has always been considered a critical aspect of

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters. UMass at TDT James Allan, Victor Lavrenko, David Frey, and Vikas Khandelwal Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA 3 We spent

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Vocabulary Usage and Intelligibility in Learner Language

Vocabulary Usage and Intelligibility in Learner Language Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand

More information

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic

More information