CBAS: CONTEXT BASED ARABIC STEMMER
|
|
- Rosemary Ball
- 6 years ago
- Views:
Transcription
1 CBAS: CONTEXT BASED ARABIC STEMMER Mahmoud El-Defrawy, Yasser El-Sobaty ad Nahla A. Belal College of Computig ad Iformatio Techology, AAST, Alexadria, Egypt ABSTRACT Arabic morphology ecapsulates may valuable features such as word s root. Arabic roots are beig utilized for may tasks; the process of extractig a word s root is referred to as stemmig. Stemmig is a essetial part of most Natural Laguage Processig tasks, especially for derivative laguages such as Arabic. However, stemmig is faced with the problem of ambiguity, where two or more roots could be extracted from the same word. O the other had, distributioal sematics is a powerful co-occurrece model. It captures the meaig of a word based o its cotext. I this paper, a distributioal sematics model utilizig Smoothed Poitwise Mutual Iformatio (SPMI) is costructed to ivestigate its effectiveess o the stemmig aalysis task. It showed a accuracy of 81.5%, with a at least 9.4% improvemet over other stemmers. KEYWORDS Natural Laguage Processig, Computatioal Liguistics, Text Aalysis, Stemmig 1.INTRODUCTION Natural Laguages (NLs) are the commuicatio chaels betwee humas. It allows coveyig iformatio, exchagig kowledge, ad sharig ideas. For may years scietists studied Natural Laguages ad developed theories ad rules that gover the use of Natural Laguages, such as Grammar ad Morphology. Natural Laguage Processig (NLP) is the itersectio betwee liguistics, ad Computatioal Sciece (CS) [1]. NLP allows utilizig liguistics to use Natural Laguages as a way of commuicatio with computatioal devices[1]. The associatio curve betwee liguistics ad computatioal scieces has evolved over time. Machie Traslatio (MT) was oe of the first NLP tasks i the 1950s; it bega as traslatio from Russia to Eglish [2]. The progress of MT was limited due to the complexity of liguistics rules, ad low computatio power at the time[1]. However, Chomsky s theory[3] of atural laguage s grammar formed the basis for the formatio of Backus-Naur Form (BNF). BNF[4] otatios are commoly used to represet Cotext Free Grammar (CFG). CFGs are used to systematically describe, ad validate artificial laguages, such as programmig laguages. Usig CFGs to describe some aspects of Naturals Laguages requires a o-trivial set of rules, which results i some ambiguity due to the uexpected rules iteractios. The itroductio of statistical methods gave some isights for reducig NLP ambiguity[1]. For example, the Probabilistic CFGs exteds the traditioal CGFs by deducig liguistic rules ad assigig weights[5]. Rules ad weights are statistically deduced from large aotated corpus. The oticeable improvemet i MT sparks the research i NLP [1]. DOI : /ijlc
2 Table 1. Cotext Matrix Sample. القاھرة Cairo) ( جامعة Uiversity) ( دول Coutries) ( عجائب( Woders ( نظم Systems) ( سائح Tourist) ( حكم( Judgemet ( NLP teds to work with large sets of data from various laguages. It raises the eed for defiig a cocise represetatio for the data while preservig as may of its features as possible. Cocise represetatio is required mostly for ay NLP task (word, setece, or documet levels). Stemmig is a primary NLP task, ad it cotributes i may other NLP tasks [6]. Stemmig is reducig a word to its basic form[7], while preservig its mai characteristics. May laguages defie liguistic rules for stemmig but ot with the same degree[8]. Derivative laguages are highly systematic, ad highly supportive for stemmig aalysis. Most of the derivative laguages share the property that complex forms are derived from basic oes. Arabic is oe of the derivative laguages that liguistically supports stemmig. The Arabic laguage is a widely used laguage[9] ad it exists i differet formats. For example, Arabic words ca be give i the format of separate text or it could be extracted from images[10]. Arabic laguage defies a accurate set of rules kow as morphological rules, or morphology. Morphology accurately describes the formulatio of a Arabic word from its basic form. The basic forms are commoly called roots. However, stemmig is faced with ambiguity, as most of the NLP tasks. Various techiques were used to resolve ambiguity. Amog which is sematic aalysis, that is to capture the iteded meaig of a word [11]. Sematic aalysis is a very powerful tool to tackle the ambiguity problem but it is very challegig to model. Distributioal Sematic (DS)[11] is a type of sematic aalysis based o co-occurrece aalysis. It represets a word s meaig by its cotext (surroudig words) distributio as show i Table 1. For example, the first row of Table 1 shows that the words ( Woders"meas عجائب( ad (meas Coutries ) دول appeared i the cotext of the word (meas Systems ) نظم with frequecies 0 ad 5, respectively. Differet measures ca be computed such as Poitwise Mutual Iformatio (PMI), Positive PMI (PPMI), ad Smoothed PMI (SPMI). They measure the correlatio betwee a word, ad its cotext[12,13]. This paper itroduces a cotext based Arabic stemmer for extractig a word s root. The proposed stemmer (CBAS) explores all possible roots the selects the appropriate root usig Distributioal Sematics (DS). The DS utilizatio impact is viewed as a series of comparisos with other stemmers usig a maually aotated set of articles. The paper is orgaized as follows; sectio 2 is a itroductio to Arabic morphology. Sectio 3 explores related work, ad techiques used for costructig stemmers. The descriptio of the proposed stemmer (CBAS) is itroduced i sectio 4. I sectio 5, a detailed aalysis, ad evaluatio of the proposed stemmer (CBAS) is preseted. Fially, a coclusio is preseted i sectio BACKGROUND Regardless of the approach used for developig morphological aalyzers, a basic uderstadig of morphological rules is eeded to set expectatios, evaluate the results, ad desig improvemets. This sectio itroduces Arabic morphology, ad commo challeges. Morphology is the study of word formulatio. Arabic morphology is based o the derivatio priciple, whereas words are acquired from roots. Roots are usually three, four, or five characters. 2
3 Roots are the seeds for Arabic words geeratio. A ew word is acquired by modifyig its root. For example the word كاتب (kātb, meas Writer ) is derived from the root ك ت ب (kāf tāʾ bāʾ, meas Wrote ) by addig ا (ʾlf) i the middle. Not every additio is cosidered to be valid. Arabic laguage itroduces a set of templates to defie valid combiatios ad additios. Templates are referred to as patters. It is a ordered sequece of letters. Sice patters work with all roots, a set of letters geerically represet roots letters ad its order while augmeted letters are represeted by themselves i correct positios. كاتب (fāʿl, meas Actor ) was used to derive the previous word فاعل For example the patter (kātb, meas Writer ) by substitutig ف (fāʾ) with ك (kāf), ع (ʿy) with ت (tāʿ), ad ل (lām) with ب (bāʾ),respectively. As oted, roots are commoly writte as separated characters to idicate possible isertios[14]. Augmeted letters are reflected o the patter, ad fially o the word itself. However, a augmeted uit (oe, or more augmeted letters) which is added i the frot, or at the ed of a word, is called a prefix, or a suffix additio. I most cases, prefixes ad suffixes are ot part of a word s meaig, rather they are additioal features[14]. For example الكاتب (ʾlkātb, meas The writer ) by addig ال (ʾl, meas The ) i frot of the word. This poit of view would substatially reduce the umber of eumerated patters. As described above, the root-patter system is simple, elegat, ad straightforward. However, the system is faced with morphological challeges, amely vocalizatio, mutatio, ad the absece of diacritics (aotatio above, or below a word s letter that captures morphological, ad grammatical additioal features). For example, a letter ca chage its form due to grammatical or phoological rules. Aother challege of the root patter system is stopwords, such as coectio words that do ot obey derivatioal rules. The process of derivig back a word to its root (stemmig) looks like a straightforward operatio, ع (fāʾ), ف by simply aligig a word to its patter, ad collectig the letters correspodig to (ʿy), ad ل (lām) letters. However, due to the challeges described above, a word may be derived back to multiple roots. 3. RELATED WORK Iformatio Retrieval (IR) is oe of the early tasks that utilized stemmig aalysis[7]. But, Stemmig aalysis is ot limited to IR. Stemmig aalysis has improved may tasks, such as Machie Traslatio (MT)[15], Setimet Aalysis [16], ad may more tasks. This sectio views commo stemmig aalysis algorithms for differet Natural Laguages (NLs). The ature of a laguage has a great impact o the developmet of related stemmig algorithms. For example, the ature of the Eglish laguage makes Eglish stemmers cocered with removig word s suffixes oly, while removig prefixes may imply a differet meaig, such as sufficiet ad isufficiet. Various stemmers were developed for Eglish[8,17]. However, the uderlyig ature of the laguage limited its extesio for other laguages, amog which is the Arabic[18] ad Urdu[19] laguages. However, stemmig is ot effective i the same degree for all laguages[20,21]. Arabic is a morphological rich laguage which has eriched the Natural Laguage Processig (NLP)[22-24]. Arabic roots have rich liguistic features; they are sematically represetative, derivable, ad fiite i umbers. Stemmers were developed over the years to take advatages of such features. This sectio itroduces commo Arabic stemmers, ad their root extractio process which 3
4 ecapsulates sematic decisios. Fially, it itroduces sematic aalysis techiques idepedetly from stemmig aalysis. Khoja stemmer[18] is oe of the early ad most powerful approaches developed for Arabic stemmig[7]. Khoja simulates the liguistic process as much as possible. It removes prefixes, ad suffixes from a word ofte after the ormalizatio process, the matches the resultig word to a patter, ad fially extracts the root. The extracted root gets validated agaist a list of correct Arabic roots to esure liguistic correctess. Khoja stemmer[18] resolves ambiguity by defiig a set of liguistic paths, or decisios based o various features such as words first character, prefixes, or suffixes legth. Additioally, decisios are implicitly ordered whereas the result would be the first correct root. For example, Khoja stemmer hadles roots with duplicate letters first. Root extractio is a highly complex process due to the existece of overlappig rules, which requires more iformatio. A ew type of stemmig aalysis itroduced is light stemmig. Light stemmig is aother way of acquirig reduced represetatio of Arabic words. Light stemmig is ot as complex as root extractio. It removes prefixes ad suffixes oly from a word. For كاتب (ʾlkātbū, meas The writers ) would be stemmed to the word الكاتبون example, the word (kātb, meas Writer ) istead of ك ت ب (kāf tāʾ bāʾ, meas Writig ). Light stemmig is widely used for Iformatio Retrieval (IR)[25]. Light stemmig has show competitive results i IR agaist root extractio based stemmers [6,26]. Light stemmers are relatively faster ad efficiet, which preserves more specific features of the word, for example, كاتب (kātb, meas Writer ) is more related to the word الكاتبون (ʾlkātbū, meas The writers) tha ك ت ب (kāf tāʾ bāʾ, meas Writig ). But, the umber of words i Arabic without prefixes ad suffixes is far more tha the roots listed i Arabic dictioaries[14]. However, there is o explicit evidece that lightly stemmed words are more efficiet tha roots[26]. ISRI [27] is aother liguistic based Arabic stemmer that roughly uses the same sequece used by Khojas stemmer [18]. But, the mai differece is that ISRI does ot liguistically validate the extracted root, o dictioary is used, ad orgaizes the defied patter set as sub-groups whereas each sub-group has commo features. Besides, chages to the ormalizatio process, prefixes, ad suffixes hadlig. The mai goal of ISRI is to get the miimum represetatio of a give word. Due to various chages, ad prioritizatio, ISRI resolves ambiguity differetly, by chagig the order of applyig morphological rules. But, ISRI still makes static decisios. Most likely, ISRI would be used for iformatio retrieval rather tha liguistic based tasks. Tashaphye [28] is aother Arabic stemmer. It maily supports light stemmig (removig prefixes ad suffixes). It follows the same approach used by Khoja [18], ad ISRI [27] stemmers. It ca be used for root extractio as well. Darwish[29] utilizes the existig word-root pairs used to costruct the Fiite State Trasducer (FST) like stemmers. It uses a learig based techique to ot oly ifer Arabic patters, but also to rak extracted roots. This methodology eumerates possible roots like FSTs [30] approaches, but additioally gives a preferece to the extracted roots. It is aother way to hadle ambiguity other tha static rules. However, part of the iferred patters would be iadequate due to cases such as vocalizatio, ad mutatio. Ad, the rakig of roots is based o iferred patters frequecy, eglectig the words features. Later, this approach has bee modified to hadle light stemmig (prefixes ad suffixes removal) [7]. Arabic grammar has high ifluece o morphological aalysis[14]. ElixirFM [31] employs sytactic features to ehace morphological results. It takes advatage from Prague Arabic Depedecy Treebak (PADT)[32] to acquire sytactic features ad other morphological features 4
5 from BuckWalter [33] stem dictioary. ElixirFM[31] defies a set of morphological rules to extract possible stems while rakig is used to disambiguate the extracted roots, or stems usig the uderlyig data. MADAMIRA[15] morphological aalyzer cosists of two tools MADA [34], ad AMIRA [35]. MADAMIRA [15] takes the advatage of large aotated corpus by usig machie learig techiques such as Support Vector Machie[35]. It combies several tools such as word segmetatio, Part of Speech Taggig (POST), ad light stemmig. It is differet from the previous approaches, sice it does ot explicitly defie morphological rules. Stemmers are part of may NLP tasks. For example, i setimet aalysis they employ the use of classifiers such as Naive Bayes or Support Vector Machies (SVMs) to perform classificatio o sets of test samples give a tagged traiig set ad rich feature sets for various tasks[16,36,37], questio ad aswer systems [38,39], ad may more. However, stemmig is ot limited to NLP. It has bee used for Iformatio Retrieval (IR), ad most of the stemmers are evaluated idirectly usig IR bechmarks [7]. May IR experimets [6,26,27] showed that Arabic roots had improved the Arabic IR. 4. PROPOSED STEMMER (CBAS) The proposed stemmer, Cotext-Based Arabic Stemmer (CBAS), utilizes distributioal similarity of a word s cotext to gai additioal iformatio about its sematic. This iformatio assists with the selectio of the correct root by excludig sematically irrelevat cadidate roots withi the cotext. This sectio itroduces the mai phases of the proposed stemmer (CBAS), cotext matrix costructio, roots geeratio, ad root selectio, whereas each phase cosists of a set of steps. The proposed algorithm is show i Fig. 1 ad Fig. 2 Fig.1: Cotext Matrix Costructio Algorithm 5
6 4.1. Data Resources Fig.2: Stemmer s Algorithm Predefied liguistic data is a essetial part of the proposed stemmer (CBAS). This sectio itroduces the data defied by CBAS. The Arabic word cosists of three parts prefix, ifix, ad suffix[14]. Prefixes ad suffixes are a set of features that ca be added to a word, such as the defiite article ال (ʾl, meas The ) or coected proous ھم (hm, meas Them ). Prefixes ad suffixes lists cotai idividual ad compoud letters that could appear i the frot or the ed of the Arabic word. There is also a list of Arabic patters which is used to extract possible Arabic roots. The fial list is the Roots dictioary which has bee extracted from the Khoja stemmer [18] to validate the extracted root. The prefixes, suffixes, patters, ad dictioary lists are beig used for the roots geeratio phase. Previous lists are commoly defied for Arabic stemmers. However, CBAS uses a raw data set which cosists of a set of Articles that have bee extracted from Omai ewspapers[40]. The dataset cotais articles from various topics, for example, culture ad sport[40], which represets a wide rage of the Arabic laguage curret usage. The raw dataset plays a cetral role i selectig sematically correct roots, which differs from other used stemmers which do ot commoly employ cotext i their algorithms Cotext Matrix Costructio Cotext Matrix is a powerful ad flexible tool to acquire some sematic properties [11]. It defies a widow of words, where the target word is at positio i, ad the rest of the surroudig words are its cotext [11]. The widow slides over the corpus associatig the target word with its cotext distributio as show i Table 1. Various measures ca be computed from the cotext matrix, ad employed i differet tasks Root Geeratio 6
7 It is automatio for root extractio. However, ulike the maual process, it extracts all possible roots. It cosists of three major sub processes, word segmetatio, patter matchig, ad root validatio. Word segmetatio breaks a word ito all possible three parts, prefix, suffix, ad ifix, usig the predefied prefixes, ad suffixes lists. Patter matchig matches the ifix obtaied i word segmetatio with oe or more patters with respect to its legth. For each matched patter, roots characters are collected, ad passed to dictioary validatio. It also hadles weak letters, stopwords, ad some other liguistic cases. Dictioary validatio esures that the extracted roots are liguistically correct. It validates extracted roots agaist a list of correct Arabic roots. Dictioary is ot sufficiet to geerate oly oe correct root, due to various liguistic cases, roots geeratio has the potetial of extractig oe, or more correct roots Roots Selectio This sectio will utilize the cotext matrix to select a appropriate root from two or more cadidate roots. Poitwise Mutual Iformatio (PMI) [12,13] measures the correlatio betwee two or more words. Sice some words ca produce two, or more root cadidates. The proposed algorithm (CBAS) uses a variatio of PMI, Smoothed PMI (SPMI) [12] to hadle sparse matrices. As show i Table 2, SPMI achieved the highest accuracy of 81.5% whe compared to PMI ad PPMI, where the achieved accuracy was 78.84% ad 79.49%, respectively. This is due to the fact that SPMI overcomes the tedecy towards rare co-occurrece evets, which is a side effect of PMI [12]. SPMI is utilized to measure the correlatio betwee the geerated roots, ad its previous cotext. To take advatage of the uderlyig matrix, set words are derived for each cadidate root, i additio to the root itself. For each derived word, the average SPMI is computed with the previous word (as cotext). The root with the highest average correlatio to its cotext is the selected. 5. RESULTS AND EVALUATION IR is the commo methodology for evaluatig a ew stemmer because of the lack of stemmed bechmarks[21]. This sectio itroduces the validatio dataset, evaluatio measures, ad fially the experimetal results Validatio Dataset Direct evaluatio is importat to show the stemmig accuracy, ad potetial improvemets. A maually aotated dataset has bee provided to measure the stemmer accuracy, ad compare with other stemmers. The dataset is part of the Itetioal Corpus of Arabic (ICA) [41].Various Arabic resources have cotributed i collectig the ICA such as ewspapers, books, ad magazies. It has bee costructed to provide a appropriate represetatio to Arabic laguage i Moder Stadard Arabic (MSA) [41]. The dataset cosists of tokes associated with various features. There exist 3629 uique word-root pairs, while other words do ot have roots associated due to the existece of stopwords, ad o-arabic words. The dataset cotais 8941 words after stopwords removal. This is show i Fig 3. 7
8 5.2. Evaluatio Criteria Fig.3: Validatio Dataset Stemmig is beeficial for may tasks, where every task uses the roots i a differet way. For example, IR uses roots as a cluster represetative to group related words, while setimet aalysis is more cocered with the liguistic accuracy of a root. A set of metrics were used to measure differet usages, ad compare them with other stemmers. Stemmig accuracy is oe of the basic measures for the effectiveess of the stemmer. It is defied as the ratio betwee the umber of correctly stemmed words, ad the umber of the words i the complete dataset. Collectig related words uder the same group is importat for tasks such as IR. There are two variatios of groupig related words. First, words ca be grouped correctly uder a sematically correct root; this is referred to it as classificatio. While the secod is to group related words together, ot ecessarily uder a correct root, ad this is referred to as clusterig. Stadard metrics for classificatio, ad clusterig are: accuracy, precisio, recall, ad F 1 measure, ad are defied as follows [42, 43]: 1 accuracy = 1 precisio = X Y i i i= 1 X i Yi X Y i i i= 1 Yi 8
9 1 recall = 1 F1 measure = X Y i i i= 1 X i X Yi Y i i= 1 X i + i Where Ad is the umber of extracted roots. X is the set of extracted root. X i is a idividual extracted root. Y is the set of extracted root. Y i is a idividual valid root Results The complete 8941 words were used to test the proposed stemmer (CBAS), with a widow size =3, the the set was reduced to a set of uique word-root pairs to be compared with other stemmers. Table 2 shows the compariso betwee the proposed stemmer (CBAS) ad other stemmers. It shows that the proposed stemmer (CBAS) achieved a accuracy of 81.5% with a improvemet of 9.4%, 67.3%, ad 51.2% over Khoja, ISRI, ad Tashphaye stemmers, respectively. Accuracy ehacemet is due to explorig various possibilities of roots. Such exploratio would ot be possible without distributioal sematics, which provides a dyamic ad robust way for selectig a appropriate root. Table 3 ad Table 4; show the performace of the proposed stemmer (CBAS) whe usig it as a groupig mechaism. Table 3 clarifies that the proposed stemmer (CBAS) has a higher potetial to liguistically group Arabic words tha other stemmers. CBAS outperformed other stemmers i the classificatio task, with a accuracy of 65.45%. While Table 4 shows that the proposed stemmer (CBAS) has potetial improvemets i o-liguistic based tasks, achievig a accuracy of 73.83% i clusterig. By comparig liguistic (classificatio), ad o-liguistic (clusterig) groupig measures, there is a icrease i all correspodig measures. This is due to that some clusters were correctly formulated irrespective to the clusters seeds. Classificatio ad clusterig measures show the superiority of the CBAS over other stemmers. This idicates the beeficial features of the CBAS for the IR task. Table 2. Stemmers Liguistic Accuracy Stemmer Liguistic Accuracy Khoja 72.1% ISRI 14.2% Tashaphaye 30.3% CBAS-PMI 78.84% CBAS-PPMI 79.49% CBAS 81.5% 9
10 6. CONCLUSION Table 3. Stemmers Classificatio Measures Stemmer Accuracy Precisio Recall F 1 measure Khoja 57.53% 57.53% 59.59% 58.55% ISRI 10.43% 10.43% 10.49% 10.46% Tashaphaye 25.07% 25.07% 25.15% 25.11% CBAS 65.45% 65.45% 68.23% 66.51% Table 4. Stemmers Clusterig Measures Stemmer Accuracy Precisio Recall F 1 measure Khoja 71.71% 93.09% 75.74% 83.52% ISRI 12.59% 69.40% 13.34% 22.27% Tashaphaye 32.25% 72.54% 37.03% 49.03% CBAS 73.83% 93.71% 75.46% 84.50% May stemmers were developed to gai the rich liguistic features provided by the roots. Most of the stemmers made explicit decisios, statistical-based or liguistic-based, to select oly oe root. Other stemmers used rakig to express their selectio preferece rather tha selectig a sigle root. However, at the very ed, a sigle root would be chose. Static decisios are very appropriate for commo ad frequet cases. However, addig other features such as sytactic ad maual aotatios would also be valuable. The itroduced stemmer employs distributioal similarity to hadle icorrect roots selectio, which is a side effect of root geeratio phase. The existece of robust filterig mechaisms, such distributioal aalysis, allows explorig various roots. Distributioal aalysis has several advatages. It ca be computed for ay corpus ad ay laguage, ad it is relatively fast ad iexpesive to costruct compared to maually aotated corpus. Distributioal sematics covers may relatios betwee words, ad it is robust agaist ay prefereces, or missig iformatio. It is also very adaptive to cotext chages, which makes it suitable for may topics. However, distributioal aalysis is ot as accurate as maually aotated data; hece, the word geeratio process was added to the roots selectio phase to tolerate possible errors. The previous techiques were compared to the proposed stemmer (CBAS) results. CBAS shows a accuracy of 81.5% with a improvemet of 9.4%, 67.3%, ad 51.2% over Khoja, ISRI, ad Tashphaye stemmers, respectively. CBAS also shows a improvemet i classificatio ad clusterig, with a accuracy of 65.45% ad 73.83%, respectively. Results idicate that the proposed stemmer (CBAS) ehaces stemmig ad other related tasks. CBAS represets a methodology for capturig a word s cotext ad makes decisios based o it. CBAS could chage its behaviour based o the uderlyig data which could be specialized i a sub domai of the Arabic laguage. The statistical model used by CBAS is relatively simple. It icorporates importat iformatio (cotext) of a word which would be a complex process to iclude i a rule based stemmer. The statistical model reduces liguistic complexity of represetig various liguistic cases. It also prevets uexpected iteractios ad prioritizatio schemes for orderig the rules. 10
11 REFERENCES [1] P. M. Nadkari, L. Oho-Machado, ad W. W. Chapma, "Natural laguage processig: a itroductio," Joural of the America Medical Iformatics Associatio, vol. 18, pp , [2] J. Hutchis, "The first public demostratio of machie traslatio: the Georgetow-IBM system, 7th Jauary 1954," oviembre de, [3] N. Chomsky, "Three models for the descriptio of laguage," Iformatio Theory, IRE Trasactios o, vol. 2, pp , [4] A. Aho, "R. Sethi, ad J. D. Ullma," Compilers: Priciples, Techiques, ad Tools, [5] D. Klei ad C. D. Maig, "Accurate ulexicalized parsig," i Proceedigs of the 41st Aual Meetig o Associatio for Computatioal Liguistics-Volume 1, 2003, pp [6] M. Aljlayl ad O. Frieder, "O Arabic search: improvig the retrieval effectiveess via a light stemmig approach," i Proceedigs of the eleveth iteratioal coferece o Iformatio ad kowledge maagemet, 2002, pp [7] I. A. Al Sughaiyer ad I. A. Al Kharashi, "Arabic morphological aalysis techiques: A comprehesive survey," Joural of the America Society for Iformatio Sciece ad Techology, vol. 55, pp , [8] M. F. Porter, "Sowball: A laguage for stemmig algorithms," ed, [9] J. Xu, A. Fraser, ad R. Weischedel, "Empirical studies i strategies for Arabic retrieval," i Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, 2002, pp [10] R. Fathalla, Y. El Sobaty, ad M. A. Ismail, "Extractio of Arabic Words form Complex Color Images," i 9th IEEE Iteratioal Coferece o Documet Aalysis ad Recogitio (ICDAR 2007), Brazil, pp [11] C. Akkaya, J. Wiebe, ad R. Mihalcea, "Utilizig sematic compositio i distributioal sematic models for word sese discrimiatio ad word sese disambiguatio," i Sematic Computig (ICSC), 2012 IEEE Sixth Iteratioal Coferece o, 2012, pp [12] D. Jurafsky. Word Seses ad Word Relatios. [13] G. Bouma, "Normalized (poitwise) mutual iformatio i collocatio extractio," Proceedigs of GSCL, pp , [14] K. C. Rydig, A referece grammar of moder stadard Arabic: Cambridge uiversity press, [15]A. Pasha, M. Al-Badrashiy, M. Diab, A. El Kholy, R. Eskader, N. Habash, et al., "Madamira: A fast, comprehesive tool for morphological aalysis ad disambiguatio of arabic," i Proceedigs of the Laguage Resources ad Evaluatio Coferece (LREC), Reykjavik, Icelad, [16] S. M. Oraby, Y. El-Sobaty, ad M. A. El-Nasr, "Explorig the Effects of Word Roots for Arabic Setimet Aalysis," i Iteratioal Joit Coferece o Natural Laguage Processig, Nagoya, Japa, 2013, pp [17] J. B. Lovis, Developmet of a stemmig algorithm: MIT Iformatio Processig Group, Electroic Systems Laboratory, [18] S. Khoja ad R. Garside, "Stemmig arabic text," Lacaster, UK, Computig Departmet, Lacaster Uiversity, [19] M. S. Husai, "A usupervised approach to develop stemmer," Iteratioal Joural o Natural Laguage Computig, vol. 1, pp , [20] D. Harma, "How effective is suffixig?," JASIS, vol. 42, pp. 7-15, [21] I. Smirov, "Overview of stemmig algorithms," Mechaical Traslatio, vol. 52, [22] Y. Beajiba, M. Diab, ad P. Rosso, "Arabic amed etity recogitio usig optimized feature sets," i Proceedigs of the Coferece o Empirical Methods i Natural Laguage Processig, 2008, pp [23] K. Darwish ad D. W. Oard, "CLIR Experimets at Marylad for TREC-2002: Evidece combiatio for Arabic-Eglish retrieval," DTIC Documet2003. [24] L. S. Larkey ad M. E. Coell, "Arabic iformatio retrieval at UMass i TREC-10," DTIC Documet2006. [25] L. S. Larkey, L. Ballesteros, ad M. E. Coell, "Light stemmig for Arabic iformatio retrieval," i Arabic computatioal morphology, ed: Spriger, 2007, pp
12 [26] L. S. Larkey, L. Ballesteros, ad M. E. Coell, "Improvig stemmig for Arabic iformatio retrieval: light stemmig ad co-occurrece aalysis," i Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, 2002, pp [27] K. Taghva, R. Elkhoury, ad J. Coombs, "Arabic stemmig without a root dictioary," i ull, 2005, pp [28] T. Zerrouki. (2010). Tashaphye, Arabic light stemmer/segmet. [29] K. Darwish, "Buildig a shallow Arabic morphological aalyzer i oe day," i Proceedigs of the ACL-02 workshop o Computatioal approaches to semitic laguages, 2002, pp [30] K. R. Beesley, "Arabic morphological aalysis o the Iteret," i Proceedigs of the 6th Iteratioal Coferece ad Exhibitio o Multi-ligual Computig, [31] O. Smrž, "Elixirfm: implemetatio of fuctioal arabic morphology," i Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues ad Resources, 2007, pp [32] O. PetrZemáek, "Prague Arabic Depedecy Treebak: A Word o the Millio Words." [33] T. Buckwalter, "Buckwalter {Arabic} Morphological Aalyzer Versio 1.0," [34] N. Habash, O. Rambow, ad R. Roth, "MADA+ TOKAN: A toolkit for Arabic tokeizatio, diacritizatio, morphological disambiguatio, POS taggig, stemmig ad lemmatizatio," i Proceedigs of the 2d Iteratioal Coferece o Arabic Laguage Resources ad Tools (MEDAR), Cairo, Egypt, 2009, pp [35] M. Diab, K. Hacioglu, ad D. Jurafsky, "Automated methods for processig arabic text: From tokeizatio to base phrase chukig," Arabic Computatioal Morphology: Kowledge-based ad Empirical Methods. Kluwer/Spriger, [36] S. N. Saleh ad Y. El-Sobaty, "A feature selectio algorithm with redudacy reductio for text classificatio," i Computer ad iformatio scieces, iscis d iteratioal symposium o, 2007, pp [37] S. Oraby, Y. El-Sobaty, ad M. A. El-Nasr, "Fidig Opiio Stregth Usig Rule-Based Parsig for Arabic Setimet Aalysis," i Advaces i Soft Computig ad Its Applicatios, ed: Spriger, 2013, pp [38] A. M. Ezzeldi, M. H. Kholief, ad Y. El-Sobaty, "ALQASIM: Arabic laguage questio aswer selectio i machies," i Iformatio Access Evaluatio. Multiliguality, Multimodality, ad Visualizatio, ed: Spriger, 2013, pp [39] A. M. Ezzeldi, Y. El-Sobaty, ad M. H. Kholief, "Explorig the Effects of Root Expasio, Setece Splittig ad Otology o Arabic Aswer Selectio," Natural Laguage Processig ad Cogitive Sciece: Proceedigs 2014, p. 273, [40] M. Abbas, K. Smaïli, ad D. Berkai, "Evaluatio of Topic Idetificatio Methods o Arabic Corpora," JDIM, vol. 9, pp , [41] S. Alasary, M. Nagi, ad N. Adly, "Buildig a Iteratioal Corpus of Arabic (ICA): progress of compilatio stage," i 7th iteratioal coferece o laguage egieerig, Cairo, Egypt, 2007, pp [42] S. Godbole ad S. Sarawagi, "Discrimiative methods for multi-labeled classificatio," i Advaces i Kowledge Discovery ad Data Miig, ed: Spriger, 2004, pp [43] M. Hillemeyer. Machie Learig. 12
Natural language processing implementation on Romanian ChatBot
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics
More informationarxiv: v1 [cs.dl] 22 Dec 2016
ScieceWISE: Topic Modelig over Scietific Literature Networks arxiv:1612.07636v1 [cs.dl] 22 Dec 2016 A. Magalich, V. Gemmetto, D. Garlaschelli, A. Boyarsky Uiversity of Leide, The Netherlads {magalich,
More informationE-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev
Titre du documet / Documet title E-learig usability : A learer-adapted approach based o the evaluatio of leaer's prefereces Auteur(s) / Author(s) TERZIEVA Valetia ; PAVLOV Yuri (1) ; ANDREEV Rume (2) ;
More informationManagement Science Letters
Maagemet Sciece Letters 4 (24) 2 26 Cotets lists available at GrowigSciece Maagemet Sciece Letters homepage: www.growigsciece.com/msl A applicatio of data evelopmet aalysis for measurig the relative efficiecy
More information'Norwegian University of Science and Technology, Department of Computer and Information Science
The helpful Patiet Record System: Problem Orieted Ad Kowledge Based Elisabeth Bayega, MS' ad Samso Tu, MS2 'Norwegia Uiversity of Sciece ad Techology, Departmet of Computer ad Iformatio Sciece ad Departmet
More informationFuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent
Fuzzy Referece Gai-Schedulig Approach as Itelliget Agets: FRGS Aget J. E. ARAUJO * eresto@lit.ipe.br K. H. KIENITZ # kieitz@ita.br S. A. SANDRI sadra@lac.ipe.br J. D. S. da SILVA demisio@lac.ipe.br * Itegratio
More informationConsortium: North Carolina Community Colleges
Associatio of Research Libraries / Texas A&M Uiversity www.libqual.org Cotributors Collee Cook Texas A&M Uiversity Fred Heath Uiversity of Texas BruceThompso Texas A&M Uiversity Martha Kyrillidou Associatio
More informationCONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014
preview begis oct 2014 lauches ja 2015 INTRODUCING WWW.FEEDBACKCOMMONS.ORG A serviced cloud platform to share ad compare feedback data ad collaboratively develop feedback ad learig practice CONSTITUENT
More informationpart2 Participatory Processes
part part2 Participatory Processes Participatory Learig Approaches Whose Learig? Participatory learig is based o the priciple of ope expressio where all sectios of the commuity ad exteral stakeholders
More informationApplication for Admission
Applicatio for Admissio Admissio Office PO Box 2900 Illiois Wesleya Uiversity Bloomig, Illiois 61702-2900 Apply o-lie at: www.iwu.edu Applicatio Iformatio I am applyig: Early Actio Regular Decisio Early
More informationHANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO
HANDBOOK Career Ceter Hadbook CALIFORNIA STATE UNIVERSITY, SACR AMENTO Tools & Tips for Career Search Success Academic Advisig ad Career Ceter 6000 J Street Lasse Hall 1013 Sacrameto, CA 95819-6064 916-278-6231
More informationVISION, MISSION, VALUES, AND GOALS
6 VISION, MISSION, VALUES, AND GOALS 2010-2015 VISION STATEMENT Ohloe College will be kow throughout Califoria for our iclusiveess, iovatio, ad superior rates of studet success. MISSION STATEMENT The Missio
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationA Comparative Survey on Arabic Stemming: Approaches and Challenges
Intelligent Information Management, 2017, 9, 39-67 http://www.scirp.org/journal/iim ISSN Online: 2160-5920 ISSN Print: 2160-5912 A Comparative Survey on Arabic Stemming: Approaches and Challenges Mohammad
More information2014 Gold Award Winner SpecialParent
Award Wier SpecialParet Dedicated to all families of childre with special eeds 6 th Editio/Fall/Witer 2014 Desig ad Editorial Awards Competitio MISSION Our goal is to provide parets of childre with special
More informationalso inside Continuing Education Alumni Authors College Events
SUMMER 2016 JAMESTOWN COMMUNITY COLLEGE ALUMNI MAGAZINE create a etrepreeur creatig a busiess a artist creatig beauty a citize creatig the future also iside Cotiuig Educatio Alumi Authors College Evets
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationSouth Carolina English Language Arts
South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationARNE - A tool for Namend Entity Recognition from Arabic Text
24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123
More informationCROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2
1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationHybridTechniqueforArabicTextCompression
Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationOn March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary
michiga veteriary medical associatio i this issue... 3 Great Lakes Veteriary Coferece 4 What You Need to Kow Whe Issuig a Iterstate Certificate of Ispectio 6 Low Pathogeic Avia Iflueza H5 Virus Detectios
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationCross-lingual Short-Text Document Classification for Facebook Comments
2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationScienceDirect. Malayalam question answering system
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationComparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection
1 Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection X. Saralegi, M. Lopez de Lacalle Elhuyar R&D Zelai Haundi kalea, 3.
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationCombining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval
Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval Jianqiang Wang and Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park,
More informationAustralian Journal of Basic and Applied Sciences
AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationNoisy SMS Machine Translation in Low-Density Languages
Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of
More informationSyntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm syntax: from the Greek syntaxis, meaning setting out together
More information1. Introduction. 2. The OMBI database editor
OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationWhat the National Curriculum requires in reading at Y5 and Y6
What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the
More informationDictionary-based techniques for cross-language information retrieval q
Information Processing and Management 41 (2005) 523 547 www.elsevier.com/locate/infoproman Dictionary-based techniques for cross-language information retrieval q Gina-Anne Levow a, *, Douglas W. Oard b,
More informationGrade 4. Common Core Adoption Process. (Unpacked Standards)
Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences
More informationOn document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More informationCEFR Overall Illustrative English Proficiency Scales
CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey
More informationEvolution of Symbolisation in Chimpanzees and Neural Nets
Evolution of Symbolisation in Chimpanzees and Neural Nets Angelo Cangelosi Centre for Neural and Adaptive Systems University of Plymouth (UK) a.cangelosi@plymouth.ac.uk Introduction Animal communication
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationArabic Orthography vs. Arabic OCR
Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among
More informationA Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique
A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique Hiromi Ishizaki 1, Susan C. Herring 2, Yasuhiro Takishima 1 1 KDDI R&D Laboratories, Inc. 2 Indiana University
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationProblems of the Arabic OCR: New Attitudes
Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing
More informationScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationCriterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations
Program 2: / Arts English Development Basic Program, K-8 Grade Level(s): K 3 SECTIO 1: PROGRAM DESCRIPTIO All instructional material submissions must meet the requirements of this program description section,
More informationAbstractions and the Brain
Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT
More informationDerivational and Inflectional Morphemes in Pak-Pak Language
Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationLip reading: Japanese vowel recognition by tracking temporal changes of lip shape
Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,
More informationAs a high-quality international conference in the field
The New Automated IEEE INFOCOM Review Assignment System Baochun Li and Y. Thomas Hou Abstract In academic conferences, the structure of the review process has always been considered a critical aspect of
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationThe 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X
The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,
More informationUMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.
UMass at TDT James Allan, Victor Lavrenko, David Frey, and Vikas Khandelwal Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA 3 We spent
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationPerformance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized
More informationOnline Updating of Word Representations for Part-of-Speech Tagging
Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationDetecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011
Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,
More informationIndian Institute of Technology, Kanpur
Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar
More informationQuickStroke: An Incremental On-line Chinese Handwriting Recognition System
QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationKnowledge Transfer in Deep Convolutional Neural Nets
Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationChapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard
Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.
More informationOutline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt
Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic
More information