Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

Size: px
Start display at page:

Download "Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches"

Transcription

1 Istat Diacritics Restoratio System for Sidhi Accet Predictio usig N-Gram ad Memory-Based Learig Approaches Hidayatullah Shaikh, Javed Ahmed Mahar, Mumtaz Hussai Mahar Departmet of Computer Sciece, Shah Abdul Latif Uiversity, Khairpur Mir s, Sidh, Pakista Abstract---The script of Sidhi Laguage is highly complex due to may complexities icludig abudace of homographic words. The iterpretatio of the text turs so tough due to the possibility of multitudial meaigs associated with a homographic word uless give specific prouciatio with the help of diacritics. Diacritics help the readers to comprehed the text easily. Due to the rapidly developig ature of this era, people do t bother writig diacritics i routie applicatios of life. Besides creatig difficulties for huma readig, the absece of diacritics does also make the text abstruse for machie readig. Relatively alike huma, machies may also lead to sematic ad sytactic complexities durig computatioal processig of the laguage. Istat diacritics restoratio is a approach emerged from the text predictio systems. This type of diacritics restoratio is a uprecedeted work i the realm of atural laguage processig, particularly i Ido-Arya laguages. A propositio for a framework usig N-Grams ad Memory-Based Learig approach is made i this work. The grab-poit of this mechaism is its 99.03% accuracy o the corpus of Sidhi laguage durig the experimets. The comparative edge of istat diacritics restoratio is its beig source of expeditio i the performace of other atural laguage ad speech processig applicatios. The future developmet of this approach seems vivid ad clear for Sidhi orthography is highly similar to those of Arabic, Urdu, Persia ad other laguages based o this type of script. Keywords--Sidhi Laguage; Istat Diacritics Restoratio; Text Predictio; N-Grams; Memory-Based Learig I. INTRODUCTION Sidhi orthography abouds i such words which possess differet meaig but idetical morphological structure. These words are called homographs i liguistics. The solutio to this problem is the assigmet of diacritic marks to the homographs. Sidhi orthography has two types of diacritic sigs used for the correct prouciatio of the words [1]. The superscript sigs assiged over the letters ad subscript oes beeath the letters. The routie scripts of Sidhi laguage are writte without diacritics such as ewspapers, magazies ad books. Such absece brigs about critical challeges facig computatioal processig of the laguage [2]. I more elaborate way, homographic words ca be iterchageably meat or iterpreted if diacritics are abset. They may be meat ad proouced erroeously as well. Without disambiguatio, it is rather difficult to figure out the iteded meaig ad prouciatio of words durig the process of differet liguistic ad speech processig applicatios. The automatic assigmet of diacritics i Sidhi script is essetial for its processig ito atural laguage ad speech applicatios [3] [4]. Therefore, the literature of this type of research is replete with the details of the research works o diacritic restoratio particularly by usig statistical approaches [5] [2]. Firstly, the results of previous research works are ot satisfactory or at acceptable level ad secodly, the istat diacritics restoratio is take ito cosideratio for the first time for Sidhi. The objective of the study is the developmet of automatic system that will covert the u-diacritized words ito the diacritized oes by assigig the diacritic sigs istatly durig typig. This research study aims at the developmet of automatic system that assigs diacritics to the words which at first are u-diacritized durig typig istatly. For this, a ivestigative study with the combiatio of N-Grams ad Letter-Level Approaches is carried out to meet the objective. The rest of the paper is orgaized as follows: some research cotributios of diacritics restoratio of Arabic script-based laguages are preseted i Sectio II. The overview of corpus preparatio is give i Sectio III. The proposed model for the task of istat diacritics restoratio is described ad depicted i Sectio IV. I Sectio V, executio process of developed software applicatio is explaied, while i Sectio VI, implemetatio process of proposed model ad detail evaluatio of calculated results are give ad fially, the paper is cocluded i Sectio VII with core results ad coclusio. II. RELATED WORK The study of literature o this topic reveals that diacritics restoratio is performed at letter ad word level. Diacritics restoratio has bee cetered by usig various techiques at word ad letter level as well, like N-Grams [6] [7], Neural Networks [8], Maximum Etropy [9], Memory-Based Learig [10] [11], ad Weighted Fiite State [12]. Majority of researchers has received ecouragig results at word level usig N-Gram laguage model [6] [7] [2] whereas Memory- Based Learig Approach [13] also yields good results at 149 P a g e

2 letter level for the same task o Arabic script-based laguages icludig Sidhi [14]. The task of automatic Sidhi diacritics restoratio is maily cosidered ad take by the researchers usig statistical approaches such as maximum etropy [1], N-grams [5] ad memory-based learig approach [14]. The acceptable results are achieved with memory-based learig ad N-gram based laguage modelig approaches. Hece, the proposed istat diacritics restoratio mechaism is also based o the N-Grams ad Memory-Based Learig approaches. Makig use of this mechaism high accuracy i less time is attaied. III. CORPUS PREPARATION As a matter of fact, two types of data sets are always required for experimetatio of diacritics restoratio systems [1]. Therefore, two types of corpora are desiged ad developed. The first subsumes complete diacritized text ad the secod udiacritized text. I additio to them, a lexico is also built. The experimets of the proposed method were performed by makig use of both types of data sets; corpora ad lexico. A data set of corpus havig 2, 65,257 words are built i Sidhi laguage for the purpose of traiig ad testig the system. The orgaized iformatio of the developed corpus i is give i Table I. The corpus is classified ito three segmets: the atique books that are completely writte with diacritics like Shah Jo Rosalo [15], the poetry books that possess partially diacritized text ad the recetly published text of differet geres which are etirely void of diacritics like ewspapers, magazies ad text books. TABLE. I. WORDS INFORMATION OF DEVELOPED SINDHI CORPUS Type of Corpus No. of No. of Seteces Words Fully Diacritized ,462 Partially Diacritized ,188 Not-Diacritized ,22, 607 Total , 65,257 A. Developed Lexico I additio to the developmet of Sidhi corpus, a lexico of Sidhi text has bee created for it is a essetial compoet for the proposed method of istat diacritizatio. The mechaism of the istat diacritics restoratio has the basis of memory based learig approach with the aid of letter level learig approach. Relatively, a table havig the letters i differet forms of diacritized as well as u-diacritized is developed. The specime of this table is give i Fig. 1. It should be oted here that each letter is assiged a uique umber for the idetificatio. This idetificatio is required for the executio of the letters ito the system. IV. PROPOSED MODEL The ie compoets work altogether as the costituets of the proposed mechaism: Calculatio of word probabilities, specimes of letters, patter matchig ad comparative fuctio of homographic structures, K-NN Classifier ad Class Labels, calculatio of distace betwee istaces usig overlap metric, calculate the features weight, ested hash ad tokeizatio. The proposed model i Fig. 2 is used to show the executio process of the complete system. The corpus fuctios as a patro o which the probabilities are depedet; hece, traiig corpus desig is a delicate matter to deal with. The more specified traiig corpus leads to the more accurate probabilities which help the task to be achieved coveietly. The N-grams are probabilistic models that help the provisio of directio for the assigmet of probabilities to the words. The uigram, bigram, trigram ad so o models are used for the calculatio of probabilities. A uigram is a N-gram of 1, bigram of 2, ad cosequetly trigram of 3, ad so o with the progressive umbers [16]. The text is a sequetial series of structured words ad ca be give represetatio as below: P( W1, W2,... W 1, W) For a bigram grammar P( w ) P( w w ) 1 i i 1 i 1 The trigram is same as bigram except the coditio o two previous words as uder. P( w ) P( w w w ) 1 i i 2 i 1 i 1 The ultimate product o the part of the system is the provisio of the optio to the user to choose the suitable or correct words as per the requiremet. Therefore, the laguage modelig is used for the computatio of N-Grams up to quad oe. The probabilities of all the words give i the corpus are idividually calculated ad stored ito a specified table i the desiged lexico. The purpose of this whole process is to support the further process of the mechaism. (1) (2) (3) 150 P a g e

3 Fig. 1. Sample Database Table for Istat Diacritics Restoratio 151 P a g e

4 After the words probabilities are calculated, the system starts computatio of the available istaces of each diacritized letter. For this, almost all the possible istaces of all the letters i corpora calculated with every diacritic mark; i.e., ب, ب, ب are calculated altogether with the surroudig letter (N letter) o both left ad right sides. At the same time, the calculated istaces are saved i a multidimesioal array ascedig. At least istaces are take from the available corpus takig care of the particular otatios give to the white spaces (SP), commas (CO) ad dots (DO) alike [11] [13]. A vector based multidimesioal array is used for the storage of these examples. The corpus same from [1] is give below ad the related sample of feature vectors extracted from the same source is preseted i Table II. Fig. 2. Proposed Model for Sidhi Istat Diacritics Restoratio 152 P a g e

5 Letters ڪ TABLE. II. SAMPLE LETTERS AND FEATURE VECTORS Feature Vectors ا,ن,ت,ي SP,,ڏ,ڇ SP,,ي,ڍ : پ,ا,س,و SP, CO, SP,,ن,و,ي : SP, SP, SP, SP, SP,ب SP,,ن,هه,ن : ي SP,,ج,و SP,,ٿ,ڪ SP,,ي,ٿ : ٿ,ي,ن,هه SP,,ٿ,هه SP,,ي,ٿ : : SP,,و,ڻ,ا,م,هه SP,,ر,هه SP ن SP,,س,ڀ SP,,ا,ک SP,,ڙ,و : SP,ڪ,ٿ,ي SP,,و,ر,ض SP,,ٿ : ن,د,و SP,,ا,ا,س,ا DO,,ي : SP,م,ا,ن SP, SP,,ڪ,هه SP,,ر : ر,ي SP,,پ,ن,پ SP,,و,ج,ن : SP, SP, SP,ڪ,ن,ن,هه,ن SP,,ن : ن SP,,هه,ر SP,,هه,ڻ,م SP,,ڪ : چ SP,,پ,ا,ڻ,م,ا,س SP,,ي : ض,ر,و,ر SP, SP,,و,د,ن,و : ڪ SP,,م,ا,ڻ,ڻ,ا,پ SP,,و : ي SP,,س,ا,م,چ,ا SP,,ن,و : ڪ ڪ هه هه هه The absece of diacritical marks lead to may complexities i the text regardig various possible vowels souds used i a word [11]. The word سکن may be take for example. The system performs compariso of the patter of the u-diacritized word with the diacritized oes available i.س ک ن ad س ک ن the corpus. System receives two types of words Patter matchig process is carried out usig regular expressio approach. The system, the, ackowledges the patter of u-diacritized iput word with the diacritized oe. The suitable word o the basis of the highest probability is fixed at the same locatio. Sample regular expressio example is give graphical represetatio below: The complete group of examples is extracted from the corpus for each complex letter structure. Each letter from the set is take oe by oe icludig the surroudig eighbors from both sides. The, the system compares with the available istaces i the corpus. The KNN classifier is used for this compariso process. The value of each feature vector is calculated ad stored i the built-i metric. All of the values of each feature are weighted ad tagged with labels whether matched or mismatched structures. These istaces are divided i accordace with the assiged labels. The istace based learig algorithm is take ito use for the compariso of ew problem examples with istaces stored already i the memory. K-earest eighbor algorithm is the prove simplest method of a istace-based learig oe; o the other had, K-NN method categorizes the objects based o the earest traiig example i the feature space. The core model is give below [17]: k f( x ) (4) i i 1 f( xq ) k All of the iput istaces are compared idividually with the all the closest eighbors by usig KNN classifier. Fially, the system accepts the most frequet oes. A multidimesioal array i the system saves the traiig examples cotaiig feature vectors. The label specifies each example accordig to its class. The highest umbers of votes icludig with eighbors categorize the labeled etity. While the process of classificatio udergoes, a uique test istace is fed to the system, usig the distace (X, Y). This computes the sameess of the ew examples ad all of the other examples i memory. Overlap metric is used for this task particularly cosiderig the distace betwee istaces maifested by N-features. It is oly to show the distace per feature [13] [14]. i 1 ( X, Y) ( x, y ) The metric performs coutig of the etire umber of feature-values i both patters regardless of matchig or mismatchig for the additio of the domai kowledge bias to the weight. For the weight of the features, statistical iformatio is calculated through a examiatio to reach the better predictors of the class tags. Iformatio Gai (IG) examies each feature idividually ad prepares measuremet for the iformatio to be produced ad stored kowledge for valid class label. Immediately after the above process, hash table begis the process of storig data i a associated etwork maer. This table stores the data i the array format ad each data value receives a uique idex withi. This way the data is quickly accessed after kowig the idex of the required data. Hashig techique is widely kow techique that is used for the coversio of a rage of key values to a rage of the array idexes. Tokeizatio of the script of Sidhi is also oe of the challegig tasks due to the complexities i the text, i i (5) 153 P a g e

6 particularly the complexities of homographic structures. A compoud word eeds to be etitled as a sigle toke but the embedded space required i betwee creates ambiguity for the tokeizatio process. The embedded space is required i betwee due to the cursive ature of Sidhi script ad its coectig ad o-coectig letters. Therefore, more attetio is to be paid because of these complicatios facig the tokeizatio. Mahar s [1] tokeizatio model is take i this research project. I fact, Sidhi script abouds i homographic words. As a result, the ambiguity is ofte observed whe the text is udiacritized. A simple word ad root word of Sidhi قسم has such costituet letters which may be iterchageably take i almost two way as ق س م (a oath) (ou), ق س م (kid) (ou). The take words without diacritics are exactly idetical. Thus, they create ambiguity for NLP applicatios. Viterbia Algorithm is oe of the efficiet approaches to fid the most likely path trasitios i such cases. This algorithm produces the most likely possible word o the basis of the highest probability value calculated by usig N-grams [16]. V. EXECUTION PROCESS OF APPLICATION Text predictio is the basic idea that igitio to the Istat Diacritics Restoratio. The former was proposed to save time ad eergy simultaeously by offerig assumptios of possible upcomig set of letters after typig the begiig letters of words. By typig each succeedig letter, the user receives possible suggestios i differet forms of popup to adopt with a sigle click oly rather tha typig all the upcomig letters of the word. For example, user wats to type.انسان the word After typig the first letter, he will be show some popup carryig some most possible ad frequetly used words beggig with.ا The, he will type the ext letter,ن he will agai be show some set of most possible ad frequetly used set of letters after the two beggig oes. If he fids the same letter i the popup, he would just hit a sigle click to get the word typed rather tha hittig five strokes for all the five letters i the word. This fuctio of text predictio gave birth to the idea of istat diacritics restoratio. The predictive approach of istat diacritizatio facilitates the user to type the words with their exact prouciatios which further helps i readig it correctly. The editor actively ad simultaeously works with the user ad assigs the diacritics automatically. The user has to type the words oly. The diacritics will automatically be assiged immediately. For example, the user wats to type the word,ا ن س ان he first types the first letter,ا the editor will assig it the superscript diacritic sig iitially, for the system is assiged this task for every first letter. After,ا the user types aother letter,ن the system will immediately calculate the probability of the possible diacritics to this couple of letters ad assig to,ن simultaeously the to ا will chage ito. The user is to type س ow, as he types س the system agai goes for the calculatio of the probability of the possible diacritics to this combiatio of letters ad assigs the diacritics to all of the three accordig the highest foud match,ن ad the ا i the corpus. Now, the user moves ahead to type the system will simultaeously work with the letters ad the diacritics while calculatig the probabilities of the letters ad diacritic sigs from the give corpus. After the user is doe with typig,ا ن س ان the system fializes its diacritics with the same procedures detailed above. The same process takes place by typig each letter i the editor. VI. IMPLEMENTATION AND RESULTS The traiig ad testig set desig stad as the foudatios to the fial results. Therefore, both are maily cocered till the results are derived. Differet techiques like Word Error Rat, Diacritic Error Rate, Precisio, Recall ad F-measures were i the use previously. We have also take Precisio which is oe of them due to the fact that its performace is observed to be better at letter level approach [1]. Moreover, the complex letters assig the target features for beig traied; hece, the task is performed at the lowest basic level of letters. Three maily used diacritics, i.e., Zabar, Zair ad Pesho i Sidhi are cosidered i experimets. The Letter Level Learig method processes every letter take from the corpus ad creates a te letters vector. Each vector is put ito a array. Cosequetly, each letter is preprocessed with its calculated probability. After receivig the testig data set, system throbs the compariso of all the udiacritized letters of the testig data set with the preprocessed data available i the arrays ad after the said process replace the letter with the diacritized oe. From the total sets of istaces take from the developed corpus, istaces are experimetally tested from each set. The testig examples are approximately 15% of the whole set of examples. Table III, Table IV ad V depict the results attaied with N=1, 3 ad 5. The tables show the ambiguous letters extracted from the developed corpus, the precisio as the result by applyig istace-based learig at letter level. TABLE. III. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=1 ٻ پ Ambiguous Set Total Tested Precisio Examples Examples Achieved ا ا ا 99, % ب ب ب 15, % ٻ ٻ 6, % ڀ ڀ ڀ 14, % ت ت ت 34, % ٿ ٿ ٿ 11, % ٽ ٽ ٽ 10, % ٺ ٺ ٺ 4, % ث ث ث % پ پ 12, % ج ج ج 41, % ج هه ج هه ج هه 5, % ڄ ڄ ڄ % ڃ ڃ ڃ % چ چ چ 18, % ڇ ڇ ڇ 10, % ح ح ح 20, % خ خ خ 8, % 154 P a g e

7 ڳ د د د 30, % ڌ ڌ ڌ % ڊ ڊ ڊ % ڏ ڏ ڏ 25, % ڍ ڍ ڍ % % ذ ذ ذ ر ر ر 48, % ڙ ڙ ڙ 1, % ز ز ز % س س س 24, % ش ش ش % ص ص ص % ض ض ض % ط ط ط % ظ ظ ظ % ع ع ع 11, % غ غ غ % ف ف ف 12, % ڦ ڦ ڦ % ق ق ق % ڪ ڪ ڪ 54, % ک ک ک 28, % گ گ گ 14, % گه گه گه 2, % ڳ ڳ % ڱ ڱ ڱ % ل ل ل 55, % م م م 60, % ن ن ن 101, % ڻ ڻ ڻ % و و و 55, % هه هه هه 84, % ء ء ء % ي ي ي 126, % TABLE. IV. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=3 ٻ Ambiguous Set Total Tested Precisio Examples Examples Achieved ا ا ا 99, % ب ب ب 15, % ٻ ٻ 6, % ڀ ڀ ڀ 14, % ت ت ت 34, % ٿ ٿ ٿ 11, % ٽ ٽ ٽ 10, % ٺ ٺ ٺ 4, % ث ث ث % پ پ پ 12, % ج ج ج 41, % ج هه ج هه ج هه 5, % ڄ ڄ ڄ % ڃ ڃ ڃ % چ چ چ 18, % ڇ ڇ ڇ 10, % ح ح ح 20, % خ خ خ 8, % د د د 30, % ڌ ڌ ڌ % ڊ ڊ ڊ % ڏ ڏ ڏ 25, % ڍ ڍ ڍ % % ذ ذ ذ ر ر ر 48, % ڙ ڙ ڙ 1, % ز ز ز % س س س 24, % ش ش ش % ص ص ص % ض ض ض % ط ط ط % ظ ظ ظ % ع ع ع 11, % غ غ غ % ف ف ف 12, % ڦ ڦ ڦ % ق ق ق % ڪ ڪ ڪ 54, % ک ک ک 28, % گ گ گ 14, % گه گه گه 2, % ڳ ڳ ڳ % ڱ ڱ ڱ % ل ل ل 55, % م م م 60, % ن ن ن 101, % ڻ ڻ ڻ % و و و 55, % هه هه هه 84, % ء ء ء % ي ي ي 126, % TABLE. V. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=5 ٻ Ambiguous Set Total Tested Precisio Examples Examples Achieved ا ا ا 99, % ب ب ب 15, % ٻ ٻ 6, % ڀ ڀ ڀ 14, % ت ت ت 34, % ٿ ٿ ٿ 11, % ٽ ٽ ٽ 10, % ٺ ٺ ٺ 4, % ث ث ث % پ پ پ 12, % ج ج ج 41, % ج هه ج هه ج هه 5, % ڄ ڄ ڄ % ڃ ڃ ڃ % چ چ چ 18, % ڇ ڇ ڇ 10, % ح ح ح 20, % خ خ خ 8, % د د د 30, % ڌ ڌ ڌ % ڊ ڊ ڊ % ڏ ڏ ڏ 25, % ڍ ڍ ڍ % % ذ ذ ذ ر ر ر 48, % ڙ ڙ ڙ 1, % 155 P a g e

8 ز ز ز % س س س 24, % ش ش ش % ص ص ص % ض ض ض % ط ط ط % ظ ظ ظ % ع ع ع 11, % غ غ غ % ف ف ف 12, % ڦ ڦ ڦ % ق ق ق % ڪ ڪ ڪ 54, % ک ک ک 28, % گ گ گ 14, % گه گه گه 2, % ڳ ڳ ڳ % ڱ ڱ ڱ % ل ل ل 55, % م م م 60, % ن ن ن 101, % ڻ ڻ ڻ % و و و 55, % هه هه هه 84, % ء ء ء % ي ي ي 126, % Three differet widow sizes were tested to reach the best oe. Amog the widow sizes of two, six, ad te letters (i.e., N= 1, 3, 5), the calculated accuracy with N=1 is 92.52%, accuracy of 95.12% is received whe N=3 ad 99.03% is calculated with N=5. Widow size for the greatest ad most efficiet accuracy was observed up to te earest accompayig letters (i.e., N=5) where N stads for the umber of letters from each side of the letter uder process. The calculated cumulative precisios with differet experimeted widow sizes are show i Fig.3. Fig. 3. Calculated Cumulative Precisio with Differet Widow Sizes The figures, give i the tables, show that a cosiderable differece ca be foud amog them; i additio to this, the calculated results reveal that the widow size is also decisive i icrease ad decrease of results. Therefore, N=5 proves to be the most suitable ad reliable widow comparatively. VII. CONCLUSION Automatic istat diacritic restoratio is essetial compoet for may NLP applicatios. The restoratio is attempted with the most possible itelliget use of two approaches; N-grams based ad Letter Level Learig-based. Each of both methods has their ow specificatios alog with the limitatios. The proposed mechaism i this study is experimeted o our developed corpus of Sidhi laguage. The widow (N=5) is foud the best oe after testig differet sizes. The Precisio with this widow is achieved at 99.03%. The proposed method is also capable for the istat diacritics restoratio of Arabic, Urdu ad Persia laguages after slight modificatios. REFERENCES [1] J. A. Mahar, Statistical Approaches to Diacritics Restoratio i Sidhi Text to Speech Sythesis System, PhD Thesis, Hamdard Uiversity, Karachi, Pakista, [2] S. A. Mahar, Comparative Aalysis of Vowel Restoratio for Arabic Script Based Laguages Usig N-Gram Models, MS Thesis, Shah Abdul Latif Uiversity, Khairpur, Pakista, [3] A. Al-Wabil, H. Al-Khalifa, W. Al-Saleh, Arabic Text-To-Speech Sythesis: A Prelimiary Evaluatio, I Proceedigs of the 2007 World Coferece o Educatioal Multimedia, Hypermedia ad Telecommuicatios, Vacouver, Caada, Pp , [4] A. A. Shah, A. W. Asari, L. Das, Bi-Ligual Text to Speech Sythesis System for Urdu ad Sidhi, Natioal Coferece o Emergig Techology, Pp , [5] J. A. Mahar, G. Q. Memo, Automatic Diacritics Restoratio for Sidhi, Sidh Uiversity Research Joural (Sciece Series), Vol. 43, No. 1, Pp , Jue [6] Y. Gal, A HMM Approach to Vowel Restoratio i Arabic ad Hebrew, ACL-02 Workshop o Computatioal Approaches to Semitic Laguages, Associatio for Computatioal Liguistic, Philadelphia, Pesylvaia, Pp.1-7, [7] A. A. Harby, M. A. Shehawey, R. S. Barogy, A Statistical Approach for Qura Vowel Restoratio, ICGST Iteratioal Joural o Artificial Itelligece ad Machie Learig, Vol. 8, No. 3, Pp. 9-16, [8] H. Sulta, Automatic Arabic Diacritizatio usig Neural Network, Scietific Bulleti of Faculty of Egieerig Ai-Shams Uiversity: Electrical Egieerig, Vol. 36, No. 4, Pp , [9] I. Zitoui, R. Sarikaya, Arabic Diacritic Restoratio Based o Maximum Etropy Models, Computer Speech ad Laguage, Vol. 23, Pp , [10] R. Mihalcea, V. Nastase, Letter Level Learig for Laguage Idepedet Diacritics Restoratio, Proceedigs of 6 th Workshop o Computatioal Laguage Learig, Vol. 20, Pp.1-7, [11] S. Kubler, E. Mohamed, Memory-based vocalizatio of Arabic, I Proceedigs of the LREC Workshop o HLT ad NLP withi the Arabic World, Pp , Morroco, [12] R. Nelke, S. M. Shieber, Arabic Diacritizatio usig Weighted Fiite- State Trasducers, ACL Workshop o Computatioal Approaches to Semitic Laguages, Associatio for Computatioal Liguistic, Pp.79-86, Michiga, [13] R. F. Mihalcea, Diacritic Restoratio: Learig from Letters Versus Learig from Words, Lecture Notes i Computer Sciece, Vol. 2276, Pp , [14] J. A. Mahar, G. Q. Memo, H. Shaikh, Sidhi Diacritics Restoratio By Letter Level Learig Approach, Sidh Uiversity Research Joural (Sciece Series), Vol. 43, No. 2, Pp , December [15] K. Aadvai, Shah Jo Risalo, 2 d Editio, Sidhica Academy, Karachi, Pakista, P a g e

9 [16] D. Jurafsky, J. H. Marti, Speech ad Laguage Processig: A Itroductio to Natural Laguage Processig, Computatioal Liguistic ad Speech Recogitio, Pretice-Hall, Pp , [17] Y. Hify, Restoratio of Arabic Diacritics Usig Dyamic Programmig," COLING, [18] C. Lee, G. G. Lee, Iformatio Gai ad Divergece-Based Feature Selectio for Machie Learig-Based Text Categorizatio, A Iteratioal Joural of Iformatio Processig ad Maagemet, Special Issue: Formal Methods for Iformatio Retrieval, Vol. 42, Issue 1, Pp , Jauary P a g e

Natural language processing implementation on Romanian ChatBot

Natural language processing implementation on Romanian ChatBot Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics

More information

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent Fuzzy Referece Gai-Schedulig Approach as Itelliget Agets: FRGS Aget J. E. ARAUJO * eresto@lit.ipe.br K. H. KIENITZ # kieitz@ita.br S. A. SANDRI sadra@lac.ipe.br J. D. S. da SILVA demisio@lac.ipe.br * Itegratio

More information

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev Titre du documet / Documet title E-learig usability : A learer-adapted approach based o the evaluatio of leaer's prefereces Auteur(s) / Author(s) TERZIEVA Valetia ; PAVLOV Yuri (1) ; ANDREEV Rume (2) ;

More information

arxiv: v1 [cs.dl] 22 Dec 2016

arxiv: v1 [cs.dl] 22 Dec 2016 ScieceWISE: Topic Modelig over Scietific Literature Networks arxiv:1612.07636v1 [cs.dl] 22 Dec 2016 A. Magalich, V. Gemmetto, D. Garlaschelli, A. Boyarsky Uiversity of Leide, The Netherlads {magalich,

More information

Division of Arts, Humanities & Wellness Department of World Languages and Cultures. Course Syllabus اللغة والثقافة العربية ١ LAN 115

Division of Arts, Humanities & Wellness Department of World Languages and Cultures. Course Syllabus اللغة والثقافة العربية ١ LAN 115 Division of Arts, Humanities & Wellness Department of World Languages and Cultures Course Syllabus Semester and Year: Course and Section number: Meeting Times: INSTRUCTOR: Office Location: Phone: Office

More information

Management Science Letters

Management Science Letters Maagemet Sciece Letters 4 (24) 2 26 Cotets lists available at GrowigSciece Maagemet Sciece Letters homepage: www.growigsciece.com/msl A applicatio of data evelopmet aalysis for measurig the relative efficiecy

More information

Accepted Manuscript. Title: Region Growing Based Segmentation Algorithm for Typewritten, Handwritten Text Recognition

Accepted Manuscript. Title: Region Growing Based Segmentation Algorithm for Typewritten, Handwritten Text Recognition Title: Region Growing Based Segmentation Algorithm for Typewritten, Handwritten Text Recognition Authors: Khalid Saeed, Majida Albakoor PII: S1568-4946(08)00114-2 DOI: doi:10.1016/j.asoc.2008.08.006 Reference:

More information

Consortium: North Carolina Community Colleges

Consortium: North Carolina Community Colleges Associatio of Research Libraries / Texas A&M Uiversity www.libqual.org Cotributors Collee Cook Texas A&M Uiversity Fred Heath Uiversity of Texas BruceThompso Texas A&M Uiversity Martha Kyrillidou Associatio

More information

Application for Admission

Application for Admission Applicatio for Admissio Admissio Office PO Box 2900 Illiois Wesleya Uiversity Bloomig, Illiois 61702-2900 Apply o-lie at: www.iwu.edu Applicatio Iformatio I am applyig: Early Actio Regular Decisio Early

More information

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING  Version 1.1, September 2014 preview begis oct 2014 lauches ja 2015 INTRODUCING WWW.FEEDBACKCOMMONS.ORG A serviced cloud platform to share ad compare feedback data ad collaboratively develop feedback ad learig practice CONSTITUENT

More information

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO HANDBOOK Career Ceter Hadbook CALIFORNIA STATE UNIVERSITY, SACR AMENTO Tools & Tips for Career Search Success Academic Advisig ad Career Ceter 6000 J Street Lasse Hall 1013 Sacrameto, CA 95819-6064 916-278-6231

More information

part2 Participatory Processes

part2 Participatory Processes part part2 Participatory Processes Participatory Learig Approaches Whose Learig? Participatory learig is based o the priciple of ope expressio where all sectios of the commuity ad exteral stakeholders

More information

ASR for Tajweed Rules: Integrated with Self- Learning Environments

ASR for Tajweed Rules: Integrated with Self- Learning Environments I.J. Information Engineering and Electronic Business, 2017, 6, 1-9 Published Online November 2017 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijieeb.2017.06.01 ASR for Tajweed Rules: Integrated with

More information

'Norwegian University of Science and Technology, Department of Computer and Information Science

'Norwegian University of Science and Technology, Department of Computer and Information Science The helpful Patiet Record System: Problem Orieted Ad Kowledge Based Elisabeth Bayega, MS' ad Samso Tu, MS2 'Norwegia Uiversity of Sciece ad Techology, Departmet of Computer ad Iformatio Sciece ad Departmet

More information

Study Center in Amman, Jordan

Study Center in Amman, Jordan Study Center in Amman, Jordan Course name: Modern Standard Arabic, Superior I Course number: ARAB 4011 AMJO Programs offering course: Advanced Arabic Language Language of instruction: Arabic U.S. Semester

More information

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition Abir Masmoudi 1,2, Mariem Ellouze Khemakhem 1,Yannick Estève 2, Lamia Hadrich Belguith 1 and Nizar Habash 3 (1) ANLP Research group,

More information

VISION, MISSION, VALUES, AND GOALS

VISION, MISSION, VALUES, AND GOALS 6 VISION, MISSION, VALUES, AND GOALS 2010-2015 VISION STATEMENT Ohloe College will be kow throughout Califoria for our iclusiveess, iovatio, ad superior rates of studet success. MISSION STATEMENT The Missio

More information

The Use of Inflectional Morphemes by Kuwaiti EFL Learners

The Use of Inflectional Morphemes by Kuwaiti EFL Learners English Language and Literature Studies; Vol. 6, No. 3; 2016 ISSN 1925-4768 E-ISSN 1925-4776 Published by Canadian Center of Science and Education The Use of Inflectional Morphemes by Kuwaiti EFL Learners

More information

2014 Gold Award Winner SpecialParent

2014 Gold Award Winner SpecialParent Award Wier SpecialParet Dedicated to all families of childre with special eeds 6 th Editio/Fall/Witer 2014 Desig ad Editorial Awards Competitio MISSION Our goal is to provide parets of childre with special

More information

VISUAL MEDIA USED IN INTRODUCING VOCABULARY AT TK IT AL-MA UN SENGKALING THESIS. By: FAJRIN AL FERA

VISUAL MEDIA USED IN INTRODUCING VOCABULARY AT TK IT AL-MA UN SENGKALING THESIS. By: FAJRIN AL FERA VISUAL MEDIA USED IN INTRODUCING VOCABULARY AT TK IT AL-MA UN SENGKALING THESIS By: FAJRIN AL FERA ENGLISH DEPARTMENT FACULTY OF TEACHER TRAINING AND EDUCATION UNIVERSITY MUHAMMADIYAH OF MALANG OCTOBER

More information

PENGUASAAN PELAJAR STAM TERHADAP IMBUHAN KATA BAHASA ARAB

PENGUASAAN PELAJAR STAM TERHADAP IMBUHAN KATA BAHASA ARAB PENGUASAAN PELAJAR STAM TERHADAP IMBUHAN KATA BAHASA ARAB MUHAMAD FAHMI BIN ABD JALIL DISERTASI DISERAHKAN UNTUK MEMENUHI KEPERLUAN BAGI IJAZAH SARJANA PENGAJIAN BAHASA MODEN FAKULTI BAHASA DAN LINGUISTIK

More information

Getting into top colleges. Farrukh Azmi, MD, PhD

Getting into top colleges. Farrukh Azmi, MD, PhD Getting into top colleges Farrukh Azmi, MD, PhD But Why? The first revealed word of the Quran? Verily, in the creation of the heavens and of the earth, and the succession of night and day: and in the

More information

SIX DISCOURSE MARKERS IN TUNISIAN ARABIC: A SYNTACTIC AND PRAGMATIC ANALYSIS. Chris Adams Bachelor of Arts, Asbury College, May 2006

SIX DISCOURSE MARKERS IN TUNISIAN ARABIC: A SYNTACTIC AND PRAGMATIC ANALYSIS. Chris Adams Bachelor of Arts, Asbury College, May 2006 SIX DISCOURSE MARKERS IN TUNISIAN ARABIC: A SYNTACTIC AND PRAGMATIC ANALYSIS by Chris Adams Bachelor of Arts, Asbury College, May 2006 A Thesis Submitted to the Graduate Faculty of the University of North

More information

Arabic Orthography vs. Arabic OCR

Arabic Orthography vs. Arabic OCR Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among

More information

HybridTechniqueforArabicTextCompression

HybridTechniqueforArabicTextCompression Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

also inside Continuing Education Alumni Authors College Events

also inside Continuing Education Alumni Authors College Events SUMMER 2016 JAMESTOWN COMMUNITY COLLEGE ALUMNI MAGAZINE create a etrepreeur creatig a busiess a artist creatig beauty a citize creatig the future also iside Cotiuig Educatio Alumi Authors College Evets

More information

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary michiga veteriary medical associatio i this issue... 3 Great Lakes Veteriary Coferece 4 What You Need to Kow Whe Issuig a Iterstate Certificate of Ispectio 6 Low Pathogeic Avia Iflueza H5 Virus Detectios

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Phonological Processing for Urdu Text to Speech System

Phonological Processing for Urdu Text to Speech System Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,

More information

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

USER ADAPTATION IN E-LEARNING ENVIRONMENTS USER ADAPTATION IN E-LEARNING ENVIRONMENTS Paraskevi Tzouveli Image, Video and Multimedia Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens tpar@image.

More information

Cross-Lingual Text Categorization

Cross-Lingual Text Categorization Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,

More information

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Distant Supervised Relation Extraction with Wikipedia and Freebase

Distant Supervised Relation Extraction with Wikipedia and Freebase Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education Advaces i DERMATOLOGY THURSDAY - FRIDAY JUNE 7-8, 2012 New York, NY Sposored by the NYU Post-Graduate Medical School 129 Years of Cotiuig Medical Educatio THE RONALD O. PERELMAN DEPARTMENT OF DERMATOLOGY

More information

Multimedia Courseware of Road Safety Education for Secondary School Students

Multimedia Courseware of Road Safety Education for Secondary School Students Multimedia Courseware of Road Safety Education for Secondary School Students Hanis Salwani, O 1 and Sobihatun ur, A.S 2 1 Universiti Utara Malaysia, Malaysia, hanisalwani89@hotmail.com 2 Universiti Utara

More information

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Search right and thou shalt find... Using Web Queries for Learner Error Detection Search right and thou shalt find... Using Web Queries for Learner Error Detection Michael Gamon Claudia Leacock Microsoft Research Butler Hill Group One Microsoft Way P.O. Box 935 Redmond, WA 981052, USA

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

Measurement. When Smaller Is Better. Activity:

Measurement. When Smaller Is Better. Activity: Measurement Activity: TEKS: When Smaller Is Better (6.8) Measurement. The student solves application problems involving estimation and measurement of length, area, time, temperature, volume, weight, and

More information

Ensemble Technique Utilization for Indonesian Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data Kurt VanLehn 1, Kenneth R. Koedinger 2, Alida Skogsholm 2, Adaeze Nwaigwe 2, Robert G.M. Hausmann 1, Anders Weinstein

More information

Exposé for a Master s Thesis

Exposé for a Master s Thesis Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially

More information

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL A thesis submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE

More information

The stages of event extraction

The stages of event extraction The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks

More information

Parsing of part-of-speech tagged Assamese Texts

Parsing of part-of-speech tagged Assamese Texts IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal

More information

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

Large vocabulary off-line handwriting recognition: A survey

Large vocabulary off-line handwriting recognition: A survey Pattern Anal Applic (2003) 6: 97 121 DOI 10.1007/s10044-002-0169-3 ORIGINAL ARTICLE A. L. Koerich, R. Sabourin, C. Y. Suen Large vocabulary off-line handwriting recognition: A survey Received: 24/09/01

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

ISSRAQ BIN RAMLI MOHD ZAKI ABD. RAHMAN

ISSRAQ BIN RAMLI MOHD ZAKI ABD. RAHMAN MANU Bil. 25, 137-158, 2017 ISSN 1511-1989 Issraq bin Ramli & Mohd Zaki Abd. Rahman Aplikasi Teori Maḥjub terhadap Pembaikan Sebutan Bunyi Bahasa Arab dalam Kalangan Pelajar Sabah Application of Mahjub

More information

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach BILINGUAL LEARNERS DICTIONARIES The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach Mark VAN MOL, Leuven, Belgium Abstract This paper reports on the

More information

Online Updating of Word Representations for Part-of-Speech Tagging

Online Updating of Word Representations for Part-of-Speech Tagging Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

INPE São José dos Campos

INPE São José dos Campos INPE-5479 PRE/1778 MONLINEAR ASPECTS OF DATA INTEGRATION FOR LAND COVER CLASSIFICATION IN A NEDRAL NETWORK ENVIRONNENT Maria Suelena S. Barros Valter Rodrigues INPE São José dos Campos 1993 SECRETARIA

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general

More information

How to Judge the Quality of an Objective Classroom Test

How to Judge the Quality of an Objective Classroom Test How to Judge the Quality of an Objective Classroom Test Technical Bulletin #6 Evaluation and Examination Service The University of Iowa (319) 335-0356 HOW TO JUDGE THE QUALITY OF AN OBJECTIVE CLASSROOM

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

A Comparative Survey on Arabic Stemming: Approaches and Challenges

A Comparative Survey on Arabic Stemming: Approaches and Challenges Intelligent Information Management, 2017, 9, 39-67 http://www.scirp.org/journal/iim ISSN Online: 2160-5920 ISSN Print: 2160-5912 A Comparative Survey on Arabic Stemming: Approaches and Challenges Mohammad

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

ARNE - A tool for Namend Entity Recognition from Arabic Text

ARNE - A tool for Namend Entity Recognition from Arabic Text 24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks 3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and

More information

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION L I S T E N I N G Individual Component Checklist for use with ONE task ENGLISH VERSION INTRODUCTION This checklist has been designed for use as a practical tool for describing ONE TASK in a test of listening.

More information

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING University of Craiova, Romania Université de Technologie de Compiègne, France Ph.D. Thesis - Abstract - DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING Elvira POPESCU Advisors: Prof. Vladimir RĂSVAN

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Indian Institute of Technology, Kanpur

Indian Institute of Technology, Kanpur Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar

More information

Generating Test Cases From Use Cases

Generating Test Cases From Use Cases 1 of 13 1/10/2007 10:41 AM Generating Test Cases From Use Cases by Jim Heumann Requirements Management Evangelist Rational Software pdf (155 K) In many organizations, software testing accounts for 30 to

More information

ScienceDirect. Malayalam question answering system

ScienceDirect. Malayalam question answering system Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam

More information

1. Introduction. 2. The OMBI database editor

1. Introduction. 2. The OMBI database editor OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper

More information

Cross-lingual Short-Text Document Classification for Facebook Comments

Cross-lingual Short-Text Document Classification for Facebook Comments 2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Dublin City Schools Mathematics Graded Course of Study GRADE 4 I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information