Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

Similar documents
Natural language processing implementation on Romanian ChatBot

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

arxiv: v1 [cs.dl] 22 Dec 2016

Division of Arts, Humanities & Wellness Department of World Languages and Cultures. Course Syllabus اللغة والثقافة العربية ١ LAN 115

Management Science Letters

Accepted Manuscript. Title: Region Growing Based Segmentation Algorithm for Typewritten, Handwritten Text Recognition

Consortium: North Carolina Community Colleges

Application for Admission

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

part2 Participatory Processes

ASR for Tajweed Rules: Integrated with Self- Learning Environments

'Norwegian University of Science and Technology, Department of Computer and Information Science

Study Center in Amman, Jordan

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

VISION, MISSION, VALUES, AND GOALS

The Use of Inflectional Morphemes by Kuwaiti EFL Learners

2014 Gold Award Winner SpecialParent

VISUAL MEDIA USED IN INTRODUCING VOCABULARY AT TK IT AL-MA UN SENGKALING THESIS. By: FAJRIN AL FERA

PENGUASAAN PELAJAR STAM TERHADAP IMBUHAN KATA BAHASA ARAB

Getting into top colleges. Farrukh Azmi, MD, PhD

SIX DISCOURSE MARKERS IN TUNISIAN ARABIC: A SYNTACTIC AND PRAGMATIC ANALYSIS. Chris Adams Bachelor of Arts, Asbury College, May 2006

Arabic Orthography vs. Arabic OCR

HybridTechniqueforArabicTextCompression

also inside Continuing Education Alumni Authors College Events

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

Linking Task: Identifying authors and book titles in verbose queries

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Disambiguation of Thai Personal Name from Online News Articles

A Case Study: News Classification Based on Term Frequency

Word Segmentation of Off-line Handwritten Documents

Phonological Processing for Urdu Text to Speech System

Reducing Features to Improve Bug Prediction

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Cross-Lingual Text Categorization

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Learning Methods in Multilingual Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Distant Supervised Relation Extraction with Wikipedia and Freebase

Python Machine Learning

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education

Multimedia Courseware of Road Safety Education for Secondary School Students

Search right and thou shalt find... Using Web Queries for Learner Error Detection

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Measurement. When Smaller Is Better. Activity:

Ensemble Technique Utilization for Indonesian Dependency Parser

Cross Language Information Retrieval

Rule Learning With Negation: Issues Regarding Effectiveness

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Exposé for a Master s Thesis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

The stages of event extraction

Parsing of part-of-speech tagged Assamese Texts

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Large vocabulary off-line handwriting recognition: A survey

Memory-based grammatical error correction

Problems of the Arabic OCR: New Attitudes

ISSRAQ BIN RAMLI MOHD ZAKI ABD. RAHMAN

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Online Updating of Word Representations for Part-of-Speech Tagging

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Modeling function word errors in DNN-HMM based LVCSR systems

INPE São José dos Campos

Training and evaluation of POS taggers on the French MULTITAG corpus

Human Emotion Recognition From Speech

Using dialogue context to improve parsing performance in dialogue systems

Australian Journal of Basic and Applied Sciences

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

How to Judge the Quality of an Objective Classroom Test

arxiv: v1 [cs.cl] 2 Apr 2017

A Comparative Survey on Arabic Stemming: Approaches and Challenges

South Carolina English Language Arts

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

ARNE - A tool for Namend Entity Recognition from Arabic Text

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

AQUA: An Ontology-Driven Question Answering System

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING

A heuristic framework for pivot-based bilingual dictionary induction

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Indian Institute of Technology, Kanpur

Generating Test Cases From Use Cases

ScienceDirect. Malayalam question answering system

1. Introduction. 2. The OMBI database editor

Cross-lingual Short-Text Document Classification for Facebook Comments

Probabilistic Latent Semantic Analysis

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Transcription:

Istat Diacritics Restoratio System for Sidhi Accet Predictio usig N-Gram ad Memory-Based Learig Approaches Hidayatullah Shaikh, Javed Ahmed Mahar, Mumtaz Hussai Mahar Departmet of Computer Sciece, Shah Abdul Latif Uiversity, Khairpur Mir s, Sidh, Pakista Abstract---The script of Sidhi Laguage is highly complex due to may complexities icludig abudace of homographic words. The iterpretatio of the text turs so tough due to the possibility of multitudial meaigs associated with a homographic word uless give specific prouciatio with the help of diacritics. Diacritics help the readers to comprehed the text easily. Due to the rapidly developig ature of this era, people do t bother writig diacritics i routie applicatios of life. Besides creatig difficulties for huma readig, the absece of diacritics does also make the text abstruse for machie readig. Relatively alike huma, machies may also lead to sematic ad sytactic complexities durig computatioal processig of the laguage. Istat diacritics restoratio is a approach emerged from the text predictio systems. This type of diacritics restoratio is a uprecedeted work i the realm of atural laguage processig, particularly i Ido-Arya laguages. A propositio for a framework usig N-Grams ad Memory-Based Learig approach is made i this work. The grab-poit of this mechaism is its 99.03% accuracy o the corpus of Sidhi laguage durig the experimets. The comparative edge of istat diacritics restoratio is its beig source of expeditio i the performace of other atural laguage ad speech processig applicatios. The future developmet of this approach seems vivid ad clear for Sidhi orthography is highly similar to those of Arabic, Urdu, Persia ad other laguages based o this type of script. Keywords--Sidhi Laguage; Istat Diacritics Restoratio; Text Predictio; N-Grams; Memory-Based Learig I. INTRODUCTION Sidhi orthography abouds i such words which possess differet meaig but idetical morphological structure. These words are called homographs i liguistics. The solutio to this problem is the assigmet of diacritic marks to the homographs. Sidhi orthography has two types of diacritic sigs used for the correct prouciatio of the words [1]. The superscript sigs assiged over the letters ad subscript oes beeath the letters. The routie scripts of Sidhi laguage are writte without diacritics such as ewspapers, magazies ad books. Such absece brigs about critical challeges facig computatioal processig of the laguage [2]. I more elaborate way, homographic words ca be iterchageably meat or iterpreted if diacritics are abset. They may be meat ad proouced erroeously as well. Without disambiguatio, it is rather difficult to figure out the iteded meaig ad prouciatio of words durig the process of differet liguistic ad speech processig applicatios. The automatic assigmet of diacritics i Sidhi script is essetial for its processig ito atural laguage ad speech applicatios [3] [4]. Therefore, the literature of this type of research is replete with the details of the research works o diacritic restoratio particularly by usig statistical approaches [5] [2]. Firstly, the results of previous research works are ot satisfactory or at acceptable level ad secodly, the istat diacritics restoratio is take ito cosideratio for the first time for Sidhi. The objective of the study is the developmet of automatic system that will covert the u-diacritized words ito the diacritized oes by assigig the diacritic sigs istatly durig typig. This research study aims at the developmet of automatic system that assigs diacritics to the words which at first are u-diacritized durig typig istatly. For this, a ivestigative study with the combiatio of N-Grams ad Letter-Level Approaches is carried out to meet the objective. The rest of the paper is orgaized as follows: some research cotributios of diacritics restoratio of Arabic script-based laguages are preseted i Sectio II. The overview of corpus preparatio is give i Sectio III. The proposed model for the task of istat diacritics restoratio is described ad depicted i Sectio IV. I Sectio V, executio process of developed software applicatio is explaied, while i Sectio VI, implemetatio process of proposed model ad detail evaluatio of calculated results are give ad fially, the paper is cocluded i Sectio VII with core results ad coclusio. II. RELATED WORK The study of literature o this topic reveals that diacritics restoratio is performed at letter ad word level. Diacritics restoratio has bee cetered by usig various techiques at word ad letter level as well, like N-Grams [6] [7], Neural Networks [8], Maximum Etropy [9], Memory-Based Learig [10] [11], ad Weighted Fiite State [12]. Majority of researchers has received ecouragig results at word level usig N-Gram laguage model [6] [7] [2] whereas Memory- Based Learig Approach [13] also yields good results at 149 P a g e

letter level for the same task o Arabic script-based laguages icludig Sidhi [14]. The task of automatic Sidhi diacritics restoratio is maily cosidered ad take by the researchers usig statistical approaches such as maximum etropy [1], N-grams [5] ad memory-based learig approach [14]. The acceptable results are achieved with memory-based learig ad N-gram based laguage modelig approaches. Hece, the proposed istat diacritics restoratio mechaism is also based o the N-Grams ad Memory-Based Learig approaches. Makig use of this mechaism high accuracy i less time is attaied. III. CORPUS PREPARATION As a matter of fact, two types of data sets are always required for experimetatio of diacritics restoratio systems [1]. Therefore, two types of corpora are desiged ad developed. The first subsumes complete diacritized text ad the secod udiacritized text. I additio to them, a lexico is also built. The experimets of the proposed method were performed by makig use of both types of data sets; corpora ad lexico. A data set of corpus havig 2, 65,257 words are built i Sidhi laguage for the purpose of traiig ad testig the system. The orgaized iformatio of the developed corpus i is give i Table I. The corpus is classified ito three segmets: the atique books that are completely writte with diacritics like Shah Jo Rosalo [15], the poetry books that possess partially diacritized text ad the recetly published text of differet geres which are etirely void of diacritics like ewspapers, magazies ad text books. TABLE. I. WORDS INFORMATION OF DEVELOPED SINDHI CORPUS Type of Corpus No. of No. of Seteces Words Fully Diacritized 8326 49,462 Partially Diacritized 10190 93,188 Not-Diacritized 14869 1,22, 607 Total 33385 2, 65,257 A. Developed Lexico I additio to the developmet of Sidhi corpus, a lexico of Sidhi text has bee created for it is a essetial compoet for the proposed method of istat diacritizatio. The mechaism of the istat diacritics restoratio has the basis of memory based learig approach with the aid of letter level learig approach. Relatively, a table havig the letters i differet forms of diacritized as well as u-diacritized is developed. The specime of this table is give i Fig. 1. It should be oted here that each letter is assiged a uique umber for the idetificatio. This idetificatio is required for the executio of the letters ito the system. IV. PROPOSED MODEL The ie compoets work altogether as the costituets of the proposed mechaism: Calculatio of word probabilities, specimes of letters, patter matchig ad comparative fuctio of homographic structures, K-NN Classifier ad Class Labels, calculatio of distace betwee istaces usig overlap metric, calculate the features weight, ested hash ad tokeizatio. The proposed model i Fig. 2 is used to show the executio process of the complete system. The corpus fuctios as a patro o which the probabilities are depedet; hece, traiig corpus desig is a delicate matter to deal with. The more specified traiig corpus leads to the more accurate probabilities which help the task to be achieved coveietly. The N-grams are probabilistic models that help the provisio of directio for the assigmet of probabilities to the words. The uigram, bigram, trigram ad so o models are used for the calculatio of probabilities. A uigram is a N-gram of 1, bigram of 2, ad cosequetly trigram of 3, ad so o with the progressive umbers [16]. The text is a sequetial series of structured words ad ca be give represetatio as below: P( W1, W2,... W 1, W) For a bigram grammar P( w ) P( w w ) 1 i i 1 i 1 The trigram is same as bigram except the coditio o two previous words as uder. P( w ) P( w w w ) 1 i i 2 i 1 i 1 The ultimate product o the part of the system is the provisio of the optio to the user to choose the suitable or correct words as per the requiremet. Therefore, the laguage modelig is used for the computatio of N-Grams up to quad oe. The probabilities of all the words give i the corpus are idividually calculated ad stored ito a specified table i the desiged lexico. The purpose of this whole process is to support the further process of the mechaism. (1) (2) (3) 150 P a g e

Fig. 1. Sample Database Table for Istat Diacritics Restoratio 151 P a g e

After the words probabilities are calculated, the system starts computatio of the available istaces of each diacritized letter. For this, almost all the possible istaces of all the letters i corpora calculated with every diacritic mark; i.e., ب, ب, ب are calculated altogether with the surroudig letter (N letter) o both left ad right sides. At the same time, the calculated istaces are saved i a multidimesioal array ascedig. At least 1224688 istaces are take from the available corpus takig care of the particular otatios give to the white spaces (SP), commas (CO) ad dots (DO) alike [11] [13]. A vector based multidimesioal array is used for the storage of these examples. The corpus same from [1] is give below ad the related sample of feature vectors extracted from the same source is preseted i Table II. Fig. 2. Proposed Model for Sidhi Istat Diacritics Restoratio 152 P a g e

Letters ڪ TABLE. II. SAMPLE LETTERS AND FEATURE VECTORS Feature Vectors ا,ن,ت,ي SP,,ڏ,ڇ SP,,ي,ڍ : پ,ا,س,و SP, CO, SP,,ن,و,ي : SP, SP, SP, SP, SP,ب SP,,ن,هه,ن : ي SP,,ج,و SP,,ٿ,ڪ SP,,ي,ٿ : ٿ,ي,ن,هه SP,,ٿ,هه SP,,ي,ٿ : : SP,,و,ڻ,ا,م,هه SP,,ر,هه SP ن SP,,س,ڀ SP,,ا,ک SP,,ڙ,و : SP,ڪ,ٿ,ي SP,,و,ر,ض SP,,ٿ : ن,د,و SP,,ا,ا,س,ا DO,,ي : SP,م,ا,ن SP, SP,,ڪ,هه SP,,ر : ر,ي SP,,پ,ن,پ SP,,و,ج,ن : SP, SP, SP,ڪ,ن,ن,هه,ن SP,,ن : ن SP,,هه,ر SP,,هه,ڻ,م SP,,ڪ : چ SP,,پ,ا,ڻ,م,ا,س SP,,ي : ض,ر,و,ر SP, SP,,و,د,ن,و : ڪ SP,,م,ا,ڻ,ڻ,ا,پ SP,,و : ي SP,,س,ا,م,چ,ا SP,,ن,و : ڪ ڪ هه هه هه The absece of diacritical marks lead to may complexities i the text regardig various possible vowels souds used i a word [11]. The word سکن may be take for example. The system performs compariso of the patter of the u-diacritized word with the diacritized oes available i.س ک ن ad س ک ن the corpus. System receives two types of words Patter matchig process is carried out usig regular expressio approach. The system, the, ackowledges the patter of u-diacritized iput word with the diacritized oe. The suitable word o the basis of the highest probability is fixed at the same locatio. Sample regular expressio example is give graphical represetatio below: The complete group of examples is extracted from the corpus for each complex letter structure. Each letter from the set is take oe by oe icludig the surroudig eighbors from both sides. The, the system compares with the available istaces i the corpus. The KNN classifier is used for this compariso process. The value of each feature vector is calculated ad stored i the built-i metric. All of the values of each feature are weighted ad tagged with labels whether matched or mismatched structures. These istaces are divided i accordace with the assiged labels. The istace based learig algorithm is take ito use for the compariso of ew problem examples with istaces stored already i the memory. K-earest eighbor algorithm is the prove simplest method of a istace-based learig oe; o the other had, K-NN method categorizes the objects based o the earest traiig example i the feature space. The core model is give below [17]: k f( x ) (4) i i 1 f( xq ) k All of the iput istaces are compared idividually with the all the closest eighbors by usig KNN classifier. Fially, the system accepts the most frequet oes. A multidimesioal array i the system saves the traiig examples cotaiig feature vectors. The label specifies each example accordig to its class. The highest umbers of votes icludig with eighbors categorize the labeled etity. While the process of classificatio udergoes, a uique test istace is fed to the system, usig the distace (X, Y). This computes the sameess of the ew examples ad all of the other examples i memory. Overlap metric is used for this task particularly cosiderig the distace betwee istaces maifested by N-features. It is oly to show the distace per feature [13] [14]. i 1 ( X, Y) ( x, y ) The metric performs coutig of the etire umber of feature-values i both patters regardless of matchig or mismatchig for the additio of the domai kowledge bias to the weight. For the weight of the features, statistical iformatio is calculated through a examiatio to reach the better predictors of the class tags. Iformatio Gai (IG) examies each feature idividually ad prepares measuremet for the iformatio to be produced ad stored kowledge for valid class label. Immediately after the above process, hash table begis the process of storig data i a associated etwork maer. This table stores the data i the array format ad each data value receives a uique idex withi. This way the data is quickly accessed after kowig the idex of the required data. Hashig techique is widely kow techique that is used for the coversio of a rage of key values to a rage of the array idexes. Tokeizatio of the script of Sidhi is also oe of the challegig tasks due to the complexities i the text, i i (5) 153 P a g e

particularly the complexities of homographic structures. A compoud word eeds to be etitled as a sigle toke but the embedded space required i betwee creates ambiguity for the tokeizatio process. The embedded space is required i betwee due to the cursive ature of Sidhi script ad its coectig ad o-coectig letters. Therefore, more attetio is to be paid because of these complicatios facig the tokeizatio. Mahar s [1] tokeizatio model is take i this research project. I fact, Sidhi script abouds i homographic words. As a result, the ambiguity is ofte observed whe the text is udiacritized. A simple word ad root word of Sidhi قسم has such costituet letters which may be iterchageably take i almost two way as ق س م (a oath) (ou), ق س م (kid) (ou). The take words without diacritics are exactly idetical. Thus, they create ambiguity for NLP applicatios. Viterbia Algorithm is oe of the efficiet approaches to fid the most likely path trasitios i such cases. This algorithm produces the most likely possible word o the basis of the highest probability value calculated by usig N-grams [16]. V. EXECUTION PROCESS OF APPLICATION Text predictio is the basic idea that igitio to the Istat Diacritics Restoratio. The former was proposed to save time ad eergy simultaeously by offerig assumptios of possible upcomig set of letters after typig the begiig letters of words. By typig each succeedig letter, the user receives possible suggestios i differet forms of popup to adopt with a sigle click oly rather tha typig all the upcomig letters of the word. For example, user wats to type.انسان the word After typig the first letter, he will be show some popup carryig some most possible ad frequetly used words beggig with.ا The, he will type the ext letter,ن he will agai be show some set of most possible ad frequetly used set of letters after the two beggig oes. If he fids the same letter i the popup, he would just hit a sigle click to get the word typed rather tha hittig five strokes for all the five letters i the word. This fuctio of text predictio gave birth to the idea of istat diacritics restoratio. The predictive approach of istat diacritizatio facilitates the user to type the words with their exact prouciatios which further helps i readig it correctly. The editor actively ad simultaeously works with the user ad assigs the diacritics automatically. The user has to type the words oly. The diacritics will automatically be assiged immediately. For example, the user wats to type the word,ا ن س ان he first types the first letter,ا the editor will assig it the superscript diacritic sig iitially, for the system is assiged this task for every first letter. After,ا the user types aother letter,ن the system will immediately calculate the probability of the possible diacritics to this couple of letters ad assig to,ن simultaeously the to ا will chage ito. The user is to type س ow, as he types س the system agai goes for the calculatio of the probability of the possible diacritics to this combiatio of letters ad assigs the diacritics to all of the three accordig the highest foud match,ن ad the ا i the corpus. Now, the user moves ahead to type the system will simultaeously work with the letters ad the diacritics while calculatig the probabilities of the letters ad diacritic sigs from the give corpus. After the user is doe with typig,ا ن س ان the system fializes its diacritics with the same procedures detailed above. The same process takes place by typig each letter i the editor. VI. IMPLEMENTATION AND RESULTS The traiig ad testig set desig stad as the foudatios to the fial results. Therefore, both are maily cocered till the results are derived. Differet techiques like Word Error Rat, Diacritic Error Rate, Precisio, Recall ad F-measures were i the use previously. We have also take Precisio which is oe of them due to the fact that its performace is observed to be better at letter level approach [1]. Moreover, the complex letters assig the target features for beig traied; hece, the task is performed at the lowest basic level of letters. Three maily used diacritics, i.e., Zabar, Zair ad Pesho i Sidhi are cosidered i experimets. The Letter Level Learig method processes every letter take from the corpus ad creates a te letters vector. Each vector is put ito a array. Cosequetly, each letter is preprocessed with its calculated probability. After receivig the testig data set, system throbs the compariso of all the udiacritized letters of the testig data set with the preprocessed data available i the arrays ad after the said process replace the letter with the diacritized oe. From the total sets of istaces take from the developed corpus, 159330 istaces are experimetally tested from each set. The testig examples are approximately 15% of the whole set of examples. Table III, Table IV ad V depict the results attaied with N=1, 3 ad 5. The tables show the ambiguous letters extracted from the developed corpus, the precisio as the result by applyig istace-based learig at letter level. TABLE. III. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=1 ٻ پ Ambiguous Set Total Tested Precisio Examples Examples Achieved ا ا ا 99,262 14889 91.22% ب ب ب 15,881 2383 93.51% ٻ ٻ 6,447 967 92.71% ڀ ڀ ڀ 14,752 2212 90.84% ت ت ت 34,169 5126 91.36% ٿ ٿ ٿ 11,223 1684 90.33% ٽ ٽ ٽ 10,227 1534 92.42% ٺ ٺ ٺ 4,673 701 90.01% ث ث ث 850 127 89.19% پ پ 12,273 1841 92.62% ج ج ج 41,688 6253 88.24% ج هه ج هه ج هه 5,486 823 83.61% ڄ ڄ ڄ 782 117 94.56% ڃ ڃ ڃ 238 36 94.62% چ چ چ 18,852 2828 90.41% ڇ ڇ ڇ 10,293 1544 92.55% ح ح ح 20,790 3118 93.77% خ خ خ 8,039 1206 95.71% 154 P a g e

ڳ د د د 30,477 4572 97.09% ڌ ڌ ڌ 993 149 94.22% ڊ ڊ ڊ 274 41 95.11% ڏ ڏ ڏ 25,622 3843 94.63% ڍ ڍ ڍ 691 104 96.12% 532 80 90.81% ذ ذ ذ ر ر ر 48,033 7205 90.01% ڙ ڙ ڙ 1,943 291 93.32% ز ز ز 849 127 90.54% س س س 24,237 3635 94.90% ش ش ش 994 149 94.32% ص ص ص 592 89 94.88% ض ض ض 231 35 95.62% ط ط ط 838 126 89.21% ظ ظ ظ 201 30 90.01% ع ع ع 11,421 1713 93.79% غ غ غ 841 126 93.88% ف ف ف 12,840 1926 94.56% ڦ ڦ ڦ 556 83 93.76% ق ق ق 605 91 94.55% ڪ ڪ ڪ 54,837 8226 95.64% ک ک ک 28,444 4267 95.99% گ گ گ 14,766 2215 94.06% گه گه گه 2,495 374 81.58% ڳ ڳ 348 52 94.93% ڱ ڱ ڱ 173 26 92.27% ل ل ل 55,121 8268 92.77% م م م 60,270 9041 95.74% ن ن ن 101,126 15169 90.31% ڻ ڻ ڻ 126 19 90.91% و و و 55,664 8350 95.05% هه هه هه 84,033 12605 88.64% ء ء ء 76 11 93.03% ي ي ي 126,023 18904 90.88% TABLE. IV. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=3 ٻ Ambiguous Set Total Tested Precisio Examples Examples Achieved ا ا ا 99,262 14889 94.55% ب ب ب 15,881 2383 96.86% ٻ ٻ 6,447 967 94.66% ڀ ڀ ڀ 14,752 2212 95.14% ت ت ت 34,169 5126 96.31% ٿ ٿ ٿ 11,223 1684 92.23% ٽ ٽ ٽ 10,227 1534 93.76% ٺ ٺ ٺ 4,673 701 95.85% ث ث ث 850 127 94.63% پ پ پ 12,273 1841 92.62% ج ج ج 41,688 6253 92.54% ج هه ج هه ج هه 5,486 823 87.41% ڄ ڄ ڄ 782 117 95.33% ڃ ڃ ڃ 238 36 97.02% چ چ چ 18,852 2828 94.48% ڇ ڇ ڇ 10,293 1544 95.88% ح ح ح 20,790 3118 96.77% خ خ خ 8,039 1206 96.07% د د د 30,477 4572 98.21% ڌ ڌ ڌ 993 149 95.99% ڊ ڊ ڊ 274 41 96.79% ڏ ڏ ڏ 25,622 3843 97.13% ڍ ڍ ڍ 691 104 96.88% 532 80 93.22% ذ ذ ذ ر ر ر 48,033 7205 93.66% ڙ ڙ ڙ 1,943 291 96.22% ز ز ز 849 127 94.34% س س س 24,237 3635 95.42% ش ش ش 994 149 97.32% ص ص ص 592 89 95.07% ض ض ض 231 35 97.65% ط ط ط 838 126 93.44% ظ ظ ظ 201 30 93.71% ع ع ع 11,421 1713 95.17% غ غ غ 841 126 95.48% ف ف ف 12,840 1926 95.06% ڦ ڦ ڦ 556 83 96.72% ق ق ق 605 91 95.15% ڪ ڪ ڪ 54,837 8226 96.99% ک ک ک 28,444 4267 97.01% گ گ گ 14,766 2215 95.06% گه گه گه 2,495 374 87.25% ڳ ڳ ڳ 348 52 95.91% ڱ ڱ ڱ 173 26 94.87% ل ل ل 55,121 8268 96.44% م م م 60,270 9041 97.14% ن ن ن 101,126 15169 96.53% ڻ ڻ ڻ 126 19 95.11% و و و 55,664 8350 96.57% هه هه هه 84,033 12605 91.84% ء ء ء 76 11 93.78% ي ي ي 126,023 18904 96.77% TABLE. V. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=5 ٻ Ambiguous Set Total Tested Precisio Examples Examples Achieved ا ا ا 99,262 14889 98.26% ب ب ب 15,881 2383 99.17% ٻ ٻ 6,447 967 99.09% ڀ ڀ ڀ 14,752 2212 99.74% ت ت ت 34,169 5126 99.22% ٿ ٿ ٿ 11,223 1684 99.04% ٽ ٽ ٽ 10,227 1534 98.51% ٺ ٺ ٺ 4,673 701 99.64% ث ث ث 850 127 99.61% پ پ پ 12,273 1841 99.55% ج ج ج 41,688 6253 98.14% ج هه ج هه ج هه 5,486 823 94.38% ڄ ڄ ڄ 782 117 99.23% ڃ ڃ ڃ 238 36 99.88% چ چ چ 18,852 2828 99.66% ڇ ڇ ڇ 10,293 1544 99.17% ح ح ح 20,790 3118 99.47% خ خ خ 8,039 1206 99.47% د د د 30,477 4572 99.91% ڌ ڌ ڌ 993 149 99.87% ڊ ڊ ڊ 274 41 99.73% ڏ ڏ ڏ 25,622 3843 99.44% ڍ ڍ ڍ 691 104 99.81% 532 80 99.88% ذ ذ ذ ر ر ر 48,033 7205 99.22% ڙ ڙ ڙ 1,943 291 99.11% 155 P a g e

ز ز ز 849 127 99.14% س س س 24,237 3635 98.66% ش ش ش 994 149 98.93% ص ص ص 592 89 99.28% ض ض ض 231 35 99.33% ط ط ط 838 126 99.17% ظ ظ ظ 201 30 99.32% ع ع ع 11,421 1713 99.37% غ غ غ 841 126 99.57% ف ف ف 12,840 1926 99.22% ڦ ڦ ڦ 556 83 99.13% ق ق ق 605 91 97.55% ڪ ڪ ڪ 54,837 8226 99.18% ک ک ک 28,444 4267 99.63% گ گ گ 14,766 2215 99.26% گه گه گه 2,495 374 94.52% ڳ ڳ ڳ 348 52 99.01% ڱ ڱ ڱ 173 26 99.61% ل ل ل 55,121 8268 99.14% م م م 60,270 9041 99.93% ن ن ن 101,126 15169 99.44% ڻ ڻ ڻ 126 19 98.66% و و و 55,664 8350 99.51% هه هه هه 84,033 12605 97.35% ء ء ء 76 11 98.17% ي ي ي 126,023 18904 99.26% Three differet widow sizes were tested to reach the best oe. Amog the widow sizes of two, six, ad te letters (i.e., N= 1, 3, 5), the calculated accuracy with N=1 is 92.52%, accuracy of 95.12% is received whe N=3 ad 99.03% is calculated with N=5. Widow size for the greatest ad most efficiet accuracy was observed up to te earest accompayig letters (i.e., N=5) where N stads for the umber of letters from each side of the letter uder process. The calculated cumulative precisios with differet experimeted widow sizes are show i Fig.3. Fig. 3. Calculated Cumulative Precisio with Differet Widow Sizes The figures, give i the tables, show that a cosiderable differece ca be foud amog them; i additio to this, the calculated results reveal that the widow size is also decisive i icrease ad decrease of results. Therefore, N=5 proves to be the most suitable ad reliable widow comparatively. VII. CONCLUSION Automatic istat diacritic restoratio is essetial compoet for may NLP applicatios. The restoratio is attempted with the most possible itelliget use of two approaches; N-grams based ad Letter Level Learig-based. Each of both methods has their ow specificatios alog with the limitatios. The proposed mechaism i this study is experimeted o our developed corpus of Sidhi laguage. The widow (N=5) is foud the best oe after testig differet sizes. The Precisio with this widow is achieved at 99.03%. The proposed method is also capable for the istat diacritics restoratio of Arabic, Urdu ad Persia laguages after slight modificatios. REFERENCES [1] J. A. Mahar, Statistical Approaches to Diacritics Restoratio i Sidhi Text to Speech Sythesis System, PhD Thesis, Hamdard Uiversity, Karachi, Pakista, 2012. [2] S. A. Mahar, Comparative Aalysis of Vowel Restoratio for Arabic Script Based Laguages Usig N-Gram Models, MS Thesis, Shah Abdul Latif Uiversity, Khairpur, Pakista, 2014. [3] A. Al-Wabil, H. Al-Khalifa, W. Al-Saleh, Arabic Text-To-Speech Sythesis: A Prelimiary Evaluatio, I Proceedigs of the 2007 World Coferece o Educatioal Multimedia, Hypermedia ad Telecommuicatios, Vacouver, Caada, Pp. 4423-4430, 2007. [4] A. A. Shah, A. W. Asari, L. Das, Bi-Ligual Text to Speech Sythesis System for Urdu ad Sidhi, Natioal Coferece o Emergig Techology, Pp. 126-130, 2004. [5] J. A. Mahar, G. Q. Memo, Automatic Diacritics Restoratio for Sidhi, Sidh Uiversity Research Joural (Sciece Series), Vol. 43, No. 1, Pp. 43-50, Jue 2011. [6] Y. Gal, A HMM Approach to Vowel Restoratio i Arabic ad Hebrew, ACL-02 Workshop o Computatioal Approaches to Semitic Laguages, Associatio for Computatioal Liguistic, Philadelphia, Pesylvaia, Pp.1-7, 2002. [7] A. A. Harby, M. A. Shehawey, R. S. Barogy, A Statistical Approach for Qura Vowel Restoratio, ICGST Iteratioal Joural o Artificial Itelligece ad Machie Learig, Vol. 8, No. 3, Pp. 9-16, 2008. [8] H. Sulta, Automatic Arabic Diacritizatio usig Neural Network, Scietific Bulleti of Faculty of Egieerig Ai-Shams Uiversity: Electrical Egieerig, Vol. 36, No. 4, Pp.501-510, 2001. [9] I. Zitoui, R. Sarikaya, Arabic Diacritic Restoratio Based o Maximum Etropy Models, Computer Speech ad Laguage, Vol. 23, Pp. 257-276, 2008. [10] R. Mihalcea, V. Nastase, Letter Level Learig for Laguage Idepedet Diacritics Restoratio, Proceedigs of 6 th Workshop o Computatioal Laguage Learig, Vol. 20, Pp.1-7, 2002. [11] S. Kubler, E. Mohamed, Memory-based vocalizatio of Arabic, I Proceedigs of the LREC Workshop o HLT ad NLP withi the Arabic World, Pp. 58-62, Morroco, 2008. [12] R. Nelke, S. M. Shieber, Arabic Diacritizatio usig Weighted Fiite- State Trasducers, ACL Workshop o Computatioal Approaches to Semitic Laguages, Associatio for Computatioal Liguistic, Pp.79-86, Michiga, 2005. [13] R. F. Mihalcea, Diacritic Restoratio: Learig from Letters Versus Learig from Words, Lecture Notes i Computer Sciece, Vol. 2276, Pp. 96-113, 2002. [14] J. A. Mahar, G. Q. Memo, H. Shaikh, Sidhi Diacritics Restoratio By Letter Level Learig Approach, Sidh Uiversity Research Joural (Sciece Series), Vol. 43, No. 2, Pp. 119-126, December 2011. [15] K. Aadvai, Shah Jo Risalo, 2 d Editio, Sidhica Academy, Karachi, Pakista, 2009. 156 P a g e

[16] D. Jurafsky, J. H. Marti, Speech ad Laguage Processig: A Itroductio to Natural Laguage Processig, Computatioal Liguistic ad Speech Recogitio, Pretice-Hall, Pp. 300-307, 2000. [17] Y. Hify, Restoratio of Arabic Diacritics Usig Dyamic Programmig," COLING, 2012. [18] C. Lee, G. G. Lee, Iformatio Gai ad Divergece-Based Feature Selectio for Machie Learig-Based Text Categorizatio, A Iteratioal Joural of Iformatio Processig ad Maagemet, Special Issue: Formal Methods for Iformatio Retrieval, Vol. 42, Issue 1, Pp. 155-165, Jauary 2006. 157 P a g e