Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic

Similar documents
Natural language processing implementation on Romanian ChatBot

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

arxiv: v1 [cs.dl] 22 Dec 2016

Consortium: North Carolina Community Colleges

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Management Science Letters

'Norwegian University of Science and Technology, Department of Computer and Information Science

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

part2 Participatory Processes

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

Application for Admission

VISION, MISSION, VALUES, AND GOALS

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Online Updating of Word Representations for Part-of-Speech Tagging

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

Distant Supervised Relation Extraction with Wikipedia and Freebase

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Learning Methods in Multilingual Speech Recognition

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Syllable Based Word Recognition Model for Korean Noun Extraction

HybridTechniqueforArabicTextCompression

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

ARNE - A tool for Namend Entity Recognition from Arabic Text

2014 Gold Award Winner SpecialParent

Natural Language Processing. George Konidaris

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

also inside Continuing Education Alumni Authors College Events

Detecting English-French Cognates Using Orthographic Edit Distance

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Switchboard Language Model Improvement with Conversational Data from Gigaword

ScienceDirect. Malayalam question answering system

Training and evaluation of POS taggers on the French MULTITAG corpus

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Semi-supervised Training for the Averaged Perceptron POS Tagger

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Cross-Lingual Text Categorization

A Case Study: News Classification Based on Term Frequency

An Evaluation of POS Taggers for the CHILDES Corpus

Large vocabulary off-line handwriting recognition: A survey

Parsing of part-of-speech tagged Assamese Texts

CS Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS 598 Natural Language Processing

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Abdul Rahman Chik a*, Tg. Ainul Farha Tg. Abdul Rahman b

Named Entity Recognition: A Survey for the Indian Languages

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Assignment 1: Predicting Amazon Review Ratings

Disambiguation of Thai Personal Name from Online News Articles

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

South Carolina English Language Arts

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Lecture 1: Machine Learning Basics

1. Introduction. 2. The OMBI database editor

Language Independent Passage Retrieval for Question Answering

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Memory-based grammatical error correction

Arabic Orthography vs. Arabic OCR

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Prediction of Maximal Projection for Semantic Role Labeling

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Indian Institute of Technology, Kanpur

Accurate Unlexicalized Parsing for Modern Hebrew

Constructing Parallel Corpus from Movie Subtitles

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Universiteit Leiden ICT in Business

Word Segmentation of Off-line Handwritten Documents

Modeling full form lexica for Arabic

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Derivational and Inflectional Morphemes in Pak-Pak Language

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

A Comparison of Two Text Representations for Sentiment Analysis

Experiments with a Higher-Order Projective Dependency Parser

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Recognition at ICSI: Broadcast News and beyond

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Literature and the Language Arts Experiencing Literature

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Ensemble Technique Utilization for Indonesian Dependency Parser

Rule Learning With Negation: Issues Regarding Effectiveness

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Transcription:

British Joural of Applied Sciece & Techology 9(): -0, 207; Article o.bjast.29754 ISSN: 223-0843, NLM ID: 066454 SCIENCEDOMAIN iteratioal www.sciecedomai.org Choosig the Optimal Segmetatio Level for POS Taggig of the Quraic Arabic Fadl Mutaher Ba-Alwi *, Mohammed Albared ad Tareq Al-Moslmi 2 Faculty of Computer ad Iformatio Techology, Saa'a Uiversity, P.O.Box 247, Yeme. 2 Faculty of Iformatio Sciece ad Techology, Uiversiti Kebagsaa Malaysia, Malaysia. Authors cotributios This work was carried out i collaboratio betwee all authors. Author FMBA desiged the study, performed the statistical aalysis, wrote the protocol, wrote the first draft of the mauscript ad maaged the literature searches. Authors MA ad TAM maaged the aalyses of the study ad literature searches. All authors read ad approved the fial mauscript. Article Iformatio DOI: 0.9734/BJAST/207/29754 Editor(s): () Kleopatra Nikolopoulou, School of Educatio, Uiversity of Athes, Athes, Greece. Reviewers: () Azma Bi Che Mat, UiTM Tereggau, Malaysia. (2) Sazeli Arif, Uiversiti Tekikal Malaysia Melaka (UTeM), Malaysia. Complete Peer review History: http://www.sciecedomai.org/review-history/787 Origial Research Article Received 27 th September 206 Accepted th November 206 Published 3 th February 207 ABSTRACT As a morphologically rich laguage, Arabic poses special challeges to Part-of-Speech (POS) taggig. Words i Arabic texts ofte cotai several segmets; each has its ow POS category. The choice of the segmetatio level or the iput uit, word-based or morpheme-based, is a major issue i desigig ay Arabic atural laguage processig system. I word-based approaches, words are used the atomic uits of the laguage. I this case, composite POS tags are assiged to words. Therefore, large amouts of traiig data are required i order to esure statistical sigificace. They suffer from the problems of data sparseess ad ukow words. I case of morpheme-based approaches, morpheme compoets of words are used as the atomic uits. This, however, results i high level of ambiguity rate ad also small size of cotext for resolvig such ambiguity because the spa of the -gram might be limited to a sigle word. This paper compares ad cotrasts the morpheme-based ad word-based statistical POS taggig strategies. This paper evaluates the taggig performace of three statistical models, amely, the Arabic HMM POS tagger with the prefix guessig models, the Arabic HMM POS tagger with the liear iterpolatio guessig models ad the TT tagger, give traiig data from both morphemebased ad word-based tokeizatio levels. It also studies the ifluece of each choice o the *Correspodig author: E-mail: dr.fadlbaalwi@gmail.com;

taggig performace of the Arabic POS taggig models, i terms of the taggig accuracy ad the time complexity. I additio, this paper also evaluates the taggig performace of several stochastic models, give traiig data from both segmetatio levels. Results show that the morpheme-based POS taggig strategy is more adequate for the purpose of traiig statistical POS taggig models as it provides a better overall taggig accuracy ad a much faster traiig ad taggig time. Keywords: Arabic atural laguage processig; POS taggig; segmetatio levels.. INTRODUCTION Part of Speech (POS) disambiguatio is the ability to computatioally determie which POS of a word is activated by its use i a particular cotext []. Automatic text taggig is a importat pre-processig step i may NLP applicatios. Arabic laguage is a morphologically rich laguage which offers some challeges to Natural Laguage Processig (NLP) systems due to the may forms a word ca take, which leads to data sparseess (the isufficiecy of data). Most of the curret researches i NLP are based o supervised machie learig techiques i which the classifier lears from traiig sets which cotai a fair amout of words ad their associated aotatio. These classifiers eed a huge amout of traiig data to get a reasoable accuracy eve with less morphological laguages such as Eglish. I morphologically rich laguages, as the classifier will be faced by may forms of the same word that do ot repeat eough for the tagger to lear the patter (data sparseess problem). These laguages have a high vocabulary growth rate which results i a large umber of ukow words [2]. I Arabic ad also i other Semitic laguages, a word, a sigle orthographic space-delimited strig, ofte cosists of a cocateatio of subtokes, up to four sub-tokes [3], which fuctio as free morph-sytactic uits, each sub-toke with its ow POS category. I fact, Arabic word cosists of proclitics, stem with affixes (prefixes ad suffixes) ad eclitics. The clitics (proclitics ad eclitics) have their ow POS tags. Followig previous works, the terms morphemelevel taggig pertai to morphemes as the wordsegmets which are assiged POS tags from a give tag set. Accordig to this, the Arabic word your) (ad + by/with + promises + ف ب و ع ودك م ف-ب- (sub-tokes) cosists of four morphemes The POS of this word is a composite. وعود-كم POS tag (Coj+Prep+Nou+Poss.Pro). Cosequetly, whe desigig POS taggers or ay NLP applicatio for Arabic laguage ad other Semitic laguages, a major architectural decisio cocers the choice of whether we should aalyze a word as a sequece of morphological uits (morpheme-based) or we should treat space-delimited words as the primitive uits of our aalyses (word-based) [2,4]. From theoretical poit of view, both methods have advatages ad disadvatages. The use of the morpheme-based approach icreases the level of ambiguity but it icreases the coverage level ad decreases the size of the ukow words. O the other had, the word-based approach suffers from the data sparseess ad large size of ukow words ad large tag set with composite tags problems, ad but it reduces less ambiguity. I additio, the word formatio process for Arabic words is quite complex. While the mai formatio process of Eglish word is cocateative, the mai word formatio process i Arabic laguages is o-cocateative [2,5]. As a Semitic laguage, the word i Arabic laguage ca be described as combiatios of two morphemes: a root ad patter. A root is a set of cosoats (also called radicals) which has a basic lexical meaig. A patter cosists of a set of vowels which are iserted amog the cosoats of a root to form a stem. I additio to this o-cocateative morphological feature, Arabic uses differet affixes to create iflectioal ad derivatioal word forms. Thus, the direct adoptio of the NLP methods which are developed for wester laguages for Arabic is ot a appropriate choice due to the specific features of the Arabic laguage [6]. The purpose of this paper is, therefore, to explore the ifluece of the differet segmetatio levels o the taggig performace, i terms of accuracy ad time complexity, of the Arabic POS taggig models i order to determie the best segmetatio level to be used for POS taggig whe small amout of traiig data is available ad a large size of ukow words exist i the test data. I additio, this paper evaluates the taggig performace of three fully 2

supervised statistical models, amely, the Arabic HMM POS tagger with the prefix guessig models, the Arabic HMM POS tagger with the liear iterpolatio guessig models ad the TT tagger (Arabic versio), give traiig data from both tokeizatio levels. The rest of the paper is orgaized as follows, Sectio. 2 discuss related works. Sectio 3 describes the used corpora. Sectio 4 describes the HMM taggig approaches ad also discusses the modificatios to better hadlig ukow words POS taggig i Arabic text. Sectio 5 gives experimetal results ad discusses them. Fially, coclusios ad future work appear i Sectio 9. 2. MATERIALS AND METHODS 2. Related Work I Research o POS taggig has a log history. Numerous approaches have bee successfully applied to POS taggig. The POS taggig techiques i the literature ca be classified ito the followig: Rule-based POS taggig: this approach is based o a lexico ad a set of disambiguatio rules [7,8]. Supervised POS taggig: these approaches use machie-learig techiques to lear a classifier from labeled traiig sets such as maximum etropy model [9], Hidde Markov model [0], coditioal radom field [], cyclic depedecy etworks [2] ad support vector machie [3]. Usupervised POS taggig: these approaches do ot require pre-tagged traiig data, but rely o dictioary iformatio. However, POS taggig for Arabic laguage has bee a active topic of research i recet years. AlGahtai et al. [4] Yousif ad Sembok [5], Al- Taai ad Abu Al-Rub [6], Zribi et al. [7] ad Alqraiy [8] are some examples for this lie of work o Arabic. Similar to this work, the selectio of the best segmetatio level problem, usig morphemes or words as iput uits i Semitic laguage NLP, has bee studied before by [2,4,9,20]. Bar-Haim et al. [4,9] study the choice of the optimal architecture for the Hebrew POS taggig ad other Semitic laguages. They show that a model whose termial symbols are word segmets (morphemes), is advatageous over a word-level model for the task of POS taggig. Tachbelie [2] explored differet ways of laguage modellig for Amharic, a morphologically rich Semitic laguage, usig morphemes as uits. The study showed that usig morphemes i modellig morphologically rich laguages is advatageous, especially i reducig the OOV rate. I cotrast with these result, Mohamed ad Kübler [2] ad Kübler ad Mohamed [20] come with differet results ad differet coclusio. They state that word-based POS taggig approach is more appropriate tha morpheme-based POS taggig approach for moder stadard Arabic POS taggig. Ulike Mohamed ad Kübler [2], this work evaluates the ifluece of the segmetatio level o the taggig performace of the taggig models give a data from the Quraic Arabic (Classic Arabic). Ali ad Jarray [22] used the Geetic algorithm to develop a Arabic part of speech taggig. They used a reduced tagset i their tagger. Hadi et al. [23] propose a Hidde Markov Model (HMM) itegrated with Arabic Rule-Based method. Their POS tagger geerates a set of three POS tags: Nou, Verb, ad Particle. Albared et al. [24] preset a approach based o the combiatio of several N-attributes probabilistic classifiers. First, the POS disambiguatio problem is decoupled ito several N-attributes taggig sub-problems. The, several classifiers are used to solve each subproblem. Fially, the outcomes of all N-attributes classifiers are combied. Several problem decompositio methods ad classifiers combiatio algorithms are ivestigated. Kadim ad Lazrek [25] preset bidirectioal HMM-based Arabic POS taggig i which they combie both direct ad reverse taggers to tag the same sequece of words i both seses. This work also evaluates the ifluece of the segmetatio level o the taggig performace, ot oly o term of the taggig accuracy but also o term of the taggig time complexity. Moreover, this work evaluates the taggig performace of several fully supervised statistical taggig models, developed especially for Arabic text. 2.2 Methodology The probabilistic taggig models used i this work are based o the trigram Hidde Markov Model (HMM). The HMM tagger assig a probability value to each pair < w, t >, where 3

w =,..., w w is the iput setece ad t =,..., t t is the POS tag sequece. I HMM, the POS problem ca be defied as the fidig the best tag sequece t give the word sequece w. The label sequece t geerated by the model is the oe which has highest probability amog all the possible label sequeces for the iput word sequece. This is ca be formally expressed as: t = arg max t p ( t i t i,..., t ) p ( t i w i) The first parameter p( t t,..., ) i i t is a kow as the trasitio probability ad secod parameter p( t ) iw is kow as the emissio probability. i These two model parameters are estimated from aotated corpus by Maximum Likelihood Estimatio (MLE), which is derived from the relative frequecies. Give these two probabilities, we ca fid the most likely tag sequece for a give word sequece usig the Viterbi algorithm. However, MLE is a bad estimator for statistical iferece because data teds to be sparse. To hadle the sparseess problem i this work, we use liear iterpolatio of uigram, bigram ad trigram maximum likelihood estimates i order to estimate the trigram trasitio probability: p( t 3 t 2, t) = λ p( t 3) + λ 2 p( t 3 t 2) + λ 3 p( t 3 t 2, t) where + + =, so p represets a valid 2 3 λ λ λ probability distributio. λ s are estimated by deleted iterpolatio. To create a HMM POS tagger that ca accurately tag ukow words, it is ecessary to determie a estimate of the probability p ( w t ) for use i the tagger. As i j kow, if a word does ot occur i the traiig data the p ( w t ) lexical probability for that word i j is 0 for all t. This requires addig a algorithm j to the HMM to approximate the probability that the curret tag will emit give ukow words [0]. To hadle the ukow words, we have used the followig the suffix Probability algorithm [26], the prefix probability algorithm ad the liear iterpolatio guessig algorithm [27]. 2.3 Dataset The data used i this work is the Quraic Arabic Corpus [28]. The Quraic Arabic Corpus is a aotated liguistic resource which shows the Arabic grammar, sytax ad morphology for each word i the Holy Qura, the religious book of Islam which is writte i classical Quraic Arabic (c. 600 CE). The research project is orgaized at the Uiversity of Leeds, ad is part of the Arabic laguage computig research group withi the School of Computig. The Quraic Arabic Corpus is cosistig of 77,430 words of Quraic Arabic. For the purpose of this work, we have used two versios from the Quraic Arabic Corpus: The word-based versio: A example from this versio is show i Table. The composite tag is cosistig of multiple tags separated by +, a tag for each word segmet. The composite tag set is cosistig of 375 tags. The morpheme-based versio: A example from this versio is show i Table. The tag set of this versio cosists of 45 simple tags. A brief statistical summary (the total umber of words, the total umber of uique words ad the tag set) of the two versios are show i Table 2. Table. Examples from the word-based versio ad the morpheme-based versio of the Quraic corpus The word-based versio The morphemebased versio Word POS Word POS <V> <V> الذين REL الذين REL يؤمنون V+PRON يؤمن V بالغيب P+DET+N ون PRON ب P ال DET غيب N Table 2. Statistical summary of the two versios Number of words Uique words Size of tag set Word-based versio 7740 4835 375 Morpheme-based versio 2829 725 45 4

3. RESULTS AND DISCUSSION I this sectio, we report a empirical compariso betwee the two levels of the segmetatio preseted i the previous sectios, ad also study the ifluece of the two segmetatio levels o the taggig performace of Arabic POS taggig models whe oly small amout of traiig data is available. 3. Experimetal Settig The two traiig data are split ito two sets, traiig set ad testig set. Essetially, we have divided the word-based versio radomly ito 90.25% (69980 words, 5700 setece) for traiig ad 9.75% (7550 words, 536 seteces) for testig. The test data are chose idepedetly from the traiig data. After that, the morpheme-based versio is divided usig the same settig, see Table 3. As show from the table, the umber of vocabularies is larger i case of the morphemebased versio tha i the word-based versio eve whe the traiig ad testig sets are equally i both versios. Furthermore, i order to study the effect of the size of the traiig data, we radomly portioed our traiig data from the two versios to costruct seve traiig sets. Table 4 shows sizes of the traiig data sets ad percetages of ukow words with respect to the test data set. The test set is the same as test set for all experimets. Although each traiig set from the morpheme-based versio cotais the same data as i its equivalet i the word-based versio, the umber of words ad the percetages of ukow words are differet. It is iterestig to ote that the umber of words are larger ad the percetages of ukow words are less i case of traiig sets which come from the morpheme-based versio tha their word-based couterparts (cotais the same seteces). 3.2 Results ad Discussio First of all, several experimets are coducted usig the TT model. Table 5 presets the results (kow accuracy, ukow words accuracy ad the overall accuracy) obtaied for each traiig data set from the two versios: the word-based versio ad the morpheme-based versio. We ca ote that the ukow word accuracy of the TT tagger over traiig data sets from the Word-Based Versio are so low ad it does ot show ay sesitivity to the icrease of data size. However, a overall accuracy of 88.% (96.2% o kow words ad 37.7% o ukow words) is obtaied whe the whole traiig data are used (traiig set 7). Table 3. Statistical summary of the traiig ad testig data from the two versios of the Quraic corpus Word-based versio Morpheme-base versio Traiig Testig Traiig Testig Percetage 90.25% 9.75% 89.% 0.9% # of seteces 5700 536 5700 536 (verses) # of words 96850 7750 5690 2529 # of uique words 3920 2855 6924 820 Table 4. The sizes of the traiig sets from the two versios of the Arabic Quraic corpus, ad the percetage of ukow words i each set with respect to the test set Traiig set Word-based versio Morpheme-based versio Traiig size % of ukow words Traiig size % of ukow words 0000 33.5% 6673 2.08% 2 9997 26.27% 338 8.28% 3 29990 2.82% 4988 6.44% 4 40002 8.88% 66427 5.07% 5 49997 6.79% 82958 4.2% 6 60000 4.4% 995 3.62% 7 6985 3.55% 5690 3.28% 5

Usig traiig data sets from the morphemebased versio, ukow words taggig results of the TT tagger are much better tha its results over those from the Word-Based Versio. However, a overall accuracy of 93.8% (of 95.6% o kow words ad 73.4% o ukow words) is obtaied whe the whole traiig data are used. I geeral, give TT as taggig model, morpheme-based POS taggig yields much better results tha full word- based taggig (93.8% vs. 88.4%). Secodly, several experimets are coducted usig the Arabic HMM POS tagger with the prefix guessig model. Table 6 presets the results obtaied for each traiig data set from the two versios. It has bee observed from both Tables 5 ad 6 that the Arabic HMM POS tagger with the prefix guessig model always performs sigificatly better tha TT tagger with the suffix guessig model regardless of the segmetatio level used ad also regardless of the traiig data set sizes. The results i Tables 5 ad 6 (the overall taggig results) also show that the morpheme-based POS taggig always yields much better results tha the Word-based taggig regardless of the taggig model ad the size of the traiig data set used. It is very iterestig to ote that the word-based POS taggig produces slightly better kow word accuracy tha those of the morpheme-based POS taggig. This is actually due to that the morpheme-based approach icreases the level of ambiguity. O the other had, the morphemebased POS taggig produces much better ukow word accuracy tha those of the wordbased POS taggig. I fact, these results show that dealig with segmetatio as separate pre-processig step (usig segmeted text) is better for hadlig ukow words ad for POS taggig i geeral especially whe traiig data is small. I additio, we compare the computatioal time cost (traiig ad testig) of two POS taggig models (TT tagger ad the Arabic HMM POS tagger with prefix guessig model) whe they are traied usig differet sized traiig data sets from the two versios: the word-based versio ad the morpheme-based versio. First, we have foud that both the TT POS tagger ad the Arabic HMM POS tagger with the Prefix guessig model have approximately the same computatioal time (traiig ad testig) whe they are traied ad tested usig the same traiig ad test data. This meas that both taggers are equally efficiet with respect to the executio time. Due to this, we oly study here the computatioal time cost of the Arabic HMM POS tagger with the Prefix guessig model whe it is traied usig differet sized traiig data sets (ad therefore differet percetages of ukow words) from the two segmetatio level approaches (ad therefore differet sizes of tag sets): the word-based versio ad the morpheme-based versio. Figs. ad 2 show the curves of the average traiig ad testig time take by the Arabic HMM POS tagger with the Prefix guessig model whe it is traied usig differet sized traiig data sets from the two tokeizatio levels. Table 5. Taggig accuracies of the TT Tagger with the varyig size of the traiig data form the two traiig Quraic versios Taitig Word-based versio Morpheme-based versio set Ukow Kow Overall Ukow Kow Overall 37.5 9.9 73.2 68. 92.3 86.7 2 39.2 94.2 79.4 72.4 94.2 90.5 3 38.5 94.8 82.2 72. 94.5 9.4 4 36.6 94.9 83.7 7.7 94.9 92.3 5 37.3 95.5 85.5 7.9 95.3 92.9 6 38.2 96.0 87.6 72.0 95.5 93.5 7 37.7 96.2 88. 73.9 95.6 93.8 6

Table 6. Taggig accuracies of the Arabic HMM tagger with the prefix guessig model with the varyig size of the traiig data form the two traiig Quraic versios Taitig Word-based versio Morpheme-based versio set Ukow Kow Overall Ukow Kow Overall 69.8 92.2 84.5 77.5 92.7 89.2 2 70.5 94.5 88. 78.5 94.3 9.6 3 7.6 95. 89.9 8.4 94.7 92.8 4 7.9 95.2 90.7 83. 95.0 93.7 5 72.5 95.7 9.7 85.7 95.4 94.4 6 74.7 96. 93.0 85.6 95.6 94.8 7 75.0 96.2 93.2 87.0 95.6 95 Traiig Time(M) 0.45 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 2 3 4 5 6 7 Fig.. The traiig time take by the Arabic HMM POS tagger traied usig differet sized traiig data sets from both tokeizatio levels 40 20 Word-Based Morpheme-Based Testig Time(M) 00 80 60 40 20 0 2 3 4 5 6 7 Word-Based Morpheme-Based Fig. 2. The testig time take by the Arabic HMM POS tagger traied usig differet sized traiig data sets from both tokeizatio levels Table 7. The taggig performace (Time ad accuracy) of the Arabic HMM POS tagger with the liear iterpolatio guessig model for each oe of the two tokeizatio levels Corpus % of Best Time i miute Accuracy ukow λ Traiig Testig Ukow Kow Overall Word-based 3.5 0.9 0.40 5. 75.6 96.2 93.4 Morpheme-based 8.02 0.7 0.03 0.58 87.4 95.6 95.0 7

From Figs. ad 2, we ca draw several importat observatios. First, the traiig time is much lower tha the testig time i spite of the traiig data set used ad the corpus versio used. Secod, the traiig time i case of a traiig data set from the morpheme-based versio is lower tha the traiig time i case of its couterpart from the word-based versio. Third, the traiig time icreased as the traiig data icreased, see Fig., ad the testig time decreased as the traiig data icreased, see Fig. 2. The explaatio of this is that as the traiig data icreased, the size of ukow words i the test data are substatially decreased, see Table 4, therefore less exceptioal processig time ad less taggig time. I fact, there is a strog positive correlatio of 0.99 betwee the testig time ad percetages of ukow words i the test sets regardless of the tokeizatio level used, which idicates that taggig time ad the percetage of ukow words go i same directios. Fourth, it is most importatly to ote that the testig time of the word- based POS taggig ( hours to 2hours) is much larger tha the testig time of the morpheme-based POS taggig (few secods). From Figs. ad 2, we ca readily observe that morpheme-based POS taggig would be a optimal choice as its taggig time is much larger tha the taggig time of the word-based POS taggig. Fially, several experimets are coducted usig our HMM tagger with the liear iterpolatio guessig model which is traied usig the whole traiig data (traiig set 7) from the two corpus versios. Varyig the λ value from 0.0 to ; the value is icremeted by 0. each time. Table 7 summarizes the taggig results, the computatioal time eeded ad the best λ at which the model ca give the best result, for each oe of the two segmetatio approach. The results also show morpheme-based POS taggig always yields better results tha word- based taggig. I additio, as i previous models (TT ad Arabic Trigram HMM tagger with prefix guessig model ) the taggig time of the wordbased POS taggig (5 miutes) is much larger tha the taggig time of the morpheme-based POS taggig (few secods). Moreover, the liear iterpolatio guessig model performs better tha the two previous models (TT ad Arabic HMM POS tagger with the prefix guessig model) for both tokeizatio levels. 4. CONCLUSION Desigig a POS tagger for Arabic with small traiig data is a challegig task due to the specific features of the Arabic laguage ad the high degree of ambiguity i Arabic. I this paper, we compare ad cotrast morpheme-based POS ad word-based POS taggig strategies ad study the ifluece of each o the taggig performace of the Arabic POS taggig models, o term of the taggig accuracy ad the time complexity. I additio, we also evaluate ad compare several stochastic taggig models. We coducted a series of experimets usig two versios of the Quraic Arabic corpus: morpheme-based versio ad word-based versio. Results show that taggig models performs sigificatly better whe their termial symbols are word segmets (morpheme-based), tha whe their termial symbols are word (word-based). I additio, the results show that the Arabic Trigram HMM POS tagger with the liear iterpolatio guessig algorithm substatially improve the taggig results over the TT tagger regardless of the tokeizatio level used. However, our future directio is to study the ifluece of the segmetatio level o aother Arabic NLP process. Moreover, we pla to desig a joit segmetatio ad POS taggig model which do both tasks simultaeously.. COMPETING INTERESTS Authors have declared that o competig iterests exist. REFERENCES. Albared M, Omar N, Ab Aziz MJ. Developig a competitive HMM arabic POS tagger usig small traiig corpora. I Proceedigs of the Third Iteratioal Coferece o Itelliget Iformatio ad Database Systems - Volume Part I, Daegu, Korea. 20;288-296. 2. Tachbelie MY. Morphology-based laguage modelig for amharic. Ph.D., Departmet of Iformatics, Uiversity of Hamburg; 200. 3. Attia MA. Arabic tokeizatio system. Preseted at the Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues 8

ad Resources, Prague, Czech Republic; 2007. 4. Bar-Haim R, Sima'A K, Witer Y. Part-ofspeech taggig of moder Hebrew text. Natural Laguage Egieerig. 2008;4: 223-25. 5. Beesley KR, Karttue L. Fiite-state ococateative morphotactics. Preseted at the Proceedigs of the 38 th Aual Meetig o Associatio for Computatioal Liguistics, Hog Kog; 2000. 6. Farghaly A, Shaala K. Arabic atural laguage processig: Challeges ad solutios. ACM Trasactios o Asia Laguage Iformatio Processig (TALIP). 2009;8:-22. 7. Loftsso H. Taggig Iceladic text: A liguistic rule-based approach. Nordic Joural of Liguistics. 2008;3:47-72. 8. Brill E. A simple rule-based part of speech tagger. Preseted at the Proceedigs of the Third Coferece o Applied Natural Laguage Processig, Treto, Italy; 992. 9. Rataparkhi A. Maximum etropy models for atural laguage ambiguity resolutio. Ph.D., Computer ad Iformatio Sciece,Uiversity of Pesylvaia; 998. 0. Thede SM, Harper MP. A secod-order Hidde Markov Model for part-of-speech taggig. Preseted at the Proceedigs of the 37 th Aual Meetig of the Associatio for Computatioal Liguistics o Computatioal Liguistics, College Park, Marylad; 999.. Lafferty JD, McCallum A, Pereira FCN. Coditioal radom fields: Probabilistic models for segmetig ad labelig sequece data. Preseted at the Proceedigs of the Eighteeth Iteratioal Coferece o Machie Learig; 200. 2. Toutaova K, Klei D, Maig CD, Siger Y. Feature-rich part-of-speech taggig with a cyclic depedecy etwork. Preseted at the Proceedigs of NAACL '03, Edmoto, Caada; 2003. 3. Giméez J, Màrquez L. SVMTool: A geeral POS tagger geerator based o support vector machies. I Proceedigs of 4 th Iteratioal Coferece o Laguage Resources ad Evaluatio (LREC), Lisbo, Portugal. 2004;43-46. 4. AlGahtai S, Black W, McNaught J. Arabic Part-of-speech taggig usig trasformatio-based learig. Preseted at the Proceedigs of the Secod Iteratioal Coferece o Arabic Laguage Resources ad Tools, Cairo, Egyp; 2009. 5. Yousif JH, Sembok T. Arabic part-ofspeech tagger based support vectors machies. I Iformatio Techology, 2008. ITSim 2008. Iteratioal Symposium. 2008;-7. 6. Al-Taai A, Abu Al-Rub S. A rule-based approach for taggig o-vocalized Arabic words. The Iteratioal Arab Joural of Iformatio Techology. 2009;9: 320-328. 7. Zribi C, Torjme A, Ahmed M. A multiaget system for POS-taggig vocalized Arabic text. The Iteratioal Arab Joural of Iformatio Techology; 2007. 8. Alqraiy S. A morphological-sytactical aalysis approach For Arabic textual taggig. Ph.D., De Motfort Uiversity, Leicester, UK; 2008. 9. Bar-Haim R, Sima'a K, Witer Y. Choosig a optimal architecture for segmetatio ad POS-taggig of moder Hebrew. Preseted at the Proceedigs of the ACL Workshop o Computatioal Approaches to Semitic Laguages, A Arbor, Michiga; 2005. 20. Kübler S, Mohamed E. Part of speech taggig for Arabic. Natural Laguage Egieerig. First View. 20;-28. 2. Mohamed E, Kübler S. Is Arabic part of speech taggig feasible without word segmetatio? Preseted at the The 200 Aual Coferece of the North America Chapter of the Associatio for Computatioal Liguistics, Los Ageles, Califoria, USA; 200. 22. Ali BB, Jarray F. Geetic approach for Arabic part of speech taggig. Iteratioal Joural o Natural Laguage Computig. 203;2:-2. 23. Hadi M, Ouatik S, Lachkar A, Mekassi M. Hybrid Part-of-speech tagger for ovocalized Arabic text. Iteratioal Joural o Natural Laguage Computig (IJNLC). 203;2. 24. Albared M, Hazaa M. N-attributes stochastic classifier combiatio for Arabic morphological disambiguatio. Saba Joural of iformatio Techology Ad Networkig (SJITN). 205;3. 25. Kadim A, Lazrek A. Bidirectioal HMMbased Arabic POS taggig. Iteratioal Joural of Speech Techology. 206;9: 303-32. 26. Brats T. TT: A statistical part-of-speech tagger. Preseted at the Proceedigs of 9

the sixth coferece o Applied atural laguage processig. Seattle, Washigto; 2000. 27. Albared M, Omar N, Ab Aziz MJ, Nazri MZA. Automatic part of speech taggig for Arabic: A experimet usig Bigram hidde Markov model. Preseted at Coferece o Rough Set ad Kowledge Techology, Beijig, Chia; 200. 28. Dukes K, Atwell E, Sharaf ABM. Sytactic aotatio guidelies for the quraic Arabic Depedecy Treebak. Preseted at the Laguage Resources ad Evaluatio Coferece (LREC 200), Valletta, Malta; 200. the Proceedigs of the 5 th Iteratioal 207 Ba-Alwi et al.; This is a Ope Access article distributed uder the terms of the Creative Commos Attributio Licese (http://creativecommos.org/liceses/by/4.0), which permits urestricted use, distributio, ad reproductio i ay medium, provided the origial work is properly cited. Peer-review history: The peer review history for this paper ca be accessed here: http://sciecedomai.org/review-history/787 0