CBAS: CONTEXT BASED ARABIC STEMMER

Similar documents
Natural language processing implementation on Romanian ChatBot

arxiv: v1 [cs.dl] 22 Dec 2016

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

Management Science Letters

'Norwegian University of Science and Technology, Department of Computer and Information Science

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Consortium: North Carolina Community Colleges

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

part2 Participatory Processes

Application for Admission

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

VISION, MISSION, VALUES, AND GOALS

Cross Language Information Retrieval

A Case Study: News Classification Based on Term Frequency

A Comparative Survey on Arabic Stemming: Approaches and Challenges

2014 Gold Award Winner SpecialParent

also inside Continuing Education Alumni Authors College Events

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Rule Learning With Negation: Issues Regarding Effectiveness

South Carolina English Language Arts

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Independent Passage Retrieval for Question Answering

Linking Task: Identifying authors and book titles in verbose queries

ARNE - A tool for Namend Entity Recognition from Arabic Text

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

HybridTechniqueforArabicTextCompression

Rule Learning with Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

Matching Similarity for Keyword-Based Clustering

Cross-lingual Short-Text Document Classification for Facebook Comments

Cross-Lingual Text Categorization

ScienceDirect. Malayalam question answering system

CS Machine Learning

Disambiguation of Thai Personal Name from Online News Articles

Reducing Features to Improve Bug Prediction

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

arxiv: v1 [cs.cl] 2 Apr 2017

A Comparison of Two Text Representations for Sentiment Analysis

Parsing of part-of-speech tagged Assamese Texts

Constructing Parallel Corpus from Movie Subtitles

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Assignment 1: Predicting Amazon Review Ratings

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probabilistic Latent Semantic Analysis

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Australian Journal of Basic and Applied Sciences

Learning Methods in Multilingual Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Modeling function word errors in DNN-HMM based LVCSR systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Noisy SMS Machine Translation in Low-Density Languages

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

1. Introduction. 2. The OMBI database editor

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

What the National Curriculum requires in reading at Y5 and Y6

Dictionary-based techniques for cross-language information retrieval q

Grade 4. Common Core Adoption Process. (Unpacked Standards)

On document relevance and lexical cohesion between query terms

Prediction of Maximal Projection for Semantic Role Labeling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

CEFR Overall Illustrative English Proficiency Scales

Evolution of Symbolisation in Chimpanzees and Neural Nets

Ensemble Technique Utilization for Indonesian Dependency Parser

Speech Recognition at ICSI: Broadcast News and beyond

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Arabic Orthography vs. Arabic OCR

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Switchboard Language Model Improvement with Conversational Data from Gigaword

Problems of the Arabic OCR: New Attitudes

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Detecting English-French Cognates Using Orthographic Edit Distance

Criterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations

Abstractions and the Brain

Derivational and Inflectional Morphemes in Pak-Pak Language

Using dialogue context to improve parsing performance in dialogue systems

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

As a high-quality international conference in the field

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Vocabulary Usage and Intelligibility in Learner Language

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Online Updating of Word Representations for Part-of-Speech Tagging

Modeling function word errors in DNN-HMM based LVCSR systems

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Indian Institute of Technology, Kanpur

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

AQUA: An Ontology-Driven Question Answering System

Knowledge Transfer in Deep Convolutional Neural Nets

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Transcription:

CBAS: CONTEXT BASED ARABIC STEMMER Mahmoud El-Defrawy, Yasser El-Sobaty ad Nahla A. Belal College of Computig ad Iformatio Techology, AAST, Alexadria, Egypt ABSTRACT Arabic morphology ecapsulates may valuable features such as word s root. Arabic roots are beig utilized for may tasks; the process of extractig a word s root is referred to as stemmig. Stemmig is a essetial part of most Natural Laguage Processig tasks, especially for derivative laguages such as Arabic. However, stemmig is faced with the problem of ambiguity, where two or more roots could be extracted from the same word. O the other had, distributioal sematics is a powerful co-occurrece model. It captures the meaig of a word based o its cotext. I this paper, a distributioal sematics model utilizig Smoothed Poitwise Mutual Iformatio (SPMI) is costructed to ivestigate its effectiveess o the stemmig aalysis task. It showed a accuracy of 81.5%, with a at least 9.4% improvemet over other stemmers. KEYWORDS Natural Laguage Processig, Computatioal Liguistics, Text Aalysis, Stemmig 1.INTRODUCTION Natural Laguages (NLs) are the commuicatio chaels betwee humas. It allows coveyig iformatio, exchagig kowledge, ad sharig ideas. For may years scietists studied Natural Laguages ad developed theories ad rules that gover the use of Natural Laguages, such as Grammar ad Morphology. Natural Laguage Processig (NLP) is the itersectio betwee liguistics, ad Computatioal Sciece (CS) [1]. NLP allows utilizig liguistics to use Natural Laguages as a way of commuicatio with computatioal devices[1]. The associatio curve betwee liguistics ad computatioal scieces has evolved over time. Machie Traslatio (MT) was oe of the first NLP tasks i the 1950s; it bega as traslatio from Russia to Eglish [2]. The progress of MT was limited due to the complexity of liguistics rules, ad low computatio power at the time[1]. However, Chomsky s theory[3] of atural laguage s grammar formed the basis for the formatio of Backus-Naur Form (BNF). BNF[4] otatios are commoly used to represet Cotext Free Grammar (CFG). CFGs are used to systematically describe, ad validate artificial laguages, such as programmig laguages. Usig CFGs to describe some aspects of Naturals Laguages requires a o-trivial set of rules, which results i some ambiguity due to the uexpected rules iteractios. The itroductio of statistical methods gave some isights for reducig NLP ambiguity[1]. For example, the Probabilistic CFGs exteds the traditioal CGFs by deducig liguistic rules ad assigig weights[5]. Rules ad weights are statistically deduced from large aotated corpus. The oticeable improvemet i MT sparks the research i NLP [1]. DOI : 10.5121/ijlc.2015.4301 1

Table 1. Cotext Matrix Sample. القاھرة Cairo) ( جامعة Uiversity) ( دول Coutries) ( عجائب( Woders ( نظم Systems) ( 0 5 10 3 سائح Tourist) ( 8 5 0 14 حكم( Judgemet ( 0 12 0 9 NLP teds to work with large sets of data from various laguages. It raises the eed for defiig a cocise represetatio for the data while preservig as may of its features as possible. Cocise represetatio is required mostly for ay NLP task (word, setece, or documet levels). Stemmig is a primary NLP task, ad it cotributes i may other NLP tasks [6]. Stemmig is reducig a word to its basic form[7], while preservig its mai characteristics. May laguages defie liguistic rules for stemmig but ot with the same degree[8]. Derivative laguages are highly systematic, ad highly supportive for stemmig aalysis. Most of the derivative laguages share the property that complex forms are derived from basic oes. Arabic is oe of the derivative laguages that liguistically supports stemmig. The Arabic laguage is a widely used laguage[9] ad it exists i differet formats. For example, Arabic words ca be give i the format of separate text or it could be extracted from images[10]. Arabic laguage defies a accurate set of rules kow as morphological rules, or morphology. Morphology accurately describes the formulatio of a Arabic word from its basic form. The basic forms are commoly called roots. However, stemmig is faced with ambiguity, as most of the NLP tasks. Various techiques were used to resolve ambiguity. Amog which is sematic aalysis, that is to capture the iteded meaig of a word [11]. Sematic aalysis is a very powerful tool to tackle the ambiguity problem but it is very challegig to model. Distributioal Sematic (DS)[11] is a type of sematic aalysis based o co-occurrece aalysis. It represets a word s meaig by its cotext (surroudig words) distributio as show i Table 1. For example, the first row of Table 1 shows that the words ( Woders"meas عجائب( ad (meas Coutries ) دول appeared i the cotext of the word (meas Systems ) نظم with frequecies 0 ad 5, respectively. Differet measures ca be computed such as Poitwise Mutual Iformatio (PMI), Positive PMI (PPMI), ad Smoothed PMI (SPMI). They measure the correlatio betwee a word, ad its cotext[12,13]. This paper itroduces a cotext based Arabic stemmer for extractig a word s root. The proposed stemmer (CBAS) explores all possible roots the selects the appropriate root usig Distributioal Sematics (DS). The DS utilizatio impact is viewed as a series of comparisos with other stemmers usig a maually aotated set of articles. The paper is orgaized as follows; sectio 2 is a itroductio to Arabic morphology. Sectio 3 explores related work, ad techiques used for costructig stemmers. The descriptio of the proposed stemmer (CBAS) is itroduced i sectio 4. I sectio 5, a detailed aalysis, ad evaluatio of the proposed stemmer (CBAS) is preseted. Fially, a coclusio is preseted i sectio 6. 2. BACKGROUND Regardless of the approach used for developig morphological aalyzers, a basic uderstadig of morphological rules is eeded to set expectatios, evaluate the results, ad desig improvemets. This sectio itroduces Arabic morphology, ad commo challeges. Morphology is the study of word formulatio. Arabic morphology is based o the derivatio priciple, whereas words are acquired from roots. Roots are usually three, four, or five characters. 2

Roots are the seeds for Arabic words geeratio. A ew word is acquired by modifyig its root. For example the word كاتب (kātb, meas Writer ) is derived from the root ك ت ب (kāf tāʾ bāʾ, meas Wrote ) by addig ا (ʾlf) i the middle. Not every additio is cosidered to be valid. Arabic laguage itroduces a set of templates to defie valid combiatios ad additios. Templates are referred to as patters. It is a ordered sequece of letters. Sice patters work with all roots, a set of letters geerically represet roots letters ad its order while augmeted letters are represeted by themselves i correct positios. كاتب (fāʿl, meas Actor ) was used to derive the previous word فاعل For example the patter (kātb, meas Writer ) by substitutig ف (fāʾ) with ك (kāf), ع (ʿy) with ت (tāʿ), ad ل (lām) with ب (bāʾ),respectively. As oted, roots are commoly writte as separated characters to idicate possible isertios[14]. Augmeted letters are reflected o the patter, ad fially o the word itself. However, a augmeted uit (oe, or more augmeted letters) which is added i the frot, or at the ed of a word, is called a prefix, or a suffix additio. I most cases, prefixes ad suffixes are ot part of a word s meaig, rather they are additioal features[14]. For example الكاتب (ʾlkātb, meas The writer ) by addig ال (ʾl, meas The ) i frot of the word. This poit of view would substatially reduce the umber of eumerated patters. As described above, the root-patter system is simple, elegat, ad straightforward. However, the system is faced with morphological challeges, amely vocalizatio, mutatio, ad the absece of diacritics (aotatio above, or below a word s letter that captures morphological, ad grammatical additioal features). For example, a letter ca chage its form due to grammatical or phoological rules. Aother challege of the root patter system is stopwords, such as coectio words that do ot obey derivatioal rules. The process of derivig back a word to its root (stemmig) looks like a straightforward operatio, ع (fāʾ), ف by simply aligig a word to its patter, ad collectig the letters correspodig to (ʿy), ad ل (lām) letters. However, due to the challeges described above, a word may be derived back to multiple roots. 3. RELATED WORK Iformatio Retrieval (IR) is oe of the early tasks that utilized stemmig aalysis[7]. But, Stemmig aalysis is ot limited to IR. Stemmig aalysis has improved may tasks, such as Machie Traslatio (MT)[15], Setimet Aalysis [16], ad may more tasks. This sectio views commo stemmig aalysis algorithms for differet Natural Laguages (NLs). The ature of a laguage has a great impact o the developmet of related stemmig algorithms. For example, the ature of the Eglish laguage makes Eglish stemmers cocered with removig word s suffixes oly, while removig prefixes may imply a differet meaig, such as sufficiet ad isufficiet. Various stemmers were developed for Eglish[8,17]. However, the uderlyig ature of the laguage limited its extesio for other laguages, amog which is the Arabic[18] ad Urdu[19] laguages. However, stemmig is ot effective i the same degree for all laguages[20,21]. Arabic is a morphological rich laguage which has eriched the Natural Laguage Processig (NLP)[22-24]. Arabic roots have rich liguistic features; they are sematically represetative, derivable, ad fiite i umbers. Stemmers were developed over the years to take advatages of such features. This sectio itroduces commo Arabic stemmers, ad their root extractio process which 3

ecapsulates sematic decisios. Fially, it itroduces sematic aalysis techiques idepedetly from stemmig aalysis. Khoja stemmer[18] is oe of the early ad most powerful approaches developed for Arabic stemmig[7]. Khoja simulates the liguistic process as much as possible. It removes prefixes, ad suffixes from a word ofte after the ormalizatio process, the matches the resultig word to a patter, ad fially extracts the root. The extracted root gets validated agaist a list of correct Arabic roots to esure liguistic correctess. Khoja stemmer[18] resolves ambiguity by defiig a set of liguistic paths, or decisios based o various features such as words first character, prefixes, or suffixes legth. Additioally, decisios are implicitly ordered whereas the result would be the first correct root. For example, Khoja stemmer hadles roots with duplicate letters first. Root extractio is a highly complex process due to the existece of overlappig rules, which requires more iformatio. A ew type of stemmig aalysis itroduced is light stemmig. Light stemmig is aother way of acquirig reduced represetatio of Arabic words. Light stemmig is ot as complex as root extractio. It removes prefixes ad suffixes oly from a word. For كاتب (ʾlkātbū, meas The writers ) would be stemmed to the word الكاتبون example, the word (kātb, meas Writer ) istead of ك ت ب (kāf tāʾ bāʾ, meas Writig ). Light stemmig is widely used for Iformatio Retrieval (IR)[25]. Light stemmig has show competitive results i IR agaist root extractio based stemmers [6,26]. Light stemmers are relatively faster ad efficiet, which preserves more specific features of the word, for example, كاتب (kātb, meas Writer ) is more related to the word الكاتبون (ʾlkātbū, meas The writers) tha ك ت ب (kāf tāʾ bāʾ, meas Writig ). But, the umber of words i Arabic without prefixes ad suffixes is far more tha the roots listed i Arabic dictioaries[14]. However, there is o explicit evidece that lightly stemmed words are more efficiet tha roots[26]. ISRI [27] is aother liguistic based Arabic stemmer that roughly uses the same sequece used by Khojas stemmer [18]. But, the mai differece is that ISRI does ot liguistically validate the extracted root, o dictioary is used, ad orgaizes the defied patter set as sub-groups whereas each sub-group has commo features. Besides, chages to the ormalizatio process, prefixes, ad suffixes hadlig. The mai goal of ISRI is to get the miimum represetatio of a give word. Due to various chages, ad prioritizatio, ISRI resolves ambiguity differetly, by chagig the order of applyig morphological rules. But, ISRI still makes static decisios. Most likely, ISRI would be used for iformatio retrieval rather tha liguistic based tasks. Tashaphye [28] is aother Arabic stemmer. It maily supports light stemmig (removig prefixes ad suffixes). It follows the same approach used by Khoja [18], ad ISRI [27] stemmers. It ca be used for root extractio as well. Darwish[29] utilizes the existig word-root pairs used to costruct the Fiite State Trasducer (FST) like stemmers. It uses a learig based techique to ot oly ifer Arabic patters, but also to rak extracted roots. This methodology eumerates possible roots like FSTs [30] approaches, but additioally gives a preferece to the extracted roots. It is aother way to hadle ambiguity other tha static rules. However, part of the iferred patters would be iadequate due to cases such as vocalizatio, ad mutatio. Ad, the rakig of roots is based o iferred patters frequecy, eglectig the words features. Later, this approach has bee modified to hadle light stemmig (prefixes ad suffixes removal) [7]. Arabic grammar has high ifluece o morphological aalysis[14]. ElixirFM [31] employs sytactic features to ehace morphological results. It takes advatage from Prague Arabic Depedecy Treebak (PADT)[32] to acquire sytactic features ad other morphological features 4

from BuckWalter [33] stem dictioary. ElixirFM[31] defies a set of morphological rules to extract possible stems while rakig is used to disambiguate the extracted roots, or stems usig the uderlyig data. MADAMIRA[15] morphological aalyzer cosists of two tools MADA [34], ad AMIRA [35]. MADAMIRA [15] takes the advatage of large aotated corpus by usig machie learig techiques such as Support Vector Machie[35]. It combies several tools such as word segmetatio, Part of Speech Taggig (POST), ad light stemmig. It is differet from the previous approaches, sice it does ot explicitly defie morphological rules. Stemmers are part of may NLP tasks. For example, i setimet aalysis they employ the use of classifiers such as Naive Bayes or Support Vector Machies (SVMs) to perform classificatio o sets of test samples give a tagged traiig set ad rich feature sets for various tasks[16,36,37], questio ad aswer systems [38,39], ad may more. However, stemmig is ot limited to NLP. It has bee used for Iformatio Retrieval (IR), ad most of the stemmers are evaluated idirectly usig IR bechmarks [7]. May IR experimets [6,26,27] showed that Arabic roots had improved the Arabic IR. 4. PROPOSED STEMMER (CBAS) The proposed stemmer, Cotext-Based Arabic Stemmer (CBAS), utilizes distributioal similarity of a word s cotext to gai additioal iformatio about its sematic. This iformatio assists with the selectio of the correct root by excludig sematically irrelevat cadidate roots withi the cotext. This sectio itroduces the mai phases of the proposed stemmer (CBAS), cotext matrix costructio, roots geeratio, ad root selectio, whereas each phase cosists of a set of steps. The proposed algorithm is show i Fig. 1 ad Fig. 2 Fig.1: Cotext Matrix Costructio Algorithm 5

4.1. Data Resources Fig.2: Stemmer s Algorithm Predefied liguistic data is a essetial part of the proposed stemmer (CBAS). This sectio itroduces the data defied by CBAS. The Arabic word cosists of three parts prefix, ifix, ad suffix[14]. Prefixes ad suffixes are a set of features that ca be added to a word, such as the defiite article ال (ʾl, meas The ) or coected proous ھم (hm, meas Them ). Prefixes ad suffixes lists cotai idividual ad compoud letters that could appear i the frot or the ed of the Arabic word. There is also a list of Arabic patters which is used to extract possible Arabic roots. The fial list is the Roots dictioary which has bee extracted from the Khoja stemmer [18] to validate the extracted root. The prefixes, suffixes, patters, ad dictioary lists are beig used for the roots geeratio phase. Previous lists are commoly defied for Arabic stemmers. However, CBAS uses a raw data set which cosists of a set of Articles that have bee extracted from Omai ewspapers[40]. The dataset cotais 20291 articles from various topics, for example, culture ad sport[40], which represets a wide rage of the Arabic laguage curret usage. The raw dataset plays a cetral role i selectig sematically correct roots, which differs from other used stemmers which do ot commoly employ cotext i their algorithms. 4.2. Cotext Matrix Costructio Cotext Matrix is a powerful ad flexible tool to acquire some sematic properties [11]. It defies a widow of words, where the target word is at positio i, ad the rest of the surroudig words are its cotext [11]. The widow slides over the corpus associatig the target word with its cotext distributio as show i Table 1. Various measures ca be computed from the cotext matrix, ad employed i differet tasks. 4.3. Root Geeratio 6

It is automatio for root extractio. However, ulike the maual process, it extracts all possible roots. It cosists of three major sub processes, word segmetatio, patter matchig, ad root validatio. Word segmetatio breaks a word ito all possible three parts, prefix, suffix, ad ifix, usig the predefied prefixes, ad suffixes lists. Patter matchig matches the ifix obtaied i word segmetatio with oe or more patters with respect to its legth. For each matched patter, roots characters are collected, ad passed to dictioary validatio. It also hadles weak letters, stopwords, ad some other liguistic cases. Dictioary validatio esures that the extracted roots are liguistically correct. It validates extracted roots agaist a list of correct Arabic roots. Dictioary is ot sufficiet to geerate oly oe correct root, due to various liguistic cases, roots geeratio has the potetial of extractig oe, or more correct roots. 4.4. Roots Selectio This sectio will utilize the cotext matrix to select a appropriate root from two or more cadidate roots. Poitwise Mutual Iformatio (PMI) [12,13] measures the correlatio betwee two or more words. Sice some words ca produce two, or more root cadidates. The proposed algorithm (CBAS) uses a variatio of PMI, Smoothed PMI (SPMI) [12] to hadle sparse matrices. As show i Table 2, SPMI achieved the highest accuracy of 81.5% whe compared to PMI ad PPMI, where the achieved accuracy was 78.84% ad 79.49%, respectively. This is due to the fact that SPMI overcomes the tedecy towards rare co-occurrece evets, which is a side effect of PMI [12]. SPMI is utilized to measure the correlatio betwee the geerated roots, ad its previous cotext. To take advatage of the uderlyig matrix, set words are derived for each cadidate root, i additio to the root itself. For each derived word, the average SPMI is computed with the previous word (as cotext). The root with the highest average correlatio to its cotext is the selected. 5. RESULTS AND EVALUATION IR is the commo methodology for evaluatig a ew stemmer because of the lack of stemmed bechmarks[21]. This sectio itroduces the validatio dataset, evaluatio measures, ad fially the experimetal results. 5.1. Validatio Dataset Direct evaluatio is importat to show the stemmig accuracy, ad potetial improvemets. A maually aotated dataset has bee provided to measure the stemmer accuracy, ad compare with other stemmers. The dataset is part of the Itetioal Corpus of Arabic (ICA) [41].Various Arabic resources have cotributed i collectig the ICA such as ewspapers, books, ad magazies. It has bee costructed to provide a appropriate represetatio to Arabic laguage i Moder Stadard Arabic (MSA) [41]. The dataset cosists of 10302 tokes associated with various features. There exist 3629 uique word-root pairs, while other words do ot have roots associated due to the existece of stopwords, ad o-arabic words. The dataset cotais 8941 words after stopwords removal. This is show i Fig 3. 7

5.2. Evaluatio Criteria Fig.3: Validatio Dataset Stemmig is beeficial for may tasks, where every task uses the roots i a differet way. For example, IR uses roots as a cluster represetative to group related words, while setimet aalysis is more cocered with the liguistic accuracy of a root. A set of metrics were used to measure differet usages, ad compare them with other stemmers. Stemmig accuracy is oe of the basic measures for the effectiveess of the stemmer. It is defied as the ratio betwee the umber of correctly stemmed words, ad the umber of the words i the complete dataset. Collectig related words uder the same group is importat for tasks such as IR. There are two variatios of groupig related words. First, words ca be grouped correctly uder a sematically correct root; this is referred to it as classificatio. While the secod is to group related words together, ot ecessarily uder a correct root, ad this is referred to as clusterig. Stadard metrics for classificatio, ad clusterig are: accuracy, precisio, recall, ad F 1 measure, ad are defied as follows [42, 43]: 1 accuracy = 1 precisio = X Y i i i= 1 X i Yi X Y i i i= 1 Yi 8

1 recall = 1 F1 measure = X Y i i i= 1 X i X Yi Y i i= 1 X i + i Where Ad is the umber of extracted roots. X is the set of extracted root. X i is a idividual extracted root. Y is the set of extracted root. Y i is a idividual valid root. 5.3. Results The complete 8941 words were used to test the proposed stemmer (CBAS), with a widow size =3, the the set was reduced to a set of uique word-root pairs to be compared with other stemmers. Table 2 shows the compariso betwee the proposed stemmer (CBAS) ad other stemmers. It shows that the proposed stemmer (CBAS) achieved a accuracy of 81.5% with a improvemet of 9.4%, 67.3%, ad 51.2% over Khoja, ISRI, ad Tashphaye stemmers, respectively. Accuracy ehacemet is due to explorig various possibilities of roots. Such exploratio would ot be possible without distributioal sematics, which provides a dyamic ad robust way for selectig a appropriate root. Table 3 ad Table 4; show the performace of the proposed stemmer (CBAS) whe usig it as a groupig mechaism. Table 3 clarifies that the proposed stemmer (CBAS) has a higher potetial to liguistically group Arabic words tha other stemmers. CBAS outperformed other stemmers i the classificatio task, with a accuracy of 65.45%. While Table 4 shows that the proposed stemmer (CBAS) has potetial improvemets i o-liguistic based tasks, achievig a accuracy of 73.83% i clusterig. By comparig liguistic (classificatio), ad o-liguistic (clusterig) groupig measures, there is a icrease i all correspodig measures. This is due to that some clusters were correctly formulated irrespective to the clusters seeds. Classificatio ad clusterig measures show the superiority of the CBAS over other stemmers. This idicates the beeficial features of the CBAS for the IR task. Table 2. Stemmers Liguistic Accuracy Stemmer Liguistic Accuracy Khoja 72.1% ISRI 14.2% Tashaphaye 30.3% CBAS-PMI 78.84% CBAS-PPMI 79.49% CBAS 81.5% 9

6. CONCLUSION Table 3. Stemmers Classificatio Measures Stemmer Accuracy Precisio Recall F 1 measure Khoja 57.53% 57.53% 59.59% 58.55% ISRI 10.43% 10.43% 10.49% 10.46% Tashaphaye 25.07% 25.07% 25.15% 25.11% CBAS 65.45% 65.45% 68.23% 66.51% Table 4. Stemmers Clusterig Measures Stemmer Accuracy Precisio Recall F 1 measure Khoja 71.71% 93.09% 75.74% 83.52% ISRI 12.59% 69.40% 13.34% 22.27% Tashaphaye 32.25% 72.54% 37.03% 49.03% CBAS 73.83% 93.71% 75.46% 84.50% May stemmers were developed to gai the rich liguistic features provided by the roots. Most of the stemmers made explicit decisios, statistical-based or liguistic-based, to select oly oe root. Other stemmers used rakig to express their selectio preferece rather tha selectig a sigle root. However, at the very ed, a sigle root would be chose. Static decisios are very appropriate for commo ad frequet cases. However, addig other features such as sytactic ad maual aotatios would also be valuable. The itroduced stemmer employs distributioal similarity to hadle icorrect roots selectio, which is a side effect of root geeratio phase. The existece of robust filterig mechaisms, such distributioal aalysis, allows explorig various roots. Distributioal aalysis has several advatages. It ca be computed for ay corpus ad ay laguage, ad it is relatively fast ad iexpesive to costruct compared to maually aotated corpus. Distributioal sematics covers may relatios betwee words, ad it is robust agaist ay prefereces, or missig iformatio. It is also very adaptive to cotext chages, which makes it suitable for may topics. However, distributioal aalysis is ot as accurate as maually aotated data; hece, the word geeratio process was added to the roots selectio phase to tolerate possible errors. The previous techiques were compared to the proposed stemmer (CBAS) results. CBAS shows a accuracy of 81.5% with a improvemet of 9.4%, 67.3%, ad 51.2% over Khoja, ISRI, ad Tashphaye stemmers, respectively. CBAS also shows a improvemet i classificatio ad clusterig, with a accuracy of 65.45% ad 73.83%, respectively. Results idicate that the proposed stemmer (CBAS) ehaces stemmig ad other related tasks. CBAS represets a methodology for capturig a word s cotext ad makes decisios based o it. CBAS could chage its behaviour based o the uderlyig data which could be specialized i a sub domai of the Arabic laguage. The statistical model used by CBAS is relatively simple. It icorporates importat iformatio (cotext) of a word which would be a complex process to iclude i a rule based stemmer. The statistical model reduces liguistic complexity of represetig various liguistic cases. It also prevets uexpected iteractios ad prioritizatio schemes for orderig the rules. 10

REFERENCES [1] P. M. Nadkari, L. Oho-Machado, ad W. W. Chapma, "Natural laguage processig: a itroductio," Joural of the America Medical Iformatics Associatio, vol. 18, pp. 544-551, 2011. [2] J. Hutchis, "The first public demostratio of machie traslatio: the Georgetow-IBM system, 7th Jauary 1954," oviembre de, 2005. [3] N. Chomsky, "Three models for the descriptio of laguage," Iformatio Theory, IRE Trasactios o, vol. 2, pp. 113-124, 1956. [4] A. Aho, "R. Sethi, ad J. D. Ullma," Compilers: Priciples, Techiques, ad Tools, 1988. [5] D. Klei ad C. D. Maig, "Accurate ulexicalized parsig," i Proceedigs of the 41st Aual Meetig o Associatio for Computatioal Liguistics-Volume 1, 2003, pp. 423-430. [6] M. Aljlayl ad O. Frieder, "O Arabic search: improvig the retrieval effectiveess via a light stemmig approach," i Proceedigs of the eleveth iteratioal coferece o Iformatio ad kowledge maagemet, 2002, pp. 340-347. [7] I. A. Al Sughaiyer ad I. A. Al Kharashi, "Arabic morphological aalysis techiques: A comprehesive survey," Joural of the America Society for Iformatio Sciece ad Techology, vol. 55, pp. 189-213, 2004. [8] M. F. Porter, "Sowball: A laguage for stemmig algorithms," ed, 2001. [9] J. Xu, A. Fraser, ad R. Weischedel, "Empirical studies i strategies for Arabic retrieval," i Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, 2002, pp. 269-274. [10] R. Fathalla, Y. El Sobaty, ad M. A. Ismail, "Extractio of Arabic Words form Complex Color Images," i 9th IEEE Iteratioal Coferece o Documet Aalysis ad Recogitio (ICDAR 2007), Brazil, pp. 1223-1227. [11] C. Akkaya, J. Wiebe, ad R. Mihalcea, "Utilizig sematic compositio i distributioal sematic models for word sese discrimiatio ad word sese disambiguatio," i Sematic Computig (ICSC), 2012 IEEE Sixth Iteratioal Coferece o, 2012, pp. 45-51. [12] D. Jurafsky. Word Seses ad Word Relatios. [13] G. Bouma, "Normalized (poitwise) mutual iformatio i collocatio extractio," Proceedigs of GSCL, pp. 31-40, 2009. [14] K. C. Rydig, A referece grammar of moder stadard Arabic: Cambridge uiversity press, 2005. [15]A. Pasha, M. Al-Badrashiy, M. Diab, A. El Kholy, R. Eskader, N. Habash, et al., "Madamira: A fast, comprehesive tool for morphological aalysis ad disambiguatio of arabic," i Proceedigs of the Laguage Resources ad Evaluatio Coferece (LREC), Reykjavik, Icelad, 2014. [16] S. M. Oraby, Y. El-Sobaty, ad M. A. El-Nasr, "Explorig the Effects of Word Roots for Arabic Setimet Aalysis," i Iteratioal Joit Coferece o Natural Laguage Processig, Nagoya, Japa, 2013, pp. 471-479. [17] J. B. Lovis, Developmet of a stemmig algorithm: MIT Iformatio Processig Group, Electroic Systems Laboratory, 1968. [18] S. Khoja ad R. Garside, "Stemmig arabic text," Lacaster, UK, Computig Departmet, Lacaster Uiversity, 1999. [19] M. S. Husai, "A usupervised approach to develop stemmer," Iteratioal Joural o Natural Laguage Computig, vol. 1, pp. 15-23, 2012. [20] D. Harma, "How effective is suffixig?," JASIS, vol. 42, pp. 7-15, 1991. [21] I. Smirov, "Overview of stemmig algorithms," Mechaical Traslatio, vol. 52, 2008. [22] Y. Beajiba, M. Diab, ad P. Rosso, "Arabic amed etity recogitio usig optimized feature sets," i Proceedigs of the Coferece o Empirical Methods i Natural Laguage Processig, 2008, pp. 284-293. [23] K. Darwish ad D. W. Oard, "CLIR Experimets at Marylad for TREC-2002: Evidece combiatio for Arabic-Eglish retrieval," DTIC Documet2003. [24] L. S. Larkey ad M. E. Coell, "Arabic iformatio retrieval at UMass i TREC-10," DTIC Documet2006. [25] L. S. Larkey, L. Ballesteros, ad M. E. Coell, "Light stemmig for Arabic iformatio retrieval," i Arabic computatioal morphology, ed: Spriger, 2007, pp. 221-243. 11

[26] L. S. Larkey, L. Ballesteros, ad M. E. Coell, "Improvig stemmig for Arabic iformatio retrieval: light stemmig ad co-occurrece aalysis," i Proceedigs of the 25th aual iteratioal ACM SIGIR coferece o Research ad developmet i iformatio retrieval, 2002, pp. 275-282. [27] K. Taghva, R. Elkhoury, ad J. Coombs, "Arabic stemmig without a root dictioary," i ull, 2005, pp. 152-157. [28] T. Zerrouki. (2010). Tashaphye, Arabic light stemmer/segmet. [29] K. Darwish, "Buildig a shallow Arabic morphological aalyzer i oe day," i Proceedigs of the ACL-02 workshop o Computatioal approaches to semitic laguages, 2002, pp. 1-8. [30] K. R. Beesley, "Arabic morphological aalysis o the Iteret," i Proceedigs of the 6th Iteratioal Coferece ad Exhibitio o Multi-ligual Computig, 1998. [31] O. Smrž, "Elixirfm: implemetatio of fuctioal arabic morphology," i Proceedigs of the 2007 Workshop o Computatioal Approaches to Semitic Laguages: Commo Issues ad Resources, 2007, pp. 1-8. [32] O. PetrZemáek, "Prague Arabic Depedecy Treebak: A Word o the Millio Words." [33] T. Buckwalter, "Buckwalter {Arabic} Morphological Aalyzer Versio 1.0," 2002. [34] N. Habash, O. Rambow, ad R. Roth, "MADA+ TOKAN: A toolkit for Arabic tokeizatio, diacritizatio, morphological disambiguatio, POS taggig, stemmig ad lemmatizatio," i Proceedigs of the 2d Iteratioal Coferece o Arabic Laguage Resources ad Tools (MEDAR), Cairo, Egypt, 2009, pp. 102-109. [35] M. Diab, K. Hacioglu, ad D. Jurafsky, "Automated methods for processig arabic text: From tokeizatio to base phrase chukig," Arabic Computatioal Morphology: Kowledge-based ad Empirical Methods. Kluwer/Spriger, 2007. [36] S. N. Saleh ad Y. El-Sobaty, "A feature selectio algorithm with redudacy reductio for text classificatio," i Computer ad iformatio scieces, 2007. iscis 2007. 22d iteratioal symposium o, 2007, pp. 1-6. [37] S. Oraby, Y. El-Sobaty, ad M. A. El-Nasr, "Fidig Opiio Stregth Usig Rule-Based Parsig for Arabic Setimet Aalysis," i Advaces i Soft Computig ad Its Applicatios, ed: Spriger, 2013, pp. 509-520. [38] A. M. Ezzeldi, M. H. Kholief, ad Y. El-Sobaty, "ALQASIM: Arabic laguage questio aswer selectio i machies," i Iformatio Access Evaluatio. Multiliguality, Multimodality, ad Visualizatio, ed: Spriger, 2013, pp. 100-103. [39] A. M. Ezzeldi, Y. El-Sobaty, ad M. H. Kholief, "Explorig the Effects of Root Expasio, Setece Splittig ad Otology o Arabic Aswer Selectio," Natural Laguage Processig ad Cogitive Sciece: Proceedigs 2014, p. 273, 2015. [40] M. Abbas, K. Smaïli, ad D. Berkai, "Evaluatio of Topic Idetificatio Methods o Arabic Corpora," JDIM, vol. 9, pp. 185-192, 2011. [41] S. Alasary, M. Nagi, ad N. Adly, "Buildig a Iteratioal Corpus of Arabic (ICA): progress of compilatio stage," i 7th iteratioal coferece o laguage egieerig, Cairo, Egypt, 2007, pp. 5-6. [42] S. Godbole ad S. Sarawagi, "Discrimiative methods for multi-labeled classificatio," i Advaces i Kowledge Discovery ad Data Miig, ed: Spriger, 2004, pp. 22-30. [43] M. Hillemeyer. Machie Learig. 12