An efficient stemming for Arabic Text Classification

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

Cross-lingual Short-Text Document Classification for Facebook Comments

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AQUA: An Ontology-Driven Question Answering System

A Case Study: News Classification Based on Term Frequency

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Probabilistic Latent Semantic Analysis

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Switchboard Language Model Improvement with Conversational Data from Gigaword

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Language properties and Grammar of Parallel and Series Parallel Languages

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

A Comparison of Two Text Representations for Sentiment Analysis

What the National Curriculum requires in reading at Y5 and Y6

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Rule Learning With Negation: Issues Regarding Effectiveness

Constructing Parallel Corpus from Movie Subtitles

Word Segmentation of Off-line Handwritten Documents

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Proof Theory for Syntacticians

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Assignment 1: Predicting Amazon Review Ratings

South Carolina English Language Arts

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

CS 598 Natural Language Processing

arxiv: v1 [math.at] 10 Jan 2016

Problems of the Arabic OCR: New Attitudes

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Arabic Orthography vs. Arabic OCR

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Reducing Features to Improve Bug Prediction

HybridTechniqueforArabicTextCompression

Rule Learning with Negation: Issues Regarding Effectiveness

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Lecture 1: Machine Learning Basics

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Graph Based Authorship Identification Approach

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

arxiv: v1 [cs.lg] 3 May 2013

Mercer County Schools

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A General Class of Noncontext Free Grammars Generating Context Free Languages

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

An Online Handwriting Recognition System For Turkish

Evolutive Neural Net Fuzzy Filtering: Basic Description

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Grade 6: Correlated to AGS Basic Math Skills

Speech Emotion Recognition Using Support Vector Machine

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Using focal point learning to improve human machine tacit coordination

Using dialogue context to improve parsing performance in dialogue systems

SARDNET: A Self-Organizing Feature Map for Sequences

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Universiteit Leiden ICT in Business

Learning Methods in Multilingual Speech Recognition

Disambiguation of Thai Personal Name from Online News Articles

Cross-Lingual Text Categorization

Natural Language Processing. George Konidaris

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

1. Introduction. 2. The OMBI database editor

Extending Place Value with Whole Numbers to 1,000,000

ON BEHAVIORAL PROCESS MODEL SIMILARITY MATCHING A CENTROID-BASED APPROACH

Lecture 1: Basic Concepts of Machine Learning

Test Blueprint. Grade 3 Reading English Standards of Learning

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Phonological Processing for Urdu Text to Speech System

Python Machine Learning

Statewide Framework Document for:

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Parsing of part-of-speech tagged Assamese Texts

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv: v2 [cs.cv] 30 Mar 2017

ARNE - A tool for Namend Entity Recognition from Arabic Text

On-Line Data Analytics

Automatic document classification of biological literature

Derivational and Inflectional Morphemes in Pak-Pak Language

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

The Smart/Empire TIPSTER IR System

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Transcription:

An efficient stemming for Arabic Text Classification Attia Nehar Département d informatique A.T. University BP. 37G, 03000 Laghouat, Algeria Email: a.nehar@mail.lagh-univ.dz Djelloul Ziadi Université de Rouen Rouen, France Email: Djelloul.Ziadi@univ-rouen.fr Hadda Cherroun Département d informatique Laghouat, Algeria Email: hadda cherroun@mail.lagh-univ.dz Abstract Using N -gram technique without stemming is not appropriate in the context of Arabic Text Classification. For this, we introduce a new stemming technique, which we call approximate-stemming, based on the use of Arabic patterns. These are modeled using transducers and stemming is done without depending on any dictionary. This stemmer will be used in the context of Arabic Text Classification. Index Terms Arabic, classification, kernels, transducers, Arabic Patterns 0.1. INTRODUCTION Text Classification (TC) is the task of automatically sorting a set of documents into one or more categories from a predefined set [1]. Text classification techniques are used in many domains, including mail spam filtering, article indexing, Web searching, automated population of hierarchical catalogues of Web resources, even automated essay grading task. Arabic language is spoken by more than 422 million people, which makes it the 5 the widely used language in the world[2]. Arabic has 3 forms; Classical Arabic (CA), Modern Standard Arabic (MSA), and Dialectal Arabic (DA). CA includes classical historical liturgical text and old litterature texts. MSA includes news media and formal speech. DA includes predominantly spoken vernaculars and has no written standards. Arabic alphabet consists of the following 28 letters ( ) and the Hamza ( ). There are three vowels ( ) and the rest are consonants. There is no upper or lower case for Arabic letters. The orientation is like all semitic languages from right to left. Arabic differs from other languages syntactically, morphologically and semantically. It is a semitic language whose main characteristic feature is that most words are built up from roots by following certain known patterns and adding prefixes and suffixes. Due to the complexity of the Arabic language, text classification is a challenging task. Many algorithms have been developed to improve performance of Arabic TC systems [3], [4], [5], [6], [7], [8]. In general, we can divide an Arabic text classification system into three steps: 1. preprocessing step: where punctuation marks, diacritics, stop words and non letters are removed. 2. features extraction: a set of features is extracted from the text, which will represent the text in the next step. For instance, Khreisat [6] used the N -gram technique to extract features form documents. Syiam et al. [8], stemming is used to extract features. 3. learning task: many supervised algorithms were used to learn systems how to classify Arabic text documents: Support Vector Machines [5], [7], K-Nearest Neighbors [8] and many others. Most algorithms rely on distance measures over extracted features to decide how much two documents are similar. In the second step, a feature vector is constructed, it will represent the document in the third step. Many stemming approaches are developed [9]. S. Khoja and R. Garside [10] developed a dictionary based stemmer. It gives good performances, but the dictionary needs to be maintained and updated. The stemmer algorithm of Al-Serhan et al. [11] finds the three-letter roots for arabic words without depending on any roots dictionary or pattern files. Many arabic words have the same stem but not the same meaning. Stemming two semantically different words to the same root can induce classification errors. To prevent this, light stemming is used in TC algorithms [12]. In this technique, the main idea is that many words generated from the same root have different meanings. The basis of this light-stemming algorithms consists of several rounds over the text, that attempt to locate and remove the most frequent prefixes and suffixes from the word. This leads to a lot of features due to the light stemming strategy. In the third step, many distance measures could be used to calculate distance between documents using these feature vectors. In this paper, we introduce a new stemming technique which do not rely on any dictionary. It is based on the use of transducers, which we will use also to measure distance between documents in our framework. This paper is organized as follows. Section 0.2 presents, in more details, the features selection techniques, namely: brute stemming and light stemming. Our new stemming approach, called approximate-stemming, is described in Section 0.3. Next, in Section 0.4 we summarize the framework in which our new feature selection method is used and explain the kernel similarity measure. Finally, in Section 0.5 we highlight some

perspectives. 0.2. STEMMING TECHNIQUES In the context of TC, stemming is used to reduce dimentiality of the feature vector. It consists of transforming each Arabic word in the text, into its root. A. Brute Stemming There are many stemming techniques used in the context of TC. They can be classified into two classes : Stemming using a dictionary : a dictionary of Arabic words stems is needed here. Khoja s stemmer [10] is an example of this class. Stemming without dictionary : Stems are extracted without depending on any root or pattern files. Al-Serhan et al. [11], give an example of this class. Khoja s stemmer removes the longest suffix and the longest prefix. It then matches the remaining word with verbal and noun patterns, using a dictionary, to extract the root. The stemmer makes use of many linguistic data files such as a list of all diacritic characters, punctuation characters, definite articles and stop words. This stemmer gives good performance but relies on dictionary which needs to be maintained and updated. The second technique, due to Al-Serhan et al.[11], finds the three-letter roots for Arabic words without depending on any root or pattern files. They extracted word roots by assigning weights and ranks to the letters that constitute a word. Weights are real numbers in the range of 0 to 5. The assignment of weights to letters was determined by statstial studing arabic documents. Table1gives the groups of letters assignemets. The rank of letters in a word depends on both the length of that word and whether the word contains odd or even number of letters. Table 2 shows the assignment of ranks to letters. N is the number of letters in a word. After determination of the rank and weight of every letter in a word, letter weights are multiplied by the letter ranks. The three letters with the smallest product value constitute the root. Table 3 gives an example of using this algorithm. This agorithm, like any other brute stemming agorithm, gives the same stem for two semantically defferent words. This could decrease performance of the classification system. Arabic Letters Weight 5 3.5 3 2 5 Rest of the Arabic Alphabet 5 Table 1: Assignment of weights to letters. Letter Position Rank if Word Rank if Word from Right Length is Even Length is Odd 1 N N 2 N-1 N-1 3 N 2 N 2 - - - [N/2] N/2 + 1 [N/2] [N/2] + 1 N/2 + 1 0.5 [N/2] + 1 1.5 [N/2] + 2 N/2 + 2 0.5 [N/2] + 2 1.5 [N/2] + 3 N/2 + 3 0.5 [N/2] + 3 1.5 - - - N N 0.5 N 1.5 Table 2: Ranks of letters. Word Letters Weights 5 0 0 5 0 2 1 5 Rank 7.5 6.5 5.5 4.5 5 6 7 8 Product 37.5 0 0 22.5 0 12 7 40 Root Table 3: An example of using Al-Serhan et al. algorithm B. Light stemming In Arabic language, many word variants do not have similar meanings or semantics (like the two words: which means library and which means writer). However, these word variants give the same root if a brute stemming is used. Thus, brute stemming affects the meanings of words. Light stemming [12] aims to enhance the Text Classification performance while retaining the words meanings. 0.3. STEMMING USING TRANSDUCERS In this section, we will explain our new stemmer. First, we introduce the notion of Weighted Transducers. Then, we explain how to build a model for stemming using transducers. Notation : Σ represents a finite alphabet. The length of a string x Σ over that alphabet is denoted by x and the complement of a subset L Σ by L = Σ \ L. x a denotes the number of occurrences of the symbol a in x. K denotes either the set of real numbers R, rational numbers Q or integers Z. A. Weighted Transducers Transducers and Weighted Transducers are finite automata in which each transition is augmented with an output label in addition to the familiar input label. Output labels are concatenated along a path to form an output sequence as we do with input labels. Weighted transducers are finite-state transducers in which each transition carries some weight in addition to the input and output labels. The weight of a pair of input and output strings (x, y) is obtained by summing the weights of the paths labeled with (x, y). The following definition define formally weighted transducers [13]. Definition 1: A weighted finite-state transducer T over the semi-ring (K, +,, 0, 1) is an 8-tuple T = (Σ,, Q, I, F, E, λ, ρ) where Σ is the finite input alphabet of the transducer, is the finite output alphabet, Q is a finite set of states, I Q the set of initial states, F Q the set of final states, E Q(Σ {ɛ})( {ɛ})kq a finite set of transitions, λ : I K the initial weight function, and ρ : F K the final weight function mapping F to K

For a path π in a transducer, p[π] denotes the origin state of that path and n[π] its destination state. The set of al paths from the initial states I to the final states F labeled with input string x and output string y is denoted by P (I, x, y, F ). A transducer T is regulated if the output weight associated by T to any pair of input-output strings (x, y) given by: T (x, y) = π P (I,x,y,F ) λ(p[π]) w[π] ρ(n[π]) (1) is well-defined and in K. T (x, y) = 0 if P (I, x, y, F ) =. Fig. 0.1 shows an example of a simple transducer, with an input string x : and an output string y :. The only possible path in this transducer is the singleton set : P ({0}, x, y, {4}) and T (x, y) = 0.0625. Figure 0.1: Transducers corresponding to the measure. Regulated weighted transducers are closed under the following operations called rational operations: the sum (or union) of two weighted transducers T 1 and T 2 is defined by: (x, y) Σ Σ, (T 1 T 2)(x, y) = T 1(x, y) T 2(x, y) (2) the product (or concatenation) of two weighted transducers T 1 and T 2 is defined by: (x, y) Σ, (T 1 T 2)(x, y) = x=x 1 x 2,y=y 1 y 2 T 1(x 1, y 1) T 2(x 2, y 2) (3) The composition of two weighted transducers T 1 and T 2 with matching input and output alphabets Σ, is a weighted transducer denoted by T 1 T 2 when the sum: (T 1 T 2 )(x, y) = z Σ T 1(x, z) T 2 (z, y) (4) is well-defined and in K for all x, y Σ. B. Stemming by transducers Arabic language differs from other languages syntactically, morphologically and semantically. One of the main characteristic features is that most words are built up from roots by following certain fixed patterns 1 and adding prefixes and suffixes. For instance, the Arabic word (school) is built from the three-letters root 2 (learn) and using the measure (see Table4), then the suffixe (which is used to denote female gender) is added. Measures Words Table 4: Measures for the three-letters root and the built words. We will use measures to construct a transducer which do stemming. Fig. 0.1 shows the example of the measure. This transducer (noted by T measure ) can be used to extract 1 Also called measures or binyan. 2 The letter denotes the first letter of the three-letters root, denotes the second letter and denotes the third one. (a) Weighted transducer version (b) Unweighted transducer version Figure 0.2: Transducer corresponding to the word. the three letters root of any Arabic word matching with this measure. This is achieved by applying the composition operation (4). We consider T word, the transducer which map any string to it self, i.e, the only possible path is the singleton set P ({s}, word, word, {q}) (Fig. 0.2 shows transducer associated to the Arabic word, Figur 0.2a gives a weighted version of the transducer, Fig. 0.2b shows an unweighted one). The composition of two transducers is also a transducer. (T word T measure)(word, y) = z Σ T word(word, z)t measure(z, y) Since the only possible string matching z is z = word, we conclude that: (T word T measure)(word, y) = T word (word, word) T measure(word, y) We have T word (word, word) = 1, so: (T word T measure )(word, y) = T measure (word, y) If word matches with the measure, the output projection will extract the root (or stem) y associated to word. In Arabic language, there are 4 verb prefixes ( ), 12 noun prefixes ( ) and 28 suffixes ( ). When considering the diacritics, there are more than 3000 patterns. Since we don t consider diacritics in our approach, patterns are much less than 200, much of them are not used in the context of Modern Standard Arabic. For example, the following patterns will result in only one pattern ( ) after removing diacritics: We adopt the following process, to construct the transducer, which enable us to include all possible measures: 1. Building the transducer of all noun prefixes; 2. Building the transducer of all noun patterns; 3. Building the transducer of all noun suffixes; 4. Concatenate transducers obtained in 1, 2 and 3. 5. Building the transducer of all verb prefixes; 6. Building the transducer of all verb patterns; 7. Building the transducer of all verb suffixes; 8. Concatenate transducers obtained in 5,6 and 7. 9. Sum the two transducers obtained in steps 4 and 8. The first and third steps are very simple. We construct a transducer for each prefix (resp. suffix) then we do the union of these transducers. The resulting transducer represents the prefixes (resp. suffixes) transducer (see Fig. 0.3 and Fig. 0.4). To do the second step, we build all possible noun pattern transducers. Then, the result of the sum of these transducers represents the transducer of all noun patterns. We do the same

(a) Noun prefixes Figure 0.3: Noun and verb prefixes (b) Verb prefixes to build the transducer of all verb patterns. The final transducer is obtained by doing the sum (union) of transducers built in steps 4 and 8. Tables 5 and 6 show some examples of noun and verb patterns. The resulting transducer could not be represented graphically because of the number of states (about 400 states), Fig. 0.5 shows the verb measures part of this transducer. This transducer can stem any well-formed arabic word, i.e, a word which match with some arabic measure. In addition, it can give us a semantic information about the stemmed word. This information can be used to improve the quality of classification system. Noun Patterns 3-letters 4-letters 5-letters 6-letters 7-letters Table 5: Examples of noun patterns. Verb Patterns 3-letters 4-letters 3-letters +1 3-letters +2 3-letters +3 4-letters +1 Table 6: Examples of verb patterns 0.4. FRAMEWORK FOR ARABIC TEXT CLASSIFICATION In the following, we explain how we will use our transducer to measure distance between documents. As mentioned above, our classification system is divided to three components: 1. preprocessing step 2. feature extraction: our transducer is applied on each word in the resulting document from step 1, this will give a document of the concatenation of words stems. Then, a transducer is built from this document and will be used in the next step. Figure 0.4: Noun and verb suffixes

[9] M. Y. Al-Nashashibi, D. Neagu, and A. A. Yaghi, Stemming techniques for arabic words: A comparative study, in 2010 2nd International Conference on Computer Technology and Development (ICCTD). IEEE, Nov. 2010, pp. 270 276. [10] S. Khoja and R. Garside, Stemming arabic text, 1999. [11] H. AlSerhan, R. A. Shalabi, and G. Kannan, New approach for extracting arabic roots, in Proceedings of The 2003 Arab conference on Information Technology (ACIT 2003), Alexandria, Egypt, Dec. 2003, pp. 42 59. [12] M. Aljlayl and O. Frieder, On arabic search: Improving the retrieval effectiveness via light stemming approach, in ACM Eleventh Conference on Information and Knowledge Management. PP, 2002, pp. 340 347. [13] J. Berstel, Transductions and Context-Free Languages. Teubner Studienbucher, 1979. [14] C. Cortes, P. Haffner, and M. Mohri, Rational kernels: Theory and algorithms, J. Mach. Learn. Res., vol. 5, pp. 1035 1062, December 2004. [15] C. Cortes, L. Kontorovich, and M. Mohri, Learning languages with rational kernels, in Proceedings of the 20th annual conference on Learning theory, ser. COLT 07, 2007, pp. 349 364. Figure 0.5: Verb patterns 3. learning task: many algorithms could be used to classify documents (or transducers). These documents are represented by transducers, we will use a rational kernel to measure distance between documents [14], [15]. 0.5. CONCLUSION AND FUTURE DIRECTIONS This paper presents a new stemming approach, which is used in the context of Arabic text classification. It is based on the use of transducers for both words stemming and distance measure between documents. First, the transducer for stemming is built by mean the Arabic Patterns. Second, transducers will be also used to calcultate ditstances. Deep experiments and analysis of this stemmer in the context of Arabic Text Classificationare the object of the future work. REFERENCES [1] F. Sebastiani and C. N. D. Ricerche, Machine learning in automated text categorization, ACM Computing Surveys, vol. 34, pp. 1 47, 2002. [2] Arabic language - wikipedia, the free encyclopedia. [Online]. Available: http://ar.wikipedia.org/wiki/http://ar.wikipedia.org/wiki/ [3] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, Automatic arabic text classification, in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data, March 2008. [4] R. M. Duwairi, Arabic text categorization, Int. Arab J. Inf. Technol., vol. 4, no. 2, pp. 125 132, 2007. [5] T. F. Gharib, M. B. Habib, and Z. T. Fayed, Arabic text classification using support vector machines, International Journal of Computers and Their Applications, vol. 16, no. 4, pp. 192 199, December 2009. [6] L. Khreisat, A machine learning approach for arabic text classification using n-gram frequency statistics, Journal of Informetrics, vol. 3, no. 1, pp. 72 77, Jan. 2009. [7] A. M. Mesleh, Support vector machines based arabic language text classification system: Feature selection comparative study, in Advances in Computer and Information Sciences and Engineering, T. Sobh, Ed. Dordrecht: Springer Netherlands, 2008, pp. 11 16. [8] M. M. Syiam, Z. T. Fayed, and M. B. Habib, International Journal of Intelligent Computing and Information Sciences, no. 1, January.