An efficient stemming for Arabic Text Classification
|
|
- Lorin Simpson
- 6 years ago
- Views:
Transcription
1 An efficient stemming for Arabic Text Classification Attia Nehar Département d informatique A.T. University BP. 37G, Laghouat, Algeria a.nehar@mail.lagh-univ.dz Djelloul Ziadi Université de Rouen Rouen, France Djelloul.Ziadi@univ-rouen.fr Hadda Cherroun Département d informatique Laghouat, Algeria hadda cherroun@mail.lagh-univ.dz Abstract Using N -gram technique without stemming is not appropriate in the context of Arabic Text Classification. For this, we introduce a new stemming technique, which we call approximate-stemming, based on the use of Arabic patterns. These are modeled using transducers and stemming is done without depending on any dictionary. This stemmer will be used in the context of Arabic Text Classification. Index Terms Arabic, classification, kernels, transducers, Arabic Patterns 0.1. INTRODUCTION Text Classification (TC) is the task of automatically sorting a set of documents into one or more categories from a predefined set [1]. Text classification techniques are used in many domains, including mail spam filtering, article indexing, Web searching, automated population of hierarchical catalogues of Web resources, even automated essay grading task. Arabic language is spoken by more than 422 million people, which makes it the 5 the widely used language in the world[2]. Arabic has 3 forms; Classical Arabic (CA), Modern Standard Arabic (MSA), and Dialectal Arabic (DA). CA includes classical historical liturgical text and old litterature texts. MSA includes news media and formal speech. DA includes predominantly spoken vernaculars and has no written standards. Arabic alphabet consists of the following 28 letters ( ) and the Hamza ( ). There are three vowels ( ) and the rest are consonants. There is no upper or lower case for Arabic letters. The orientation is like all semitic languages from right to left. Arabic differs from other languages syntactically, morphologically and semantically. It is a semitic language whose main characteristic feature is that most words are built up from roots by following certain known patterns and adding prefixes and suffixes. Due to the complexity of the Arabic language, text classification is a challenging task. Many algorithms have been developed to improve performance of Arabic TC systems [3], [4], [5], [6], [7], [8]. In general, we can divide an Arabic text classification system into three steps: 1. preprocessing step: where punctuation marks, diacritics, stop words and non letters are removed. 2. features extraction: a set of features is extracted from the text, which will represent the text in the next step. For instance, Khreisat [6] used the N -gram technique to extract features form documents. Syiam et al. [8], stemming is used to extract features. 3. learning task: many supervised algorithms were used to learn systems how to classify Arabic text documents: Support Vector Machines [5], [7], K-Nearest Neighbors [8] and many others. Most algorithms rely on distance measures over extracted features to decide how much two documents are similar. In the second step, a feature vector is constructed, it will represent the document in the third step. Many stemming approaches are developed [9]. S. Khoja and R. Garside [10] developed a dictionary based stemmer. It gives good performances, but the dictionary needs to be maintained and updated. The stemmer algorithm of Al-Serhan et al. [11] finds the three-letter roots for arabic words without depending on any roots dictionary or pattern files. Many arabic words have the same stem but not the same meaning. Stemming two semantically different words to the same root can induce classification errors. To prevent this, light stemming is used in TC algorithms [12]. In this technique, the main idea is that many words generated from the same root have different meanings. The basis of this light-stemming algorithms consists of several rounds over the text, that attempt to locate and remove the most frequent prefixes and suffixes from the word. This leads to a lot of features due to the light stemming strategy. In the third step, many distance measures could be used to calculate distance between documents using these feature vectors. In this paper, we introduce a new stemming technique which do not rely on any dictionary. It is based on the use of transducers, which we will use also to measure distance between documents in our framework. This paper is organized as follows. Section 0.2 presents, in more details, the features selection techniques, namely: brute stemming and light stemming. Our new stemming approach, called approximate-stemming, is described in Section 0.3. Next, in Section 0.4 we summarize the framework in which our new feature selection method is used and explain the kernel similarity measure. Finally, in Section 0.5 we highlight some
2 perspectives STEMMING TECHNIQUES In the context of TC, stemming is used to reduce dimentiality of the feature vector. It consists of transforming each Arabic word in the text, into its root. A. Brute Stemming There are many stemming techniques used in the context of TC. They can be classified into two classes : Stemming using a dictionary : a dictionary of Arabic words stems is needed here. Khoja s stemmer [10] is an example of this class. Stemming without dictionary : Stems are extracted without depending on any root or pattern files. Al-Serhan et al. [11], give an example of this class. Khoja s stemmer removes the longest suffix and the longest prefix. It then matches the remaining word with verbal and noun patterns, using a dictionary, to extract the root. The stemmer makes use of many linguistic data files such as a list of all diacritic characters, punctuation characters, definite articles and stop words. This stemmer gives good performance but relies on dictionary which needs to be maintained and updated. The second technique, due to Al-Serhan et al.[11], finds the three-letter roots for Arabic words without depending on any root or pattern files. They extracted word roots by assigning weights and ranks to the letters that constitute a word. Weights are real numbers in the range of 0 to 5. The assignment of weights to letters was determined by statstial studing arabic documents. Table1gives the groups of letters assignemets. The rank of letters in a word depends on both the length of that word and whether the word contains odd or even number of letters. Table 2 shows the assignment of ranks to letters. N is the number of letters in a word. After determination of the rank and weight of every letter in a word, letter weights are multiplied by the letter ranks. The three letters with the smallest product value constitute the root. Table 3 gives an example of using this algorithm. This agorithm, like any other brute stemming agorithm, gives the same stem for two semantically defferent words. This could decrease performance of the classification system. Arabic Letters Weight Rest of the Arabic Alphabet 5 Table 1: Assignment of weights to letters. Letter Position Rank if Word Rank if Word from Right Length is Even Length is Odd 1 N N 2 N-1 N-1 3 N 2 N [N/2] N/2 + 1 [N/2] [N/2] + 1 N/ [N/2] [N/2] + 2 N/ [N/2] [N/2] + 3 N/ [N/2] N N 0.5 N 1.5 Table 2: Ranks of letters. Word Letters Weights Rank Product Root Table 3: An example of using Al-Serhan et al. algorithm B. Light stemming In Arabic language, many word variants do not have similar meanings or semantics (like the two words: which means library and which means writer). However, these word variants give the same root if a brute stemming is used. Thus, brute stemming affects the meanings of words. Light stemming [12] aims to enhance the Text Classification performance while retaining the words meanings STEMMING USING TRANSDUCERS In this section, we will explain our new stemmer. First, we introduce the notion of Weighted Transducers. Then, we explain how to build a model for stemming using transducers. Notation : Σ represents a finite alphabet. The length of a string x Σ over that alphabet is denoted by x and the complement of a subset L Σ by L = Σ \ L. x a denotes the number of occurrences of the symbol a in x. K denotes either the set of real numbers R, rational numbers Q or integers Z. A. Weighted Transducers Transducers and Weighted Transducers are finite automata in which each transition is augmented with an output label in addition to the familiar input label. Output labels are concatenated along a path to form an output sequence as we do with input labels. Weighted transducers are finite-state transducers in which each transition carries some weight in addition to the input and output labels. The weight of a pair of input and output strings (x, y) is obtained by summing the weights of the paths labeled with (x, y). The following definition define formally weighted transducers [13]. Definition 1: A weighted finite-state transducer T over the semi-ring (K, +,, 0, 1) is an 8-tuple T = (Σ,, Q, I, F, E, λ, ρ) where Σ is the finite input alphabet of the transducer, is the finite output alphabet, Q is a finite set of states, I Q the set of initial states, F Q the set of final states, E Q(Σ {ɛ})( {ɛ})kq a finite set of transitions, λ : I K the initial weight function, and ρ : F K the final weight function mapping F to K
3 For a path π in a transducer, p[π] denotes the origin state of that path and n[π] its destination state. The set of al paths from the initial states I to the final states F labeled with input string x and output string y is denoted by P (I, x, y, F ). A transducer T is regulated if the output weight associated by T to any pair of input-output strings (x, y) given by: T (x, y) = π P (I,x,y,F ) λ(p[π]) w[π] ρ(n[π]) (1) is well-defined and in K. T (x, y) = 0 if P (I, x, y, F ) =. Fig. 0.1 shows an example of a simple transducer, with an input string x : and an output string y :. The only possible path in this transducer is the singleton set : P ({0}, x, y, {4}) and T (x, y) = Figure 0.1: Transducers corresponding to the measure. Regulated weighted transducers are closed under the following operations called rational operations: the sum (or union) of two weighted transducers T 1 and T 2 is defined by: (x, y) Σ Σ, (T 1 T 2)(x, y) = T 1(x, y) T 2(x, y) (2) the product (or concatenation) of two weighted transducers T 1 and T 2 is defined by: (x, y) Σ, (T 1 T 2)(x, y) = x=x 1 x 2,y=y 1 y 2 T 1(x 1, y 1) T 2(x 2, y 2) (3) The composition of two weighted transducers T 1 and T 2 with matching input and output alphabets Σ, is a weighted transducer denoted by T 1 T 2 when the sum: (T 1 T 2 )(x, y) = z Σ T 1(x, z) T 2 (z, y) (4) is well-defined and in K for all x, y Σ. B. Stemming by transducers Arabic language differs from other languages syntactically, morphologically and semantically. One of the main characteristic features is that most words are built up from roots by following certain fixed patterns 1 and adding prefixes and suffixes. For instance, the Arabic word (school) is built from the three-letters root 2 (learn) and using the measure (see Table4), then the suffixe (which is used to denote female gender) is added. Measures Words Table 4: Measures for the three-letters root and the built words. We will use measures to construct a transducer which do stemming. Fig. 0.1 shows the example of the measure. This transducer (noted by T measure ) can be used to extract 1 Also called measures or binyan. 2 The letter denotes the first letter of the three-letters root, denotes the second letter and denotes the third one. (a) Weighted transducer version (b) Unweighted transducer version Figure 0.2: Transducer corresponding to the word. the three letters root of any Arabic word matching with this measure. This is achieved by applying the composition operation (4). We consider T word, the transducer which map any string to it self, i.e, the only possible path is the singleton set P ({s}, word, word, {q}) (Fig. 0.2 shows transducer associated to the Arabic word, Figur 0.2a gives a weighted version of the transducer, Fig. 0.2b shows an unweighted one). The composition of two transducers is also a transducer. (T word T measure)(word, y) = z Σ T word(word, z)t measure(z, y) Since the only possible string matching z is z = word, we conclude that: (T word T measure)(word, y) = T word (word, word) T measure(word, y) We have T word (word, word) = 1, so: (T word T measure )(word, y) = T measure (word, y) If word matches with the measure, the output projection will extract the root (or stem) y associated to word. In Arabic language, there are 4 verb prefixes ( ), 12 noun prefixes ( ) and 28 suffixes ( ). When considering the diacritics, there are more than 3000 patterns. Since we don t consider diacritics in our approach, patterns are much less than 200, much of them are not used in the context of Modern Standard Arabic. For example, the following patterns will result in only one pattern ( ) after removing diacritics: We adopt the following process, to construct the transducer, which enable us to include all possible measures: 1. Building the transducer of all noun prefixes; 2. Building the transducer of all noun patterns; 3. Building the transducer of all noun suffixes; 4. Concatenate transducers obtained in 1, 2 and Building the transducer of all verb prefixes; 6. Building the transducer of all verb patterns; 7. Building the transducer of all verb suffixes; 8. Concatenate transducers obtained in 5,6 and Sum the two transducers obtained in steps 4 and 8. The first and third steps are very simple. We construct a transducer for each prefix (resp. suffix) then we do the union of these transducers. The resulting transducer represents the prefixes (resp. suffixes) transducer (see Fig. 0.3 and Fig. 0.4). To do the second step, we build all possible noun pattern transducers. Then, the result of the sum of these transducers represents the transducer of all noun patterns. We do the same
4 (a) Noun prefixes Figure 0.3: Noun and verb prefixes (b) Verb prefixes to build the transducer of all verb patterns. The final transducer is obtained by doing the sum (union) of transducers built in steps 4 and 8. Tables 5 and 6 show some examples of noun and verb patterns. The resulting transducer could not be represented graphically because of the number of states (about 400 states), Fig. 0.5 shows the verb measures part of this transducer. This transducer can stem any well-formed arabic word, i.e, a word which match with some arabic measure. In addition, it can give us a semantic information about the stemmed word. This information can be used to improve the quality of classification system. Noun Patterns 3-letters 4-letters 5-letters 6-letters 7-letters Table 5: Examples of noun patterns. Verb Patterns 3-letters 4-letters 3-letters +1 3-letters +2 3-letters +3 4-letters +1 Table 6: Examples of verb patterns 0.4. FRAMEWORK FOR ARABIC TEXT CLASSIFICATION In the following, we explain how we will use our transducer to measure distance between documents. As mentioned above, our classification system is divided to three components: 1. preprocessing step 2. feature extraction: our transducer is applied on each word in the resulting document from step 1, this will give a document of the concatenation of words stems. Then, a transducer is built from this document and will be used in the next step. Figure 0.4: Noun and verb suffixes
5 [9] M. Y. Al-Nashashibi, D. Neagu, and A. A. Yaghi, Stemming techniques for arabic words: A comparative study, in nd International Conference on Computer Technology and Development (ICCTD). IEEE, Nov. 2010, pp [10] S. Khoja and R. Garside, Stemming arabic text, [11] H. AlSerhan, R. A. Shalabi, and G. Kannan, New approach for extracting arabic roots, in Proceedings of The 2003 Arab conference on Information Technology (ACIT 2003), Alexandria, Egypt, Dec. 2003, pp [12] M. Aljlayl and O. Frieder, On arabic search: Improving the retrieval effectiveness via light stemming approach, in ACM Eleventh Conference on Information and Knowledge Management. PP, 2002, pp [13] J. Berstel, Transductions and Context-Free Languages. Teubner Studienbucher, [14] C. Cortes, P. Haffner, and M. Mohri, Rational kernels: Theory and algorithms, J. Mach. Learn. Res., vol. 5, pp , December [15] C. Cortes, L. Kontorovich, and M. Mohri, Learning languages with rational kernels, in Proceedings of the 20th annual conference on Learning theory, ser. COLT 07, 2007, pp Figure 0.5: Verb patterns 3. learning task: many algorithms could be used to classify documents (or transducers). These documents are represented by transducers, we will use a rational kernel to measure distance between documents [14], [15] CONCLUSION AND FUTURE DIRECTIONS This paper presents a new stemming approach, which is used in the context of Arabic text classification. It is based on the use of transducers for both words stemming and distance measure between documents. First, the transducer for stemming is built by mean the Arabic Patterns. Second, transducers will be also used to calcultate ditstances. Deep experiments and analysis of this stemmer in the context of Arabic Text Classificationare the object of the future work. REFERENCES [1] F. Sebastiani and C. N. D. Ricerche, Machine learning in automated text categorization, ACM Computing Surveys, vol. 34, pp. 1 47, [2] Arabic language - wikipedia, the free encyclopedia. [Online]. Available: [3] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, Automatic arabic text classification, in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data, March [4] R. M. Duwairi, Arabic text categorization, Int. Arab J. Inf. Technol., vol. 4, no. 2, pp , [5] T. F. Gharib, M. B. Habib, and Z. T. Fayed, Arabic text classification using support vector machines, International Journal of Computers and Their Applications, vol. 16, no. 4, pp , December [6] L. Khreisat, A machine learning approach for arabic text classification using n-gram frequency statistics, Journal of Informetrics, vol. 3, no. 1, pp , Jan [7] A. M. Mesleh, Support vector machines based arabic language text classification system: Feature selection comparative study, in Advances in Computer and Information Sciences and Engineering, T. Sobh, Ed. Dordrecht: Springer Netherlands, 2008, pp [8] M. M. Syiam, Z. T. Fayed, and M. B. Habib, International Journal of Intelligent Computing and Information Sciences, no. 1, January.
Linking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationCross-lingual Short-Text Document Classification for Facebook Comments
2014 International Conference on Future Internet of Things and Cloud Cross-lingual Short-Text Document Classification for Facebook Comments Mosab Faqeeh, Nawaf Abdulla, Mahmoud Al-Ayyoub, Yaser Jararweh
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationErkki Mäkinen State change languages as homomorphic images of Szilard languages
Erkki Mäkinen State change languages as homomorphic images of Szilard languages UNIVERSITY OF TAMPERE SCHOOL OF INFORMATION SCIENCES REPORTS IN INFORMATION SCIENCES 48 TAMPERE 2016 UNIVERSITY OF TAMPERE
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationCLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH
ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationQuickStroke: An Incremental On-line Chinese Handwriting Recognition System
QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents
More informationLanguage properties and Grammar of Parallel and Series Parallel Languages
arxiv:1711.01799v1 [cs.fl] 6 Nov 2017 Language properties and Grammar of Parallel and Series Parallel Languages Mohana.N 1, Kalyani Desikan 2 and V.Rajkumar Dare 3 1 Division of Mathematics, School of
More informationLongest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationWhat the National Curriculum requires in reading at Y5 and Y6
What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationUsing Web Searches on Important Words to Create Background Sets for LSI Classification
Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract
More informationProof Theory for Syntacticians
Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationLIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting
LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting El Moatez Billah Nagoudi Laboratoire d Informatique et de Mathématiques LIM Université Amar
More informationInformatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy
Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationSouth Carolina English Language Arts
South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content
More informationAn Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method
Farhadi F, Sorkhi M, Hashemi S et al. An effective framework for fast expert mining in collaboration networks: A grouporiented and cost-based method. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(3): 577
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationarxiv: v1 [math.at] 10 Jan 2016
THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the
More informationProblems of the Arabic OCR: New Attitudes
Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationArabic Orthography vs. Arabic OCR
Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among
More informationDetecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011
Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011 Cristian-Alexandru Drăgușanu, Marina Cufliuc, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University,
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationHybridTechniqueforArabicTextCompression
Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 15 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationPerformance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized
More informationRANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S
N S ER E P S I M TA S UN A I S I T VER RANKING AND UNRANKING LEFT SZILARD LANGUAGES Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A-1997-2 UNIVERSITY OF TAMPERE DEPARTMENT OF
More informationParallel Evaluation in Stratal OT * Adam Baker University of Arizona
Parallel Evaluation in Stratal OT * Adam Baker University of Arizona tabaker@u.arizona.edu 1.0. Introduction The model of Stratal OT presented by Kiparsky (forthcoming), has not and will not prove uncontroversial
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationA Graph Based Authorship Identification Approach
A Graph Based Authorship Identification Approach Notebook for PAN at CLEF 2015 Helena Gómez-Adorno 1, Grigori Sidorov 1, David Pinto 2, and Ilia Markov 1 1 Center for Computing Research, Instituto Politécnico
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationChapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard
Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationPREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES
PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,
More informationarxiv: v1 [cs.lg] 3 May 2013
Feature Selection Based on Term Frequency and T-Test for Text Categorization Deqing Wang dqwang@nlsde.buaa.edu.cn Hui Zhang hzhang@nlsde.buaa.edu.cn Rui Liu, Weifeng Lv {liurui,lwf}@nlsde.buaa.edu.cn arxiv:1305.0638v1
More informationMercer County Schools
Mercer County Schools PRIORITIZED CURRICULUM Reading/English Language Arts Content Maps Fourth Grade Mercer County Schools PRIORITIZED CURRICULUM The Mercer County Schools Prioritized Curriculum is composed
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationA General Class of Noncontext Free Grammars Generating Context Free Languages
INFORMATION AND CONTROL 43, 187-194 (1979) A General Class of Noncontext Free Grammars Generating Context Free Languages SARWAN K. AGGARWAL Boeing Wichita Company, Wichita, Kansas 67210 AND JAMES A. HEINEN
More informationDickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks
3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationClass-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,
More informationAn Online Handwriting Recognition System For Turkish
An Online Handwriting Recognition System For Turkish Esra Vural, Hakan Erdogan, Kemal Oflazer, Berrin Yanikoglu Sabanci University, Tuzla, Istanbul, Turkey 34956 ABSTRACT Despite recent developments in
More informationEvolutive Neural Net Fuzzy Filtering: Basic Description
Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:
More informationTABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards
TABE 9&10 Revised 8/2013- with reference to College and Career Readiness Standards LEVEL E Test 1: Reading Name Class E01- INTERPRET GRAPHIC INFORMATION Signs Maps Graphs Consumer Materials Forms Dictionary
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationGrade 6: Correlated to AGS Basic Math Skills
Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and
More informationSpeech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationUsing focal point learning to improve human machine tacit coordination
DOI 10.1007/s10458-010-9126-5 Using focal point learning to improve human machine tacit coordination InonZuckerman SaritKraus Jeffrey S. Rosenschein The Author(s) 2010 Abstract We consider an automated
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationSARDNET: A Self-Organizing Feature Map for Sequences
SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu
More informationDeveloping True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability
Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability Shih-Bin Chen Dept. of Information and Computer Engineering, Chung-Yuan Christian University Chung-Li, Taiwan
More informationImproved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form
Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationNatural Language Processing. George Konidaris
Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans
More informationUse of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT
DESIDOC Journal of Library & Information Technology, Vol. 31, No. 1, January 2011, pp. 19-24 2011, DESIDOC Use of Online Information Resources for Knowledge Organisation in Library and Information Centres:
More information1. Introduction. 2. The OMBI database editor
OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper
More informationExtending Place Value with Whole Numbers to 1,000,000
Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit
More informationON BEHAVIORAL PROCESS MODEL SIMILARITY MATCHING A CENTROID-BASED APPROACH
MICHAELA BAUMANN, M.SC. ON BEHAVIORAL PROCESS MODEL SIMILARITY MATCHING A CENTROID-BASED APPROACH MICHAELA BAUMANN, MICHAEL HEINRICH BAUMANN, STEFAN JABLONSKI THE TENTH INTERNATIONAL MULTI-CONFERENCE ON
More informationLecture 1: Basic Concepts of Machine Learning
Lecture 1: Basic Concepts of Machine Learning Cognitive Systems - Machine Learning Ute Schmid (lecture) Johannes Rabold (practice) Based on slides prepared March 2005 by Maximilian Röglinger, updated 2010
More informationTest Blueprint. Grade 3 Reading English Standards of Learning
Test Blueprint Grade 3 Reading 2010 English Standards of Learning This revised test blueprint will be effective beginning with the spring 2017 test administration. Notice to Reader In accordance with the
More informationDublin City Schools Mathematics Graded Course of Study GRADE 4
I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationPhonological Processing for Urdu Text to Speech System
Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationStatewide Framework Document for:
Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance
More informationSpeech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines
Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationLEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE
LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE Submitted in partial fulfillment of the requirements for the degree of Sarjana Sastra (S.S.)
More informationPrentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)
Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Nebraska Reading/Writing Standards (Grade 10) 12.1 Reading The standards for grade 1 presume that basic skills in reading have
More informationCROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2
1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis
More informationELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading
ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationarxiv: v2 [cs.cv] 30 Mar 2017
Domain Adaptation for Visual Applications: A Comprehensive Survey Gabriela Csurka arxiv:1702.05374v2 [cs.cv] 30 Mar 2017 Abstract The aim of this paper 1 is to give an overview of domain adaptation and
More informationARNE - A tool for Namend Entity Recognition from Arabic Text
24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123
More informationOn-Line Data Analytics
International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob
More informationAutomatic document classification of biological literature
BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. Copyedited and fully formatted PDF and full text (HTML) versions will be made available soon. Automatic
More informationDerivational and Inflectional Morphemes in Pak-Pak Language
Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes
More informationCLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction
CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationDeep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach
#BaselOne7 Deep search Enhancing a search bar using machine learning Ilgün Ilgün & Cedric Reichenbach We are not researchers Outline I. Periscope: A search tool II. Goals III. Deep learning IV. Applying
More information