Ajees A P Department of Computer Science Cochin University of Science and Technology Kochi,India

Similar documents
ScienceDirect. Malayalam question answering system

Indian Institute of Technology, Kanpur

Linking Task: Identifying authors and book titles in verbose queries

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Named Entity Recognition: A Survey for the Indian Languages

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Syllable Based Word Recognition Model for Korean Noun Extraction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Prediction of Maximal Projection for Semantic Role Labeling

Grammars & Parsing, Part 1:

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The stages of event extraction

AQUA: An Ontology-Driven Question Answering System

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

A Simple Surface Realization Engine for Telugu

CS 598 Natural Language Processing

Distant Supervised Relation Extraction with Wikipedia and Freebase

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

ARNE - A tool for Namend Entity Recognition from Arabic Text

Parsing of part-of-speech tagged Assamese Texts

Universiteit Leiden ICT in Business

Modeling function word errors in DNN-HMM based LVCSR systems

Context Free Grammars. Many slides from Michael Collins

Short Text Understanding Through Lexical-Semantic Analysis

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Modeling function word errors in DNN-HMM based LVCSR systems

Python Machine Learning

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Using dialogue context to improve parsing performance in dialogue systems

A Case Study: News Classification Based on Term Frequency

Cross Language Information Retrieval

Grammar Extraction from Treebanks for Hindi and Telugu

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Recognition at ICSI: Broadcast News and beyond

What the National Curriculum requires in reading at Y5 and Y6

Leveraging Sentiment to Compute Word Similarity

Probabilistic Latent Semantic Analysis

Learning Methods in Multilingual Speech Recognition

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Training and evaluation of POS taggers on the French MULTITAG corpus

An Evaluation of POS Taggers for the CHILDES Corpus

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Advanced Grammar in Use

Ensemble Technique Utilization for Indonesian Dependency Parser

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Beyond the Pipeline: Discrete Optimization in NLP

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

A Comparison of Two Text Representations for Sentiment Analysis

A heuristic framework for pivot-based bilingual dictionary induction

BYLINE [Heng Ji, Computer Science Department, New York University,

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

Sample Goals and Benchmarks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Extracting Verb Expressions Implying Negative Opinions

Development of the First LRs for Macedonian: Current Projects

A Bayesian Learning Approach to Concept-Based Document Classification

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Dialog Act Classification Using N-Gram Algorithms

SEMAFOR: Frame Argument Resolution with Log-Linear Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Introduction to Text Mining

BULATS A2 WORDLIST 2

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

South Carolina English Language Arts

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Assignment 1: Predicting Amazon Review Ratings

Language Independent Passage Retrieval for Question Answering

Online Updating of Word Representations for Part-of-Speech Tagging

Human Emotion Recognition From Speech

English Language and Applied Linguistics. Module Descriptions 2017/18

Disambiguation of Thai Personal Name from Online News Articles

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 1: Machine Learning Basics

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

A Graph Based Authorship Identification Approach

arxiv: v1 [cs.cl] 2 Apr 2017

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Constructing Parallel Corpus from Movie Subtitles

Procedia - Social and Behavioral Sciences 154 ( 2014 )

HinMA: Distributed Morphology based Hindi Morphological Analyzer

The College Board Redesigned SAT Grade 12

Transcription:

A POS Tagger for Malayalam using Conditional Random Fields Ajees A P Department of Computer Science Cochin University of Science and Technology Kochi,India Sumam Mary Idicula Department of Computer Science Cochin University of Science and Technology Kochi,India Abstract POS tagging is one of the preliminary steps in Natural Language Processing(NLP). It is the process of assigning lexical categories to each and every word in a document based on its definition and context. POS taggers try to assign syntactic categories to words based on its role in the sentence. The requirement of a standard dataset is also necessary for developing POS taggers. Unfortunately, there is no publicly available corpus for Malayalam. In this work, we have prepared a publicly available tagged dataset as well as a CRF based POS tagger for Malayalam. The proposed system provides better results in comparison with the existing methodologies. Index Terms POS Tagging,Malayalam,CRF,Tagged corpus I. INTRODUCTION POS tagging is one of the main building blocks in many NLP applications. Resolving the ambiguities involved in the tagging process is a challenging task [1]. There are various POS taggers which differs in their internal model and amount of training. All of them can be broadly classified into three-namely rule-based, statistical and machine learning approaches. Among stochastic models, HMM is the most popular one. Rule-based approaches use grammar rules to improve the accuracy of tagging. They consist of a two-stage architecture for POS tagging. The first stage assigns all possible POS tags to each and every word using dictionaries. The second stage makes use of dictionaries to find out the most suitable tag for each word in the document. The main bottleneck of this approach is building handcrafted rules, which is a labor-intensive and time-consuming task. On the other hand, statistical taggers make use of a tagged corpus to solve the tagging problem. They compute the probability of a given word having a specified tag in a particular context. The statistical occurrence of the tag N-gram and word-tag frequencies are used to find the most probable tag sequence for a sentence.pos tagging finds its applications in information retrieval, text to speech, information extraction and much more higher level NLP tasks such as parsing, semantics, machine translation etc. Annotated corpora are one of the main requirement for many NLP tasks such as information retrieval, machine translation, question answering etc [2]. It is a building block for constructing statistical models for natural language processing. A lot of such corpora are available for different languages across the world. Various studies on POS tagging are carried out in different languages as the required resources are available for such languages. But Malayalam is far behind as compared with such languages since it doesn t have such useful resources. II. ABOUT MALAYALAM AND THE CORPUS Malayalam is one of the four major Dravidian languages with rich literary tradition. It is the official language of Kerala and Lakshadweep. Malayalam has inherited a lot from the Sanskrit, the language of Vedas. Influence of Sanskrit is evident in the alphabet, vocabulary, and phonology of Malayalam. Malayalam is a highly agglutinative language [3]. The highly productive morphology of Malayalam results in the creation of many complex and ambiguous words. Spoken forms of Malayalam is different from different parts of Keara even though the literary dialect throughout Kerala is uniform. One of the major challenges in processing highly inflected languages like Malayalam is to deal with its frequent morphological variations of root words appearing in the text. So possible transformations through which surface words are formed from stems have to be studied in detail. In Malayalam, about 40% of words are inflected. A particular root word can form many morphological variants [4]. The high level of inflections on root words causes several problems in developing Natural Language Processing tools. There is no benchmark corpus available for Malayalam. Earlier works are based on self-developed training and test datasets. And these datasets are not publicly available for experimentation. Hence it is not possible for the researchers in this area to compare and validate their results with the existing systems. Tagged corpus is also required in other fields of Natural Language Processing such as text to speech, information retrieval, machine translation, etc. The abovementioned facts motivated us to develop a publicly available tagged corpus for Malayalam. So that the other researchers in this area can work on it and report their results. As part of the dataset preparation, we have downloaded a lot of Malayalam text from available literature, science, and online newspapers. They are converted to standard UTF-8 encoding format. Useless symbols and abnormalities are removed from the text. A total of 28.7k words are prepared and tagged with 22

TABLE I: Tags and descriptions Tag Description Tag Description N NN Common noun RB Adverb N NNP Proper noun PSP Postposition N NST Locative noun CC CCD Co-ordinator PR PRP Personal pronoun CC CCS Subordinator PR PRF Reflexive pronoun CC CCS UT Quotative PR PRL Relative pronoun RP RPD Default particle PR PRC Reciprocal pronoun RP CL Classifier particle PR PRQ Wh-word RP INJ Interjection particle DM DMD Deictic demonstrative RP INTF Intensifier particle DM DMR Relative demonstrative RP NEG Negation particle DM DMQ Wh-word QT QTF General quantifier V VM Main verb QT QTC Cardinals V VM VF Finite verb QT QTO Ordinals V VM VNF Non-finite verb RD RDF Foreign words V VM VINF Infinite verb RD SYM Symbol V VN Verbal noun RD PUNC Punctuation V VAUX Auxiliary verb RD UNK Unknown JJ Adjective RD ECH Echo words Fig. 1: The percentage of most common tags extracted from training corpus Fig. 2: Graphical representation of CRF 23

BIS Tagset. The tagset contains 36 tags from the Bureau of Indian Standards(BIS) tag set. BIS tagset is developed by the POS tag standardization committee of Department of Information Technology(DIT), New Delhi, India. The tagset is constructed with the guidance of experts from the area of Natural Language Processing and language technology of Indian languages. Different tags and their descriptions are shown in table 1. The tagged corpus is made publicly available at www.cs.cusat.ac.in. The percentage of most common tags extracted from the tagged corpus is shown in figure 1. III. RELATED WORK POS tagging in Malayalam is still in its childhood as compared with other Dravidian languages. Only a few numbers of works are reported till now. The first work is a stochastic HMM-based methodology reported by Manju K and Soumya S in 2009 [5]. They used a very small sized training corpus for their work. Only 1400 words are used for training which was the main bottleneck of their work. The second work is reported from Amritha University Coimbatore in 2010 [6]. A tagged corpus of more than one lakh words is used for training. They have used their own tagset of 29 tags.svm algorithm is used for training. One more work is reported in 2010 from IIIT-MK Trivandrum using TNT and SVM Tool [7]. They compared the performance of TNT and SVMTool through Malayalam POS tagging using IIT-Hyderabad tagset. Another work is reported by Robert Jesaraj in 2013 [8]. A Memory-Based Language Processing(MBLP) approach is used for tagging. He utilized the power of efficient storage of solved examples and similarity-based reasoning. In 2015, a hybrid method of POS tagging was reported by Sunitha C [9]. She used handcrafted rules as well as bigrams for POS tagging. Apart from the above-mentioned studies, there is hardly any work found in the literature that directly addresses the problem of POS tagging in Malayalam, indicating a big gap in the area of Natural Language Processing. IV. PROPOSED METHOD Our objective is to build a POS tagger which tags all the words in the text document with BIS tags.conditional Random Field is used for tagging purpose.crfs are capable of predicting multiple variables that depend on each other.they are probabilistic graphical models for labeling sequential data. They try to define a conditional probability distribution over tag sequences given a particular word sequence by considering the context in to account.the graphical representation of CRF is shown in figure 2. The main advantage of conditional random fields over HMM is their conditional nature, resulting in relaxation of independence assumptions to ensure tractable inference. CRFs are also free from label bias problem, a weakness exhibited by Maximum Entropy Markov Models. The architecture of the system is shown in figure 3. It contains mainly four modules. The first module is a preprocessing module that takes the tagged text as input and converts into sequences of sentences and sequences of tags. Figure 4 shows this procedure. CRF finds the best tag sequence corresponding to a word sequence as shown in equation 1. ŷ = arg max P (ȳ x ; w ) (1) ȳ exp(w F (x, ȳ)) P (ȳ x ; w ) =. exp(w F (x, ȳ r )) (2) ȳ t Y F (x, ȳ) =. f (y i 1, y i, x, i) (3) i Here x is the observable word sequence and ȳ is the corresponding hidden tag sequence. The probability of a tag sequence, ȳ for a given word sequence x is calculated as shown in equation 2. Where w denotes the weight vector and F is the global feature vector. The global feature vector is defined by a set of local feature functions. Equation 3 elaborates this point. Each feature function can analyze the entire observation sequence x, the current y i and previous y i 1 positions in the tag sequence and current position i in the observation sequence. A feature function is computed by summing f k over all n different state transitions y. Finally decoding of the best tag sequence is done by using viterbi algorithm. The second module in the architecture is the feature preparation module. Each word from the sentence is sent to the feature preparation module. The feature preparation module replaces each word by a set of features corresponding to that word as shown in figure 5. The set of features we have considered for each word are suffixes of length 2,3 and 5, the word itself, previous word, the word before the previous word, next word and the word after the next word. The third module is the training module where the model parameters are learned. Pycrfsuite [10], a python based implementation of CRF is used for training. After training, the model is saved for testing. The last module is the testing module, where the saved model is used for testing. The words in the test data are also converted in to feature vectors as in the pre-processing stage.sequences of words are replaced with sequences of feature vectors. Then these sequences are provided to the saved CRF model for predicting the output. V. EXPERIMENTS AND RESULTS We have used CUSAT Malayalam corpus for our experiments. Most of the words in the corpus are unique and ambiguous which makes the tagging problem a difficult one. The most common tag in the training corpus is Noun. The preprocessed tagged text is used for training and testing.the dataset is divided into 80% for training and 20% for validation and testing.the system is trained on 23K words and tested on 5.7K words. Setting the model parameters is an important task for CRF training. Setting the model parameters is an important task for CRF training. We have set the coefficient of L1 penalty as 1.0 and L2 penalty as 1e-3. The model is trained for 50 epochs. The system performs with an accuracy of 91.2% on test data. An example of the tagged text tagged using CRF tagger is shown in figure 6. The performance of different tagging algorithms on CUSAT corpus is shown in figure 7. 24

Fig. 3: The architecture of the proposed system Fig. 4: Preprocessing Fig. 5: Feature set for a single word TABLE II: Performance of the tagger on test data(most common tags) Tag Precision Recall F1-score Common noun 0.89 0.96 0.92 Nonfinite Verb 0.89 0.85 0.87 Punctuation 1.00 1.00 1.00 Adjective 0.91 0.94 0.93 Finite verb 0.94 0.94 0.94 Proper noun 0.81 0.58 0.67 Cardinals 0.94 0.95 0.95 Auxiliary 0.90 0.87 0.88 Adverb 0.91 0.64 0.75 Demonstrative 0.94 0.90 0.92 Postposition 0.95 0.89 0.92 VI. CONCLUSION In this paper, we have discussed a CRF based POS tagger for Malayalam, a morphologically rich language. The odd feature of this tagger is its accuracy in tagging. Even though different taggers are available for Malayalam, CRF based 25

Fig. 6: An example of input and tagged output Fig. 7: The performance of different tagging algorithms on CUSAT corpus tagger outperforms all the existing methodologies. Incorporating morphological features in training helped to improve the accuracy of tagging. Out of vocabulary words are also tagged with the help of morphological and contextual features of the words. The proposed method can also be applied to various language processing applications like Named Entity Recognition, Speech Recognition, Phrase Chunking, etc. [8] Robert Jesuraj and PC Reghu Raj. Mblp approach applied to pos tagging in malayalam language. Proceedings of NCILC, pages 5 8, 2013. [9] C Sunitha et al. A hybrid parts of speech tagger for malayalam language. In Advances in Computing, Communications and Informatics (ICACCI), 2015 International Conference on, pages 1502 1507. IEEE, 2015. [10] A python binding for crfsuite. https://github.com/scrapinghub/ pythoncrfsuite. Accessed: 2017-09-30. REFERENCES [1] Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya. Morphological richness offsets resource demand-experiences in constructing a pos tagger for hindi. In Proceedings of the COL- ING/ACL on Main conference poster sessions, pages 779 786. Association for Computational Linguistics, 2006. [2] Dan Jurafsky. Speech & language processing. Pearson Education India, 2000. [3] Wikipedia. Malayalam. 2010 (accessed December 7, 2015). [4] Ravi Sankar S Nair. : A grammar of malayalam. Language in India http://www. languageinindia. com/nov2012/ravisankarmalayalamgrammar. pdf, 2012. [5] K Manju, S Soumya, and Sumam Mary Idicula. Development of a pos tagger for malayalam-an experience. In Advances in Recent Technologies in Communication and Computing, 2009. ARTCom 09. International Conference on, pages 709 713. IEEE, 2009. [6] PJ Antony, Santhanu P Mohan, and KP Soman. Svm based part of speech tagger for malayalam. In Recent Trends in Information, Telecommunication and Computing (ITC), 2010 International Conference on, pages 339 341. IEEE, 2010. [7] RR Rajeev, Jisha P Jayan, and Elizabeth Serly. Tagging malayalam text with parts of speech-tnt and svm tagger comparison. 2010. 26