Part-of-Speech Tagging for Code-mixed Indian Social Media Text at ICON 2015

Similar documents
Natural language processing implementation on Romanian ChatBot

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Management Science Letters

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

arxiv: v1 [cs.dl] 22 Dec 2016

Application for Admission

Consortium: North Carolina Community Colleges

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

Indian Institute of Technology, Kanpur

part2 Participatory Processes

VISION, MISSION, VALUES, AND GOALS

'Norwegian University of Science and Technology, Department of Computer and Information Science

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Learning Methods in Multilingual Speech Recognition

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Named Entity Recognition: A Survey for the Indian Languages

also inside Continuing Education Alumni Authors College Events

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Parsing of part-of-speech tagged Assamese Texts

ScienceDirect. Malayalam question answering system

Prediction of Maximal Projection for Semantic Role Labeling

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Ensemble Technique Utilization for Indonesian Dependency Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A study of speaker adaptation for DNN-based speech synthesis

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Lecture 1: Machine Learning Basics

Modeling function word errors in DNN-HMM based LVCSR systems

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Online Updating of Word Representations for Part-of-Speech Tagging

Automating the E-learning Personalization

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Human Emotion Recognition From Speech

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Beyond the Pipeline: Discrete Optimization in NLP

Training and evaluation of POS taggers on the French MULTITAG corpus

Disambiguation of Thai Personal Name from Online News Articles

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

An Online Handwriting Recognition System For Turkish

Modeling function word errors in DNN-HMM based LVCSR systems

A Syllable Based Word Recognition Model for Korean Noun Extraction

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Emotion Recognition Using Support Vector Machine

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Reducing Features to Improve Bug Prediction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Georgetown University at TREC 2017 Dynamic Domain Track

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Distant Supervised Relation Extraction with Wikipedia and Freebase

2014 Gold Award Winner SpecialParent

Word Segmentation of Off-line Handwritten Documents

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

ARNE - A tool for Namend Entity Recognition from Arabic Text

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Comparison of Two Text Representations for Sentiment Analysis

Developing a TT-MCTAG for German with an RCG-based Parser

BYLINE [Heng Ji, Computer Science Department, New York University,

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Lecture 9: Speech Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Case Study: News Classification Based on Term Frequency

CS 598 Natural Language Processing

An Evaluation of POS Taggers for the CHILDES Corpus

A Graph Based Authorship Identification Approach

Linking Task: Identifying authors and book titles in verbose queries

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Calibration of Confidence Measures in Speech Recognition

The CESAR Project: Enabling LRT for 70M+ Speakers

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

6 Financial Aid Information

Assignment 1: Predicting Amazon Review Ratings

Finding Your Friends and Following Them to Where You Are

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Constructing Parallel Corpus from Movie Subtitles

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Millersville University Degree Works Training User Guide

Memory-based grammatical error correction

Transcription:

Part-of-Speech Taggig for Code-mixed Idia Social Media Text at ICON 205 Kamal Sarkar Computer Sciece & Egieerig Dept. Jadavpur Uiversity Kolkata-700032, Idia jukamal200@yahoo.com ABSTRACT This paper discusses the experimets carried out by us at Jadavpur Uiversity as part of the participatio i ICON 205 task: POS Taggig for Code-mixed Idia Social Media Text. The tool that we have developed for the task is based o Trigram Hidde Markov Model that utilizes iformatio from dictioary as well as some other word level features to ehace the observatio probabilities of the kow tokes as well as ukow tokes. We submitted rus for Begali-Eglish, Hidi-Eglish ad Tamil-Eglish Laguage pairs. Our system has bee traied ad tested o the datasets released for ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text. I costraied mode, our system obtais average overall accuracy (averaged over all three laguage pairs) of 75.60% which is very close to other participatig two systems (76.79% for IIITH ad 75.79% for AMRITA_CEN) raked higher tha our system. I ucostraied mode, our system obtais average overall accuracy of 70.65% which is also close to the system (72.85% for AMRITA_CEN) which obtais the highest average overall accuracy. Keywords Part-of-Speech Taggig, Code Mixed, Social Media, HMM.. INTRODUCTION Part-of-Speech (POS) taggig is the task of assigig grammatical categories (ou, verb, adjective etc.) to words i a atural laguage setece []. POS taggig ca be used i various NLP (Natural Laguage Processig) applicatios. The iterest i applyig NLP methods for aalyzig ostadardized texts, such as social media texts, rapidly is growig [2], because the automatic aalysis of social media texts is oe of essetial requiremets for the task of setimet aalysis [3]. Sice social media texts cotai blog commets or chat messages, it differs from stadardized texts i the word usage but also i their grammatical structure. This creates the eed for adaptig NLP methods to aalyzig social media text ad i particular, for the adaptio of POS taggig methods to such text types. Most state-of-the art taggers have bee developed for stadardized texts. This paper presets a descriptio of HMM (Hidde Markov Model) based system for POS taggig from Social Media Text i Idia Laguages. The ICON 205 shared task: POS Taggig For Codemixed Idia Social Media Text is defied i this year to build the POS tagger systems for code mixed Idia social media text - Begali-Eglish, Hidi-Eglish ad Tamil- Eglish laguage pairs for which traiig data ad test data were provided. Data set for a laguage pair cotais the social media text writte i the laguages of the cocered pair. For example, for Begali-Eglish laguage pair, data set cotais the social media text writte i Eglish ad Hidi. We have participated for all three laguage pairs. POS Tagger ca be developed usig both liguistic models ad stochastic models. The earliest works o POS taggig [4][5][6] use supervised learig methods. Some research work has already doe for developig POS tagger for stadard texts i Idia laguages [7]. Dadapat et. al [8].presets HMM ad Maximum Etropy (ME) based approaches for Begali POS taggig. Ekbal et. al. [9] preseted a POS tagger for Begali laguage usig Coditioal Radom Fields (CRF). They also discussed aother machie learig based POS tagger usig SVM algorithm i [0]. A usupervised Parts-of-Speech Tagger for the Bagla laguage was proposed by Ali et.al. i []. Chakrabarti et.al.[2] has proposed a Layered Parts of Speech Taggig for Bagla. A detailed survey o POS taggig for other Idia laguages has bee preseted i [3][4]. A few attempts have also bee made for developig POS tagger for code mixed Idia social media text. A POS Taggig System of Eglish-Hidi Code-Mixed Social Media Cotet has bee preseted i [5]. A POS taggig system for Idia Social Media Text o Twitter has bee preseted i [6].

2. PREPARATION OF TRAINING DATA The traiig data released for the ICON 205 shared task cotais three files: oe file for Begali-Eglish Laguage pair, oe file for Hidi-Eglish laguage pair ad oe file for Tamil-Eglish laguage pair. Each lie i a file cotais tokes i the laguages of cocered pair, Laguage tag ad Part-of-Speech tag. The participats are istructed to produce the output i the same format after testig the system o the test data where the test data cotais per lie a tab separated toke ad the correspodig laguage tag. Our system uses a traiig file for a laguage pair ad coverts each setece ito a sequece of pairs of toke ad tag where each toke i this ew format is formed by combiig the source toke ad some other iformatio such as laguage tag. The detailed of this format is discussed i the later sectios. 3. HMM MODEL FOR POS TAGGING A POS tagger based o Hidde Markov Model (HMM) fids the best sequece of POS tags t that is optimal for a give observatio sequece o. The taggig problem becomes equivalet to searchig for arg max Po ( t ) Pt ( ) (by the applicatio of Bayes t law), that is, we eed to compute: tˆ arg max P( o t ) P( t ) = (). t Where t is a tag sequece ad o is a observatio sequece, Pt ( ) is the prior probability of the tag Po ( t ) is the likelihood of the word sequece ad sequece. I geeral, HMM based POS taggig use words i a setece as a observatio sequece [] [7]. But, we use some additioal iformatio such as laguage tag for disambiguatig each toke i text. We also use some other iformatio such as whether the toke cotais ay hash tag or ot. We use this iformatio i a form of meta tag (details are preseted i the subsequet sectios). We use a small dictioary of words which cotais words with its broad POS categories. If ay toke is foud i the dictioary, we use the broad POS tag as some additioal iformatio which we combies with the observatio toke (details are preseted i the subsequet sectios). Ulike the traditioal HMM based POS taggig system, to use this additioal iformatio for POS taggig task, we cosider a triplet as a observatio symbol: <word, metatag, Laguage tag >. This is a pseudo toke used as a observed symbol, that is, for a setece of words, the correspodig observatio sequece will be as follows: (<word, meta-tag, L-tag, >, <word 2, meta-tag 2, L- tag 2 >, <word 3, meta-tag 3, L-tag 3 >,..., <word, metatag, L-tag,>). Here a observatio symbol o i correspods to <word i, meta-tag i, L-tag i, > ad L-tag is the laguage tag ad meta-tag is decided based o the additioal iformatio (e.g. Hash tag). Sice Equatio () is too hard to compute directly, HMM taggers follows Markov assumptio accordig to which the probability of a tag is depedet oly o short memory (a small, fixed umber of previous tags). For example, a bigram tagger cosiders that the probability of a tag depeds oly o the previous tag For our proposed trigram model, the probability of a tag depeds o two previous tags ad thus Pt ( ) is computed as: Pt ( ) Π Pt ( t, t ) (2) i i i 2 i= Depedig o the assumptio that the probability of a word appearig is depedet oly o its ow tag, Po ( t ) ca be simplified to: Po ( t) Po ( t) i= i i (3) Pluggig the above metioed two equatios (2) ad (3) ito () results i the followig equatio by which a bigram tagger estimates the most probable tag sequece: ˆ arg max ( ) ( ) arg max t ( ) ( ) = P t o P t P o t P t t (4) t t i= i i i i ( ) Where: the tag trasitio probabilities, Pti ti, represet the probability of a tag give the previous tag. Po ( i ti) represets the probability of a observed symbol give a tag. Cosiderig a special tag t + to idicate the ed setece boudary ad two special tags t - ad t 0 at the startig boudary of the setece ad addig these three special tags to the tag set [4], gives the followig equatio for POS taggig: tˆ = arg max P( t o ) P( t ) t Poi ti Pti ti t i= ti 2 Pt+ t argmax[ ( ) (, )] ( ) The equatio (5) is still computatioally expesive because we eed to cosider all possible tag sequece of legth. So, dyamic programmig approach is used to compute the equatio (5). At the traiig phase of HMM based POS taggig, observatio probability matrix ad tag trasitio probability matrix are created. A geeral Architecture of our developed POS tagger is show i Figure. As we ca see from the equatio (4), to fid the most likely tag sequece for a observatio sequece, we eed (5)

to compute two kids of probabilities: tag trasitio probabilities ad word likelihoods or observatio probabilities. Traiig Corpus (Laguage tagged ad POS tagged) Assig special tags to Tokes (meta tag ad broad POS tags from dictioary) Tagged sequeces observatio symbols. Traiig based tagger HMM model of HMM POS Figure. Architecture for our developed HMM based POS taggig system Our developed trigram HMM tagger requires to compute tag trigram probability, Pt ( i ti, ti 2), which is computed by the maximum likelihood estimate from tag trigram couts. To overcome the data sparseess problem, tag trigram probability is smoothed usig deleted iterpolatio techique [7][4] which uses the maximum likelihood estimates from couts for tag trigram, tag bigram ad tag uigram. The observatio probability of a observed triplet <word, meta-tag, L-tag >, which is the observed symbol i our case, is computed usig the followig equatio [][7]. Po ( t ) = (7) C( o, t) C( o) Social media setece (laguage tagged) Assig special tags to Tokes (meta tag ad broad POS tags from dictioary) Testig phase POS tagged Setece 3. Viterbi Decodig We have used Viterbi algorithm to fid the best hidde state sequece give a iput HMM ad a sequece of observatio symbols. The Viterbi algorithm is a stadard applicatio of the classic dyamic programmig algorithm [8]. Give a tag trasitio probability matrix ad the observatio probability matrix, Viterbi decodig (used at the testig phase) accepts a setece from code mixed social media text ad fids the most likely tag sequece for the test setece which is also L-tagged ad Meta tagged. Here a setece is submitted to the viterbi as the observatio sequece of triplets: (<word, meta-tag, L-tag >, <word 2, meta-tag 2, L-tag 2 >, <word 3, meta-tag 3, L-tag 3 >,..., <word, meta-tag, L- tag >). Here a observatio symbol o i correspods to <word i, meta-tag i, L-tag i,> ad L-tag is a laguage tag ad Meta tag is determied based o the dictioary iformatio ad Hash tag feature. After assigig the tag sequece to the observatio sequece as metioed above, L-tag ad meta-tag iformatio are removed from the output ad thus the output for a iput setece is coverted to a POS-tagged setece. Oe of the importat problems to apply Viterbi decodig algorithm is how to hadle ukow triplets i the iput. The ukow triplets are triplets which are ot preset i the traiig set ad hece their observatio probabilities are ot kow. To hadle this problem, we estimate the observatio probability of a ukow oe by aalyzig L-tag, meta-tag ad the suffix of the word associated with the correspodig the triplet. We estimate the observatio probability of a ukow observed triplet i the followig ways: The observatio probabilities of ukow triplet < word, meta-tag, L-tag> correspodig to a word i the iput setece are decided accordig to the suffix of a pseudo word formed by addig L-tag ad meta-tag to the ed of the word. We fid the observatio probabilities of such ukow pseudo words usig suffix aalysis [7][4]. of all rare pseudo words (frequecy <=2) i the traiig corpus for the cocered laguage pairs. 4. SPECIAL TAGS 4. Meta Tag Each toke has some properties by which oe toke differs from aother. For example, a toke may cotai Hash tag which is frequet i the social media text. Meta-tag= YYYY (default) if the first character of the toke is a Hash symbol (#) the metatag = "HB

else if the hash tag is preset i ay other positio of a toke metatag = "HE" Ed If 4.2 Dictioary I earlier sectios, we have metioed that we have used some dictioary iformatio as the meta-tag also. A metatag is set to the value of broad POS tag for a toke after matchig it with the dictioary words ad retrievig the correspodig broad POS tag foud i the dictioary. The descriptio of dictioary is show i Table. Table : Descriptio of Dictioaries Laguage Pair Begali- Eglish Hidi- Eglish Tamil- Eglish Broad POS categories Number of etries i the dictioary(to kes are ot ormalized) Proou, verb ad cojuctio Proou verb cojuctio Proou Verb Cojuctio 92 79 60 274 85 56 203 633 56 We follow the followig rules for assigig to the toke this type of broad POS tag extracted from the dictioary: If raw toke is foud i the dictioary ad the broad POS tag of the cocered toke is XXXX the meta-tag ="XXXX" ed if Sice we have used oly verb, proou ad cojuctios i the dictioaries, XXXX ca take oe three values: VERB, PNON ad CONJ. 5. EVALUATION AND RESULTS We trai separately our developed POS tagger based o the traiig data ad tue the parameters of our system o the traiig data for the respective laguage pair. After learig the tuig parameters, we test our system o the test data for the cocered laguage pair. The descriptio of the data for three laguage pairs is show i the Table2 Our developed POS system has bee evaluated usig the traditioal accuracy measure. For traiig, tuig ad testig our system, we have used the datasets for three differet laguage pairs: Begali-Eglish, Hidi-Eglish ad Tamil-Eglish, released by the orgaizers of ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text. Table2. The descriptio of the data for various laguage pairs Laguage Total of seteces Traiig data Test data Begali-Eglish 2837 459 Hidi-Eglish 729 377 Tamil-Eglish 639 279 The orgaizers of the shared task released the data i two phases: i the first phase, traiig data is released where traiig data was laguage tagged ad POS tagged. I the secod phase, the test data is released where test data was oly laguage tagged. The cotestats are istructed to assig POS tags to the seteces i the test file usig their developed systems. The tagged test files for test data sets were fially set to the orgaizers for evaluatio. The orgaizers evaluate the differet rus submitted by the various teams ad sed the official results to the participatig teams. A total of 0 teams submitted their rus for this cotest. For each laguage pair the cotests were doe i two differet modes: Costraied mode ad ucostraied mode. I cotraied mode, the participat team is oly allowed to use the traiig corpus. No exteral resource is allowed. I ucostraied mode, the participat team is allowed to use ay exteral resources (POS tagger, NER, Parser, ad additioal data) to trai their system. I costraied mode, we have ot used ay dictioary ad oly Hash tag has bee used as the meta-tag. I ucostraied mode, we have used a small dictioary as metioed i Table ad Hash tag has bee used as the meta-tag. The results obtaied by our system (team code: KS_JU) have bee show i the tables 3 to 8. The results obtaied by other participatig systems have also bee show i the tables. The secod row of the each table shows the overall accuracy obtaied by the various systems participated i the cotest. We have also evaluated the system based o its cosistecy across the laguages i costraied ad ucostraied mode. Average overall accuracy is computed by takig the average of overall accuracy of the system obtaied for all three laguage pairs i a particular mode. I costraied mode, our system obtais average overall accuracy (averaged over all three laguage pairs) of 75.60% which is very close to other participatig two systems (76.79% for IIITH ad 75.79% for AMRITA_CEN) raked higher tha our system. I ucostraied mode, our system obtais average overall accuracy of 70.65% which is also close to the system (72.85% for AMRITA_CEN) which obtais the highest average overall accuracy.

Table 3. Official results (Begali-Costraied mode) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorica IIITH AMRITA_CEN KS_JU CDACMUMBAI DD_JU SN_JU Amrita l Overall 79.84% 78.50% 78.42% 75.46% 75.22% 72.64% 0.3% E 97.% 94.22% 97.% 97.% 95.95% 97.% 0.00% @ 00.00% 93.33% 93.33% 93.33% 86.67% 86.67% 0.00% JJ 65.25% 6.2% 6.92% 62.72% 58.9% 52.46% 20.5% N_NST 80.00% 80.00% 80.00% 80.00% 0.00% 80.00% 0.00% DT 95.90% 96.29% 94.92% 95.5% 93.75% 94.73% 0.00% RD_SYM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RB_AMN 8.48% 77.78% 80.79% 77.3% 76.6% 66.20% 0.00% N_NN 8.26% 79.80% 78.8% 8.73% 83.56% 67.66%.20% U 00.00% 3.64% 8.82% 00.00% 0.00% 00.00% 0.00% RD_RDF 47.96% 40.52% 42.94% 39.22% 36.06% 33.64% 0.00% QT_QTF 48.75% 55.63% 57.50% 56.25% 53.3% 50.00% 0.00% RP_RPD 7.24% 74.5% 76.47% 69.28% 49.02% 77.78% 0.00% N_NNV 59.68% 62.90% 56.45% 35.48% 66.3% 56.45% 0.00% V_VM 79.76% 8.87% 78.49% 80.66% 74.76% 7.8% 0.54% PR_PRQ 83.93% 87.50% 75.00% 87.50% 9.07% 82.4% 0.00% # 95.35% 97.67% 97.67% 88.37% 74.42% 74.42% 0.00% PR_PRP 87.48% 90.9% 88.77% 89.29% 87.0% 87.6% 0.00% N_NNP 65.46% 55.47% 59.52% 59.8% 43.08% 6.55% 60.68% V_VAUX 39.08% 3.03% 35.06% 27.59% 20.69% 30.46% 0.00% $ 64.7% 69.85% 6.76% 6.76% 4.9% 44.85% 0.00% RP_INJ 53.6% 50.52% 60.82% 54.64% 26.80% 49.48% 0.00% RB_ALC 54.4% 70.59% 58.82% 63.24% 75.00% 54.4% 0.00% DM_DMD 7.34% 72.6% 74.52% 70.70% 78.98% 76.43% 0.00% PR_PRF 55.56% 77.78% 44.44% 55.56% 77.78% 66.67% 0.00% CC 82.76% 85.7% 85.52% 83.79% 83.0% 8.38% 0.34% DM_DMQ 50.00% 50.00% 50.00% 50.00% 0.00% 50.00% 0.00% PSP 87.69% 89.38% 92.36% 90.54% 87.56% 89.25% 3.89% DM_DMR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RD_PUNC 98.79% 99.% 98.46% 76.74% 97.67% 93.57% 0.5% PR_PRL 60.00% 80.00% 80.00% 80.00% 60.00% 40.00% 0.00%

Table 4. Official results (Begali_ucostraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical KS_JU AMRITA_CEN DD_JU Overall 78.29% 76.73% 47.08% E 58.96% 94.80% 95.95% @ 66.67% 93.33% 86.67% JJ 45.94% 6.38% 56.32% N_NST 50.00% 80.00% 0.00% DT 59.96% 96.29% 6.72% RD_SYM 0.00% 0.00% 0.00% RB_AMN 53.70% 80.56% 0.23% N_NN 57.68% 76.8% 44.86% U 3.82% 9.09% 0.00% RD_RDF 29.74% 36.62% 36.06% QT_QTF 4.25% 54.37% 53.3% RP_RPD 52.94% 66.67% 33.33% N_NNV 33.87% 59.68% 48.39% V_VM 48.37% 79.46% 5.54% PR_PRQ 48.2% 89.29% 9.07% # 74.42% 95.35% 74.42% PR_PRP 60.52% 89.03% 8.32% N_NNP 38.23% 50.8% 42.22% V_VAUX 6.09% 35.63% 20.69% $ 38.24% 72.79% 28.68% RP_INJ 25.77% 60.82% 4.43% RB_ALC 45.59% 66.8% 75.00% DM_DMD 54.4% 74.52% 78.98% PR_PRF 33.33% 00.00% 77.78% CC 59.66% 83.79% 73.79% DM_DMQ 25.00% 25.00% 0.00% PSP 59.33% 88.86% 6.97% DM_DMR 0.00% 0.00% 0.00% RD_PUNC 64.34% 98.93% 97.67% PR_PRL 40.00% 80.00% 60.00%

Table 5. Official results (Hidi-costraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categori cal KS_JU AMRITA_CEN IIITH DD_JU CDACMUMBAI SN_JU Auj_IITB Amrita Overall 77.74% 75.58% 75.04% 73.6% 7.% 68.85% 64.52% 3.45% E 7.94% 94.44% 94.44% 92.06% 94.44% 9.27% 96.03%.59% @ 6.67% 83.33% 50.00% 33.33% 83.33% 33.33% 83.33% 0.00% JJ 9.93% 52.23% 56.40% 54.0% 56.55% 55.68% 64.60% 0.86% DT 5.74% 93.77% 92.07% 90.26% 90.49% 9.39% 86.98% 0.00% N_NST 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RB_AMN 5.58% 75.88% 76.42% 78.32% 77.78% 65.8% 69.65% 0.00% RD_SYM 0.00% 9.67% 9.67% 9.67% 9.67% 9.67% 9.67% 0.00% N_NN 3.93% 79.83% 82.77% 8.75% 82.77% 7.97% 48.38% 20.89% U 0.00% 2.50% 62.50% 0.00% 62.50% 93.75% 93.75% 0.00% RD_RDF 0.76% 4.55% 3.03% 3.79% 3.79% 4.55% 3.79% 0.00% QT_QTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 20.00% 0.00% RP_RPD 0.00% 0.00% 0.00% 27.78% 0.00% 5.56% 5.56% 0.00% N_NNV 4.76% 9.52% 9.52% 4.76% 9.52% 9.52% 9.52% 0.00% RP_INTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% V_VM 6.68% 83.32% 8.42% 84.49% 82.46% 74.62% 52.30% 56.84% PR_PRQ 0.00% 88.89% 66.67% 22.22% 33.33% 33.33% 44.44% 0.00% # 20.97% 00.00% 00.00% 00.00% 00.00% 80.65% 00.00% 0.00% RD_UNK 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% PR_PRP 8.68% 87.8% 82.67% 88.54% 79.69% 88.09% 73.0% 0.45% N_NNP.99% 67.54% 69.30% 67.84% 53.22% 35.38% 69.88% 2.63% V_VAUX 8.98% 34.04% 4.3% 6.38% 36.4% 43.26% 50.35%.65% $ 9.8% 69.6% 65.89% 36.45% 57.94% 37.38% 57.0% 0.00% RP_INJ 4.76% 6.90% 55.24% 43.8% 54.29% 43.8% 47.62% 0.95% RB_ALC 0.00% 6.67% 6.67% 0.00% 6.67% 0.00% 6.67% 0.00% PR_PRF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% CC.92% 34.89% 45.94% 7.60% 4.45% 44.2% 53.54% 0.00% PSP 9.07% 75.67% 62.37% 82.99% 69.8% 62.78% 58.66% 0.82% ~ 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RD_PUNC 8.44% 98.30% 97.85% 95.85% 70.52% 85.33% 96.22% 4.44% PR_PRL.45% 0.00%.45% 0.00%.45% 0.00% 5.80% 0.00%

Table 6. Official results (Hidi-Ucostraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical IIITH KS_JU AMRITA_CEN Rudra_IITB DD_JU CDACMUMBAI Overall 80.68% 77.60% 73.66% 68.94% 27.60% 6.84% E 98.4% 7.94% 93.65% 96.03% 92.06% 5.56% @ 83.33% 6.67% 66.67% 50.00% 33.33% 6.67% JJ 82.88% 0.36% 6.73% 52.37% 54.82% 2.45% DT 93.54% 5.52% 94.% 87.32% 76.90% 2.49% N_NST 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RB_AMN 89.70% 5.58% 79.27% 53.66% 0.27% 5.5% RD_SYM 9.67% 0.00% 50.00% 75.00% 9.67% 0.00% N_NN 88.48% 4.47% 8.57% 7.9% 4.44% 26.83% U 62.50% 0.00% 37.50% 93.75% 0.00% 0.00% RD_RDF 3.03% 0.76% 8.33% 2.27% 3.03% 0.76% QT_QTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RP_RPD 6.67% 0.00% 44.44%.% 0.00% 0.00% N_NNV 9.52% 4.76% 9.52% 0.00% 4.76% 4.76% RP_INTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% V_VM 86.82% 5.57% 88.78% 75.7% 3.49% 5.70% PR_PRQ 66.67% 0.00%.% 22.22% 22.22% 0.00% # 00.00% 20.97% 00.00% 90.32% 00.00% 0.00% RD_UNK 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% PR_PRP 87.00% 8.59% 87.45% 90.52% 2.26%.08% N_NNP 7.64%.99% 59.94% 68.7% 67.84% 0.29% V_VAUX 43.03% 8.04% 6.62% 48.46% 4.02% 6.62% $ 68.22% 0.28% 66.36% 48.60% 23.36% 0.00% RP_INJ 74.29% 4.76% 63.8% 54.29% 30.48% 9.52% RB_ALC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% PR_PRF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% CC 50.26%.74% 8.64% 89.2% 3.% 0.52% PSP 65.5% 0.2% 60.62% 3.7% 3.6%.24% ~ 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RD_PUNC 98.5% 8.37% 99.% 97.04% 95.85% 5.48% PR_PRL.45%.45%.45% 0.00% 0.00% 0.00%

Table 7. Official results (Tamil_Costraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical IIITH AMRITA_CEN CDACMUMBAI KS_JU DD_JU SN_JU Amrita Overall 75.48% 73.30% 7.04% 70.64% 64.83% 62.44% 7.07% N_NNP 00.00% 99.09% 80.9% 99.09% 98.64% 69.55% 8.64% PR_PRP 80.92% 69.08% 77.0% 7.37% 8.30% 66.4% 3.44% QT_QTO 55.56% 00.00% 62.96% 8.48% 96.30% 70.37% 0.00% V_VAUX 0.00% 0.00% 0.00% 0.00% 0.00% 27.27% 0.00% JJ 69.70% 52.02% 64.65% 64.4% 6.% 56.57% 3.54% RP_INJ 0.00% 0.00% 0.00% 0.00% 25.00% 25.00% 0.00% DT 79.59% 65.3% 7.43% 73.47% 9.84% 6.22% 0.00% RB_AMN 59.57% 46.0% 59.57% 53.90% 43.26% 43.97% 7.09% N_NN 76.52% 77.64% 75.72% 72.52% 60.70% 64.70% 6.6% CC 73.46% 79.0% 77.78% 76.54% 62.96% 78.40% 0.62% PSP 66.67% 52.38% 49.2% 50.79% 58.73% 60.32% 0.00% V_VM 76.8% 84.54% 7.98% 69.8% 57.49% 6.59% 56.76% X 58.06% 48.39% 46.77% 45.6% 33.87% 46.77% 0.00% RD_PUNC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Table 8. Official results (Tamil_ucostraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical AMRITA_CEN KS_JU CDACMUMBAI DD_JU Overall 68.6% 56.05% 48.03% 44.2% N_NNP 80.9% 99.09% 7.73% 98.64% PR_PRP 72.52% 27.48% 39.3% 54.20% QT_QTO 74.07% 8.48% 5.85% 96.30% V_VAUX 0.00% 0.00% 9.09% 0.00% JJ 59.09% 66.6% 38.38% 54.04% RP_INJ 50.00% 0.00% 0.00% 0.00% DT 77.55% 63.27% 38.78% 83.67% RB_AMN 5.77% 56.03% 37.59% 23.40% N_NN 68.85% 7.88% 90.58% 6.29% CC 80.86% 32.0% 75.3% 60.49% PSP 53.97% 9.05% 22.22% 57.4% V_VM 69.8% 4.06% 7.39% 42.5% X 40.32% 43.55% 40.32% 30.65% RD_PUNC 56.25% 0.00% 0.00% 0.00%

6. CONCLUSION This paper describes a POS taggig system for code mixed social media text i Idia Laguages. The features such as dictioary based iformatio ad some other word level features have bee itroduced ito the HMM model. The experimetal results show that performace of our system is comparable with the best performig systems participated i ICON 205 task: POS Taggig for Code-mixed Idia Social Media Text. The POS taggig system has bee developed usig Visual Basic platform so that a suitable user iterface ca be desiged for the ovice users. The system has bee desiged i such a way that oly chagig the traiig corpus i a file ca make the system portable to other Idia laguages. Refereces [] Sarkar, K. ad Gaye, V., 202, November. A practical partof-speech tagger for Begali. I Emergig Applicatios of Iformatio Techology (EAIT), 202 Third Iteratioal Coferece o (pp. 36-40). IEEE. [2] Neuerdt, M., Trevisa, B., Reyer, M. ad Mathar, R., 203. Part-of-speech taggig for social media texts. I Laguage Processig ad Kowledge i the Web (pp. 39-50). Spriger Berli Heidelberg. [3] Trevisa, B., Neuerdt, M. ad Jakobs, E.M., 202. A multi-level aotatio model for fie-graied opiio detectio i Germa blog commets. I Proceedigs of KONVENS (Vol. 202, pp. 79-88). [4] Brats, T., TT A statistical part-of-speech tagger, I Proc. Of the 6 th Applied NLP Coferece, pp. 224-23, 2000. [5] Dadapat, S., Sarkar, S., Basu, A., 2007. Automatic partof-speech taggig for begali: a approach for morphologically rich laguages i a poor sceario, Proceedigs of the Associatio for Computatioal Liguistic, pp. 22-224 [6] Ekbal, et. al, 2007., Begali part of speech taggig usig coditioal radom field i Proceedigs of the 7 th Iteratioal Symposium of Natural Laguage Processig( SNLP-2007), Pattaya, Thailad 3-5 December, pp. 3-36. [7] Sarkar, K. ad Gaye, V., 203. A Trigram HMM-Based POS Tagger for Idia Laguages. I Proceedigs of the Iteratioal Coferece o Frotiers of Itelliget Computig: Theory ad Applicatios (FICTA) (pp. 205-22). Spriger Berli Heidelberg. [8] Dadapat, S., Sarkar, S., Basu, A.,, 2007, Automatic partof-speech taggig for begali: a approach for morphologically rich laguages i a poor sceario, Proceedigs of the Associatio for Computatioal Liguistic, pp. 22-224. [9] Ekbal, A., et. al, Begali part of speech taggig usig coditioal radom field i Proceedigs of the 7 th Iteratioal Symposium of Natural Laguage Processig( SNLP-2007), Pattaya, Thailad, 3-5 December, pp. 3-36, 2007. [0] Ekbal, A., Badyopadhyay, S., 2008, Part of speech taggig i begali usig support vector machie, ICIT-08, IEEE Iteratioal Coferece o Iformatio Techology, pp. 06-. [] Ali, H., 200., A usupervised parts-of-speech tagger for the bagla laguage, Departmet of Computer Sciece, Uiversity of British Columbia [2] Chakrabarti, D., 200, Layered parts of speech taggig for Bagla, Laguage i Idia www.laguageiidia.com, Special Volume: Problems of Parsig i Idia Laguages. [3] Atoy, P. J., Soma, K. P., 20, Parts of speech taggig for Idia laguages: a literature survey, Iteratioal Joural of Computer Applicatios (0975-8887) Volume 34- No.8. [4] Kumar,D., Sigh Josa G., 200, Part of speech taggers for morphologically rich idia laguages: a survey, Iteratioal Joural of Computer Applicatios(0975-8887) Volume 6-No.5. [5] Vyas, Y., Gella, S., Sharma, J., Bali, K. ad Choudhury, M., 204, October. Pos taggig of eglish-hidi code-mixed social media cotet. I Proceedigs of the First Workshop o Codeswitchig, EMNLP. [6] Jamatia, A., Gambäck, B. ad Das A., 205. Part-of-Speech Taggig for Code-Mixed Eglish-Hidi Twitter ad Facebook Chat Messages. I the Proceedig of 0th Recet Advaces of Natural Laguage Processig (RANLP), September, Pages 239 248, Bulgaria [7] Gaye, V. ad Sarkar, K., 204. "A HMM based amed etity recogitio system for Idia laguages: the JU system at ICON 203." arxiv preprit arxiv:405. 7397 (204). [8] Jurafsky, D. ad Marti, J. H., 2002, Speech ad Laguage Processig: A Itroductio to Natural Laguage Processig, Computatioal Liguistics ad Speech Recogitio, Preaso Educatio Series.