Part-of-Speech Taggig for Code-mixed Idia Social Media Text at ICON 205 Kamal Sarkar Computer Sciece & Egieerig Dept. Jadavpur Uiversity Kolkata-700032, Idia jukamal200@yahoo.com ABSTRACT This paper discusses the experimets carried out by us at Jadavpur Uiversity as part of the participatio i ICON 205 task: POS Taggig for Code-mixed Idia Social Media Text. The tool that we have developed for the task is based o Trigram Hidde Markov Model that utilizes iformatio from dictioary as well as some other word level features to ehace the observatio probabilities of the kow tokes as well as ukow tokes. We submitted rus for Begali-Eglish, Hidi-Eglish ad Tamil-Eglish Laguage pairs. Our system has bee traied ad tested o the datasets released for ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text. I costraied mode, our system obtais average overall accuracy (averaged over all three laguage pairs) of 75.60% which is very close to other participatig two systems (76.79% for IIITH ad 75.79% for AMRITA_CEN) raked higher tha our system. I ucostraied mode, our system obtais average overall accuracy of 70.65% which is also close to the system (72.85% for AMRITA_CEN) which obtais the highest average overall accuracy. Keywords Part-of-Speech Taggig, Code Mixed, Social Media, HMM.. INTRODUCTION Part-of-Speech (POS) taggig is the task of assigig grammatical categories (ou, verb, adjective etc.) to words i a atural laguage setece []. POS taggig ca be used i various NLP (Natural Laguage Processig) applicatios. The iterest i applyig NLP methods for aalyzig ostadardized texts, such as social media texts, rapidly is growig [2], because the automatic aalysis of social media texts is oe of essetial requiremets for the task of setimet aalysis [3]. Sice social media texts cotai blog commets or chat messages, it differs from stadardized texts i the word usage but also i their grammatical structure. This creates the eed for adaptig NLP methods to aalyzig social media text ad i particular, for the adaptio of POS taggig methods to such text types. Most state-of-the art taggers have bee developed for stadardized texts. This paper presets a descriptio of HMM (Hidde Markov Model) based system for POS taggig from Social Media Text i Idia Laguages. The ICON 205 shared task: POS Taggig For Codemixed Idia Social Media Text is defied i this year to build the POS tagger systems for code mixed Idia social media text - Begali-Eglish, Hidi-Eglish ad Tamil- Eglish laguage pairs for which traiig data ad test data were provided. Data set for a laguage pair cotais the social media text writte i the laguages of the cocered pair. For example, for Begali-Eglish laguage pair, data set cotais the social media text writte i Eglish ad Hidi. We have participated for all three laguage pairs. POS Tagger ca be developed usig both liguistic models ad stochastic models. The earliest works o POS taggig [4][5][6] use supervised learig methods. Some research work has already doe for developig POS tagger for stadard texts i Idia laguages [7]. Dadapat et. al [8].presets HMM ad Maximum Etropy (ME) based approaches for Begali POS taggig. Ekbal et. al. [9] preseted a POS tagger for Begali laguage usig Coditioal Radom Fields (CRF). They also discussed aother machie learig based POS tagger usig SVM algorithm i [0]. A usupervised Parts-of-Speech Tagger for the Bagla laguage was proposed by Ali et.al. i []. Chakrabarti et.al.[2] has proposed a Layered Parts of Speech Taggig for Bagla. A detailed survey o POS taggig for other Idia laguages has bee preseted i [3][4]. A few attempts have also bee made for developig POS tagger for code mixed Idia social media text. A POS Taggig System of Eglish-Hidi Code-Mixed Social Media Cotet has bee preseted i [5]. A POS taggig system for Idia Social Media Text o Twitter has bee preseted i [6].
2. PREPARATION OF TRAINING DATA The traiig data released for the ICON 205 shared task cotais three files: oe file for Begali-Eglish Laguage pair, oe file for Hidi-Eglish laguage pair ad oe file for Tamil-Eglish laguage pair. Each lie i a file cotais tokes i the laguages of cocered pair, Laguage tag ad Part-of-Speech tag. The participats are istructed to produce the output i the same format after testig the system o the test data where the test data cotais per lie a tab separated toke ad the correspodig laguage tag. Our system uses a traiig file for a laguage pair ad coverts each setece ito a sequece of pairs of toke ad tag where each toke i this ew format is formed by combiig the source toke ad some other iformatio such as laguage tag. The detailed of this format is discussed i the later sectios. 3. HMM MODEL FOR POS TAGGING A POS tagger based o Hidde Markov Model (HMM) fids the best sequece of POS tags t that is optimal for a give observatio sequece o. The taggig problem becomes equivalet to searchig for arg max Po ( t ) Pt ( ) (by the applicatio of Bayes t law), that is, we eed to compute: tˆ arg max P( o t ) P( t ) = (). t Where t is a tag sequece ad o is a observatio sequece, Pt ( ) is the prior probability of the tag Po ( t ) is the likelihood of the word sequece ad sequece. I geeral, HMM based POS taggig use words i a setece as a observatio sequece [] [7]. But, we use some additioal iformatio such as laguage tag for disambiguatig each toke i text. We also use some other iformatio such as whether the toke cotais ay hash tag or ot. We use this iformatio i a form of meta tag (details are preseted i the subsequet sectios). We use a small dictioary of words which cotais words with its broad POS categories. If ay toke is foud i the dictioary, we use the broad POS tag as some additioal iformatio which we combies with the observatio toke (details are preseted i the subsequet sectios). Ulike the traditioal HMM based POS taggig system, to use this additioal iformatio for POS taggig task, we cosider a triplet as a observatio symbol: <word, metatag, Laguage tag >. This is a pseudo toke used as a observed symbol, that is, for a setece of words, the correspodig observatio sequece will be as follows: (<word, meta-tag, L-tag, >, <word 2, meta-tag 2, L- tag 2 >, <word 3, meta-tag 3, L-tag 3 >,..., <word, metatag, L-tag,>). Here a observatio symbol o i correspods to <word i, meta-tag i, L-tag i, > ad L-tag is the laguage tag ad meta-tag is decided based o the additioal iformatio (e.g. Hash tag). Sice Equatio () is too hard to compute directly, HMM taggers follows Markov assumptio accordig to which the probability of a tag is depedet oly o short memory (a small, fixed umber of previous tags). For example, a bigram tagger cosiders that the probability of a tag depeds oly o the previous tag For our proposed trigram model, the probability of a tag depeds o two previous tags ad thus Pt ( ) is computed as: Pt ( ) Π Pt ( t, t ) (2) i i i 2 i= Depedig o the assumptio that the probability of a word appearig is depedet oly o its ow tag, Po ( t ) ca be simplified to: Po ( t) Po ( t) i= i i (3) Pluggig the above metioed two equatios (2) ad (3) ito () results i the followig equatio by which a bigram tagger estimates the most probable tag sequece: ˆ arg max ( ) ( ) arg max t ( ) ( ) = P t o P t P o t P t t (4) t t i= i i i i ( ) Where: the tag trasitio probabilities, Pti ti, represet the probability of a tag give the previous tag. Po ( i ti) represets the probability of a observed symbol give a tag. Cosiderig a special tag t + to idicate the ed setece boudary ad two special tags t - ad t 0 at the startig boudary of the setece ad addig these three special tags to the tag set [4], gives the followig equatio for POS taggig: tˆ = arg max P( t o ) P( t ) t Poi ti Pti ti t i= ti 2 Pt+ t argmax[ ( ) (, )] ( ) The equatio (5) is still computatioally expesive because we eed to cosider all possible tag sequece of legth. So, dyamic programmig approach is used to compute the equatio (5). At the traiig phase of HMM based POS taggig, observatio probability matrix ad tag trasitio probability matrix are created. A geeral Architecture of our developed POS tagger is show i Figure. As we ca see from the equatio (4), to fid the most likely tag sequece for a observatio sequece, we eed (5)
to compute two kids of probabilities: tag trasitio probabilities ad word likelihoods or observatio probabilities. Traiig Corpus (Laguage tagged ad POS tagged) Assig special tags to Tokes (meta tag ad broad POS tags from dictioary) Tagged sequeces observatio symbols. Traiig based tagger HMM model of HMM POS Figure. Architecture for our developed HMM based POS taggig system Our developed trigram HMM tagger requires to compute tag trigram probability, Pt ( i ti, ti 2), which is computed by the maximum likelihood estimate from tag trigram couts. To overcome the data sparseess problem, tag trigram probability is smoothed usig deleted iterpolatio techique [7][4] which uses the maximum likelihood estimates from couts for tag trigram, tag bigram ad tag uigram. The observatio probability of a observed triplet <word, meta-tag, L-tag >, which is the observed symbol i our case, is computed usig the followig equatio [][7]. Po ( t ) = (7) C( o, t) C( o) Social media setece (laguage tagged) Assig special tags to Tokes (meta tag ad broad POS tags from dictioary) Testig phase POS tagged Setece 3. Viterbi Decodig We have used Viterbi algorithm to fid the best hidde state sequece give a iput HMM ad a sequece of observatio symbols. The Viterbi algorithm is a stadard applicatio of the classic dyamic programmig algorithm [8]. Give a tag trasitio probability matrix ad the observatio probability matrix, Viterbi decodig (used at the testig phase) accepts a setece from code mixed social media text ad fids the most likely tag sequece for the test setece which is also L-tagged ad Meta tagged. Here a setece is submitted to the viterbi as the observatio sequece of triplets: (<word, meta-tag, L-tag >, <word 2, meta-tag 2, L-tag 2 >, <word 3, meta-tag 3, L-tag 3 >,..., <word, meta-tag, L- tag >). Here a observatio symbol o i correspods to <word i, meta-tag i, L-tag i,> ad L-tag is a laguage tag ad Meta tag is determied based o the dictioary iformatio ad Hash tag feature. After assigig the tag sequece to the observatio sequece as metioed above, L-tag ad meta-tag iformatio are removed from the output ad thus the output for a iput setece is coverted to a POS-tagged setece. Oe of the importat problems to apply Viterbi decodig algorithm is how to hadle ukow triplets i the iput. The ukow triplets are triplets which are ot preset i the traiig set ad hece their observatio probabilities are ot kow. To hadle this problem, we estimate the observatio probability of a ukow oe by aalyzig L-tag, meta-tag ad the suffix of the word associated with the correspodig the triplet. We estimate the observatio probability of a ukow observed triplet i the followig ways: The observatio probabilities of ukow triplet < word, meta-tag, L-tag> correspodig to a word i the iput setece are decided accordig to the suffix of a pseudo word formed by addig L-tag ad meta-tag to the ed of the word. We fid the observatio probabilities of such ukow pseudo words usig suffix aalysis [7][4]. of all rare pseudo words (frequecy <=2) i the traiig corpus for the cocered laguage pairs. 4. SPECIAL TAGS 4. Meta Tag Each toke has some properties by which oe toke differs from aother. For example, a toke may cotai Hash tag which is frequet i the social media text. Meta-tag= YYYY (default) if the first character of the toke is a Hash symbol (#) the metatag = "HB
else if the hash tag is preset i ay other positio of a toke metatag = "HE" Ed If 4.2 Dictioary I earlier sectios, we have metioed that we have used some dictioary iformatio as the meta-tag also. A metatag is set to the value of broad POS tag for a toke after matchig it with the dictioary words ad retrievig the correspodig broad POS tag foud i the dictioary. The descriptio of dictioary is show i Table. Table : Descriptio of Dictioaries Laguage Pair Begali- Eglish Hidi- Eglish Tamil- Eglish Broad POS categories Number of etries i the dictioary(to kes are ot ormalized) Proou, verb ad cojuctio Proou verb cojuctio Proou Verb Cojuctio 92 79 60 274 85 56 203 633 56 We follow the followig rules for assigig to the toke this type of broad POS tag extracted from the dictioary: If raw toke is foud i the dictioary ad the broad POS tag of the cocered toke is XXXX the meta-tag ="XXXX" ed if Sice we have used oly verb, proou ad cojuctios i the dictioaries, XXXX ca take oe three values: VERB, PNON ad CONJ. 5. EVALUATION AND RESULTS We trai separately our developed POS tagger based o the traiig data ad tue the parameters of our system o the traiig data for the respective laguage pair. After learig the tuig parameters, we test our system o the test data for the cocered laguage pair. The descriptio of the data for three laguage pairs is show i the Table2 Our developed POS system has bee evaluated usig the traditioal accuracy measure. For traiig, tuig ad testig our system, we have used the datasets for three differet laguage pairs: Begali-Eglish, Hidi-Eglish ad Tamil-Eglish, released by the orgaizers of ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text. Table2. The descriptio of the data for various laguage pairs Laguage Total of seteces Traiig data Test data Begali-Eglish 2837 459 Hidi-Eglish 729 377 Tamil-Eglish 639 279 The orgaizers of the shared task released the data i two phases: i the first phase, traiig data is released where traiig data was laguage tagged ad POS tagged. I the secod phase, the test data is released where test data was oly laguage tagged. The cotestats are istructed to assig POS tags to the seteces i the test file usig their developed systems. The tagged test files for test data sets were fially set to the orgaizers for evaluatio. The orgaizers evaluate the differet rus submitted by the various teams ad sed the official results to the participatig teams. A total of 0 teams submitted their rus for this cotest. For each laguage pair the cotests were doe i two differet modes: Costraied mode ad ucostraied mode. I cotraied mode, the participat team is oly allowed to use the traiig corpus. No exteral resource is allowed. I ucostraied mode, the participat team is allowed to use ay exteral resources (POS tagger, NER, Parser, ad additioal data) to trai their system. I costraied mode, we have ot used ay dictioary ad oly Hash tag has bee used as the meta-tag. I ucostraied mode, we have used a small dictioary as metioed i Table ad Hash tag has bee used as the meta-tag. The results obtaied by our system (team code: KS_JU) have bee show i the tables 3 to 8. The results obtaied by other participatig systems have also bee show i the tables. The secod row of the each table shows the overall accuracy obtaied by the various systems participated i the cotest. We have also evaluated the system based o its cosistecy across the laguages i costraied ad ucostraied mode. Average overall accuracy is computed by takig the average of overall accuracy of the system obtaied for all three laguage pairs i a particular mode. I costraied mode, our system obtais average overall accuracy (averaged over all three laguage pairs) of 75.60% which is very close to other participatig two systems (76.79% for IIITH ad 75.79% for AMRITA_CEN) raked higher tha our system. I ucostraied mode, our system obtais average overall accuracy of 70.65% which is also close to the system (72.85% for AMRITA_CEN) which obtais the highest average overall accuracy.
Table 3. Official results (Begali-Costraied mode) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorica IIITH AMRITA_CEN KS_JU CDACMUMBAI DD_JU SN_JU Amrita l Overall 79.84% 78.50% 78.42% 75.46% 75.22% 72.64% 0.3% E 97.% 94.22% 97.% 97.% 95.95% 97.% 0.00% @ 00.00% 93.33% 93.33% 93.33% 86.67% 86.67% 0.00% JJ 65.25% 6.2% 6.92% 62.72% 58.9% 52.46% 20.5% N_NST 80.00% 80.00% 80.00% 80.00% 0.00% 80.00% 0.00% DT 95.90% 96.29% 94.92% 95.5% 93.75% 94.73% 0.00% RD_SYM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RB_AMN 8.48% 77.78% 80.79% 77.3% 76.6% 66.20% 0.00% N_NN 8.26% 79.80% 78.8% 8.73% 83.56% 67.66%.20% U 00.00% 3.64% 8.82% 00.00% 0.00% 00.00% 0.00% RD_RDF 47.96% 40.52% 42.94% 39.22% 36.06% 33.64% 0.00% QT_QTF 48.75% 55.63% 57.50% 56.25% 53.3% 50.00% 0.00% RP_RPD 7.24% 74.5% 76.47% 69.28% 49.02% 77.78% 0.00% N_NNV 59.68% 62.90% 56.45% 35.48% 66.3% 56.45% 0.00% V_VM 79.76% 8.87% 78.49% 80.66% 74.76% 7.8% 0.54% PR_PRQ 83.93% 87.50% 75.00% 87.50% 9.07% 82.4% 0.00% # 95.35% 97.67% 97.67% 88.37% 74.42% 74.42% 0.00% PR_PRP 87.48% 90.9% 88.77% 89.29% 87.0% 87.6% 0.00% N_NNP 65.46% 55.47% 59.52% 59.8% 43.08% 6.55% 60.68% V_VAUX 39.08% 3.03% 35.06% 27.59% 20.69% 30.46% 0.00% $ 64.7% 69.85% 6.76% 6.76% 4.9% 44.85% 0.00% RP_INJ 53.6% 50.52% 60.82% 54.64% 26.80% 49.48% 0.00% RB_ALC 54.4% 70.59% 58.82% 63.24% 75.00% 54.4% 0.00% DM_DMD 7.34% 72.6% 74.52% 70.70% 78.98% 76.43% 0.00% PR_PRF 55.56% 77.78% 44.44% 55.56% 77.78% 66.67% 0.00% CC 82.76% 85.7% 85.52% 83.79% 83.0% 8.38% 0.34% DM_DMQ 50.00% 50.00% 50.00% 50.00% 0.00% 50.00% 0.00% PSP 87.69% 89.38% 92.36% 90.54% 87.56% 89.25% 3.89% DM_DMR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RD_PUNC 98.79% 99.% 98.46% 76.74% 97.67% 93.57% 0.5% PR_PRL 60.00% 80.00% 80.00% 80.00% 60.00% 40.00% 0.00%
Table 4. Official results (Begali_ucostraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical KS_JU AMRITA_CEN DD_JU Overall 78.29% 76.73% 47.08% E 58.96% 94.80% 95.95% @ 66.67% 93.33% 86.67% JJ 45.94% 6.38% 56.32% N_NST 50.00% 80.00% 0.00% DT 59.96% 96.29% 6.72% RD_SYM 0.00% 0.00% 0.00% RB_AMN 53.70% 80.56% 0.23% N_NN 57.68% 76.8% 44.86% U 3.82% 9.09% 0.00% RD_RDF 29.74% 36.62% 36.06% QT_QTF 4.25% 54.37% 53.3% RP_RPD 52.94% 66.67% 33.33% N_NNV 33.87% 59.68% 48.39% V_VM 48.37% 79.46% 5.54% PR_PRQ 48.2% 89.29% 9.07% # 74.42% 95.35% 74.42% PR_PRP 60.52% 89.03% 8.32% N_NNP 38.23% 50.8% 42.22% V_VAUX 6.09% 35.63% 20.69% $ 38.24% 72.79% 28.68% RP_INJ 25.77% 60.82% 4.43% RB_ALC 45.59% 66.8% 75.00% DM_DMD 54.4% 74.52% 78.98% PR_PRF 33.33% 00.00% 77.78% CC 59.66% 83.79% 73.79% DM_DMQ 25.00% 25.00% 0.00% PSP 59.33% 88.86% 6.97% DM_DMR 0.00% 0.00% 0.00% RD_PUNC 64.34% 98.93% 97.67% PR_PRL 40.00% 80.00% 60.00%
Table 5. Official results (Hidi-costraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categori cal KS_JU AMRITA_CEN IIITH DD_JU CDACMUMBAI SN_JU Auj_IITB Amrita Overall 77.74% 75.58% 75.04% 73.6% 7.% 68.85% 64.52% 3.45% E 7.94% 94.44% 94.44% 92.06% 94.44% 9.27% 96.03%.59% @ 6.67% 83.33% 50.00% 33.33% 83.33% 33.33% 83.33% 0.00% JJ 9.93% 52.23% 56.40% 54.0% 56.55% 55.68% 64.60% 0.86% DT 5.74% 93.77% 92.07% 90.26% 90.49% 9.39% 86.98% 0.00% N_NST 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RB_AMN 5.58% 75.88% 76.42% 78.32% 77.78% 65.8% 69.65% 0.00% RD_SYM 0.00% 9.67% 9.67% 9.67% 9.67% 9.67% 9.67% 0.00% N_NN 3.93% 79.83% 82.77% 8.75% 82.77% 7.97% 48.38% 20.89% U 0.00% 2.50% 62.50% 0.00% 62.50% 93.75% 93.75% 0.00% RD_RDF 0.76% 4.55% 3.03% 3.79% 3.79% 4.55% 3.79% 0.00% QT_QTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 20.00% 0.00% RP_RPD 0.00% 0.00% 0.00% 27.78% 0.00% 5.56% 5.56% 0.00% N_NNV 4.76% 9.52% 9.52% 4.76% 9.52% 9.52% 9.52% 0.00% RP_INTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% V_VM 6.68% 83.32% 8.42% 84.49% 82.46% 74.62% 52.30% 56.84% PR_PRQ 0.00% 88.89% 66.67% 22.22% 33.33% 33.33% 44.44% 0.00% # 20.97% 00.00% 00.00% 00.00% 00.00% 80.65% 00.00% 0.00% RD_UNK 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% PR_PRP 8.68% 87.8% 82.67% 88.54% 79.69% 88.09% 73.0% 0.45% N_NNP.99% 67.54% 69.30% 67.84% 53.22% 35.38% 69.88% 2.63% V_VAUX 8.98% 34.04% 4.3% 6.38% 36.4% 43.26% 50.35%.65% $ 9.8% 69.6% 65.89% 36.45% 57.94% 37.38% 57.0% 0.00% RP_INJ 4.76% 6.90% 55.24% 43.8% 54.29% 43.8% 47.62% 0.95% RB_ALC 0.00% 6.67% 6.67% 0.00% 6.67% 0.00% 6.67% 0.00% PR_PRF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% CC.92% 34.89% 45.94% 7.60% 4.45% 44.2% 53.54% 0.00% PSP 9.07% 75.67% 62.37% 82.99% 69.8% 62.78% 58.66% 0.82% ~ 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RD_PUNC 8.44% 98.30% 97.85% 95.85% 70.52% 85.33% 96.22% 4.44% PR_PRL.45% 0.00%.45% 0.00%.45% 0.00% 5.80% 0.00%
Table 6. Official results (Hidi-Ucostraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical IIITH KS_JU AMRITA_CEN Rudra_IITB DD_JU CDACMUMBAI Overall 80.68% 77.60% 73.66% 68.94% 27.60% 6.84% E 98.4% 7.94% 93.65% 96.03% 92.06% 5.56% @ 83.33% 6.67% 66.67% 50.00% 33.33% 6.67% JJ 82.88% 0.36% 6.73% 52.37% 54.82% 2.45% DT 93.54% 5.52% 94.% 87.32% 76.90% 2.49% N_NST 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RB_AMN 89.70% 5.58% 79.27% 53.66% 0.27% 5.5% RD_SYM 9.67% 0.00% 50.00% 75.00% 9.67% 0.00% N_NN 88.48% 4.47% 8.57% 7.9% 4.44% 26.83% U 62.50% 0.00% 37.50% 93.75% 0.00% 0.00% RD_RDF 3.03% 0.76% 8.33% 2.27% 3.03% 0.76% QT_QTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RP_RPD 6.67% 0.00% 44.44%.% 0.00% 0.00% N_NNV 9.52% 4.76% 9.52% 0.00% 4.76% 4.76% RP_INTF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% V_VM 86.82% 5.57% 88.78% 75.7% 3.49% 5.70% PR_PRQ 66.67% 0.00%.% 22.22% 22.22% 0.00% # 00.00% 20.97% 00.00% 90.32% 00.00% 0.00% RD_UNK 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% PR_PRP 87.00% 8.59% 87.45% 90.52% 2.26%.08% N_NNP 7.64%.99% 59.94% 68.7% 67.84% 0.29% V_VAUX 43.03% 8.04% 6.62% 48.46% 4.02% 6.62% $ 68.22% 0.28% 66.36% 48.60% 23.36% 0.00% RP_INJ 74.29% 4.76% 63.8% 54.29% 30.48% 9.52% RB_ALC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% PR_PRF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% CC 50.26%.74% 8.64% 89.2% 3.% 0.52% PSP 65.5% 0.2% 60.62% 3.7% 3.6%.24% ~ 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% RD_PUNC 98.5% 8.37% 99.% 97.04% 95.85% 5.48% PR_PRL.45%.45%.45% 0.00% 0.00% 0.00%
Table 7. Official results (Tamil_Costraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical IIITH AMRITA_CEN CDACMUMBAI KS_JU DD_JU SN_JU Amrita Overall 75.48% 73.30% 7.04% 70.64% 64.83% 62.44% 7.07% N_NNP 00.00% 99.09% 80.9% 99.09% 98.64% 69.55% 8.64% PR_PRP 80.92% 69.08% 77.0% 7.37% 8.30% 66.4% 3.44% QT_QTO 55.56% 00.00% 62.96% 8.48% 96.30% 70.37% 0.00% V_VAUX 0.00% 0.00% 0.00% 0.00% 0.00% 27.27% 0.00% JJ 69.70% 52.02% 64.65% 64.4% 6.% 56.57% 3.54% RP_INJ 0.00% 0.00% 0.00% 0.00% 25.00% 25.00% 0.00% DT 79.59% 65.3% 7.43% 73.47% 9.84% 6.22% 0.00% RB_AMN 59.57% 46.0% 59.57% 53.90% 43.26% 43.97% 7.09% N_NN 76.52% 77.64% 75.72% 72.52% 60.70% 64.70% 6.6% CC 73.46% 79.0% 77.78% 76.54% 62.96% 78.40% 0.62% PSP 66.67% 52.38% 49.2% 50.79% 58.73% 60.32% 0.00% V_VM 76.8% 84.54% 7.98% 69.8% 57.49% 6.59% 56.76% X 58.06% 48.39% 46.77% 45.6% 33.87% 46.77% 0.00% RD_PUNC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Table 8. Official results (Tamil_ucostraied) obtaied by the various systems participated i ICON 205 shared task: POS Taggig For Code-mixed Idia Social Media Text POS/Categorical AMRITA_CEN KS_JU CDACMUMBAI DD_JU Overall 68.6% 56.05% 48.03% 44.2% N_NNP 80.9% 99.09% 7.73% 98.64% PR_PRP 72.52% 27.48% 39.3% 54.20% QT_QTO 74.07% 8.48% 5.85% 96.30% V_VAUX 0.00% 0.00% 9.09% 0.00% JJ 59.09% 66.6% 38.38% 54.04% RP_INJ 50.00% 0.00% 0.00% 0.00% DT 77.55% 63.27% 38.78% 83.67% RB_AMN 5.77% 56.03% 37.59% 23.40% N_NN 68.85% 7.88% 90.58% 6.29% CC 80.86% 32.0% 75.3% 60.49% PSP 53.97% 9.05% 22.22% 57.4% V_VM 69.8% 4.06% 7.39% 42.5% X 40.32% 43.55% 40.32% 30.65% RD_PUNC 56.25% 0.00% 0.00% 0.00%
6. CONCLUSION This paper describes a POS taggig system for code mixed social media text i Idia Laguages. The features such as dictioary based iformatio ad some other word level features have bee itroduced ito the HMM model. The experimetal results show that performace of our system is comparable with the best performig systems participated i ICON 205 task: POS Taggig for Code-mixed Idia Social Media Text. The POS taggig system has bee developed usig Visual Basic platform so that a suitable user iterface ca be desiged for the ovice users. The system has bee desiged i such a way that oly chagig the traiig corpus i a file ca make the system portable to other Idia laguages. Refereces [] Sarkar, K. ad Gaye, V., 202, November. A practical partof-speech tagger for Begali. I Emergig Applicatios of Iformatio Techology (EAIT), 202 Third Iteratioal Coferece o (pp. 36-40). IEEE. [2] Neuerdt, M., Trevisa, B., Reyer, M. ad Mathar, R., 203. Part-of-speech taggig for social media texts. I Laguage Processig ad Kowledge i the Web (pp. 39-50). Spriger Berli Heidelberg. [3] Trevisa, B., Neuerdt, M. ad Jakobs, E.M., 202. A multi-level aotatio model for fie-graied opiio detectio i Germa blog commets. I Proceedigs of KONVENS (Vol. 202, pp. 79-88). [4] Brats, T., TT A statistical part-of-speech tagger, I Proc. Of the 6 th Applied NLP Coferece, pp. 224-23, 2000. [5] Dadapat, S., Sarkar, S., Basu, A., 2007. Automatic partof-speech taggig for begali: a approach for morphologically rich laguages i a poor sceario, Proceedigs of the Associatio for Computatioal Liguistic, pp. 22-224 [6] Ekbal, et. al, 2007., Begali part of speech taggig usig coditioal radom field i Proceedigs of the 7 th Iteratioal Symposium of Natural Laguage Processig( SNLP-2007), Pattaya, Thailad 3-5 December, pp. 3-36. [7] Sarkar, K. ad Gaye, V., 203. A Trigram HMM-Based POS Tagger for Idia Laguages. I Proceedigs of the Iteratioal Coferece o Frotiers of Itelliget Computig: Theory ad Applicatios (FICTA) (pp. 205-22). Spriger Berli Heidelberg. [8] Dadapat, S., Sarkar, S., Basu, A.,, 2007, Automatic partof-speech taggig for begali: a approach for morphologically rich laguages i a poor sceario, Proceedigs of the Associatio for Computatioal Liguistic, pp. 22-224. [9] Ekbal, A., et. al, Begali part of speech taggig usig coditioal radom field i Proceedigs of the 7 th Iteratioal Symposium of Natural Laguage Processig( SNLP-2007), Pattaya, Thailad, 3-5 December, pp. 3-36, 2007. [0] Ekbal, A., Badyopadhyay, S., 2008, Part of speech taggig i begali usig support vector machie, ICIT-08, IEEE Iteratioal Coferece o Iformatio Techology, pp. 06-. [] Ali, H., 200., A usupervised parts-of-speech tagger for the bagla laguage, Departmet of Computer Sciece, Uiversity of British Columbia [2] Chakrabarti, D., 200, Layered parts of speech taggig for Bagla, Laguage i Idia www.laguageiidia.com, Special Volume: Problems of Parsig i Idia Laguages. [3] Atoy, P. J., Soma, K. P., 20, Parts of speech taggig for Idia laguages: a literature survey, Iteratioal Joural of Computer Applicatios (0975-8887) Volume 34- No.8. [4] Kumar,D., Sigh Josa G., 200, Part of speech taggers for morphologically rich idia laguages: a survey, Iteratioal Joural of Computer Applicatios(0975-8887) Volume 6-No.5. [5] Vyas, Y., Gella, S., Sharma, J., Bali, K. ad Choudhury, M., 204, October. Pos taggig of eglish-hidi code-mixed social media cotet. I Proceedigs of the First Workshop o Codeswitchig, EMNLP. [6] Jamatia, A., Gambäck, B. ad Das A., 205. Part-of-Speech Taggig for Code-Mixed Eglish-Hidi Twitter ad Facebook Chat Messages. I the Proceedig of 0th Recet Advaces of Natural Laguage Processig (RANLP), September, Pages 239 248, Bulgaria [7] Gaye, V. ad Sarkar, K., 204. "A HMM based amed etity recogitio system for Idia laguages: the JU system at ICON 203." arxiv preprit arxiv:405. 7397 (204). [8] Jurafsky, D. ad Marti, J. H., 2002, Speech ad Laguage Processig: A Itroductio to Natural Laguage Processig, Computatioal Liguistics ad Speech Recogitio, Preaso Educatio Series.