Indian Institute of Technology, Kanpur

Similar documents
Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Python Machine Learning

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

Lecture 1: Machine Learning Basics

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Named Entity Recognition: A Survey for the Indian Languages

CS Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Modeling function word errors in DNN-HMM based LVCSR systems

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Using dialogue context to improve parsing performance in dialogue systems

Assignment 1: Predicting Amazon Review Ratings

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Probabilistic Latent Semantic Analysis

Multi-Lingual Text Leveling

Modeling function word errors in DNN-HMM based LVCSR systems

CS 446: Machine Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CSL465/603 - Machine Learning

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reducing Features to Improve Bug Prediction

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Case Study: News Classification Based on Term Frequency

Corrective Feedback and Persistent Learning for Information Extraction

Ensemble Technique Utilization for Indonesian Dependency Parser

Online Updating of Word Representations for Part-of-Speech Tagging

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Grammars & Parsing, Part 1:

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Loughton School s curriculum evening. 28 th February 2017

A Comparison of Two Text Representations for Sentiment Analysis

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Training and evaluation of POS taggers on the French MULTITAG corpus

Multilingual Sentiment and Subjectivity Analysis

Human Emotion Recognition From Speech

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Word Segmentation of Off-line Handwritten Documents

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Myths, Legends, Fairytales and Novels (Writing a Letter)

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods in Multilingual Speech Recognition

Memory-based grammatical error correction

Beyond the Pipeline: Discrete Optimization in NLP

THE VERB ARGUMENT BROWSER

Speech Emotion Recognition Using Support Vector Machine

Finding Translations in Scanned Book Collections

Detecting English-French Cognates Using Orthographic Edit Distance

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Exposé for a Master s Thesis

Calibration of Confidence Measures in Speech Recognition

Cross Language Information Retrieval

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

A Vector Space Approach for Aspect-Based Sentiment Analysis

Learning From the Past with Experiment Databases

The stages of event extraction

(Sub)Gradient Descent

Dialog Act Classification Using N-Gram Algorithms

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

A Bayesian Learning Approach to Concept-Based Document Classification

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Leveraging Sentiment to Compute Word Similarity

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Rule Learning with Negation: Issues Regarding Effectiveness

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Artificial Neural Networks written examination

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Parsing of part-of-speech tagged Assamese Texts

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Applications of data mining algorithms to analysis of medical data

ARNE - A tool for Namend Entity Recognition from Arabic Text

BYLINE [Heng Ji, Computer Science Department, New York University,

An Evaluation of POS Taggers for the CHILDES Corpus

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

ScienceDirect. Malayalam question answering system

Disambiguation of Thai Personal Name from Online News Articles

Semi-Supervised Face Detection

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Advanced Grammar in Use

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Transcription:

Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar Begad (12612) {sandeekb@iitk.ac.in} Mentor - Prof. Amitabha Mukerjee Abstract Social media in today s world possess enormous amount of data. This data is used by various companies for advertisement to suitable group, declaring promotions, etc. But the problem starts in bilingual or multilingual populations where a lot of people tend to use multiple languages in the same sentence. Now analysis of such a text unravels a whole new field of study.

Contents 1 Introduction 2 2 Previous work 2 3 Dataset 2 4 Theory 3 4.1 Support Vector machine............................ 3 4.2 Naive Bayes Classifier............................ 3 4.3 Decision Trees................................ 3 4.4 Logistic classifier............................... 4 4.5 Conditional Random Field.......................... 4 5 Methodology 5 5.1 Language Identification............................ 5 5.2 Backtransliteration.............................. 5 5.3 POS Tagging................................. 5 6 Results 8 7 Error Analysis and Conclusions 11 8 Acknowledgement 11 1

1 Introduction Mixing of languages is called code mixing. Code mixing occurs due to various reasons. According to a work by Hidayat et. al 2012[4], An analysis of code switching used by facebookers: a case study in a social network site, there are the following major reasons for code switching: 45% : Real lexical needs For instance someone is thinking of some object but is not able to recall the word in the language he is using already, then he/she will tend to switch to a language where he knows the appropriate word. 40% : Talking about a particular topic People tend to talk about some topics in their mother tongue (like food) and generally while discussing science people tend to switch to English. 5% : For content clarification While explaining something, for better clarification of the audience, to make the audience more clear about the topic, code switching is used. A bit older work by Dewaele (2010)[2] said that strong emotional arousal also increases code mixing frequency. As social media contains valuable information, due to the presence of above mentioned type of code mixed text, the complexity in analyzing the data increases. Even today there are no proper tools that deals with this type of data. The primary reason behind this limitation is due to proper corpus acquisition and there have not been any. This project proposes a model that POS tags the code mixed text which can be used for various tasks in Natural Language Processing 2 Previous work There has not been much work done in terms of POS tagging of code mixed text. We came across only one related paper which was POS Tagging of English-Hindi Code-Mixed Social Media Content by Vyas et. al[7]. They used word level language identification using a logistic classifier and to take into account the context they calculate context switching probabilities. We will have a similar approach for language identification, but for including context we employed a conditional random field. 3 Dataset For the language identification using 1-5 n-gram character vectors part we extracted the top 5000 Hindi words and top 5000 English words from Fire 2013 shared Task Dataset LINK and 1000 English and 1000 test from the same. For the remaining part, we manually extracted 100 sentences from facebook pages of Bollywood actors namely Shahrukh Khan, Amir Khan etc. While extracting the sentences, we ensured that the number of words per sentence is atleast 5 and the sentence contains code mixed text. By considering the context of every word in the sentence and the base language of the sentence, We then manually tagged them by their language and by their POS tags. So after tagging, the structure looks 2

like this. word / Language (E/H) / POS tag Example: kolkata/h/noun kaa/h/adp charm/e/noun ur/h/conj busy/e/adj life/e/noun mujhe/h/pron behad/h/adj pasand/h/verb hai/h/verb Therefore, the tags are / separated, the words are space separated and the sentences are line separated. 4 Theory This section deals with all the theoretical concepts that are involved in the project. 4.1 Support Vector machine A support vector machine or popularly know as SVM is a supervised learning technique which incorporates a learning algorithm for like gradient descent and is used for tasks like classification, pattern recognition and regression. Figure 1: Illustration of SVM Source: Mathieu s log, http://www.mblondel.org/journal/2010/09/19/support-vector-machines-in-python/ A linear SVM creates a hyperplane which separates the n dimensional data. Figure 1 shows a trained SVM. 4.2 Naive Bayes Classifier Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Baye s theorem with strong (naive) independence assumptions between the features. Baye s theorem: P(A B) = P(A)P(B A) (1) P(B) 4.3 Decision Trees In these tree structures, leaves represent class labels and branches represent conjunctions of features of those features leading to those class labels. 3

Figure 2: A tree showing survival of passengers on Titatnic Source: Wikipedia 4.4 Logistic classifier A logistic classifier for a simple case where the output can take any two values 0 or 1 (True or False), can be modelled as below. P(Y = 1 X) = 1 1 + exp(w 0 + n i=1 w ix i ) P(Y = 0 X) = exp(w 0 + n i=1 w ix i ) 1 + exp(w 0 + n i=1 w ix i ) (2) (3) 4.5 Conditional Random Field A conditional random field is similar to a a Hidden Markov Model (HMM) apart from that the fact that in a HMM we give a feature and get an output, but in a CRF with the present feature, previous features are also passed to include context. Figure 3: Similarity between CRF and HMM Source: An Introduction to CRFs for Relational Learning by Charles Sutton, Andre McCallum, Univ. of Massachusetts 4

5 Methodology We have used a pipeline approach in this project[3]. It consists of the following three phases. 5.1 Language Identification This is the first problem that we need to come up. Also we desire a high accuracy in this part as all subsequent parts depend heavily on this. We tackle this problem in two scenarios When we do not include context into picture The state of the art technique for word level language identification was used here. We took a combination 1-5 character n grams as features and fed them to four different classifiers which are SVM, Logistic, Decision tree based and Naive Bayes classifiers. We chose the best performing classifier which was logistic based classifier. When we incorporate context too Although the only work in this field uses context switching probabilities to include context, we used a conditional random field. 5.2 Backtransliteration After the language identification we take consecutive English and Hindi words and group them. On the Hindi chunks we used the Google API for backtransliteration. This gave us the Hindi text in Devanagari. 5.3 POS Tagging Now we are ready to do our main task. We took each sentances and splited them into contiguous fragments of words called as chunks. Therefore all the words that corresponds to a chunk have same language either English (E) or Hindi (H) but not the combination. Then we applied CRF++ based Hindi POS tagger developed by IIT Kharagpur which is freely available from http://nltr.org/snltr-software/ on the Hindi chunks. Similarly we applied the Twitter POS tagger (Owoputi et al., 2013)[5] on the English chunks. The reason for using Twitter POS tagger is that it has a inbuilt tokenizer and it can be used directly on unnormalized text. As we are using two different taggers, they have different tagsets. The Twitter POS tagset has its own POS tagset. The CRF++ based Hindi POS tagger has ILPOST tagset[1]. Therefore these POS tags remain conserved accross languages and hence to ensure uniformity, we mapped these POS tagsets to the Universal POS tagset[6] which has 12 POS tags. For Twitter POS tags a mapping exists which convert them to universal POStags 2 and for the Hindi POS tagger we ourselves defined the mapping 1. 5

ILPOSTS Common Noun(NC) Proper Noun(NP) Verbal Noun(NV) SpatioTemporal Noun(NST) Main Verb(VM) Auxiliary Verb(VA) Pronomial Pronoun(PPR) Reflexive Pronoun(PRF) Reciprocal Pronoun(PRC) Relative Pronoun(PRL) Wh Pronoun(PWH) Adjective Nominal Modifier(JJ) Quantifiers Nominal Modifier(JQ) Absolute Demonstrative(DAB) Relative Demonstrative(DRL) Wh Demonstrative(DWH) Manner Adverb(AMN) Location Adverb(ALC) Adjectival Participle(LRL) Adverbial Participle(LV) Nominal Participle(LN) Conditional Participle(LC) Postposition(PP) Coordinating Particles(CCD) Subordinating Particles(CSB) Classifier Particles(CCL) Interjection Particles(CIN) Other Particles(CX) Punctuation(PU) Foreign Word Residual(RDF) Symbol Residual(RDS) Other Residual(RDX) Universal POSTS Verb(VERB) Verb(VERB) Pronoun(PRON) Pronoun(PRON) Pronoun(PRON) Pronoun(PRON) Pronoun(PRON) Adjectives(ADJ) Adjectives(ADJ) Adjectives(ADJ) Adjectives(ADJ) Adjectives(ADJ) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adpositions(ADP) Punctuation(.) Table 1: Map from ILPOSTS to Universal POSTS 6

Twitter POSTS Interjection(!) Topic Category(#) Numeral($) Coordinating Conjunction(&) Punctuation(,) At-Mention(@) Adjective(A) Determiner(D) Emoticon(E) Other Abbreviation(G) Nominal + Verb(L) Proper Noun + Verb(M) Common Noun(N) Pronoun(personal/WH; not possessive)(o) Pre-Post Position(P) Adverb(R) Nominal + Possessive(S) Verb Particle(T) Url or Email address(u) Verb Auxiliaries(V) Predeterminers(X) Verbal Predeterminers(Y) Proper + possesive Noun(Z) Proper Noun(ˆ) Discourse Marker( ) Universal POSTS Cardinal Numbers(NUM) Conjunction(CONJ) Punctuation(.) Adjective(ADJ) Determiners(DET) Pronoun(PRON) Adposition(ADP) Adverb(ADV Verb(VERB) Table 2: Map from Twitter POSTS to Universal POSTS 7

6 Results 1. Language Identification: (i) We tested for the best feature by trying different combinations of 1-5 character n grams and found that a combination of all of them gave best results. In the figure below 6 on x-axis represents the combination of all 1-5 character grams. Figure 4: Comparison among the features (ii) We then also tested for the best performing classifier among Logistic, Naive Bayes, SVM and Decision Tree. Logistic classifier outperformed all other classifiers. Figure 5: Comparison among the classifiers 8

(iii) The confusion matrix obtained after the language identification were as follows: Figure 6: Confusion matrix from n-gram model Figure 7: Confusion matrix from CRF model (iv) The POS tagging of a code mixed corpus depend greatly on the accuracy of language identification. So we performed POS tagging in the following three different scenarios: Figure 8: Case A: Confusion matrix when language identification was done using n-gram based features 9

Figure 9: Case B: Confusion matrix when conditional random field was used Figure 10: Case C: Confusion matrix when language of each word is known precisely Some of the features of our data in all of the above three cases is presented below. Figure 11: Characteristics of the data 10

7 Error Analysis and Conclusions The best feature was the combination of 1-5 gram character vectors which was also established result. The best performing classifier was Logistic classifier or the maximum entropy classifier The accuracy of the classifier with char n-grams as features is 90.95% which is lower due to the fact that there were features in the test set which were not there in the training. This was due to the fact that there were less training examples were used which was due to the computational limits of our system. The accuracy of classifier using the CRF model 84.48%. We believe that taking context into account must improve the accuracy. The skewed result is due to the fact that we have very less training and testing data. Also the number of English words in out training data were very less compared to that of Hindi words. That is why CRF is classifying only 3 correct English words, rest 27 it classified as Hindi. Classification as a whole is limited by the availability of corpus and use of nonstandard spellings. We do not have a measure to evaluate the back-transliteration part. Although we know are assured that Google API uses the state of the art for back-transliteratiion. When using the char n-gram for identification the number of correctly POS-tagged sentences were 6 out of 90. This goes up to 15 out of 93 when CRF is used. The very less number of full correctly POS tagged sentence is due to the fact that we were limited by corpus, the approach proposed does not take into account the complexity and underlying grammar.also the identification accuracy and the back transliteration plays a major role in the final POS-tagging, which is established by the increase of correctly POS-tag from 6 to 16. 8 Acknowledgement We sincerely thank Prof. Amitabha Mukherjee for his able guidance throughout the course of this project. He has helped us at each and every stage with his valuable suggestions and ideas. 11

References [1] BASKARAN, S., BALI, K., BHATTACHARYA, T., BHATTACHARYYA, P., JHA, G. N., ET AL. A common parts-of-speech tagset framework for indian languages. In In Proc. of LREC 2008 (2008), Citeseer. [2] DEWAELE, J.-M. Emotions in multiple languages. [3] GELLA, S., SHARMA, J., AND BALI, K. Query word labeling and back transliteration for indian languages: Shared task system description. FIRE Working Notes (2013). [4] HIDAYAT, T. An Analysis of Code Switching Used by Facebookers (a case study in a social network site). PhD thesis, BA Thesis, English Education Study Program, College of Teaching and Education (STKIP), Bandung, Indonesia, October, 2012. [5] OWOPUTI, O., O CONNOR, B., DYER, C., GIMPEL, K., SCHNEIDER, N., AND SMITH, N. A. Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics. [6] PETROV, S., DAS, D., AND MCDONALD, R. A universal part-of-speech tagset. arxiv preprint arxiv:1104.2086 (2011). [7] VYAS, Y., GELLA, S., SHARMA, J., BALI, K., AND CHOUDHURY, M. Pos tagging of english-hindi code-mixed social media content. In Proceedings of the First Workshop on Codeswitching, EMNLP (2014). 12