Code-Mixing: A Challenge for Language Identification in the Language of Social Media

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

Indian Institute of Technology, Kanpur

Named Entity Recognition: A Survey for the Indian Languages

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Multi-Lingual Text Leveling

Switchboard Language Model Improvement with Conversational Data from Gigaword

Corpus Linguistics (L615)

The taming of the data:

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS 446: Machine Learning

Memory-based grammatical error correction

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

A Neural Network GUI Tested on Text-To-Phoneme Mapping

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Literature and the Language Arts Experiencing Literature

Learning Methods in Multilingual Speech Recognition

CS Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Cross Language Information Retrieval

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Rule Learning with Negation: Issues Regarding Effectiveness

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

Multilingual Sentiment and Subjectivity Analysis

Mandarin Lexical Tone Recognition: The Gating Paradigm

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Using dialogue context to improve parsing performance in dialogue systems

ScienceDirect. Malayalam question answering system

Derivational and Inflectional Morphemes in Pak-Pak Language

The Role of the Head in the Interpretation of English Deverbal Compounds

Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

Problems of the Arabic OCR: New Attitudes

Speech Emotion Recognition Using Support Vector Machine

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Test Blueprint. Grade 3 Reading English Standards of Learning

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Graph Alignment for Semi-Supervised Semantic Role Labeling

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

A Bayesian Learning Approach to Concept-Based Document Classification

Disambiguation of Thai Personal Name from Online News Articles

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Monticello Community School District K 12th Grade. Spanish Standards and Benchmarks

A Comparison of Two Text Representations for Sentiment Analysis

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Using a Native Language Reference Grammar as a Language Learning Tool

Handling Sparsity for Verb Noun MWE Token Classification

A Vector Space Approach for Aspect-Based Sentiment Analysis

What the National Curriculum requires in reading at Y5 and Y6

Prediction of Maximal Projection for Semantic Role Labeling

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Python Machine Learning

Universiteit Leiden ICT in Business

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Beyond the Pipeline: Discrete Optimization in NLP

The Structure of Relative Clauses in Maay Maay By Elly Zimmer

Calibration of Confidence Measures in Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

BULATS A2 WORDLIST 2

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

A Case Study: News Classification Based on Term Frequency

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Modeling function word errors in DNN-HMM based LVCSR systems

Holy Family Catholic Primary School SPELLING POLICY

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Parsing of part-of-speech tagged Assamese Texts

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Detecting English-French Cognates Using Orthographic Edit Distance

Cross-lingual Text Fragment Alignment using Divergence from Randomness

A study of speaker adaptation for DNN-based speech synthesis

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Learning From the Past with Experiment Databases

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

C a l i f o r n i a N o n c r e d i t a n d A d u l t E d u c a t i o n. E n g l i s h a s a S e c o n d L a n g u a g e M o d e l

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

USING DRAMA IN ENGLISH LANGUAGE TEACHING CLASSROOMS TO IMPROVE COMMUNICATION SKILLS OF LEARNERS

MERRY CHRISTMAS Level: 5th year of Primary Education Grammar:

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Transcription:

Code-Mixing: A Challenge for Language Identification in the Language of Social Media Utsab Barman, Amitava Das, Joachim Wagner & Jennifer Foster Dublin City University, Dublin, Ireland. University of North Texas, Denton, USA. DATE 25.10.2014

Language Identification in Social Media is a Challenging Task Twitter Language Map Plenty of languages Only half of them are in English Informal writing Great -> gr8 http://www.fastcodesign.com/1665366/infographic-of-the-day-the-many-languages-of-twitter Code-mixing 2

Code-Mixing Mixing multiple languages Inter-sentential Intra-sentential Word-level Phonetic typing Writing in Roman script instead of native language script Ad-hoc Romanisation 3

Example : Phonetically Typed Code-Mixed Content Achha ei prosno ta ageo keu korechhe kina jani na, tobe ei page-e Cr Arindam Sarkar er reign of terror dekhe amar akta prosno mathaye ghurchhe. Tumi ki 1st year er Class Representative howa ta beshi seriously niye felechhile naki Cr er onyo ortho achhe? Bengali English 4

Goal of our Work Word-Level Language Identification with Phonetically Typed Code-Mixed Content 5

Corpus English-Hindi-Bengali phonetically typed code-mixed content Facebook post and comments Indian student community Reasons: Code-mixing is frequent among speakers who are multilingual and younger in age. India is a country with 30 spoken languages, among which 22 are official. 65% of Indian population is 35 or under. ** Currently our corpus contains 2335 posts and 9813 comments. ** http://www.theguardian.com/commentisfree/2014/apr/08/india-leaders-young-people-change-2014-elections 6

Annotation (1) Annotation Type: Human Annotation Number of Annotators: 4 3 students from Computer Science background from same university 1 author of this paper Target: Capture inter-sentential code-mixing intra-sentential code-mixing word-level code-mixing 7

Annotation (2) Tags: <T attribute = L > </T> T: Type of cde-mixing sentence (sent) fragment (frag) inclusion (incl) word level code-mixing (wlcm) L: Language(s) of code-mixing English (en) Hindi (hi) Bengali (bn) Mixed (mixd) Universals (univ) Undefined (undef) 8

Annotation (3) Sentence <sent lang = language >... </sent> Identifies sentence boundary Identifies inter-sentential code-mixing 9

Annotation (4) English Sentence: what a...6 hrs long...but really nice tennis... <sent lang= en > what a...6 hrs long...but really nice tennis... </sent> Bengali Sentence: shubho nabo borsho.. :) <sent lang= bn > shubho nabo borsho.. :) </sent> Hindi Sentence: karwa sachh... :( <sent lang= hi > karwa sachh... :( </sent> 10

Annotation (5) Univ-Sentence: hahahahahahah...!!!!! <sent lang= univ > hahahahahahah...!!!!! </sent> Mixed-Sentence: oye hoye... angreji me kahte hai ke I love u..!!! <sent lang= mixd > <frag lang= hi > oye hoye... angreji me kahte hai ke </frag> <frag lang= en > I love u..!!! </frag> </sent> 11

Annotation (6) Fragment <frag lang = language >... </frag> Identifies groups of grammatically related words in a sentence Identifies intra-sentential code-mixing 12

Annotation (7) Mixed-Sentence: oye hoye... angreji me kahte hai ke I love u..!!! <sent lang= mixd > <frag lang= hi > oye hoye... angreji me kahte hai ke </frag> <frag lang= en > I love u..!!! </frag> </sent> 13

Annotation (8) Inclusion <incl lang= language >... </incl> Identifies foreign word or phrase Within sentence or fragment Assimilated in native language Identifies intra-sentential code-mixing 14

Annotation (9) Sentence with inclusion: Na re seriously ami khub kharap achi. <sent lang= bn > Na re <incl lang= en > seriously </incl> ami khub kharap achi. </sent> 15

Annotation (10) Word-Level Code-Mixing <wlcm type= languages >... </wlcm> Capture intra-word code-mixing Smallest unit of code-mixing 16

Annotation (11) Word-level code mixing (EN-BN) : chapless where Root word: chap (Bengali) Appended Suffix: less (English) <wlcm type= bn-and-en''> chapless </wlcm> 17

Token-Level Statistics Language Count EN 66,298 BN 79,899 HI 3,440 WLCM 633 UNIV 39,291 UNDEF 61 5,233 tokens are identified as NE and 715 tokens are identified as Acronym (e.g. JU). Total: 195,570 18

Tag-Level Statistics Tags EN BN HI Mixd Univ Undef sent 5,370 5,523 354 204 746 15 frag 288 213 40-6 0 incl 7,377 262 94-1,032 1 wlcm 477 19

Ambiguous Words Labels Count Percentage EN 9,109 34.40 BN 14,345 54.18 HI 1,039 3.92 EN or BN 1,479 5.58 EN or HI 61 0.23 BN or HI 277 1.04 EN or BN or HI 165 0.62 Some types are annotated in multiple languages, e.g 'to', 'clg', 'baba' Common vocabulary between languages Effect of phonetic typing 20

IAA (1) Token-Level Kappa = 0.884 [Calculated on randomly selected 100 comments between 2 annotators] 21

IAA (2) Tag-level Kappa = 0.6683 Tag Kappa sent 0.6825 frag 0.5171 incl 0.5507 wlcm 0.6223 ne 0.6172 acro 0.6144 All tags 0.6683 Annotation <sent lang= bn >ki <incl lang= en > cntrl </incl> korte parli na </sent> Word-level representation B-SENT-bn ki B-INCL-en/I-SENT-bn cntrl I-SENT-bn krte I-SENT-bn parli I-SENT-bn na [Calculated on randomly selected 100 comments between 2 annotators] 22

Experiments (1) Approaches Dictionary-based SVM without contextual information SVM and CRF with contextual information 5-fold cross-validation 4-way classification (en, bn, hi and univ) 23

Experiments (2) To avoid unrealistic context, NEs and WLCMs are included for context features With label 'other' in training (5-way system) Two special cases: Gold NEs and WLCMs do not count for evaluation Back-off to 4-way system (en, bn, hi and univ) when 'other' is predicted 24

Dictionary Approach (1) Full-form dictionaries extracted from British National Corpus SEMEVAL 2013 Twitter data Lexical normalisation list (Han and Baldwin, 2011) Training data No transliterated Bengali or Hindi dictionary available 25

Dictionary Approach (2) Language prediction by presence in dictionaries Use normalised word frequencies For OOVs or ties, the majority language is predicted UNIV identified with hand-crafted regular expressions 26

Dictionary Approach (3) Dictionary Accuracy (%) BNC 80.09 SEMEVAL Twitter 77.61 LexNormList 79.86 Training Data 90.21 LexNormList+Training Data 93.12 All combinations were tried. 27

SVM without Context (1) Features Character n-grams (G) Presence in dictionary (D) Binary indicators of word Length (L) Split points determined by decision tree (J48) trained only with length of a word as a single feature Capitalization (C) SVM linear kernel with optimised 'C' parameter 28

SVM without Context (2) Binary indicators for length feature J48 Pruned Tree length <= 3 length <= 1: en length > 1: bn length > 3 length <= 6: bn length > 6 length <= 8: bn length > 8 length <= 13: en length > 13: bn Extracted Length Features Is greater than 3 Is greater than 1 Is greater than 6 Is greater than 8 Is greater than 13 Encoding 6 ranges: 0-1, 2-3, 4-6, 7-8, 9-13 and 14-inf 29

SVM without Context (3) 30

SVM with Context (1) Features Character n-grams (G) Presence in dictionary (D) Binary indicators of word Length (L) Capitalization (C) Previous words (Pi) Next words (Ni) 31

SVM with Context (2) Context Accuracy (%) GDLC (no context) 94.75 GDLC+P2 94.66 GDLC+P1 94.55 GDLC+N1 94.53 GDLC+N2 94.37 GDLC+P1N1 95.14 GDLC+P2N2 94.55 32

CRF (1) Linear chain Conditional Random Field (CRF) with increasing order (0,1,2) Features Character n-grams (G) Presence in dictionary (D) Word length (L) Capitalisation (C) 33

CRF (2) Features Order-0 Order-1 Order-2 G 92.80 95.16 95.36 GD 93.42 95.59 95.98 GL 92.82 95.14 95.41 GDL 93.47 95.60 95.94 GC 92.07 94.60 95.05 GDC 93.47 95.62 95.98 GLC 92.36 94.53 95.02 GDLC 93.47 95.58 95.98 34

Test Set Results Dictionary 93.64% SVM without context 95.21% SVM with context 95.52% CRF 95.76% 35

Conclusion (1) Contextual clues are helpful: The following example is wrongly classified by all our systems that do not use context information. All context-based systems classify it correctly. Gold data: /univ the/en movie/en for/en which/en i/en can/en die/en for/en../univ SVM without context: /univ the/en movie/en for/en which/en i/en can/en die/bn for/en../univ 36

Conclusion (2) Character n-grams are helpful features for language identification experiments. Adding dictionary-based predictions as features gives a small boost to accuracy. 37

Another CRF Tool We re-ran our CRF experiments with Wapiti (Lavergne et al., 2010) instead of Mallet 96.37% accuracy (+0.39 percentage points) 38

THANK YOU 39

SVM without Context (4) +0.11 +0.02 +0.03 +0.11 +0.11 +0.02 +0.00 +0.06 +0.05 +0.00 +0.02 +0.05 40