Indian Institute of Technology, Kanpur
|
|
- Melvyn Morrison
- 6 years ago
- Views:
Transcription
1 Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna ( ) {vamsi@iitk.ac.in} Sandeep Kumar Begad (12612) {sandeekb@iitk.ac.in} Mentor - Prof. Amitabha Mukerjee Abstract Social media in today s world possess enormous amount of data. This data is used by various companies for advertisement to suitable group, declaring promotions, etc. But the problem starts in bilingual or multilingual populations where a lot of people tend to use multiple languages in the same sentence. Now analysis of such a text unravels a whole new field of study.
2 Contents 1 Introduction 2 2 Previous work 2 3 Dataset 2 4 Theory Support Vector machine Naive Bayes Classifier Decision Trees Logistic classifier Conditional Random Field Methodology Language Identification Backtransliteration POS Tagging Results 8 7 Error Analysis and Conclusions 11 8 Acknowledgement 11 1
3 1 Introduction Mixing of languages is called code mixing. Code mixing occurs due to various reasons. According to a work by Hidayat et. al 2012[4], An analysis of code switching used by facebookers: a case study in a social network site, there are the following major reasons for code switching: 45% : Real lexical needs For instance someone is thinking of some object but is not able to recall the word in the language he is using already, then he/she will tend to switch to a language where he knows the appropriate word. 40% : Talking about a particular topic People tend to talk about some topics in their mother tongue (like food) and generally while discussing science people tend to switch to English. 5% : For content clarification While explaining something, for better clarification of the audience, to make the audience more clear about the topic, code switching is used. A bit older work by Dewaele (2010)[2] said that strong emotional arousal also increases code mixing frequency. As social media contains valuable information, due to the presence of above mentioned type of code mixed text, the complexity in analyzing the data increases. Even today there are no proper tools that deals with this type of data. The primary reason behind this limitation is due to proper corpus acquisition and there have not been any. This project proposes a model that POS tags the code mixed text which can be used for various tasks in Natural Language Processing 2 Previous work There has not been much work done in terms of POS tagging of code mixed text. We came across only one related paper which was POS Tagging of English-Hindi Code-Mixed Social Media Content by Vyas et. al[7]. They used word level language identification using a logistic classifier and to take into account the context they calculate context switching probabilities. We will have a similar approach for language identification, but for including context we employed a conditional random field. 3 Dataset For the language identification using 1-5 n-gram character vectors part we extracted the top 5000 Hindi words and top 5000 English words from Fire 2013 shared Task Dataset LINK and 1000 English and 1000 test from the same. For the remaining part, we manually extracted 100 sentences from facebook pages of Bollywood actors namely Shahrukh Khan, Amir Khan etc. While extracting the sentences, we ensured that the number of words per sentence is atleast 5 and the sentence contains code mixed text. By considering the context of every word in the sentence and the base language of the sentence, We then manually tagged them by their language and by their POS tags. So after tagging, the structure looks 2
4 like this. word / Language (E/H) / POS tag Example: kolkata/h/noun kaa/h/adp charm/e/noun ur/h/conj busy/e/adj life/e/noun mujhe/h/pron behad/h/adj pasand/h/verb hai/h/verb Therefore, the tags are / separated, the words are space separated and the sentences are line separated. 4 Theory This section deals with all the theoretical concepts that are involved in the project. 4.1 Support Vector machine A support vector machine or popularly know as SVM is a supervised learning technique which incorporates a learning algorithm for like gradient descent and is used for tasks like classification, pattern recognition and regression. Figure 1: Illustration of SVM Source: Mathieu s log, A linear SVM creates a hyperplane which separates the n dimensional data. Figure 1 shows a trained SVM. 4.2 Naive Bayes Classifier Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Baye s theorem with strong (naive) independence assumptions between the features. Baye s theorem: P(A B) = P(A)P(B A) (1) P(B) 4.3 Decision Trees In these tree structures, leaves represent class labels and branches represent conjunctions of features of those features leading to those class labels. 3
5 Figure 2: A tree showing survival of passengers on Titatnic Source: Wikipedia 4.4 Logistic classifier A logistic classifier for a simple case where the output can take any two values 0 or 1 (True or False), can be modelled as below. P(Y = 1 X) = exp(w 0 + n i=1 w ix i ) P(Y = 0 X) = exp(w 0 + n i=1 w ix i ) 1 + exp(w 0 + n i=1 w ix i ) (2) (3) 4.5 Conditional Random Field A conditional random field is similar to a a Hidden Markov Model (HMM) apart from that the fact that in a HMM we give a feature and get an output, but in a CRF with the present feature, previous features are also passed to include context. Figure 3: Similarity between CRF and HMM Source: An Introduction to CRFs for Relational Learning by Charles Sutton, Andre McCallum, Univ. of Massachusetts 4
6 5 Methodology We have used a pipeline approach in this project[3]. It consists of the following three phases. 5.1 Language Identification This is the first problem that we need to come up. Also we desire a high accuracy in this part as all subsequent parts depend heavily on this. We tackle this problem in two scenarios When we do not include context into picture The state of the art technique for word level language identification was used here. We took a combination 1-5 character n grams as features and fed them to four different classifiers which are SVM, Logistic, Decision tree based and Naive Bayes classifiers. We chose the best performing classifier which was logistic based classifier. When we incorporate context too Although the only work in this field uses context switching probabilities to include context, we used a conditional random field. 5.2 Backtransliteration After the language identification we take consecutive English and Hindi words and group them. On the Hindi chunks we used the Google API for backtransliteration. This gave us the Hindi text in Devanagari. 5.3 POS Tagging Now we are ready to do our main task. We took each sentances and splited them into contiguous fragments of words called as chunks. Therefore all the words that corresponds to a chunk have same language either English (E) or Hindi (H) but not the combination. Then we applied CRF++ based Hindi POS tagger developed by IIT Kharagpur which is freely available from on the Hindi chunks. Similarly we applied the Twitter POS tagger (Owoputi et al., 2013)[5] on the English chunks. The reason for using Twitter POS tagger is that it has a inbuilt tokenizer and it can be used directly on unnormalized text. As we are using two different taggers, they have different tagsets. The Twitter POS tagset has its own POS tagset. The CRF++ based Hindi POS tagger has ILPOST tagset[1]. Therefore these POS tags remain conserved accross languages and hence to ensure uniformity, we mapped these POS tagsets to the Universal POS tagset[6] which has 12 POS tags. For Twitter POS tags a mapping exists which convert them to universal POStags 2 and for the Hindi POS tagger we ourselves defined the mapping 1. 5
7 ILPOSTS Common Noun(NC) Proper Noun(NP) Verbal Noun(NV) SpatioTemporal Noun(NST) Main Verb(VM) Auxiliary Verb(VA) Pronomial Pronoun(PPR) Reflexive Pronoun(PRF) Reciprocal Pronoun(PRC) Relative Pronoun(PRL) Wh Pronoun(PWH) Adjective Nominal Modifier(JJ) Quantifiers Nominal Modifier(JQ) Absolute Demonstrative(DAB) Relative Demonstrative(DRL) Wh Demonstrative(DWH) Manner Adverb(AMN) Location Adverb(ALC) Adjectival Participle(LRL) Adverbial Participle(LV) Nominal Participle(LN) Conditional Participle(LC) Postposition(PP) Coordinating Particles(CCD) Subordinating Particles(CSB) Classifier Particles(CCL) Interjection Particles(CIN) Other Particles(CX) Punctuation(PU) Foreign Word Residual(RDF) Symbol Residual(RDS) Other Residual(RDX) Universal POSTS Verb(VERB) Verb(VERB) Pronoun(PRON) Pronoun(PRON) Pronoun(PRON) Pronoun(PRON) Pronoun(PRON) Adjectives(ADJ) Adjectives(ADJ) Adjectives(ADJ) Adjectives(ADJ) Adjectives(ADJ) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adverbs(ADV) Adpositions(ADP) Punctuation(.) Table 1: Map from ILPOSTS to Universal POSTS 6
8 Twitter POSTS Interjection(!) Topic Category(#) Numeral($) Coordinating Conjunction(&) Punctuation(,) Adjective(A) Determiner(D) Emoticon(E) Other Abbreviation(G) Nominal + Verb(L) Proper Noun + Verb(M) Common Noun(N) Pronoun(personal/WH; not possessive)(o) Pre-Post Position(P) Adverb(R) Nominal + Possessive(S) Verb Particle(T) Url or address(u) Verb Auxiliaries(V) Predeterminers(X) Verbal Predeterminers(Y) Proper + possesive Noun(Z) Proper Noun(ˆ) Discourse Marker( ) Universal POSTS Cardinal Numbers(NUM) Conjunction(CONJ) Punctuation(.) Adjective(ADJ) Determiners(DET) Pronoun(PRON) Adposition(ADP) Adverb(ADV Verb(VERB) Table 2: Map from Twitter POSTS to Universal POSTS 7
9 6 Results 1. Language Identification: (i) We tested for the best feature by trying different combinations of 1-5 character n grams and found that a combination of all of them gave best results. In the figure below 6 on x-axis represents the combination of all 1-5 character grams. Figure 4: Comparison among the features (ii) We then also tested for the best performing classifier among Logistic, Naive Bayes, SVM and Decision Tree. Logistic classifier outperformed all other classifiers. Figure 5: Comparison among the classifiers 8
10 (iii) The confusion matrix obtained after the language identification were as follows: Figure 6: Confusion matrix from n-gram model Figure 7: Confusion matrix from CRF model (iv) The POS tagging of a code mixed corpus depend greatly on the accuracy of language identification. So we performed POS tagging in the following three different scenarios: Figure 8: Case A: Confusion matrix when language identification was done using n-gram based features 9
11 Figure 9: Case B: Confusion matrix when conditional random field was used Figure 10: Case C: Confusion matrix when language of each word is known precisely Some of the features of our data in all of the above three cases is presented below. Figure 11: Characteristics of the data 10
12 7 Error Analysis and Conclusions The best feature was the combination of 1-5 gram character vectors which was also established result. The best performing classifier was Logistic classifier or the maximum entropy classifier The accuracy of the classifier with char n-grams as features is 90.95% which is lower due to the fact that there were features in the test set which were not there in the training. This was due to the fact that there were less training examples were used which was due to the computational limits of our system. The accuracy of classifier using the CRF model 84.48%. We believe that taking context into account must improve the accuracy. The skewed result is due to the fact that we have very less training and testing data. Also the number of English words in out training data were very less compared to that of Hindi words. That is why CRF is classifying only 3 correct English words, rest 27 it classified as Hindi. Classification as a whole is limited by the availability of corpus and use of nonstandard spellings. We do not have a measure to evaluate the back-transliteration part. Although we know are assured that Google API uses the state of the art for back-transliteratiion. When using the char n-gram for identification the number of correctly POS-tagged sentences were 6 out of 90. This goes up to 15 out of 93 when CRF is used. The very less number of full correctly POS tagged sentence is due to the fact that we were limited by corpus, the approach proposed does not take into account the complexity and underlying grammar.also the identification accuracy and the back transliteration plays a major role in the final POS-tagging, which is established by the increase of correctly POS-tag from 6 to Acknowledgement We sincerely thank Prof. Amitabha Mukherjee for his able guidance throughout the course of this project. He has helped us at each and every stage with his valuable suggestions and ideas. 11
13 References [1] BASKARAN, S., BALI, K., BHATTACHARYA, T., BHATTACHARYYA, P., JHA, G. N., ET AL. A common parts-of-speech tagset framework for indian languages. In In Proc. of LREC 2008 (2008), Citeseer. [2] DEWAELE, J.-M. Emotions in multiple languages. [3] GELLA, S., SHARMA, J., AND BALI, K. Query word labeling and back transliteration for indian languages: Shared task system description. FIRE Working Notes (2013). [4] HIDAYAT, T. An Analysis of Code Switching Used by Facebookers (a case study in a social network site). PhD thesis, BA Thesis, English Education Study Program, College of Teaching and Education (STKIP), Bandung, Indonesia, October, [5] OWOPUTI, O., O CONNOR, B., DYER, C., GIMPEL, K., SCHNEIDER, N., AND SMITH, N. A. Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics. [6] PETROV, S., DAS, D., AND MCDONALD, R. A universal part-of-speech tagset. arxiv preprint arxiv: (2011). [7] VYAS, Y., GELLA, S., SHARMA, J., BALI, K., AND CHOUDHURY, M. Pos tagging of english-hindi code-mixed social media content. In Proceedings of the First Workshop on Codeswitching, EMNLP (2014). 12
Twitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationExperiments with Cross-lingual Systems for Synthesis of Code-Mixed Text
Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text Sunayana Sitaram 1, Sai Krishna Rallabandi 1, Shruti Rijhwani 1 Alan W Black 2 1 Microsoft Research India 2 Carnegie Mellon University
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationNamed Entity Recognition: A Survey for the Indian Languages
Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationPOS tagging of Chinese Buddhist texts using Recurrent Neural Networks
POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationNetpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models
Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationMulti-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationModeling function word errors in DNN-HMM based LVCSR systems
Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford
More informationCS 446: Machine Learning
CS 446: Machine Learning Introduction to LBJava: a Learning Based Programming Language Writing classifiers Christos Christodoulopoulos Parisa Kordjamshidi Motivation 2 Motivation You still have not learnt
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationCSL465/603 - Machine Learning
CSL465/603 - Machine Learning Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Introduction CSL465/603 - Machine Learning 1 Administrative Trivia Course Structure 3-0-2 Lecture Timings Monday 9.55-10.45am
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationTraining a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski
Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationCorrective Feedback and Persistent Learning for Information Extraction
Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationOnline Updating of Word Representations for Part-of-Speech Tagging
Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org
More informationLinguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis
International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:
More informationThe 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X
The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationGrammars & Parsing, Part 1:
Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review
More informationCLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH
ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department
More informationESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly
ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Classical Approaches to Tagging The slides are posted on the web. The url is http://chss.montclair.edu/~feldmana/esslli10/.
More informationLoughton School s curriculum evening. 28 th February 2017
Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationHuman Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationUsing Web Searches on Important Words to Create Background Sets for LSI Classification
Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationknarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese
knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese Adriano Kerber Daniel Camozzato Rossana Queiroz Vinícius Cassol Universidade do Vale do Rio
More informationMyths, Legends, Fairytales and Novels (Writing a Letter)
Assessment Focus This task focuses on Communication through the mode of Writing at Levels 3, 4 and 5. Two linked tasks (Hot Seating and Character Study) that use the same context are available to assess
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationBeyond the Pipeline: Discrete Optimization in NLP
Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We
More informationTHE VERB ARGUMENT BROWSER
THE VERB ARGUMENT BROWSER Bálint Sass sass.balint@itk.ppke.hu Péter Pázmány Catholic University, Budapest, Hungary 11 th International Conference on Text, Speech and Dialog 8-12 September 2008, Brno PREVIEW
More informationSpeech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
More informationFinding Translations in Scanned Book Collections
Finding Translations in Scanned Book Collections Ismet Zeki Yalniz Dept. of Computer Science University of Massachusetts Amherst, MA, 01003 zeki@cs.umass.edu R. Manmatha Dept. of Computer Science University
More informationDetecting English-French Cognates Using Orthographic Edit Distance
Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationExposé for a Master s Thesis
Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially
More informationCalibration of Confidence Measures in Speech Recognition
Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationHeuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger
Page 1 of 35 Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Kaihong Liu, MD, MS, Wendy Chapman, PhD, Rebecca Hwa, PhD, and Rebecca S. Crowley, MD, MS
More informationA Vector Space Approach for Aspect-Based Sentiment Analysis
A Vector Space Approach for Aspect-Based Sentiment Analysis by Abdulaziz Alghunaim B.S., Massachusetts Institute of Technology (2015) Submitted to the Department of Electrical Engineering and Computer
More informationLearning From the Past with Experiment Databases
Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More information(Sub)Gradient Descent
(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include
More informationDialog Act Classification Using N-Gram Algorithms
Dialog Act Classification Using N-Gram Algorithms Max Louwerse and Scott Crossley Institute for Intelligent Systems University of Memphis {max, scrossley } @ mail.psyc.memphis.edu Abstract Speech act classification
More informationAnalyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio
SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State
More informationA Bayesian Learning Approach to Concept-Based Document Classification
Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationDefragmenting Textual Data by Leveraging the Syntactic Structure of the English Language
Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language Nathaniel Hayes Department of Computer Science Simpson College 701 N. C. St. Indianola, IA, 50125 nate.hayes@my.simpson.edu
More informationLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationUnsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model
Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.
More informationArtificial Neural Networks written examination
1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14
More informationFirst Grade Curriculum Highlights: In alignment with the Common Core Standards
First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationAnalysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationApplications of data mining algorithms to analysis of medical data
Master Thesis Software Engineering Thesis no: MSE-2007:20 August 2007 Applications of data mining algorithms to analysis of medical data Dariusz Matyja School of Engineering Blekinge Institute of Technology
More informationARNE - A tool for Namend Entity Recognition from Arabic Text
24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationAn Evaluation of POS Taggers for the CHILDES Corpus
City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects Graduate Center 9-30-2016 An Evaluation of POS Taggers for the CHILDES Corpus Rui Huang The Graduate
More informationExtracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models
Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),
More informationScienceDirect. Malayalam question answering system
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationSemi-Supervised Face Detection
Semi-Supervised Face Detection Nicu Sebe, Ira Cohen 2, Thomas S. Huang 3, Theo Gevers Faculty of Science, University of Amsterdam, The Netherlands 2 HP Research Labs, USA 3 Beckman Institute, University
More informationArticle A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek
Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek Vasileios Athanasiou and Manolis Maragoudakis * Artificial
More informationExploiting Wikipedia as External Knowledge for Named Entity Recognition
Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292
More informationAdvanced Grammar in Use
Advanced Grammar in Use A self-study reference and practice book for advanced learners of English Third Edition with answers and CD-ROM cambridge university press cambridge, new york, melbourne, madrid,
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More information