IberEval 2017, COSET task: a basic approach

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Georgetown University at TREC 2017 Dynamic Domain Track

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Variations of the Similarity Function of TextRank for Automated Summarization

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Speech Emotion Recognition Using Support Vector Machine

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Second Exam: Natural Language Parsing with Neural Networks

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Exposé for a Master s Thesis

A Case Study: News Classification Based on Term Frequency

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Assignment 1: Predicting Amazon Review Ratings

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v4 [cs.cl] 28 Mar 2016

Indian Institute of Technology, Kanpur

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Postprint.

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Reducing Features to Improve Bug Prediction

Data Driven Grammatical Error Detection in Transcripts of Children s Speech

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

CS 446: Machine Learning

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Artificial Neural Networks written examination

arxiv: v1 [cs.lg] 3 May 2013

Lecture 1: Machine Learning Basics

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Rule Learning With Negation: Issues Regarding Effectiveness

Ensemble Technique Utilization for Indonesian Dependency Parser

Learning From the Past with Experiment Databases

Modeling function word errors in DNN-HMM based LVCSR systems

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CS Machine Learning

CSL465/603 - Machine Learning

Word Segmentation of Off-line Handwritten Documents

Rule Learning with Negation: Issues Regarding Effectiveness

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Online Updating of Word Representations for Part-of-Speech Tagging

Human Emotion Recognition From Speech

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Semantic and Context-aware Linguistic Model for Bias Detection

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

Residual Stacking of RNNs for Neural Machine Translation

Speaker Identification by Comparison of Smart Methods. Abstract

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

arxiv: v1 [cs.cl] 19 Oct 2017

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v5 [cs.ai] 18 Aug 2015

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A Comparison of Two Text Representations for Sentiment Analysis

Switchboard Language Model Improvement with Conversational Data from Gigaword

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using dialogue context to improve parsing performance in dialogue systems

The stages of event extraction

(Sub)Gradient Descent

ARNE - A tool for Namend Entity Recognition from Arabic Text

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Model Ensemble for Click Prediction in Bing Search Ads

Cross-lingual Short-Text Document Classification for Facebook Comments

THE enormous growth of unstructured data, including

arxiv: v1 [cs.cl] 2 Apr 2017

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

A Vector Space Approach for Aspect-Based Sentiment Analysis

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Mining Association Rules in Student s Assessment Data

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Memory-based grammatical error correction

Lecture 1: Basic Concepts of Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v1 [cs.lg] 7 Apr 2015

Term Weighting based on Document Revision History

Probabilistic Latent Semantic Analysis

Disambiguation of Thai Personal Name from Online News Articles

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

Multi-Lingual Text Leveling

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Transcription:

IberEval 2017, COSET task: a basic approach Carlos Diez Alba, Jesús Vieco Pérez Departament de Sistemes Informàtics i Computació Universitat Politècnica de València {cardieal,jeviepe}@dsic.upv.es Abstract. This paper discusses the IberEval 2017 shared task on Classification Of Spanish Election Tweets (COSET) [5]. This task has the goal to analyze tweets that talk about Spanish General Election of 2015 and classify them in one of these 5 categories: political issues, policy issues, personal issues, campaign issues and other issues. Keywords: UPV, bag-of-words, embeddings, neural networks, support vector machine, random forest 1 Introduction Election dates increase the political conversations in Twitter. We can extract several useful information from all the conversations if we are able to segment them in different categories. COSET [5] has the aim to classify a simited corpus of Spanish tweets which are related with the 2015 Spanish General Election in one of these categories: Political issues: related to the electoral confrontation Policy issues: related to the political sectors Personal issues: related to the life and activities of the candidates Campaign issues: related to the hustings Other issues: remaining ones that do not fit in the previous categories The aim of this paper is to perform a study of different feature extraction methods and machine learning algorithms for classifying this set of tweets into their categories. 2 Machine learning algorithms Before explaining the feature extraction we need some algorithms that can classify the tweets in the different categories. To solve this problem we tried different machine learning algorithms: Support vector machines (SVM): probably the most used algorithm in tweets classification, the SVM is one of the state-of-the-art algorithm for tasks related with text.

Random forest (RF): this algorithm is not usually employed for a text task but we decided to use it because we have possibilities of overfitting with the size of the corpus and the random forest prevents that. Neural network (NN): currently the most used machine learning algorithm due to its flexibility. We used both multilayer perceptron and long short term memory (LSTM) [6] networks. We used support vector and random forest from scikit learn [14]. Neural networks are programmed using Keras [3] and Theano [20]. 3 Feature extraction For this task we have a list of characters for every tweet but we can not feed a machine learning algorithm with a list of characters: before it we have to extract features for every tweet. We try three different methods of feature extraction. All of them need a previous preprocessing: Tokenization: we split every tweet in a list of tokens like words, URL links, punctuation marks, etc. Cleaning: we remove all URL links, emoticons and some other special characters. 3.1 N-gram The n-gram idea is based on grouping the tokens in groups of n. We create a model based on the idea of bag of words but using the n-gram themselves [13], so we get a vector with the number of different n-grams as size and the number of times a n-gram appears in the tweets as a feature. 3.2 Tf-idf Another approach is using the Term Frequency Inverse Document Frequency (tf-idf) to generate vectors for every tweet with the tweets tokenized and clean, despite of getting the n-gram model. It is similar to the previous one but we have a probability of a word in every feature related with the importance of a word in a tweet instead of getting a count in every feature. This method is used in some experiments like [10] and [7]. 3.3 Vector representations of words: Embeddings The third feature extraction method is the proposed one [12]. The features extracted by this method are called embeddings and we can found several recent papers where the embeddings are combined whit neural networks and get stateof-the-art results [19]. The main idea is to obtain a vector representation for every word learning it in an unsupervised way of a set of sentences. To train this embedding we will use the Gensim [15] software. Due to this model 62

needs many tweets to train and we have less than 3000, we used a bigger set of tweets provided by the UPV researcher Javier Palanca Cámara, used in other fields [21] [22], in addition of what we already have. These tweets have political correlation and are linked with the manifestation for the independence of Catalonia in 2016 called Diada. This feature extraction gets a vector for every word, so we got different numbers of vectors for every tweet. To train a neural network we need a recurrent layer to deal with different size of vectors. In our case we used Long short term memory (LSTM) layers [6] and we compare it with the bidirectional [17] long short term memory (BLSTM), which has some improvements in this field [11] [2]. 3.4 Combining methods We used both feature extraction methods tf-idf and word embeddings in order to improve our past model. We need a neural network which can join both. In Figure 1 we can see our base neural network to work with. Fig. 1: Network schema 63

We tried some different models, replacing the LSTM by a BLSTM, adding dropout, batch normalization and more fully connected layers. To improve the LSTM learned, we added another output layer just before the LSTM. 4 Results and Discussions We are going to show the experiments done. We will divide these sections in two parts: The first is for the experiments with the first feature extraction method, which was the n-grams approximation with the tf-idf approach. Next we proceed to do the same with the embeddings method. Finally we join both methods in one. We show the best results with f1 macro score obtained with every model and feature extraction because the task will be evaluate with it due to the unbalanced corpus. In addition we show the average precision for every result. The f1 score considers both the precision and recall. The formula for the f1 score is: f1 = 2 precision recall precision + recall To calculate the f1 macro score we only need to calculate the f1 score for every class and find their unweighted mean. (1) 4.1 N-grams and tf-idf In Table 1 we can see our best result in terms of f1 macro score and its correspondent accuracy obtained in every machine learning algorithm. First of all, we can see the support vector machine column. In the second column we have the experiments with random forest and the third column is for the test with neural networks. Table 1: N-grams and tf-idf experiments using svm, rf and nn SVM RF NN accuracy f1 score accuracy f1 score accuracy f1 score 1-gram 60.80 56.70 51.20 43.77 64.00 58.74 2-gram 60.40 55.92 50.80 41.19 62.80 59.09 3-gram 62.80 56.73 50.08 41.77 62.40 59.05 tf-idf 63.60 57.24 52.40 47.63 64.80 59.15 The random forest results are quite far from the others as it is shown in the table. As we said before the random forest is not so common as svm for task related to text but we want to make a comparison because we saw some studies where obtain very good marks [4] [1]. For the random forest, the best results are obtained with the tf-idf feature extraction and a random forest form by ten decision trees forced to have only one 64

sample in every leaf. In the case of the support vector machine, we get the best with the same feature extraction and a group of SVM (one for every pair of classes) with linear kernel faced each other to decide the predicted label. To obtain the best result on this table we use a multilayer perceptron with 3 hidden layers of size 1024, 512, 256 for each layer respectively. Due to the size of the corpus we need to avoid the overfitting. In this case we use dropout [18] along with batch normalization [8]. The curious thing about this model is that avoiding cleaning the tweets (remove all URL links, emoticons and some other special characters) we improve our results in 3 points but we did not see this improvement in the other algorithms. 4.2 Embeddings In this section we trained two models of embeddings. The first one was trained with only the training tweets provided for the competition and the second one with these tweets along with the tweets provided by Javier Palanca Cámara [22] and [21]. We obtained better results with the second model in all the experiments with embeddings. Therefore, to avoid redundant results we will not show the ones of the first model. Table 2: Embeddings experiments accuracy f1 score lstm 46.80 42.12 blstm 50.00 47.92 svm 57.24 63.60 rf 47.63 52.40 nn 59.15 64.80 In Table 2 are shown the best results obtained with LSTM and BLSTM along with the best results obtained in the previous section with svm, rf and nn. As we say before BLSTM have some improvements in LSTM but both have worse results compared to the approach of n-grams and tf-idf. We can conclude that embeddings and recurrent neural networks in this task need much more samples that other approximations to be competitive. 4.3 Combining methods Best results we obtained are shown in Table 3. As we said before, we got better result with the BLSTM than with the LSTM. We have improved the results of the LSTM and the BLSTM with embeddings but can not improve the results with the neural network and td-idf. One of the biggest problems for this method is the number of parameters we need to estimate. Furthermore, more data to train the model is needed. 65

Table 3: Combining experiments accuracy f1 score lstm 54.22 50.95 blstm 61.85 56.34 5 Conclusions and future work We can conclude that traditional models like support vector machines and multilayer perceptron have worked better than fashionable approaches such as LSTM and BLSTM. We saw the same results comparing n-grams and tf-idf against embeddings. These results are due to the size of the training set, so we might obtain better results using data augmentation techniques. In order to improve our previous models, we can try some new approaches like fasttext [9]. We can also try adding other feature extraction like part-of-speech tags as we can see in [16]. References 1. Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: Proceedings of the conference on empirical methods in natural language processing. pp. 1568 1576. Association for Computational Linguistics (2011) 2. Augenstein, I., Rocktäschel, T., Vlachos, A., Bontcheva, K.: Stance detection with bidirectional conditional encoding. arxiv preprint arxiv:1606.05464 (2016) 3. Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015) 4. Fernández Anta, A., Núñez Chiroque, L., Morere, P., Santos Méndez, A.: Sentiment analysis and topic detection of spanish tweets: a comparative study of nlp techniques (2013-03) 5. Giménez, M., Baviera, T., Llorca, G., Gámir, J., Calvo, D., Rosso, P., Rangel, F.: Overview of the 1st classification of spanish election tweets task at ibereval 2017. In: Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017), Murcia, Spain, September 19, CEUR Workshop Proceedings. CEUR-WS.org, 2017 (sep 2017) 6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735 1780 (1997) 7. Hong, L., Dan, O., Davison, B.D.: Predicting popular messages in twitter. In: Proceedings of the 20th international conference companion on World wide web. pp. 57 58. ACM (2011) 8. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arxiv preprint arxiv:1502.03167 (2015) 9. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arxiv preprint arxiv:1607.01759 (2016) 10. Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on. pp. 251 258. IEEE (2011) 11. Limsopatham, N., Collier, N.: Bidirectional lstm for named entity recognition in twitter messages. WNUT 2016 p. 145 (2016) 66

12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arxiv preprint arxiv:1301.3781 (2013) 13. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREC. vol. 10 (2010) 14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825 2830 (2011) 15. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45 50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/ publication/884893/en 16. Robinson, T.: Disaster tweet classification using parts-of-speech tags: a domain adaptation approach. Ph.D. thesis, Kansas State University (2016) 17. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 2673 2681 (1997) 18. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929 1958 (2014) 19. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: ACL (1). pp. 1555 1565 (2014) 20. Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints abs/1605.02688 (May 2016), http: //arxiv.org/abs/1605.02688 21. Val, E.D., Palanca, J., Rebollo, M.: U-Tool: A Urban-Toolkit for enhancing city maps through citizens activity. In: 14th International Conference on Practical Applications of Agents and Multi-Agent Systems. pp. 243 246 (2016) 22. Vivanco, E., Palanca, J., Val, E.D., Rebollo, M., Botti, V.: Using geo-tagged sentiment to better understand social interactions. In: 15th International Conference on Practical Applications of Agents and Multi-Agent Systems (2017) 67