Sentiment Analysis and Visualization of Social Media Data

Size: px

Start display at page:

Download "Sentiment Analysis and Visualization of Social Media Data"

Nancy Lester
6 years ago
Views:

Sentiment Analysis and Visualization of Social Media Data The #BostonMarathon #Bombings test case Amir Salarpour Department of Computer Engineering Bu-Ali Sina University Hamedan, Iran a.

proios@gmail.com Abstract This work aims a) to perform sentiment analysis on social media data using Machine Learning methods and b) to propose a user-friendly visualization of these data. Fig. 2.

INTRODUCTION (Heading 1) The target of this project is a) to perform sentiment analysis (SA) on social media text data (e.g. product reviews, tweets) using machine learning algorithms and b) to create a visualization summary of these data taking also into account the SA output.

Dataset Description The following two datasets were used for the experiments: To perform sentiment analysis we pre-processed the text data using GATE (General Architecture for Text Engineering).

1 Sentiment Analysis and Visualization of Social Media Data The #BostonMarathon #Bombings test case Amir Salarpour Department of Computer Engineering Bu-Ali Sina University Hamedan, Iran Mohammad Hossein Bamneshin Department of Computer Engineering Bu-Ali Sina University Hamedan, Iran Dimitris Proios Department of Information and Telematics Harakopio University of Athens Athens, Greece Abstract This work aims a) to perform sentiment analysis on social media data using Machine Learning methods and b) to propose a user-friendly visualization of these data. Fig. 2. Sentiment analysis flow Keywords sentiment analysis, data visualization, machine learning I. INTRODUCTION (Heading 1) The target of this project is a) to perform sentiment analysis (SA) on social media text data (e.g. product reviews, tweets) using machine learning algorithms and b) to create a visualization summary of these data taking also into account the SA output. Fig. 1. Overall system s workflow II. DATASET AND ANNOTATION A. Dataset Description The following two datasets were used for the experiments: To perform sentiment analysis we pre-processed the text data using GATE (General Architecture for Text Engineering). The output was used to construct feature vectors (feature extraction) which in turn were used to train several machine learning models. Finally, the learnt models were evaluated in terms of accuracy, which is the proportion of test instances (reviews) that were classified in the correct category. For the visualization task we used D3 (Data-Driven Documents) a JavaScript library that provides functionality to display data in graphical charts. Product Reviews Corpus (PRC). The PRC is a part of Wishful Expressions Corpora [1]. It contains 1235 sentences with customer product reviews from Amazon.com and cnet.com, collected by Bing Liu 1 and his colleagues, and used in several publications. Two examples of such reviews are the following: i will never buy their product again at this rate and neither should you the product has worked perfectly for me on my xp This corpus was annotated by ILSP 2 regarding to specific sentiment categories (see below section 2.2) Boston Corpus (BC) has been collected and annotated by ILSP 3. It contains 5000 tweets related to the Marathon event and the bombings that took place in Boston on 15/4/2013. The Marathon started at 09:00 and the explosions occurred at 14:59. The tweets were collected in the timeframe between 14/4/2013 at 21:00 1 The original data and publications can be found at 2 Institute for Language and Speech Processing, Athena R.C. 3 This corpus cannot be distributed without the permission of ILSP. Any questions on this data should be directed to Haris Papageorgiou (xaris@ilsp.gr)

2 am and 15/04/2013 at 19:46 am. Some examples of these tweets are the following: Excited for the #bostonmarathon tomorrow #makeitcount #findgreatness Best of luck to all those running #bostonmarathon today! Have FUN and enjoy!! #bostonmarathon explosions! Horrifying site :( let's pray for the may affected #PrayersforBoston our prayers go out to Boston this afternoon. We randomly selected and annotated a 1027 tweets subset. This sub-corpus was initially used to test the best classifier that was trained on the product reviews corpus. In a second phase, it was combined with the product reviews corpus in order to train the final sentiment model. B. Annotation and Inter-Annotator Agreement (IAA) Product Reviews Corpus: Each sentence of the PRC was judged 4 by two annotators regarding to the following categories: Subjective: any text about private states (sentiments, opinions, emotions, feelings, thoughts, beliefs etc.) expressed by an author. Objective: text that contains only factual information. Positive: any text containing positive opinions, emotions, feelings etc. or facts/events that may trigger positive sentiment. Negative: any text containing negative opinions, emotions, feelings etc. or facts/events that may trigger negative sentiment. Praise: positive evaluations and opinions about specific entities or topics and their aspects explicitly or implicitly expressed by an author. Criticism: negative evaluations and opinions about specific entities or topics and their aspects explicitly or implicitly expressed by an author. To assess the agreement of the two annotators we used Cohen's kappa coefficient and PSA (Proportion of Specific Agreement). The results are shown in Table 1. TABLE I. INTER-ANNOTATOR AGREEMENT Method Sub Obj Pos Neg Prais Crit Kappa ,6785 0,5690 PSA for 0-label ,8749 0,9071 PSA for 1-label ,8029 0,6616 When the number of each class instances are different, kappa coefficient will be unreal. So we used PSA coefficient that shows the agreement separately. For the subjective category we notice that the agreement for 1-label class is high but not for the 0-label one and vice versa for the objective category. In contrast, the results show high agreement in the positive and negative categories. The IAA is also high for the praise category. In the criticism category the agreement for the 0-label class is high (0.9071) and for the 1-label (0.6616). However, this asymmetry between these two criticism classes results in a low Kappa. We decided to focus on the positive and negative categories, where the IAA was substantial. In total, the two annotators disagreed in 300 sentences. These conflicts were resolved by an expert annotator in order to create a reliable corpus. Boston Corpus: The randomly selected tweets subset was annotated similarly to the product reviews corpus by a domain expert from ILSP 6. Again, we focus on the positive and negative categories. III. FEATURES AND DATA PRE-PROCESSING A. Features ILSP provided two types of features for the sentiment experiments: Lexical features assigned to each token of a text after being pre-processed by applying a custom Natural Language Processing pipeline in GATE (see below 3.3). Examples of such features are the part-of-speech tags, punctuation, orthography etc. Sentiment Lexicon-based features resulting from the following lexica: Opinion Lexicon [2, 3]: It contains 4783 negative and 2006 positive opinion words, namely words used from writers and speakers in order to express their opinions toward some target. ANEW (Affective Norms of English Words) [4]: contains 1034 words with valence, arousal and dominance scores. The specific words had been previously identified as bearing meaningful emotional content. Lexicon [5]: It contains 4075 attitude words classified in particular categories using specific syntactic and semantic criteria. is examined inside the scope of Appraisal Theory [6] and the specific words are considered as a linguistic device to express evaluations (criticism or praise) toward some target. Intention Lexicon 7 : It contains 355 words and expressions used to express future intentions such as commitments, promises, desires etc. Each entry is classified in particular categories using specific syntactic and semantic criteria. B. Data Pre-processing The dataset was pre-processed using GATE (General Architecture for Text Engineering) by applying the following pipeline: 4 The annotated corpus cannot be distributed without the permission of ILSP. 5 After removing the duplicates (re-tweets) we ended up to a corpus of 776 distinct tweets. 6 The annotated corpus cannot be distributed without the permission of ILSP. 7 This lexicon is being developed by ILSP (Pontiki Maria, Thanasis Kalogeropoulos and Haris Papageorgiou) and is not published yet.

Fig. 3. NLP pipeline TABLE II. FEATURES PROPERTIES Min Max Mean Mod Variance Range Number of Category 2 83 19.5789 13 11.6469 81 Number of NN 0 28 5.43157 3 3.51322 28 Number of JJ 0 9 1.40323 1 1.

3 Fig. 3. NLP pipeline TABLE II. FEATURES PROPERTIES Min Max Mean Mod Variance Range Number of Category Number of NN Number of JJ Number of VB Number of RB Number of W Number of Negations Number of Opinion Number of Positive Opinion The final output of the GATE pre-processing for each text is stored in an XML file. IV. SENTIMENT ANALYSIS USING MACHINE LEARNING A. Experiments on product reviews dataset We split the 1235 reviews of the PRC to two parts; the 70% was used for training and the 30% for testing. 1) Feature extraction We parsed each GATE xml file (using MATLAB) and we calculated the following 32 features for each review: 1- Number of different category there were in each sentence. 2- Number of NN tags. 3- Number of JJ tags. 4- Number of VB tags. 5- Number of RB tags. 6- Number of W tags. 7- Number of negation words. 8- Number of words detected by Opinion Lexicon. 9- Number of negative words detected by Opinion Lexicon. 10- Number of positive words detected by Opinion Lexicon. 11- Number of words detected by Lexicon. 12- Number of negative words detected by Lexicon. 13- Number of positive words detected by Lexicon. 14- Number of both words detected by Lexicon. 15- Number of JJ words detected by Lexicon. 16- Number of NN words detected by Lexicon. 17- Number of RB words detected by Lexicon. 18- Number of negative and JJ words detected by Lexicon. 19- Number of negative and NN words detected by Lexicon. 20- Number of negative and RB words detected by Lexicon. 21- Number of positive and JJ words detected by Lexicon. 22- Number of positive and NN words detected by Lexicon. 23- Number of positive and RB words detected by Lexicon. 24- Number of both and JJ words detected by Lexicon. 25- Number of both and NN words detected by Lexicon. 26- Number of both and RB words detected by Lexicon. 27- Average of Valence Mean for words covered by ANEW Lexicon. 28- Average of Dominance Mean for words covered by ANEW Lexicon. 29- Average of Arousal Mean for words covered by ANEW Lexicon. 30- Number of discovered Desire word using Intention Lexicon. 31- Number of discovered Commitment word using Intention Lexicon. 32- Number of discovered Purpose word using Intention Lexicon. In Table 2 we present a statistical analysis for the extracted features: Number of Negative Opinion Number of Number of Negative Number of Positive Number of Both Number of JJ Number of NN Number of RB Number of JJ Negative Number of NN Negative Number of RB Negative Number of JJ Positive Number of NN Positive Number of RB Positive Number of JJ Both Number of NN Both Number of RB Both Valence Average Dominance Average Arousal Average Number of Desire Number of Commitment Number of Purpose

4 2) Experiments For the sentiment analysis task several experiments were conducted using different machine learning (ML) algorithms. To evaluate the learnt models we used accuracy which is defined as the number of correctly classified instances divided with the total number of instances. In our models we kept all the features listed in previous section since the feature selection experiments we run using various methods (Information Gain, Mutual Information and Ranking Algorithm) didn t show any improvements. As a baseline system we used a majority classifier which always chooses as the correct category the one that is more frequent in the training data. The accuracy of this method also indicates the level of difficulty of the task. Below we present the experiments we have done on product reviews dataset using various ML methods: K-Nearest Neighbors (k-nn): We tried k-nn a simple ML algorithm that classifies a test instance to the majority category of its k nearest training examples. We tried a wide range of k values (k =1,,100) and we found using cross validation on the training set that the optimal one is 21. As a distance measure between feature vectors we used Euclidean distance. We also used Principal Component Analysis (PCA) to remove correlations between features. PCA improves accuracy for both positive and negative categories (see Table 4 and 5). Naïve Bayes: Another well-known ML algorithm is Naïve Bayes (NB). NB assumes that feature variables (x 1,, x n ) are independent given the class c (category). The distributions of these variables P(x i c ) are estimated from the training data. When a test instance defined by its feature vector x is given to NB, it classifies it to the class c that has the highest P(c x). The estimation of the latter probability is estimated using Bayes Theorem and learnt P(x i c ) probabilities. SVM with MLP or RBF kernel: We also tried Support Vector Machines (SVM) which attempt to learn a separating hyperplane for the given classes (categories) from the training data. We experimented with Multilayer Perceptron kernel (MLP) and Radial Basis Function (RBF) kernel. The best accuracy was obtained using RBF kernel. We tuned model parameters on training set using Genetic Algorithms (GA) which significantly improved accuracy for both kernels (see Table 4 and 5). As previously we used PCA to remove correlation between features. The model with the best results is SVM with RBF which we use in our experiments with twitter data. SVM MLP + PCA SVM MLP + PCA SVM RBF + PCA SVM RBF + PCA Naïve Bayes Naïve Bayes + PCA Naïve Bayes + Naïve Bayes + + PCA TABLE IV ACCURACY OF DIFFERENT METHODS FOR POSITIVE CATEGORY Train Method Test Cross-validation Full training set in training set Baseline k-nn (k=21) K-NN + PCA (k=21) SVM MLP + PCA SVM MLP + PCA SVM RBF + PCA SVM RBF + PCA Naïve Bayes Naïve Bayes + PCA Naïve Bayes + Naïve Bayes + + PCA We also wanted to assess how better our models predict the target (category) as we increase the number of training instances. So, we build models using 10%, 20%,, 100% of the training set and we evaluated them on the test set. We show the results we obtained in Figures 3, 4, 5, and 6 for negative category and in Figures 7, 8, 9 and 10 for the positive category. K-NN with PCA, SVM with RBF or MLP kernel using parameter and PCA, and Naïve Bayes using parameter and PCA improve their accuracy as we add more training data. Fig. 4. KNN Learning curve comparison for using PCA and not using it KNN Learning Curve on Negative Class TABLE III. ACCURACY OF DIFFERENT METHODS FOR NEGATIVE CATEGORY Train Method Test Cross-validation Full training set in training set Baseline k-nn (k=21) K-NN + PCA (k=21) non pca

5 Fig. 5. Naive Bayes Learning Curves - comparison for using and not using PCA and Parameter Naive Bayes Learning Curve on Negative Class Fig. 9. Naive Bayes Learning Curves - comparison for using and not using PCA and Parameter Naive Bayes Learning Curve on Positive Class 0.6 Fig. 6. SVM-RBF learning Curve using PCA compared with same method with tuned parameters optimized and non pca optimized and non pca Fig. 7. SVM-MLP Learning Curve using PCA and compared with same method with tuned parameters SVM-RBF Learning Curve on Negative Class optimized and Fig. 8. KNN Learning curve comparison for using PCA and not using it 0.9 SVM-MLP Learning Curve on Negative Class optimized and KNN Learning Curve on Positive Class Fig. 10. SVM-RBF learning Curve using PCA compared with same method with tuned parameters non pca optimized and optimized and non pca Fig. 11. SVM-MLP Learning Curve using PCA and compared with same method with tuned parameters Learning Curve SVM-RBF for Positive class Using PCA Optimised using PCA SVM-MLP Learning Curve on Positive Class optimized and non pca

B. Experiments on BC We used our best models (SVM RBF + PCA + ) trained on the 70% of the product reviews and evaluate them on the 776 tweets. The models achieve 64.1% and 77.

Hashtag graph This one presents the most important (frequent) discussed topics of the 3963 tweets using a D3 bubble chart. The biggest bubbles correspond to more frequently discussed topics. Fig. 13.

6 B. Experiments on BC We used our best models (SVM RBF + PCA + ) trained on the 70% of the product reviews and evaluate them on the 776 tweets. The models achieve 64.1% and 77.7% for the positive and negative class, respectively. Both outperform the corresponding majority baselines. A. Hashtag graph This one presents the most important (frequent) discussed topics of the 3963 tweets using a D3 bubble chart. The biggest bubbles correspond to more frequently discussed topics. Fig. 13. General Hashtags Bubble Chart TABLE V. SVM RBF + PCA + TUNING method positive class negative class Majority baseline SVM RBF + PCA We also created a training set using the 70% of product reviews and the 70% of the labeled twitter data. Similarly, we created a test set by combining the remaining 30% of the two aforementioned datasets. We then trained models by progressively adding more training data as in the previous section. As it is shown in Figure 11 our models achieve better accuracy as more training instances are added. Fig. 12. Learning Curve of SVM-RBF using GA on combined dataset SVM-RBF kernel using GA Learning Curve on Combined CORPUS Fig. 14. General Hashtags Bubble Chart using logarithmic scale for radial size Positive Label Negative Label In the following Table we show the accuracy of our classifier trained using the 100% of training set. This model was used to classify the remaining 3963 tweets from the BC and the output was fed to the data visualization algorithm. TABLE VI. SVM RBF + PCA + TUNING Method Positive Class Negative Class Majority SVM RBF + PCA V. DATA VISUALIZATION Two types of data graph visualizations are presented: As seen in the above figures, each topic corresponds to a set of twitter hashtags whose names arise one from another using minor lexical or stylistic transformations (e.g. BostonMarathon, bostonmarathon ). These hashtags are detected using simple heuristics and/or Levenshtein (edit) distance. A graph with the consolidated hashtags is shown below:

Sentiment Graph It presents the frequency of the positive and negative tweets over time.

7 Fig. 15. Bubble Chart after the hashtag consolidation Fig. 17. The distribution of positive and negative tweets on a four hour time frame Fig. 18. The distribution of positive and negative tweets per hour Fig. 16. Bubble Chart after the hashtag consolidation using logarithmic scale for radial size B. Sentiment Graph It presents the frequency of the positive and negative tweets over time. As shown below (Figure 14 and 15) the number of tweets for the Boston Marathon was relatively small in the beginning, however, after bomb explosion it was rapidly increased. As it also shown the negative tweets dominated over the positive ones as time was passing since more people expressed its sadness or anger about the event. The fact that many positive tweets are detected (as shown in the graph) is mainly due to many people express hopes and wishes (e.g. Best wishes to those at #BostonMarathon, I hope everyone is ok ). VI. CONCLUSIONS AND FUTURE WORK We have experimented with a variety of well-known machine learning algorithms that were used to predict the expression of positive or negative sentiments on social media data. We have shown that a Support Vector Machine with RBF kernel has obtained the best results for both categories on a dataset of product reviews. We have also shown that same classifier has competitive results on a different domain (twitter dataset). We also created using D3 JavaScript library a concise visualization summary of the data. This visualization presents in a user friendly way a) the most important topics discussed and b) the dominant sentiment expressed in the data over time. In future work we plan to assess the effectiveness of each lexicon and to test different feature sets and machine learning algorithms (e.g. Logistic Regression). In addition, we would to perform an error analysis to detect the cases that our classifier fails to predict correct sentiment. Furthermore, a more sophisticated visualization is planned in which we will present the dominant topics per time unit separately for each sentiment category.

8 REFERENCES [1] Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson and Xiaojin Zhu. May All Your Wishes Come True: A Study of Wishes and How to Recognize Them. Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2009). [2] Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, [3] Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan. [4] Bradley, M., & Lang, P. (1999). Affective norms for english words (anew): Stimuli, instruction manual and affective ratings. Technical report c-1, Gainesville, FL: University of Florida [5] Pontiki Maria, Aggelou Zoe, Maltezou Sofia & Papageorgiou Haris (2013). Sentiment Analysis: Building Bilingual Lexical Resources. To be published in the Proceedings of the 13th International conference on Greek Linguistics, September 26-29, 2013 [6] Martin, J.R. and White, P.R.R. (2005). The Language of Evaluation, Appraisal in English, Palgrave Macmillan, London & New York.

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for