Sentiment Analysis on Social Media Text. Siddhartha Banerjee (sub253) Eric Obeysekare (ero5004) IST 557: Data Mining Project

Sentiment Analysis on Social Media Text Siddhartha Banerjee (sub253) Eric Obeysekare (ero5004) IST 557: Data Mining Project

Agenda What is sentiment analysis? Basic concepts Literature overview Ø General approaches Ø Approaches on Social media datasets Summary & Discussions

Sentiment Analysis Identify and extract subjectivity in text Also known as Opinion Mining Textual information Subjectivity: I bought an iphone a few days ago. It is such a nice phone. Opinions Facts We only concentrate on Opinions, but. My iphone broke in just two days.

Sentiment Analysis What are the classes? Generally - positive, negative and neutral However, this is not always the case It might make sense to also understand how positive or how negative My iphone broke in just two days. partly fact, partly negative Why not have a scale (1-5)? Multi-class classification. Pang and Lee (2005) exploited knowledge from star ratings on websites How to create the dataset? Inter-annotator agreement.

Sentiment Analysis on Social Media Mostly, sentiment analysis has been applied on customer reviews on products/ movies, etc. (Turney, 2002) Limited work has been done on social media text Tweets are just 140 characters, yet users express opinions using such tweets (Kouloumpis et al., 2011) opinions on movies, elections, other sensitive issues, etc. It might be helpful to do real-time assessment of the sentiments of users on specific topics

Literature overview Several types of methods implemented on review datasets Dictionary-based approaches Supervised-learning based approaches Aspect extraction based approaches q Approaches specifically on Social Media text

Dictionary-based approaches Paper: Lexicon-based methods for sentiment analysis (Taboada et al, 2011) The performances were all really fantastic. Find polarities of individual words, if they are in a lexicon Lexicons can be manually compiled, or generated automatically (Pang and Park, 2005) Dictionary based approaches work well, but... The issue of domain specificity positive words in one domain might be negative in another The hotel room is really huge. The USB stick is really huge. Build domain specific lexicons!

Supervised Learning based approaches Sentiment analysis is a text classification problem Generate a training and test set How to generate feature vectors from text data? Example: John likes to watch movies. Mary likes movies too. Create a vocabulary: {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10 } Represent the sentence using elements from the vocabulary: Ø [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] The values are encoded by frequencies. You can use binary too. Ø Single words are unigrams Ø Two words together are bigrams.. (John likes, likes to, to watch,. )

Classification problem Thumbs up? Sentiment Classification using Machine Learning Techniques (Pang et al, 2002) Movie review dataset: 1400 (balanced positive and negative) Reasonably good performance on this dataset with simple set of features!

Aspect extraction Fine-grained sentiment analysis Not just overall sentiment, sentiment focused on aspects I bought an iphone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. A frequency-based approach (Hu and Liu, 2004): nouns (NN) that are frequently talked about are likely to be true aspects (called frequent aspects). Find the adjectives that modify such nouns Nearest adjective rule.

Social Media

Predicting Political Opinion Using Twitter to predict election results (Tumasjan et al., 2010) Counting tweets Sentiment analysis - LIWC

An LIWC Example LIWC Dimension Self-references (I, me, my) Jessie s emails Jessie s site Personal Texts Formal Texts 4.53 2.87 11.4 4.2 Social words 14.34 3.69 9.5 8.0 Positive emotions 1.89 1.91 2.7 2.6 Negative emotions 0.00 0.41 2.6 1.6 Overall cognitive words 6.42 6.01 7.8 5.4 Articles (a, an, the) 7.55 5.60 5.0 7.2 Big words (> 6 letters) 22.26 37.98 13.1 19.6

Discussion What are some issues with the use of LIWC in the analysis of Twitter data? How could LIWC be used to analyze tweets?

Predicting Political Opinion (cont.) Comparing Twitter sentiment to political polls (O Connor et al., 2010) Sentiment analysis OpinionFinder Daily ratio of positive to negative words Message retrieval keywords, hashtags

Tweet selection Twitter APIs Search max 3,200 tweets, 180 searches/15 minutes Streaming real-time, returns a subset Firehose real-time, premium service Automatic Topic-focused Monitor (ATM) (Li et al, 2013) Sample tweets for some keywords Select most relevant keywords from that sample Query Twitter with new keywords Iterate

Predicting Political Opinion (cont.) Real-time debate performance (Diakopoulos & Shamma, 2010) Sentiment determined manually Mechanical Turk

Predicting Real-world Outcomes with Twitter Box Office Sales (Asur & Huberman, 2010) Tweet rates Sentiment Analysis supervised learning model Training data - Mechanical Turk LingPipe Computational linguistics package DynamicLMClassifier

Predicting Real-world Outcomes with Twitter (cont.) Subjectivity measure Multiple samples

Predicting Real-world Outcomes with Twitter (cont.) Positive Negative ratio Multiple samples

Discussion What is the best method to use for sentiment analysis of Tweets? Dictionary based? Supervised learning model? Something else? How can we select tweets?

Conclusion Twitter is a good approximation of real-world opinions Multiple approaches with different benefits and drawbacks Dictionary-based plug and play Supervised learning more customizable Tweet selection is hard!

Questions?

References Turney, Peter D. "Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. Kouloumpis, Efthymios, Theresa Wilson, and Johanna Moore. "Twitter sentiment analysis: The good the bad and the omg!." ICWSM 11 (2011): 538-541. Tumasjan, A., Sprenger, T., Sandner, P., & Welpe, I. (2010). Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment. ICWSM, 178 185. Retrieved from http://www.aaai.org/ocs/index.php/icwsm/icwsm10/paper/viewfile/1441/1852 Diakopoulos, N. a., & Shamma, D. a. (2010). Characterizing debate performance via aggregated twitter sentiment. Proceedings of the 28th International Conference on Human Factors in Computing Systems - CHI 10, 1195 1198. doi:10.1145/1753326.1753504

References O Connor, B., & Balasubramanyan, R. (2010). From tweets to polls: Linking text sentiment to public opinion time series. ICWSM, 11, 122 129. Retrieved from http://www.aaai.org/ocs/index.php/icwsm/icwsm10/paper/viewpdfinterstitial/ 1536/1842 Asur, S., & Huberman, B. (2010). Predicting the future with social media. Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, 1, 492 499. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5616710 Li, R., Wang, S., & Chang, K. C. C. (2013). Towards social data platform: Automatic topicfocused monitor for twitter stream. Proceedings of the VLDB Endowment, 6(14), 1966-1977.