Multiclass Sentiment Analysis on Movie Reviews

Multiclass Sentiment Analysis on Movie Reviews Shahzad Bhatti Department of Industrial and Enterprise System Engineering University of Illinois at Urbana Champaign Urbana, IL 61801 bhatti2@illinois.edu 1 Introduction The Internet has changed our lives in more ways than we can imagine. It has fundamentally changed the way we conduct our lives on daily basis, the way we interact with people and the way we do business. We make our preferences to buy products based on product reviews online. We decide to watch a movie based on its reviews on IMDB.com. We choose a restaurant for special dinner after reading some reviews online. In fact, we can read online reviews of almost everything that we want to buy or anywhere we want to visit or every new gadget that we want to enjoy. Product manufacturers are interested in feedback about their products as to how they can improve the products or what aspects of a product fascinate consumers the most. Reviews are useful tool to gauge the level of consumer satisfaction with a product. Review can help manufacturers improve their product. Manufactures are typically interested in the general sentiments of the reviewers about a product. The goal of this project is to classify the sentiments of movie reviews into meaningful groups using machine learning. 1.1 Related Work Learning sentiments from text using machine learning techniques was introduced by Pang et al. [1] in their seminal work on sentiment analysis. Several papers followed their work with different approaches to classify movie review [2-4]. More recently Mass et al. [5] improved upon the accuracy achieved by Pang by learning word vectors which capture semantic terms in a review. Many other articles have been written to extend this work to learn sentiment of reviews about other products and services like electronics, hotels etc. Most of the work in the field of movie reviews classify sentiments into only two categories positive and negative, thus only capturing polarity of the review. The goal of this project is to classify the sentiments of movie reviews in two ways; traditional binary classification and multiclass classification. In multiclass classification, the sentiments of a review is classified into four categories to capture whether a reviews thinks the movie is poor, bad, good or excellent. In this way, one can also gather the degree of positivity or negativity of a sentiment expressed about a movie. 2 Data Set The data set containing movie review is taken from [5]. This data set consists of 25000 labeled and polarized movie reviews. The examples in the data set have been labeled into four classes poor, bad, good and excellent with 7384, 5116, 5505 and 6995 examples respectively. In order to perform binary classification, poor and bad labeled examples are combined to form a negative label, and good and excellent examples are combined to get positive labeled examples. In this case the data set has 50-50 split of positive and negative reviews. In the whole collection, no more than 30 reviews are taken from any given movie since the reviews of the same movie tend to have correlated ratings. A review labeled as poor has a score of 1-2 on IMDB, bad reviews have a score of 3-4, a review with a score of 7-8 is labeled as good and a score of 9-10 assigns a review excellent label. 1

3 Preprocessing In order to extract useful features, the raw text needs to be preprocessed. The text is tokenized using NLTK s regular expression tokenizer. The text is tokenized by extracting alphabet sequences exclamation marks, question marks. Words ending with n t are also considered as tokens because the words such as can t, shouldn t, ain t etc. capture useful information about the sentiment of a review. All the stop words which does not provide any information related to the sentiment of review are removed from the data. The set of stop words is taken from NLTK. The text is also stemmed using the second version of Porter stemmer which is better than both the original Porter stemmer and Lancaster stemmer because it is fast and balanced. On the other hand, the original Porter stemmer is very gentle on tokens and Lancaster stemmer is known to be very aggressive. 4 Features Feature selection is by far the most important task in machine learning. Features represent the useful information that is extracted from the examples and is used to train the classifier. The accuracy of any classifier depends heavily on how well the features capture the underlying learning task. In this project, various features have been used as described below. 4.1 gram Features Unigram features are selected based on the number of occurrence of a word in the whole data set. To classify the reviews, 3000, 2000 and 1000 most frequent words have been selected as unigram features. Similar to the unigram features, bigram features also consist of 3000, 2000, 1000 most frequent bigrams appearing in the whole text. n-grams with n equal to 3 and 4 are also considered. 4.2 Other Features Elongated words are very important to sentiment analysis since they express the extremity of sentiments. In order to be consistent, elongated words have been trimmed to a little longer version of original word, for example soooo and sooo has been trimmed to soo rather than so to keep the expression of an elongated word different from the usual word. Cross features are also used in the experiments; unigram features that do not appear in other examples with different labels. The idea here is to capture label specific features. 5 Classifiers Two classifiers are used to classify the sentiments for both multiclass and binary classification; Naive Bayes and support vector machine. scikit-learn [6] libraries for python are used to implement the classifiers, because these are know to perform faster than NLTK. 5.1 Naive Bayes Naive Bayes methods are a set of simple probabilistic supervised machine learning algorithms based on applying Bayes theorem with naive but strong independence assumptions between the features given the label. This naive assumption states that value of any particular feature is unrelated to and does not depend upon the value or presence of any/all other features, given the label. Given a class variable y and a dependent feature vector X = (x 1, x 2,..., x n ), Bayes theorem states the following relationship: Pr (y x 1, x 2,..., x n ) = Pr (x 1, x 2,..., x n y) Pr (y) Pr (x 1, x 2,..., x n ) Using the naive assumption, which states that the value any feature x i is independent of the values of all/some of the other features, given the label, i.e. 2

Pr (x i x 1,..., x i 1, x i+1,..., x n ; y) = Pr (x i y) For all values of i, we can then simply write Pr (y x 1, x 2,..., x n ) = Pr (y) n Pr (x i y) Pr (x 1, x 2,..., x n ) Given a feature vector X = (x 1, x 2,..., x n ), corresponding to an example, we want to find the most likely label ŷ. Pr (y) n ŷ = argmax Pr (x i y) y Pr (x 1, x 2,..., x n ) Since the denominator does not depend on y and does not affect the value of y which maximizes the right hand side, so n ŷ = argmax Pr (y) Pr (x i y) y We can estimate P (y) and P (x i y) using Maximum A Posteriori (MAP) estimation from the training set. P (y) for any label y is fraction of examples with label y in the training set. We get different Naive Bayes classifiers by assuming different probability distribution of P (x i y). In this project, Bernoulli distribution for P (x i y) is used, which uses the presence of absence of a word in a document rather than frequency of a word in the document as in Multinomial Naive Bayes. Bernoulli NB is preferred over the Multinomial NB, because movie reviews usually consist of around a dozen sentences, thus critical words are not present in abundance in a review and only their presence or absence is enough to categorize a review. 5.2 Support Vector Machines Support Vector Machines (SVM) are supervised learning models which classify data into two labels by leaning a hyperplane from the training set that separates the data into two classes. The original SVM algorithm was invented by Vladimir N. Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995 [7]. Intuitively, a good separation is achieved by the learned hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Given a set of training vectors x i R p, i = 1,..., n, in two classes, and a vector y R n such that y i {1, 1}, SVM solves the following optimization problem: min w,b,ξ 1 2 wt w + C n subject to y i (w T x i + b) 1 ξ i, ξ i 0, i = 1,..., n This the primal optimization problem and its dual problem is given by 1 n n min α i α j y i, y j x T i x j α 2 i,j=1 subject to y T α = 0, 0 α i C, i = 1,..., n C > 0 is the upper bound on α s in the dual problem and acts as a trade-off parameter between having a small margin with over fitting and large margin with better generalization. C = 1 is used in this project. In some application the data is not linearly separable, but can be separated by a ξ i α i 3

Model Features Binary Classification Multiclass Classification Naive Bayes SVM Naive Bayes SVM 1 Unigrams 3000 84.7 85.4 55.1 56.0 2 Unigrams 2000 84.2 85.7 54.9 57.7 3 Unigrams 1000 83.1 85.4 54.4 58.8 4 M1+Bigrams 3000 85.6 84.5 57.2 52.8 5 M2+Bigrams 2000 85.2 85.1 56.6 54.8 6 M3+Bigrams 1000 83.5 85.2 55.6 56.7 7 Cross Features 86.2 85.7 56.9 59.2 8 Model 3 + ME 84.7 86.7 55.8 59.5 9 Model 6 + ME 85.2 86.3 57.1 58.0 Table 1: Average five-fold cross-validation accuracies, in percent. hyperplane if the data is represented in a higher dimensional space. This is done by using kernels, but in this project we do not use any kernels. SVM is a binary classifier, but it can be used to learn multiclass classification by two methods; one vs one and one vs all. In one vs one, for k classes k(k 1)/2 binary classifiers are trained. Each classifier is trained with a pair of classes from the original training set and learns a hyperplane between these classes. The predictions are made via a voting scheme. In one vs all strategy a single classifier is trained per class by labeling the examples from this class as positive and all other examples as negative. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label. Prediction are made using these scores. In this project one vs one strategy is used, because it is better than one vs all and in our case we have only two additional classifiers to classify but with smaller set of examples. 6 Analysis Given a set of movie reviews along with label of each review describing the sentiment of the review, a classifier is trained on 80% of the reviews using the features extracted from the reviews. Both the classifiers are trained using different set of features, for example top 1000 frequent unigrams, unigrams and bigrams etc. For each of the feature setting, the reviews are randomized and five-fold cross validation is performed. The accuracy of the classifier is calculated by percentage of labels that are correctly predicted by the classifier from the test data. The average accuracy among the five-folds is given in the Table 1 for each of the settings. 6.1 Unigrams Top 3000, 2000 and 1000 most frequently occurring unigrams are used for both binary and multiclass classification. Before picking the unigrams, all the stop words were removed from data and the words were stemmed using porter2 stemmer. Both for binary and multiclass classification SVM performed better than Naive Bayes algorithm. The accuracy of Naive Bayes deceased with the decrease in number of features in both binary and multiclass classification. On the other hand, the accuracy of SVM increased with decrease in number of features when classifying the reviews into multiple classes. This may be an indication that SVM was over fitting when many features were used. 6.2 Unigrams and Bigrams Often just the key words are not enough to detect the sentiment in a piece of text, rather the sentiments are expressed by a combination of words like nothing new, does little, not entertaining. So, the unigrams are supplemented with bigram of words. Again the 3000, 2000, 1000 most frequently occurring bigrams are used along with the same number of most frequent unigrams. The results show that addition of bigrams increase the accuracy of Naive Bayes algorithm both in binary 4

and multiclass classification. However this addition does not help SVM to increase the accuracy. In fact the accuracy of SVM decreases with the introduction of bigrams. This is again an indication that SVM tend to over fit when the number of features are increased. 6.3 Cross Unigram with n-grams Sometime similar words are present in the reviews with both positive and negative sentiments. So, the unigrams that are only found in positive review or negative reviews are used in experiments. Along with these unigrams, top 500 most frequent bigrams, top 400 most frequent trigrams and top 200 most frequent 4-grams were also added. Only unigrams were extracted as cross features, rest of the features were selected from the whole data. With these set of features, the accuracy of Naive Bayes is improved for both binary and multiclass classification. However SVM s accuracy only improves in multiclass classification. 6.4 Unigrams, Marks and Elongated words Exclamation marks are used to convey strong feelings, so they can be very useful while detecting sentiments in text. Similarly question marks can also prove to be very helpful in detecting emotions in a text. Elongated words are widely used in informal communication to express the intensity of emotion and are widely found in online reviews. Trimmed versions of elongated words, exclamation marks and question marks are also considered. In the current setting, elongated words, marks and top 1000 unigrams features are used to classify reviews into both binary and multiple classes. These features helped to improve the accuracy of SVM for both binary and multiclass classification. Also the accuracy of Naive Bayes increased as compared to using the unigrams. 6.5 Unigrams, Bigrams, Marks and Elongated words Top 1000 bigrams are added to the setting discussed in the previous section. As we have seen earlier, Naive Bayes performs better when we have a combination of unigrams and bigrams. Similarly in this setting, Naive Bayes improved when we added bigram of words as features. 7 Conclusions In this project, Machine learning techniques are used to detect the sentiments of movie reviews. Polarity of the sentiment as to whether a review is positive or negative, as well as the intensity of sentiment are considered. Traditionally movie reviews only focus on polarity of a review and thus cannot distinguish between a good movie and a masterpiece. Both Naive Bayes and SVM prove to be good methods to learn sentiments from a review, but SVM is better than Naive Bayes in achieving considerable accuracy with very basic feature set. Overall, gaining further improvements using these supervised models of machine learning for the purpose of sentiment analysis is studied in Natural Language Processing and requires sufficient knowledge of Linguistics. This requires more sentence structure analysis and understanding how words used in different parts of speech can change the sentiment of the text. Extracting features based on semantic structure of the text will improve the accuracy of these classifiers. In the settings of this project, it is very difficult to detect sentiments when the text involves sarcasm or negations, for example This movie is worth watching over and over and over... and over again. Incorporating positive and negative sentiment scores of unigram features using SentiWordNet [8] can prove very helpful in increasing accuracy. These scores range from 0 to 1 for a word and express the positivity and negativity of the word. The feature generation process in this case takes time in the order of hours, so it was not suitable for this project. Only positive and negative scores of the first and last few words might prove to be enough for further improvement. Most of the time a reviewer expresses his sentiment about a movie in the in the beginning or at the end of the review and in the middle just summarized why he has a certain opinion about the movie. 5

References [1] Pang, B., Lee, L., & Vaithyanathan, S., (2002) Thumbs up? Sentiment classification using machine learning techniques. Proceedings of EMNLP, 79-86 [2] Zhuang, L., Jing, F. & Zhu, X., (2006) Movie review mining and summarization Proceedings of the 15th ACM international conference on Information and knowledge management 43-50. Arlington, VA, USA [3] Kennedy, A. & Inkpen, D., (2006) Sentiment Classification of Movie Reviews Using Contextual Valence Shifters Computational Intelligence (22) 110-125. [4] Thet, T. (2010) Aspect-based sentiment analysis of movie reviews on discussion boards Journal of Information Science (6) 823-848. [5] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. & Potts, C. (2011) Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (1): 142-150. Stroudsburg, PA, USA [6] scikit-learn, Machine Learning in Python http://scikit-learn.org/ [7] Cortes, C., Vapnik, V., (1995) Support-vector networks, Machine Learning, (20): 273-297 [8] SentiWordNet 3.0 http://sentiwordnet.isti.cnr.it/ 6