White Paper. Using Sentiment Analysis for Gaining Actionable Insights

corevalue.net info@corevalue.net White Paper Using Sentiment Analysis for Gaining Actionable Insights Sentiment analysis is a growing business trend that allows companies to better understand their brand, products, and services by analyzing the attitudes, opinions, and emotions expressed by an online audience. Author Olena Domanska CoreValue Data Science Engineer

Using sentiment analysis for gaining actionable insights Consumer opinions undoubtedly affect a company s reputation and should be of a high interest to businesses, as they can prove to be extremely valuable assets. Actionable insights provide businesses an advantage over their competition and help them maintain a competitive edge on the market. Today it s easy for consumers to loudly express their satisfaction and their frustration about a company or a product through social media, forums, blogs, and review platforms which can greatly impact public opinion. Sentimental analysis allows businesses to analyze public opinion about a product or service in order to unlock the hidden value contained within. This information, when used correctly, enables them to make better informed business decisions. The notion of sentiment analysis Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information found in source materials. By harnessing the power of sentiment analysis and wrangling all the opinion-related information it contains, businesses can extract tremendous value and use it to their advantage. This data mining requires significant effort, however, as it involves various product/service comparisons, subjectivity and probability defining, emotional components classification, opinion reasoning, and summarizing. In layman s terms, the sentiment analysis engine lurks in social platforms, processes tons of unrestricted data, and derives actionable insights that are directly related to business results. Core techniques of sentiment analysis There are two fundamental ways in which to approach sentiment analysis: supervised and unsupervised (or lexicon-based). The lexicon-based approach rests upon the assumption that the contextual sentiment orientation of the text can be calculated by summing up the sentiment scores of each separate word or phrase. Essentially, this technique relies on external lexical resources that are concerned with mapping words to a categorical class (positive, negative, neutral) or numerical sentiment score. As a result, its effectiveness strongly depends on the quality and adequacy of the chosen resource. While the obvious advantage of the approach is avoiding the arduous step of labeling training data, one must also be aware of its possible limitations. A few examples include instances when a word associated with a positive or negative sentiment actually has opposite orientations in different application domains or, when a sentence containing sentiment words may not express any sentiment at all (in interrogative and conditional sentences). Sentences with a sarcastic tone often warp the polarity of sentiment words and many sentences without sentiment words can also imply opinions. Supervised techniques, on the other hand, work with the notion of training data. Specifically, training samples and the corresponding output values are entered into the algorithm before applying it to the actual data set. This enables the algorithm to handle new unknown data in the future and provide more accurate sentiment classification in specific domains for which it has been trained. The most common supervised learning methods are Naive Bayes classification and Support Vector Machines (SVM) although researchers apply many others as well, including maximum entropy, random forest, neural networks, and regression tree. Recent works in the area shows that supervised approaches tend to overcome unsupervised ones. But is this really true? In this article, we will try to verify this assumption with real data.

Sentiment analysis in action Theory aside, the real questions seem to be, How effective is sentiment analysis in practice? and Which approach is more accurate: supervised or unsupervised? In order to figure this out, we decided to analyze all the available reviews for HubSpot, one of the popular marketing automation platforms, from a natural language processing perspective. The script for the following analysis can be found on GitHub. To perform our analysis, we began by closely examining the data we collected in order to discover the most frequently used words and built associations between them to figure out clusters and themes within the reviews. We then examined how review topics changed over time. Finally, we identified the sentiments of consumer opinions by applying alternatively unsupervised and supervised methods, comparing how each performed on our real data. Exploratory Phase Here is the sample of data we gathered: "HubSpot is our main marketing platform. It's currently used to automate our marketing programs, including email marketing, landing pages, social media and our blog. We also use the tool to score leads, and automate our lead nurturing process. It's easy to measure the success of our programs through the reporting. HubSpot is great for automating workflow emails, creating new campaigns, landing page creation and compiling lists. They have a great training program for gearing up with HubSpot..." Each review was scored by reviewers on a scale from 1 to 10. Here is a distribution of these scores: As you can see, the distribution is strongly left-skewed with a distinct peak at the highest value. It is interesting to see how algorithms perform on such unbalanced real data. To start off, we determined the most frequently used words by building a word cloud with the help of Wordle. Below, you can find the screenshot that illustrates the results.

After examining the word cloud, we concluded that people mainly discussed the features of HubSpot s inbound marketing platform and describe them with the words easy, great and amazing. Even though word clouds give us an understanding of which words are most popular in reviews, they don t allow us to determine numerical proportions of their occurrence frequencies. To do this, we built a so-called document-term matrix that shows which terms contain the review and how often they appear. Adding the number of term occurrences for each review, we get the following histogram:

The most frequently used words in the review texts are the words hubspot (2531), market (1418), use (1393), and tool (690 occurrences). But how are these words connected to each other? To answer this, we built the following graph, illustrating the associations between these words. The thicker the line connecting two words, the higher the probability of their cooccurrence in a review: We see that words hubspot, lead, tool, as well as hubspot, can, help ; email, content, page are usually present together. At this point, the content of the reviews starts to become more clear. Next, we thought it would be interesting to find out what directions of discussions are hidden in reviews texts. Topics Identification To identify topics, we grouped reviews into clusters using the Hierarchical Clustering technique. The results of which are shown below (we first determined the ascending clustering of reviews before constructing the tree with only a few of the uppermost clusters). It was necessary to pick a threshold level to form the groups, so we decided on the simplest and most popular solution, which is to inspect the dendrogram.

Hierarchical Clustering 30 uppermost clusters of reviews In our case, threshold at 0.4 level seemed to be a reasonable choice, which revealed 3 clusters, depicted by black, green, and red boxes containing 73, 12, and 15 percents of reviews, respectively. Thereby revealing that 73 out of 100 reviewers discussed the same topic. But what was the topic they discussed? We determined topics of those clusters based on probabilistic modeling of term frequency. Below are a few of the most frequently used words for each topic: Topic 1: "marketing, hubspot, tool, customer, lead, inbound, sale" Topic 2: "hubspot, email, social, page, blog, content, website, manage" Topic 3: "hubspot, time, can, make, help, need, get" On the basis of the word vectors listed as topics, we concluded the theme of the reviews within each cluster. For instance, the first group of the reviews concentrates on the idea that Hubspot is a leading marketing tool for increasing customer sales, while the second group of reviews is devoted to discussing Hubspot tools like social media and blog posts publisher, landing page creator, content management and website visitor tracking. Let s discover the trending topics across the three-year time frame:

According to this plot, there were just a few reviews on Hubspot until the middle of 2013, with topic 1 proving to be prevalent for yet another year. The highest concentration of reviews occurs at the beginning of 2015, followed by a slowdown which still continues. We continued our sentiment analysis of the reviews using unsupervised and supervised approaches, comparing their accuracy. Sentiment Analysis We began using the unsupervised (lexicon-based) approach, which estimates a record's sentiment by counting the number of occurrences of "positive" and "negative" words and utilizing Hu and Liu's "opinion lexicon". It categorizes around 6,800 words as positive or negative and is available for download here. Other useful resources for lexicon-based sentiment analysis include the MPQA Subjectivity Lexicon, SentiWordNet, SenticNet. To assign a numeric score to each review, we simply subtract the number of negative words from the number of positive words that occur. A new question arose: Should we take into account the length of the review? Consider the following two reviews. The first is fantastic (one word-long, the sentiment score is equal to one) while the second review is several sheets long, expressing both positive and negative thoughts about the product, but with a total score is also equal to one. Obviously, we should rank these reviews differently. One way to take this peculiarity into account is to normalize the score by the length of the review. Not to be unbound, we calculated scores for HubSpot reviews in both cases: with

and without normalization to evaluate the sum of squared errors (SSE). As it turns out, normalization reduced the SSE twice (see the details on GitHub). Based on these arguments, we determined our analysis using three steps: count the sentiment score for each review, normalize the score by the review length, map the obtained scores to the interval [1,10] and round them due to the fact that every review has a rating score assigned by reviewers in the range of 1 to 10. As a result, we obtained 10-class classification problem (NORMAL formulation). However, sentiment classification is usually formulated as a two-class classification problem: positive and negative (BINARY formulation), where a review with the rating score from 1 to 4 is considered to be a negative review, while a review with 5 to 10 rating score is considered to be a positive review. It is also possible to use a neutral class and consider a three-class classification problem, by which a review with 1 to 3 rating score is considered to be negative, 4 to 6 - neutral, and 7 to 10 - positive (BASIC formulation). We obtained the following distributions of the reviews classes for each of the formulated sentiment classification problems: After comparing the distribution of reviews scores assigned by reviewers with the distributions of reviews scores obtained with the help of lexicon-based approach, we concluded that the unsupervised (lexicon-based) technique performs well only in the case of binary classification, where accuracy reaches 0.69. If you consider that normal and basic formulations have accuracies equal to only 0.0065 and 0.19 respectively, our analysis reconfirms the research provided by other scientists who determined the accuracy of classification algorithms depends significantly on the number of classes considered. Taking into account that, according to a series of experiments with Mechanical Turk, humans only agree 79% of the time, this algorithm gives competitive result in the case of two classes. The quantitative perspective of opinions in this case stated that 68.8% of people were positive about the Hubspot product.

Now let s consider the supervised approach. We start with a Naive Bayes classifier. The chosen classifier applies Bayes Theorem to predict the class of the given text using a number of previously classified samples of the same type. We divide our reviews into 2 groups: train and test, in order to evaluate the accuracy of the method on the text set. As in the case of lexicon-based approach, we provide NORMAL, BASIC and BINARY formulations. The accuracy of the method turns out to be 0.63, 0.99, and 1 for normal, basic and binary formulations, respectively. It turns out that compared to unsupervised method, one of the simplest supervised models - Naive Bayes classifier - was able to achieve a recall accuracy up to 100% for our biased data. In the landscape of R, the fantastic RTextTools package was developed by Timothy P. Jurka and colleagues for automatic text classification. The package includes nine algorithms for ensemble classification and is designed to conduct supervised learning in less than 10 steps. For sentiment classification of Hubspot reviews, we chose the following 5 of the existed algorithms: support vector machine (SVM), maximum entropy (MAXENT), random forest (RF), classification or regression tree (TREE) and neural networks (NNET), and implemented them for every formulation: NORMAL, BASIC, and BINARY. All of these methods showed almost 100% accuracy in the case of two or three classes and the accuracy in the interval [0.59, 0.72] in the case of 10 numerical classes. The comparison of the obtained results can be observed in the following plot:

Conclusions In this article, we provided a thorough comparison of unsupervised and supervised approaches to sentiment analysis using the example of Hubspot platform reviews. Specifically, we have used and evaluated the results from seven different models, including lexicon-based, Naive Bayes classifier, support vector machine, maximum entropy, random forest, classification, and neural networks algorithms. Our examination shows that when data is skewed, both lexicon-based (unsupervised) and machine learning (applied to supervised scenario) techniques perform very well in terms of accuracy in the case of binary classification. As expected, the machine learning methodologies outperformed the lexicon-based method. Overall, the sentiment analysis proves to be a relatively simple and effective tool to extract valuable opinion-based information from source data. This creates the potential for further growth of sentiment analysis by expanding its usage into new areas, much to the benefit of businesses who harness the power of this method.

About CoreValue CoreValue, a Software and Technology Services firm headquartered in New Jersey with Development Labs in Eastern Europe, provides Mobility and traditional Cloud based CRM implementation services, Mobile applications in Pharmaceutical, Medical, Media and Telecommunication verticals. Customers trust CoreValue to provide Infrastructure services utilizing premier staff in Data Science, Data Management, Database Services, Quality Assurance and traditional development. CoreValue Services 18 Overlook Ave, Suite 9 Rochelle Park, NJ 07662 908-312-4070 info@corevalue.net