Hidden Topic Sentiment Model

Size: px

Start display at page:

Download "Hidden Topic Sentiment Model"

Mabel Miller
6 years ago
Views:

1 Hien Topic Sentiment Moel M Mustafizur Rahman, Hongning Wang Department of Computer Science University of Virginia Charlottesville VA, USA {mr4xb, hw5x}@virginia.eu ABSTRACT Various topic moels have been evelope for sentiment analysis tasks. But the simple topic-sentiment mixture assumption prohibits them from fining fine-graine epenency between topical aspects an sentiments. In this paper, we buil a Hien Topic Sentiment Moel (HTSM) to explicitly capture topic coherence an sentiment consistency in an opinionate text ocument to accurately extract latent aspects an corresponing sentiment polarities. In HTSM, 1) topic coherence is achieve by enforcing wors in the same sentence to share the same topic assignment an moeling topic transition between successive sentences; 2) sentiment consistency is impose by constraining topic transitions via tracking sentiment changes; an 3) both topic transition an sentiment transition are guie by a parameterize logistic function base on the linguistic signals irectly observable in a ocument. Extensive experiments on four categories of prouct reviews from both Amazon an NewEgg valiate the effectiveness of the propose moel. General Terms Algorithm, Experimentation Keywors Topic moeling, aspect etection, sentiment analysis 1. INTRODUCTION Topic moels have become an important builing block in sentiment analysis [18, 12, 11, 15, 31, 28]. It naturally ecomposes unstructure text content into topical aspects an sentiment polarities via generative moeling. The automatically ientifie topics an corresponing opinions provie a fine-graine unerstaning of opinionate text ata an enable a wie range of important applications, incluing public opinion tracking in social meia [18, 15, 12], automate recommenation in e-commerce [16], contrastive opinions summarization in political science [6], an many more. One funamental assumption in topic moels is exchangeability, i.e., topics are infinitely exchangeable within a given ocument while the joint probability is invariant [3]. As a result, a common practice is to moel a ocument as a mixture over a set of latent Copyright is hel by the International Worl Wie Web Conference Committee (IW3C2). IW3C2 reserves the right to provie a hyperlink to the author s site if the Material is use in electronic meia. WWW 2016, April 11 15, 2016, Montréal, Québec, Canaa. ACM /16/04. topics; an given topic mixing proportion, the topic assignments over wors in a ocument are consiere as inepenent from each other. This overly simplifie assumption fails to capture rich structures embee in a text ocument: in reality, natural language text rarely consists of isolate, unrelate sentences, but rather collocate, structure an coherent groups of sentences [10]. The existence of sentiment in an opinionate text ocument further increases the complication of topic an sentiment mixture. For example, most topic moels for sentiment analysis assume the selection of topics are inepenent given sentiment labels over wors [18, 15, 12]. However, it is very unlikely for a user to express contraictory sentiment, i.e., both positive an negative, about the same topical aspect in a ocument; an thus when sentiment switches, the topic shoul also change. Enhance inepenence assumption is expecte to yiel better moels in terms of latent aspect ientification an sentiment classification. Figure 1 illustrates this interepenency between topic assignments an sentiment polarities in a typical prouct review, which motivates our research in this paper. By: Kinle Customer Date:June 25, 2014 I own an ultrabook an I like it for a number of specific tasks. I especially like its portability (3 pouns with a small footprint) {portability,+} an the spee of its soli state rive {har rive,+}. When it comes to looks you have to give it to the Inspiron {appearance,+}. It efinitely has the sleek look of an ultrabook {appearance,+}. The combination of brushe aluminum with black trim, keys an bezel make for a very classy, corporate presence {appearance,+}. The fit an finish are first rate {appearance,+}. However, the soun sucks {soun,-}. I have owne 10 notebook an laptop computers over the past two ecaes an this Inspiron has the worst soun of any before it {soun,-}. It is weak, tinny an what low en it has is muy an inistinct {soun,-}. While we ve all come to expect pretty lousy soun from notebooks, this is subpar even consiering those low stanars {soun,-}. Figure 1: A review of a laptop from Amazon 1. Topical aspects an sentiment polarities are manually labele in superscripts with ifferent colors on each sentence. Three important observations can be foun in the sample review ocument annotate in Figure 1. First, topic assignments over wors in a ocument are not a simple mixture; instea, wors in close proximity ten to share the same topic, i.e., topic coherence. Secon, sentiment polarities expresse towar the same topical aspect ten to be consistent, i.e., sentiment consistency. We shoul note that this observation is not contraicting with the fact that a user might have mixe jugements about an item within a review 1 RQ4YYC5BXD

2 ocument, e.g., appreciate appearance but islike soun quality of the ultrabook in our motivating example. Sentiment consistency suggests that a user tens to give the same opinion about a particular topical aspect, rather than expressing contraictory assessments over it. This as another imension of regularity of topic assignments over wors in an opinionate text ocument: when sentiment switches, the topic assignment shoul also switch. Last but not least, there are clear linguistic cues inicating the transition of sentiment an topics between successive sentences. For example, conjunctions like however an nonetheless imply a switch of sentiment in the current sentence, while an increase overlap of content wors suggest unaltere topic an sentiment assignment between two ajacent sentences. Some solutions have been evelope to realize topic coherence, i.e., assign wors in a sentence to the same topic [12] an moel topic transition among successive sentences [8, 29]. Linguistic cues, e.g., POS tagging [31] an metaata [19], have been also exploite to guie topic generation. But exchangeability assumption is still being mae when moeling the compoun of topic an sentiment in a ocument [18, 15, 12]: topics are moele as simple mixtures uner sentiment labels. It reners erroneous posterior inference results that assign opposite sentiment labels to the same topical aspects in a ocument. This inevitably leas to suboptimal performance in ownstream sentiment analysis tasks. In this work, we propose to explicitly moel topic coherence an sentiment consistency in an opinionate text ocument so that we can accurately extract latent aspects an corresponing sentiment polarities. Specifically, we introuce hien Markov moel into topic moeling an name our solution as Hien Topic Sentiment Moel (HTSM). In HTSM, topics are moele as a compoun of latent aspects an sentiment polarities. Topic coherence is achieve by enforcing wors in the same sentence to share the same topic assignment an moeling topic transition between successive sentences [8]. Sentiment consistency is impose by constraining topic transitions via tracking sentiment changes once sentiment assignment changes, a new topic has to be sample for the current sentence. Both topic transition an sentiment transition are guie by a parameterize logistic function base on the linguistic signals irectly observable in a ocument, e.g., cosine similarity an POS tag overlapping between ajacent sentences. A customize forwarbackwar algorithm is evelope to perform efficient posterior inference for HTSM. The moel configuration, incluing both wor istribution uner topics an topic/sentiment transitions, is learne in a fully unsupervise manner via expectation maximization. The formalization of HTSM is so flexible that partially annotate ocuments, e.g., user-provie pros an cons, can be easily incorporate for more accurate moel estimation. Extensive experimentations are performe on four categories of prouct reviews crawle from both Amazon an NewEgg to valiate the effectiveness of the propose moel. A set of state-of-theart topic moels for sentiment analysis are employe as baselines to compare the quality of learne topics, accuracy of sentiment classification, an utility of aspect-base contrastive summarization from our HTSM moel. As a summary, our contributions in this paper are as follows, We evelop a unifie topic moel to explicitly capture topic coherence an sentiment consistency in opinionate text ocuments. It provies more accurate extraction of latent topics an sentiment polarities. Our flexible moeling assumption enables both unsupervise an semi-supervise estimation of moel parameters. We performe extensive experiment comparisons on ifferent ata sets uner various application scenarios. Promising performance confirms the value of moeling epenence between sentiment an topic in sentiment analysis. 2. RELATED WORK The wie coverage of topics an abunance of opinions in social meia make it a gol mine for iscovering public opinions on all sorts of topics [22]. Significant research effort has been pai on builing statistic topic moels to mine user-generate opinion ata. Accoring to the notion propose in Mimno an McCallum s work [19], we can categorize most of existing topic moels for sentiment analysis as upstream moels an ownstream moels. Upstream moels assume that in orer to generate a wor in a ocument, one nees to first ecie the sentiment polarity of this wor an then sample the topic assignment for this wor accoringly. In contrast, ownstream moels assume the sentiment label is etermine by the topic assignment in parallel to the text content. Our propose solution falls into the category of upstream moels. One typical upstream moel is the Topic-Sentiment Moel (TSM) propose in [18]. TSM is constructe base on the plsa moel [9]: in aition to assuming a corpus consists of a set of latent topics with neutral sentiment, TSM introuces two aitional sentiment moels, one for positive an one for negative sentiment. A new concept calle theme is introuce in TSM for ocument moeling, an a theme is moele as a compoun of these three components: neutral topic wors, positive wors an negative wors, in each ocument. However, this kin of ivision cannot capture the interrelation between topic an sentiment, given a ocument is still moele as an unorere bag of wors; an TSM also suffers from the same problems as in plsa, e.g., overfitting an can harly generalize to unseen ocuments. Several follow-up work tries to aress the limitations of TSM from ifferent perspectives. Base on the LDA moel [3], Lin an He propose a joint sentiment/topic moel (JST) for sentiment analysis [15]. In JST, the combination of topic an sentiment is moele as a Cartesian prouct between a set of topic moels an sentiment moels. Accoringly, each ocument exhibits istinct topic mixtures uner ifferent sentiment categories in JST. To improve topic coherence, Jo an Oh extene JST by enforcing wors in a single sentence to share the same topic an sentiment label in their Aspect an Sentiment Unification Moel (ASUM) [12]. Zhao et al. introuce the Maximum Entropy LDA moel (MaxEnt- LDA) to control the sampling of wors from a backgroun topic, aspect-specific topics an opinion-specific topics in [31]. Both JST an ASUM strongly epen on sentiment see wors to ifferentiate ifferent sentiment categories. MaxEnt-LDA epens on a set of manually labele training sentences with backgroun, aspect an opinion wors to estimate the maximum entropy moel beforehan. Moreover, the simple sentiment-topic mixture assumption prevents all the aforementione moels to recognize sentiment consistency, i.e., sampling the same aspect assignment uner ifferent sentiment categories in a ocument. Downstream moels reverse the generation assumption between sentiment labels an topic assignments, an provie some flexibility in moeling sentiment, e.g., continuous opinion ratings can also be moele [17, 28, 25]. However, ownstream moels usually assume the sentiment labels are observable, an it thus limits their applications in sentiment analysis. Another line of relate work is introucing Markov moel into topic moeling. Aspect-HMM moel [2] combines plsa with a hien Markov moel [23] to perform ocument segmentation over text streams. However, Aspect-HMM separately estimates topics in training set an epens on heuristics to infer the transitional relations between topics. HMM-LDA [7] istinguishes short-range syntactic epenencies from long-range semantic epenencies a- 156

3 mong the wors in each ocument. But in HMM-LDA, only the latent variables for the syntactic classes are treate as a locally epene sequence, while latent topics are treate the same as in other topic moels. Hien Topic Markov Moel (HTMM) [8] is the most similar moel to ours. HTMM captures topic coherence by assuming wors in one sentence share the same topic assignment an moeling topic transitions between successive sentences. However, HTMM loosely moels the transition between topics as a binary relation: the same as the previous sentence s assignment or raw a new one with a certain probability. It ignores sentiment consistence in a ocument: when sentiment switches, the topic assignments shoul also switch. Our HTSM constrains topic transition via tracking sentiment changes; an linguistic cues irectly observable from ajacent sentences are leverage to guie topic an sentiment transitions. 3. METHODOLOGY In this section, we escribe the propose Hien Topic Sentiment Moel an iscuss how it captures topic coherence an sentiment consistence simultaneously within an opinionate text ocument. Efficient posterior inference is performe via a customize forwar-backwar algorithm, an Expectation-Maximization algorithm is utilize to estimate the moel parameters in both unsupervise an semi-supervise settings. 3.1 Definition of Terminologies We first specify the notations an efinitions of aspect, sentiment an topic use in this paper. Denote a set of review text ocuments about a particular type of entities, e.g., prouct reviews, as D = { 1, 2,..., D }, where each ocument i consists of m sentences. We assume there is a share set of aspects that attract reviewers interest; an they can be efine as follows: Definition (Aspect) An aspect of a particular entity is characterize by a set of wors, which present a semantically coherent theme of iscussion. An aspect can be inexe by a iscrete ranom variable taking value from A = {a 1,a 2,...,a A }. For example, wors such as price, value, an worth escribe the price aspect of a prouct. Besie escribing the aspects, users also express their personal attitues towar those aspects in their review ocuments, e.g., favor price aspect or criticize customer service aspect in prouct reviews. The expresse attitue is efine as sentiment. Definition (Sentiment) Sentiment represents a user s emotional feelings about a particular entity. It can be enote by a iscrete ranom variable taking value from S = {s 1,s 2,...,s S }, e.g., positive or negative. In text ocuments, sentiment can be etermine from content wors. For example, love an wonerful inicate positive sentiment, an terrible an regret inicate negative sentiment. In this paper, topic is efine as a compoun of latent aspect an sentiment polarity. For example, in tablet reviews, potential topics coul inclue positive aspect about battery life an negative aspect about customer service. Formally, topic is efine as follows: Definition (Topic) Topic is a compoun of latent aspect an sentiment polarity in a given ocument collection. It can be represente as a iscrete istribution over wors in a fixe vocabulary. Wors with high probabilities uner a topic epict the corresponing aspect an sentiment. Base on the above efinitions, we strive to evelop a probabilistic generative moel to automatically ientify topics, i.e., aspects an sentiment, from a collection of opinionate text ocuments. The moel takes an unstructure text ocument as input an returns ecompose latent aspects an sentiment polarities as output. In the following sections, we will iscuss the etaile moel assumptions an specifications. 3.2 Hien Topic Sentiment Moel From linguistic analysis perspective, ocument exhibits internal structure, where structural segments encapsulate semantic units that are closely relate [13]. As a result, in the propose Hien Topic Sentiment Moel (HTSM), we treat sentence as the basic structure unit an assume all the wors in a sentence share the same topic (as illustrate in our motivating example in Figure 1). Base on this, HTSM rops the simple mixture assumption employe in conventional topic moels [3, 9], an explicitly moels topic transition in successive sentences via a first-orer hien Markov moel. Because in HTSM a topic is moele as a compoun of latent aspect an sentiment polarity, two factors control the transition of topics. First, once the sentiment labels switch between two consecutive sentences, a topic has to be generate for the subsequent sentence with a ifferent aspect assignment. This enforces sentiment consistency. Secon, when the sentiment labels stay intact, two ajacent sentences are assume to be highly relate: the subsequent sentence will inherent the topic assignment from the previous sentence, or select a istinct one from a ocument-specific topic mixture with certain probability. This imposes topic coherence. Formally, we assume there are K topics embee in a given collection of review ocuments. A topic inexe by z k has two components: a k inicates aspect label an s k inicates sentiment label, i.e., z k = (a k,s k ). Topic z k is specifie as a multinomial istribution over a fixe vocabularyv,i.e.,{p(w β k )} w V, where β k is the corresponing moel parameter. To avoi overfitting, we impose a share Dirichlet prior over β k, i.e., β k Dir(η). In this paper, to simplify our iscussion, we only consier binary sentiment polarities in HTSM, i.e., s k = {0,1}. But HTSM is flexible enough to moel multi-class sentiment polarities, e.g., five-star rating scales [21]. In a given ocument, the ocument-level topic proportionθ is assume to be rawn from a share Dirichlet istribution [3], i.e., θ Dir(α). Among m sentences in, each sentence t i has N i wors an is associate with a topicz i, which is sequentially rawn from a ocument-specific Markov chain. Because the aspect label an sentiment polarity of sentences are unobservable, we introuce two latent variables τ an ψ on each sentence to control the sampling of topics with respect to the topic coherence an sentiment consistency requirements. Specifically, τ i an ψ i are binary ranom variables inicating whether there is a sentiment switch an an aspect change on sentence t i accoringly. Their combination etermines topic transition: 1) when τ i = 0 an ψ i = 0, t i will inherent previous sentence s topic assignment; 2) whenτ i = 0 an ψ i = 1, a new topic z i will be rawn from θ, with the constraint that s i = s i 1 an a i a i 1; 3) when τ i = 1 an ψ i = 1, a new topic z i will be sample from θ with the constraint that s i s i 1 an a i a i 1. The combination of τ i = 1 an ψ i = 0 is not allowe in HTSM, because the sentiment consistency constraint enforces aspect change when sentiment is switche. To capitalize on the linguistic features irectly observable in ocument content, e.g., overlappe sentence content inicates intact topic assignments, we use parameterize logistic functions to efine the generation probability ofτ anψ in each sentence. Aspect transition feature function f a(,i) takes ocument an sentence t i as input, an outputs an l-imensional feature vector escribing aspect change. Accoringly,f s(,i) generates ap-imensional 157

4 w w w Figure 2: Graphical moel representation of Hien Topic Sentiment Moel. Dark an light circles represent observable an latent ranom variables, an plates enote repetitions. Soli arrows encoe epenency relation an ashe arrows enote the generation of transition features. feature vector escribing sentiment switch. Hence, we efine, 1 p(τ i = 1,σ) = 1+exp ( σ T f ) s(,i) (1) 1 p(ψ i = 1,ǫ) = 1+exp ( ǫ T f ) a(,i) (2) where σ an ǫ are the corresponing feature weights for moeling sentiment switch an aspect change. The etaile specifications of f a(,i) an f s(,i) an the feature weight estimation proceures will be iscusse in Section 3.4. Putting above assumptions together, the generative process of a ocument postulate in HTSM can be escribe as follows, 1. For every topicz, raw β z Dir(η). 2. For each review ocument D, (a) Draw topic proportionθ Dir(α). (b) For each sentencet i,i = 1,2,...,m, i. Sample τ i p(τ i,σ); setτ i = 1 when i = 1. ii. Sample ψ i p(ψ i,ǫ); set ψ i = 1 when τ i = 1. iii. Sample z i by, z i 1 if τ i=0,ψ i=0 z i= z Mul(θ ), s.t. a a i 1,s=s i 1 if τ i=0,ψ i=1 z Mul(θ ), s.t. a a i 1,s s i 1 if τ i=1,ψ i=1 iv. Sample each worw n int i,w n Mul(β zi ). To make the above generation process consistent at every sentence in a ocument, we efinea 0 = an s 0 =, such that there is no constraint when sampling new topics for the first sentence in a ocument. Using the language of graphical moels, this generation process can be visualize in Figure 2. Conitione on the moel parameters(α,β,ǫ,σ), the joint probability of sentences an latent topics in ocumentis thus given by, p(z,θ,ψ,τ,w 1,...,w Ni α,β,ǫ,σ) (3) m =p(θ α) i=1 p(τ i,ǫ)p(ψ i,σ)p(z i z i 1,τ i,ψ i,θ) N i n=1 p(w n β zi ) The above joint istribution ifferentiates HTSM from conventional topic moels for sentiment analysis, which are built on the simple topic mixture assumptions. Due to the sequential generation of topic assignments in sentences from a Markov chain, HTSM is no longer invariant to permutation of wors nor sentences in a ocument. Documents in which successive sentences share coherent topics are more likely than any ranom shuffling of the same sentences. This leas to linearly coherent topic inference in a ocument: successive sentences ten to share similar topics, rather than fluctuate assignments. More importantly, sentiment consistency is especially emphasize in HTSM: in every sentence of a ocument, one nees to first etermine if he/she wants to keep the sentiment polarity from previous sentence; if not, a new topic with ifferent aspect label an sentiment polarity nees to be sample. This avois assigning contraictory sentiment polarities to the same aspect in a ocument. To the best of our knowlege, no existing topic moels coul achieve such regularity over topic assignments. 3.3 Posterior Inference The latent variables of interest in HTSM are sentence-level topic assignments z an ocument-level topic proportion θ. The aspect switch inicators ψ an sentiment switch inicators τ can be easily ecoe from the topic assignment sequence z. However, ue to the coupling between continuous ranom variable θ an iscrete ranom variables z, exact inference in HTSM is computationally infeasible. In this paper, we evelop a coorinate ascent base solution to perform approximate posterior inference. In a given ocument, θ can be first ranomly initialize from its prior istribution Dir(α). With known θ, exact inference for (z, ψ, τ) can be efficiently performe via the forwar-backwar algorithm [23]. Because of the special esign in our Markov chain, customization of the generic forwar-backwar algorithm can be mae to greatly reuce its computational complexity in HTSM. In particular, we treat the combination of(z i,ψ i,τ i) at sentence t i as latent states in our Markov chain for ocument, an erive the corresponing transition function as, p(z i,ψ i,τ i z i 1,θ,,ǫ,σ) = p(z i z i 1,θ,ψ i,τ i) (4) p(ψ i,ǫ)p(τ i,σ) in whichp(ψ i,ǫ) anp(τ i,σ) can be pre-compute beforehan since they are invariant uring inference. An the first term of righthan sie in Eq (4) has a simple linear structure, p(z i z i 1,θ,ψ i,τ i) (5) 1 ifτ i=0,ψ i=0,z i=z i 1 θ zi, s.t. a i a i 1,s i=s i 1 ifτ i=0,ψ i=1 θ zi, s.t. a i a i 1,s i s i 1 ifτ i=1,ψ i=1 0 otherwise This enables us to maintain a blockwise transition matrix in an reuce the quaratic computational complexity in stanar forwar an backwar computations to linear in HTSM. After one roun of forwar-backwar computation, posterior of θ can be compute by the expecte frequency of wors assigne to a topic that is rawn from the ocument-specific topic proportion, rather than inherite from a previous sentence. More specifically, m N i θ,z p(z i = z,ψ i = 1 )+α z 1 (6) i=1 n=1 The inference of θ an (z,ψ,τ) can be alternatively performe in a given ocument. An it can be prove that this coorinate ascent metho will converge to a local maximum of ata likelihoo function in, because the forwar-backwar algorithm gives us exact posterior of(z,ψ,τ) (refer to the EM algorithm proof [5]). 3.4 Parameter Estimation Motivate by the insights gaine from the annotate example shown in Figure 1, in HTSM we leverage content features irect- 158

5 ly observable in ocuments to efine the probabilities of aspect change an sentiment switch. In orer to ifferentiate the aspectriven transitions from sentiment-rive transitions, two sets of transition features are constructe. The aspect transition features f a(,i) inclue: 1) content-base cosine similarity between t i an t i 1; 2) sentence length ratio betweent i ant i 1; 3) relative position oft i in, i.e.,i/m; an 4) an inicator function about whethert i is more similar tot i 1 ort i+1. The sentiment transition featuresf s(,i) inclue: 1) content-base cosine similarity betweent i ant i 1; 2) sentiwornet [1] score - ifference between t i an t i 1; 3) sentiment wor count ifference between t i an t i 1; 4) Jaccar coefficient between POS tags in t i an t i 1, an 5) aversative conjunction count in t i. We also a bias terms in f a(,i) an f s(,i) to capture unconitione aspect an sentiment transitions in ocuments. Detaile escriptions of these transition features can be foun in Table 3. The feature weights ǫ an σ in the transition functions efine in Eq (1) an Eq (2) can be efficiently estimate together with the other moel parameters in HTSM by EM algorithm. In this work, we treat α an η as hyper-parameters of the moel an manually tune their settings, given they have consierably less influence in moel fitting [24] comparing to the other parameters, i.e.,(β, ǫ, σ). We shoul note optimizing α an η with respect to ata likelihoo [3] is also feasible in HTSM. The EM algorithm executes iteratively between E-step (for posterior inference) an M-step (for expectation maximization). In E- step at iteration T, the approximate inference proceures evelope in Section 3.3 is execute in each ocument with the current moel parameter (β T,ǫ T,σ T ). The following sufficient statistics are collecte in ocuments after inference, m N i E[c(z,w,)] = δ(w n = w)p(z i = z ) (7) i=1 n=1 E[ψ i] = p(ψ i = 1 ), s.t. i > 1 (8) E[τ i] = p(τ i = 1 ), s.t. i > 1 (9) In M-step, maximum likelihoo estimator is use to compute (β T+1,ǫ T+1,σ T+1 ) as follows, D βz,w T+1 E[c(z,w,)]+η w 1 (10) ǫ T+1 = argmax ǫ σ T+1 = argmax σ D m E[ψ i]logp(ψ i = 1,ǫ) (11) i=1 D m E[τ i]logp(τ i = 1,σ) (12) i=1 where the optimization of ǫ an σ can be effectively solve via a graient-base optimizer. The E-step an M-step will be alternatively execute until the ata likelihoo function on the whole collectiond converges. In some review ata sets, external signals about sentiment polarities are irectly available. For example, some reviewers will explicitly organize their reviews in pros an cons sections 1 ; an in NewEgg ( reviewers are require to o so. Such signals can be easily incorporate in HTSM to refine moel estimation. In the ocuments with ientifie pros/cons sections, sentences in pros section will be consiere as having sentiment label s = 1, an sentences in cons section will have s = 0. During posterior inference, the sentiment switch inicator 1 R12HYQYZX5TNT9 τ can be irectly compute from the sentiment labels in such ocuments, while all the rest inference steps stay the same. Hence, moel parameter estimation in M-Step will be affecte by such irect observations. As a result, HTSM effectively exploits such sie information in ocument content an estimate the moel parameters in a semi-supervise manner. In our quantitative evaluation, such semi-supervise moel training greatly improves HTSM s sentiment classification performance. 4. EXPERIMENT In this section, we perform experiment evaluations of the propose HTSM moel from both quantitative an qualitative perspectives. We compare HTSM with several state-of-the-art topic moels for sentiment analysis on four ifferent collections of prouct reviews from both Amazon an NewEgg. 4.1 Data Sets & Preprocessing We have collecte four categories of prouct reviews, i.e., i) camera, ii) tablet, iii) tv an iv) phone, from Amazon (http: // an NewEgg ( com). The reviews from NewEgg are segmente into pros an cons sections by their original authors, since this is require by the website. The complete ata set can be foun at virginia.eu/~hw5x/ataset.html. Stanar pre-processing is performe before the subsequent experiments. Firstly, punctuation, numbers an other non-alphabet characters are remove. Stopwors are also remove base on a stanar stopwor list [14]. Seconly, all the wors are converte to the lower cases an stemming is performe on the remaining wors in a ocument using the Porter s stemmer [30]. Finally, all the reviews which have less than five wors are remove. Besies, since we are moeling topic transition between successive sentences, those reviews containing less than two sentences are also remove. Table 1 summarizes the resulting review ata sets. Table 1: Statistics of evaluation ata sets. Data set Amazon NewEgg Vocabulary Positive size ratio camera tv tablet phone For comparison purposes, we inclue Latent Dirichlet Allocation (LDA) [3], Hien Topic Markov moels (HTMM) [8], Aspect an Sentiment Unification moel (ASUM) [12], an Joint Sentiment/Topic moel (JST) [15] as baselines. Among these baseline moels, ASUM an JST are specialize for sentiment analysis, an HTMM an ASUM explicitly moel sentences in a ocument. As unsupervise topic moels, both ASUM an JST require sentiment see wors as input. Following the settings in their original paper, two sets of sentiment see wors are use in our experiments. The first one is from Turney s PARADIGM [26] contains seven positive wors an seven negative wors, an the secon one is PARADIGM+ which contains all Turney s paraigm wors plus other sentiment wors. To conuct a fair comparison, we also inclue those sentiment see wors in our HTSM moel, i.e., aing positive see wors to topics with sentiment label s = 1, an negative wors to topics with sentiment label s = 0 as priors. We shoul note that unless otherwise specifie, we have use 26 topics for camera an phone, 30 topics for tablet an 16 topics for tv for all the moels. In aition, we fixe the hyper-parameters α an η in Dirichlet priors to 1.01 an for all the topic moels. 159

6 Tv Dataset LDA HTMM ASUM JST HTSM Perplexity Camera Dataset Tablet Dataset 3000 Phone Dataset Perplexity Figure 3: Perplexity with increasing training size on four ifferent review ocument sets. 4.2 Topic moeling evaluation We first compare the quality of learne topics from all the topic moels. Perplexity an wor intrusion experiments are performe to quantitatively evaluate this aspect, an we also emonstrate the learne topical transition iagram from HTSM Perplexity comparisons Perplexity, use by convention in language moeling, is monotonically ecreasing with respect to the likelihoo of test ata, an is algebraically equivalent to the inverse of the geometric mean of per-wor likelihoo. A lower perplexity inicates better generalization performance. More specifically, the perplexity of test ocument setd test can be compute as: { } M =1 perplexity(d test) = exp logp(w ) M =1 N (13) wherem is the total number of ocuments in test corpus ann is the total number of wor in a test ocumentd test. We traine all the topic moels (HTSM, HTMM, LDA, JST an ASUM) on the escribe corpora to compare their generalization performance in moeling text ocuments on a hel-out test set via the perplexity measurement. Since our goal is to evaluate the ensity estimation quality, all ocuments in the corpora are treate as unlabelle (e.g., ignore the pros/cons segmentation in NewEgg reviews). The etaile experiment setup for perplexity comparison is as follows: we start with a training set containing only the reviews from NewEgg, refer this training set as the origin in plots of Figure 3, an graually a more training reviews from Amazon (training size 1000, 2000 etc.). This experiment setting is to make the results aligne with the later sentiment classification experiments. Figure 3 emonstrates the average perplexity from five-fol cross valiation (test sets are selecte from both Amazon an NewEgg reviews accoringly). It is clear from Figure 3 that HTSM outperforme all the other topic moels on all four atasets, except HTMM. There are two possible explanations. First, HTMM moels topic transitions loosely as a Bernoulli istribution: the same as the previous sentence s assignment or rawing a new topic with certain probability. But HTSM moels this topical transition with a more complicate logistic function. Overfitting might be cause by this parametric moel. Secon, HTMM oes not consier sentiment in a ocument, i.e., less constraints in moeling a ocument. But in HTSM, once the sentiment label switches, a ifferent topic has to be sample for the subsequent sentence. As a result, HTMM has more freeom to allocate wors uner one topic, which results in a lower perplexity in moeling unseen ocuments. We shoul note that the perplexity metric only measures the quality of estimate wor istribution in unseen ocuments. It cannot assess the sentiment preiction quality, which HTMM is unable to achieve. In later experiments we foun that the increase complexity in HTSM benefits sentiment classification greatly. Finally, we can fin that the simple sentiment-topic mixture assumptions in both JST an ASUM fail to capture the topic-wor istribution in the test set an lea to much worse perplexity than HTSM. It is also important to investigate how a topic moel s generalization capability varies uner ifferent number of topics. In particular, we test ifferent moels perplexity at the last testing point in Figure 3, i.e., five-fol cross valiation on 5000 Amazon reviews with NewEgg reviews for training. Due to space limit, we only emonstrate the perplexity result from our HTSM moel on all four categories of reviews in Figure 4. The baseline moels exhibit similar patterns. From the results, it is clearly to observe that within a reasonable range of topic size, the perplexity of HTSM increases moerately. When we have more than 40 topics, the perplexity increases ramatically on all ata sets, i.e., an inication of overfitting. The results justifie our setting of the number of topics in HTSM an all baseline topic moels, an we fix this setting in all our following experiments Wor intrusion comparisons Perplexity only measures the quality topic moeling from ensity estimation perspective; it is also necessary to evaluate whether 160

7 Perplexity Camera Tablet Phone Tv Number of Topics Figure 4: Perplexity of HTSM uner ifferent number of topics across all four categories of reviews. the topics ientifie by those statistical moels are human interpretable. More specifically, we prefer a moel that generates more semantically coherent an meaningful topics. In this experiment, we employ wor intrusion iscusse in [4] to evaluate four ifferent topic moels, namely LDA, HTMM, A- SUM an HTSM (because ASUM an JST are quite similar in moel assumptions, we o not inclue JST in this experiment). During the first phase of evaluation, our experiment setup is as follows: we first selecte the top five wors from each topic z k uner every moel as topical wors. Then we select two intruing wors. The first intruing wor is referre to as intra-topic intrusion wor, which has a very low probability in topic z k of corresponing moels. The secon intruing wor is referre to as inter-topic intrusion wor, which is selecte from a ifferent topic z l z k an has a high probability in topic z l but a very low probability in topic z k. To select a wor which is consiere as having a very low generation probability, we rank all the wors uner a topic in a escening orer with respect to p(w z) an then ranomly select a wor with rank between 90 to 100 (given our vocabulary size on all collections is aroun 1400). Hence, in total we have seven wors for each topicz k from every topic moel: among those, five are regular wors, one is intra-topic intruing wor an the last one is inter-topic intrusion wor. In the secon phase of this evaluation, we ranomly shuffle the topical wors with the intruing wors uner each topic from every moel an present the shuffle wors to three annotators. The annotators o not have any knowlege about which topics or wors have been generate by which moel, an they are only informe of the category of the prouct. The task of the annotators is to i- entify at least one an at most two intruing wors uner each topic presente to them. In orer to reuce annotation bias, we evenly separate the learne topics from each moel into two parts, an present them to ifferent annotators. We ensure that each topic is annotate by three ifferent annotators. Since we have four ifferent categories an four ifferent topic moels, for this task we take feeback from twenty four annotators. The agreement a- mong annotators was calculate by pairwise Kappa statistics [27] an then these kappa values were average across all pairs of annotators. For example, on tablet ata set, the average kappa value for original topical wors is 0.885, which inicates annotators agree with each other most of time. However, for the intra-topic intrusion wors an inter-topic intrusion wors the average kappa values are an respectively, which imply that annotators might have ifferent ways of interpreting the inferre topics. To quantitatively measure the quality of inferre topics from all these four moels, we efine a metric name moel wor-intrusion recall (MR) as follows: MR m = K S s=1 1(im z k,s,wz m k ) k=1 K S (14) Table 2: Wor intrusion measurement across ifferent topic moels of four categories of prouct reviews. Inter-topic MR Category LDA HTMM ASUM HTSM camera tablet phone tv Intra-topic MR Category LDA HTMM ASUM HTSM camera tablet phone tv where wz m k is the vocabulary inex of the intruing wor among the wors generate from the zk th topic inferre by topic moelm, i m z k,s is the corresponing inex of the intruing wor selecte by annotator s. S enotes the number of annotators, an K enotes the total number of topics. From Table 2, it is evient that annotators can interpret the topics inferre by HTSM more effectively than those from the other moels in terms of inter-topic intrusion wor. For example, out of 90 actual inter-topic intrusion wors in tablet category, 35 wors have been picke out by annotators from HTSM s topics. This empirical evience implies that our HTSM moel is inferring more human interpretable topics than other topic moels. However, in terms of intra-topic intrusion, the performance of HTSM is not as competitive as other moels. The proceure of selecting low probability intra-topic intrusion wor an the concentration of the learne wor istribution uner topics from HTSM might be contributing factors to the relative inferior performance of HTSM Topic transitions Given HTSM explicitly moels topic transitions in an opinionate review ocument, we visualize the learne transition using a transitional iagram to qualitatively emonstrate the topical coherence obtaine by HTSM. Due to space limit, we only report the results extracte from tablet ata set. First, we train an HTSM with 30 topics on all the reviews from tablet category. To automatically ifferentiate omain-specific sentiment polarity, we train HTSM in a semi-supervise moe: the pros/cons sections in NewEgg reviews will be use to specify sentiment labels on sentences; while Amazon reviews will be use in fully unsupervise training. Then, for each sentence t i in a review ocument from training set, we infer its most probable topic z k from HTSM via the Viterbi algorithm. As a result, for two consecutive sentencest i 1 ant i, we have the corresponing pairwise topic transitionz j z k. We accumulate the transition count base on all the consecutive sentences in the training corpus, an normalize the resulting transition matrix to construct the iagram. Figure 5 illustrates the learne topic transition iagram in the tablet category. It is to be note that in orer to get a more perceivable view, we have ignore the transitions with probabilities less than 0.01 an also remove less popular topics in it. In this figure, each topic is enote as a pair of Aspect_Sentiment. For example, screen_p represents positive sentiment about the screen aspect. In this transition iagram, there is also a special noe name start, which is use to represent a ummy topic, which generates the initial topic for the first sentence in every ocument. Besies, we also highlighte the top six wors uner some selecte topics (the selection of annotate topics is purely base on space constrain- 161

8 screen h rea isplay color resolut buy money price bought purchas wast batteri life hour charg time long charg batteri rain power time ie prouct amazon ship return box ay camera pictur goo great front vieo app anroi ownloa problem play slow Figure 5: Estimate topic transition an top wors uner selecte topic on tablet ata set. t). From Figure 5, we can clearly ientify some interesting topical transitions in tablet reviews. For example, when reviewers hol positive feeling about their purchase tablets, they usually start with positive sentiment about price, which is followe by positive sentiment about battery life, service an so on. However, if a reviewer plans to criticize a tablet, he or she usually starts with negative sentiment about price an then transits to negative sentiment about battery life, screen, app, an etc. This learne transition is of particular importance in opinion summarization: it helps organize the generate sentences in a coherent orer. 4.3 Sentiment classification In this section, we evaluate HTSM in terms of sentiment classification. We use the alreay segmente NewEgg reviews as grountruth sentence-level sentiment annotations: we treat all sentences in the pros section as positive an all sentences in the cons section as negative. We shoul note such annotations are ifferent from the overall ratings of reviews. The overall ratings are of low resolution in sentiment annotation: a review with high overall rating might still contain some negative sentences, an vice versa. In contrast, the self-annotate pros/cons sections are with finer-granularity in sentiment annotations. Therefore, in this experiment we i not use the overall ratings in moel training an testing. During the training phase of HTSM, we use a mixture of review ata sets obtaine from NewEgg an Amazon. Besies, since we have sentiment labels on sentences from the NewEgg ata set, the sentiment transition inicator τ can be irectly inferre. Hence we train our HTSM moel in a semi-supervise manner. Specifically, uring the training phase of HTSM, if the input ocument is from NewEgg, τ is fixe base on the sentiment labels on sentences; otherwise, HTSM has to infer τ accoring to Eq (1). To make a fare comparison across all the moels, ASUM an JST were also moifie to utilize the annotate pros/cons sections in NewEgg ata set uring the training phase. In aition, we also inclue EM- NaiveBayes [20], a semi-supervise algorithm, as a baseline in this experiment. It exploits the sentiment annotation in NewEgg ata uring the training phase. We use only NewEgg ata set to construct the test set, since we o not have such fine-graine annotations in Amazon ata set (so we refer Amazon ata as unlabelle ata). Besies, we start our training set containing only the reviews from NewEgg (training size 0 in Figure 6) an then keep aing more an more unlabelle ata from Amazon (training size 1000, 2000 etc.) into the training set, i.e., the exact setting that we use in perplexity evaluation in Section We report the average F-1 score from five-fol cross-valiation as the performance metric in this experiment. Figure 6 illustrates the sentiment classification performance of HTSM over all the four categories against ASUM, JST an EM- NaiveBayes baselines. We can clearly notice that with the same amount of training ata, HTSM outperforme all the other moels, which treat sentences as inepenent in a ocument. Sentiment consistency enforce by HTSM helps to capture the epenence between consecutive sentences better an therefore preicts their sentiment polarities more accurately. The only exception is in the tv category, where the performance of HTSM egenerate beyon training size 3000 an became worse than EM-NaiveBayes. This egenerate result is cause by the ivergent proucts reviewe in the Amazon an NewEgg ata sets. We manually checke the proucts in tv category from these two ata sets an foun there are less common proucts than other categories. As a result, aing more Amazon reviews increases the iscrepancy of the learne moel on testing set, which is only from NewEgg reviews. The improve classification performance of HTSM results from its unique capability in moeling sentiment consistency insie a review ocument, i.e., when sentiment switches, topic assignments have to change in successive sentences. The transitions are controlle by the parameterize logistic functions on the observable linguistic features escribe in Section 3.4. In Table 3, the learne feature weights for topic switch ǫ an sentiment switch σ on camera ata set are emonstrate (we have very similar results on the other three categories as well, but ue to space limit we cannot list them in the table). For example, the bias term controlling sentiment switch is more negative than that for topic transition. It implies that sentiment in two consecutive sentences are less likely to change than the topics. The learne weights for the content-base cosine 162

9 F1 measure F1 measure Camera Dataset 0.7 ASUM JST 0.65 EM-Naive Bayes HTSM Phone Dataset 0.6 Tablet Dataset Tv Dataset Figure 6: Sentiment classification performance with increasing training size on four ifferent review ocument sets. Table 3: Learne feature weights in HTSM for sentiment an topic transition on camera ata set. Sentiment transition feature Weight bias termf s(,i) content-base cosine similarity betweent i ant i sentiwornet [1] score ifference betweent i ant i sentiment wor count ifference int i ant i an inicator function about whethert i is more similar to t i 1 or t i jaccar coefficient between POS tags int i an t i negation wor count int i Topic transition feature Weight bias termf a(,i) content-base cosine similarity betweent i ant i length ratio of two consecutive sentencest i an t i relative position oft i in, i.e.,i/m an inicator function about whethert i is more similar to t i 1 or t i similarity are negative for both transitions. It follows our expectation that the more similar two consecutive sentences are, the less likely we will observe sentiment or topic switch. These kin of observations well support our ecision of using observable linguistic features to guie topic transition moeling an it ultimately helps HTSM to achieve improve topic coherence an sentiment consistency in moeling opinionate ocuments. To provie a thorough evaluation of sentiment classification, we also teste all the topic moels with varie number of topics. Following the same settings as in Figure 4, we reporte the F1 measure of HTSM uner all four categories of reviews. Due to space limit, we i not inclue the results from the baselines in Figure 7. Similar conclusion as that in perplexity evaluation can be reache: with a moerate number of topics in HTSM, its classification performance is satisfactory an stable; but with an increase number of topics, the classification results varie an even egenerate on some ata sets (e.g., tablet ata set). F1 measure Camera Tablet Phone Tv Number of Topics Figure 7: Sentiment classification performance of HTSM uner ifferent number of topics across all four categories of reviews. 4.4 Aspect-Base Contrastive Summarization In orer to evaluate the utility of the aspects an sentiments ientifie by our moel, we stuy aspect-base review summarization, which aims at fining the most representative sentences for each topic (a combination of aspect an sentiment) from a collection of reviews. In Table 4, we emonstrate a sample aspect-base contrastive summarization result for two comparable tablet proucts. We selecte Samsung Galaxy Note 10.1 an Amazon Kinle Fire HDX base on their popularity in Amazon tablet ata set. The practical value of this type of contrastive review summaries is to help customers easily igest vast amount of opinionate ata an make informe ecisions. Table 4 shows the sie-by-sie comparison on six aspects ( + inicates positive aspects an - inicates negative aspects) of these two tablets ientifie by HTSM. Imagine a user is making a choice between these two tablets. If the user cares battery aspect the most, he or she can easily fin out from the summary that Samsung Galaxy Note 10.1 is a better choice than Amazon Kinle Fire HDX by consulting this aspect-base contrastive review summarization. This saves the user consierable amount of time in reaing the etaile reviews. We perform user stuies to unerstan whether these kin of summaries are meaningful for the actual users. In this experiment, 163

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview