A Comparison of Two Text Representations for Sentiment Analysis

010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational Software Guangzhou University Guangzhou, China xiong.ong@gmail.com Andy Dong Design Lab University of Sydney Sydney, Australia a.dong@arch.usyd.edu.au Abstract This paper compares two representations of text within the same experimental setting for sentiment orientation analysis, and in particular focuses on the sensitivity of the analysis to sentence. The two representations compared in this paper are bag-of-words (BoW) and nine dimensional vector (9Dim). The former represents text with a high dimensional feature vector, which ignores grammatical structure and is lexicon-dependent. In contrast, the 9Dim representation encodes grammatical knowledge of clauses in sentences into a compact nine dimensional vector, which is lexicon-independent. Text is composed by multiple sentences since the grammatical structure of a single sentence or clause may not provide sufficient information for sentiment orientation classification. A convenient way to enrich grammatical knowledge in a text is to compose the text with multi-sentences, thereby ening the sample. We consider the of text is an important factor in text classification. The aim of this paper is to demonstrate how text sentiment orientation classifiers performance is improved when the of the sentence comprising a training vector is varied. The experimental results indicated that the accuracy of the classifiers benefits from the increasing of the text s, and the results also illustrated that the 9Dim method can provide comparable results to BoW under the same sentiment classification algorithm, support vector machines (SVM). Keywords-sentiment analysis; text representations; bagofwords; 9Dim I. INTRODUCTION Sentiment analysis concentrates on classifying documents according to their opinion and emotions expressed by their authors. Judging a document's orientation as positive or negative is a common two-class problem in sentiment analysis [1-3], which is also known as sentiment orientation analysis in text classification. With the expansion of internet, text classification has been found to be helpful in many respects. In some on-line e- commerce companies' websites, visitors or customers are encouraged to leave their comments or feedback. Summarising these comments or feedback with sentiment orientation analysis technology would help the websites to stimulate suppliers and interest potential suppliers and customers, or even appeal to value-adding [3]. Text classification is also helpful in information retrieval [4] and could be applied to recognise and filter spam. Furthermore, by analysing on-line communication such as on-line forums, chat rooms, newsgroups, and Dark Web [5], sentiment orientation analysis could provide support in tracking extremist groups, terrorists and hate groups []. In this paper, we investigate the effectiveness of adopting the support vector machines classification algorithm to the sentiment orientation analysis problem. An interesting issue of this problem is the relationship between the performance of text classifier and the of training text. For example, what is the optimal for a training text example? With that, a trained text classifier for sentiment orientation analysis would perform with the best performance and less memory space. Thus, apart from showing the results from the two text representation methods, we also analyse this issue to acquire a deeper understanding of the results we obtained. The remainder of this paper is organised as follows. Section II presents a review of related works in sentiment orientation analysis. Section III will identify research gaps and questions. Data collection and processing of design text is described in Section IV. Section V presents experiments used to compare the BoW and 9Dim methods. Section VI concludes with closing remarks. II. RELATED WORK In recent studies, the BoW text representation method has enjoyed much attention and achieved outstanding performance in sentiment analysis of semantic orientation in natural language []. The BoW representation expresses text without considering word order or word usage. BoW represents a document in this way: initially, a feature list (wordlist) is composed with all the words in the corpus the document belongs to. A numerical vector is created to represent the given document; each entry of the vector corresponds to a word which is contained in the wordlist. The vector is initialised as a zero-vector when a word is contained in both the document and the wordlist. The corresponding entry of the vector is labelled with 1 for the appearance, or a higher positive integer to indicate the frequency of the word in the document. The BoW method has been successfully adopted in both natural language processing [6, 7] and computer vision [8]. Let us take the following example from a document with two clauses (documents): (A) It is a great masterpiece. and (B) Martin is a good designer. The composed wordlist is {Martin, It, is, a, 978-1-444-737-6/$6.00 C 010 IEEE V11-35

010 International Conference on Computer Application and System Modeling (ICCASM 010) good, great, designer, masterpiece}. The representation vector for clause A is [0 1 1 1 0 1 0 1], and [1 0 1 1 1 0 1 0] is for clause B. Pang [3] demonstrated a system with the BoW representation and the common machine learning classification algorithm support vector machines (SVM) as a two-class classification problem. They compared Naїve Bayes, Maximum Entropy Classification and SVM classification techniques, machine learning methods known to be successful at topic classification tasks, to the semantic orientation of movie reviews. Within the theory of systemicfunctional linguistics, Martin and White [9] provide a rigorous, network-based model for sentiment, which linguists characterise as the construal of emotions and interpersonal relations in language. The model has been partially implemented [10] and the performance has improved about 7% compared to results reported by other researchers on the same data set, movie reviews. Given the superior performance of support vector machines in sentiment orientation analysis, we have opted to use support vector machines as the machine learning formalism for sentiment orientation analysis. The issue becomes one of ascertaining the best representation of the text for the machine learning algorithm. In our previous research [11, 1], we show how to represent a document by a nine dimension numerical vector (9Dim), which encodes grammatical knowledge and taxonomical semantic information of being about activities (Process), about objects (Product) or about agents (People), but is otherwise lexicon independent. The structure of the 9Dim vector is shown as follows: [ PCN PCV PCA PDN PDV PDA PPN PPV PPA] where PC = Process, PD = Product, PP = People, N = Noun, V = verb, and A = adjective/adverb. According to the noun's pertinence ratio with Process, Product or People in a design context [11, 13], a weight is distributed into each of these three categories. This pertinence ratio is a 3Dim numerical vector and named as K. For each rated sentence, part-of-speech tagging [14] will provide the phrase structure trees and typed dependency in order to obtain the grammatical relationships. A noun-based clustering algorithm is then applied. The basic idea is to identify every noun in a sentence and put all verbs and modifiers (adjectives and adverbs) connected to the noun together with it. Each noun is looked up (queried) in the WordNet [15] lexicographer database to ascertain the logical grouping that might indicate the appropriate category (Product, Process, People) for the word. The WordNet lexicographer database and their syntactic category and logical groupings were used to categorise words (nouns) as being about Product, Process or People. Verbs, adjectives and adverbs are categorised according to the category(ies) of the noun they relate to grammatically. These clusters of syntactically related words are called word groups. For the noun in each word group, rules were applied to identify which of the WordNet logical groupings would contain nouns in the categories [17]. Two correction factors are multiplied with the count of the frequency of occurrence of a word in the target clause applied: K, which is inversely 1 proportional to the number of possible Process-Product- People categories a WordNet logical grouping can belong to; and K mentioned above, since the correction factor K for a word may have up to three value, it is normally expressed as a vector of the form K ( word) = [ K, PC K ],, PD K,, PP,. The semantic orientation (SO) of the words in each word group is calculated using the SO-PMI measure, which is in turn based on their pointwise mutual information (PMI) [14]. The strategy for calculating the SO-PMI is to calculate the log-odds ((1)) of a canonical basket of positive (Pwords) or negative (Nwords) words appearing with the target word on the assumption that if the canonical good or bad word appears frequently with the target word then the target word has a similar semantic orientation. The log odds that two words co-occur: p( word1& word) PMI IR( word1, word) = log (1) pword ( 1) pword ( ) In this study, we used a Google query with the NEAR operator to look up the co-occurrence of the target word with the canonical basket of positive and negative words. The SO- PMI based on the NEAR operator is described by (). The semantic orientation of word based on mutual cooccurrence with a canonical basket of positive and negative words: Π pword Pwordshits( wordnearpword ) Π nword Nwords hits( nword ) SO PMI ( word ) = log Π ( ) Π ( ) nword Nwordshits pword nword Nwords hits wordnearnword () We selected a basket of 1 canonical positive and negative words. Adjectives and adverbs were selected based on most frequent occurrence in written and spoken English according to the British National Corpus [11]. Because this list is published separately, we joined both lists and ordered them by frequency per million words. We selected only those adjectives and adverbs which were judged positive or negative modifiers according to the General Inquirer corpus [http://www.wjh.harvard.edu/~inquirer/]. The basis for the selection of these frequently occurring words as the canonical words is the increased likelihood of finding documents which contain both the canonical word and the word for which the PMI IR is being calculated. This increases the accuracy of the SO-PMI measurement. Table I lists the canonical Pwords and Nwords and their frequency per million words. The SO-PMI of all unigrams (noun, verb, modifiers) in the target lexicon are pre-calculated and saved in a database to speed up the analysis. A rated sentence is processed with both of grammatical and semantic analyses. When all word-clusters in a sentence are processed, a complete 9-dimensional vector is generated. For detailed description and implementation of a complete 9- dimensional vector, please refer to one of our previous papers [11]. V11-36

010 International Conference on Computer Application and System Modeling (ICCASM 010) TABLE I. CANONICAL POSITIVE AND NEGATIVE WORDS Positive Words Negative Words good (176) bad (64) well (1119) difficult (0) great (635) dark (104) important (39) cold (103) able (304) cheap (68) III. clear (39) dangerous (58) RESEARCH GAPS AND QUESTIONS Based on the literature review in previous research, we have identified several important research gaps. The BoW method ignores semantic knowledge about words and grammar. It relies on a very high-dimension representation that hinges on training a system on a text domain which contains a high coverage of words that are likely to appear in the target corpus. On the other hand, the 9Dim representation embeds grammatical knowledge in a lower dimension vector, which is a lexicon-independent representation. It abstracts lexical knowledge toward potential sentiment content and does not need a connecting lexicon between the training corpora and the target corpora. The differences between the BoW representation and 9Dim create the following research questions. First, does the lower dimensional representation method cost less memory in implementation? If that is the case, that means we can get a text classifier with lower cost memory and comparable ability as its counterpart with higher dimension representation. Secondly, 9Dim is a sentence-based representation for text, which means a single 9Dim vector comprised of data from multiple sentences can be treated as a training example. Intuitively, the information contained by a simple sentence is less than that embedded in a complex sentence. Text is known to affect text classification. Researchers indicated that the of sentence is an important fact in text classification when each vector represents one sentence [16]. If so, is the of text (paragraph or multi-sentences) that is represented in a single training vector an important factor in text classification as well? IV. DATA COLLECTION AND PROCESSING This section presents the data collection process and processing methods which are used to produce the tagged data for the training and validation of the computational system. To conduct this research, it is necessary to create labelled design text. In this research, we studied text from creative industries engaged in design. That is, we needed to create a new data set consisting of text about design works, the process of designing, and designers, which were labelled for semantic orientation and category. By adopting a popular way in computational linguistics [3] to create data sets, a cohort of three native English speakers with a background in a design-related discipline (e.g., engineering, architecture, and computer science) was tasked with reading and categorising various design texts. The texts included formal and informal design text from various on-line sources and across various design-related disciplines. All design texts were collected by the author. Each coder was paid to classify the texts. The rating cohorts were trained to identify the proper category and its semantic orientation according to the context. Training lasted for one hour. During coding, two of the three coders had to agree on the semantic meaning (category), semantic orientation (orientation), and the value of the orientation, that is, positive or negative. Working in two-hour time blocks, the coders read various design texts, including formal design reports, reviews of designed works, reviews of designers, and transcripts of conversations of designers working together. After the rated text data was collected, spell-checked and grammar-checked, the data was saved in a sentence pool for composing training data and testing data. Data sets generated in this way could be statistically significant, but the difference was small enough to be utterly unimportant. V. ANALYSIS AND EXPERIMENTS RESULTS This section analyses the space complexity of the BoW and the 9Dim methods, then presents the results of the experiments on the appraisal system with different data sets. To compare the two representations' memory cost, the standard method is to compare the space complexity of both. Space complexity is the limiting behavior of the use of memory space of an algorithm when the size of the problem goes to infinity [17]. As discussed before, the implementation of the BoW consists of two steps: 1) compose vector for each paragraph with the wordlist; and ) train and validate a SVM classifier with the represented vectors. For the first step, the space complexity is dependent on the of the wordlist. If it is feature, then the space complexity is O( feature ). For the second step, the space complexity is dependent on light implementation of SVM. SVM is the implementation light adopted in this study; the time complexity of SVM is On ( ) [18], where n is the number of training examples. The total space complexity for the BoW implementation is O = O( feature ) + O( n ). BoW For the 9Dim method, there are three processing steps: 1) part-of-speech tagging; ) look up the K to get the pertinence ratio for each selected word to compose the 9Dim vector; and 3) train and validate a SVM classifier with the represented vectors. For the first step, the space complexity is Om ( ) [14], where m is the number of rated sentences. For the second step, the space complexity is dependent on the of K, K, so it is O. For the third step, it is K the same as the second step of the BoW, On ( ). The total space complexity for 9Dim implementation is O9 = O( m) + O + O( n ). Because n is the same in Dim K V11-37

010 International Conference on Computer Application and System Modeling (ICCASM 010) both O and O, BoW 9Dim Om ( ), OK and O( feature ) are the terms with lower order in their formula and can be ignored. Therefore, O9 Dim = O( n ) and OBoW = O( n ). The 9Dim and the BoW representation methods have the same time complexity. However, in practice, due to the lower feature dimension of the 9Dim representation when compared to the BoW representation, 9Dim has lower memory cost in implementation. The feature dimension of the 9Dim representation is a constant, 9, whereas for BoW it is the of wordlist. We implemented Pang's unigram experiment [3] and applied the same experiment setting on design text semantic orientation classification. Each sentence in the rated sentence pool is represented by a BoW vector with 1111-word wordlist. The represented rated sentence pool is split into two parts, one for composing training examples, and another for validation examples. The size of training or validation examples set is set to be 500. Each training or validation example is comprised by one or more represented rated sentence(s) which are chosen randomly from the training or validation sentence pool. The number of represented rated sentences in a training or validation example is adjusted gradually from one to 0. For each number, five iterations of training and validation are run; the accuracy bars of these experiments are shown in Figure 1 For the 9Dim method experiment, the rated sentence pool is represented by the 9Dim representation. Similar experimental conditions were adopted from the experiment based on the BoW representation. The only difference was that the number of represented rated sentences in a training or validation example was adjusted gradually from one to 80 and the train-validation iteration number is set to be 50. The accuracy bars of these 9Dim experiments are shown in Figure Figure 1 and Figure show that the accuracy of sentiment orientation classification accuracy is improved when the number of represented rated sentences in each example is increased. Figure. 9Dim-based design text sentiment orientation analysis VI. CONCLUSIONS In this paper, we compared two text representation methods for design text in two respects: the space complexity of their implementation; and sentiment orientation classification. The results show that it is possible to encode semantic information and grammatical knowledge into a lower dimension vector to represent text for the purposes of sentiment classification. It also shows that a grammatical knowledge-embedding representation method can provide extra information for the classification algorithm to identify sentiment orientation and thereby reduce the space complexity in implementation. The complexity analysis indicates that the 9Dim representation method is superior to BoW in space complexity in practice, and provides comparable accuracy in classification. Nevertheless, the results from this study also point out that text is also an important factor in text classification. The reason may be that the longer text may contain more information about the semantic orientation features. However, if the text is "too" long, it may include sentences of conflicting sentiment. So, while more sentences per vector for training may be desirable, a fewer number of sentences may be better for the classification stage. ACKNOWLEDGMENT This research was supported under Australian Research Council s Discovery Projects funding scheme (project number DP0557346). The first author would like to thank an early career grant from Guangzhou University. This research was carried out while the first author was studying in the University of Sydney as a PhD student. Figure 1. BoW-based design text sentiment orientation analysis. REFERENCES [1] P.D. Turney, Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL '0), 00, pp. 417-44. [] A. Ahmed, C. Hsinchun, and S. Arab, Sentimen analysis in multiple languages: Feature selection for opnion classification V11-38

010 International Conference on Computer Application and System Modeling (ICCASM 010) in Web forums, ACM Trans. Inf. Syst. 6 (3) (008), pp. 1-34. [3] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? sentiment classification using machine learning techniques, Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 0), 00, pp. 79-86. [4] W. Lam, M.E. Ruiz, and P. Srinivasan, Automatic text categorization and its application to text retrieval, IEEE Transactions on Knowledge and Data Engineering, 11(6), 1999, pp. 865-879. [5] H. Chen, Intelligence and Security Informatics for International Security: Information Sharing and Data Mining. New York: Springer-Verlag Inc., 006. [6] T.K. Landauer, P.W. Foltz, and D. Laham, An introduction to latent semantic analysis, Discourse Processes, 5(),1998, pp. 59--84. [7] D.M. Blei, A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (003), pp. 993-10. [8] G. Wang, Y. Zhang, and L. Fei-Fei, Using dependent regions for object categorization in a generative framework, Proc. of the 006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 006, pp. 1597-1604. [9] J.R. Martin, and P.R.R. White, The Language of Evaluation: Appraisal in English, New York: Palgrave Macmillan, 005. [10] C. Whitelaw, N. Garg, and S. Argamon, Using appraisal groups for sentiment analysis, Proc. of the 14th ACM international conference on Information and knowledge management, New York: ACM, 005, pp. 65-631. [11] J. Wang, and A. Dong, A case study of computing appraisals in design text, in J.S. Gero (Ed.), Design Computing and Cognition '08 (DCC'08), Springer Netherlands, 008, pp. 573-59. [1] J. Wang, and A. Dong, How am I doing : computing the language of appraisal in design, Proc. of 16th International Conference on Engineering Design (ICED'07), 007, pp. ICED'07/14. [13] A. Dong, The Language of Design-Theory and Computation. London: Springer, 009. [14] M. Marneffe, B. MacCartney, and C.D. Manning, Generating typed dependency parses from phrase structure parses, Proc. of the IEEE / ACL 006 Workshop on Spoken Language Technology, 006. [15] C. Fellbaum, WordNet: an electronic lexical database, Cambridge: MIT Press, 1998. [16] E. Kelih, P. Grzybek, G. Antić, and E. Stadlober, Quantitative text typology the impact of sentence, Proc. of the 9th Annual Conference of the Gesellschaft für Klassifikation e.v., Berlin Heidelberg: Springer, 006, pp. 38-389. [17] U.S. National Institute of Standards and Technology, Algorithms and Theory of Computation Handbook, CRC Press LLC, 1999. [18] T. Joachims, Making large-scale SVM learning practical in Advances in Kernel Methods -- Support Vector Learning, MIT Press, 1999, pp. 169 184. V11-39