A Hybrid Approach for Automated Document-level Sentiment Classification (Proposal)

Similar documents
Multilingual Sentiment and Subjectivity Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Linking Task: Identifying authors and book titles in verbose queries

A Comparison of Two Text Representations for Sentiment Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Using Games with a Purpose and Bootstrapping to Create Domain-Specific Sentiment Lexicons

Rule Learning With Negation: Issues Regarding Effectiveness

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Using dialogue context to improve parsing performance in dialogue systems

Cross Language Information Retrieval

Leveraging Sentiment to Compute Word Similarity

A Case Study: News Classification Based on Term Frequency

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Extracting and Ranking Product Features in Opinion Documents

Rule Learning with Negation: Issues Regarding Effectiveness

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Probabilistic Latent Semantic Analysis

Assignment 1: Predicting Amazon Review Ratings

Cross-lingual Short-Text Document Classification for Facebook Comments

Reducing Features to Improve Bug Prediction

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Movie Review Mining and Summarization

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Vector Space Approach for Aspect-Based Sentiment Analysis

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Extracting Verb Expressions Implying Negative Opinions

CS Machine Learning

Robust Sense-Based Sentiment Classification

A Bayesian Learning Approach to Concept-Based Document Classification

Australian Journal of Basic and Applied Sciences

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Determining the Semantic Orientation of Terms through Gloss Classification

Postprint.

The stages of event extraction

Indian Institute of Technology, Kanpur

Speech Emotion Recognition Using Support Vector Machine

AQUA: An Ontology-Driven Question Answering System

Switchboard Language Model Improvement with Conversational Data from Gigaword

Disambiguation of Thai Personal Name from Online News Articles

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Word Segmentation of Off-line Handwritten Documents

Learning From the Past with Experiment Databases

Distant Supervised Relation Extraction with Wikipedia and Freebase

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS 446: Machine Learning

Ensemble Technique Utilization for Indonesian Dependency Parser

Bug triage in open source systems: a review

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.cl] 2 Apr 2017

Matching Similarity for Keyword-Based Clustering

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cross-Lingual Text Categorization

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The taming of the data:

Parsing of part-of-speech tagged Assamese Texts

Python Machine Learning

Universiteit Leiden ICT in Business

Words come in categories

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Prediction of Maximal Projection for Semantic Role Labeling

Short Text Understanding Through Lexical-Semantic Analysis

TextGraphs: Graph-based algorithms for Natural Language Processing

Vocabulary Usage and Intelligibility in Learner Language

Dialog Act Classification Using N-Gram Algorithms

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Loughton School s curriculum evening. 28 th February 2017

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Combining a Chinese Thesaurus with a Chinese Dictionary

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

BYLINE [Heng Ji, Computer Science Department, New York University,

Multi-Lingual Text Leveling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

NOT SO FAIR AND BALANCED:

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Grade 4. Common Core Adoption Process. (Unpacked Standards)

1. Introduction. 2. The OMBI database editor

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

arxiv: v1 [cs.lg] 3 May 2013

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Verbal Behaviors and Persuasiveness in Online Multimedia Content

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

(Sub)Gradient Descent

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Spinners at the School Carnival (Unequal Sections)

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

The Smart/Empire TIPSTER IR System

National Literacy and Numeracy Framework for years 3/4

Transcription:

A Hybrid Approach for Automated Document-level Sentiment Classification (Proposal) Presented by: Sara A. Morsy Supervisor: Dr. Ahmed Rafea

2 Overview Introduction & Background Approaches Literature Review Problem Statement & Motivation Proposed Approach Experimentation & Evaluation

3

4 What is Sentiment Classification? Aka Opinion Mining, Sentiment Extraction/Analysis, or Review Mining. It is the area of research that attempts to identify the opinion/sentiment that a person may hold towards an object. It is a broad area of computational linguistics, natural language processing, text mining and machine learning.

5 Sentiment Classification Tasks At the document-level: Classify a whole document as positive, negative or neutral. At the sentence-level: Classify a sentence as subjective or objective and then identify its sentiment as positive, negative or neutral. At the feature-level: Classify the opinions on specific features in a single review as positive, negative or neutral.

6 Applications What other people think is an important piece of information for most of us: Individual Interest: People are interested in others opinions when purchasing a product/service or finding opinions on political topics. Market Intelligence: Companies are interested in categorizing positive and negative reviews about their products. Political Interest: Government intelligence systems seek political opinions expressed online. Opinion Search: Not supported by current search engines Recommendation & Summarization Systems

7 Resources and Tools for Sentiment Analysis Lexicons Annotated Corpora Tools

8 What is a Sentiment Lexicon? Find relevant words, phrases, and patterns that can be used to express subjectivity Words: adjectives, verbs, adverbs and nouns Phrases containing adjectives and adverbs Lexico-syntactic patterns e.g. expense of <np>: at the expense of the world s security and stability Determine the polarity of subjective expressions

9 Annotated Corpora An annotated corpus is needed to: Understand the problem. Create training data and gold standards. Annotation is done manually by analyzing the corpus documents and individual sentences and labeling them to their corresponding sentiment (positive, negative or neutral).

10 Tools An online dictionary to search for synonyms and antonyms, e.g. WordNet Machine learning classifiers that use textclassification algorithms: Support Vector Machines (SVM) Naïve Bayes (NB) Part-Of-Speech (POS) tagger Stemmer/Lemmatizer

11 Challenges Subjectivity detection & polarity classification: It can t be done with just a set of subjective keywords! Context-sensitive This camera is great. (+ve) A great amount of money was spent for promoting this camera. (neutral) If you think this is a great camera, well think again, because (-ve) This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can t hold up. (-ve)

12 Challenges (cont d) Domain-dependent Go read the book can indicate +ve sentiment for book reviews but ve sentiment for movie reviews. Unpredictable : +ve for movie reviews, -ve for car s steering. Feature-dependent long : - This camera has a long battery life (+ve), vs - The lens of this camera takes a long time to focus (-ve). Topic-sentiment interaction: Walmart reports that the profits rose - would be a +ve sentiment if the document is talking about Walmart. - would be a -ve sentiment if the document is talking about Target.

13

14 Document-level Sentiment Classification Machine Learning Semantic Orientation Hybrid Approaches Manually Corpus-based Dictionarybased

15 1- Machine Learning (ML) Approach A classifier is trained using annotated corpora. Features used: Syntactic Features: e.g. POS tags, n-grams, punctuation Stylistic Features: Lexical Features: e.g. character- or word-based statistical measures of word variation Structural Features: e.g. number of paragraphs It uses text-classification algorithms, such as: Support Vector Machines (best performance) Naïve Bayes AdaBoost

16 2- Semantic Orientation (SO) Approach A sentiment lexicon is built assigning polarity to sentimentbearing words/phrases. Manually: Labor-intensive task. Done with the other techniques. Corpus-based: It finds co-occurrence patterns of words to determine their sentiment polarity. It requires a large corpus. It can find words with domain-specific orientation. Dictionary-based: It uses a small seed list in a bootstrapping process to search for synonyms and antonyms in a dictionary to determine their sentiment polarity. It can find a lot of words.

17 Pros and Cons of Approaches ML Approach SO Approach Advantages Disadvantages Performs better on a single domain Contextual and domain-specific polarity Requires a large annotated corpus Does not take linguistic context into account, e.g. negation & intensification Negative classification bias Performs better across different domains Does not require labeled data Prior polarity (no domainspecific polarity) Needs a manual check for the words and their corresponding polarity and inter-annotator agreement Positive classification bias

18 Classification Bias of ML and SO Approaches Negative feelings are usually expressed using both positive and negative words, e.g. This car is not good. Positive feelings are usually expressed using positive words only. The SO approach has a positive classification bias: Since polarities of words are known in advance and positive words sometimes predominate even in negative documents. The ML approach has a negative classification bias: Since polarities of words are learnt automatically so it is easier for the classifier to learn negative expressions.

19 3) Hybrid Approaches Many researchers tried combining both ML and SO approaches to make use of their benefits. Examples include: Adding sentiment-bearing words with their SO as features for the classifier. Classifying documents with a sentiment lexicon and applying the high-confidence classified set to a classifier as training data.

20

1) Sentiment Analysis in Multiple Languages The authors used both syntactic and stylistic features on an SVM classifier on movie reviews as well as English and Arabic web forums of extremist/hate groups. They developed a feature selection algorithm, called Entropy Weighted Genetic Algorithm. Using 10-fold cross-validation, they achieved the highest accuracy so far: 91.7% for movie reviews, and 92.8% and 93.6% for English and Arabic web forums, respectively. 21

22 2) Lexicon-based Methods for Sentiment Analysis The authors developed a sentiment lexicon manually, having separate dictionaries for: Adjectives nouns verbs adverbs intensifiers They also developed a list of: Negators irrealis markers (modals, conditional markers (e.g. if), negative polarity items (e.g. any and anything), questions, and words enclosed in quotes) Besides, they implemented: Text-level features (e.g. frequency of unigrams) weighting techniques multiple cut-offs

23 2) Lexicon-based Methods for Sentiment Analysis, cont d They achieved high accuracy among different review domains, with overall 78.74% accuracy. They achieved accuracy between 62.17-88.98% across different domains, e.g. MPQA, news, blogs and headlines. They also outperformed other sentiment lexicons, such as: The Maryland dictionary The General Inquirer The Subjectivity Dictionary SentiWordNet

24 3) A Lexicon-Enhanced Method for Sentiment Classification The authors used a hybrid approach by adding sentiment-bearing words from SentiWordNet 3.0 as features to an SVM classifier, beside syntactic and stylistic features. They applied Information Gain heuristic as a feature selection method. Using 10-fold cross-validation, they achieved between 78.85-84.15% accuracy among different review domains.

25 4) SELC: A Self-Supervised Model for Sentiment Classification The authors developed a 2-phase model: 1 st phase is lexicon-based: They used a sentiment lexicon and negation word list to classify a set of documents. 2 nd phase is corpus-based: They applied the high-confidence classified set as training data for an SVM classifier and classified the uncertain set. Then, results from both phases were integrated to remove the classification bias of both approaches. They achieved an overall F 1 -score of 89.35% across different Chinese review domains.

26 Flow Chart of the SELC Model

27 Problem Statement & Motivation There is currently no automated domainindependent sentiment classification tool with high accuracy that does not need a manuallyannotated corpus. Such a tool is needed for opinion search, recommendation, summarization and mining of the increasingly web opinionated content.

28 Proposed Approach Use an efficient sentiment lexicon (verbs, adverbs, adjectives, nouns, and intensifiers) and a negation word list and irrealis markers to classify the documents and apply the high-confidence ones as training data for an SVM classifier. Build an SVM classifier with selected syntactic, stylistic and sentiment features. Integrate the results of both the sentiment lexicon and the classifier. Compare these results with the base-line results.

29 Experimentation & Evaluation Datasets that will be used: The Polarity Dataset (2,000 movie review texts, 1,000 positive and 1,000 negative) provided by Pang and Lee 2004. 400 review texts on: books, cars, computers, cookware, hotels, movies, music, and phones (25 positive and 25 negative reviews in each domain), obtained from Epinions.com (Taboada et al. 2011). A dataset of blogs collected about Jan 25 Revolution. Different sentiment, syntactic and stylistic features with different feature selection algorithms will be compared to apply the most efficient ones to the classifier.

30 Experimentation & Evaluation (cont d) Tools used: Weka Software (SVM classifier) Stanford POS tagger MorphAdorner English Stemmer/Lemmatizer A sentiment lexicon Evaluation metrics that will be used: Precision Recall Accuracy F 1 -score

31 References A. Abbasi, H. Chen, and A. Salem, Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums, ACM Trans. Information Systems, 2008, vol. 26, no. 3, pp. 1 34. A. Abbasi and H. Chen, Applying Authorship Analysis to Extremist- Group Web Forum Messages, IEEE Intelligent Systems, 2005, vol. 20, no. 5, pp. 67 75. B. Liu, Sentiment Analysis and Subjectivity, Handbook of Natural Language Processing, Second Edition, (editors: N. Indurkhya and F. J. Damerau), 2010. B. Pang and L. Lee, A Sentimental Education: Sentiment Analysis using Subjectivity Summarization based on Minimum Cuts, Proceedings of 42 nd Meeting of the Association for Computational Linguistics, 2004, pp. 271-278, Barcelona, Spain.

32 References (cont.) B. Pang and L. Lee, Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval, 2008, vol. 2, nos. 1 2 pp. 1 135, ebook from http://www.cs.cornell.edu/home/llee/omsa/omsa.pdf L. Qiu, W. Zhang, C. Hu, and K. Zhao, SELC: A Self-Supervised Model for Sentiment Classification, Proceedings of the 18th ACM conference on Information and knowledge Management, November 02-06, 2009, Hong Kong, China. M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, Lexicon-based Methods for Sentiment Analysis, Association for Computational Linguistics, 2011, vol. 1, no. 1, pp. 1-42. Y. Dang, Y. Zhang, and H. Chen, A Lexicon-Enhanced Method for Sentiment Classification: An experiment on online product reviews. IEEE Intelligent Systems, 2010, vol. 25, no. 4, pp. 46-53.

33