Assignment 1: Predicting Amazon Review Ratings

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CS Machine Learning

Learning From the Past with Experiment Databases

Probabilistic Latent Semantic Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

On-the-Fly Customization of Automated Essay Scoring

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Linking Task: Identifying authors and book titles in verbose queries

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Vector Space Approach for Aspect-Based Sentiment Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Reducing Features to Improve Bug Prediction

Switchboard Language Model Improvement with Conversational Data from Gigaword

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v1 [cs.lg] 15 Jun 2015

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

STA 225: Introductory Statistics (CT)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Statewide Framework Document for:

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

(Sub)Gradient Descent

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning with Negation: Issues Regarding Effectiveness

Multi-Lingual Text Leveling

Grade 6: Correlated to AGS Basic Math Skills

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Modeling function word errors in DNN-HMM based LVCSR systems

Rule Learning With Negation: Issues Regarding Effectiveness

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CS 446: Machine Learning

Detailed course syllabus

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

arxiv: v1 [cs.cl] 2 Apr 2017

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Model Ensemble for Click Prediction in Bing Search Ads

Attributed Social Network Embedding

Calibration of Confidence Measures in Speech Recognition

CSL465/603 - Machine Learning

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Truth Inference in Crowdsourcing: Is the Problem Solved?

Evaluation of Teach For America:

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Comment-based Multi-View Clustering of Web 2.0 Items

An Online Handwriting Recognition System For Turkish

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Australian Journal of Basic and Applied Sciences

Beyond the Pipeline: Discrete Optimization in NLP

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.lg] 3 May 2013

A Case Study: News Classification Based on Term Frequency

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speech Emotion Recognition Using Support Vector Machine

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Bug triage in open source systems: a review

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Data Fusion Through Statistical Matching

Math 96: Intermediate Algebra in Context

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Using Proportions to Solve Percentage Problems I

Like much of the country, Detroit suffered significant job losses during the Great Recession.

arxiv: v2 [cs.ir] 22 Aug 2016

Probability and Statistics Curriculum Pacing Guide

Semantic and Context-aware Linguistic Model for Bias Detection

Human Emotion Recognition From Speech

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Universidade do Minho Escola de Engenharia

Indian Institute of Technology, Kanpur

Diagnostic Test. Middle School Mathematics

Multi-label classification via multi-target regression on data streams

Race, Class, and the Selective College Experience

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

12- A whirlwind tour of statistics

About How Good is Estimation? Assessment Materials Page 1 of 12

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

arxiv: v2 [cs.cv] 30 Mar 2017

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Learning to Rank with Selection Bias in Personal Search

Generative models and adversarial training

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v2 [cs.cl] 26 Mar 2015

School of Innovative Technologies and Engineering

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

A Comparison of Two Text Representations for Sentiment Analysis

Transcription:

Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for automotive products that can be found at the Stanford University SNAP website [1]. The dataset is formatted as a JSON document where each example is a user review for a product. The format and information represented by each example is shown in Table 1. The dataset consists of 188,728 user reviews of automotive products that are sold by Amazon. The reviews were written by 133,256 unique users for 47,577 unique products. In order to facilitate training a predictive model for user ratings, the dataset was partitioned into training, validation, and testing subsets using a 60/20/20 split of the overall dataset. An analysis of the training dataset revealed that the mean rating is 4.14, the variance is 1.7747, and the median is 5.0 which is the maximum possible rating that a user can give a product. The average, variance and median rating show that there is an optimistic bias towards higher ratings in this particular dataset. Figure 1 shows the proportion of each rating within the training data. Due to the low variance of user ratings in this dataset, I believe that developing a model with significantly high performance will be a challenge. Figure 1: Proportion of each rating in training data 2 Prediction Task A least squares linear regression model was trained in order to predict the user rating given a set of features that were derived from the fields of the user reviews. The predictor takes the following form: Xθ = y Where X is a feature matrix, θ is a column matrix of learned parameter, and y is a column matrix of predicted labels. User ratings are represented as integers between 5.0 and 1.0. A predictor based upon a linear regression model was selected because the weighted sum of features including the pseudo-feature would output a real number that represents a predicted user rating. The 1

3 RELEVANT LITERATURE Table 1: Fields in the Dataset product/productid product/title product/price review/userid review/profilename review/helpfulness review/score review/time review/summary review/text a unique identifier assigned to each product the title of product as it appears to users on Amazon the products price if it is known, otherwise the entry shows unknown a unique identifier for the user who created this review the Amazon user name of the review author or anonymous if the author chose not to divulge their user name a rating assigned to this review by other users that reflects its helpfulness. This is expressed as a ratio of people who found the review helpful to the total number of people who entered a helpfulness response a rating given by the user for this product that ranges from 1.0 to 5.0 time and date that this review was submitted as expressed in Unix time format a user written summary of their review a user written review of the product real valued output can be mapped to an actual user rating by rounding to the nearest integer value. Due to the metric that we chose to evaluate the models the discrepancy between integer labels and real valued output does not affect the efficacy of our experimental method or results. The metric that was used to evaluate the models is the mean squared error (MSE) of actual labels and predicted labels. MSE = 1 N N 2 (y i X i θ) i=1 The baseline against which we evaluated a models performance was the variance of ratings within the training data. This metric was selected because a trivial predictor where the predicted rating was equal to the mean rating of the training data performs with an MSE equal to the variance. It follows that a model which returns a lower MSE than the variance is capable of outperforming the trivial predictor. For each set of features, we trained a standard linear regression model and a linear regression model with regularization. The results of our experiments motivated the inclusion of a regularised model due to overfitting of the training data when a large number of features were used. 3 Relevant Literature The automotive reviews dataset that was used for this assignment is a subset of a larger dataset that comprises Amazon reviews from many different product categories (e.g. books, movies, pet supplies, etc)[1]. The dataset was compiled by McAuley and Leskovec to assist with their research into product recommendation models that combine latent review topics and latent rating dimensions [2]. The Amazon review corpus has been widely used in a wide range of research fields due 2

4.1 Bag of Words 4 FEATURE IDENTIFICATION to it s accessibility, size, and quality. Business and management researchers have used the corpus to analyze what factors determine the helpfulness of reviews [4] and the effect of online reviews on sales of physical product [5]. Natural language parsing techniques have been applied to extract useful information from textual reviews by determining the relevant product features that were reviewed and performing a sentiment analysis to summarize them[6]. Current state of the art methods for predicting ratings based upon review data incorporate latent feature dimensions and textual analysis of the reviews in order to discover the sentiments, topics, and opinions expressed within them.[2][3][7]. The current research informed the selection of features for the model that was trained for the rating prediction task. Due to the scope of this project, implementing a model that incorporated abstract information extracted from the review text was not a realistic goal. However, as will be explained in the following section on feature identification, all models utilized the text in an attempt to improve the performance of the models. 4 Feature Identification The dataset provides 10 raw fields including the rating of the product. This leaves 9 data fields which can be transformed into a feature or set of features that is compatible with a linear regression model. Analysis of the model helped to eliminate some of these fields as useful features. The userid field and productid field were not considered for incorporation into the model due to the relatively large proportion of unique ID s in the training data. The inclusion of those fields within the logistic regression model would have vastly increased the dimensionality of our feature set. This would result in severe overfitting and a heavy load upon the limited computing resources that were used to fit the models. Features that were created from the review data fields using feature functions which map the data into a format usable by the linear regression module include the following: 1. anonymity - The username was used to create a binary feature indicating if this review was submitted anonymously. 2. price - The value was used directly as a feature unless the price was indicated as unknown. In that case the price was set to 0. 3. month - The previous lecture material on Beer review analysis showed that the month of year was a useful feature. The unix time field was converted to a 12 element vector where each element corresponds to a month. 4. review length - We count the number of characters in the users full review of the product. 4.1 Bag of Words The belief that information extracted from the actual user review text would provide the most gains in performance motivated the inclusion of a bag-of-words feature. A bag-of-words model was created by parsing all of the review text in the training data and identifying the N most commonly used words. The words were then mapped to a binary vector of length N. This value of N became a parameter during the optimization of our model. The feature function for bag-of-words extracts the feature from an example by parsing the review text and determining which words are also in the set of N most common words. 3

6 EXPERIMENTAL RESULTS To eliminate sequences of characters that are very common across all types of ratings, an intermediate data set consisting of review text was created where punctuation characters and stop words were removed. This intermediate data set is used to fit the bag-of-words model and create a feature for every example in the data set. 5 Model As explained in the preceding section on the prediction task, a linear regression model was selected due to the relative closeness with which a rating predicted as a real value could be mapped to the actual ratings which correspond to between 1 and 5 stars. Initially a standard linear regression model was used to explore the performance of the trained models in predicting the labels of the validation set using only features which did not include the bag-of-words. This approach was fine due to the limited number of features that the model consisted of and overfitting was not a problem that was observed. We based our optimization decisions upon the MSE between training/validation labels and predicted labels of those respective sets. Once we incorporated the bag-of-words feature into our model, problems with scalability and overfitting were observed. We parameterized N, the number of words to include into our bag-of-words model, so that we could use it to guide the optimization process. As N grew larger the time to fit a model increased exponentially and the degree of overfitting increased. At this point it became clear that a regularization term would be required to create a model that performed well on the validation set. A new round of models were fit and validated with L2 regularization. The optimization of these models included a parameter for regularization strength, λ. The optimal parameters of N and λ were discovered by way of grid search and assessing the performance of each model against the baseline variance of the training set. 6 Experimental Results Table 2 shows the training performance, expressed as mean squared error, of predictors based upon a single feature. Each feature, except the anonymous submission feature, performs better on the training set than the baseline predictor which has an MSE equal to 1.7747. An interesting result is that the validation error is lower than the training error for all of the features. This runs counter to the result we expect where validation error is equal to or greater than training error. The performance gained over the baseline predictor is quite low. The review length in character achieved the best performance, however it s coefficient of determination shows that it is slightly better than the baseline R 2 = 1 1.7386 1.7747 = 0.0203. Table 2: MSE of models incorporating a single feature Feature Training Validation Price 1.7728 1.7445 Month 1.7713 1.7447 Review Length 1.7658 1.7386 Anonymous Submission 1.7747 1.7478 Our next set of experiments involved training models which combine the single features into a model as shown in table 3. The figure shows that the anonymous submission feature, which was the poorest performing in the previous experiment, does not increase the performance of the predictor. 4

7 CONCLUSION Table 3: MSE of models with multiple features Features Training Validation Price, Month 1.7693 1.7412 Review Length 1.7603 1.7318 Review Length, Anonymous 1.7603 1.7318 The bag-of-words models were fit and tested without additional features in order to facilitate a search for the optimal set of parameters. Table 4 shows that that regularization has a significant effect on preventing overfitting of the model to the training set, but different strengths of regularization have no discernable effect. The figure also suggests that the optimal N is 4. Table 4: MSE of validation set for different combinations of λ and N λ/n 2 4 8 16 32 0.0 3.35 3.034 2.596 2.249 1.995 0.1 1.756 1.748 1.753 1.757 1.759 1.0 1.756 1.748 1.753 1.757 1.759 10.0 1.756 1.748 1.753 1.757 1.759 Table 5: MSE of different models on the Test set. λ = 1.0 and N = 4 Features Test Bag-of-Words 1.7482 Review Length 1.7632 Review Length, 1.7542 Bag-of-Words main challenge in predicting the rating for the automotive dataset is that the variance of ratings is already low and that the median rating is 5.0. This is similar to the unbalanced binary classification problem where there is an overabundance of one type of label. A multiclass SVM that treats each rating as a separate class would most likely outperform the logistic regression model because it could be tuned so that the imbalance among classes could be compensated for by assigning different weights. This is something that I would pursue given another round of experiments. Once the optimal value for λ and N were determined through experiment, three models were created and evaluated against the test set. Table 5 presents the results which show that the model that uses only the bag-of-words featured performed the best. 7 Conclusion The coefficient of determination (R 2 ) for the top performing Bag-of-Words model is R 2 = 1 1.7482 1.7747 = 0.015. This is far less than what I would have liked to achieve. I believe the Another promising direction to take is trying to extract more useful information from the review text by using more powerful natural language processing techniques. Surprisingly, the bag-of-words model outperformed the model which incorporated features from the review meta data, but it is still a very naive way of modeling text because much information regarding structure and semantics is lost. An approach that incorporates N-grams, sentiment analysis, or topic modeling would be interesting and likely perform better according to the performance of the bag-of-words model. 5

REFERENCES REFERENCES References [1] http://snap.stanford.edu/data/web- Amazon-links.html 2015 [2] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013. [3] G. Ganu, N. Elhadad, and A. Marian. Beyond the stars: Improving rating predictions using review text content. In WebDB, 2009. [4] SM Mudambi, D Schuff. What makes a helpful review? A study of customer reviews on Amazon.com. MIS quarterly, 2010 [5] JA Chevalier, D Mayzlin. The effect of word of mouth on sales: Online book reviews. Journal of marketing research, 2006 [6] M Hu, B Liu. Mining and summarizing customer reviews. Proceedings of the tenth ACM SIGKDD international, 2004 [7] H Wang, Y Lu, C, Zhai. Latent aspect rating analysis on review text data: a rating regression approach. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pages 783-792, 2010 6