CS 224D Final Project: Neural Network Ensembles for Sentiment Classification

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

arxiv: v5 [cs.ai] 18 Aug 2015

Second Exam: Natural Language Parsing with Neural Networks

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v4 [cs.cl] 28 Mar 2016

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Modeling function word errors in DNN-HMM based LVCSR systems

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Modeling function word errors in DNN-HMM based LVCSR systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Lecture 1: Machine Learning Basics

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Georgetown University at TREC 2017 Dynamic Domain Track

Softprop: Softmax Neural Network Backpropagation Learning

Rule Learning With Negation: Issues Regarding Effectiveness

CS Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.cl] 20 Jul 2015

Residual Stacking of RNNs for Neural Machine Translation

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v1 [cs.lg] 15 Jun 2015

Probabilistic Latent Semantic Analysis

(Sub)Gradient Descent

Deep Neural Network Language Models

Rule Learning with Negation: Issues Regarding Effectiveness

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

arxiv: v1 [cs.cl] 27 Apr 2016

Semantic and Context-aware Linguistic Model for Bias Detection

A Vector Space Approach for Aspect-Based Sentiment Analysis

Artificial Neural Networks written examination

Ensemble Technique Utilization for Indonesian Dependency Parser

Human Emotion Recognition From Speech

A deep architecture for non-projective dependency parsing

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

ON THE USE OF WORD EMBEDDINGS ALONE TO

A Case Study: News Classification Based on Term Frequency

Learning From the Past with Experiment Databases

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probing for semantic evidence of composition by means of simple classification tasks

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Online Updating of Word Representations for Part-of-Speech Tagging

Cultivating DNN Diversity for Large Scale Video Labelling

Model Ensemble for Click Prediction in Bing Search Ads

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Linking Task: Identifying authors and book titles in verbose queries

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Australian Journal of Basic and Applied Sciences

Beyond the Pipeline: Discrete Optimization in NLP

Knowledge Transfer in Deep Convolutional Neural Nets

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.cv] 10 May 2017

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning Distributed Linguistic Classes

Learning Methods in Multilingual Speech Recognition

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Calibration of Confidence Measures in Speech Recognition

Using EEG to Improve Massive Open Online Courses Feedback Interaction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Neural Network GUI Tested on Text-To-Phoneme Mapping

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Word Segmentation of Off-line Handwritten Documents

Dialog-based Language Learning

Speech Emotion Recognition Using Support Vector Machine

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

arxiv: v1 [cs.cl] 2 Apr 2017

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

arxiv: v3 [cs.cl] 7 Feb 2017

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

CSL465/603 - Machine Learning

Generative models and adversarial training

A Comparison of Two Text Representations for Sentiment Analysis

THE world surrounding us involves multiple modalities

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Mining Association Rules in Student s Assessment Data

Attributed Social Network Embedding

Transcription:

CS 224D Final Project: Neural Network Ensembles for Sentiment Classification Tri Dao Department of Computer Science Stanford University trid@stanford.edu Abstract We investigate the effect of ensembling on two simple models: LSTM and bidirectional LSTM. These models are used for fine-grained sentiment classification on the Stanford Sentiment Treebank dataset. We observe that ensembling improves the classification accuracy by about 3% over single models. Moreover, the more complex model, bidirectional LSTM, benefits more from ensembling. 1 Introduction In language modeling and machine translation, predictions are made by taking average of around 5 models that have been trained independently. Ensembling can reduce variance (in the case of bagging) or both bias and variance (in the case of boosting). These techniques have been very popular in the context of decision trees. We hope that ensembles of a larger number of models (around 50) for tasks with smaller data sets can improve the performance of neural network models for these tasks. In particular, we investigate the effectiveness of emsembling a large number of simple long short-term memory (LSTM) models for sentiment classification. 2 Problem statement For the task of sentence sentiment analysis, we use the Stanford Sentiment Treebank dataset [5] with 11,855 single sentences extracted from movie reviews. The sentences in the treebank were split into a train (8544), (1101) and test splits (2210). Given a sentence from a movie review, we aim to predict the sentiment of the review. In the binary classification task, we predict whether the review is positive or negative. However, in this project, we will focus on the fine-grain classification task. That is, we predict whether the review is very negative, negative, neutral, positive, or very positive. We evaluate the models by the prediction accuracy on the test set. We compare the accuracy of the ensemble (bagging) with that of a single model to assess the effectiveness of these techniques. 3 Related Work Socher et al. [5] introduced Recursive Neural Tensor Network for the task of sentence sentiment analysis. The model pushes the state of the start in binary sentiment classification (from 80% to 85.4%). They also introduce and analyze fine-grained sentiment classification (5 classes), and achieve 45.7% accuracy. Making use of sentence parses and classification for each phrase improves the accuracy of the overall classification task. 1

Tai et al. [6] eloped tree-structured LSTM for the same task. This is a generalization of LSTMs to tree-structure network typologies, and in particular sentence parses. This again improves the prediction accuracy of both the binary and fine-grained sentiment classification to 88.0% and 51.0% respectively. Both of these works use the sentiment classification of phrases in the sentences as training data. There are 215,154 unique phrases in the dataset compared to just 11,855 sentences. Hence the actual number of training examples is much larger than the number of sentences. In this project, we do not make use of the phrases because we want to investigate the effectiveness of ensembling in the context of small dataset. That is, our dataset only consists of 11,855 sentences, 8544 of which are used as training data. 4 Approaches and models 4.1 LSTM Traditional recurrent neural network often runs into the problem of vanishing gradient, since the gradient is multiplied a larger number of times by the weight matrix. For long sequences (such as long sentences), this makes the network unable to learn long-term dependencies. This motivates the LSTM model introduced by Hochreiter and Schmidhuber [2] which contains a new structure called a memory cell (see Figure 1 below). A memory cell consists of an input gate, a neuron with a connection to itself, a forget gate, and an output gate. The memory cells can keep information intact, unless the inputs make them forget it or overwrite it with new input. Figure 1: The repeating model in an LSTM contains four interacting layers, from [3]. 4.2 Bidirectional LSTM Graves et al. [1] introduce the bidirectional LSTM model. Figure 2 shows the structure of the bidirectional LSTM, where there is one forward LSTM and one backward LSTM running in reverse time. Their features are concatenated at the output layer, enabling information from both past and future to come together and make more accurate prediction. 4.3 LSTM model for sentiment classification We use a simple LSTM model as described in Figure 1. For each word in the input sentence, we look up its corresponding word vector pre-trained with GloVe [4]. We then pass these vectors into a layer of LSTM cells, then apply a softmax layer to the output of the last LSTM. This outputs a probability for each class which can be used to predict the sentiment of the sentence. In the case of bidirectional LSTM, we follow the same procedure but we concatenate the output of the forward and the backward LSTM layers before applying softmax. 2

Figure 2: Bidirectional LSTM contains one forward and one backward LSTM, from [7]. 4.4 Bagging We train multiple neural network models (around 50) on bootstrap samples from the Stanford Sentiment Treebank dataset. That is, for each single model, we obtain a sample with replacement of size n from the training dataset, where n is the number of training examples in the training dataset. This means that some training examples might appear more than once in the bootstrap training set. To make a prediction, we either take the majority vote of all the models or average the probabilities output from each LSTM model to classify a new sentence to one of five sentiment classes. This has the effect of reducing the variance of the ensemble model. Therefore we do not apply as much regularization to the individual LSTM model. In particular, we do not use early stopping but allow each individual LSTM model to train to completion. We train each single model on bootstrap samples from the training dataset instead of the whole dataset to reduce the correlation between the models. Since we will take the average of the predictions, the variance of the ensemble depends strongly on the correlation between the models. The less correlated the models are, the bigger the variance reduction bagging will bring. 5 Experimental results 5.1 Single LSTM model We plot the train and accuracy of a single LSTM model against number of training epoch in Figure 3. Note that this model is trained on the original training set, not the bootstrap one. We see that the training accuracy keeps increasing while the validation accuracy increases and then decreases. This suggests that the model overfits the data. However, since we will combine many such single models in an ensemble to reduce the variance, each model overfitting is not a problem. This is similar to bagging many decision trees where we grow the trees fully and do not prune them (a form of regularization.) 5.2 Single bidirectional LSTM model Now we plot the train and accuracy of a single LSTM model against number of training epoch in Figure 4. Again, this model is trained on the original training set, not the bootstrap one. We see that the performance is not unlike that of the single LSTM model without the backward flow, even though we are using a more powerful model. This suggests that the prediction error is coming from the variance instead of the bias. That is, with such a small training size (8544 examples), the variance dominates the error in both cases, leading to lower accuracy. 3

1.0 train 0.8 0.6 0.4 0.2 0.0 0 5 10 15 20 25 30 Epoch Figure 3: of a single LSTM model. 1.0 train 0.8 0.6 0.4 0.2 0.0 0 5 10 15 20 25 30 Epoch Figure 4: of a single bidirectional LSTM model. 5.3 Ensemble We plot the accuracy of the ensemble model on the set against the ensemble size in Figure 5 and Figure 6. We see that in both cases, the ensemble s performance increases by about 3% compared to the performance of a single model. In the case of ensemble consisting of LSTM models, the benefit of ensembling is only evident for ensemble size of up to 10. After that, there is virtually no benefit to having a larger ensemble. In the case of ensemble consisting of bidirectional LSTM models, the performance of the ensemble keeps increasing even as the ensemble size gets to 50. This might be 4

0.41 test 0.40 0.39 0.38 0.37 0 10 20 30 40 50 Ensemble size Figure 5: of the ensemble consisting of LSTM models. 0.44 0.43 test 0.42 0.41 0.40 0.39 0.38 0.37 0 10 20 30 40 50 Ensemble size Figure 6: of the ensemble consisting of bidirectional LSTM models. because the bidirectional LSTM model is more powerful, with lower bias and higher variance. Thus ensembling helps reduce this higher variance while maintaining a lower bias. Therefore complex models benefit more from ensembling. 6 Conclusion We investigate the effect of ensembling on two simple models: LSTM and bidirectional LSTM. We observe that ensembling improves the fine-grained sentiment classification accuracy by about 3%. Moreover, the more complex model, bidirectional LSTM, benefits more from ensembling. We see that the ensemble s performance keep increasing even when the ensemble size gets up to 50. 5

The improved performance on a small dataset (8544 training sentences) is encouraging. However, we are not able to match the performance of other models ([5], [6]) that are trained on a larger dataset (215,154 phrases). It would be interesting to see the performance of ensemble models trained on the phrases, i.e a larger effective training corpus. In the future, we will also explore other ensemble methods. For example, boosting can be used sequentially train a larger number of neural networks. Another way to combine multiple models is to concatenate the intermediate representation learned by each model (i.e. the last LSTM cell) and apply a softmax layer on top of that to make the prediction. The weight matrix for the softmax layer will be learned from the data. Acknowledgments We thank Kevin Clark for allowing us to build off of his LSTM codebase and for giving valuable advice. References [1] Alan Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273 278. IEEE, 2013. [2] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8): 1735 1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http: //dx.doi.org/10.1162/neco.1997.9.8.1735. [3] Christopher Olah. Understanding lstm networks, 2015. URL http://colah.github.io/ posts/2015-08-understanding-lstms/img/lstm3-chain.png. [4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532 1543, 2014. [5] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631 1642, Stroudsburg, PA, October 2013. Association for Computational Linguistics. [6] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1556 1566, Beijing, China, July 2015. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/p15-1150. [7] Wu Zhen Zhou. Bidirectional rnn, 2016. URL https://github.com/hycis/ bidirectional_rnn/blob/master/item_lstm.png. 6