Improving Paragraph2Vec

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v5 [cs.ai] 18 Aug 2015

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v4 [cs.cl] 28 Mar 2016

Assignment 1: Predicting Amazon Review Ratings

Lecture 1: Machine Learning Basics

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv: v1 [cs.cl] 20 Jul 2015

Semantic and Context-aware Linguistic Model for Bias Detection

Artificial Neural Networks written examination

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v2 [cs.ir] 22 Aug 2016

arxiv: v1 [cs.cv] 10 May 2017

A study of speaker adaptation for DNN-based speech synthesis

(Sub)Gradient Descent

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Deep Neural Network Language Models

A deep architecture for non-projective dependency parsing

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

ON THE USE OF WORD EMBEDDINGS ALONE TO

Attributed Social Network Embedding

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Vector Space Approach for Aspect-Based Sentiment Analysis

Model Ensemble for Click Prediction in Bing Search Ads

Knowledge Transfer in Deep Convolutional Neural Nets

arxiv: v1 [cs.lg] 7 Apr 2015

Calibration of Confidence Measures in Speech Recognition

arxiv: v2 [cs.cl] 26 Mar 2015

Generative models and adversarial training

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Probing for semantic evidence of composition by means of simple classification tasks

Learning From the Past with Experiment Databases

Modeling function word errors in DNN-HMM based LVCSR systems

Dialog-based Language Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Probabilistic Latent Semantic Analysis

CS Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Embedding Based Correlation Model for Question/Answer Matching

Modeling function word errors in DNN-HMM based LVCSR systems

Axiom 2013 Team Description Paper

Human Emotion Recognition From Speech

On-the-Fly Customization of Automated Essay Scoring

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Cultivating DNN Diversity for Large Scale Video Labelling

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Word Segmentation of Off-line Handwritten Documents

An empirical study of learning speed in backpropagation

Getting Started with Deliberate Practice

Residual Stacking of RNNs for Neural Machine Translation

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

arxiv: v1 [cs.cl] 2 Apr 2017

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

arxiv: v3 [cs.cl] 7 Feb 2017

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v2 [cs.cv] 30 Mar 2017

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning Methods for Fuzzy Systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Rule Learning With Negation: Issues Regarding Effectiveness

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Case Study: News Classification Based on Term Frequency

arxiv: v2 [cs.ro] 3 Mar 2017

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Summarizing Answers in Non-Factoid Community Question-Answering

WHEN THERE IS A mismatch between the acoustic

How People Learn Physics

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Why Did My Detector Do That?!

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Softprop: Softmax Neural Network Backpropagation Learning

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Lecture 10: Reinforcement Learning

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

INPE São José dos Campos

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Truth Inference in Crowdsourcing: Is the Problem Solved?

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

Transcription:

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Improving Paragraph2Vec Seokho Hong seokho@stanford.edu Abstract Paragraph vectors were proposed as a powerful unsupervised method of learning representations of arbitrary lengths of text. Although paragraph vectors had the advantage of being versatile, being unsupervised and unconstrained by lengths of text, the concept has not been further developed since its first publication. We propose two extensions upon the initial formulation of the paragraph vector, and test its performance on two separate semantic-based tasks. Although the results are limited by the fact that our attempt to reproduce the original paragraph vectors was not successful, we can still show that the extended models outperform the original paragraph vectors. 1 Introduction The Paragraph Vector (Le & Mikolov 2014) was proposed with the objective of unsupervised semantic learning of arbitrary lengths of texts. While the PV-DM and PV-DBOW models proposed perform very well, their formulations leave room for improvement. In particular, the performance differences between PV-DM and PV-DBOW suggest that improvements to PV-DM could yield be significant. Also, the paper finds that PV-DM and PV-DBOW vectors concatenated achieve the best performance. Even if different training methods do not necessarily produce better vectors, concatenating them with the original paragraph vectors may offer better results than the PV-DM and PV-DBOW concatenated alone. The main mathematical limitation of the PV-DM and PV-DBOW models is that they do not allow the paragraph vector to interact in complex, non-linear ways with the word vectors. While this design is not entirely unjustified, since different sections of text do not radically change the distribution and sequence of English words, there is nonetheless a limitation on the extent to which a paragraph vector can assist in the word-prediction task the paragraph vector uses. We propose two different formulations of training unsupervised paragraph vectors, which we will call the hidden layer model and the tensor model. While both models offer improvements upon the original paragraph models, we cannot make definitive conclusions because we were unable to replicate the same level of performance reported in (Le & Mikolov 2014). 2 Background Recent works have proposed various deep methods for learning the semantics of sentences. Recursive neural networks with various substructures (Socher et al. 2013), (Tai et al. 2015) and convolutional neural networks (Kalchbrenner et al. 2014), offer excellent performance that approximately matches or exceeds that of paragraph vectors across different tasks. These models, however, tend to be far more complex and deeper than the paragraph vector method and therefore take significantly longer to train. They also have other limitations that could potentially limit their range of applications. Recursive models work only for sentences and need a parser, which is not 1

054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 only an extra requirement, but also a source of errors given that parsers are imperfect. Convolutional neural networks as suggested in (Kalchbrenner et al. 2014) do not need a parser, but remain untested for longer pieces of text. The max-pooling layer the paper proposes appears as though it will be less effective as the text gets much longer. The paragraph vector has the advantage of simplicity and versatility. Perhaps it is a bit too simple. The models suggested here were inspired by the recent gains from the increasingly complex models. It does not appear unreasonable to train modestly more complex models along the lines of the paragraph vector framework if it will result in performance gains. 3 Approach Paragraph Vector Framework The paragraph vector framework is the general approach to training unsupervised paragraph vectors, and is common to the models discussed here as well as the original PV-DM. The framework is a word prediction task. Given a set of words w 1, w 2,, w n, the model trains by predicting one of the words vector, w j given the other n 1 words vectors. The model also takes a paragraph vector p i where i identifies which body of text w 1, w 2,, w n come from. The original authors proposed both hierarchical softmax and negative sampling as a replacement for the traditional but expensive softmax for the objective function, but we will use only negative sampling here. The cost function is therefore: log(σ(r v wj )) i log(σ(r v wi )) (1) Where r is the predicted vector, w i is the vector of a random word, for which there are k random words. For the models trained here, k is fixed at 10. The training is done via backpropagation through structure using Adagrad. In training, both the word vectors and paragraph vectors are initialized randomly with values in the range of 0.01 to 0.01. Hidden Layer Model The original PV-DM model has the following equation for r: r = σ(w [c; p i ]) (2) (3) where c is the concatenation of the input word vectors for the prediction task. The hidden layer model we propose simply adds another layer between r and the two input vectors. z = σ(w [c; p i ]) (4) r = σ(u z) (5) (6) 2

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 While simple, it allows the components of p i and c to interact in more complex ways. The original equation for r allows only a single non-linear transformation of a linear function of p i and c. Tensor Model To allow even more interaction between the components of p i and c, we propose the following: where T is a tensor. 3.1 Training r = σ(c T p i + W [c; p i ]) (7) (8) Given a collection of text that can be divided into n documents, or paragraphs, each paragraph is assigned a vector. Training involves sliding the window of context words across each word of each paragraph, for every paragraph. At each data point, the input is the window of context words, and the paragraph vector corresponding to the origin of the words, and the target output is a particular word within or adjacent to the context words (depending on the hyper parameters). The target output is obviously not given as input. At either end of the paragraph, a special NULL word is applied as needed to fill the necessary window space. The NULL word is trained like every other word. Backpropagation is used to train both the paragraph vectors and the word vectors simultaneously. 3.2 Testing Testing involves running gradient descent to train new paragraph vectors for each new paragraph. At test time, all parameters of the model are frozen, including the word vectors, and the backpropagation is only applied to the paragraph vectors. 4 Experiments We tested the two new models on two fairly standard tasks, one on which other models have been benchmarked: sentiment analysis and semantic textual similarity. 4.1 Sentiment Analysis - Stanford Sentiment Treebank Each phrase in the treebank was treated as a separate paragraph, and training and testing was done according to the dataset s specifications. In an attempt to reproduce the results from (Le and Mikolov, 2014), we tried to match the hyperparameters as closely as possible. 4.1.1 Hyperparameters for PV-DM Word vector dimensions: 100; paragraph vector dimension: 400; context window size: 7 words, predict 8th; trained using Adagrad with 0.01 learning rate, minibatches of size 300, L2 regularization of 1e-4. Trained for approximately 10 hours. 4.1.2 Hyperparameters for Hidden Layer Model Word vector dimensions: 100; paragraph vector dimension: 400; context window size: 7 words, predict 8th; trained using Adagrad with 0.01 learning rate, minibatches of size 300, L2 regularization of 1e-4. Trained for approximately 16 hours. 3

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 4.1.3 Hyperparameters for Tensor Model Word vector dimensions: 100; paragraph vector dimension: 200; context window size: 7 words, predict 8th; trained using Adagrad with 0.005 learning rate, minibatches of size 300, L2 regularization of 1e-3. Trained for approximately 48 hours. The tensor model training becomes very expensive with large word or paragraph vectors so the paragraph vector dimensionality was reduced. Final Classification After pre-training the paragraph vectors (both for training and testing), a random forest classifier was used to predict the final sentiment label for each paragraph. On the fine-grained task, each paragraph was classified in categories from 1 to 5, where the original sentiment is scaled up from 0.0 to 1.0. On the binary task, each paragraph rated in categories 2 and 4 were used. We used the random forest classifier from the sci-kit learn package, with 100 trees as the only modification from the default hyperparameters. 5 Attempted Optimization Method Table 1: SST Results Fine-Grained Binary PV-DM 41.6 81.2 PV-Hidden 42.7 82.2 PV-Tensor 44.1 84.3 PV-DM (Le and Mikolov, 2014) 48.7 87.8 We made many attempts to fully reproduce the results from (Le and Mikolov, 2014). The only major difference is that we reduced the word vector dimensionality to 100 from 400, but this reduction was necessary to allow the model to train in a reasonable amount of time. Training a model with 200 dimensions did improve the results, but not enough to suggest that the reduction from 400 is responsible for the disparity in results. Training duration was determined by convergence on the development set (10% randomly sampled from the training set). If there was no improvement over 10 epochs, or passes through the entire training set, then the training was halted. Dropping out for the Hidden Layer model (p = 0.5 on W) and Tensor model (p = 0.2 on T) was used. 5.1 Semantic Textual Similarity - SemEval 2014 Task 1 While (Le and Mikolov, 2014) did not benchmark their PV-DM on this task, many other papers benchmark their neural networks on this dataset, suggesting that the dataset is a reliable one. The dataset here is the SICK dataset, 10000 pairs of English sentences, each labeled with a semantic similarity score from 1 to 5. There is a Train, Test, and Trial division of the dataset, and each were used respectively for training, testing, and validating the models. 5.1.1 Hyperparameters for PV-DM Word vector dimensions: 200; paragraph vector dimension: 400; context window size: 9 words, predict 10th; trained using Adagrad with 0.01 learning rate, minibatches of size 300, L2 regularization of 1e-4. Trained for approximately 6 hours. 4

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 5.1.2 Hyperparameters for Hidden Layer Model Word vector dimensions: 200; paragraph vector dimension: 400; context window size: 9 words, predict 10th; trained using Adagrad with 0.01 learning rate, minibatches of size 300, L2 regularization of 1e-4. Trained for approximately 10 hours. 5.1.3 Hyperparameters for Tensor Model Word vector dimensions: 200; paragraph vector dimension: 200; context window size: 9 words, predict 10th; trained using Adagrad with 0.005 learning rate, minibatches of size 300, L2 regularization of 1e-3. Trained for approximately 26 hours. Final Classification After pre-training the paragraph vectors (both for training and testing), a random forest regressor was used to predict the final sentiment label for each paragraph. We used the random forest regressor from the sci-kit learn package, with 100 trees as the only modification from the default hyperparameters. Table 2: Semantic Textual Similarity Results Method MSE Mean Vectors (Tai et al, 2015) 0.455 LSTM (Tai et al, 2015) 0.281 PV-DM 0.392 PV-Hidden 0.388 PV-Tensor 0.365 The Mean vectors baseline in (Tai et al, 2015) computes a semantic relatedness score from the average of the word vectors of the words in the sentence. While the LSTM is not the focus of the paper, it is one of the most effective models for the task, and puts the paragraph vector models in perspective, at least our implementations. 6 Attempted Optimization Since this example used the same implementation as the one used for the Stanford Sentiment Treebank, no doubt there are some flaws holding back the scores here. The SICK dataset is smaller than the previous one, which allowed slightly larger models to be trained in a reasonable time. Training duration was determined by convergence on the development set, as specified in the dataset. If there was no improvement over 10 epochs, or passes through the entire training set, then the training was halted. Dropping out for the Hidden Layer model (p = 0.5 on W) and Tensor model (p = 0.2 on T) was used. 7 Conclusions Despite the subpar results on the PV-DM implementation, it is safe to say that additional hidden layer and the tensor models improve upon the original PV-DM formulation. The hidden layer model 5

270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 probably is not worth the additional training time required for the slight performance gains. The tensor model is much better, but it takes even longer to converge. The performance of the original PV-DM showed that modeling the distribution and ordering of a paragraph s words, is an effective method of modeling the semantics of a sentence or any piece of text. The performance of the hidden layer and tensor models shows that there is room for improvement by allowing the paragraph vector to learn a more complex function for how the paragraph influences the distribution of words. Regarding direct improvement of the model, perhaps a deeper, more complex model would further improve the performance of paragraph vectors, but at that point it would no longer be the fairly simple model that trains quickly compared to other deep models. For improving performance on particular tasks such as sentiment analysis, models that train directly supervised representations of text are outperforming the unsupervised paragraph vector. It intuitively seems obvious that deep methods that optimize paragraph representations for a particular task will outperform a shallow classifier on an unsupervised paragraph representation. It is also not possible to fine tune the unsupervised paragraph vectors on a specific task unless all the paragraphs are available during training time, which does not prove generalization. Thus improving the paragraph vector is probably not the best method for attempting to improve upon state of the art performance on separate tasks. References [1] Kalchbrenner, Nel, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional Neural Network for Modelling Sentences. ACL14 [2] Le, Quoc V and Tomas Mikolov. 2014. Dis- tributed representations of sentences and doc- uments.arxiv preprint arxiv [3] Tai, Kai Sheng, Richard Socher, Christopher D. Manning 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks uments.arxiv preprint arxiv [4] Socher, Richard, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank uments.emnlp 2013 6