Convolutional Neural Networks for Sentence Classification

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

arxiv: v1 [cs.cl] 20 Jul 2015

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v4 [cs.cl] 28 Mar 2016

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Second Exam: Natural Language Parsing with Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Machine Learning Basics

There are some definitions for what Word

ON THE USE OF WORD EMBEDDINGS ALONE TO

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Knowledge Transfer in Deep Convolutional Neural Nets

arxiv: v1 [cs.lg] 15 Jun 2015

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v5 [cs.ai] 18 Aug 2015

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A deep architecture for non-projective dependency parsing

Assignment 1: Predicting Amazon Review Ratings

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Probabilistic Latent Semantic Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Online Updating of Word Representations for Part-of-Speech Tagging

arxiv: v1 [cs.cv] 10 May 2017

Attributed Social Network Embedding

Word Embedding Based Correlation Model for Question/Answer Matching

arxiv: v2 [cs.cl] 26 Mar 2015

arxiv: v2 [cs.ir] 22 Aug 2016

Model Ensemble for Click Prediction in Bing Search Ads

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Comparison of Two Text Representations for Sentiment Analysis

Deep Neural Network Language Models

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Speech Recognition at ICSI: Broadcast News and beyond

CS 446: Machine Learning

A Deep Bag-of-Features Model for Music Auto-Tagging

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Artificial Neural Networks written examination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Linking Task: Identifying authors and book titles in verbose queries

Calibration of Confidence Measures in Speech Recognition

Ensemble Technique Utilization for Indonesian Dependency Parser

THE enormous growth of unstructured data, including

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.cl] 27 Apr 2016

A Neural Network GUI Tested on Text-To-Phoneme Mapping

SARDNET: A Self-Organizing Feature Map for Sequences

Offline Writer Identification Using Convolutional Neural Network Activation Features

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Generative models and adversarial training

Boosting Named Entity Recognition with Neural Character Embeddings

Indian Institute of Technology, Kanpur

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Learning Methods for Fuzzy Systems

Residual Stacking of RNNs for Neural Machine Translation

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

CS Machine Learning

Dialog-based Language Learning

THE world surrounding us involves multiple modalities

Summarizing Answers in Non-Factoid Community Question-Answering

A Case Study: News Classification Based on Term Frequency

arxiv: v3 [cs.cl] 7 Feb 2017

Probing for semantic evidence of composition by means of simple classification tasks

Getting Started with Deliberate Practice

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Learning From the Past with Experiment Databases

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v2 [cs.cv] 30 Mar 2017

Semantic and Context-aware Linguistic Model for Bias Detection

Beyond the Pipeline: Discrete Optimization in NLP

Softprop: Softmax Neural Network Backpropagation Learning

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A study of speaker adaptation for DNN-based speech synthesis

(Sub)Gradient Descent

INPE São José dos Campos

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CSL465/603 - Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Natural Language Processing. George Konidaris

Word Segmentation of Off-line Handwritten Documents

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Georgetown University at TREC 2017 Dynamic Domain Track

Cultivating DNN Diversity for Large Scale Video Labelling

Using dialogue context to improve parsing performance in dialogue systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Speech Emotion Recognition Using Support Vector Machine

Transcription:

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Agenda Word Embeddings Classification Recursive Neural Tensor Networks Convolutional Neural Networks Experiments Conclusion 2 / 34

Word Embeddings Deep learning in Natural Language Processing Deep learning has achieved state-of-the-art results in computer vision (Krizhevsky et al., 2012) and speech (Graves et al., 2013). NLP: fast becoming (already is) a hot area of research. Much of the work involves learning word embeddings and performing composition over the learned embeddings for NLP tasks. 3 / 34

Word Embeddings Word Embeddings (or Word Vectors) Traditional NLP: Words are treated as indices (or one-hot vectors in R V ) Every word is orthogonal to one another. w mother w father = 0 Can we embed words in R D with D V such that semantically close words are likewise close in R D? (i.e. w mother w father > 0) Yes! Don t (necessarily) need deep learning for this: Latent Semantic Analysis, Latent Dirichlet Allocation, or simple context counts all give dense representations. 4 / 34

Word Embeddings Neural Language Models (NLM) Another way to obtain word embeddings. Words are projected from R V to R D via a hidden layer. D is a hyperparameter to be tuned. Various architectures exist. Simple ones are popular these days (right). Very fast can train on billions of tokens in one day with a single machine. Figure 1: Skipgram architecture of Mikolov et al. (2013) 5 / 34

Word Embeddings Linguistic regularities in the obtained embeddings The learned embeddings encode semantic and syntactic regularities: w big w bigger w slow w slower wfrance w paris w korea w seoul These are cool, but not necessarily unique to neural language models. [...] the neural embedding process is not discovering novel patterns, but rather is doing a remarkable job at preserving the patterns inherent in the word-context co-occurrence matrix. Levy and Goldberg, Linguistic Regularities in Sparse and Explicit Representations, CoNLL 2014 6 / 34

Word Embeddings But the embeddings from NLMs are still good! We set out to conduct this study [on context-counting vs. context-predicting] because we were annoyed by the triumphalist overtones often surrounding predict models, despite the almost complete lack of a proper comparison to count vectors. Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. [...] Instead we found that the predict models are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture. Baroni et al., Don t count, predict! A systematic comparision of context-counting vs. context-predicting semantic vectors, ACL 2014 7 / 34

Classification Using word embeddings as features in classification The embeddings can be used as features (along with other traditional NLP features) in a classifier. For multi-word composition (e.g. sentences and phrases), one could (for example) take the average. This is obviously a bit crude... can we do composition in a more sophisticated way? 8 / 34

Classification Recursive Neural Tensor Networks Recursive Neural Tensor Networks (RNTN) Figure 2: Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP 2013 9 / 34

Classification Recursive Neural Tensor Networks RNTN Extended the previous state-of-the-art in sentiment analysis by a large margin. Best performing out of a family of recursive networks (Recursive Autoencoders, Socher et al., 2011; Matrix-Vector Recursive Neural Networks, Socher et al., 2012). Composition function is expressed as a tensor each slice of the tensor encodes different composition. Can discern negation at different scopes. 10 / 34

Classification Recursive Neural Tensor Networks RNTN Need parse trees to be computed beforehand. Phrase-level classification is expensive to obtain. Hard to adopt to other domains (e.g. Twitter). 11 / 34

Classification Convolutional Neural Networks Convolutional Neural Networks (CNN) Originally invented for computer vision (Lecun et al, 1989). Pretty much all modern vision systems use CNNs. Figure 3: LeCun et al., Gradient-based learning applied to document recognition, IEEE 1998 12 / 34

Classification Convolutional Neural Networks Brief tutorial on CNNs Key idea 1: Weight sharing via convolutional layers Key idea 2: Pooling layers Key idea 3: Multiple feature maps Figure 4: 1-dimensional convolution plus pooling 13 / 34

Classification Convolutional Neural Networks CNN: 2-dimensional case Figure 5: 2-dimensional convolution. From http://colah.github.io/ 14 / 34

Classification Convolutional Neural Networks CNN details Shared weights means less parameters (than would be the case if fully connected). Pooling layers allow for local invariance. Multiple feature maps allow different kernels to act as specialized feature extractors. Training done through backpropagation. Errors are backpropagated through pooling modules. 15 / 34

Classification Convolutional Neural Networks CNNs in NLP Collobert and Weston used CNNs to achieve (near) state-of-the-art results on many traditional NLP tasks, such as POS tagging, SRL, etc. CNN at the bottom + CRF on top. Collobert et al., Natural Language Processing (almost) from scratch, JLMR 2011. 16 / 34

Classification Convolutional Neural Networks CNNs in NLP Becoming more popular in NLP Semantic parsing (Yih et al., Semantic Parsing for Single-Relation Question Answering, ACL 2014) Search query retrieval (Shen et al., Learning Semantic Representations Using Convolutional Neural Networks for Web Search, WWW 2014) Sentiment analysis (Kalchbrenner et al., A Convolutional Neural Network for Modelling Sentences, ACL 2014; dos Santos and Gatti, Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts, COLING 2014) Most of these networks are quite complex, with multiple convolutional layers. 17 / 34

Classification Convolutional Neural Networks Dynamic Convolutional Neural Network Figure 6: Kalchbrenner et al., A Convolutional Neural Network for Modelling Sentences, ACL 2014 18 / 34

Classification Convolutional Neural Networks How well can we do with a simple CNN? Collobert-Weston style CNN with pre-trained embeddings from word2vec 19 / 34

Classification Convolutional Neural Networks CNN architecture One layer of convolution with ReLU (f (x) = x + ) non-linearity. Multiple feature maps and multiple filter widths. Filter widths of 3, 4, 5 with 100 feature maps each, so 300 units in the penultimate layer. Words not in word2vec are initialized randomly from U[ a, a] where a is chosen such that the unknown words have the same variance as words already in word2vec. Regularization: Dropout on the penultimate layer with a constraint on L 2 -norms of the weight vectors. These hyperparameters were chosen via some light tuning on one of the datasets. 20 / 34

Classification Convolutional Neural Networks Dropout Proposed by Hinton et al. (2012) to prevent co-adaptation of hidden units. During forward propagation, randomly mask (set to zero) each unit with probability p. Backpropagate only through unmasked units. At test time, do not use dropout, but scale the weights by p. Like taking the geometric average of different models. Rescale weights to have L 2 -norm = s whenever L 2 -norm > s after a gradient step. 21 / 34

Classification Convolutional Neural Networks Note on SGD: Adagrad vs. Adadelta Adagrad (Duchi et al., 2011) w t+1 = w t η ɛ+ t g i=1 g i 2 t Adadelta (Zeiler, 2012) w t+1 = w t ɛ+st ɛ+q t g t, where s t and q t recursively defined as, s t = ρs t 1 + (1 ρ)(w t w t 1 ) 2 q t = ρq t 1 + (1 ρ)g 2 t Adadelta generally required fewer epochs to reach the (local) minima, even with a higher η on Adagrad. But both eventually give similar results (Adagrad slightly more stable). Use Adadelta to quickly search the hyperparameter space and then build final model with Adagrad. 22 / 34

Experiments Datasets Sentence/phrase-level classification tasks Data c l N V V pre Prev SotA MR 2 20 10662 18765 16448 79.5 SST-1 5 18 11855 17836 16262 48.7 SST-2 2 19 9613 16185 14838 87.8 Subj 2 23 10000 21323 17913 93.6 TREC 6 10 5952 9592 9125 95.0 CR 2 19 3775 5340 5046 82.7 MPQA 2 3 10606 6246 6083 87.2 c: number of labels l: average sentence length N: number of sentences V : vocab size ( V pre is words already in word2vec) 23 / 34

Experiments Baseline: Randomly initialize all words (CNN-rand) Data Prev SotA CNN-rand MR 79.5 76.1 SST-1 48.7 45.0 SST-2 87.8 82.7 Subj 93.6 89.6 TREC 95.0 91.2 CR 82.7 79.8 MPQA 87.2 83.4 Baseline model doesn t do too well... 24 / 34

Experiments Model 1: Keep the embeddings fixed (CNN-static) Data Prev SotA CNN-rand CNN-static MR 79.5 76.1 81.0 SST-1 48.7 45.0 45.5 SST-2 87.8 82.7 86.8 Subj 93.6 89.6 93.0 TREC 95.0 91.2 92.8 CR 82.7 79.8 84.7 MPQA 87.2 83.4 89.6 Even a simple model does very well! word2vec embeddings are universal enough that they can be used for different tasks without having to learn task-specific embeddings. Same hyperparameters for all datasets. 25 / 34

Experiments Model 2: Fine-tune embeddings for each task (CNN-nonstatic) Data Prev SotA CNN-rand CNN-static CNN-nonstatic MR 79.5 76.1 81.0 81.5 SST-1 48.7 45.0 45.5 48.0 SST-2 87.8 82.7 86.8 87.2 Subj 93.6 89.6 93.0 93.4 TREC 95.0 91.2 92.8 93.6 CR 82.7 79.8 84.7 84.3 MPQA 87.2 83.4 89.6 89.5 Fine-tuning vectors helps, though not that much. Perhaps our embeddings are overfitting (given the relatively small training sample)? 26 / 34

Experiments Model 3: Multi-channel CNN Two channels of embeddings (i.e. look-up tables). One is allowed to change, while one is kept fixed. Both initialized with word2vec. 27 / 34

Experiments Model 3 performance is mixed Data Prev SotA CNN-nonstatic CNN-multichannel MR 79.5 81.5 81.1 SST-1 48.7 48.0 47.4 SST-2 87.8 87.2 88.1 Subj 93.6 93.4 93.2 TREC 95.0 93.6 92.2 CR 82.7 84.3 85.0 MPQA 87.2 89.5 89.4 Performance is not statistically different from CNN-nonstatic. 28 / 34

Experiments Fine-tuned embeddings (on SST) bad good Most Similar Words for Static Non-static good terrible terrible horrible horrible lousy lousy stupid great nice bad decent terrific solid decent terrific good and bad are similar to each other in original word2vec because interchanging them will still result in a grammatically correct sentence. The model learns to discriminate adjectival scales. sim(good, nice) > sim(good, great) 29 / 34

Experiments Fine-tuned embeddings (on SST) n t!, Static Non-static os not ca never ireland nothing wo neither 2,500 2,500 entire lush jez beautiful changer terrific decasia but abysmally dragon demise a valiant and n t was already in word2vec but had meaningless embeddings.! and, were not in word2vec. The network learns that! is associated with effusive words and that, is conjunctive (though not very well). Not sure if the multichannel architecture is the right way to regularize embeddings. 30 / 34

Conclusion Further Observations Width/multiple feature maps are important up to a point. Width/Feature Maps 10 25 50 100 2 75.8 78.4 78.1 78.5 3 78.9 80.0 79.6 79.2 4 78.1 81.6 80.1 79.9 5 80.0 79.6 81.0 80.5 6 79.0 80.5 82.1 81.9 7 80.8 81.1 81.1 82.3 Performance on one fold of the MR dataset 31 / 34

Conclusion Further Observations ReLU, Tanh, Hard Tanh all gave similar results (contrary to vision). Might be different if we have deeper architectures (ReLU is robust to gradient saturation). L 2 -norm constraint on the penultimate layer is important. When using pre-trained vectors, initializing unknown words to have similar variance as the pre-trained ones helps. Existing software makes it easy to train neural nets (Theano, Torch). Briefly experimented with Collobert-Weston (SENNA) embeddings trained on Wikipedia word2vec was much better. 32 / 34

Conclusion Future work Regularizing the fine-tuning process: Keep word2vec embeddings fixed, fine-tune only unknown words. Have extra-dimensions which are allowed to change. Be smarter about initializing unknown words. Recurrent architectures, though difficult to train, seem promising for sentence composition/classification Sutsekever et al., Sequence to Sequence Learning with Neural Networks, arxiv 2014. Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate, arxiv 2014. Kalchbrenner and Blunsom, Recurrent Convolutional Neural Networks for Discourse Compositionality, ACL Workshop 2013. Document-level classification. 33 / 34

Conclusion Paper/slides/code available at http://www.yoon.io 34 / 34