CS224d: Deep NLP. Lecture 11: Advanced Recursive Neural Networks. Richard Socher

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v5 [cs.ai] 18 Aug 2015

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Second Exam: Natural Language Parsing with Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Vector Space Approach for Aspect-Based Sentiment Analysis

Assignment 1: Predicting Amazon Review Ratings

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

A Case Study: News Classification Based on Term Frequency

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v4 [cs.cl] 28 Mar 2016

Probing for semantic evidence of composition by means of simple classification tasks

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Word Embedding Based Correlation Model for Question/Answer Matching

CS 598 Natural Language Processing

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Ensemble Technique Utilization for Indonesian Dependency Parser

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

arxiv: v1 [cs.cl] 20 Jul 2015

THE world surrounding us involves multiple modalities

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v1 [cs.lg] 15 Jun 2015

Generative models and adversarial training

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.cl] 2 Apr 2017

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Semantic and Context-aware Linguistic Model for Bias Detection

On document relevance and lexical cohesion between query terms

A Comparison of Two Text Representations for Sentiment Analysis

Model Ensemble for Click Prediction in Bing Search Ads

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Artificial Neural Networks written examination

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

ON THE USE OF WORD EMBEDDINGS ALONE TO

A deep architecture for non-projective dependency parsing

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Dialog-based Language Learning

Topic: Making A Colorado Brochure Grade : 4 to adult An integrated lesson plan covering three sessions of approximately 50 minutes each.

Proof Theory for Syntacticians

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Grammars & Parsing, Part 1:

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Data Structures and Algorithms

Joint Learning of Character and Word Embeddings

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AQUA: An Ontology-Driven Question Answering System

Mandarin Lexical Tone Recognition: The Gating Paradigm

Today we examine the distribution of infinitival clauses, which can be

Word Segmentation of Off-line Handwritten Documents

Distant Supervised Relation Extraction with Wikipedia and Freebase

Online Updating of Word Representations for Part-of-Speech Tagging

Using dialogue context to improve parsing performance in dialogue systems

Rule Learning With Negation: Issues Regarding Effectiveness

Compositional Semantics

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The stages of event extraction

Multi-Lingual Text Leveling

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Context Free Grammars. Many slides from Michael Collins

arxiv: v2 [cs.ir] 22 Aug 2016

arxiv: v1 [cs.lg] 7 Apr 2015

DRAFT. Reading Question

Residual Stacking of RNNs for Neural Machine Translation

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Beyond the Pipeline: Discrete Optimization in NLP

Handling Sparsity for Verb Noun MWE Token Classification

TextGraphs: Graph-based algorithms for Natural Language Processing

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

MYCIN. The MYCIN Task

Linking Task: Identifying authors and book titles in verbose queries

Multilingual Sentiment and Subjectivity Analysis

arxiv: v2 [cs.cl] 26 Mar 2015

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Calibration of Confidence Measures in Speech Recognition

arxiv: v1 [cs.cl] 27 Apr 2016

Introduction to the Common European Framework (CEF)

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

CS 101 Computer Science I Fall Instructor Muller. Syllabus

arxiv: v1 [cs.cv] 10 May 2017

Short Text Understanding Through Lexical-Semantic Analysis

Psycholinguistic Features for Deceptive Role Detection in Werewolf

A study of speaker adaptation for DNN-based speech synthesis

CS 446: Machine Learning

Parsing of part-of-speech tagged Assamese Texts

Transcription:

CS224d: Deep NLP Lecture 11: Advanced Recursive Neural Networks Richard Socher richard@metamind.io

PSet2 please read instructions for submissions Please followpiazza for questions and announcements Because of some ambiguities in PSet2, we will be lenient in grading. TF is a super useful skill. If re-grade question or request, please come to office hours or send a message on Piazza. To improve learning and your experience, we will publish solutions to PSets. Lecture 1, Slide 2 Richard Socher 5/5/16

Recursive Neural Networks Focused on compositional representation learning of Hierarchical structure, features and predictions Different combinations of: 1. Training Objective 2. Composition Function V W score W s p c 1 c 2 3. Tree Structure

Overview Last lecture: Recursive Neural Networks This lecture: Different RNN composition functions and NLP tasks 1. Standard RNNs: Paraphrase detection 2. Matrix-Vector RNNs: Relation classification 3. Recursive Neural Tensor Networks: Sentiment Analysis 4. Tree LSTMs: Phrase Similarity Next lecture Review for Midterm. Going over common problems/questions from office hours. Please prepare questions. 4 Richard Socher 5/5/16

Applications and Models Note: All models can be applied to all tasks More powerful models are needed for harder tasks Models get increasingly more expressive and powerful: 1. Standard RNNs: Paraphrase detection 2. Matrix-Vector RNNs: Relation classification 3. Recursive Neural Tensor Networks: Sentiment Analysis 4. Tree LSTMs: Phrase Similarity Lecture 1, Slide 5 Richard Socher 5/5/16

Paraphrase Detection Pollack said the plaintiffs failed to show that Merrill and Blodget directly caused their losses Basically, the plaintiffs did not show that omissions in Merrill s research caused the claimed losses The initial report was made to Modesto Police December 28 It stems from a Modesto police report 6

How to compare the meaning of two sentences? 7

RNNs for Paraphrase Detection Unsupervised RNNs and a pair-wise sentence comparison of nodes in parsed trees (Socher et al., NIPS 2011) 8

RNNs for Paraphrase Detection Experiments on Microsoft Research Paraphrase Corpus (Dolan et al. 2004) Method Acc. F1 Rus et al.(2008) 70.6 80.5 Mihalcea et al.(2006) 70.3 81.3 Islam et al.(2007) 72.6 81.3 Qiu et al.(2006) 72.0 81.6 Fernando et al.(2008) 74.1 82.4 Wan et al.(2006) 75.6 83.0 Das and Smith (2009) 73.9 82.3 Das and Smith (2009) + 18 Surface Features 76.1 82.7 F. Bu et al. (ACL 2012): String Re-writing Kernel 76.3 -- Unfolding Recursive Autoencoder (NIPS 2011) 76.8 83.6 9 Dataset is problematic, a better evaluation is introduced later

RNNs for Paraphrase Detection 10

Recursive Deep Learning 1. Standard RNNs: Paraphrase Detection 2. Matrix-Vector RNNs: Relation classification 3. Recursive Neural Tensor Networks: Sentiment Analysis 4. Tree LSTMs: Phrase Similarity 11

Compositionality Through Recursive Matrix-Vector Spaces p = tanh(w + b) c 1 c 2 One way to make the composition function more powerful was by untying the weights W But what if words act mostly as an operator, e.g. very in very good Proposal: A new composition function 12

Compositionality Through Recursive Matrix-Vector Recursive Neural Networks c p = tanh(w 1 + b) C c p = tanh(w 2 c 1 + 2 b) C 1 c 2 13

Predicting Sentiment Distributions Good example for non-linearity in language 14

MV-RNN for Relationship Classification Relationship Cause- Effect(e2,e1) Entity- Origin(e1,e2) Message- Topic(e2,e1) 15 Sentence with labeled nouns for which to predict relationships Avian [influenza]e1 is an infectious disease caused by type a strains of the influenza [virus]e2. The [mother]e1 left her native [land]e2 about the same time and they were married in that city. Roadside [attractions]e1 are frequently advertised with [billboards]e2 to attract tourists.

Sentiment Detection Sentiment detection is crucial to business intelligence, stock trading, 16

Sentiment Detection and Bag-of-Words Models Most methods start with a bag of words + linguistic features/processing/lexica But such methods (including tf-idf) can t distinguish: + white blood cells destroying an infection an infection destroying white blood cells 17

Sentiment Detection and Bag-of-Words Models Sentiment is that sentiment is easy Detection accuracy for longer documents 90% Lots of easy cases ( horrible or awesome ) For dataset of single sentence movie reviews (Pang and Lee, 2005) accuracy never reached above 80% for >7 years Harder cases require actual understanding of negation and its scope + other semantic effects

Data: Movie Reviews Stealing Harvard doesn t care about cleverness, wit or any other kind of intelligent humor. There are slow and repetitive parts but it has just enough spice to keep it interesting. 19

Two missing pieces for improving sentiment 1. Compositional Training Data 2. Better Compositional model

1. New Sentiment Treebank

1. New Sentiment Treebank Parse trees of 11,855 sentences 215,154 phrases with labels Allows training and evaluating with compositional information

Better Dataset Helped All Models Positive/negative full sentence classification 84 83 82 81 80 79 78 77 76 75 Training with Sentence Labels Training with Treebank But hard negation cases are still mostly incorrect We also need a more powerful model! Bi NB RNN MV-RNN

Better Dataset Helped This improved performance for full sentence positive/negative classification by 2 3 % Yay! But a more in depth analysis shows: hard negation cases are still mostly incorrect We also need a more powerful model!

2. New Compositional Model Recursive Neural Tensor Network More expressive than previous RNNs Idea: Allow more interactions of vectors

2. New Compositional Model Recursive Neural Tensor Network

2. New Compositional Model Recursive Neural Tensor Network

Recursive Neural Tensor Network Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Socher et al. 2013

Details: Tensor Backpropagation Training Main new matrix derivative needed for a tensor: @X @a T Xa @X = @at X T a @X = aa T

Details: Tensor Backpropagation Training Minimizing cross entropy error: Standard softmax error message: For each slice, we have update: Main backprop rule to pass error down from parent: Finally, add errors from parent and current softmax:

Positive/Negative Results on Treebank Classifying Sentences: Accuracy improves to 85.4 86 84 82 Bi NB RNN MV-RNN RNTN 80 78 76 74 Training with Sentence Labels Training with Treebank

Fine Grained Results on Treebank

Negation Results

Negation Results Most methods capture that negation often makes things more negative (See Potts, 2010) Analysis on negation dataset Accuracy:

Results on Negating Negatives But how about negating negatives? No flips, but positive activation should increase! not bad

Results on Negating Negatives Evaluation: Positive activation should increase

37

Visualizing Deep Learning: Word Embeddings

LSTMs Remember LSTMs? Historically only over temporal sequences We used Lecture 1, Slide 39 Richard Socher 5/5/16

Tree LSTMs We can use those ideas in grammatical tree structures! Paper: Tai et al. 2015: Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks Idea: Sum the child vectors in a tree structure Each child has its own forget gate Same softmax on h Lecture 1, Slide 40 Richard Socher 5/5/16

Results on Stanford Sentiment Treebank Method Fine-grained Binary RAE (Socher et al., 2013) 43.2 82.4 MV-RNN (Socher et al., 2013) 44.4 82.9 RNTN (Socher et al., 2013) 45.7 85.4 DCNN (Blunsom et al., 2014) 48.5 86.8 Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8 CNN-non-static (Kim, 2014) 48.0 87.2 CNN-multichannel (Kim, 2014) 47.4 88.1 DRNN (Irsoy and Cardie, 2014) 49.8 86.6 LSTM 45.8 86.7 Bidirectional LSTM 49.1 86.8 2-layer LSTM 47.5 85.5 2-layer Bidirectional LSTM 46.2 84.8 Constituency Tree LSTM (no tuning) 46.7 86.6 Constituency Tree LSTM 50.6 86.9 of word vectors Lecture 1, Slide 41 Richard Socher 5/5/16

Semantic Similarity Better than binary paraphrase classification! Dataset from a competition: SemEval-2014 Task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness [and textual entailment] Relatedness score 1.6 2.9 3.6 4.9 Example A: A man is jumping into an empty pool B: There is no biker jumping in the air A: Two children are lying in the snow and are making snow angels B: Two angels are making snow on the lying children A: The young boys are playing outdoors and the man is smiling nearby B: There is no boy playing outdoors and there is no man smiling A: A person in a black jacket is doing tricks on a motorbike B: A man in a black jacket is doing tricks on a motorbike Lecture 1, Slide 42 Richard Socher 5/5/16

Semantic Similarity Results (correlation and MSE) Pearson s r, Spearman s ρ Method r MSE Mean vectors 0.8046 0.7294 0.3595 DT-RNN (Socher et al., 2014) 0.7863 0.7305 0.3983 SDT-RNN (Socher et al., 2014) 0.7886 0.7280 0.3859 Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692 UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550 Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224 ECNU (Zhao et al., 2014) 0.8414 LSTM 0.8477 0.7921 0.2949 Bidirectional LSTM 0.8522 0.7952 0.2850 2-layer LSTM 0.8411 0.7849 0.2980 2-layer Bidirectional LSTM 0.8488 0.7926 0.2893 Constituency Tree LSTM 0.8491 0.7873 0.2852 Dependency Tree LSTM 0.8627 0.8032 0.2635 Lecture 1, Slide 43

Semantic Similarity Results, Pearson Correlation r 0.90 0.88 0.86 0.84 0.82 0.80 DepTree-LSTM LSTM Bi-LSTM ConstTree-LSTM 0.78 4 6 8 10 12 14 16 18 20 mean sentence length Lecture 1, Slide 44 Richard Socher 5/5/16

Next lecture: Midterm review session Go over materials with different viewpoints Come with questions! Lecture 1, Slide 45 Richard Socher 5/5/16