Con-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec

Similar documents
Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Machine Learning Basics

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Attributed Social Network Embedding

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

(Sub)Gradient Descent

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v1 [cs.cl] 2 Apr 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CS Machine Learning

Using dialogue context to improve parsing performance in dialogue systems

Probabilistic Latent Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 10: Reinforcement Learning

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Comment-based Multi-View Clustering of Web 2.0 Items

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Rule Learning with Negation: Issues Regarding Effectiveness

Summarizing Answers in Non-Factoid Community Question-Answering

HLTCOE at TREC 2013: Temporal Summarization

arxiv: v2 [cs.ir] 22 Aug 2016

Truth Inference in Crowdsourcing: Is the Problem Solved?

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Switchboard Language Model Improvement with Conversational Data from Gigaword

Variations of the Similarity Function of TextRank for Automated Summarization

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Calibration of Confidence Measures in Speech Recognition

Word Segmentation of Off-line Handwritten Documents

Artificial Neural Networks written examination

Beyond the Pipeline: Discrete Optimization in NLP

Learning Methods in Multilingual Speech Recognition

Model Ensemble for Click Prediction in Bing Search Ads

Introduction to Causal Inference. Problem Set 1. Required Problems

Corrective Feedback and Persistent Learning for Information Extraction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Comparison of Two Text Representations for Sentiment Analysis

arxiv: v1 [cs.lg] 3 May 2013

Cross Language Information Retrieval

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Learning From the Past with Experiment Databases

CS 446: Machine Learning

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

arxiv: v2 [cs.cv] 30 Mar 2017

A Vector Space Approach for Aspect-Based Sentiment Analysis

Speech Recognition at ICSI: Broadcast News and beyond

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

arxiv: v1 [cs.cl] 20 Jul 2015

AQUA: An Ontology-Driven Question Answering System

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Team Formation for Generalized Tasks in Expertise Social Networks

Online Updating of Word Representations for Part-of-Speech Tagging

Mining Topic-level Opinion Influence in Microblog

The stages of event extraction

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Bayesian Learning Approach to Concept-Based Document Classification

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Detecting English-French Cognates Using Orthographic Edit Distance

Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes

arxiv: v1 [cs.cv] 10 May 2017

Modeling function word errors in DNN-HMM based LVCSR systems

Georgetown University at TREC 2017 Dynamic Domain Track

Modeling function word errors in DNN-HMM based LVCSR systems

Semantic and Context-aware Linguistic Model for Bias Detection

arxiv: v1 [cs.lg] 15 Jun 2015

Proof Theory for Syntacticians

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cqa Services

Reducing Features to Improve Bug Prediction

Softprop: Softmax Neural Network Backpropagation Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Discriminative Learning of Beam-Search Heuristics for Planning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning Computational Grammars

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Second Exam: Natural Language Parsing with Neural Networks

Australian Journal of Basic and Applied Sciences

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multi-label classification via multi-target regression on data streams

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Evolutive Neural Net Fuzzy Filtering: Basic Description

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Ensemble Technique Utilization for Indonesian Dependency Parser

Human Emotion Recognition From Speech

Learning Methods for Fuzzy Systems

Building a Semantic Role Labelling System for Vietnamese

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Transcription:

Con-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec Tanay Kumar Saha 1 Shafiq Joty 2 Mohammad Al Hasan 1 1 Indiana University Purdue University Indianapolis, Indianapolis, IN 46202, USA 2 Nanyang Technological University, Singapore September 22, 2017 Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 1 / 35

Outline 1 Introduction and Motivation 2 Con-S2V Model 3 Experimental Settings 4 Experimental Results 5 Conclusion Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 2 / 35

Outline 1 Introduction and Motivation Introduction Related Work 2 Con-S2V Model Modeling Content Modeling Distributional Similarity Modeling Proximity Training Con-S2V 3 Experimental Settings Evaluation Tasks Metrics for Evaluation Baseline Models for Evaluation Optimal Parameter Settings 4 Experimental Results Classification and Clustering Performance Summarization Performance 5 Conclusion Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 3 / 35

Sen2Vec (Model for representation of Sentences) Learn distributed representation of sentences from unlabeled data v1 : I eat rice [0.2 0.3 0.4] φ : V R d For many text processing tasks that involve classification, clustering, or ranking of sentences, vector representation of sentences is a prerequisite Distributed Representation has been shown to perform better than Bag-of-Words (BOW) based vector representation Proposed by Mikolov et. al Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 4 / 35

Con-S2V (Our Model) A novel approach to learn distributed representation of sentences from unlabeled data by jointly modeling both content and context of a sentence v 1 : I have an NEC multisync 3D monitor for sale v 2 : Looks new v3 : Great Condition In contrast to the existing works, we consider context sentences as atomic linguistic units. We consider two types of context: discourse and similarity. However, our model can take any arbitrary type of context Our evaluation on these tasks across multiple datasets shows impressive results for our model, which outperforms the best existing models by up to 7.7 F 1 -score in classification, 15.1 V -score in clustering, 3.2 ROUGE-1 score in summarization. Build on top of Sen2Vec Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 5 / 35

Context Types of a Sentence Discourse Context of a Sentence It is formed by the previous and the following sentences in the text Adjacent sentences in a text are logically connected by certain coherence relations (e.g., elaboration, contrast) to express the meaning Lactose is a milk sugar. The enzyme lactase breaks it down. Here, the second sentence is an elaboration of the first sentence. Similarity Context of a Sentence Based on more direct measures of similarity Considers relations between all possible sentences in a document and possibly across multiple documents Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 6 / 35

Related Work Sen2Vec Uses Sentence ID as a special token and learn the representation of the sentence by predicting all the words in a sentence For example, for a sentence, v 1 : I eat rice, it will learn representation for v 1 by learning to predict each of the words, i.e. I, eat, and rice correctly Shown to perform better than tf-idf W2V-avg Uses word vector averaging A tough-to-beat baseline for most downstream tasks SDAE Employs an encoder-decoder framework, similar to neural machine translation (NMT) to de-noise an original sentence (target) from its corrupted version (source) SAE is similar in spirit to SDAE but does not corrupt source Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 7 / 35

Related Work C-Phrase C-PHRASE is an extension of CBOW (Continuous Bag of Words Model) The context of a word is extracted from a syntactic parse of the sentence Syntax tree for a sentence, A sad dog is howling in the park is: (S (NP A sad dog) (VP is (VP howling (PP in (NP the park))))) C-PHRASE will optimize context prediction for dog, sad dog, a sad dog, a sad dog is howling, etc., but not, for example, for howling in, as these two words do not form a syntactic constituent by themselves Uses word vector addition for representing sentences Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 8 / 35

Related Work Skip-Thought (Context Sensitive) Uses the NMT framework to predict adjacent sentences (target) given a sentence (source) FastSent (Context Sensitive) An additive model to learn sentence representation from word vectors It predicts the words of its adjacent sentences in addition to its own words Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 9 / 35

Con-S2V A novel model to learn distributed representation of sentences by considering content as well as context of a sentence It treats the context sentences as an atomic unit Efficient to train compared to compositional methods like encoder-decoder models (e.g., SDAE, Skip-Thought) that compose a sentence vector from the word vectors Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 10 / 35

Outline 1 Introduction and Motivation Introduction Related Work 2 Con-S2V Model Modeling Content Modeling Distributional Similarity Modeling Proximity Training Con-S2V 3 Experimental Settings Evaluation Tasks Metrics for Evaluation Baseline Models for Evaluation Optimal Parameter Settings 4 Experimental Results Classification and Clustering Performance Summarization Performance 5 Conclusion Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 11 / 35

Con-S2V Model The model for learning the vector representation of a sentence comprises three components The first component models the content by asking the sentence vector to predict its constituent words (modeling content) The second component models the distributional hypotheses of a context (modeling context) Third component models the proximity hypotheses of a context, which also suggests that sentences that are proximal should have similar representations (modeling context) Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 12 / 35

Con-S2V Model v 1 : I have an NEC multisync 3D monitor for sale great v 1 L c L g condition v 3 L c L g v 2 : Great Condition L r L r v 1 φ v 3 L r L r v 1 φ v 3 v 3 : Looks New v 2 v 2 (a) (b) Figure: Two instances (see (b) and (c)) of our model for learning representation of sentence v 2 within a context of two other sentences: v 1 and v 3 (see (a)). Directed and undirected edges indicate prediction loss and regularization loss, respectively, and dashed edges indicate that the node being predicted is randomly sampled. (Collected from: 20news-bydate-train/misc.forsale/74732. The central topic is forsale.) (c) Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 13 / 35

Con-S2V Model We minimize the following loss function for learning representation of sentences: J(φ) = [ Lc (v i, v) + L g (v i, v j ) + v i V v v i l t j U(1,C i ) L r (v i, N (v i )) ] (1) L c : Modeling Content (First Component) L g : Modeling Context with Distributional Hypothesis (Second Component). The distributional hypothesis conveys that the sentences occurring in similar contexts should have similar representations L r : Modeling Context with Proximity Hypothesis (Third Component). Proximity hypotheses of a context, which also suggests that sentences that are proximal should have similar representations Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 14 / 35

Modeling Content Our approach for modeling content of a sentence is similar to the distributed bag-of-words (DBOW) model of Sen2Vec Given an input sentence v i, we first map it to a unique vector φ(v i ) by looking up the corresponding vector in the sentence embedding matrix φ We then use φ(v i ) to predict each word v sampled from a window of words in v i. Formally, the loss for modeling content using negative sampling is: ( ) L c (v i, v) = logσ wv T φ(v i ) log S s=1 ( ) E v s ψ c σ wv T s φ(v i) (2) Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 15 / 35

Modeling Distributional Similarity Our sentence-level distributional hypothesis is that if two sentences share many neighbors in the graph, their representations should be similar We formulate this in our model by asking the sentence vector to predict its neighboring nodes Formally, the loss for predicting a neighboring node v j N (v i ) using the sentence vector φ(v i ) is: ( ) L g (v i, v j ) = log σ wj T φ(v i ) S ( ) log E j s ψ g σ wj T s φ(v i) (3) s=1 Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 16 / 35

Modeling Proximity According to our proximity hypothesis, sentences that are proximal in their contexts, should have similar representations We use a Laplacian regularizer to model this The regularization loss for modeling proximity for a sentence v i in its context N (v i ) is L r (v i, N (v i )) = λ C i v k N (v i ) φ(v i ) φ(v k ) 2 (4) Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 17 / 35

Training Con-S2V Algorithm 1: Training Con-S2V with SGD Input : set of sentences V, graph G = (V, E) Output: learned sentence vectors φ 1. Initialize model parameters: φ and w s; 2. Compute noise distributions: ψ c and ψ g 3. repeat for each sentence v i V do for each content word v v i do a) Generate a positive pair (v i, v) and S negative pairs {(v i, v s )} S s=1 using ψ c; b) Take a gradient step for L c (v i, v); c) Sample a neighboring node v j from N (v i ); d) Generate a positive pair (v i, v j ) and S negative pairs {(v i, v s j )}S s=1 using ψ g ; e) Take a gradient step for L g (v i, v j ); f) Take a gradient step for L r (v i, N (v i )); end end until convergence; Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 18 / 35

Training Details Con-S2V is trained with stochastic gradient descent (SGD), where the gradient is obtained via backpropagation The number of noise samples (S) in negative sampling was 5 In all our models, the embeddings vectors (φ, ψ) were of 600 dimensions, which were initialized with random numbers sampled from a small uniform distribution, U( 0.5/d, 0.5/d) The weight vectors ω s were initialized with zero Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 19 / 35

Outline 1 Introduction and Motivation Introduction Related Work 2 Con-S2V Model Modeling Content Modeling Distributional Similarity Modeling Proximity Training Con-S2V 3 Experimental Settings Evaluation Tasks Metrics for Evaluation Baseline Models for Evaluation Optimal Parameter Settings 4 Experimental Results Classification and Clustering Performance Summarization Performance 5 Conclusion Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 20 / 35

Evaluation Tasks and Dataset We evaluate Con-S2V on Summarization, Classification and Clustering Task Con-S2V learns representation of a sentence by exploiting contextual information in addition to the content For this reason, we did not evaluate our models on tasks (Sentiment Classification) previously used to evaluate sentence representation models For Classification and Clustering evaluation, it require a corpora of annotated sentences with ordering and document boundaries preserved, i.e., documents with sentence-level annotations Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 21 / 35

Evaluation Tasks (Summarization) The goal is to select the most important sentences to form an abridged version of the source document(s) We use the popular graph-based algorithm LexRank The input to LexRank is a graph, where nodes represent sentences and edges represent cosine similarity between vector representations (learned by models) of the two corresponding sentences We use the benchmark datasets from DUC-2001 and DUC-2002 dataset for evaluation Dataset #Doc. #Avg. Sen. #Avg. Sum. DUC 2001 486 40 2.17 DUC 2002 471 28 2.04 Table: Basic statistics about the DUC datasets Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 22 / 35

Evaluation Tasks (Classification and Clustering) We evaluate our models by measuring how effective the learned vectors are when they are used as features for classifying or clustering the sentences into topics We use a MaxEnt classifier and a K-means++ clustering algorithm for classification and clustering tasks, respectively We use the standard text categorization corpora: Reuters-21578 and 20-Newsgroups. Reuters-21578 (henceforth Reuters) is a collection of 21, 578 news documents covering 672 topics. 20-Newsgroups is a collection of about 20, 000 news articles organized into 20 different topics. Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 23 / 35

Classification and Clustering (Generating Sentence-level Topic Annotations) One option is to assume that all the sentences of a document share the same topic label as the document This naive assumption induces a lot of noise Although sentences in a document collectively address a common topic, not all sentences are directly linked to that topic, rather they play supporting roles To minimize this noise, we employ our extractive summarizer to select the top 20% sentences of each document as representatives of the document, and assign them the same topic label as the topic of the document Note that the sentence vectors are learned independently from an entire dataset Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 24 / 35

DataSet Statistics for Classification and Clustering Dataset #Doc. Total Annot. Train Test #Class #sen. #sen #sen. #sen. Reuters 9,001 42,192 13,305 7,738 3,618 8 Newsgroups 7,781 95,809 22,374 10,594 9,075 8 Table: Statistics about Reuters and Newsgroups. Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 25 / 35

Metrics for Evaluation For Summarization, We use the widely used automatic evaluation metric ROUGE to evaluate the system-generated summaries. ROUGE computes n-gram recall between a system-generated summary and a set of human-authored reference summaries We report raw accuracy, macro-averaged F 1 -score, and Cohen s κ for comparing classification performance For clustering, we report V-measure and adjusted mutual information or AMI Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 26 / 35

Models Compared Existing Distributed Models: Sen2Vec, W2V-avg, C-Phrase, FastSent, and Skip-Thought Non-distributed Model: Tf-Idf Retrofitted Models: Ret-dis, Ret-sim Regularized Models: Reg-dis, Reg-sim: We compare with a variant of our model, where the loss to capture distributional similarity L g (v i, v j ) is turned off Our Model: Con-S2V-dis, Con-S2V-sim Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 27 / 35

Similarity Network Construction Our similarity context allows any other sentence in the corpus to be in the context of a sentence depending on how similar they are we first represent the sentences with vectors learned by Sen2Vec, then we measure the cosine distance between the vectors We restrict the context size of a sentence for computational efficiency First, we set thresholds for intra- and across-document connections: sentences in a document are connected only if their similarity value is above a pre-specified threshold δ, and sentences across documents are connected only if their similarity value is above another pre-specified threshold γ we allow up to 20 most similar neighbors. We call the resulting network similarity network Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 28 / 35

Optimal Parameter Settings For each dataset that we describe earlier, we randomly selected 20% documents from the training set to form a held-out validation set on which we tune the hyper-parameters we optimized F 1 for classification, AMI for clustering, and ROUGE-1 for summarization For Ret-sim, and Ret-dis, the number of iteration was set to 20 For the similarity context, the intra- and across-document thresholds δ and γ were set to 0.5 and 0.8 Optimal Parameter values are given in the following table: Dataset Task Sen2Vec FastSent W2V-avg Reg-sim Reg-dis Con-S2V-sim Con-S2V-dis (win. size) (win. size, reg. str.) (win. size, reg. str.) Reuters Newsgroups clas. 8 10 10 (8, 1.0) (8, 1.0) (8, 0.8) (8, 1.0) clus. 12 8 12 (12, 0.3) (12, 1.0) (12,0.8 ) (12, 0.8) clas. 10 8 10 (10, 1.0) (10, 1.0) (10, 1.0) (10, 1.0) clus. 12 12 12 (12, 1.0) (12, 1.0) (12, 0.8) (10, 1.0) DUC 2001 sum. 10 12 12 (10, 0.8) (10, 0.5) (10, 0.3) (10, 0.3) DUC 2002 sum. 8 8 10 (8, 0.8) (8, 0.3) (8, 0.3) (8, 0.3 ) Table: Optimal values of the hyper-parameters for different models on different tasks. Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 29 / 35

Outline 1 Introduction and Motivation Introduction Related Work 2 Con-S2V Model Modeling Content Modeling Distributional Similarity Modeling Proximity Training Con-S2V 3 Experimental Settings Evaluation Tasks Metrics for Evaluation Baseline Models for Evaluation Optimal Parameter Settings 4 Experimental Results Classification and Clustering Performance Summarization Performance 5 Conclusion Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 30 / 35

Classification and Clustering Performance Topic Classification Results Topic Clustering Results Reuters Newsgroups Reuters Newsgroups F1 Acc κ F1 Acc κ V AMI V AMI Sen2Vec 83.25 83.91 79.37 79.38 79.47 76.16 42.74 40.00 35.30 34.74 W2V-avg (+) 2.06 (+) 1.91 (+) 2.51 ( ) 0.42 ( ) 0.44 ( ) 0.50 ( ) 11.96 ( ) 10.18 ( ) 17.90 ( ) 18.50 C-Phrase ( ) 2.33 ( ) 2.01 ( ) 2.78 ( ) 2.49 ( ) 2.38 ( ) 2.86 ( ) 11.94 ( ) 10.80 ( ) 1.70 ( ) 1.44 FastSent ( ) 0.37 ( ) 0.29 ( ) 0.41 ( ) 12.23 ( ) 12.17 ( ) 14.21 ( ) 15.54 ( ) 13.06 ( ) 34.40 ( ) 34.16 Skip-Thought ( ) 19.13 ( ) 15.61 ( ) 21.8 ( ) 13.79 ( ) 13.47 ( )15.76 ( ) 29.94 ( ) 28.00 ( ) 27.50 ( ) 27.04 Tf-Idf ( ) 3.51 ( ) 2.68 ( ) 3.85 ( ) 9.95 ( ) 9.72 ( ) 11.55 ( ) 21.34 ( ) 20.14 ( ) 29.20 ( ) 30.60 Ret-sim (+) 0.92 (+) 1.28 (+) 1.65 (+) 2.00 (+) 1.97 (+) 2.27 (+) 3.72 (+) 3.34 (+) 5.22 (+) 5.70 Ret-dis (+) 1.66 (+) 1.79 (+) 2.30 (+) 5.00 (+) 4.91 (+) 5.71 (+) 4.56 (+) 4.12 (+) 6.28 (+) 6.76 Reg-sim (+) 2.53 (+) 2.53 (+) 3.28 (+) 3.31 (+) 3.29 (+) 3.81 (+) 4.76 (+) 4.40 (+) 12.78 (+) 12.18 Reg-dis (+) 2.52 (+) 2.43 (+) 3.17 (+) 5.41 (+) 5.34 (+) 6.20 (+) 7.40 (+) 6.82 (+) 12.54 (+) 12.44 Con-S2V-sim (+) 3.83 (+) 3.55 (+) 4.62 (+) 4.52 (+) 4.50 (+) 5.21 (+) 14.98 (+) 14.38 (+) 13.68 (+) 13.56 Con-S2V-dis (+) 4.29 (+) 4.04 (+) 5.22 (+) 7.68 (+) 7.56 (+) 8.80 (+) 9.30 (+) 8.36 (+) 15.10 (+) 15.20 Table: Performance of our models on topic classification and clustering tasks in comparison to Sen2Vec. Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 31 / 35

Summarization Performance DUC 01 DUC 02 Sen2Vec 43.88 54.01 W2V-avg ( ) 0.62 (+) 1.44 C-Phrase (+) 2.52 (+) 1.68 FastSent ( ) 4.15 ( ) 7.53 Skip-Thought (+) 0.88 ( ) 2.65 Tf-Idf (+) 4.83 (+) 1.51 Ret-sim ( ) 0.62 (+) 0.42 Ret-dis (+) 0.45 ( ) 0.37 Reg-sim (+) 2.90 (+) 2.02 Reg-dis ( ) 1.92 ( ) 8.77 Con-S2V-sim (+) 3.16 (+) 2.71 Con-S2V-dis (+) 1.15 ( ) 4.46 Table: ROUGE-1 scores of the models on DUC datasets in comparison with Sen2Vec. Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 32 / 35

Outline 1 Introduction and Motivation Introduction Related Work 2 Con-S2V Model Modeling Content Modeling Distributional Similarity Modeling Proximity Training Con-S2V 3 Experimental Settings Evaluation Tasks Metrics for Evaluation Baseline Models for Evaluation Optimal Parameter Settings 4 Experimental Results Classification and Clustering Performance Summarization Performance 5 Conclusion Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 33 / 35

Conclusion and Future Work We have presented a novel model to learn distributed representation of sentences by considering content as well as context of a sentence One important property of our model is that it encodes a sentence directly, and it considers neighboring sentences as atomic units Apart from the improvements that we achieve in various tasks, this property makes our model quite efficient to train compared to compositional methods like encoder-decoder models (e.g., SDAE, Skip-Thought) that compose a sentence vector from the word vectors Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 34 / 35

Conclusion and Future Work It would be interesting to see how our model compares with compositional models on sentiment classification task However, this would require creating a new dataset of comments with sentence-level sentiment annotations We intend to create such datasets and evaluate the models in the future Saha, Joty, Hasan (IUPUI, NTU) CON-S2V: Latent Repres. of Sentences September 22, 2017 35 / 35