Abstractive Text Summarization

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v4 [cs.cl] 28 Mar 2016

Second Exam: Natural Language Parsing with Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Artificial Neural Networks written examination

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Residual Stacking of RNNs for Neural Machine Translation

Python Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

Dialog-based Language Learning

Georgetown University at TREC 2017 Dynamic Domain Track

Lecture 1: Machine Learning Basics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Variations of the Similarity Function of TextRank for Automated Summarization

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.cv] 10 May 2017

Assignment 1: Predicting Amazon Review Ratings

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Attributed Social Network Embedding

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Indian Institute of Technology, Kanpur

How to Judge the Quality of an Objective Classroom Test

Word Segmentation of Off-line Handwritten Documents

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning to Schedule Straight-Line Code

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v3 [cs.cl] 7 Feb 2017

Knowledge Transfer in Deep Convolutional Neural Nets

Linking Task: Identifying authors and book titles in verbose queries

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Calibration of Confidence Measures in Speech Recognition

Software Maintenance

A Case Study: News Classification Based on Term Frequency

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Deep Neural Network Language Models

Online Updating of Word Representations for Part-of-Speech Tagging

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Truth Inference in Crowdsourcing: Is the Problem Solved?

Probabilistic Latent Semantic Analysis

Rule Learning with Negation: Issues Regarding Effectiveness

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Generative models and adversarial training

Dropout improves Recurrent Neural Networks for Handwriting Recognition

A Reinforcement Learning Variant for Control Scheduling

Axiom 2013 Team Description Paper

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v1 [cs.cl] 2 Apr 2017

Test Effort Estimation Using Neural Network

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The Role of String Similarity Metrics in Ontology Alignment

Lecture 10: Reinforcement Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

arxiv: v1 [cs.lg] 15 Jun 2015

Improvements to the Pruning Behavior of DNN Acoustic Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Cross Language Information Retrieval

TD(λ) and Q-Learning Based Ludo Players

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v5 [cs.ai] 18 Aug 2015

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Hands-on Books-closed: Creating Interactive Foldables in Islamic Studies. Presented By Tatiana Coloso

HLTCOE at TREC 2013: Temporal Summarization

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Summarizing Answers in Non-Factoid Community Question-Answering

A Vector Space Approach for Aspect-Based Sentiment Analysis

CS Machine Learning

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

The Evolution of Random Phenomena

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Comment-based Multi-View Clustering of Web 2.0 Items

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Learning Methods in Multilingual Speech Recognition

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

arxiv: v1 [cs.cl] 20 Jul 2015

Laboratorio di Intelligenza Artificiale e Robotica

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Speaker Identification by Comparison of Smart Methods. Abstract

Learning Methods for Fuzzy Systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Columbia University at DUC 2004

Using dialogue context to improve parsing performance in dialogue systems

A heuristic framework for pivot-based bilingual dictionary induction

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Transcription:

Abstractive Text Summarization Using Seq2Seq Attention Models Soumye Singhal Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur 22 nd November, 2017

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Why Text Summarization? In the modern Internet age, textual data is ever increasing Need some way to condense this data while preserving the information and meaning. Text summarization is a fundamental problem that we need to solve. Would help in easy and fast retrieval of information.

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Extractive vs Abstractive Extractive summarization Copying parts/sentences of the source text and then combine those part/sentences together to render a summary. Importance of sentence is based on linguistic and statistical features Abstractive summarization These methods try to first understand the text and then rephrase it in a shorter manner, using possibly different words For perfect abstractive summary, the model has to first truly understand the document and then try to express that understanding in short possibly using new words and phrases. Much harder than extractive. Has complex capabilities like generalization, paraphrasing and incorporating real-world knowledge.

Deep Learning Majority of the work has traditionally focused on Extractive approaches due to the easy of defining hard-coded rules to select important sentences than generate new ones. But they often don t summarize long and complex texts well as they are very restrictive. The traditional rule-based AI does poorly on Abstractive Text Summarization. Inspired by the performance of Neural Attention Model in the closely related task of Machine Translation Rush et al. 2015 and Chopra et al. 2016 applied this Neural Attention Model to Abstractive Text Summarization and found that it already performed very well and beat the previous non-deep Learning-based approaches.

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Recurrent Neural Network Figure: An unrolled RNN w i - input tokens of source article h i - Encoder hidden states P vocab = softmax(vh i + b) is the distribution over vocabulary from which we sample out i

Long-Short Term Memory If the context of the word is far away, RNN s struggle to learn. Vanishing Gradient Problem LSTMs selectively pass and forget information. 1 Image taken from colah.github.io

Long-Short Term Memory Forget Gate Layer f t = σ(w f [h t 1, x t ] + b f ) C t = C t f t Input Gate Layer i t = σ(w t [h t 1, x t ] + b i ) Ct = tanh(w i [h t 1, x t ] + b c ) C t = C t t t + C t Output Gate Layer o t = σ(w o [h t 1, x t ] + b o ) h t = o t tanh(c t )

Bi-Directional RNN out out out out 0 1 2 3 Predict Predict Predict Predict Backward RNN Foreword RNN Embeddings Embeddings Embeddings Embeddings in 0 in 1 in 2 in 3 Two passes on source computing hidden states h t and h t h t = [ h t, h t ] now encodes past and future information.

Vanilla Encoder-Decoder 2 It consists of an Encoder(Bidirectional LSTM) and a Decoder LSTM network. The final hidden state from the Encoder(thought vector) is passed into the Decoder. 2 Image taken from colah.github.io

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Why do we need Attention? The basic encoder-decoder model fails to scale up. The main bottleneck is the fixed sized thought vector Not able to capture all the relevant information of the input sequence as the model sizes up. At each generation step, only a part of the input is relevant. This is where attention comes it. It helps the model decide which part of the input encoding to focus on at each generation step to generate novel words. At each step, the decoder outputs hidden state h i, from which we generate the output.

Attention is all you need! importance it = V tanh(e i W 1 + h t W 2 + b attn ). Attention Distribution a t = softmax(importance it ) ContextVector h t = i e i a t i 3 3 Image stylized from https://talbaumel.github.io/attention/

Training Context Vector is then fed into two layers to generate distribution over the vocabulary from which we sample. P vocab (w) = softmax(v (V [h t, h t ] + b) + b ) For the loss at time step t, loss t = log P(w t ), where wt the target summary word. LOSS = T t=0 losst T We then use the Backpropagation Algorithm to get the gradient and learn the parameters is

Generating the Summaries At each step, the decoder outputs a probability distribution over the target vocabulary. To get the output word at this step we can do the following Greedy Sampling, ie choose the mode of the Distribution Sample from the distribution. Beam Search - Choosing the top k most likely target words and then feeding them all into the next decoder input. So at each time-step t the decoder gets k different possible inputs. It then computes the top k most likely target words for each of these different inputs. Among these, it keeps only the top-k out of k 2 and rejects the rest. This process continues. This ensures that each target word gets a fair shot at generating the summary.

Metrics If target summary is not given Need a similarity measure between summary and source document. In a good summary, the topics covered would be similar Use topic models like Latent Semantic Analysis(LSA) and Latent Dirichlet Allocation(LDA) If the target summary is given Use metrics like ROUGE(Lin 2004) and METEOR They are essentially string matching metrics ROUGE-N measures the overlap of N-grams between the system and reference summary ROUGE-L is based on longest common subsequences. Takes into account sentence level similarity. ROUGE-S is the skip-gram variant

Dataset Sentence level Datasets DUC-2004 Gigaword Large-Scale Dataset by Nallapati et al. 2016 CNN/Daily Mail Dataset adapted for summarization.

Problems with Baseline Though the baseline gives decent results, they are clearly plagued by many problems They sometimes tend to reproduce factually incorrect details. Struggles with Out of Vocabulary (OOV) words. They are also a bit repetitive and focus on a word/phrase multiple times. Focus mainly on single sentence summary tasks like headline generation.

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Feature-rich Encoder Introduced by Nallapati et al. 2016 Aim is to input more more information about the source text into encoder Apart from word-embeddings like word2vec, GloVe also incorporate more linguistic features like POS(parts of speech) tags named-entity tags TF-IDF statistics Though it speeds up training, it hurts the abstractive capabilities of the model.

Hierarchical Attention Introduced by Nallapati et al. 2016. For bigger source document, they try to also identify key sentences for the summary. Two Bi-Direction RNN at source text One at word level Another at sentence level Word level attention is then weighted by corresponding sentence level attention. P a (j) = P a w (j)p a s (s(j)) Nd k=1 Pa w (k)p a s (s(k))

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Pointer-Generator Network Introduced by See et al. 2017. Helps to solve the challenge of OOV words and factual errors. Works better for multi-sentence summaries. Ides is to choose between generating a word from the fixed vocabulary or copying one from the source document at each step of the generation. It brings in the power of extractive methods by pointing (Vinyals et al. 2015) So for OOV words, simple generation would result in UNK, but here the network will copy the OOV from the source text.

Pointer-Generator Network 4 4 Image taken from blog, www.abigailsee.com

Pointer-Generator Network At each step we calculate generation probability p gen p gen = σ(w T h h t + w T s h t + w T x x t + b ptr ) x t is the decoder input. Parameter w h, w s, w x, b ptr are learnable. Now this p gen is used as a switch. P(w) = p gen P vocab (w) + (1 p gen ) i:w i =w at i Note that for OOV word P vocab (w) = 0, so we end up pointing.

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Coverage Mechanism The cause of repetitiveness of the model can be accounted for by increased and continuous attention to a particular word. So we can use Coverage Model by Tu et al. 2016. Coverage Vector c t = t 1 t =0 at Intuitively, by summing the attention at all steps we are keeping track of how much coverage each encoding, e i has received. Now, give this as input to attention mechanism. importance it = V tanh(e i W 1 + h t W 2 + W c c t i + b attn ) Penalize attending to things that have already been covered. covloss t = i min(at i, ct i ) penalizes overlap between attention at this step and coverage till now. loss t = log P(w t ) + λcovloss t

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

Intra-Attention Traditional approaches attend on the encoder states. But the current word being generated also depends upon what previous words were generated. So Paulus et al. 2017 used Intra-Attention on Decoder outputs. This approach also avoids repeating things. Decoder context vector ct is generated in a similar way to encoder attention. c t passed on to generate P vocab (w)

Outline The Problem Why Text Summarization Extractive vs Abstractive Baseline Model Vanilla Encoder-Decoder Attention is all you need! Metrics and Datasets Improvements Hierarchical Attention Pointer-Generator Network Coverage Mechanism Intra-Attention Reinforcement Based Training Challenges and Way Forward

How to correct my mistakes? During training, we always feed in the correct inputs to the decoder, no matter what the output was at the previous step. Model doesn t learn to recover from its mistakes. It assumes that it will be given the golden token at each step in the decoding. During testing if the model produces even one wrong word then the recovery is hard. A naive way to do rectify this problem is that during training, toss a coin with P[heads] = p to decide between choosing generated output from the previous step or taking the golden token.

Training using Reinforcement Learning There are various ways in which the document can be effectively summarized. The reference summary is just one of those possible ways. There should be some scope for variations in the summary This is the idea behind Reinforcement based learning introduced by Paulus et al. 2017 which gave significant improvement over the baseline. This is the current state of the art. During training, we first let the model generate a summary using its own decoder outputs as inputs. After the model produces its own summary, we evaluate the summary in comparison to the reference summary using the ROUGE metric. We then define a loss based on this score. If the score is high that means the summary is good and hence the loss should be less and vice-versa.

Training using Reinforcement Learning generates Summary compares Model Scorer Golden summary updates Reward returns

Policy Learning We use self-critical Policy gradient training. We generate two strings y s and ŷ y s P(yt s y1 s,, y t 1 s, x) ie sampling and ŷ by greedy search. y is the ground truth. r(y) is the reward for sequence y compared with y. L rl = (r(ŷ) r(y s )) t log P(y s t y s 1,, y s t 1, x).

Problems in Training using Reinforcement Learning It s possible to achieve a very high ROUGE score, without the summary being human readable. Reflects that ROUGE doesn t exactly capture the way we humans evaluate summary. Now, since the above method optimizes for the ROUGE scores, it may produce summaries with very high ROGUE scores, but which are barely human-readable. So to curb this problem, we train our model in a mixed fashion using both Reinforcement learning and Supervised Training. We can interpret it as, RL training giving the summary a global sentence/summary level supervision and Supervised training giving a local word level supervision. L mixed = γl rl + (1 γ)l ml

Challenges As pointed out by Paulus et al. 2017, ROUGE as a metric is deficient. Dataset issues A majority of the dataset that is available is news dataset. Can come up with a good summary only by looking at the top few sentences. All the above-discussed models discussed above assume this and look at only the top 5-6 sentences of the source article. Need a richer dataset for multi-sentence Text Summarization. Scalability Issues - the Multi-sentence problem largely unsolved. Need a lot of data and computation power.

Future Work To solve the problem of ROUGUE metric in the Reinforcement Learning based training method, we can instead learn a Discriminator separately first, which given a document and corresponding summary tells how good the summary it. The problem of long document summarization has two main problems Vanishing Gradient Problem LSTM s help information pass along further But, the errors don t propagate further back in time well. Maximum 20-25 steps only. Logarithmic Residual LSTM s

Logarithmic Residual LSTMs h e s t e t t - 2 t - 2 t - 1 t s t 2 s t 1 x 1 x 2 x t-1 x t

References I Chopra, Sumit et al. (2016). Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93 98. Lin, Chin-Yew (2004). Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out: Proceedings of the ACL-04 workshop. Vol. 8. Barcelona, Spain. Nallapati, Ramesh et al. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. In: arxiv preprint arxiv:1602.06023. Paulus, Romain et al. (2017). A Deep Reinforced Model for Abstractive Summarization. In: arxiv preprint arxiv:1705.04304.

References II Rush, Alexander M et al. (2015). A neural attention model for abstractive sentence summarization. In: arxiv preprint arxiv:1509.00685. See, Abigail et al. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In: arxiv preprint arxiv:1704.04368. Tu, Zhaopeng et al. (2016). Modeling coverage for neural machine translation. In: arxiv preprint arxiv:1601.04811. Vinyals, Oriol et al. (2015). Pointer networks. In: Advances in Neural Information Processing Systems, pp. 2692 2700.