Modern Challenges in Building End-to-End Dialogue Systems

Similar documents
Dialog-based Language Learning

arxiv: v3 [cs.cl] 7 Feb 2017

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v1 [cs.cl] 2 Apr 2017

Residual Stacking of RNNs for Neural Machine Translation

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Assignment 1: Predicting Amazon Review Ratings

Online Updating of Word Representations for Part-of-Speech Tagging

Language Model and Grammar Extraction Variation in Machine Translation

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Noisy SMS Machine Translation in Low-Density Languages

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lecture 1: Machine Learning Basics

Speech Emotion Recognition Using Support Vector Machine

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Regression for Sentence-Level MT Evaluation with Pseudo References

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v3 [cs.cl] 24 Apr 2017

Generative models and adversarial training

Lip Reading in Profile

Using dialogue context to improve parsing performance in dialogue systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Recognition at ICSI: Broadcast News and beyond

Deep Neural Network Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

Australian Journal of Basic and Applied Sciences

Verbal Behaviors and Persuasiveness in Online Multimedia Content

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Word Segmentation of Off-line Handwritten Documents

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Case Study: News Classification Based on Term Frequency

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Python Machine Learning

Variations of the Similarity Function of TextRank for Automated Summarization

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

A Comparison of Two Text Representations for Sentiment Analysis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

arxiv: v4 [cs.cl] 28 Mar 2016

Learning Methods in Multilingual Speech Recognition

Reducing Features to Improve Bug Prediction

Efficient Online Summarization of Microblogging Streams

Indian Institute of Technology, Kanpur

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

A study of speaker adaptation for DNN-based speech synthesis

Rule Learning with Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Summarizing Answers in Non-Factoid Community Question-Answering

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

arxiv: v1 [cs.cl] 20 Jul 2015

Learning From the Past with Experiment Databases

Second Exam: Natural Language Parsing with Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Extracting and Ranking Product Features in Opinion Documents

arxiv: v1 [cs.cv] 10 May 2017

The stages of event extraction

Re-evaluating the Role of Bleu in Machine Translation Research

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

THE world surrounding us involves multiple modalities

Modeling function word errors in DNN-HMM based LVCSR systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv: v4 [cs.cv] 13 Aug 2017

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

CSL465/603 - Machine Learning

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Knowledge Transfer in Deep Convolutional Neural Nets

Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

Axiom 2013 Team Description Paper

The Ups and Downs of Preposition Error Detection in ESL Writing

Affective Classification of Generic Audio Clips using Regression Models

Calibration of Confidence Measures in Speech Recognition

November 17, 2017 ARIZONA STATE UNIVERSITY. ADDENDUM 3 RFP Digital Integrated Enrollment Support for Students

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

arxiv: v2 [cs.cl] 18 Nov 2015

arxiv: v1 [cs.cl] 27 Apr 2016

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

A deep architecture for non-projective dependency parsing

Matching Similarity for Keyword-Based Clustering

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

TINE: A Metric to Assess MT Adequacy

There are some definitions for what Word

Completely Heterogeneous Transfer Learning with Attention - What And What Not To Transfer

Using Web Searches on Important Words to Create Background Sets for LSI Classification

arxiv: v1 [cs.lg] 15 Jun 2015

Semi-Supervised Face Detection

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Word Embedding Based Correlation Model for Question/Answer Matching

Dialog Act Classification Using N-Gram Algorithms

Meta Comments for Summarizing Meeting Speech

Transcription:

Modern Challenges in Building End-to-End Dialogue Systems Ryan Lowe McGill University

Primary Collaborators Joelle Pineau Iulian V. Serban Mike Noseworthy McGill U. Montreal McGill Chia-Wei Liu Nissan Pow Laurent Charlin McGill McGill HEC Montreal

Dialogue Systems

Modular Dialogue Systems Traditional system consists of modules Each module optimized with separate objective function Achieves good performance with small amounts of data Problem: does not work well in general domains!

End-to-End Dialogue Systems A single model trained directly on conversational data Uses a single objective function, usually maximum likelihood on next response Significant recent work using neural networks to predict the next response. (Ritter et al., 2011; Sordoni et al., 2015; Shang et al., 2015)

End-to-End Dialogue Systems Advantages of end-to-end systems: 1) Does not require feature engineering (only architecture engineering). 2) Can be transferred to different domains. 3) Does not require supervised data for each module! (collecting this data does not scale well)

Challenge #1: Data

Dialogue Datasets Building general-purpose dialogue systems requires lots of data The best datasets are proprietary We need large (>500k dialogues), open-source datasets to make progress

Ubuntu Dialogue Corpus Large dataset of ~1 million tech support dialogues Scraped from Ubuntu IRC channel 2-person dialogues extracted from chat stream Lowe*, Pow*, Serban, Pineau. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. SIGDIAL, 2015.

Other Datasets Twitter Corpus, 850k Twitter dialogues (Ritter et al., 2011) Movie Dialog Dataset, 1 million Reddit dialogues (Dodge et al. 2016) Our survey paper covering existing datasets: Serban, Lowe, Charlin, Pineau. A Survey of Available Corpora for Building Data-Driven Dialogue Systems. arxiv:1512.05742, 2015. Needs more work!

Challenge #2: Generic Responses

The Problem of Generic Responses Most models trained to predict most likely next utterance given context But some utterances are likely given any context! Neural models often generate I don t know, or I m not sure to most contexts (Li et al., 2016)

Encoder-Decoder Use RNN to encode text into fixed-length vector representation Use another RNN to decode representation to text Can make this hierarchical Cho et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014. Serban, Sordoni, Bengio, Courville, Pineau. Building End-to-End Dialogue Systems using Generative Hierarchical Neural Network Models AAAI, 2015.

Variational EncoderDecoder (VHRED) Augment encoder-decoder with Gaussian latent variable Inspired by VAE (Kingma & Welling, 2014) When generating first sample latent variable, then use it to condition generation Serban, Sordoni, Lowe, Charlin, Pineau, Courville, Bengio. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. arxiv:1605.06069, 2016.

Variational Encoder-Decoder (VHRED) VHRED generates longer responses with higher entropy Outperforms baselines in most experiments

Variational Encoder-Decoder (VHRED)

Diversity-Promoting Objective Uses new objective: maximize the mutual information between source sentence S and target T Can be considered a penalty on generic responses Gives slightly better results Li, Galley, Brockett, Gao, Dolan. A Diversity-Promoting Objective Function for Neural Conversational Models. arxiv:1510.03055, 2016.

Challenge #3: Evaluation

Automatic Dialogue Evaluation Want a fully automatic way of evaluating the quality of a dialogue system If there is no notion of task completion, this is very hard Current methods compare the generated system response to the ground-truth next response

Comparison of ground-truth utterance Context Hey, want to go to the movies tonight? Generated Response Yeah, let s go see that movie about Turing! Ground-truth response Nah, I d rather stay at home, thanks. SCORE

Comparison of ground-truth utterance 1) Word-overlap metrics: BLEU, METEOR, ROUGE 2) Word embedding-based metrics: Vector extrema, greedy matching, embedding average Generated Response Yes, let s go see that movie about Turing! Ground-truth response Nah, I d rather stay at home, thanks. SCORE

Human study Created 100 questions each for Twitter and Ubuntu datasets (20 contexts with responses from 5 diverse models ) 25 volunteers from CS department at McGill Asked to judge response quality on a scale from 1 to 5 Compared human ratings with ratings from automatic evaluation metrics Liu*, Lowe*, Serban*, Noseworthy*, Charlin, Pineau. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Systems. EMNLP, 2016.

Goal (inter-annotator) Liu*, Lowe*, Serban*, Noseworthy*, Charlin, Pineau. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Systems. EMNLP, 2016.

Reality (BLEU) Liu*, Lowe*, Serban*, Noseworthy*, Charlin, Pineau. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Systems. EMNLP,

Reality (vector-based) Liu*, Lowe*, Serban*, Noseworthy*, Charlin, Pineau. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Systems. EMNLP, 2016.

Reality (ROUGE & METEOR) Liu*, Lowe*, Serban*, Noseworthy*, Charlin, Pineau. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Systems. EMNLP,

Correlation Results Liu*, Lowe*, Serban*, Noseworthy*, Charlin, Pineau. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Systems. EMNLP,

Next Utterance Classification Instead of evaluating model responses, can use an auxiliary task Have models predict next utterance in conversation from a list (multiple-choice style) Mitigates problem with response diversity (and many other advantages!) Lowe, Serban, Noseworthy, Charlin, Pineau. On the Evaluation of Dialogue Systems with Next Utterance Classification. SIGDIAL, 2016.

Summary End-to-end systems are promising, but we have a long way to go. Work on collecting larger, better datasets! This is the most useful for the community! Don t rely on only word-overlap metrics like BLEU! Use human evaluations (for now )

Thank you!

References Dodge, J., Gane, A., Zhang, X., Bordes, A., Chopra, S., Miller, A.,... & Weston, J. (2016). Evaluating prerequisite qualities for learning end-to-end dialog systems. In ICLR. Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR. Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural conversation models. arxiv preprint arxiv:1510.03055. Ritter, A., Cherry, C., & Dolan, W. B. (2011). Data-driven response generation in social media. In EMNLP. Shang, L., Lu, Z., & Li, H. (2015). Neural responding machine for short-text conversation. arxiv preprint arxiv:1503.02364. Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., & Dolan, B. (2015). A neural network approach to context-sensitive generation of conversational responses. In NAACL-HLT.

Other curiosities Hard to evaluate when the proposed response has a different length than the ground-truth response

Other curiosities Removing stop words from BLEU evaluation actually makes things worse