Bootstrapping Dialog Systems with Word Embeddings

Similar documents
Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

Python Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Case Study: News Classification Based on Term Frequency

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probabilistic Latent Semantic Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning From the Past with Experiment Databases

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v1 [cs.cl] 20 Jul 2015

Learning Methods in Multilingual Speech Recognition

Human Emotion Recognition From Speech

Switchboard Language Model Improvement with Conversational Data from Gigaword

Georgetown University at TREC 2017 Dynamic Domain Track

CS Machine Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Comparison of Two Text Representations for Sentiment Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Deep Neural Network Language Models

(Sub)Gradient Descent

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.cl] 2 Apr 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule Learning With Negation: Issues Regarding Effectiveness

Online Updating of Word Representations for Part-of-Speech Tagging

Using Web Searches on Important Words to Create Background Sets for LSI Classification

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Generative models and adversarial training

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Reducing Features to Improve Bug Prediction

Probability and Statistics Curriculum Pacing Guide

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Linking Task: Identifying authors and book titles in verbose queries

arxiv: v2 [cs.ir] 22 Aug 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Vector Space Approach for Aspect-Based Sentiment Analysis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Dialog-based Language Learning

Comment-based Multi-View Clustering of Web 2.0 Items

Rule Learning with Negation: Issues Regarding Effectiveness

Artificial Neural Networks written examination

Australian Journal of Basic and Applied Sciences

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Model Ensemble for Click Prediction in Bing Search Ads

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Word Segmentation of Off-line Handwritten Documents

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Indian Institute of Technology, Kanpur

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Calibration of Confidence Measures in Speech Recognition

Lecture 10: Reinforcement Learning

CS 446: Machine Learning

Multivariate k-nearest Neighbor Regression for Time Series data -

Detecting English-French Cognates Using Orthographic Edit Distance

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

The stages of event extraction

arxiv: v2 [cs.cv] 30 Mar 2017

Semi-Supervised Face Detection

Speech Recognition at ICSI: Broadcast News and beyond

Knowledge Transfer in Deep Convolutional Neural Nets

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.cl] 26 Mar 2015

Multi-Lingual Text Leveling

Grade 6: Correlated to AGS Basic Math Skills

Second Exam: Natural Language Parsing with Neural Networks

Learning to Rank with Selection Bias in Personal Search

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Distant Supervised Relation Extraction with Wikipedia and Freebase

Multilingual Sentiment and Subjectivity Analysis

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Modeling function word errors in DNN-HMM based LVCSR systems

SARDNET: A Self-Organizing Feature Map for Sequences

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Bayesian Learning Approach to Concept-Based Document Classification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Evolutive Neural Net Fuzzy Filtering: Basic Description

Attributed Social Network Embedding

arxiv: v1 [cs.lg] 15 Jun 2015

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Speech Emotion Recognition Using Support Vector Machine

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A survey of multi-view machine learning

Software Maintenance

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Transcription:

Bootstrapping Dialog Systems with Word Embeddings Gabriel Forgues, Joelle Pineau School of Computer Science McGill University {gforgu, jpineau}@cs.mcgill.ca Jean-Marie Larchevêque, Réal Tremblay Nuance Communications, Inc. {jean-marie.larcheveque, real.tremblay}@nuance.com Abstract One of the main tasks of a dialog system is to assign intents to user utterances, which is a form of text classification. Since intent labels are application-specific, bootstrapping a new dialog system requires collecting and annotating in-domain data. To minimize the need for a long and expensive data collection process, we explore ways to improve the performance of dialog systems with very small amounts of training data. In recent years, word embeddings have been shown to provide valuable features for many different language tasks. We investigate the use of word embeddings in a text classification task with little training data. We find that count and vector features complement each other and their combination yields better results than either type of feature alone. We propose a simple alternative, vector extrema, to replace the usual averaging of a sentence s vectors. We show how taking vector extrema is well suited for text classification and compare it against standard vector baselines in three different applications. 1 Introduction Dialog systems are gaining popularity and can now be found in cars, televisions, phones and other devices. Two essential components of such systems are: a speech recognition engine to transcribe speech into text and a language model to understand the text s intent. While speech recognition can now often boast impressive accuracy, language understanding is still very difficult in comparison. Here we focus on the text classification task which aims to identify the intent of some short piece of text such as a single utterance. Although dialog systems deployed with large amounts of training data can train complex models to reach high accuracy on this task, their performance hinges on a large domain-specific data collection effort. The most representative in-domain data is collected from real user speech after deploying the dialog system, but data collection can be a lengthy and costly process. In practice, new dialog systems can be seeded with a few artificial samples and then deployed to interact with users. However, the small amount of training data can limit users to shallow dialogs. The conversational data can then be annotated and used to train a better model. Many such iterations might occur before the system reaches high accuracy. A strong start out of the gate can be expected not only to improve early user experience, but also to positively impact dialog quality over the long term. We accordingly focus on improving dialog systems in their early stages when very little training data is available. We consider tasks with very short texts (e.g. single sentences) and many classes, and where texts are semantically rich despite their short length. 2 Word embeddings In recent years, several algorithms have been proposed to learn word embeddings, also known as word vectors or distributed representations of words [1,2,3]. These embeddings encode words as 1

vectors such that words with similar meanings have similar vector representations. Although large amounts of data are needed to learn effective representations, the data need not be domain-specific or labelled, which makes word vectors well suited as additional features to bootstrap text classification. 2.1 Sentence-level vectors using extrema Since our goal is not to learn new word embeddings, we assume we have access to vectors which have been pre-trained on some large unlabelled text [4,5]. We must then encode sentences with a variable number of vectors into a fixed-length feature representation. The standard approach sums or averages the sentence s vectors. A simple improvement uses a weighted average where important words carry higher weight. This approach requires some method of computing word weights, such as the commonly-used inverse document frequency (IDF) [6]. While more complex approaches have been proposed [6,7], they typically require additional data in the form of context or extra labels. Although there are many algorithms to induce word embeddings, the training objectives frequently involve maximizing the similarity of vectors of words occurring in similar contexts while minimizing the similarity with other words. Since the vectors can contain negative as well as positive values, this effectively pulls the vectors of common words towards the zero vector, so that they are not too dissimilar to the entire vocabulary. On the other hand, context-specific words are pushed away from zero, either in the negative or positive direction. Since these rarer words tend to strongly convey intent, we can emphasize them in the sentence s vector by taking the maximum (or minimum) of each dimension d i D from the sentence s set of D-dimensional word vectors. However, since we have no reason to favour either the maximum or minimum, we instead take whichever of the two values is further from zero. We refer to this operation as the vector extrema. { max di if max d extrema(d i ) = i min d i min d i otherwise (1) 2.2 Combining word counts with embeddings Baroni et al. [8] recently compared context word counts against distributed word representations on tasks such as synonym detection and semantic relatedness between pairs of words, and found that word vectors were overwhelmingly superior. However, their evaluation considered word-level tasks where word embeddings can be directly used as features. The results do not apply to text which must be transformed into a fixed-length representation. For this task, intuition might suggest that word embeddings are especially helpful when there is little training data. In this setting, a vector representation of words gives a basis to relate unseen test words to similar words seen during training. However, as the amount of training data increases, we might expect word counts to be favourable over the averaging of a sentence s word vectors, since the first approach exactly represents each word while the second obscures the presence of individual words. We evaluate whether the two sets of features can be complementary by representing a sentence of words w 1,..., w k as a combination of its bag of word counts alongside real-valued features from its word vectors v 1,..., v k. 3 Experiments BoW : bag of word counts Vec(Avg) : 1 k i=1 v i BoW + Vec(Avg) : bag of word counts, 1 k i=1 v i BoW + Vec(W.Avg) : bag of word counts, i=1 idf(wi)vi i=1 idf(wi) BoW + Vec(Ext) : bag of word counts, extrema(d i ) d i D We evaluate the combination of word count features and word embeddings and compare the vector average and extrema operations. Two dialog datasets from Nuance Communications are used for these experiments: a banking dataset which contains 2,961 unique utterances classified into one of 172 intent labels (deposits, stock quotes, etc.), and a travel dataset of 2,494 utterances and 62 intent 2

Figure 1: Combination of word count (BoW) and vector average features Figure 2: Distribution of extrema words. The x-axis represents the number of extremum values (as a percentage of vector dimension) which each word contributed to the sentence s extrema vector. labels (book flight, change seats, etc.). We also evaluate the methods on a question classification dataset [9] with 15,452 short questions labelled into 50 classes (person, country, money, etc.). 3.1 Setup In preliminary experiments, we considered different sets of features to use as baselines such as unigrams, bigrams, part of speech tags and stemming. For the dialog data we evaluated, none of these features provided a significant improvement over unigram word counts alone. We therefore use unigram counts as the baseline feature representation (BoW). We also compared different classifiers, including naive Bayes and logistic regression, but ultimately selected a linear SVM as the best classifier for this task. We train the classifier in a stratified way to ensure that all classes are seen in training. For some parameter k, we select k samples from each intent label, thereby downsampling the annotated corpus. We run 200 trials for each value of k, where each trial is trained on k random samples from each intent and evaluated on all remaining non-training samples. We report results of accuracy averaged over all trials, with a special focus for k < 10 as the smallest number of samples of each class that one might reasonably expect when bootstrapping a new application. 3.2 Preprocessing We first convert all text to lower-case and split each sentence into tokens for spaces and punctuation. We then remove all single-character words since preliminary experiments showed they were uninformative and their removal led to an increase in accuracy. We also add all label words into the training vocabulary by creating a new artificial sample for each label, where the sample s text consists only of the label s words (e.g. book flight for label BOOK FLIGHT). This ensures the baseline classifier can accurately label the simplest sentences regardless of its training data. The experiments used the pre-trained word vectors distributed by word2vec [4]. We normalize all vectors by taking the minimum and maximum value of each dimension from the training word vectors, such that the vectors of words seen during training have values which are bounded in [-1,1]. 3

Figure 3: Results comparing vector extrema with IDF-weighted and non-weighted vector averages 4 Results In Figure 1, we can see a trade-off between word counts (BoW) and vectors. While word vectors are initially superior, this advantage shifts as the amount of training data increases. However, the combination of both features outperforms each set of features individually. While the figure only shows one example domain, the joint use of word counts and vectors was superior to vectors alone for all applications considered, therefore we only report results combining both features hereafter. The vector extrema operation produced the highest accuracy on both dialog datasets by a small margin and the second highest on the question dataset (Figure 3). The same figure also shows the effect of increasing the amount of training data, with all three domains nearly doubling or tripling in accuracy with less than 10 samples per class. We also analysed the composition of sentence extrema vectors to determine which words contributed the most extreme components. As shown in Figure 2, words which occur in specific contexts (e.g. napalm ) tend to have the greatest extremum values, while stopwords (e.g. the ) contribute the least. The results suggest that taking vector extrema effectively induces a global weighing scheme on a sentence s words, where words that are indicative of highly specific contexts in the comprehensive corpus (which was used to learn the word embeddings) have higher weight. This is to be contrasted with the local weights, derived from frequencies in the training set that are used to compute weighted averages. 5 Discussion While word counts and vectors have often been pitted against each other, our empirical results suggest that they provide features which are complementary, at least to the extent that their combination was advantageous for the three domains we evaluated. However, the combination of real and discrete features is not without issues. The concatenation of word counts with real-valued features prevents the use of a sparse dictionary representation which is typical for large text corpora. While this is not problematic in settings with small amounts of data, it would introduce memory issues as the dataset grows in size. One possible solution would be to discretize the real-valued vectors as new entries in the sparse matrix, but this might significantly reduce the word embeddings benefit. The vector extrema operation appears to be a good alternative to a weighted vector average, especially because of its implicit word weights, but its effectiveness seems domain-dependent. It scored lower than a vector average on the question dataset, especially with few training samples, likely because interrogative words were under-represented in sentence extrema (e.g. who is a general word yet it clearly suggests the question relates to a person). The extrema representation will also grow ineffective as the input text s size increases beyond single sentences since the vector must hold components from many words in order to preserve information of the entire sentence. Acknowledgments Funding for this work was provided by FRQNT, NSERC and Nuance Communications, Inc.. 4

References [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003. [2] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, 2008. [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations, 2013. [4] Word2Vec. [Online]. Available: https://code.google.com/p/word2vec/. [Accessed October 8th 2014] [5] E. Huang. Word Representations. [Online]. Available: https://ai.stanford.edu/ ehhuang. [Accessed October 8th 2014] [6] E. Huang, R. Socher, C. Manning, and A. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012 [7] Q. Le, T. Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, 2014. [8] M. Baroni, G. Dinu, and G. Kruszewski. Don t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014. [9] X. Li and D. Roth. Learning Question Classifiers: The Role of Semantic Information. In Proceedings of the 19th International Conference on Computational Linguistics, 2002. 5