Semantic decoding in dialogue systems. Dialogue Systems Group, Cambridge University Engineering Department

Similar documents
Python Machine Learning

Calibration of Confidence Measures in Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Lecture 1: Machine Learning Basics

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Deep Neural Network Language Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.lg] 7 Apr 2015

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Modeling function word errors in DNN-HMM based LVCSR systems

Assignment 1: Predicting Amazon Review Ratings

On the Formation of Phoneme Categories in DNN Acoustic Models

A Vector Space Approach for Aspect-Based Sentiment Analysis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Human Emotion Recognition From Speech

Attributed Social Network Embedding

Using dialogue context to improve parsing performance in dialogue systems

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

(Sub)Gradient Descent

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Comparison of Two Text Representations for Sentiment Analysis

Linking Task: Identifying authors and book titles in verbose queries

arxiv: v2 [cs.ir] 22 Aug 2016

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

Knowledge Transfer in Deep Convolutional Neural Nets

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Artificial Neural Networks written examination

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Corrective Feedback and Persistent Learning for Information Extraction

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A Review: Speech Recognition with Deep Learning Methods

arxiv: v1 [cs.cl] 27 Apr 2016

Indian Institute of Technology, Kanpur

Semi-Supervised Face Detection

INPE São José dos Campos

CSL465/603 - Machine Learning

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Model Ensemble for Click Prediction in Bing Search Ads

Second Exam: Natural Language Parsing with Neural Networks

Learning Methods for Fuzzy Systems

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Probabilistic Latent Semantic Analysis

Softprop: Softmax Neural Network Backpropagation Learning

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Methods in Multilingual Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Switchboard Language Model Improvement with Conversational Data from Gigaword

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Discriminative Learning of Beam-Search Heuristics for Planning

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

arxiv: v1 [cs.cv] 10 May 2017

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Support Vector Machines for Speaker and Language Recognition

Generative models and adversarial training

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A deep architecture for non-projective dependency parsing

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v3 [cs.cl] 24 Apr 2017

arxiv: v4 [cs.cl] 28 Mar 2016

WHEN THERE IS A mismatch between the acoustic

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Word Segmentation of Off-line Handwritten Documents

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Reinforcement Learning Variant for Control Scheduling

Rule Learning With Negation: Issues Regarding Effectiveness

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Dialog-based Language Learning

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Transcription:

Semantic decoding in dialogue systems Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 39

In this lecture... Dialogue acts Semantic decoding as a classification task Input features to semantic decoder Semantic decoding as a sequence to sequence learning task 2 / 39

Architecture of a statistical dialogue system Speech recognition Semantic decoding waveform distribution over text hypotheses distribution over dialogue acts Dialogue management Ontology Speech synthesis Natural language generation 3 / 39

Problem Decoding meaning in utterances: Do they serve Korean food Can you repeat that please Hi I want to find a restaurant that serves Italian food How about a restaurant that serves Lebanese food I want a different restaurant Is it near Union Square May I have the address No, I want an expensive restaurant 4 / 39

Reminder: Dialogue acts Semantic concepts: dialogue act type - encodes the system or the user intention in a (part of) dialogue turn semantic slots and values - further describe entities from the ontology that a dialogue turn refers to Is there um maybe a cheap place in the centre of town please? inform ( price = cheap, area = centre) dialogue act type semantics slots and values 5 / 39

Semantic decoding Do they serve Korean food Can you repeat that please Hi I want to find an Italian restaurant I want a different restaurant Is it near Union Square May I have the address No, I want an expensive restaurant How about a restaurant that serves Lebanese food confirm(food=korean) repeat() hello(type=restaurant, food=italian) reqalts() confirm(near=union Square) request(addr) negate(type=restaurant, pricerange=expensive) reqalts(type=restaurant, food=lebanise) 6 / 39

Semantic decoding Data Model Predictions Dialogue utterances labelled with semantic concepts The set of semantic concepts 7 / 39

Semantic decoding as a classification task Data Model Predictions Dialogue utterances labelled with semantic concepts Support vector machines The set of semantic concepts 8 / 39

Semantic decoding as a classification task Is there um maybe a cheap place in the centre of town please? Classes: Dialogue act types negate Slot value pairs food=italian deny food=chinese inform area=centre select area=north price=cheap 9 / 39

Theory: support vector machines Support vector machine is a maximum margin classifier Support vectors are input data points that lie on the margin Input data points are mapped into a high dimensional feature space where the data is linearly separable. x φ(x) Kernel function is the dot product of feature functions k(x, x) = φ(x) T φ(x) 10 / 39

Theory: support vector machines The decision surface is given by f (x) = n y i α i k(x, x i ) + β i=1 x test data point x i support vectors y i labels, y i {1, 1} α i weight of the support vector in the feature space β bias k(, ) kernel function 11 / 39

Theory: support vector machines Extended to multiclass SVM using one-versus-rest approach The output of an SVM is transformed into probability by fitting a sigmoid p(y = 1 x) = 1 1 + exp(af (x) + b) and estimating a, b by maximum likelihood on a validation set 12 / 39

Input to semantic decoder top ASR hypothesis Features are extracted directly from top hypothesis and the classification is performed into relevant semantic classes. 13 / 39

Ontology and Delexicalisation Ontology name(carluccios) food(italian) pricerange(moderate) area(centre) name(seven Days) food(chinese) pricerange(cheap) area(centre) name(cocum) food(indian) pricerange(cheap) area(north) I m looking for an Italian restaurant. I m looking for an Indian restaurant. Delexicalise I m looking for a Chinese restaurant. I m looking for an <tagged-food-value> restaurant. I m looking for a <tagged-food-value> restaurant. 14 / 39

SVMs in semantic decoding: semantic tuple classifier Italian restaurant please Delexicalise Ontology <tagged-food-value> restaurant please Count N-grams please 1 restaurant please 1 0 0 <tagged-food-value> restaurant 1 0 Query SVMs multi-class SVM for dialog act SVM for food=<taggedfood-value> SVM for area=<taggedarea-value> SVM for price=<taggedprice-value> 0.9 0.1 0.2 inform() 0.9 request() 0.1 Produce valid dialogue acts and renormalise distributions inform(food=italian) request() 0.85 0.15 15 / 39

Input to semantic decoder N-best list of ASR hypothesis In real conversational systems error rate of the top hypothesis is typically 20-30%. To achieve robustness alternative hypotheses are needed. 16 / 39

Taking alternative ASR hypotheses into account Is there an expensive restaurant? Is there an inexpensive restaurant? Inexpensive restaurant? In expensive restaurant? 0.35 0.30 0.20 0.05 Semantic Decoding inform(type=restaurant, price=expensive) inform(type=restaurant, price=inexpensive) inform(type=restaurant, price=inexpensive) inform(type=restaurant, price=expensive) 0.35 0.30 0.20 0.05 Combine outputs inform(type=restaurant, price=inexpensive) 0.50 inform(type=restaurant, price=expensive) 0.40 17 / 39

Input to semantic decoder Word confusion network summarises the posterior distribution of ASR better, without pruning low probability words. Each arc in the word confusion network has a posterior probability for that word. That is the sum of all paths which contain that word at around that approximate time. I am looking for an inexpensive place I m a expensive Context features can be extracted from the last system action. The user response may be dependent on the system question. Caution! We still want to understand utterances where the user is not following the system. 18 / 39

Evaluate the quality of semantic decoder F-score C ref semantic concepts in reference and C hyp0 semantic concepts in top hypothesis F = 2 C ref C hyp0 C ref + C hyp0 Item level cross entropy (ICE) [Thomson et al., 2008] measures the quality of the output distribution p i for every concept ICE = 1 1 + C ref log(p(c)p (c)+(1 p(c))(1 p (c))), c C where p(c) = { i p i(c), c C hypi p 1 c C ref (c) = 0 otherwise and 19 / 39

Results [Henderson et al., 2012] Cambridge Restaurant Information Domain Semantic concepts area, price-range, food-type, phone number, post-code, signature dish, address of restaurant and dialogue act types: inform, request, confirm etc Data collected in car WER 37.0% Input F-Score ICE Top ASR hypothesis 0.692 ± 0.012 1.790 ± 0.065 N-best ASR hypotheses 0.708 ± 0.012 1.760 ± 0.074 Confusion network 0.730 ± 0.011 1.680 ± 0.063 Confusion network + context 0.767 ± 0.011 0.880 ± 0.063 20 / 39

Semantic decoding as a sequence to sequence learning task Reads the input word by word, or window of words Outputs sequence of concepts using BIO labelling (begin, inside, other) These are then heuristically mapped into slot-value pairs Is there um maybe a cheap place in the centre of town serving Chinese food please? Sequence to sequence model o o o o o b_price o o o b_area i_area i_area b_food i_food i_food o Heuristics price=cheap area=centre food=chinese 21 / 39

Semantic decoding as a sequence to sequence learning task Data Model Predictions Dialogue utterances with semantic concepts Conditional random fields The sequence of semantic concepts 22 / 39

Theory: Conditional random fields ( ) p(y x) = 1 Z(x) exp λ i f i (x, y) This is a fully connected undirected graph where: x input sequence (x 0,, x n ) y output sequence (y 0,, y n ) f i λ i given feature functions parameters to be estimated i 23 / 39

Theory: Linear chain conditional random field In this case the graph is not fully connected any more, the label at time step t depends on the label in the previous time step t 1. yt-1 yt x0 xt-1 xt xn p(y x) = ( ) 1 Z(x) exp λ i f i (x, y t, y t 1 ) t i (1) = 1 Z(x) exp(λt F(x, y)) (2) 24 / 39

Training a linear chain conditional random field Maximise the log probability log p(y x) with respect to parameters λ. It can be shown that the gradient of the log probability is the difference between the feature function values and the expected feature function values: λ L = F(x, y) y p(y x)f(y, x). Since the label at each time step only depends on the label in the previous time step, message passing can be used to find the expectation. 25 / 39

Linear chain CRFs in semantic decoding [Tur et al., 2013] Input data Word confusion networks where each bin is annotated with semantic concept Features For each bin in the confusion network extract N-grams of the neighbouring bins and weight them by their confidence scores. Task is conversational understanding system with real users about movies (22 concepts) Input F-Score Top ASR hypothesis 0.77 Confusion network 0.83 26 / 39

Semantic decoding as a sequence to sequence learning task Data Model Predictions Dialogue utterances with semantic concepts Recurrent neural networks The sequence of semantic concepts 27 / 39

Theory: Neural networks Neural network transforms input vector x into an output categorical probability distribution y: h 0 = g 0 (W 0 x T + b 0 ) h i = g i (W i h T i 1 + b i ), 0 < i < m y = softmax(w m h T m 1 + b m ) softmax(h) i = exp (h i )/( j exp (h j )) where g i W i, b i (differentiable) activation functions hyperbolic tangent tanh or sigmoid σ parameters to be estimated 28 / 39

Theory: Neural networks Neural network structure Output y Hidden nodes h Weights w Input x 29 / 39

Theory: Training neural networks Cost function is the negative log probability of true label j y ij log y ij y i is delta distribution (zero everywhere except for the correct category) y i is the probability distribution estimated by a Neural network The cost function can be minimised by stochastic gradient descent. 30 / 39

Neural networks for semantic decoding Example network which does not take into account context. price hidden layer softmax non-linear transformation..1.. Input feature vector 1-hot representation I m looking for a <tagged-price-value> restaurant I m looking for a cheap restaurant 31 / 39

Recurrent neural networks Elman-type Recurrent neural networks are deep neural networks unrolled through time. Elman-type neural network has recurrent connections between hidden layers of the neural networks: output t0 output t1 output tn hidden layer t0 hidden layer t1 hidden layer tn Input feature vector t0 Input feature vector t1 Input feature vector tn 32 / 39

Recurrent neural networks Jordan-type Jordan-type neural network feeds the output of previous time step into the next time step: output t0 output t1 output tn hidden layer t0 hidden layer t1 hidden layer tn Input feature vector t0 Input feature vector t1 Input feature vector tn 33 / 39

RNNs in semantic decoding [Mesnil et al., 2015] ATIS dataset - flight booking information. I want to fly from Boston to New York. Input features 1-hot representation or context window. F-score Elman Jordan CRF 1-hot 0.932 0.652 0.67 window 0.950 0.942 0.929 F-score on entertainment dataset CRF RNN 0.906 0.881 34 / 39

Long short-term memory neural networks RNNs automatically learn context information but suffer from vanishing gradient problem. Long short-term memory neural networks are an alternative model which to some extent avoid this problem and have been successfully used in semantic decoding [Yao et al., 2014]. 35 / 39

Summary Input Model Input can be 1-best or N-best list from the ASR or a confusion network. Taking into account alternative recognition result improves robustness. Semantic decoding can be defined as a classification task. In this case a collection of SVMs can be used. Semantic decoding can be more naturally defined as a sequence to sequence learning task. CRFs are one sequence-to-sequence model which require predefined context feature functions. RNNs automatically provide context but suffer from vanishing gradient problem. 36 / 39

References I Henderson, M., Gasic, M., Thomson, B., Tsiakoulis, P., Yu, K., and Young, S. (2012). Discriminative spoken language understanding using word confusion networks. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pages 176 181. Mesnil, G., Dauphin, Y., Yao, K., Bengio, Y., Deng, L., Hakkani-Tur, D., He, X., Heck, L., Tur, G., Yu, D., and Zweig, G. (2015). Using recurrent neural networks for slot filling in spoken language understanding. Trans. Audio, Speech and Lang. Proc., 23(3):530 539. 37 / 39

References II Thomson, B., Yu, K., Gasic, M., Keizer, S., Mairesse, F., Schatzmann, J., and Young, S. (2008). Evaluating semantic-level confidence scores with multiple hypotheses. In INTERSPEECH, pages 1153 1156. Tur, G., Deoras, A., and Hakkani-Tur, D. (2013). Semantic parsing using word confusion networks with conditional random fields. Annual Conference of the International Speech Communication Association (Interspeech). 38 / 39

References III Yao, K., Peng, B., Zhang, Y., Yu, D., Zweig, G., and Shi, Y. (2014). Spoken language understanding using long short-term memory neural networks. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 189 194. 39 / 39