Using Wizard-of-Oz simulations to bootstrap Reinforcement-Learningbased dialog management systems

Similar documents
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Learning Methods in Multilingual Speech Recognition

Task Completion Transfer Learning for Reward Inference

Using dialogue context to improve parsing performance in dialogue systems

Task Completion Transfer Learning for Reward Inference

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Axiom 2013 Team Description Paper

Calibration of Confidence Measures in Speech Recognition

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Laboratorio di Intelligenza Artificiale e Robotica

Georgetown University at TREC 2017 Dynamic Domain Track

Speeding Up Reinforcement Learning with Behavior Transfer

Lecture 1: Machine Learning Basics

Lecture 10: Reinforcement Learning

Rule Learning With Negation: Issues Regarding Effectiveness

A Reinforcement Learning Variant for Control Scheduling

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The influence of written task descriptions in Wizard of Oz experiments

Lecture 1: Basic Concepts of Machine Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Python Machine Learning

Reducing Features to Improve Bug Prediction

Laboratorio di Intelligenza Artificiale e Robotica

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Softprop: Softmax Neural Network Backpropagation Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Switchboard Language Model Improvement with Conversational Data from Gigaword

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

AQUA: An Ontology-Driven Question Answering System

NCEO Technical Report 27

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Abstractions and the Brain

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

CS Machine Learning

Agent-Based Software Engineering

On the Combined Behavior of Autonomous Resource Management Agents

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Parsing of part-of-speech tagged Assamese Texts

TD(λ) and Q-Learning Based Ludo Players

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

An investigation of imitation learning algorithms for structured prediction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

CSL465/603 - Machine Learning

A Case Study: News Classification Based on Term Frequency

Learning Methods for Fuzzy Systems

The Strong Minimalist Thesis and Bounded Optimality

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Emotion Recognition Using Support Vector Machine

A study of speaker adaptation for DNN-based speech synthesis

Linking Task: Identifying authors and book titles in verbose queries

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Modeling function word errors in DNN-HMM based LVCSR systems

CS 446: Machine Learning

Motivation to e-learn within organizational settings: What is it and how could it be measured?

Modeling function word errors in DNN-HMM based LVCSR systems

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

Seminar - Organic Computing

Evolutive Neural Net Fuzzy Filtering: Basic Description

Welcome to. ECML/PKDD 2004 Community meeting

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Assignment 1: Predicting Amazon Review Ratings

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Speech Recognition at ICSI: Broadcast News and beyond

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Learning to Schedule Straight-Line Code

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Probabilistic Latent Semantic Analysis

WHEN THERE IS A mismatch between the acoustic

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Guidelines for Project I Delivery and Assessment Department of Industrial and Mechanical Engineering Lebanese American University

CEFR Overall Illustrative English Proficiency Scales

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

(Sub)Gradient Descent

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Australian Journal of Basic and Applied Sciences

An OO Framework for building Intelligence and Learning properties in Software Agents

A Case-Based Approach To Imitation Learning in Robotic Agents

Beyond the Pipeline: Discrete Optimization in NLP

Introduction to Simulation

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

What is Initiative? R. Cohen, C. Allaby, C. Cumbaa, M. Fitzgerald, K. Ho, B. Hui, C. Latulipe, F. Lu, N. Moussa, D. Pooley, A. Qian and S.

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Transcription:

Using Wizard-of-Oz simulations to bootstrap Reinforcement-Learningbased dialog management systems Jason D. Williams Steve Young Department of Engineering, University of Cambridge, Cambridge, CB2 PZ, United Kingdom {jdw3,sjy}@eng.cam.ac.uk Abstract This paper describes a method for bootstrapping a Reinforcement Learningbased dialog manager using a Wizard-of- Oz trial. The state space and action set are discovered through the annotation, and an initial policy is generated using a Supervised Learning algorithm. The method is tested and shown to create an initial policy which performs significantly better and with less effort than a handcrafted policy, which can be generated using a small number of dialogs. Introduction and motivation Recent work has successfully applied Reinforcement Learning (RL) to learning dialog strategy from experience, typically formulating the problem as a Markov Decision Process (MDP). (Walker et al., 998; Singh et al., 22; Levin et al., 2). Despite successes, several open questions remain, especially the issue of how to create (or bootstrap ) the initial system prior to data becoming available from on-line operation. This paper proceeds as follows. Section 2 outlines the core elements of an MDP and issues related to applying an MDP to dialog management. Sections 3 and 4 detail a method for addressing these issues, and the procedure used to test the method, respectively. Sections 5-7 present the results, a discussion, and conclusions, respectively. 2 Background An MDP is composed of a state space, an action set, and a policy which maps each state to one action. Introducing a reward function allows us to create or refine the policy using RL. (Sutton and Barto, 998). When the MDP framework is applied to dialog management, the state space is usually constructed from vector components including information state, dialog history, recognition confidence, database status, etc. In most of the work to date both the state space and action set are hand selected, in part to ensure a limited state space, and to ensure training can proceed using a tractable number of dialogs. However, hand selection becomes impractical as system size increases, and automatic generation/selection of these elements is currently an open problem, closely related to the problem of exponential state space size. 3 3. 3.2 A method for bootstrapping RL-based systems Here we propose a method for bootstrapping a MDP-based system; specifically, we address the choice of the state representation and action set, and the creation of an initial policy. Step : Conduct Wizard-of-Oz dialogs The method commences with talking wizard interactions in which either the wizard s voice is disguised, or a Text-to-speech engine is used. We choose human/wizard rather than human/human dialogs as people behave differently toward (what they perceive to be) machines and other people as discussed in Jönsson and Dahlbick, 988 and also validated in Moore and Browning, 992. The dialog, including wizard s interaction with back-end data sources is recorded and transcribed. Step 2: Exclude out-of-domain turns The wizard will likely handle a broader set of requests than the system will ultimately be able to cover; thus some turns must be excluded. Step 2 begins by formulating a list of tasks which are to be included in the transcript; the remainder is labeled out-of-domain (OOD) and excluded.

This step takes an approach which is analogous to, but more simplistic than Dialogue Distilling (Larsson et al., 2) which changes, adds and removes portions of turns or whole turns. Here rules simply stipulate whether to keep a whole turn. 3.3 3.4 Step 3: Enumerate action set and state space Next, the in-domain turns are annotated with dialog acts. Based on these, an action set is enumerated, and a set of state parameters and their possible values to form a vector describing the state space is determined, including: Information state (e.g., departure-city, arrival-city) from the user and database. The confidence/confirmation status of information state variables. Expressed user goal and/or system goal. Low-level turn information (e.g., yes/no responses, backchannel, thank you, etc.). Status of database interactions (e.g., when a form can be submitted or has been returned). A variety of dialog-act tagging taxonomies exist in the literature. Here we avoid a tagging system that relies on a stack or other recursive structure (for example, a goal stack) as it is not immediately clear how to represent a recursive structure in a state space. In practice, many information state components are much less important than their corresponding confirmation status, and can be removed. Even with this reduction, the state space will be massive probably too large to ever visit all states. We propose using a parameterized value function - - i.e., a value function that shares parameters across states (including states previously unobserved). One special case of this is state tying, in which a group of states share the same value function; an alternative is to use a Supervised Learning algorithm to estimate a value function. Step 4: Form an initial policy For each turn in the corpus, a vector is created representing the current dialog state plus the subsequent wizard action. Taking the action as the class variable, Supervised Learning (SL) is used to build a classifier which functions as the initial policy. Depending on the type of SL algorithm used, it may be possible to produce a prioritized list of actions rather than a single classification; in this case, this list can form an initial list of actions permitted in a given state. As noted by Levin et al. (2), supervised learning is not appropriate for optimizing dialog strategy because of the temporal/environmental nature of dialog. Here we do not assert that the SL-learned policy will be optimal simply that it can be easily created, and that it will be significantly better than random guessing, and better and cheaper to produce than creating a cursory handcrafted strategy. 3.5 4 Limitations of the method This method has several obvious limitations: Because a talking, perfect-hearing wizard is used, no/little account is taken of the recognition errors to be expected with automated speech recognition (ASR). Excluding too much in Step 2 may exclude actions or state parameters which would have produced a superior deployed system. Experimental design The proposed approach has been tested using the Autoroute corpus of 66 dialogs, in which a talking wizard answered questions about driving directions in the UK (Moore and Browning, 992). A small set of in-domain tasks was enumerated (e.g., gathering route details, outputting summary information about a route, disambiguation of place names, etc.), and turns which did not deal with these tasks were labeled OOD and excluded. The latter included gathering the caller s name and location ( UserID ), the most common OOD type. The corpus was annotated using an XML schema to provide the following: 5 information components were created (e.g., from, to, time, car-type). Each information component was given a status: C (Confirmed), U (Unconfirmed), and NULL (Not known). Up to 5 routes may be under discussion at once the state tracked the route under dis-

cussion (RUD), total number of routes (TR), and all information and status components for each route. A component called flow tracked singleturn dialog flow information from the caller (e.g., yes, no, thank-you, silence). A component called goal tracked the (most recent) goal expressed by the user (e.g., plan-route, how-far). Goal is empty unless explicitly set by the caller, and only one goal is tracked at a time. No attempt is made to indicate if/when a goal has been satisfied. 33 action types were identified. Some of which take information slots as parameters (e.g., whquestion, implicit-confirmation). The corpus gave no indication of database interactions other than what can be inferred from the dialog transcripts. One common wizard action asked the caller to please wait when the wizard was waiting for a database response. To account for this, we provided an additional state component which indicated whether the database was working called db-request, which was set to true whenever the action taken was please-wait and false otherwise. Other less common database interactions occurred when town names were ambiguous or not found, and no attempt was made to incorporate this information into the state representation. The state space was constructed using only the status of the information slots (not the values); of the 5, 4 were occasionally expressed (e.g., day of the week) but not used to complete the transaction and were therefore excluded from the state space. Two turns of wizard action history were also incorporated. This formulation of the state space leads to approximately 33 distinct states. For evaluation of the method, a hand-crafted policy of 3 rules mapping state to action was created by inspecting the dialogs. 5 Results Table shows in-domain vs. out-of-domain wizard and caller turns. Figures through 4 show counts of flow values, goal values, action values, and state It was not clear in what situations some of the actions should be used, so some (rare) actions were not covered by the rules. components, respectively. The most common action type was please-wait (4.6% of actions). Turn type Total In domain OOD: User ID OOD: Other Wizard 355 (%) 24 (76.4%) 594 (8.8%) 5 (4.8%) Caller 2466 (%) 73 (69.5%) 56 (22.7%) 92 (7.8%) Table : In-domain and Out-of-domain (OOD) turns Criteria States Visits Visited only 82 82 once (85.7%) (45.9%) Visited more 96 353 than once (7.%) (3.7%) without a conflict Visited more 4 than once with (7.3%) (4.3%) conflict TOTAL 379 (%) 2576 (%) Table 2: Conflicts by state and visits Estimated action probabilities Visits p(action taken state) > p(any 774 (74.3%) other action state) p(action taken state) = p(one 9 (.4%) or more other actions state) > p(all remaining actions state) p(action taken state) < 48 (4.2%) p(another action state) TOTAL 4 (%) Table 3: Estimated probabilities in conflict states Engine Class Precision jbnc Action-type only 72.7% Action-type & parameters 66.7% C4.5 Action-type only 79.% Action-type & parameters 72.9% Hand- Action-type only 58.4% craft Action-type & parameters 53.9% Table 4: Results from SL training and evaluation In some cases, the wizard took different actions in the same state; we labeled this situation a conflict. Table 2 shows the number of distinct states that were encountered and, for states visited more than once, whether conflicting actions were selected. Of states with conflicts, Table 3 shows probabilities estimated from the corpus. The interaction data was then submitted to 2 SL pattern classifiers c4.5 using decision-trees

(Quinlan, 992) and jbnc using Naïve Bayesians (Sacha, 23). Table 4 shows -fold cross validation classification error rates classifying () the action type, and (2) the action type with parameters, as well as the results for the hand-crafted policy. Figure 5 show the -fold cross validation classification error rates for varying amounts of training data for two different pattern classifiers and action-type/action-type and parameters. massive amounts of data would be needed to observe all states which are within dialogs, and suggests dialog does not primarily visit familiar states. Within a given state, the wizard s behavior is stochastic, occasionally deviating from an otherwise static policy. Some of this behavior results from database information not included in the corpus and state space; in other cases, the wizard is occasionally making random choices with no apparent basis. 6 Discussion The majority of the data collected was usable : although 26.7% of turns were excluded, 2.5% of these were due to a well-defined task not under study here (user identification), and only 6.% fell outside of designated tasks. That said, it may be desirable to impose a minimum threshold on how many times a flow, goal, or action must be observed before adding it to the state space or action set given the long tails of these elements. Dialogs containing Action 2 5 5 5 9 3 7 2 Action ID 25 29 33 Dialogs containing Flow ID 2 5 5 2 3 4 5 6 7 8 9 2 Flow component ID Figure : Dialogs containing flow components Dialogs containing Goal 6 4 2 8 6 4 2 2 3 4 5 6 7 8 9 2 3 Goal component ID Figure 2: Dialogs containing goal components About half of the turns took place in states which were visited only once. This confirms that Figure 3: Dialogs containing action types Dialogs containing component 2 5 5 3 5 7 9 Component ID Figure 4: Dialogs containing information components Figure 5 implies that a relatively small number of dialogs (several hundred turns, or about 3-4 dialogs) contain the vast majority of information relevant to SL algorithms less than expected. Correctly predicting the wizard s action in 72.9% of turns is significantly better than the 58.4% correct prediction rate from the handcrafted policy. When a caller allows the system to retain initiative, the policy learned by the c4.5 algorithm handled enquiries about single trips perfectly. Policy errors start to occur as the user takes more initiative, entering less well observed states. 3

Hand examination of a small number of misclassified actions indicate that about half of the actions were reasonable e.g., including an extra item in a confirmation. Hand examination also confirmed that the wizard s non-deterministic behavior and lack of database information resulted in misclassifications. Other sources of mis-classifications derived primarily from under-account of the user s goal and other deficiencies in the expressiveness of the state space. 7 Conclusion & future work This work has proposed a method for determining many of the basic elements of a RL-based spoken dialog system with minimal input from dialog designers using a talking wizard. The viability of the model has been tested with an existing corpus and shown to perform significantly better than a hand-crafted policy and with less effort to create. Future research will explore refining this approach vis-à-vis user goal, applying this method to actual RL-based systems and finding suitable methods for parameterized value functions References A. Jönsson and N. Dahlbick. 988. Talking to A Computer is Not Like Talking To Your Best Friend. Proceedings of the Scandinavian Conference on Artificial Intelligence '88, pp. 53-68. Staffan Larsson, Arne Jönsson and Lena Santamarta. 2. Using the process of distilling dialogues to understand dialogue systems. ICSLP 2, Beijing. Ester Levin, Roberto Pieraccini and Wieland Eckert. 2. A Stochastic Model of Human-Machine Interaction for Learning Dialogue Structures. IEEE Trans on Speech and Audio Processing 8():-23. R. K. Moore and S. R. Browning. 992. Results of an exercise to collect genuine spoken enquiries using Wizard of Oz techniques. Proc. of the Inst. of Acoustics. Ross Quinlan. 992. C4.5 Release 8. (Software package). http://www.cse.unsw.edu.au/~quinlan/ Jarek P. Sacha. 23. jbnc version.. (Software package). http://sourceforge.net/projects/jbnc/. Satinder Singh, Diane Litman, Michael Kearns, Marilyn Walker. 22. Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. Journal of Artificial Intelligence Research, vol 6, 5-33. Richard S. Sutton and Andrew G. Barto. 998. Reinforcement Learning: an Introduction. The MIT Press, Cambridge, Massachusetts, USA. Marilyn A. Walker, Jeanne C. Fromer, Shrikanth Narayanan. 998. Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email. Proc. 36th Annual Meeting of the ACM and 7th Int l Conf. on Comp. Linguistics, 345--352. Classification errors (%) 8.% 7.% 6.% 5.% 4.% 3.% 2.% c4.5 Naive Bayes 5 5 2 25 Training examples (dialog turns) Figure 5: Classification errors vs. training samples for action-type & parameters