Research perspective: Reinforcement learning and dialogue management

Similar documents
Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

CSL465/603 - Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Axiom 2013 Team Description Paper

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Georgetown University at TREC 2017 Dynamic Domain Track

FF+FPG: Guiding a Policy-Gradient Planner

Introduction to Simulation

Reinforcement Learning by Comparing Immediate Reward

CS Machine Learning

Calibration of Confidence Measures in Speech Recognition

Probabilistic Latent Semantic Analysis

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Generative models and adversarial training

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Laboratorio di Intelligenza Artificiale e Robotica

A Reinforcement Learning Variant for Control Scheduling

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Artificial Neural Networks written examination

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Regret-based Reward Elicitation for Markov Decision Processes

An investigation of imitation learning algorithms for structured prediction

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Assignment 1: Predicting Amazon Review Ratings

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

arxiv: v1 [cs.lg] 15 Jun 2015

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Human Emotion Recognition From Speech

An OO Framework for building Intelligence and Learning properties in Software Agents

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

TD(λ) and Q-Learning Based Ludo Players

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning From the Past with Experiment Databases

Evolutive Neural Net Fuzzy Filtering: Basic Description

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

On the Combined Behavior of Autonomous Resource Management Agents

Laboratorio di Intelligenza Artificiale e Robotica

Learning Methods for Fuzzy Systems

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Intelligent Agents. Chapter 2. Chapter 2 1

INPE São José dos Campos

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Rule Learning With Negation: Issues Regarding Effectiveness

Model Ensemble for Click Prediction in Bing Search Ads

Task Completion Transfer Learning for Reward Inference

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Improving Action Selection in MDP s via Knowledge Transfer

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Reducing Features to Improve Bug Prediction

Modeling function word errors in DNN-HMM based LVCSR systems

On-the-Fly Customization of Automated Essay Scoring

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Planning with External Events

AMULTIAGENT system [1] can be defined as a group of

Using dialogue context to improve parsing performance in dialogue systems

Task Completion Transfer Learning for Reward Inference

M55205-Mastering Microsoft Project 2016

Modeling user preferences and norms in context-aware systems

Knowledge Transfer in Deep Convolutional Neural Nets

Softprop: Softmax Neural Network Backpropagation Learning

(Sub)Gradient Descent

Corrective Feedback and Persistent Learning for Information Extraction

Machine Learning and Development Policy

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

College Pricing and Income Inequality

DOCTOR OF PHILOSOPHY HANDBOOK

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Natural Language Processing. George Konidaris

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Neural Network GUI Tested on Text-To-Phoneme Mapping

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Knowledge-Based - Systems

Australian Journal of Basic and Applied Sciences

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Go fishing! Responsibility judgments when cooperation breaks down

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Probability and Statistics Curriculum Pacing Guide

Transcription:

Research perspective: Reinforcement learning and dialogue management Reasoning and Learning Lab / Center for Intelligent Machines School of Computer Science, McGill University Samung Research Forum November 10, 2014

Reinforcement learning 1. Learning agent tries a sequence of actions (a t ). 2. Observes outcomes (state s t+1, rewards r t ) of those actions. 3. Statistically estimates relationship between action choice and outcomes. After some time... learns an action selection policy that optimizes selected outcomes. (Bellman, 1957; Sutton, 1988; Sutton&Barto, 1998.) 2

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs 3

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs Training signal = rewards Inputs Reinforcement Learning Outputs ( actions ) 4

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs Training signal = rewards Environment Inputs Reinforcement Learning Outputs ( actions ) 5

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs Learning from i.i.d samples Inputs Training signal = rewards Reinforcement Learning Environment Outputs ( actions ) Jointly learning AND planning from correlated samples 6

Reinforcement learning: Definitions Model problem using a Markov Decision Process: S: Set of states A: Set of actions Pr(s t s t-1,a t ): Probabilistic effects r(s t,a t ): Reward function a t-1 a t a t+1 s t-1 s t s t+1 r t-1 r t r t+1 7

The policy A policy is a mapping from states to actions. Deterministic policy: in each state, agent chooses a unique action. π: S A, π(s) = a Stochastic policy: in each state, agent samples an action from a distribution. π: S x A [0, 1], π(s,a) = P(a t =a s t =s) 8

The policy A policy is a mapping from states to actions. Deterministic policy: in each state, agent chooses a unique action. π: S A, π(s) = a Stochastic policy: in each state, agent samples an action from a distribution. π: S x A [0, 1], π(s,a) = P(a t =a s t =s) Goal: Find the policy that maximizes expected total reward. (But there are many policies!) argmax π E π [ r 0 + r 1 + + r T ] 9

Learning problem Learn function defining the state-action value: Q π (s,a) = r(s,a) + s P(s s,a) max a Q t+1 (s,a ) Immediate reward Future expected sum of rewards 10

In large state spaces: Need approximation 11

Batch reinforcement learning Use regression analysis to estimate the long-term cost of different actions from the training data f i < t i f i < t i f i < t i Regression with linear function, kernel function, random forests, neural networks,.. Important! Target function is sum of future expected rewards. f i < t i 12

Scientific objective #1: Conditional computation Training and evaluation of deep learning architectures can be expensive. Adaptive training and evaluation of deep learning architectures. 13

Scientific objective #1: Conditional computation Training and evaluation of deep learning architectures can be expensive. => Adaptive training and evaluation of deep learning architectures. Use reinforcement learning (RL) for: Training phase: which weights to update, in what order, with what parameters (e.g. learning rate). Evaluation phase: what subset of node to compute to get sufficient information to predict the output. Possible case study: deep recurrent networks, e.g. for speech recognition 14

Scientific objective #1: Conditional computation Technical challenges: 1. Defining the deep net state. 15

Scientific objective #1: Conditional computation Technical challenges: 1. Defining the deep net state. 2. Scaling RL to thousands of dimensions. 3. Finding low-dimensional representation of deep net state. 16

Scientific objective #1: Conditional computation Technical challenges: 1. Defining the deep net state. 2. Scaling RL to thousands of dimensions. 3. Finding low-dimensional representation of deep net state. 4. Handing large (continuous?) action spaces. 17

Scientific objective #1: Conditional computation Technical challenges: 1. Defining the deep net state. 2. Scaling RL to thousands of dimensions. 3. Finding low-dimensional representation of deep net state. 4. Handing large (continuous?) action spaces. 5. Delayed rewards» Effect of adaptive configuration (especially for training) will only be visible after many steps in the dynamic system. 18

Scientific objective #1: Conditional computation Expected impact : Deep learning Faster, more efficient training of deep learning architectures.» E.g. Train large model (10 9 connections) 10 times faster in 2 years. Faster, more efficient use of deep learning architectures. Expected impact : Reinforcement learning Novel, more scalable algorithms, for other complex applications. 19

Scientific objective #2: Dialogue management http://mi.eng.cam.ac.uk/research/dialogue/epsrc/ 20

Scientific objective #2: Dialogue management Text-to-text interactions, with multiple turn-taking. Potentially use for cell phone apps, call centers, phone-in information systems. Current approach: Significant human effort to hand-design rules for expert system. Goal: Directly learn good dialogue strategy from data using deep learning architecture 21

Speech-based control of smart wheelchair Recent advances in Reinforcement Learning 22

RL in partially observable domains Partially Observable Markov Decision Processes (POMDPs) O z t+1 O z t+2 Formally defined by: a t a t+1 State space (user intent, task status), S! s t T s t+1 T s t+2 Action space (robot commands), A! Observation space (sensor readings), Z! b t b t+1 b t+2 State-to-state transition probabilities, P(s s, a)! State-emitted observation probabilities, P(z s, a)! Reward function, R(s,a) R!! 23

RL in partially observable domains Partially Observable Markov Decision Processes (POMDPs) O z t+1 O z t+2 Formally defined by: a t a t+1 State space (user intent, task status), S! s t T s t+1 T s t+2 Action space (robot commands), A! Observation space (sensor readings), Z! b t b t+1 b t+2 State-to-state transition probabilities, P(s s, a)! State-emitted observation probabilities, P(z s, a)! Reward function, R(s,a) R! When state is not observable, track the information state: b t (s) := Pr(s t =s a 0, z 0,, a t ) Research! Perspectives 24

Learning an interaction model [Png & Pineau ICASSP 11] Key challenge is to estimate the observation model, P(z s, a). 1. Supervised learning: Collect human subject data, label it, directly estimate model. 25

Learning an interaction model [Png & Pineau ICASSP 11] Key challenge is to estimate the observation model, P(z s, a). 1. Supervised learning: Collect human subject data, label it, directly estimate model. 2. Bayesian learning: Specify prior, observe data, apply gradient method to update the posterior. Empirical returns show good learning. Using domain-knowledge to constrain the structure is more useful than having accurate priors. 26

The Wheelchair Skills Test (WST) The test covers 32 skills. Kirby et al. Arch.Phys. Med. Rehabil. 2004. Each task is graded for Performance and Safety on a Pass/Fail scale by a human rater. 27

Wheelchair skills included in robotic test 28

User experiments Phase 1: In lab evaluation of user interface (full WST, no robot) 8 university students not involved in the project. Data used for training POMDP model parameters. Phase 2: WST with healthy subjects. 8 individuals working in the rehabilitation field. Data used for validating system integration and baseline evaluation. Phase 3: WST with subjects with mobility disorders. 9 individuals, 31 to 85 years old, avg. 6.8 years of wheelchair use. 29

Dialogue management results Voice interaction with the control test subjects: 30

Dialogue management results Control subjects: Wheelchair users: 31

WST Performance Score 100 90 80 70 60 50 Standard Intelligent 40 30 20 10 0 1 2 3 4 5 6 7 8 9 Subject ID 32

Dialogue management: Proposed activities 1. Identify large datasets. 33

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 34

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 3. Apply deep RL to learn optimal response strategies. 35

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 3. Apply deep RL to learn optimal response strategies. 4. Explore pre-training with relevant non-dialogue corpuses.» Learn about domain knowledge (e.g. travel vocab).» Learn about structure of dialogue interactions. 36

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 3. Apply deep RL to learn optimal response strategies. 4. Explore pre-training with relevant non-dialogue corpuses.» Learn about domain knowledge (e.g. travel vocab).» Learn about structure of dialogue interactions. 5. Recurrent vs non-recurrent deep net for multiple (5+) turn-taking. 37

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 3. Apply deep RL to learn optimal response strategies. 4. Explore pre-training with relevant non-dialogue corpuses.» Learn about domain knowledge (e.g. travel vocab).» Learn about structure of dialogue interactions. 5. Recurrent vs non-recurrent deep net for multiple (5+) turn-taking. 6. Online parameter estimation (user/task-specific?) 7. Quick prototyping of new topic-specific dialogue system.» Use transfer learning to generalize between topics. 38

Research team @ McGill 39

Questions? 40

Three inference problems in POMDPs Belief tracking: When state is not observable, track the information state. Generally tractable with standard Bayesian filter, b t (s) := Pr(s t =s b 0, a 0, z 0,, a t ). Easy! Planning: Objective is to select actions such as to maximize expected sum of rewards, V(b t ) := E[ i=tt r i b t ]. Approximately tractable with approximate dynamic programming. Hard! Learning: Usually assume model is known a priori. Learning model from data is a major challenge. Harder! 41

Bayesian learning: General idea Let the model be a random variable, M Choose a (conjugate) prior over the model, P(M) Generate observable measurements, Y Assume a generative process, P(Y M) Computer the posterior, P(M Y) = P(Y M) P (M) / P(Y) NOTE: This is a model-based Bayesian approach. You can also consider a model-free approach with a posterior over the value function [Ghavamzadeh&Engel, ICML 07]. 42

Bayesian learning: POMDPS Estimate POMDP model parameters using Bayesian inference: T: Estimate a posterior ϕ a ss on the incidence of transitions s a s. O: Estimate a posterior ψ a sz on the incidence of observations s a z. R: Assume for now this is known (straight-forward extension.) Goal: Maximize expected return under partial observability of (s, ϕ, ψ). This is also a POMDP problem: S : physical state (s S) + information state (ϕ, ψ) T : describes probability of update (s, ϕ, ψ) a (s, ϕ, ψ ) O : describes probability of observing count increment. A solution to this problem is an optimal plan to act and learn! 43

Bayes-Adaptive POMDPs Basic extended POMDP model: [Ross et al. JMLR 11] In this model: Learning = Tracking the hyper-state Issues: Representing ϕ, ψ. Tracking the hyper-state. Planning over the hyper-belief. 44

Bayes-Adaptive POMDPs: Belief tracking Assume S, A, Z are discrete. Model ϕ, ψ using Dirichlet distributions. Initial hyper-belief: b 0 (s, ϕ, ψ ) = b 0 (s) I(ϕ=ϕ 0 ) I(ψ=ψ 0 ) where b 0 (s) is the initial belief over original state space I( ) is the indicator function (ϕ 0, ψ 0 ) are the initial counts (prior on T, O) Updating b t defines a mixture of Dirichlets, with O( S t+1 ) components. In practice, approximate with a particle filter. 45

Bayes-Adaptive POMDPs: Belief tracking Different ways of approximating b t (s, φ, ψ) via particle filtering: 1. Monte-Carlo sampling (MC) 2. K most probable hyper-states (MP) 3. Risk-sensitive filtering with weighted distance metric: 46

Bayes-Adaptive POMDPs: Planning Receding horizon control to estimate the value of each action at current belief, b t. Usually consider a short horizon of reachable beliefs. Use pruning and heuristics to reach longer planning horizons. 47

Case study: Dialogue management Estimate observation noise using Bayesian method. Reduce number of parameters to learn via hand-coded symmetry. [Png & Pineau ICASSP 11] Consider both a good prior (ψ=0.8) and a weak prior (ψ=0.6) Empirical returns show good learning. Using domain-knowledge to constrain the structure is more useful than having accurate priors. 48

Case study: Dialogue management Vary the depth of the forward-search. Does it improves the return? (Very noisy estimate. Lots of variance.) In general, it seems the return improves up to planning depth d=2, but not beyond. 49