Research perspective: Reinforcement learning and dialogue management

Research perspective: Reinforcement learning and dialogue management Reasoning and Learning Lab / Center for Intelligent Machines School of Computer Science, McGill University Samung Research Forum November 10, 2014

Reinforcement learning 1. Learning agent tries a sequence of actions (a t ). 2. Observes outcomes (state s t+1, rewards r t ) of those actions. 3. Statistically estimates relationship between action choice and outcomes. After some time... learns an action selection policy that optimizes selected outcomes. (Bellman, 1957; Sutton, 1988; Sutton&Barto, 1998.) 2

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs 3

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs Training signal = rewards Inputs Reinforcement Learning Outputs ( actions ) 4

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs Training signal = rewards Environment Inputs Reinforcement Learning Outputs ( actions ) 5

RL vs supervised learning Training signal = desired (target outputs), e.g. class Inputs Supervised Learning Outputs Learning from i.i.d samples Inputs Training signal = rewards Reinforcement Learning Environment Outputs ( actions ) Jointly learning AND planning from correlated samples 6

Reinforcement learning: Definitions Model problem using a Markov Decision Process: S: Set of states A: Set of actions Pr(s t s t-1,a t ): Probabilistic effects r(s t,a t ): Reward function a t-1 a t a t+1 s t-1 s t s t+1 r t-1 r t r t+1 7

The policy A policy is a mapping from states to actions. Deterministic policy: in each state, agent chooses a unique action. π: S A, π(s) = a Stochastic policy: in each state, agent samples an action from a distribution. π: S x A [0, 1], π(s,a) = P(a t =a s t =s) Goal: Find the policy that maximizes expected total reward. (But there are many policies!) argmax π E π [ r 0 + r 1 + + r T ] 9

Learning problem Learn function defining the state-action value: Q π (s,a) = r(s,a) + s P(s s,a) max a Q t+1 (s,a ) Immediate reward Future expected sum of rewards 10

In large state spaces: Need approximation 11

Batch reinforcement learning Use regression analysis to estimate the long-term cost of different actions from the training data f i < t i f i < t i f i < t i Regression with linear function, kernel function, random forests, neural networks,.. Important! Target function is sum of future expected rewards. f i < t i 12

Scientific objective #1: Conditional computation Training and evaluation of deep learning architectures can be expensive. Adaptive training and evaluation of deep learning architectures. 13

Scientific objective #1: Conditional computation Training and evaluation of deep learning architectures can be expensive. => Adaptive training and evaluation of deep learning architectures. Use reinforcement learning (RL) for: Training phase: which weights to update, in what order, with what parameters (e.g. learning rate). Evaluation phase: what subset of node to compute to get sufficient information to predict the output. Possible case study: deep recurrent networks, e.g. for speech recognition 14

Scientific objective #1: Conditional computation Technical challenges: 1. Defining the deep net state. 15

Scientific objective #1: Conditional computation Technical challenges: 1. Defining the deep net state. 2. Scaling RL to thousands of dimensions. 3. Finding low-dimensional representation of deep net state. 4. Handing large (continuous?) action spaces. 5. Delayed rewards» Effect of adaptive configuration (especially for training) will only be visible after many steps in the dynamic system. 18

Scientific objective #1: Conditional computation Expected impact : Deep learning Faster, more efficient training of deep learning architectures.» E.g. Train large model (10 9 connections) 10 times faster in 2 years. Faster, more efficient use of deep learning architectures. Expected impact : Reinforcement learning Novel, more scalable algorithms, for other complex applications. 19

Scientific objective #2: Dialogue management http://mi.eng.cam.ac.uk/research/dialogue/epsrc/ 20

Scientific objective #2: Dialogue management Text-to-text interactions, with multiple turn-taking. Potentially use for cell phone apps, call centers, phone-in information systems. Current approach: Significant human effort to hand-design rules for expert system. Goal: Directly learn good dialogue strategy from data using deep learning architecture 21

Speech-based control of smart wheelchair Recent advances in Reinforcement Learning 22

RL in partially observable domains Partially Observable Markov Decision Processes (POMDPs) O z t+1 O z t+2 Formally defined by: a t a t+1 State space (user intent, task status), S! s t T s t+1 T s t+2 Action space (robot commands), A! Observation space (sensor readings), Z! b t b t+1 b t+2 State-to-state transition probabilities, P(s s, a)! State-emitted observation probabilities, P(z s, a)! Reward function, R(s,a) R!! 23

Learning an interaction model [Png & Pineau ICASSP 11] Key challenge is to estimate the observation model, P(z s, a). 1. Supervised learning: Collect human subject data, label it, directly estimate model. 2. Bayesian learning: Specify prior, observe data, apply gradient method to update the posterior. Empirical returns show good learning. Using domain-knowledge to constrain the structure is more useful than having accurate priors. 26

The Wheelchair Skills Test (WST) The test covers 32 skills. Kirby et al. Arch.Phys. Med. Rehabil. 2004. Each task is graded for Performance and Safety on a Pass/Fail scale by a human rater. 27

Wheelchair skills included in robotic test 28

User experiments Phase 1: In lab evaluation of user interface (full WST, no robot) 8 university students not involved in the project. Data used for training POMDP model parameters. Phase 2: WST with healthy subjects. 8 individuals working in the rehabilitation field. Data used for validating system integration and baseline evaluation. Phase 3: WST with subjects with mobility disorders. 9 individuals, 31 to 85 years old, avg. 6.8 years of wheelchair use. 29

Dialogue management results Voice interaction with the control test subjects: 30

Dialogue management results Control subjects: Wheelchair users: 31

WST Performance Score 100 90 80 70 60 50 Standard Intelligent 40 30 20 10 0 1 2 3 4 5 6 7 8 9 Subject ID 32

Dialogue management: Proposed activities 1. Identify large datasets. 33

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 34

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 3. Apply deep RL to learn optimal response strategies. 35

Dialogue management: Proposed activities 1. Identify large datasets. 2. Use deep network to track user intentions during interaction. 3. Apply deep RL to learn optimal response strategies. 4. Explore pre-training with relevant non-dialogue corpuses.» Learn about domain knowledge (e.g. travel vocab).» Learn about structure of dialogue interactions. 5. Recurrent vs non-recurrent deep net for multiple (5+) turn-taking. 37

Research team @ McGill 39

Questions? 40

Three inference problems in POMDPs Belief tracking: When state is not observable, track the information state. Generally tractable with standard Bayesian filter, b t (s) := Pr(s t =s b 0, a 0, z 0,, a t ). Easy! Planning: Objective is to select actions such as to maximize expected sum of rewards, V(b t ) := E[ i=tt r i b t ]. Approximately tractable with approximate dynamic programming. Hard! Learning: Usually assume model is known a priori. Learning model from data is a major challenge. Harder! 41

Bayesian learning: General idea Let the model be a random variable, M Choose a (conjugate) prior over the model, P(M) Generate observable measurements, Y Assume a generative process, P(Y M) Computer the posterior, P(M Y) = P(Y M) P (M) / P(Y) NOTE: This is a model-based Bayesian approach. You can also consider a model-free approach with a posterior over the value function [Ghavamzadeh&Engel, ICML 07]. 42

Bayesian learning: POMDPS Estimate POMDP model parameters using Bayesian inference: T: Estimate a posterior ϕ a ss on the incidence of transitions s a s. O: Estimate a posterior ψ a sz on the incidence of observations s a z. R: Assume for now this is known (straight-forward extension.) Goal: Maximize expected return under partial observability of (s, ϕ, ψ). This is also a POMDP problem: S : physical state (s S) + information state (ϕ, ψ) T : describes probability of update (s, ϕ, ψ) a (s, ϕ, ψ ) O : describes probability of observing count increment. A solution to this problem is an optimal plan to act and learn! 43

Bayes-Adaptive POMDPs Basic extended POMDP model: [Ross et al. JMLR 11] In this model: Learning = Tracking the hyper-state Issues: Representing ϕ, ψ. Tracking the hyper-state. Planning over the hyper-belief. 44

Bayes-Adaptive POMDPs: Belief tracking Assume S, A, Z are discrete. Model ϕ, ψ using Dirichlet distributions. Initial hyper-belief: b 0 (s, ϕ, ψ ) = b 0 (s) I(ϕ=ϕ 0 ) I(ψ=ψ 0 ) where b 0 (s) is the initial belief over original state space I( ) is the indicator function (ϕ 0, ψ 0 ) are the initial counts (prior on T, O) Updating b t defines a mixture of Dirichlets, with O( S t+1 ) components. In practice, approximate with a particle filter. 45

Bayes-Adaptive POMDPs: Belief tracking Different ways of approximating b t (s, φ, ψ) via particle filtering: 1. Monte-Carlo sampling (MC) 2. K most probable hyper-states (MP) 3. Risk-sensitive filtering with weighted distance metric: 46

Bayes-Adaptive POMDPs: Planning Receding horizon control to estimate the value of each action at current belief, b t. Usually consider a short horizon of reachable beliefs. Use pruning and heuristics to reach longer planning horizons. 47

Case study: Dialogue management Estimate observation noise using Bayesian method. Reduce number of parameters to learn via hand-coded symmetry. [Png & Pineau ICASSP 11] Consider both a good prior (ψ=0.8) and a weak prior (ψ=0.6) Empirical returns show good learning. Using domain-knowledge to constrain the structure is more useful than having accurate priors. 48

Case study: Dialogue management Vary the depth of the forward-search. Does it improves the return? (Very noisy estimate. Lots of variance.) In general, it seems the return improves up to planning depth d=2, but not beyond. 49