Reinforcement Learning based Dialog Manager Speech Group Department of Signal Processing and Acoustics Katri Leino User Interface Group Department of Communications and Networking Aalto University, School of Electrical Engineering
Outline Spoken Dialog System Reinforcement Learning POMDP Belief Tracking Policy model User Simulation Fast Learning and User Adaptation Evaluation
Spoken Dialog system Interface between user and database where speech is primary communication medium Interactive conversational agents Support services, teaching tutors, entertainment.. Apple s Siri, Call centers, Alexa Moi! http://images.freeimages.com
Dialog system - structure
Dialogue Manager Flow chart Traditionally made by hand Time consuming Expensive Fragile Unreliable input Error checking, recovery https://files.readme.io/sryqqxrsxmzexoehuvsq_2015-11-18-flowchart.jpg
Reinforcement learning AI Reward Iterate and learn
POMDP - partially observable Markov decision process Mathematical framework for Dialog manager Dialog as Markov process Stochastic process that satisfies Markov property Future states depend only on present state, not on the sequence of events that preceded it. Defining parameters State s t and Action a t at time step t Transition probability p(s t s t-1, a t-1 ) User input probability p(o t s t ), o t is noisy observation Dialog model models e.g. state transition and observation probability functions Policy model decides which action to take during each turn Reward function expected reward r(s t, a t ) Typically: maximize success + minimize dialog length
POMDP with Reinforcement Learning
Belief Tracking Part of Dialog Model, determines probabilities Updated during process, models user behavior (history), current intention and goal Controls state probabilities over all states Important for training policy and in error situations/recovery Challenge: Huge belief space The number of states, actions and observation easily over 10 10 N-best approach : pruning and recombination Factored Bayesian network approach
Policy model Maps between belief state b and appropriate system action a The objective is to find an optimal policy that maximizes the reward function Exact representation of policy possible in theory but do not work in real life Policy optimization not necessary full automatic as designer can handcraft certain rules and restrict system from doing in human perspective illogical actions. Constraining actions etc results faster convergence Require human work, too strict rules can rule out the optimal policy, use heuristics instead of strict rules (low probability to bad actions) Compact representation of policy essential
Summary space Only part of belief space is actually visited during dialog States and actions are restricted depending on location in the space Belief tracking is performed in master space Policy optimization take place in subspace i.e. summary space Belief tracker gives features to summary space 5-20 features usually selected by hand User s top goal, state frequencies, dialog history Instead of features, probability distribution models belief state
Model based Optimization Model Parameters estimated from corpus Transition probabilities, frequency No user responses, no interaction with user Requires large corpus which have to be gathered beforehand Dialogs and policy fixed to corpus Action and state space fixed Value iteration
Monte Carlo Optimization Policy is optimized online Requires user to use the system The current estimate of the policy is used to select action ε-greedy exploration Less probably actions get less exploration time Policy is updated after each dialog according to sequence of states, actions and rewards. Cannot be trained from the corpus User simulation
User Simulation User simulator interacts directly with dialog system For development, training parameters and evaluation Error model for simulating ASR related errors Set of confusion networks, not just binary errors Simulate real behavior of ASR system Simulators are biased to certain behavior System may not work well in real life Train and test on different simulator Policy
User Simulation - methods N-gram models In order to model context and have consistent behavior, N has to be too large Dynamic Bayesian Network and HMM Trained on data, can model conditional dependencies Data sparsity problem - joint probabilities Inverse Reinforcement Learning User s reward functions from real human-human conversations
Fast Learning and User Adaptation Speed up optimization so that policy can be trained or adapted to real users Learns via interaction with user Gaussian Process based reinforcement learning Non-parametric model of Bayesian inference Specified mean and kernel function Variance has been specified : measures uncertainty When system is uncertain, non-policy action can be selected to explore for optimal policy Can work in master and summary space No need for handcrafted features Convey system s confidence level to the user GP-SARSA algorithm Optimizes faster than standard RL algorithm Suitable for real world problems with real people
Evaluation Goal: user satisfaction Difficult to measure because requires interaction Modules can be evaluated separately Real users with real needs Testing with artificial goals causes bias PARADISE framework Weighted sum of success and dialog length (performance function) User simulation Simple, efficient, wide coverage of dialogs and scenarios, ASR model Model vs Real user PARADISE framework
Summary Faster to create The resulting system equally good compared to handcrafted Many methods have been researched User simulation vs Fast Learning Evaluation challenging
Homework Select the component of the system and method you find interesting and write short summary about it.
References Young, Steve, et al. "Pomdp-based statistical spoken dialog systems: A review." Proceedings of the IEEE 101.5 (2013): 1160-1179. Schatzmann, Jost, et al. "A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies." The knowledge engineering review 21.2 (2006): 97-126. Walker, Marilyn A., et al. "PARADISE: A framework for evaluating spoken dialogue agents." Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1997. Gasic, Milica, and Steve Young. "Gaussian processes for pomdp-based dialogue manager optimization." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22.1 (2014): 28-40.