Reinforcement Learning

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Task Completion Transfer Learning for Reward Inference

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Task Completion Transfer Learning for Reward Inference

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Laboratorio di Intelligenza Artificiale e Robotica

Calibration of Confidence Measures in Speech Recognition

Learning Methods in Multilingual Speech Recognition

Reinforcement Learning by Comparing Immediate Reward

Georgetown University at TREC 2017 Dynamic Domain Track

Learning Methods for Fuzzy Systems

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

TD(λ) and Q-Learning Based Ludo Players

Modeling function word errors in DNN-HMM based LVCSR systems

Artificial Neural Networks written examination

Modeling function word errors in DNN-HMM based LVCSR systems

Software Maintenance

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

FF+FPG: Guiding a Policy-Gradient Planner

High-level Reinforcement Learning in Strategy Games

Generative models and adversarial training

Softprop: Softmax Neural Network Backpropagation Learning

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Lecture 10: Reinforcement Learning

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Regret-based Reward Elicitation for Markov Decision Processes

Python Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

AMULTIAGENT system [1] can be defined as a group of

On the Combined Behavior of Autonomous Resource Management Agents

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Introduction to Simulation

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

WHEN THERE IS A mismatch between the acoustic

Knowledge-Based - Systems

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Speech Recognition at ICSI: Broadcast News and beyond

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Major Milestones, Team Activities, and Individual Deliverables

BADM 641 (sec. 7D1) (on-line) Decision Analysis August 16 October 6, 2017 CRN: 83777

(Sub)Gradient Descent

Planning with External Events

A student diagnosing and evaluation system for laboratory-based academic exercises

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Seminar - Organic Computing

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Uncertainty concepts, types, sources

Probability and Statistics Curriculum Pacing Guide

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Radius STEM Readiness TM

SARDNET: A Self-Organizing Feature Map for Sequences

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Speech Emotion Recognition Using Support Vector Machine

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Reinforcement Learning Variant for Control Scheduling

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Rule-based Expert Systems

What is PDE? Research Report. Paul Nichols

LEADERSHIP AND COMMUNICATION SKILLS

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

A Comparison of Standard and Interval Association Rules

MYCIN. The MYCIN Task

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Speeding Up Reinforcement Learning with Behavior Transfer

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

CSL465/603 - Machine Learning

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

An Introduction to Simio for Beginners

Agent-Based Software Engineering

Truth Inference in Crowdsourcing: Is the Problem Solved?

An OO Framework for building Intelligence and Learning properties in Software Agents

SIE: Speech Enabled Interface for E-Learning

NCEO Technical Report 27

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

While you are waiting... socrative.com, room number SIMLANG2016

Human Emotion Recognition From Speech

Automatic Pronunciation Checker

Probability estimates in a scenario tree

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

An Online Handwriting Recognition System For Turkish

Transcription:

Reinforcement Learning based Dialog Manager Speech Group Department of Signal Processing and Acoustics Katri Leino User Interface Group Department of Communications and Networking Aalto University, School of Electrical Engineering

Outline Spoken Dialog System Reinforcement Learning POMDP Belief Tracking Policy model User Simulation Fast Learning and User Adaptation Evaluation

Spoken Dialog system Interface between user and database where speech is primary communication medium Interactive conversational agents Support services, teaching tutors, entertainment.. Apple s Siri, Call centers, Alexa Moi! http://images.freeimages.com

Dialog system - structure

Dialogue Manager Flow chart Traditionally made by hand Time consuming Expensive Fragile Unreliable input Error checking, recovery https://files.readme.io/sryqqxrsxmzexoehuvsq_2015-11-18-flowchart.jpg

Reinforcement learning AI Reward Iterate and learn

POMDP - partially observable Markov decision process Mathematical framework for Dialog manager Dialog as Markov process Stochastic process that satisfies Markov property Future states depend only on present state, not on the sequence of events that preceded it. Defining parameters State s t and Action a t at time step t Transition probability p(s t s t-1, a t-1 ) User input probability p(o t s t ), o t is noisy observation Dialog model models e.g. state transition and observation probability functions Policy model decides which action to take during each turn Reward function expected reward r(s t, a t ) Typically: maximize success + minimize dialog length

POMDP with Reinforcement Learning

Belief Tracking Part of Dialog Model, determines probabilities Updated during process, models user behavior (history), current intention and goal Controls state probabilities over all states Important for training policy and in error situations/recovery Challenge: Huge belief space The number of states, actions and observation easily over 10 10 N-best approach : pruning and recombination Factored Bayesian network approach

Policy model Maps between belief state b and appropriate system action a The objective is to find an optimal policy that maximizes the reward function Exact representation of policy possible in theory but do not work in real life Policy optimization not necessary full automatic as designer can handcraft certain rules and restrict system from doing in human perspective illogical actions. Constraining actions etc results faster convergence Require human work, too strict rules can rule out the optimal policy, use heuristics instead of strict rules (low probability to bad actions) Compact representation of policy essential

Summary space Only part of belief space is actually visited during dialog States and actions are restricted depending on location in the space Belief tracking is performed in master space Policy optimization take place in subspace i.e. summary space Belief tracker gives features to summary space 5-20 features usually selected by hand User s top goal, state frequencies, dialog history Instead of features, probability distribution models belief state

Model based Optimization Model Parameters estimated from corpus Transition probabilities, frequency No user responses, no interaction with user Requires large corpus which have to be gathered beforehand Dialogs and policy fixed to corpus Action and state space fixed Value iteration

Monte Carlo Optimization Policy is optimized online Requires user to use the system The current estimate of the policy is used to select action ε-greedy exploration Less probably actions get less exploration time Policy is updated after each dialog according to sequence of states, actions and rewards. Cannot be trained from the corpus User simulation

User Simulation User simulator interacts directly with dialog system For development, training parameters and evaluation Error model for simulating ASR related errors Set of confusion networks, not just binary errors Simulate real behavior of ASR system Simulators are biased to certain behavior System may not work well in real life Train and test on different simulator Policy

User Simulation - methods N-gram models In order to model context and have consistent behavior, N has to be too large Dynamic Bayesian Network and HMM Trained on data, can model conditional dependencies Data sparsity problem - joint probabilities Inverse Reinforcement Learning User s reward functions from real human-human conversations

Fast Learning and User Adaptation Speed up optimization so that policy can be trained or adapted to real users Learns via interaction with user Gaussian Process based reinforcement learning Non-parametric model of Bayesian inference Specified mean and kernel function Variance has been specified : measures uncertainty When system is uncertain, non-policy action can be selected to explore for optimal policy Can work in master and summary space No need for handcrafted features Convey system s confidence level to the user GP-SARSA algorithm Optimizes faster than standard RL algorithm Suitable for real world problems with real people

Evaluation Goal: user satisfaction Difficult to measure because requires interaction Modules can be evaluated separately Real users with real needs Testing with artificial goals causes bias PARADISE framework Weighted sum of success and dialog length (performance function) User simulation Simple, efficient, wide coverage of dialogs and scenarios, ASR model Model vs Real user PARADISE framework

Summary Faster to create The resulting system equally good compared to handcrafted Many methods have been researched User simulation vs Fast Learning Evaluation challenging

Homework Select the component of the system and method you find interesting and write short summary about it.

References Young, Steve, et al. "Pomdp-based statistical spoken dialog systems: A review." Proceedings of the IEEE 101.5 (2013): 1160-1179. Schatzmann, Jost, et al. "A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies." The knowledge engineering review 21.2 (2006): 97-126. Walker, Marilyn A., et al. "PARADISE: A framework for evaluating spoken dialogue agents." Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1997. Gasic, Milica, and Steve Young. "Gaussian processes for pomdp-based dialogue manager optimization." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22.1 (2014): 28-40.