Partial Observability. Partially Observable MDPs (POMDPs) A Little Example. Belief State

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Axiom 2013 Team Description Paper

Regret-based Reward Elicitation for Markov Decision Processes

Improving Action Selection in MDP s via Knowledge Transfer

Task Completion Transfer Learning for Reward Inference

Georgetown University at TREC 2017 Dynamic Domain Track

Task Completion Transfer Learning for Reward Inference

Speeding Up Reinforcement Learning with Behavior Transfer

FF+FPG: Guiding a Policy-Gradient Planner

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Methods for Fuzzy Systems

B.S/M.A in Mathematics

Artificial Neural Networks written examination

A Reinforcement Learning Variant for Control Scheduling

High-level Reinforcement Learning in Strategy Games

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning and Transferring Relational Instance-Based Policies

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

An Introduction to Simulation Optimization

AMULTIAGENT system [1] can be defined as a group of

Laboratorio di Intelligenza Artificiale e Robotica

Discriminative Learning of Beam-Search Heuristics for Planning

Lecture 6: Applications

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

TD(λ) and Q-Learning Based Ludo Players

Lecture 1: Machine Learning Basics

An OO Framework for building Intelligence and Learning properties in Software Agents

An investigation of imitation learning algorithms for structured prediction

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Planning with External Events

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

CS Machine Learning

(Sub)Gradient Descent

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Improving Fairness in Memory Scheduling

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Generative models and adversarial training

Learning Prospective Robot Behavior

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Team Formation for Generalized Tasks in Expertise Social Networks

Probabilistic Latent Semantic Analysis

Probability and Game Theory Course Syllabus

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Radius STEM Readiness TM

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

CSL465/603 - Machine Learning

Python Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Liquid Narrative Group Technical Report Number

Self Study Report Computer Science

Acquiring Competence from Performance Data

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

An Online Handwriting Recognition System For Turkish

CURRICULUM VITAE. To develop expertise in Graph Theory and expand my knowledge by doing Research in the same.

Lecture 1: Basic Concepts of Machine Learning

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

AI Agent for Ice Hockey Atari 2600

A Version Space Approach to Learning Context-free Grammars

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Seminar - Organic Computing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning Semantic Maps Through Dialog for a Voice-Commandable Wheelchair

Arizona s College and Career Ready Standards Mathematics

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Introduction to Simulation

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Action Models and their Induction

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Predicting Future User Actions by Observing Unmodified Applications

Intelligent Agents. Chapter 2. Chapter 2 1

Numerical Recipes in Fortran- Press et al (1992) Recursive Methods in Economic Dynamics - Stokey and Lucas (1989)

Assignment 1: Predicting Amazon Review Ratings

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Combining Proactive and Reactive Predictions for Data Streams

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Genevieve L. Hartman, Ph.D.

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A theoretic and practical framework for scheduling in a stochastic environment

Go fishing! Responsibility judgments when cooperation breaks down

On the Combined Behavior of Autonomous Resource Management Agents

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Firms and Markets Saturdays Summer I 2014

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Natural Language Processing. George Konidaris

Transcription:

Partial Observability Partially Observable MDPs (POMDPs) Based on Cassandra, Kaelbling, & Littman, 12th AAAI, 1994 Objectives of this lecture:! Introduction to POMDPs! Solving POMDPs! RL and POMDPs Start with an MDP <S, A, T, R>, where S is finite state set A is finite action set T is the state transition function: T(s, a, s ) is prob that next state is s, given doing a in state s R is the reward function: R(s, a) is the immediate reward for doing a in state s Add partial observability: O, a finite set of possible observations O, an observation function: O(a, s, o) is probability of observing o after taking action a in state s Complexity: finite horizon: PSPACE-complete. infinite horizon: undecidable R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 A Little Example Belief State Two actions: left, right; deterministic If moves into a wall, stays in current state If reaches the goal state (star), moves randomly to state 0, 1, or 3, and receives reward 1 Agent can only observe whether or not it is in the goal state b: belief state: a discrete probability distribution over state set S b(s) = prob agent is in state s After goal: (1/3, 1/3, 0, 1/3) After action right and not observing the goal: (0, 1/2, 0, 1/2) After moving right again and still not observing the goal: (0, 0, 0, 1) But in general, some actions in some situations can increase uncertainty, while others can decrease it. An optimal policy in general will sometimes take actions only to gain information. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

The Belief MDP Belief MDP cont. Belief state estimator Cassandra et al. say: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Value Iteration for the Belief MDP Value function over belief space from Tony Cassandra s POMDPs for Dummies http://www.cs.brown.edu/research/ai/pomdp/tutorial 1D belief space for a 2 state POMDP R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Sample PWLC value function Sample PWLC function and its partition of belief space R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Immediate rewards for belief states Value of a fixed action and observation a1 has reward 1 in s1; 0 in s2 a2 has reward 0 in s1; 1.5 in s2 Summing these for the best action from b gives the optimal horizon-2 value of taking a1 in b and observing z1 This is, in fact, the Horizon-1 value function Note: here T is the earlier R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Transformed value function Do this for each observation given a1 Doing this for all belief sates: Immed reward + S(a1, z1) is the whole value function for action a1 and observation z1 [times P(z1 a1, b) ] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Transformed value function for all observations Partitions for all observations If we start at b and do a1, then next best action is: a1 if we observe z2 or z3 a2 if we observe z1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Partition for action a1 Value function and partition for action a1 Produced by summing the appropriate S(a1, ) lines R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Value function and partition for action a2 Combined a1 and a2 value functions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Value function for horizon 2 Value function for action a1 and horizon 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Value function for action a2 and horizon 3 Value functions for both actions a2 and horizon 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Value function for horizon 3 General Form of POMDP Solution Transformed V for a and z R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Adjacent belief partitions for transformed value function Making a new partition from S(a,z) partitions How do you do this in general? Not so easy. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Policy Graphs More policy graphs When all belief states in one partition are transformed into belief states in the same partition, given an optimal action and resulting observation, can form a finite state machine as policy. Only goal state is distinguishable Tiger Problem: Two doors: tiger or big reward You can choose to listen (for a small cost) If tiger is on left, you will hear it on left with prob 0.85, and on right with prob 0.15, and symmetrically if tiger is on right Iterated: restarts with tiger and reward randomly repositioned R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 RL for POMDPs RL for POMDPs! Memoryless policies: treat observations as if they were Markov states! Use non-bootstrapping algorithm to estimate Q(o, a) for observations o; do policy improvement! Policies can be bad! Stochastic policies can be better! QMDP method:! Ignore the observation model and find optimal Q-values for the underlying MDP! Extend to belief states like this: Q a (b) = " b(s) Q MDP (s,a)! s Assume all uncertainty disappears in one step: cannot produce policies that act to gain information! But can work surprisingly well in many cases! Replicated Q-learning! Use a single vector, q a, to approx Q-function for each action: Q a (b) = q a " b! At each step, for every state s: ' * "q a (s) = # b(s) ) r + $ maxq a % ( b %) & q a (s), ( +! Reduces to normal Q-learning if belief state collapses to deterministic case! Certainly suboptimal, but sometimes works well a % R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

RL for POMDPs! Smooth Partially Observable Value Approximation (SPOVA) Parr and Russell RL for POMDPs! McCallum s U-Tree algorithm, 1996 SPOVA SPOVA-RL R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34 RL for POMDPs! Linear Q-Learning! Almost the same as replicated Q-learning: ' * "q a (s) = # b(s) ) r + $ maxq a % ( b %) & q a (s), ( + a % ( + "q a (s) = # b(s) * r + $ maxq a % ( b %) & q a ' b- ), a % replicated linear R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35