Partial Observability

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Axiom 2013 Team Description Paper

Regret-based Reward Elicitation for Markov Decision Processes

Improving Action Selection in MDP s via Knowledge Transfer

Task Completion Transfer Learning for Reward Inference

Georgetown University at TREC 2017 Dynamic Domain Track

Task Completion Transfer Learning for Reward Inference

Speeding Up Reinforcement Learning with Behavior Transfer

FF+FPG: Guiding a Policy-Gradient Planner

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Methods for Fuzzy Systems

B.S/M.A in Mathematics

Artificial Neural Networks written examination

A Reinforcement Learning Variant for Control Scheduling

High-level Reinforcement Learning in Strategy Games

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning and Transferring Relational Instance-Based Policies

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

An Introduction to Simulation Optimization

AMULTIAGENT system [1] can be defined as a group of

Laboratorio di Intelligenza Artificiale e Robotica

Discriminative Learning of Beam-Search Heuristics for Planning

Lecture 6: Applications

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

TD(λ) and Q-Learning Based Ludo Players

Lecture 1: Machine Learning Basics

An OO Framework for building Intelligence and Learning properties in Software Agents

An investigation of imitation learning algorithms for structured prediction

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Planning with External Events

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

CS Machine Learning

(Sub)Gradient Descent

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Improving Fairness in Memory Scheduling

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Generative models and adversarial training

Learning Prospective Robot Behavior

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Team Formation for Generalized Tasks in Expertise Social Networks

Probabilistic Latent Semantic Analysis

Probability and Game Theory Course Syllabus

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Radius STEM Readiness TM

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

CSL465/603 - Machine Learning

Python Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Liquid Narrative Group Technical Report Number

Self Study Report Computer Science

Acquiring Competence from Performance Data

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

An Online Handwriting Recognition System For Turkish

CURRICULUM VITAE. To develop expertise in Graph Theory and expand my knowledge by doing Research in the same.

Lecture 1: Basic Concepts of Machine Learning

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

AI Agent for Ice Hockey Atari 2600

A Version Space Approach to Learning Context-free Grammars

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Seminar - Organic Computing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning Semantic Maps Through Dialog for a Voice-Commandable Wheelchair

Arizona s College and Career Ready Standards Mathematics

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Introduction to Simulation

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Action Models and their Induction

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Predicting Future User Actions by Observing Unmodified Applications

Intelligent Agents. Chapter 2. Chapter 2 1

Numerical Recipes in Fortran- Press et al (1992) Recursive Methods in Economic Dynamics - Stokey and Lucas (1989)

Assignment 1: Predicting Amazon Review Ratings

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Combining Proactive and Reactive Predictions for Data Streams

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Genevieve L. Hartman, Ph.D.

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A theoretic and practical framework for scheduling in a stochastic environment

Go fishing! Responsibility judgments when cooperation breaks down

On the Combined Behavior of Autonomous Resource Management Agents

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Firms and Markets Saturdays Summer I 2014

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Natural Language Processing. George Konidaris

Transcription:

Partial Observability Objectives of this lecture: Introduction to POMDPs Solving POMDPs RL and POMDPs R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Partially Observable MDPs (POMDPs) Based on Cassandra, Kaelbling, & Littman, 12th AAAI, 1994 Start with an MDP <S, A, T, R>, where S is finite state set A is finite action set T is the state transition function: T(s, a, s ) is prob that next state is s, given doing a in state s R is the reward function: R(s, a) is the immediate reward for doing a in state s Add partial observability: O, a finite set of possible observations O, an observation function: O(a, s, o) is probability of observing o after taking action a in state s Complexity: finite horizon: PSPACE-complete. infinite horizon: undecidable R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

A Little Example Two actions: left, right; deterministic If moves into a wall, stays in current state If reaches the goal state (star), moves randomly to state 0, 1, or 3, and receives reward 1 Agent can only observe whether or not it is in the goal state R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Belief State b: belief state: a discrete probability distribution over state set S b(s) = prob agent is in state s After goal: (1/3, 1/3, 0, 1/3) After action right and not observing the goal: (0, 1/2, 0, 1/2) After moving right again and still not observing the goal: (0, 0, 0, 1) But in general, some actions in some situations can increase uncertainty, while others can decrease it. An optimal policy in general will sometimes take actions only to gain information. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

The Belief MDP Belief state estimator R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Belief MDP cont. Cassandra et al. say: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Value Iteration for the Belief MDP from Tony Cassandra s POMDPs for Dummies http://www.cs.brown.edu/research/ai/pomdp/tutorial 1D belief space for a 2 state POMDP R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Value function over belief space R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Sample PWLC value function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Sample PWLC function and its partition of belief space R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Immediate rewards for belief states a1 has reward 1 in s1; 0 in s2 a2 has reward 0 in s1; 1.5 in s2 This is, in fact, the Horizon-1 value function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Value of a fixed action and observation Summing these for the best action from b gives the optimal horizon-2 value of taking a1 in b and observing z1 Note: here T is the earlier R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Transformed value function Doing this for all belief sates: Immed reward + S(a1, z1) is the whole value function for action a1 and observation z1 [times P(z1 a1, b) ] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Do this for each observation given a1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

Transformed value function for all observations R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Partitions for all observations If we start at b and do a1, then next best action is: a1 if we observe z2 or z3 a2 if we observe z1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Partition for action a1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Value function and partition for action a1 Produced by summing the appropriate S(a1, ) lines R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Value function and partition for action a2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Combined a1 and a2 value functions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Value function for horizon 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Value function for action a1 and horizon 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Value function for action a2 and horizon 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Value functions for both actions a2 and horizon 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Value function for horizon 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

General Form of POMDP Solution Transformed V for a and z R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Adjacent belief partitions for transformed value function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Making a new partition from S(a,z) partitions How do you do this in general? Not so easy. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Policy Graphs When all belief states in one partition are transformed into belief states in the same partition, given an optimal action and resulting observation, can form a finite state machine as policy. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

More policy graphs Only goal state is distinguishable Tiger Problem: Two doors: tiger or big reward You can choose to listen (for a small cost) If tiger is on left, you will hear it on left with prob 0.85, and on right with prob 0.15, and symmetrically if tiger is on right Iterated: restarts with tiger and reward randomly repositioned R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

RL for POMDPs Memoryless policies: treat observations as if they were Markov states Use non-bootstrapping algorithm to estimate Q(o, a) for observations o; do policy improvement Policies can be bad Stochastic policies can be better QMDP method: Ignore the observation model and find optimal Q-values for the underlying MDP Extend to belief states like this: Q a (b) = b(s) Q MDP (s,a) " s Assume all uncertainty disappears in one step: cannot produce policies that act to gain information But can work surprisingly well in many cases R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

RL for POMDPs Replicated Q-learning Use a single vector, q a, to approx Q-function for each action: Q a (b) = q a " b At each step, for every state s: ' * "q a (s) = # b(s) ) r + $ maxq a % ( b %) & q a (s), ( + Reduces to normal Q-learning if belief state collapses to deterministic case Certainly suboptimal, but sometimes works well a % R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

RL for POMDPs Smooth Partially Observable Value Approximation (SPOVA) Parr and Russell SPOVA SPOVA-RL R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

RL for POMDPs McCallum s U-Tree algorithm, 1996 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34

RL for POMDPs Linear Q-Learning Almost the same as replicated Q-learning: ' * "q a (s) = # b(s) ) r + $ maxq a % ( b %) & q a (s), ( + ( + "q a (s) = # b(s) * r + $ maxq a % ( b %) & q a ' b- ), a % a % replicated linear R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35