Partially observable Markov decision processes

Similar documents
Lecture 10: Reinforcement Learning

Regret-based Reward Elicitation for Markov Decision Processes

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Axiom 2013 Team Description Paper

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Probabilistic Latent Semantic Analysis

Learning Prospective Robot Behavior

FF+FPG: Guiding a Policy-Gradient Planner

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

AMULTIAGENT system [1] can be defined as a group of

Learning and Transferring Relational Instance-Based Policies

CSL465/603 - Machine Learning

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Improving Action Selection in MDP s via Knowledge Transfer

Learning Methods for Fuzzy Systems

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Semantic Maps Through Dialog for a Voice-Commandable Wheelchair

High-level Reinforcement Learning in Strategy Games

(Sub)Gradient Descent

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Discriminative Learning of Beam-Search Heuristics for Planning

Task Completion Transfer Learning for Reward Inference

Henry Tirri* Petri Myllymgki

Rule Learning With Negation: Issues Regarding Effectiveness

Georgetown University at TREC 2017 Dynamic Domain Track

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Reducing Features to Improve Bug Prediction

Intelligent Agents. Chapter 2. Chapter 2 1

Action Models and their Induction

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Task Completion Transfer Learning for Reward Inference

An Introduction to Simulation Optimization

Artificial Neural Networks written examination

Python Machine Learning

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Planning with External Events

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Laboratorio di Intelligenza Artificiale e Robotica

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

CS Machine Learning

TD(λ) and Q-Learning Based Ludo Players

Learning Methods in Multilingual Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Emotion Recognition Using Support Vector Machine

Rule Learning with Negation: Issues Regarding Effectiveness

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 6: Applications

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Navigating the PhD Options in CMS

An OO Framework for building Intelligence and Learning properties in Software Agents

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Introduction to Simulation

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

DOCTOR OF PHILOSOPHY HANDBOOK

Welcome to. ECML/PKDD 2004 Community meeting

A Case-Based Approach To Imitation Learning in Robotic Agents

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

WHEN THERE IS A mismatch between the acoustic

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Evolutive Neural Net Fuzzy Filtering: Basic Description

BMBF Project ROBUKOM: Robust Communication Networks

Time series prediction

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Softprop: Softmax Neural Network Backpropagation Learning

Knowledge-Based - Systems

Modeling user preferences and norms in context-aware systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

The Strong Minimalist Thesis and Bounded Optimality

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Semi-Supervised Face Detection

Using focal point learning to improve human machine tacit coordination

Cooperative Systems Modeling, Example of a Cooperative e-maintenance System

A survey of multi-view machine learning

A Genetic Irrational Belief System

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Device Independence and Extensibility in Gesture Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Mining Student Evolution Using Associative Classification and Clustering

Eye Movements in Speech Technologies: an overview of current research

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Agent-Based Software Engineering

Transcription:

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal Reading group meeting, February 12, 2007 1/22

Overview Partially observable Markov decision processes: Model. Belief states. MDP-based algorithms. Other sub-optimal algorithms. Optimal algorithms. Application to robotics. 2/22

A planning problem Task: start at random position ( ) pick up mail at P deliver mail at D ( ). Characteristics: motion noise, perceptual aliasing. 3/22

Planning under uncertainty Uncertainty is abundant in real-world planning domains. Bayesian approach probabilistic models. Common approach in robotics, e.g., robot localization. 4/22

Partially observable Markov decision processes (POMDPs) (Kaelbling et al., 1998): Framework for agent planning under uncertainty. POMDPs Typically assumes discrete sets of states S, actions A and observations O. Transition model p(s s,a): models the effect of actions. Observation model p(o s,a): relates observations to states. Task is defined by a reward model r(s,a). Goal is to compute plan, or policy π, that maximizes long-term reward. 5/22

POMDP applications Robot navigation (Simmons and Koenig, 1995; Theocharous and Mahadevan, 2002). Visual tracking (Darrell and Pentland, 1996). Dialogue management (Roy et al., 2000). Robot-assisted health care (Pineau et al., 2003b; Boger et al., 2005). Machine maintenance (Smallwood and Sondik, 1973), structural inspection (Ellis et al., 1995). Inventory control (Treharne and Sox, 2002), dynamic pricing strategies (Aviv and Pazgal, 2005), marketing campaigns (Rusmevichientong and Van Roy, 2001). Medical applications (Hauskrecht and Fraser, 2000; Hu et al., 1996). 6/22

Transition model For instance, robot motion is inaccurate. Transitions between states are stochastic. p(s s,a) is the probability to jump from state s to state s after taking action a.????? 7/22

Imperfect sensors. Partially observable environment: Sensors are noisy. Sensors have a limited view. Observation model p(o s,a) is the probability the agent receives observation o in state s after taking action a. 8/22

Memory A POMDP example that requires memory (Singh et al., 1994): r r a 1 a 2, +r a 2 Method MDP policy s 1 s 2 a 1, +r Memoryless deterministic POMDP policy Value V = r 1 γ V max = r γr 1 γ Memoryless stochastic POMDP policy V = 0 Memory-based POMDP policy V min = γr 1 γ 9/22

Beliefs: The agent maintains a belief b(s) of being at state s. Beliefs After action a A and observation o O the belief b(s) can be updated using Bayes rule: b (s ) p(o s ) s p(s s,a)b(s) The belief vector is a Markov signal for the planning task. 10/22

Belief update example True situation: Robot s belief: 0.5 0.25 0 Observations: door or corridor, 10% noise. Action: moves 3 (20%), 4 (60%), or 5 (20%) states. 11/22

Belief update example True situation: Robot s belief: 0.5 0.25 0 Observations: door or corridor, 10% noise. Action: moves 3 (20%), 4 (60%), or 5 (20%) states. 11/22

Belief update example True situation: Robot s belief: 0.5 0.25 0 Observations: door or corridor, 10% noise. Action: moves 3 (20%), 4 (60%), or 5 (20%) states. 11/22

Belief update example True situation: Robot s belief: 0.5 0.25 0 Observations: door or corridor, 10% noise. Action: moves 3 (20%), 4 (60%), or 5 (20%) states. 11/22

Solving POMDPs A solution to a POMDP is a policy, i.e., a mapping a = π(b) from beliefs to actions. An optimal policy is characterized by a value function that maximizes: V π (b 0 ) = E[ t=0 γ t r(b t,π(b t ))] Computing the optimal value function is a hard problem (PSPACE-complete for finite horizon). In robotics: a policy is often computed using simple MDP-based approximations. 12/22

MDP-based algorithms Use the solution to the MDP as an heuristic. Most likely state (Cassandra et al., 1996): π MLS (b) = π (arg max s b(s)). Q MDP (Littman et al., 1995): π QMDP (b) = arg max a s b(s)q (s,a). C a b a A b b I a 0.5 0.5 A c b c +1 1 D a (Parr and Russell, 1995) 13/22

Other sub-optimal techniques Grid-based approximations (Drake, 1962; Lovejoy, 1991; Brafman, 1997; Zhou and Hansen, 2001; Bonet, 2002). Optimizing finite-state controllers (Platzman, 1981; Hansen, 1998b; Poupart and Boutilier, 2004). Gradient ascent (Ng and Jordan, 2000; Aberdeen and Baxter, 2002). Heuristic search in the belief tree (Satia and Lave, 1973; Hansen, 1998a; Smith and Simmons, 2004). Compressing the POMDP (Roy et al., 2005; Poupart and Boutilier, 2003). Point-based techniques (Pineau et al., 2003a; Spaan and Vlassis, 2005). 14/22

Optimal value functions The optimal value function of a (finite horizon) POMDP is piecewise linear and convex: V (b) = max α b α. V α 1 α 2 000 111 000 111 00 11 000000 111111 00 11 00 11 00000 11111000 000 111 11100000 11111 000 11100 0000 1111 1100 α 3 00 11 00 11 000 111 00 11 000 111 000 111 00 11 000 111 α 4 00 11 (1,0) (0,1) 15/22

Exact value iteration Value iteration computes a sequence of value function estimates: V 1,V 2,...,V n. V V 3 V 2 V 1 (1,0) (0,1) 16/22

Optimal POMDP methods Enumerate and prune: Most straightforward: Monahan (1982) s enumeration algorithm. Generates a maximum of A V n O vectors at each iteration, hence requires pruning. Incremental pruning (Zhang and Liu, 1996; Cassandra et al., 1997). Search for witness points: One Pass (Sondik, 1971; Smallwood and Sondik, 1973). Relaxed Region, Linear Support (Cheng, 1988). Witness (Cassandra et al., 1994). 17/22

Vector pruning V α 1 α 2 α 5 α 3 α 4 b 1 b 2 (1,0) (0,1) Linear program for pruning: variables: s S,b(s);x maximize: x subject to: b (α α ) x, α V,α α b (S) 18/22

High dimensional sensor readings Omnidirectional camera images. Example images Dimension reduction: Collect a database of images and record their location. Apply Principal Component Analysis on the image data. Project each image to the first 3 eigenvectors, resulting in a 3D feature vector for each image. 19/22

Observation model p(s o) We cluster the feature vectors into 10 prototype observations. We compute a discrete observation model p(o s, a) by a histogram operation. 20/22

States, actions and rewards P D State: s = (x,j) with x the robot s location and j the mail bit. Grid X into 500 locations. Actions: {,,,, pickup, deliver}. Positive reward: only upon successful mail delivery. 21/22

References D. Aberdeen and J. Baxter. Scaling internal-state policy-gradient methods for POMDPs. In International Conference on Machine Learning, 2002. Y. Aviv and A. Pazgal. A partially observed Markov decision process for dynamic pricing. Management Science, 51(9):1400 1416, 2005. J. Boger, P. Poupart, J. Hoey, C. Boutilier, G. Fernie, and A. Mihailidis. A decision-theoretic approach to task assistance for persons with dementia. In Proc. Int. Joint Conf. on Artificial Intelligence, 2005. B. Bonet. An epsilon-optimal grid-based algorithm for partially observable Markov decision processes. In International Conference on Machine Learning, 2002. R. I. Brafman. A heuristic variable grid solution method for POMDPs. In Proc. of the National Conference on Artificial Intelligence, 1997. A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In Proc. of the National Conference on Artificial Intelligence, 1994. A. R. Cassandra, L. P. Kaelbling, and J. A. Kurien. Acting under uncertainty: Discrete Bayesian models for mobile robot navigation. In Proc. of International Conference on Intelligent Robots and Systems, 1996. A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proc. of Uncertainty in Artificial Intelligence, 1997. H. T. Cheng. Algorithms for partially observable Markov decision processes. PhD thesis, University of British Columbia, 1988. T. Darrell and A. Pentland. Active gesture recognition using partially observable Markov decision processes. In Proc. of the 13th Int. Conf. on Pattern Recognition, 1996. A. W. Drake. Observation of a Markov process through a noisy channel. Sc.D. thesis, Massachusetts Institute of Technology, 1962. J. H. Ellis, M. Jiang, and R. Corotis. Inspection, maintenance, and repair with partial observability. Journal of Infrastructure Systems, 1(2):92 99, 1995. E. A. Hansen. Finite-memory control of partially observable systems. PhD thesis, University of Massachusetts, Amherst, 1998a. E. A. Hansen. Solving POMDPs by searching in policy space. In Proc. of Uncertainty in Artificial Intelligence, 1998b. M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18:221 244, 2000. C. Hu, W. S. Lovejoy, and S. L. Shafer. Comparison of some suboptimal control policies in medical drug therapy. Operations Research, 44(5):696 709, 1996. L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99 134, 1998. M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. In International Conference on Machine Learning, 1995. W. S. Lovejoy. Computationally feasible bounds for partially observed Markov decision processes. Operations Research, 39(1):162 175, 1991. G. E. Monahan. A survey of partially observable Markov decision processes: theory, models and algorithms. Management Science, 28(1), Jan. 1982. A. Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proc. of Uncertainty in Artificial Intelligence, 2000. R. Parr and S. Russell. Approximating optimal policies for partially observable stochastic domains. In Proc. Int. Joint Conf. on Artificial Intelligence, 1995. J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proc. Int. Joint Conf. on Artificial Intelligence, 2003a. J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun. Towards robotic assistants in nursing homes: Challenges and results. Robotics and Autonomous Systems, 42(3 4):271 281, 2003b. L. K. Platzman. A feasible computational approach to infinite-horizon partially-observed Markov decision problems. Technical Report J-81-2, School of Industrial and Systems Engineering, Georgia Institute of Technology, 1981. Reprinted in working notes AAAI 1998 Fall Symposium on Planning with POMDPs. P. Poupart and C. Boutilier. Bounded finite state controllers. In Advances in Neural Information Processing Systems 16. MIT Press, 2004. P. Poupart and C. Boutilier. Value-directed compression of POMDPs. In Advances in Neural Information Processing Systems 15. MIT Press, 2003. N. Roy, J. Pineau, and S. Thrun. Spoken dialog management for robots. In Proc. of the Association for Computational Linguistics, 2000. N. Roy, G. Gordon, and S. Thrun. Finding approximate POMDP solutions through belief compression. Journal of Artificial Intelligence Research, 23:1 40, 2005. P. Rusmevichientong and B. Van Roy. A tractable POMDP for a class of sequencing problems. In Proc. of Uncertainty in Artificial Intelligence, 2001. J. K. Satia and R. E. Lave. Markovian decision processes with probabilistic observation of states. Management Science, 20(1), 1973. R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proc. Int. Joint Conf. on Artificial Intelligence, 1995. S. Singh, T. Jaakkola, and M. Jordan. Learning without state-estimation in partially observable Markovian decision processes. In International Conference on Machine Learning, 1994. R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov decision processes over a finite horizon. Operations Research, 21:1071 1088, 1973. T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proc. of Uncertainty in Artificial Intelligence, 2004. E. J. Sondik. The optimal control of partially observable Markov processes. PhD thesis, Stanford University, 1971. M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24:195 220, 2005. G. Theocharous and S. Mahadevan. Approximate planning with hierarchical partially observable Markov decision processes for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, 2002. J. T. Treharne and C. R. Sox. Adaptive inventory control for nonstationary demand and partial information. Management Science, 48(5):607 624, 2002. N. L. Zhang and W. Liu. Planning in stochastic domains: problem characteristics and approximations. Technical Report HKUST-CS96-31, Department of Computer Science, The Hong Kong University of Science and Technology, 1996. R. Zhou and E. A. Hansen. An improved grid-based approximation algorithm for POMDPs. In Proc. Int. Joint Conf. on Artificial Intelligence, 2001. 22/22