Multiagent models for partially observable environments

Similar documents
Lecture 10: Reinforcement Learning

Seminar - Organic Computing

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AMULTIAGENT system [1] can be defined as a group of

High-level Reinforcement Learning in Strategy Games

Axiom 2013 Team Description Paper

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speeding Up Reinforcement Learning with Behavior Transfer

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Reinforcement Learning by Comparing Immediate Reward

Regret-based Reward Elicitation for Markov Decision Processes

Probabilistic Latent Semantic Analysis

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Discriminative Learning of Beam-Search Heuristics for Planning

Intelligent Agents. Chapter 2. Chapter 2 1

An OO Framework for building Intelligence and Learning properties in Software Agents

Georgetown University at TREC 2017 Dynamic Domain Track

Language properties and Grammar of Parallel and Series Parallel Languages

Learning Methods in Multilingual Speech Recognition

Learning Methods for Fuzzy Systems

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Learning and Transferring Relational Instance-Based Policies

Learning to Schedule Straight-Line Code

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Using focal point learning to improve human machine tacit coordination

Learning Prospective Robot Behavior

Lecture 1: Basic Concepts of Machine Learning

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

An Online Handwriting Recognition System For Turkish

FF+FPG: Guiding a Policy-Gradient Planner

Lecture 1: Machine Learning Basics

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

On the Combined Behavior of Autonomous Resource Management Agents

Planning with External Events

Action Models and their Induction

Agent-Based Software Engineering

Evolutive Neural Net Fuzzy Filtering: Basic Description

TD(λ) and Q-Learning Based Ludo Players

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Introduction to Simulation

Artificial Neural Networks written examination

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Evolution of Collective Commitment during Teamwork

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Rule Learning With Negation: Issues Regarding Effectiveness

Multiagent Simulation of Learning Environments

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Prediction of Maximal Projection for Semantic Role Labeling

Probability and Game Theory Course Syllabus

Rule Learning with Negation: Issues Regarding Effectiveness

Towards Team Formation via Automated Planning

Laboratorio di Intelligenza Artificiale e Robotica

Softprop: Softmax Neural Network Backpropagation Learning

An Investigation into Team-Based Planning

Learning Semantic Maps Through Dialog for a Voice-Commandable Wheelchair

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

CSL465/603 - Machine Learning

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

DOCTOR OF PHILOSOPHY HANDBOOK

An Introduction to Simulation Optimization

Navigating the PhD Options in CMS

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Motivation to e-learn within organizational settings: What is it and how could it be measured?

Beyond the Pipeline: Discrete Optimization in NLP

Causal Link Semantics for Narrative Planning Using Numeric Fluents

Lecture 6: Applications

Emotional Variation in Speech-Based Natural Language Generation

Improving Action Selection in MDP s via Knowledge Transfer

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Laboratorio di Intelligenza Artificiale e Robotica

Liquid Narrative Group Technical Report Number

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Task Completion Transfer Learning for Reward Inference

Improving Fairness in Memory Scheduling

Calibration of Confidence Measures in Speech Recognition

A Version Space Approach to Learning Context-free Grammars

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Cases to Resolve Conflicts and Improve Group Behavior

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Python Machine Learning

SARDNET: A Self-Organizing Feature Map for Sequences

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

How do adults reason about their opponent? Typologies of players in a turn-taking game

Game-based formative assessment: Newton s Playground. Valerie Shute, Matthew Ventura, & Yoon Jeon Kim (Florida State University), NCME, April 30, 2013

An Empirical and Computational Test of Linguistic Relativity

Computer Science (CS)

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Transcription:

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal Reading group meeting, March 26, 2007 1/18

Overview Multiagent models for partially observable environments: Non-communicative models. Communicative models. Game-theoretic models. Some algorithms. Talk based on survey by Frans Oliehoek (2006). 2/18

The Dec-Tiger problem A toy problem: decentralized tiger (Nair et al., 2003). Two agents, two doors. Opening correct door: both receive treasure. Opening wrong door: both get attacked by a tiger. Agents can open a door, or listen. Two noisy observations: hear tiger left or right. Don t know the other s actions or observations. 3/18

Multiagent planning frameworks Aspects: communication on-line vs. off-line centralized vs. distributed cooperative vs. self-interested observability factored reward 4/18

Partially observable stochastic games Partially observable stochastic games (POSGs) (Hansen et al., 2004): Extension of stochastic games (Shapley, 1953). Hence self-interested. Agents do not observe each other s observations or actions. 5/18

A set I = {1,...,n} of n agents. A i is the set of actions for agent i. O i is the set of observations for agent i. POSGs: definition Transition model p(s s, ā) where ā A 1... A n. Observation model p(ō s, ā) where ō O 1... O n. Reward function R i : S A 1... A n R. [ h ] Each agents maximizes E. t=0 γt R t i Policy π = {π 1,...,π n }, with π i : t 1 (A i O i ) A i. 6/18

Decentralized POMDPs Decentralized partially observable Markov decision processes (Dec-POMDPs) (Bernstein et al., 2002): Cooperative version of POSGs. Only one reward, i.e., reward functions are identical for each agent. Reward function R : S A 1... A n R. Dec-MDPs: Jointly observable Dec-POMDP: joint observation ō = {o 1,...,o n } identifies the state. But each agents only observes o i. MTDP (Pynadath and Tambe, 2002): essentially identical to Dec- POMDP. 7/18

Interactive POMDPs Interactive POMDPs (Gmytrasiewicz and Doshi, 2005): For self-interested agents. Each agents keeps a belief over world states and other agents models. An agent s model: local observation history, policy, observation function. Leads to infinite hierarchy of beliefs. 8/18

Implicit or explicit. Implicit communication can be modeled in non-communicative frameworks. Communication Explicit communication Goldman and Zilberstein (2004): informative messages commitments rewards/punishments Semantics: Fixed: optimize joint policy given semantics. General case: optimize meanings as well. Potential assumptions: instantaneous, noise-free, broadcast communication. 9/18

Dec-POMDPs with communication Dec-POMDP-Com (Goldman and Zilberstein, 2004) Dec-POMDP plus: Σ is the alphabet of all possible messages. σ i is a message sent by agent i. C Σ : Σ R is the cost of sending a message. Reward depends on message sent: R(s,a 1,σ 1,...,a n,σ n,s ). Instantaneous broadcast communication. Fixed semantics. Two policies: for domain-level actions, and for communicating. Closely related model: Com-MTDP (Pynadath and Tambe, 2002). 10/18

Extensive form games 8-card poker: 11/18

Extensive form games (1) Extensive form games: View a POSG as a game tree. Agents act on information sets. Actions are taken in turns. POSGs are defined over world states, extensive form games over nodes in the game tree. 12/18

Dec-POMDP complexity results Observability Communication fully jointly partial none none P NEXP NEXP NP general P NEXP NEXP NP free, instantaneous P P PSPACE NP 13/18

Dynamic programming for POSGs Dynamic programming for POSGs (Hansen et al., 2004). Uncertainty over state and the other agent s future conditional plans. Define value function V t over state and other agent s depth-t policy trees: a S vector for each pair of policy trees. Computing the t + 1 value function requires backing up all combinations of all agents depth-t policy trees. Prune (very weakly) dominated strategies. Optimal for cooperative settings (DEC-POMDP). Still infeasible for all but the smallest problems. 14/18

(Approximate) DEC-POMDP solving Extra assumptions: e.g., independent observations, factored state representation, local full observability (DEC-MDP), structure in the reward function. Optimize one agent while keeping others fixed, and iterate. Settle for locally optimal solutions. Free communication turns problem into a big POMDP. Find good on-line communication policy. Add synchronization action (Nair et al., 2004). Belief over belief tree (Roth et al., 2005). 15/18

Some algorithms Joint Equilibrium based Search for Policies (Nair et al., 2003) Use alternating maximization. Converges to Nash equilibrium, which is a local optimum. Keeps belief over state and other agents observation histories. This POMDP is transformed to an MDP over the belief states, and solved using value iteration. 16/18

Some algorithms (1) Set-Coverage algorithm Becker et al. (2004): For transition-independent Dec-MDPs with a particular joint reward structure. Bounded Policy Iteration for Dec-POMDPs (Bernstein et al., 2005): Optimize a finite-state controller with a bounded size. Alternating maximization. 17/18

References R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman. Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22:423 455, 2004. D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819 840, 2002. D. S. Bernstein, E. A. Hansen, and S. Zilberstein. Bounded policy iteration for decentralized POMDPs. In Proc. Int. Joint Conf. on Artificial Intelligence, 2005. P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24:49 79, 2005. C. V. Goldman and S. Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22:143 174, 2004. E. A. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. In Proc. of the National Conference on Artificial Intelligence, 2004. R. Nair, M. Tambe, M. Yokoo, D. Pynadath, and S. Marsella. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proc. Int. Joint Conf. on Artificial Intelligence, 2003. R. Nair, M. Tambe, M. Roth, and M. Yokoo. Communication for improving policy computation in distributed POMDPs. In Proc. of Int. Joint Conference on Autonomous Agents and Multi Agent Systems, 2004. D. V. Pynadath and M. Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16:389 423, 2002. M. Roth, R. Simmons, and M. Veloso. Decentralized communication strategies for coordinated multi-agent policies. In A. Schultz, L. Parker, and F. Schneider, editors, Multi-Robot Systems: From Swarms to Intelligent Automata, volume IV. Kluwer Academic Publishers, 2005. L. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39:1095 1100, 1953. 18/18