Introduction to Multi-Agent Programming

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

TD(λ) and Q-Learning Based Ludo Players

Regret-based Reward Elicitation for Markov Decision Processes

High-level Reinforcement Learning in Strategy Games

AMULTIAGENT system [1] can be defined as a group of

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Improving Action Selection in MDP s via Knowledge Transfer

Laboratorio di Intelligenza Artificiale e Robotica

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Speeding Up Reinforcement Learning with Behavior Transfer

Georgetown University at TREC 2017 Dynamic Domain Track

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

A Reinforcement Learning Variant for Control Scheduling

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Laboratorio di Intelligenza Artificiale e Robotica

Learning Prospective Robot Behavior

Task Completion Transfer Learning for Reward Inference

Lecture 6: Applications

Task Completion Transfer Learning for Reward Inference

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Python Machine Learning

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Probabilistic Latent Semantic Analysis

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

FF+FPG: Guiding a Policy-Gradient Planner

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Discriminative Learning of Beam-Search Heuristics for Planning

An Introduction to Simio for Beginners

Intelligent Agents. Chapter 2. Chapter 2 1

An investigation of imitation learning algorithms for structured prediction

Introduction to Simulation

A Comparison of Annealing Techniques for Academic Course Scheduling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

BMBF Project ROBUKOM: Robust Communication Networks

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

CSL465/603 - Machine Learning

(Sub)Gradient Descent

Seminar - Organic Computing

Generative models and adversarial training

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Software Maintenance

Executive Guide to Simulation for Health

The Strong Minimalist Thesis and Bounded Optimality

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

WHEN THERE IS A mismatch between the acoustic

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Softprop: Softmax Neural Network Backpropagation Learning

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Improving Fairness in Memory Scheduling

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Managerial Decision Making

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

On the Combined Behavior of Autonomous Resource Management Agents

An Online Handwriting Recognition System For Turkish

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Transfer Learning Action Models by Measuring the Similarity of Different Domains

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An Introduction to Simulation Optimization

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

SARDNET: A Self-Organizing Feature Map for Sequences

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Mathematics subject curriculum

Self Study Report Computer Science

An OO Framework for building Intelligence and Learning properties in Software Agents

GUIDE TO THE CUNY ASSESSMENT TESTS

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

While you are waiting... socrative.com, room number SIMLANG2016

Learning and Transferring Relational Instance-Based Policies

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v2 [cs.ro] 3 Mar 2017

AI Agent for Ice Hockey Atari 2600

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

LEGO MINDSTORMS Education EV3 Coding Activities

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Language properties and Grammar of Parallel and Series Parallel Languages

Liquid Narrative Group Technical Report Number

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Team Formation for Generalized Tasks in Expertise Social Networks

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Using focal point learning to improve human machine tacit coordination

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The Evolution of Random Phenomena

Transcription:

Introduction to Multi-Agent Programming 11. Learning in Multi-Agent Systems (Part A) SDP, MDPs, Value Iteration, Policy Iteration, RL Alexander Kleiner, Bernhard Nebel

Contents Introduction Sequential decision problems Markov decision processes Value Iteration & Policy Iteration Reinforcement Learning (RL)

Introduction The importance of learning in MAS: Agents are typically deployed in complex domains, i.e., dynamic domains with large state spaces, and uncertainty of action execution Sometimes impossible to prepare agents for any situation Learning methods can be used to enable the agent to do rich decisions based on little experience (generalization) enable the agent to change its behavior online according to changes in the world (adaption) However, machine learning suffers under the curse of dimensionality Exponential growth of the state space with an increasing number of state variables Exponential growth of action space with an increasing number of action (In MAS even harder)

Different Types Of Learning feedback The learning feedback indicates the performance level achieved so far The following learning feedbacks are distinguished: Supervised learning (teacher) Reinforcement learning (critic) Unsupervised learning (observer)

Unsupervised Learning Inputs Unsupervised Learning System Outputs Example: clustering of texts on the Internet according to counted word frequencies

Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output) Example: detecting faces in images

Reinforcement Learning Training Info = evaluations ( rewards / penalties ) Inputs RL System Outputs ( actions ) Objective: get as much reward as possible Example: robot driving without collisions

The Agent-Environment Interface

The Credit-Assignment Problem The problem of properly assigning feedback for an overall performance change to each of the system activities that contributed to that change Which actions were invariant, which were important? Can be decomposed into two sub-problems: The inter-agent CAP Assignment of credit for an overall performance change to the external actions of the agents The intra-agent CAP Assignment of credit for a particular external action of an agent to its internal modules

Sequential Decision Problems (1) Beginning in the start state the agent must choose an action at each time step. The interaction with the environment terminates if the agent reaches one of the goal states (4, 3) (reward of +1) or (4,2) (reward 1). Each other location has a reward of -.04. In each location the available actions are Up, Down, Left, Right.

Sequential Decision Problems (2) Deterministic version: All actions always lead to the next square in the selected direction, except that moving into a wall results in no change in position. Stochastic version: Each action achieves the intended effect with probability 0.8, but the rest of the time, the agent moves at right angles to the intended direction. 0.8 0.1 0.1

Markov Decision Problem (MDP) Given a set of actions A, a set of states S in an accessible, stochastic environment, an MDP is defined by Initial state S 0 Transition Model T(s,a,s ) Reward function R(s) Transition model: T(s,a,s ) is the probability that state s is reached, if action a is executed in state s. Policy: Complete mapping π that specifies for each state s which action π(s) to take. Wanted: The optimal policy π* is the policy that maximizes the expected utility.

Optimal Policies (1) Given the optimal policy, the agent uses its current percept that tells it its current state. It then executes the action π*(s). We obtain a simple reflex agent that is computed from the information used for a utility-based agent. Optimal policy for our MDP when R(s) = -0.4 for nonterminals:

Optimal Policies (2) R(s) -1.6248-0.4278 < R(s) < -0.085-0.0221 < R(s) < 0 0 < R(s) How to compute optimal policies?

Finite and Infinite Horizon Problems Performance of the agent is measured by the sum of rewards for the states visited. To determine an optimal policy we will first calculate the utility of each state and then use the state utilities to select the optimal action for each state. The result depends on whether we have a finite or infinite horizon problem. Utility function for state sequences: U h ([s 0,s 1,,s n ]) Finite horizon: U h ([s 0,s 1,,s N+k ]) = U h ([s 0,s 1,,s N ]) for all k > 0. For finite horizon problems the optimal policy depends on the horizon N. In infinite horizon problems the optimal policy only depends on the current state.

Assigning Utilities to State Sequences For finite horizon problems utilities for each state can be computed by summing-up rewards of each state: U h ([s 0,s 1 s 2, ]) = R(s 0 ) + R(s 1 ) + R(s 2 ) + For infinite horizon problems utilities have to be computed by discounting future rewards: U h ([s 0,s 1 s 2, ]) = R(s 0 ) + γr(s 1 ) + γ 2 R(s 2 ) + The term γ [0:1[ is called the discount factor. With discounted rewards the utility of an infinite state sequence is always finite. The discount factor expresses that future rewards have less value than current rewards.

Utilities of States The utility of a state depends on the utility of the state sequences that follow it. Let U π (s) be the utility of a state under policy π. Let s t be the state of the agent after executing π for t steps. Thus, the utility of s under π is The true utility U(s) of a state is U π* (s). R(s) is the short-term reward for being in s and U(s) is the long-term total reward from s onwards.

Choosing Actions using the Maximum Expected Utility Principle The agent simply chooses the action that maximizes the expected utility of the subsequent state: The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action:

Example The utilities of the states in our 4x3 world with γ=1 and R(s)=-0.04 for non-terminal states: Which action would an optimal agent choose here?

Bellman-Equation The equation is also called the Bellman-Equation. In our 4x3 world the equation for the state (1,1) is U(1,1) = -0.04 + γ max{ 0.8 U(1,2) + 0.1 U(2,1) + 0.1 U(1,1), (Up) 0.9 U(1,1) + 0.1 U(1,2), (Left) 0.9 U(1,1) + 0.1 U(2,1), (Down) 0.8 U(2,1) + 0.1 U(1,2) + 0.1 U(1,1) } (Right) Given the numbers for the optimal policy, Up is the optimal action in (1,1).

Value Iteration (1) An algorithm to calculate an optimal strategy. Basic Idea: Calculate the utility of each state. Then use the state utilities to select an optimal action for each state. How to calculate the utility of each state? The bellman equation can be used to build as system of n equations for n states However, due to the transition model and the therefore required max operator, the system is non-linear Solution can not be computed in closed form (can only be done for deterministic problems) 14/21

Value Iteration (2) Iterative Procedure Solution: We can apply an iterative approach in which we replace the equality of the bellman equation by an assignment:

The Value Iteration Algorithm It can be shown that value iteration converges

Application Example In practice the policy often becomes optimal before the utility has converged.

Policy Iteration Value iteration computes the optimal policy even at a stage when the utility function estimate has not yet converged. If one action is better than all others, then the exact values of the states involved need not to be known. Policy iteration alternates the following two steps beginning with an initial policy π 0 : Policy evaluation: given a policy π t, calculate U t = U π t, the utility of each state if π t were executed. Policy improvement: calculate a new maximum expected utility policy π t+1 according to

The Policy Iteration Algorithm

Reinforcement Learning Learning from interaction with an external environment or other agents Goal-oriented learning Learning and making observations are interleaved Process is modeled as MDP or variants

Key Features of RL Learner is not told which actions to take Possibility of delayed reward (sacrifice short -term gains for greater long-term gains) Model-free: Models are learned online, i.e., have not to be defined in advance! Trial-and-Error search The need to explore and exploit

Some Notable RL Applications TD-Gammon: Tesauro world s best backgammon program Elevator Control: Crites & Barto high performance down-peak elevator controller Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin high performance assignment of radio channels to mobile telephone calls

Some Notable RL Applications TD-Gammon Tesauro, 1992 1995 Value Action selection by 2 3 ply search TD error Effective branching factor 400 Start with a random network Play very many games against self Learn a value function from this simulated experience This produces arguably the best player in the world

Some Notable RL Applications Elevator Dispatching Crites and Barto, 1996 10 floors, 4 elevator cars STATES: button states; positions, directions, and motion states of cars; passengers in cars & in halls ACTIONS: stop at, or go by, next floor REWARDS: roughly, 1 per time step for each person waiting Conservatively about 10 22 states

Some Notable RL Applications Performance Comparison Elevator Dispatching

Q-Learning (1)

Q-Learning (2) At time t the agent performs the following steps: Observe the current state s t Select and perform action a t Observe the subsequent state s t+1 Receive immediate payoff r t Adjust Q-value for state s t

Q-Learning (3) Update and Selection Update function: Where k denotes the version of the Q function, and α denotes a learning step size parameter that should decay over time Intuitively, actions can be selected by:

Q-Learning (4) Algorithm

The Exploration/Exploitation Dilemma Suppose you form estimates action value estimates The greedy action at time t is: You can t exploit all the time; you can t explore all the time You can never stop exploring; but you should always reduce exploring

e-greedy Action Selection Greedy action selection: e-greedy: { Continuously decrease of ε during each episode necessary! the simplest way to try to balance exploration and exploitation