Emergent Communication for Collaborative Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

AMULTIAGENT system [1] can be defined as a group of

Reinforcement Learning by Comparing Immediate Reward

High-level Reinforcement Learning in Strategy Games

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Probability and Game Theory Course Syllabus

Speeding Up Reinforcement Learning with Behavior Transfer

The dilemma of Saussurean communication

Learning Prospective Robot Behavior

Axiom 2013 Team Description Paper

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

An OO Framework for building Intelligence and Learning properties in Software Agents

Laboratorio di Intelligenza Artificiale e Robotica

Regret-based Reward Elicitation for Markov Decision Processes

Improving Action Selection in MDP s via Knowledge Transfer

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Multiagent Simulation of Learning Environments

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

A Reinforcement Learning Variant for Control Scheduling

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Learning and Transferring Relational Instance-Based Policies

On the Combined Behavior of Autonomous Resource Management Agents

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Learning Cases to Resolve Conflicts and Improve Group Behavior

A Case-Based Approach To Imitation Learning in Robotic Agents

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Action Models and their Induction

Task Completion Transfer Learning for Reward Inference

DOCTOR OF PHILOSOPHY HANDBOOK

Probabilistic Latent Semantic Analysis

TRUST AND RISK IN GAMES OF PARTIAL INFORMATION

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Task Completion Transfer Learning for Reward Inference

College Pricing and Income Inequality

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

TD(λ) and Q-Learning Based Ludo Players

An investigation of imitation learning algorithms for structured prediction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Lecture 6: Applications

Reducing Features to Improve Bug Prediction

Seminar - Organic Computing

An Introduction to Simulation Optimization

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Centralized Assignment of Students to Majors: Evidence from the University of Costa Rica. Job Market Paper

Discriminative Learning of Beam-Search Heuristics for Planning

Using focal point learning to improve human machine tacit coordination

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

Evolution of Symbolisation in Chimpanzees and Neural Nets

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Guru: A Computer Tutor that Models Expert Human Tutors

Modeling user preferences and norms in context-aware systems

Practical Integrated Learning for Machine Element Design

The Good Judgment Project: A large scale test of different methods of combining expert predictions

An Investigation into Team-Based Planning

Team Dispersal. Some shaping ideas

(Sub)Gradient Descent

Houghton Mifflin Harcourt Trophies Grade 5

College Pricing and Income Inequality

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Predicting Future User Actions by Observing Unmodified Applications

FF+FPG: Guiding a Policy-Gradient Planner

Software Maintenance

AI Agent for Ice Hockey Atari 2600

Characteristics of Collaborative Network Models. ed. by Line Gry Knudsen

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

A Comparison of Annealing Techniques for Academic Course Scheduling

The Strong Minimalist Thesis and Bounded Optimality

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

Evolution of Collective Commitment during Teamwork

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

arxiv: v1 [cs.lg] 8 Mar 2017

What is PDE? Research Report. Paul Nichols

visual aid ease of creating

Learning Semantic Maps Through Dialog for a Voice-Commandable Wheelchair

Liquid Narrative Group Technical Report Number

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Parsing of part-of-speech tagged Assamese Texts

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Dynamic Evolution with Limited Learning Information on a Small-World Network

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

While you are waiting... socrative.com, room number SIMLANG2016

Probability and Statistics Curriculum Pacing Guide

Introduction to Modeling and Simulation. Conceptual Modeling. OSMAN BALCI Professor

Commanding Officer Decision Superiority: The Role of Technology and the Decision Maker

Medical Complexity: A Pragmatic Theory

Knowledge-Based - Systems

CS 598 Natural Language Processing

BMBF Project ROBUKOM: Robust Communication Networks

Navigating the PhD Options in CMS

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Transcription:

Emergent Communication for Collaborative Reinforcement Learning Yarin Gal and Rowan McAllister MLG RCC 8 May 2014

Game Theory Multi-Agent Reinforcement Learning Learning Communication

Nash Equilibrium Nash equilibria are game-states s.t. no player would fare better by unilateral 1 change of their own action. 1 Performed by or affecting only one person involved in a situation, without the agreement of another. 3 of 40

Prisoner s Dilemma Sideshow Bob Cooperate Defect Snake Cooperate 1,1 3,0 Defect 0,3 2,2 (prison sentence in years) 4 of 40

Pareto Efficiency Pareto optima are game-states s.t. no alternative state exists whereby each player would fare equal or better. 5 of 40

Iterated Prisoner Dilemma Strategies A t = {Cooperate (C), Defect (D)} S t = {CC, CD, DC, DD} (previous game outcome) π : t i=2 S i A t Possible strategies π for Snake: Tit-for-Tat: π(s t ) = { C, if t = 1; a Bob,t 1, if t > 1 Reinforce actions conditioned on game outcomes: π(s t ) = arg min a E T [accumulated prison years s t, a] update transition model T 6 of 40

Game Theory Multi-Agent Reinforcement Learning Learning Communication

Multi-Agent Reinforcement Learning How can we learn mutually-beneficial collaboration strategies? Modelling: multi-agent-mdps, dec-mdps Issues solving joint tasks: decentralised knowledge with no centralised control, credit assignment, communication constraints Issues affecting individual agents: state space explodes: O( S #agents ), coadapatation dynamic non-markov environment 8 of 40

Markov Decision Process (MDP) Stochastic environment characterised by tuple {S, A, R, T, γ}, where: R : S A S R (, ) T : S A S R [0, 1] γ [0, 1] 9 of 40

Multi-agent MDP (MMDP) N-agent stochastic game characterised by tuple {S, A, R, T, γ}, where: S = N i=1 S i A = N i=1 A i R = N i=1 R i, T : S A S R R i : S A S R 10 of 40

Multi-agent Q-learning Oblivious agents [Sen et al., 1994] Q i (s, a i ) (1 α)q i (s, a i ) + α[r i (s, a i ) + γv i (s )] Vi (s) = max Qi (s, a i ) a i A i Common-payoff games [Claus and Boutilier, 1998] Q i (s, a) (1 α)q i (s, a) + α[r i (s, a, s ) + γv i (s )] V i (s) max P i (s, a i )Q i (s, {a i, a i }) a i A a i A/{A i } 11 of 40

Multi-agent Q-learning Oblivious agents [Sen et al., 1994] Q i (s, a i ) (1 α)q i (s, a i ) + α[r i (s, a i ) + γv i (s )] Vi (s) = max Qi (s, a i ) a i A i Common-payoff games [Claus and Boutilier, 1998] Q i (s, a) (1 α)q i (s, a) + α[r i (s, a, s ) + γv i (s )] V i (s) max P i (s, a i )Q i (s, {a i, a i }) a i A a i A/{A i } 11 of 40

Independent vs Cooperative Learning [Tan, 1993]: Can N communicating agents outperform N non-communicating agents? Ways of communication: Agents share Q-learning updates (thus syncing Q-values): Pro: each agent learns N-fold faster (per timestep), Note: same asymptotic performance as independent agents. Agents share sensory information: Pro: more information better policies, Con: more information larger state space slower learning. 12 of 40

Independent vs Cooperative Learning [Tan, 1993]: Can N communicating agents outperform N non-communicating agents? Ways of communication: Agents share Q-learning updates (thus syncing Q-values): Pro: each agent learns N-fold faster (per timestep), Note: same asymptotic performance as independent agents. Agents share sensory information: Pro: more information better policies, Con: more information larger state space slower learning. 12 of 40

Hunter-Prey Problem prey hunter x 10 10 grid world. y perceptual state, visual depth 2 (prey s relative position). S = 5 2 + 1 = 26 { 1.0 : a hunter catches a prey, i.e. {xi, y R = i } = {0, 0} 0.1 : otherwise 13 of 40

Hunter-Prey Experiments Experiment 1 any hunter catches a prey: Baseline: 2 independent hunters, S i = 5 2 + 1 = 26 2 hunters, communicating Q-value updates. S i = 26 Experiment 2 both hunters catch same prey simultaneously: Baseline: 2 independent hunters, S i = 26 2 hunters, communicating own locations, S i 26 19 2 = 9386 2 hunters, communicating own+prey locations. S i (19 2 + 1) 19 2 = 130682 14 of 40

Hunter-Prey Results Average steps in training (cumulative) 0 10 20 30 40 50 Independent Same-policy Average steps in training (for every 200 trails) 0 25 50 75 100 125 150 Independent Passively-observing Mutual-scouting 50 100 150 200 250 300 350 400 450 500 Number of Trials Experiment 1: any hunter catches a prey 0 2000 4000 6000 8000 10000 Number of Trials Experiment 2: Both hunters catching same prey simultaneously 15 of 40

Decentralised Sparse-Interaction MDP [Melo and Veloso, 2011] Philosophy: N-agent coordination is hard since the size of the state space grows exponentially in N. Limit scope of coordination to where it s probably more useful; plans and learn w.r.t. local agent-agent interactions only. The Dec-SIMDP framework determines when and how agents i and j coordinate vs act independently. Decentralised = have full joint S-observability, but not full individual S-observability (agent i only observes S i + nearby agents). 16 of 40

Dec-SIMDP: Reducing Joint State Space S1 S1 S2 S1 S1 S2 S2 S2 Global coupling Local coupling only 17 of 40

Dec-SIMDP: A Navigation Task Navigation task: coordination necessarily only when crossing the narrow doorway. S i = {1,..., 20, D}, 2 if s = (20, 9) 1 if s A i = {N, S, E, W }, R(s, a) = 1 = 20, or s 2 = 9 20 if s = (D, D) Z i = S i {{6, 15, D} {6, 15, D}} 0 otherwise 18 of 40

Sparse Interaction [video] Four interconnected modular robots cooperate to change configuration: line ring 19 of 40

Teammate Modelling [Mundhe and Sen, 2000] 20 of 40

Credit Assignment How should individuals be individually credited w.r.t. total team performance (or utility)? 21 of 40

Communication Shall we both choose to cooperate next round? Sideshow Bob Cooperate Defect Snake Cooperate 1,1 3,0 Defect 0,3 2,2 OK. (prison sentence in years) 22 of 40

Unknown Languages? Alien Cooperate Defect Snake Cooperate 1,1 3,0 Defect 0,3 2,2 What? (prison sentence in years) 23 of 40

Game Theory Multi-Agent Reinforcement Learning Learning Communication

Learning communication How learning communication can help in RL collaboration Approaches to learning communication (ranging from linguistically motivated to a pragmatic view) What problems exist with learning communication? 25 of 40

Learning communication for collaboration How can learning communication help in RL collaboration? Forgoes expensive expert time for protocol planning Allows for a decentralised system without an external authority to decide on a communication protocol Life-long learning (adaptive tasks, e.g. future proofed robots) 26 of 40

Learning communication for collaboration How can learning communication help in RL collaboration? Forgoes expensive expert time for protocol planning Allows for a decentralised system without an external authority to decide on a communication protocol Life-long learning (adaptive tasks, e.g. future proofed robots) 26 of 40

Learning communication for collaboration How can learning communication help in RL collaboration? Forgoes expensive expert time for protocol planning Allows for a decentralised system without an external authority to decide on a communication protocol Life-long learning (adaptive tasks, e.g. future proofed robots) 26 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view emergent languages I Emergent languages I I Pidgin a simplified language developed for communication between groups that do not have a common language Creole a pidgin language nativised by children as their primary language, e.g. Singlish 27 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models A computational model for emergent languages should account for polysemy (a word might have different meanings), synonymy (a meaning might have different words), ambiguity (two agents might associate different meanings to the same word), and be open (agents may enter or leave the population, new words might emerge to describe meanings). 28 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models A computational model for emergent languages should account for polysemy (a word might have different meanings), synonymy (a meaning might have different words), ambiguity (two agents might associate different meanings to the same word), and be open (agents may enter or leave the population, new words might emerge to describe meanings). 28 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models A computational model for emergent languages should account for polysemy (a word might have different meanings), synonymy (a meaning might have different words), ambiguity (two agents might associate different meanings to the same word), and be open (agents may enter or leave the population, new words might emerge to describe meanings). 28 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models A computational model for emergent languages should account for polysemy (a word might have different meanings), synonymy (a meaning might have different words), ambiguity (two agents might associate different meanings to the same word), and be open (agents may enter or leave the population, new words might emerge to describe meanings). 28 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models [Steels, 1996] constructs a model in which words map to features of an object 29 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models Agents learn each-other s word-feature mappings by selecting an object and describing one of its distinctive features 30 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view computational models An agent s word-feature mapping is reinforced when both agents use the same word to identify a distinctive feature of the object 31 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework Using RL we can formalise the ideas above For example [Goldman et al., 2007] establish a formal framework where agents using different languages learn to coordinate In this framework a state space S describes the world, A i describes the actions the i th agent can perform, F i (s) is the probability that agent i is in state s, Σi is the alphabet of messages agent i can communicate, and oi is an observation of the state for agent i. 32 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework Using RL we can formalise the ideas above For example [Goldman et al., 2007] establish a formal framework where agents using different languages learn to coordinate In this framework a state space S describes the world, A i describes the actions the i th agent can perform, F i (s) is the probability that agent i is in state s, Σi is the alphabet of messages agent i can communicate, and oi is an observation of the state for agent i. 32 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework Using RL we can formalise the ideas above For example [Goldman et al., 2007] establish a formal framework where agents using different languages learn to coordinate In this framework a state space S describes the world, A i describes the actions the i th agent can perform, F i (s) is the probability that agent i is in state s, Σi is the alphabet of messages agent i can communicate, and oi is an observation of the state for agent i. 32 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework Using RL we can formalise the ideas above For example [Goldman et al., 2007] establish a formal framework where agents using different languages learn to coordinate In this framework a state space S describes the world, A i describes the actions the i th agent can perform, F i (s) is the probability that agent i is in state s, Σi is the alphabet of messages agent i can communicate, and oi is an observation of the state for agent i. 32 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework Using RL we can formalise the ideas above For example [Goldman et al., 2007] establish a formal framework where agents using different languages learn to coordinate In this framework a state space S describes the world, A i describes the actions the i th agent can perform, F i (s) is the probability that agent i is in state s, Σi is the alphabet of messages agent i can communicate, and oi is an observation of the state for agent i. 32 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework We define agent i s policy to be a mapping from sequences (the history) of state-message pairs to actions δ i : Ω Σ A i, and define a secondary mapping from sequences of state-message pairs to messages δ Σ i : Ω Σ Σ i. A translation τ between languages Σ and Σ is a distribution over message pairs; each agent holds a distribution P τ,i over translations between its own language and other agents languages, And meaning is interpreted as what belief state would cause me to send the message I just received. 33 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework We define agent i s policy to be a mapping from sequences (the history) of state-message pairs to actions δ i : Ω Σ A i, and define a secondary mapping from sequences of state-message pairs to messages δ Σ i : Ω Σ Σ i. A translation τ between languages Σ and Σ is a distribution over message pairs; each agent holds a distribution P τ,i over translations between its own language and other agents languages, And meaning is interpreted as what belief state would cause me to send the message I just received. 33 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework We define agent i s policy to be a mapping from sequences (the history) of state-message pairs to actions δ i : Ω Σ A i, and define a secondary mapping from sequences of state-message pairs to messages δ Σ i : Ω Σ Σ i. A translation τ between languages Σ and Σ is a distribution over message pairs; each agent holds a distribution P τ,i over translations between its own language and other agents languages, And meaning is interpreted as what belief state would cause me to send the message I just received. 33 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework We define agent i s policy to be a mapping from sequences (the history) of state-message pairs to actions δ i : Ω Σ A i, and define a secondary mapping from sequences of state-message pairs to messages δ Σ i : Ω Σ Σ i. A translation τ between languages Σ and Σ is a distribution over message pairs; each agent holds a distribution P τ,i over translations between its own language and other agents languages, And meaning is interpreted as what belief state would cause me to send the message I just received. 33 of 40

Learning communication: a model Overview of the framework 34 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view formal framework I I I I Several experiments where used to assess the framework. For example, two agents work to meet at a point in a gridworld according to a belief over the location of the other. Messages describing an agent s location are exchanged and their translations are updated depending on whether the agents meet or not. The optimal policies are assumed to be known before the agents try to learn how to communicate. 35 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view a pragmatic view Use in robotics A leader robot controlling a follower robot [Yanco and Stein, 1993] Small robots pushing a box towards a source of light [Mataric, 1998] Figure: Leader-follower robots Figure: Box pushing 36 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view a pragmatic view Use in robotics A leader robot controlling a follower robot Communication diagram 37 of 40

Approaches to learning communication From linguistic motivation to a pragmatic view a pragmatic view Use in robotics A leader robot controlling a follower robot Reinforcement regime 38 of 40

Why is learning communication difficult? What problems exist with learning communication? Difficult to specify a framework Many partial frameworks proposed with different approaches State space explosion Difficult to use for RL collaboration No framework has been shown to improve on independent RL 39 of 40

Why is learning communication difficult? What problems exist with learning communication? Difficult to specify a framework Many partial frameworks proposed with different approaches State space explosion Difficult to use for RL collaboration No framework has been shown to improve on independent RL 39 of 40

Why is learning communication difficult? What problems exist with learning communication? Difficult to specify a framework Many partial frameworks proposed with different approaches State space explosion Difficult to use for RL collaboration No framework has been shown to improve on independent RL 39 of 40

Why is learning communication difficult? What problems exist with learning communication? Difficult to specify a framework Many partial frameworks proposed with different approaches State space explosion Difficult to use for RL collaboration No framework has been shown to improve on independent RL These problems are not fully answered in current research. 39 of 40

Up, Up and Away Where this might go Learning communication based on sparse interactions Reduce state space complexity Selecting what to listen to in incoming communication State space selection Cyber-warfare better computer worms? Developing unique communication protocols between cliques of agents Online learning of communication Introducing a new agent into a system with existing agents Finding optimal policy with agents ignorant of one another, and then allowing agents to start communicating to improve collaboration 40 of 40

Up, Up and Away Where this might go Learning communication based on sparse interactions Reduce state space complexity Selecting what to listen to in incoming communication State space selection Cyber-warfare better computer worms? Developing unique communication protocols between cliques of agents Online learning of communication Introducing a new agent into a system with existing agents Finding optimal policy with agents ignorant of one another, and then allowing agents to start communicating to improve collaboration 40 of 40

Up, Up and Away Where this might go Learning communication based on sparse interactions Reduce state space complexity Selecting what to listen to in incoming communication State space selection Cyber-warfare better computer worms? Developing unique communication protocols between cliques of agents Online learning of communication Introducing a new agent into a system with existing agents Finding optimal policy with agents ignorant of one another, and then allowing agents to start communicating to improve collaboration 40 of 40

Up, Up and Away Where this might go Learning communication based on sparse interactions Reduce state space complexity Selecting what to listen to in incoming communication State space selection Cyber-warfare better computer worms? Developing unique communication protocols between cliques of agents Online learning of communication Introducing a new agent into a system with existing agents Finding optimal policy with agents ignorant of one another, and then allowing agents to start communicating to improve collaboration 40 of 40

Up, Up and Away Where this might go Learning communication based on sparse interactions Reduce state space complexity Selecting what to listen to in incoming communication State space selection Cyber-warfare better computer worms? Developing unique communication protocols between cliques of agents Online learning of communication Introducing a new agent into a system with existing agents Finding optimal policy with agents ignorant of one another, and then allowing agents to start communicating to improve collaboration Lots to do for future research! 40 of 40

Claus, C. and Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In AAAI/IAAI, pages 746 752. Goldman, C. V., Allen, M., and Zilberstein, S. (2007). Learning to communicate in a decentralized environment. Autonomous Agents and Multi-Agent Systems, 15(1):47 90. Mataric, M. J. (1998). Using communication to reduce locality in distributed multiagent learning. Journal of experimental & theoretical artificial intelligence, 10(3):357 369. Melo, F. S. and Veloso, M. (2011). Decentralized mdps with sparse interactions. Artificial Intelligence, 175(11):1757 1789. Mundhe, M. and Sen, S. (2000). Evolving agent socienties that avoid social dilemmas. 40 of 40

In GECCO, pages 809 816. Sen, S., Sekaran, M., Hale, J., et al. (1994). Learning to coordinate without sharing information. In AAAI, pages 426 431. Steels, L. (1996). Emergent adaptive lexicons. From animals to animats, 4:562 567. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, volume 337. Amherst, MA. Yanco, H. and Stein, L. A. (1993). An adaptive communication protocol for cooperating mobile robots. In Meyer, JA, HL Roitblat, and S. Wilson (1993) From Animals to Animats 2. Proceedings of the Second International Conference on 40 of 40

Simulation of Adaptive Behavior. The MIT Press, Cambridge Ma, pages 478 485. 40 of 40