Thresholded Rewards: Acting Optimally in Timed, Zero-Sum Games

Similar documents
Lecture 10: Reinforcement Learning

Speeding Up Reinforcement Learning with Behavior Transfer

Regret-based Reward Elicitation for Markov Decision Processes

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Exploration. CS : Deep Reinforcement Learning Sergey Levine

High-level Reinforcement Learning in Strategy Games

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Improving Action Selection in MDP s via Knowledge Transfer

AMULTIAGENT system [1] can be defined as a group of

TD(λ) and Q-Learning Based Ludo Players

Learning and Transferring Relational Instance-Based Policies

Lecture 1: Machine Learning Basics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Task Completion Transfer Learning for Reward Inference

Laboratorio di Intelligenza Artificiale e Robotica

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Probabilistic Latent Semantic Analysis

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Laboratorio di Intelligenza Artificiale e Robotica

An OO Framework for building Intelligence and Learning properties in Software Agents

FF+FPG: Guiding a Policy-Gradient Planner

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Georgetown University at TREC 2017 Dynamic Domain Track

On the Combined Behavior of Autonomous Resource Management Agents

DOCTOR OF PHILOSOPHY HANDBOOK

Discriminative Learning of Beam-Search Heuristics for Planning

Planning with External Events

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Task Completion Transfer Learning for Reward Inference

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Learning From the Past with Experiment Databases

A Case-Based Approach To Imitation Learning in Robotic Agents

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Learning Methods for Fuzzy Systems

Action Models and their Induction

Artificial Neural Networks written examination

A Reinforcement Learning Variant for Control Scheduling

Seminar - Organic Computing

Measurement & Analysis in the Real World

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

An investigation of imitation learning algorithms for structured prediction

BMBF Project ROBUKOM: Robust Communication Networks

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Combining Proactive and Reactive Predictions for Data Streams

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Causal Link Semantics for Narrative Planning Using Numeric Fluents

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Python Machine Learning

Lecture 6: Applications

The Evolution of Random Phenomena

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Learning Cases to Resolve Conflicts and Improve Group Behavior

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Human Emotion Recognition From Speech

CSL465/603 - Machine Learning

Introduction to Simulation

Software Maintenance

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

An Introduction to Simio for Beginners

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

(Sub)Gradient Descent

Learning Prospective Robot Behavior

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Evolutive Neural Net Fuzzy Filtering: Basic Description

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

PELLISSIPPI STATE TECHNICAL COMMUNITY COLLEGE MASTER SYLLABUS APPLIED MECHANICS MET 2025

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

A Comparison of Annealing Techniques for Academic Course Scheduling

Generative models and adversarial training

A theoretic and practical framework for scheduling in a stochastic environment

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Intelligent Agents. Chapter 2. Chapter 2 1

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Switchboard Language Model Improvement with Conversational Data from Gigaword

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Major Milestones, Team Activities, and Individual Deliverables

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Predicting Future User Actions by Observing Unmodified Applications

BYLINE [Heng Ji, Computer Science Department, New York University,

Australian Journal of Basic and Applied Sciences

Transcription:

Thresholded Rewards: Acting Optimally in Timed, Zero-Sum Games Colin McMillen and Manuela Veloso Presenter: Man Wang

Overview Zero-sum Games Markov Decision Problems Value Iteration Algorithm Thresholded Rewards MDP TRMDP Conversion Solution Extraction Heuristic Techniques Conclusion References

Zero-sum Games Zero sum game A participant's gains of utility -- Losses of the other participant Cumulative intermediate reward The difference between our score and opponent s score True reward Win, loss or tie Determined at the end based on intermediate reward

Markov Decision Problem Consider a non-perfect system Actions are performed with a probability less than 1 What is the best action for an agent under this constraint? Example: A mobile robot does not exactly perform the desired action

Markov Decision Problem Sound means of achieving optimal rewards in uncertain domains Find a policy maps state S to action A Maximize the cumulative long-term rewards

Value Iteration Algorithm What is the best way to move to +1 without moving into -1? Consider non-deterministic transition model:

Value Iteration Algorithm Calculate the utility of the center cell:

Value Iteration Algorithm

Thresholded Rewards MDP TRMDP (M, f, h): M: MDP(S, A, T, R, s0) f : threshold function f(rintermediate) = rtrue h : time horizon

Thresholded Rewards MDP Example: States: 1. FOR: our team scored (reward +1) 2. AGAINST: opponent scored (reward -1) 3. NONE: no score occurs (reward 0) Actions: 1. Balanced 2. Offensive 3. Defensive

Thresholded Rewards MDP Expected one step reward: 1. Balanced: 0 = 0.05*1+0.05*(-1)+0.9*0 2. Offensive: -0.25 = 0.25*1+0. 5*(-1)+0.25*0 3. Defensive: -0.01 = 0.01*1+0.02*(-1)+0.97*0 Suboptimal solution, true reward = 0

TRMDP Conversion

TRMDP Conversion

TRMDP Conversion The MDP M given MDP M and h=3

Solution Extraction Two important facts: M has a layered, feed-forward structure: every layer contains transitions only into the next layer At iteration k of value iteration, the only values that change are those for the states s =(s, t, ir) such that t=k

Solution Extraction Expected reward = 0.1457 Win : 50% Lose: 35% Tie : 15% Optimal policy for M and h=120

Solution Extraction Effect of changing opponent s capabilities Performance of MER vs TR on 5000 random MDPs

Heuristic Techniques Uniform-k heuristic Lazy-k heuristic Logarithmic-k-m heuristic Experiments

Uniform-k heuristic Adopt non-stationary policy Change policy every k time steps Compress the time horizon uniformly by factor k Solution is suboptimal

Lazy-k heuristic More than k steps remaining: No reward threshold K steps remaining: Create threshold rewards MDP Time horizon k Current state as initial state

Logarithmic-k-m heuristic Time resolution becomes finer when approaching the time horizon k Number of decisions made before the time resolution increased m The multiple by which the resolution is increased For instance, k=10,m=2 means that 10 actions before each increase, time resolution doubles on each increase

Experiment 60 different MDPs randomly chosen from the 5000 MDPs in previous experiment Uniform-k suffers from large state size Logarithmic highly depend on parameters Lazy-k provides high true reward with low number of states

Conclusion Introduced thresholded-rewards problem in finitehorizon environment Intermediate rewards True reward at the end of horizon Maximize the probability of winning Present an algorithm converts base MDP to expanded MDP Investigate three heuristic techniques generating approximate solutions

References 1. Bacchus, F.; Boutilier, C.; and Grove, A. 1996. Rewarding behaviors. In Proc. AAAI-96. 2. Guestrin, C.; Koller, D.; Parr, R.; and Venkataraman, S. 2003. Efficient solution algorithms for factored MDPs. JAIR. 3. Hoey, J.; St-Aubin, R.; Hu, A.; and Boutilier, C. 1999. SPUDD: Stochastic planning using decision diagrams. In Proceedings of Uncertainty in Artificial Intelligence. 4. Kaelbling, L. P.; Littman, M. L.; and Moore, A. W. 1996. Reinforcement learning: A survey. JAIR. 5. Kearns, M. J.; Mansour, Y.; and Ng, A. Y. 2002. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning.

References 6. Li, L.; Walsh, T. J.; and Littman, M. L. 2006. Towards a unified theory of state abstraction for MDPs. In Symposium on Artificial Intelligence and Mathematics. 7. Mahadevan, S. 1996. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning 22(1-3):159 195. 8. McMillen, C., and Veloso, M. 2006. Distributed, play-based role assignment for robot teams in dynamic environments. In Proc. Distributed Autonomous Robotic Systems. 9. Puterman, R. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. 10. Stone, P. 1998. Layered Learning in Multi-Agent Systems. Ph.D. Dissertation, Carnegie Mellon University.