MDP: Motivation. Markovian Decision Processes (MD. Exploration/Exploitation Conflict. Example

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Georgetown University at TREC 2017 Dynamic Domain Track

AMULTIAGENT system [1] can be defined as a group of

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

The Evolution of Random Phenomena

Using focal point learning to improve human machine tacit coordination

Artificial Neural Networks written examination

Laboratorio di Intelligenza Artificiale e Robotica

A Reinforcement Learning Variant for Control Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Axiom 2013 Team Description Paper

Learning Prospective Robot Behavior

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

STA 225: Introductory Statistics (CT)

Evidence for Reliability, Validity and Learning Effectiveness

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

High-level Reinforcement Learning in Strategy Games

Association Between Categorical Variables

Improving Action Selection in MDP s via Knowledge Transfer

Probability and Statistics Curriculum Pacing Guide

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Firms and Markets Saturdays Summer I 2014

On the Combined Behavior of Autonomous Resource Management Agents

While you are waiting... socrative.com, room number SIMLANG2016

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Algebra 2- Semester 2 Review

Lecture 1: Machine Learning Basics

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Analysis of Enzyme Kinetic Data

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Radius STEM Readiness TM

Grade 6: Correlated to AGS Basic Math Skills

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Seminar - Organic Computing

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Python Machine Learning

Managerial Decision Making

Surprise-Based Learning for Autonomous Systems

Grade Dropping, Strategic Behavior, and Student Satisficing

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Transfer Learning Action Models by Measuring the Similarity of Different Domains

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

Statistical Studies: Analyzing Data III.B Student Activity Sheet 7: Using Technology

A Comparison of Annealing Techniques for Academic Course Scheduling

Softprop: Softmax Neural Network Backpropagation Learning

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

Steps Before Step Scanning By Linda J. Burkhart Scripting by Fio Quinn Powered by Mind Express by Jabbla

CSL465/603 - Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Discriminative Learning of Beam-Search Heuristics for Planning

FF+FPG: Guiding a Policy-Gradient Planner

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Stopping rules for sequential trials in high-dimensional data

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Emergency Management Games and Test Case Utility:

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Medical Complexity: A Pragmatic Theory

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Online Updating of Word Representations for Part-of-Speech Tagging

10.2. Behavior models

Probability and Game Theory Course Syllabus

Major Milestones, Team Activities, and Individual Deliverables

A Case-Based Approach To Imitation Learning in Robotic Agents

Honors Mathematics. Introduction and Definition of Honors Mathematics

Lecture 1: Basic Concepts of Machine Learning

Implementation. Journal of Reading Recovery Spring 2005

Rule Learning With Negation: Issues Regarding Effectiveness

TEACHING AND EXAMINATION REGULATIONS (TER) (see Article 7.13 of the Higher Education and Research Act) MASTER S PROGRAMME EMBEDDED SYSTEMS

Planning with External Events

Introduction to Simulation

Tutor s Guide TARGET AUDIENCES. "Qualitative survey methods applied to natural resource management"

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

On-the-Fly Customization of Automated Essay Scoring

Go fishing! Responsibility judgments when cooperation breaks down

S H E A D AV I S C O L U M B U S S C H O O L F O R G I R L S

Regret-based Reward Elicitation for Markov Decision Processes

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Extending Learning Across Time & Space: The Power of Generalization

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Transcription:

MP Motivation P aniel Polani Scenario sequence of decisions where 1. each decision may lead randomly to different outcomes. each decision is connected with a reward 3. rewards cumulate to total utility. rewards may be delayed Relevance for Social Intelligence adaptive models for socia interaction xample P p.1/ xploration/xploitation Conflict iven armed bandit problem. ach trial of some bandit arm costs. Payoff distributions with differing mean and variance per arm (finite and unknown Seeking strategy with maximum payoff. Temporal conditions limited unlimited weighted time Problem finding balance between exploration and exploration leads to conflict reedy Strategy shortterm gain remember tragedy of the commons Note in the following we will ignore the intricacies of the exploration/exploitation dilemma and use only simple strategies for their control P p.3/

( ' 89 10 / Utility stimation Simple Strategies ssumption reward for an action mean. Choose sequence of actions estimate for via is random variable with. Then obtain reedy Strategy choose action "& $% "# lternative reedy with probability choose greedy with probability random action. dvantage More exploration which are the times where the at the times between 1 and where action is chosen. Initialization! is initially set to any value e.g. 0. P p./ Incremental stimates Notes Remark (Incremental Computation of consider actions having been made and the action at time be. Then / 01 old value.(note 3 / 01 dependent factor deviation from target value to Notes 1. need only to store and. the update common in learning systems 3. for 8 8 is very Nonstationary nvironments give more importance to recent values e.g. / 01 e.g. constant in ;! < P p./

Reinforcement Learning Preliminary efinitions The Full Reinforcement Learning Problem ef. (state a full description of the current situation agent and world ef. (policy policy at a time is a conditional probability that an agent in a state chooses an action gent at a time step it has access to current state reward just obtained current policy From This calculate current action choice policy. and following Markovian ecision Process P p.9/ xample Note full access to current state border between agent and environment given by absolute control not by limitation of knowledge Note goals of agent specified by rewards oal longterm maximization of cumulated rewards xamples reward structure as follows 1. robot is supposed to learn to move if step ahead. maze 0 per step 1 for a step outside the maze 3. maze 1 per step in the maze P p.11/

( # ( #? @ F F @ C Value Function Value Function a measure how good it is for an agent to b in a certain state. This depends on future actions (mor precisely on policy $ $! "#%$ Function a measure how good it is for an agent to be in a state and picking a certain action (again dependent on the policy in the following states. $ $ "#$ ackup iagrams ackup iagram for C ackup iagram for C Objective. Question have sequence of rewards What do we want to maximize? In eneral want to maximize total payoff Cases pisodic Tasks tasks with natural end time require discounting Unlimited Tasks with P p.13/ ellmann quation Theorem with 8 3 10 / 0 8 3 0 0 one has > 8 0 0 < 10 / 0 ;80 9; ' ; 0 ;80 ( ' P p.1/

8 8 "& %? F @ Optimal Policy ellmanns Optimality Criterion Comparison of Policies we say that good as if for all states we have Theorem there exists always an optimal policy for all policies. ( is at least as. i.e. Note an optimal policy is not necessarily unique! ut the same for all optimal policies. Therefore write optimal value function. nalogously define Remark one has is for ellmann s Optimality quation one has nalogously "& % ; 8( 0 ' 8 P p.1/ ackup iagrams Learning Methods For max Methods dynamic programming value iteration learning For max C C P p.19/

Learning Learning Properties Learning update rule given by #%$ # $ #%$ # $ Theorem (Watkins if all are being update often enough converges towards independently of policy. dvantages offpolicy does not require explicit averaging done implicit as you go no model required Reinforcement Learning uses Learning as central mode many variants and improvements exist. eneral Remarks P p.1/ ambler s Problem Reinforcement Learning learning from delayed rewards In Particular Learning requires no model of dynamics only immediate backup but state must have Markov Property the result of an action must only depend on a current state variable which must be known to the agent. In particular 1. it must not depend on some memory effects unseen by the agent. it must not depend on the history of the agent or the world Scenario playing red/black starting and exiting with given amount P p.3/