Lecture 10: Reinforcement Learning

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

AMULTIAGENT system [1] can be defined as a group of

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Improving Action Selection in MDP s via Knowledge Transfer

Regret-based Reward Elicitation for Markov Decision Processes

Task Completion Transfer Learning for Reward Inference

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Task Completion Transfer Learning for Reward Inference

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Intelligent Agents. Chapter 2. Chapter 2 1

Grade 6: Correlated to AGS Basic Math Skills

Software Maintenance

Learning Prospective Robot Behavior

TD(λ) and Q-Learning Based Ludo Players

Introduction to Simulation

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Laboratorio di Intelligenza Artificiale e Robotica

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Using focal point learning to improve human machine tacit coordination

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Visual CP Representation of Knowledge

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

On the Combined Behavior of Autonomous Resource Management Agents

Generative models and adversarial training

A Reinforcement Learning Variant for Control Scheduling

CS Machine Learning

Python Machine Learning

Probability and Game Theory Course Syllabus

Arizona s College and Career Ready Standards Mathematics

Speeding Up Reinforcement Learning with Behavior Transfer

Transfer Learning Action Models by Measuring the Similarity of Different Domains

FF+FPG: Guiding a Policy-Gradient Planner

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Self Study Report Computer Science

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

AI Agent for Ice Hockey Atari 2600

Lecture 1: Basic Concepts of Machine Learning

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

First Grade Standards

Grades. From Your Friends at The MAILBOX

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Surprise-Based Learning for Autonomous Systems

Ohio s Learning Standards-Clear Learning Targets

B. How to write a research paper

The Strong Minimalist Thesis and Bounded Optimality

Corrective Feedback and Persistent Learning for Information Extraction

Finding Your Friends and Following Them to Where You Are

Truth Inference in Crowdsourcing: Is the Problem Solved?

1.11 I Know What Do You Know?

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

Laboratorio di Intelligenza Artificiale e Robotica

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Extending Place Value with Whole Numbers to 1,000,000

Backwards Numbers: A Study of Place Value. Catherine Perez

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

An investigation of imitation learning algorithms for structured prediction

Discriminative Learning of Beam-Search Heuristics for Planning

Planning with External Events

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

INPE São José dos Campos

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Liquid Narrative Group Technical Report Number

An OO Framework for building Intelligence and Learning properties in Software Agents

Seminar - Organic Computing

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Written by Wendy Osterman

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Shockwheat. Statistics 1, Activity 1

Functional Skills Mathematics Level 2 assessment

Learning Methods for Fuzzy Systems

Firms and Markets Saturdays Summer I 2014

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Learning and Transferring Relational Instance-Based Policies

A Version Space Approach to Learning Context-free Grammars

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Action Models and their Induction

Major Milestones, Team Activities, and Individual Deliverables

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Transcription:

Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p.

Motivation addressed problem: How can an autonomous agent that senses and acts in its environment learn to choose optimal actions to achieve its goals? consider building a learning robot (i.e., agent) the agent has a set of sensors to observe the state of its environment and a set of actions it can perform to alter its state the task is to learn a control strategy, or policy, for choosing actions that achieve its goals assumption: goals can be defined by a reward function that assigns a numerical value to each distinct action the agent may perform from each distinct state Lecture 1: Reinforcement Learning p.

Motivation considered settings: deterministic or nondeterministic outcomes prior backgound knowledge available or not similarity to function approximation: approximating the function π : S A where S is the set of states and A the set of actions differences to function approximation: Delayed reward: training information is not available in the form < s, π(s) >. Instead the trainer provides only a sequence of immediate reward values. Temporal credit assignment: determining which actions in the sequence are to be credited with producing the eventual reward Lecture 1: Reinforcement Learning p.

Motivation differences to function approximation (cont.): exploration: distribution of training examples is influenced by the chosen action sequence which is the most effective exploration strategy? trade-off between exploration of unknown states and exploitation of already known states partially observable states: sensors only provide partial information of the current state (e.g. forward-pointing camera, dirty lenses) life-long learning: function approximation often is an isolated task, while robot learning requires to learn several related tasks within the same environment Lecture 1: Reinforcement Learning p.

The Learning Task based on Markov Decision Processes (MDP) the agent can perceive a set S of distinct states of its environment and has a set A of actions that it can perform at each discrete time step t, the agent senses the current state s t, chooses a current action a t and performs it the environment responds by returning a reward r t = r(s t, a t ) and by producing the successor state s t+1 = δ(s t, a t ) the functions r and δ are part of the environment and not neccessarily known to the agent in an MDP, the functions r(s t, a t ) and δ(s t, a t ) depend only on the current state and action Lecture 1: Reinforcement Learning p.

The Learning Task the task is to learn a policy π : S A one approach to specify which policy π the agent should learn is to require the policy that produces the greatest possible cumulative reward over time (discounted cumulative reward) V π (s t ) r t + γr t+1 + γ 2 r t+1 γ i r t+i i= where V π (s t ) is the cumulative value achieved by following an arbitrary policy π from an arbitrary initial state s t r t+i is generated by repeatedly using the policy π and γ ( γ < 1) is a constant that determines the relative value of delayed versus immediate rewards Lecture 1: Reinforcement Learning p.

The Learning Task Agent state reward action Environment s a r s 1 a 1 r 1 s 2 a 2 r 2... Goal: Learn to choose actions that maximize r + γ r 1 + γ 2 r 2 +..., where <γ<1 hence, the agent s learning task can be formulated as π argmax π V π (s), ( s) Lecture 1: Reinforcement Learning p.

Illustrative Example 1 1 G 9 81 1 9 G 1 the left diagramm depicts a simple grid-world environment γ =.9 squares states, locations arrows possible transitions (with annotated r(s, a)) G goal state (absorbing state) once states, actions and rewards are defined and γ is chosen, the optimal policy π with its value function V (s) can be determined Lecture 1: Reinforcement Learning p.

Illustrative Example the right diagram shows the values of V for each state e.g. consider the bottom-right state V = 1, because π selects the move up action that receives a reward of 1 thereafter, the agent will stay G and receive no further awards V = 1 + γ + γ 2 +... = 1 e.g. consider the bottom-center state V = 9, because π selects the move right and move up actions V = + γ 1 + γ 2 +... = 9 recall that V is defined to be the sum of discounted future awards over infinite future Lecture 1: Reinforcement Learning p.

Q Learning it is easier to learn a numerical evaluation function than implement the optimal policy in terms of the evaluation function question: What evaluation function should the agent attempt to learn? one obvious choice is V the agent should prefer s 1 to s 2 whenever V (s 1 ) > V (s 2 ) problem: the agent has to chose among actions, not among states π (s) = argmax[r(s, a) + γv (δ(s, a))] a the optimal action in state s is the action a that maximizes the sum of the immediate reward r(s, a) plus the value of V of the immediate successor, discounted by γ Lecture 1: Reinforcement Learning p. 1

Q Learning thus, the agent can acquire the optimal policy by learning V, provided it has perfect knowledge of the immediate reward function r and the state transition function δ in many problems, it is impossible to predict in advance the exact outcome of applying an arbitrary action to an arbitrary state the Q function provides a solution to this problem Q(s, a) indicates the maximum discounted reward that can be achieved starting from s and applying action a first Q(s, a) = r(s, a) + γv (δ(s, a)) π (s) = argmaxq(s, a) a Lecture 1: Reinforcement Learning p. 1

Q Learning hence, learning the Q function corresponds to learning the optimal policy π if the agent learns Q instead of V, it will be able to select optimal actions even when it has no knowledge of r and δ it only needs to consider each available action a in its current state s and chose the action that maximizes Q(s, a) the value of Q(s, a) for the current state and action summarizes in one value all information needed to determine the discounted cumulative reward that will be gained in the future if a is selected in s Lecture 1: Reinforcement Learning p. 1

Q Learning 1 1 G 81 72 9 81 81 9 1 81 9 1 G 72 81 the right diagramm shows the corresponding Q values the Q value for each state-action transition equals the r value for this transition plus the V value discounted by γ Lecture 1: Reinforcement Learning p. 1

Q Learning Algorithm key idea: iterative approximation relationship between Q and V V (s) = max a Q(s, a ) Q(s, a) = r(s, a) + γ max a Q(δ(s, a), a ) this recursive definition is the basis for algorithms that use iterative approximation the learner s estimate ˆQ(s, a) is represented by a large table with a separate entry for each state-action pair Lecture 1: Reinforcement Learning p. 1

Q Learning Algorithm For each s, a initialize the table entry ˆQ(s, a) to zero Oberserve the current state s Do forever: Select an action a and execute it Receive immediate reward r Observe new state s Update each table entry for ˆQ(s, a) as follows s s ˆQ(s, a) r + γmax a ˆQ(s, a ) using this algorithm the agent s estimate ˆQ converges to the actual Q, provided the system can be modeled as a deterministic Markov decision process, r is bounded, and actions are chosen so that every state-action pair is visited infinitely often Lecture 1: Reinforcement Learning p. 1

Illustrative Example R 72 63 1 81 9 63 R 1 81 a right Initial state: s 1 Next state: s 2 ˆQ(s 1, a right ) r + γ max a ˆQ(s2, a ) +.9 max{66, 81, 1} 9 each time the agent moves, Q Learning propagates ˆQ estimates backwards from the new state to the old Lecture 1: Reinforcement Learning p. 1

Experimentation Stategies algorithm does not specify how actions are chosen by the agent obvious strategy: select action a that maximizes ˆQ(s, a) risk of overcommiting to actions with high ˆQ values during earlier trainings exploration of yet unknown actions is neglected alternative: probabilistic selection P(a i s) = kŝ(s,a i) j k ˆQ(s,a i ) k indicates how strongly the selection favors actions with high ˆQ values k large exploitation strategy k small exploration strategy Lecture 1: Reinforcement Learning p. 1

Generalizing From Examples so far, the target function is represented as an explicit lookup table the algorithm performs a kind of rote learning and makes no attempt to estimate the Q value for yet unseen state-action pairs unrealistic assumption in large or infinite spaces or when execution costs are very high incorporation of function approximation algorithms such as BACKPROPAGATION table is replaced by a neural network using each ˆQ(s, a) update as training example (s and a are inputs, ˆQ the output) a neural network for each action a Lecture 1: Reinforcement Learning p. 1

Relationship to Dynamic Programming Q Learning is closely related to dynamic programming approaches that solve Markov Decision Processes dynamic programming assumption that δ(s, a) and r(s, a) are known focus on how to compute the optimal policy mental model can be explored (no direct interaction with environment) offline system Q Learning assumption that δ(s, a) and r(s, a) are not known direct interaction inevitable online system Lecture 1: Reinforcement Learning p. 1

Relationship to Dynamic Programming relationship is appent by considering the Bellman s equation, which forms the foundation for many dynamic programming approaches solving Markov Decision Processes ( s S)V (s) = E[r(s, π(s)) + γv (δ(s, π(s)))] Lecture 1: Reinforcement Learning p. 2

Advanced Topics different updating sequences proof of convergence nondeterministic rewards and actions temporal difference learning Lecture 1: Reinforcement Learning p. 2