Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Artificial Neural Networks written examination

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Axiom 2013 Team Description Paper

AMULTIAGENT system [1] can be defined as a group of

TD(λ) and Q-Learning Based Ludo Players

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

Intelligent Agents. Chapter 2. Chapter 2 1

Radius STEM Readiness TM

Planning with External Events

A Reinforcement Learning Variant for Control Scheduling

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Learning to Schedule Straight-Line Code

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Speeding Up Reinforcement Learning with Behavior Transfer

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Task Completion Transfer Learning for Reward Inference

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Multimedia Application Effective Support of Education

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Python Machine Learning

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

An OO Framework for building Intelligence and Learning properties in Software Agents

Learning and Transferring Relational Instance-Based Policies

AI Agent for Ice Hockey Atari 2600

On the Combined Behavior of Autonomous Resource Management Agents

Rule-based Expert Systems

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Regret-based Reward Elicitation for Markov Decision Processes

BMBF Project ROBUKOM: Robust Communication Networks

Seminar - Organic Computing

Lecture 6: Applications

Georgetown University at TREC 2017 Dynamic Domain Track

Task Completion Transfer Learning for Reward Inference

FF+FPG: Guiding a Policy-Gradient Planner

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Cal s Dinner Card Deals

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

Written by Wendy Osterman

This map-tastic middle-grade story from Andrew Clements gives the phrase uncharted territory a whole new meaning!

The Consistent Positive Direction Pinnacle Certification Course

Improving Action Selection in MDP s via Knowledge Transfer

CSC200: Lecture 4. Allan Borodin

Visual CP Representation of Knowledge

White Paper. The Art of Learning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

An Introduction to Simio for Beginners

Functional Skills Mathematics Level 2 assessment

Abstractions and the Brain

Characteristics of Functions

Generating Test Cases From Use Cases

Predicting Future User Actions by Observing Unmodified Applications

Surprise-Based Learning for Autonomous Systems

High-level Reinforcement Learning in Strategy Games

Learning Prospective Robot Behavior

CSL465/603 - Machine Learning

The Strong Minimalist Thesis and Bounded Optimality

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Causal Link Semantics for Narrative Planning Using Numeric Fluents

An Investigation into Team-Based Planning

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

While you are waiting... socrative.com, room number SIMLANG2016

DOCTOR OF PHILOSOPHY HANDBOOK

A process by any other name

SURVIVING ON MARS WITH GEOGEBRA

College Pricing and Income Inequality

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Ab Calculus Clue Problem Set Answers

Managerial Decision Making

Discriminative Learning of Beam-Search Heuristics for Planning

5.7 Course Descriptions

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Go fishing! Responsibility judgments when cooperation breaks down

University of Groningen. Systemen, planning, netwerken Bosman, Aart

OFFICE SUPPORT SPECIALIST Technical Diploma

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

An empirical study of learning speed in backpropagation

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Airplane Rescue: Social Studies. LEGO, the LEGO logo, and WEDO are trademarks of the LEGO Group The LEGO Group.

Proof Theory for Syntacticians

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Improving Fairness in Memory Scheduling

Transcription:

Reinforcement Learning ICS 273A Instructor: Max Welling Source: T. Mitchell, Machine Learning, Chapter 13.

Overview Supervised Learning: Immediate feedback (labels provided for every input. Unsupervised Learning: No feedback (no labels provided). Reinforcement Learning: Delayed scalar feedback (a number called reward). RL deals with agents that must sense & act upon their environment. This is combines classical AI and machine learning techniques. It the most comprehensive problem setting. Examples: A robot cleaning my room and recharging its battery Robot-soccer How to invest in shares Modeling the economy through rational agents Learning how to fly a helicopter Scheduling planes to their destinations and so on

The Big Picture Your action influences the state of the world which determines its reward

Complications The outcome of your actions may be uncertain You may not be able to perfectly sense the state of the world The reward may be stochastic. Reward is delayed (i.e. finding food in a maze) You may have no clue (model) about how the world responds to your actions. You may have no clue (model) of how rewards are being paid off. The world may change while you try to learn it How much time do you need to explore uncharted territory before you exploit what you have learned?

The Task To learn an optimal policy that maps states of the world to actions of the agent. I.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it. What is it that the agent tries to optimize? Answer: the total future discounted reward: Note: immediate reward is worth more than future reward. What would happen to mouse in a maze with gamma = 0?

Value Function Let s say we have access to optimal value function that computes the total future discounted reward What would be the optimal policy? Answer: we choose the action that maximizes: We assume that we know what the reward will be if we perform action a in state s : We also assume we know what the next state of the world will be if we perform action a in state s :

Example I Consider some complicated graph, and we would like to find the shortest path from a node Si to a goal node G. Traversing an edge will cost you length edge dollars. The value function encodes the total remaining distance to the goal node from any node s, i.e. V(s) = 1 / distance to goal from s. S i If you know V(s), the problem is trivial. You simply choose the node that has highest V(s). G

Example II Find your way to the goal.

Q-Function One approach to RL is then to try to estimate V*(s). a Bellman Equation: V * (s) max[ r(s,a) +γv * (δ(s,a))] However, this approach requires you to know r(s,a) and delta(s,a). This is unrealistic in many real problems. What is the reward if a robot is exploring mars and decides to take a right turn? Fortunately we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn r & delta. We want a function that directly learns good state-action pairs, i.e. what action should I take in what state. We call this Q(s,a). Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a) and delta(s,a). We have:

Check that Example II

Q-Learning This still depends on r(s,a) and delta(s,a). However, imagine the robot is exploring its environment, trying new actions as it goes. At every step it receives some reward r, and it observes the environment change into a new state s for action a. How can we use these observations, (s,a,s,r) to learn a model? s =s t+1

Q-Learning s =s t+1 This equation continually makes an estimate at state s consistent with the estimate s, one step in the future: temporal difference (TD) learning. Note that s is closer to goal, and hence more reliable, but still an estimate itself. Updating estimates based on other estimates is called bootstrapping. We do an update after each state-action pair. Ie, we are learning online! We are learning useful things about explored state-action pairs. These are typically most useful because they are likely to be encountered again. Under suitable conditions, these updates can actually be proved to converge to the real answer.

Example Q-Learning Q-learning propagates Q-estimates 1-step backwards

Exploration / Exploitation It is very important that the agent does not simply follow the current policy when learning Q. (off-policy learning).the reason is that you may get stuck in a suboptimal solution. I.e. there may be other solutions out there that you have never seen. Hence it is good to try new things so now and then, e.g. If T large lots of exploring, if T small follow current policy. One can decrease T over time.

Improvements One can trade-off memory and computation by cashing (s,s,r) for observed transitions. After a while, as Q(s,a ) has changed, you can replay the update: One can actively search for state-action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping). One can do updates along the sampled path much further back than just one step ( learning).

Extensions To deal with stochastic environments, we need to maximize expected future discounted reward: Often the state space is too large to deal with all states. In this case we need to learn a function: Neural network with back-propagation have been quite successful. For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning.

Conclusion Reinforcement learning addresses a very broad and relevant question: How can we learn to survive in our environment? We have looked at Q-learning, which simply learns from experience. No model of the world is needed. We made simplifying assumptions: e.g. state of the world only depends on last state and action. This is the Markov assumption. The model is called a Markov Decision Process (MDP). We assumed deterministic dynamics, reward function, but the world really is stochastic. There are many extensions to speed up learning. There have been many successful real world applications. http://elsy.gdan.pl/index.php?option=com_content&task=view&id=20&itemid=39