Review: Types of Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

TD(λ) and Q-Learning Based Ludo Players

Axiom 2013 Team Description Paper

Intelligent Agents. Chapter 2. Chapter 2 1

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

AMULTIAGENT system [1] can be defined as a group of

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

High-level Reinforcement Learning in Strategy Games

Laboratorio di Intelligenza Artificiale e Robotica

Speeding Up Reinforcement Learning with Behavior Transfer

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Task Completion Transfer Learning for Reward Inference

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

A Reinforcement Learning Variant for Control Scheduling

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

The Evolution of Random Phenomena

Georgetown University at TREC 2017 Dynamic Domain Track

Task Completion Transfer Learning for Reward Inference

While you are waiting... socrative.com, room number SIMLANG2016

Answer each question by placing an X over the appropriate answer. Select only one answer for each question.

Creative Media Department Assessment Policy

Seminar - Organic Computing

A Comparison of Annealing Techniques for Academic Course Scheduling

UNDERSTANDING DECISION-MAKING IN RUGBY By. Dave Hadfield Sport Psychologist & Coaching Consultant Wellington and Hurricanes Rugby.

CS Machine Learning

FF+FPG: Guiding a Policy-Gradient Planner

Laboratorio di Intelligenza Artificiale e Robotica

Improving Action Selection in MDP s via Knowledge Transfer

An OO Framework for building Intelligence and Learning properties in Software Agents

Grade 6: Correlated to AGS Basic Math Skills

Developing Grammar in Context

Contents. Foreword... 5

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Functional Skills Mathematics Level 2 assessment

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Science Fair Project Handbook

Regret-based Reward Elicitation for Markov Decision Processes

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning From the Past with Experiment Databases

Improving Fairness in Memory Scheduling

Getting Started with Deliberate Practice

Introduction to Simulation

BMBF Project ROBUKOM: Robust Communication Networks

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Genevieve L. Hartman, Ph.D.

Guidelines for the Master s Thesis Project in Biomedicine BIMM60 (30 hp): planning, writing and presentation.

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Lecture 1.1: What is a group?

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Transfer of Training

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Learning and Transferring Relational Instance-Based Policies

Managerial Decision Making

CHAPTER V IMPLEMENTATION OF A LEARNING CONTRACT AND THE MODIFICATIONS TO THE ACTIVITIES Instructional Space The atmosphere created by the interaction

An Introduction to Simio for Beginners

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

MYCIN. The MYCIN Task

4-3 Basic Skills and Concepts

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

PART C: ENERGIZERS & TEAM-BUILDING ACTIVITIES TO SUPPORT YOUTH-ADULT PARTNERSHIPS

Learning Prospective Robot Behavior

Merry-Go-Round. Science and Technology Grade 4: Understanding Structures and Mechanisms Pulleys and Gears. Language Grades 4-5: Oral Communication

CSC200: Lecture 4. Allan Borodin

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

On the Combined Behavior of Autonomous Resource Management Agents

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Computers Change the World

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Socratic Seminar (Inner/Outer Circle Method)

MENTORING. Tips, Techniques, and Best Practices

Notetaking Directions

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

DOCTOR OF PHILOSOPHY HANDBOOK

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Knowledge Transfer in Deep Convolutional Neural Nets

The Strong Minimalist Thesis and Bounded Optimality

Fall Classes At A Glance

The Task. A Guide for Tutors in the Rutgers Writing Centers Written and edited by Michael Goeller and Karen Kalteissen

Changing User Attitudes to Reduce Spreadsheet Risk

BADM 641 (sec. 7D1) (on-line) Decision Analysis August 16 October 6, 2017 CRN: 83777

Measurement. Time. Teaching for mastery in primary maths

File # for photo

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Name: Class: Date: ID: A

PREP S SPEAKER LISTENER TECHNIQUE COACHING MANUAL

Foundations of Knowledge Representation in Cyc

Transcription:

Introduction to Reinforcement Learning Kevin Swingler Review: Types of Learning There are three broad types of learning: Supervised learning Learner looks for patterns in inputs. Teacher tells learner the right or wrong answer. Unsupervised learning Learner looks for patterns in inputs. No right or wrong answer. Reinforcement learning Learner is not told which actions to take, but gets reward/punishment from environment and adjusts/learns the action to pick next time. 1

Biological Learning Which kind of learning is ecologically useful? Supervised Learning How often are animals given a neat pair of stimuli: Action Result to learn? Perhaps in the lab but rarely otherwise Unsupervised learning Useful for organising what is sensed in the world but not for choosing actions RL in Nature Many actions that humans and animals make do not fall into neat action reward pairs Sometimes reward (or punishment) comes some time after an action, or requires a chain of actions This is where RL is useful 2

What is RL? Reinforcement Learning (RL) is learning to act in order to maximize a future reward. RL is a class of tasks which require a trial-and-error learning We will talk about computer learning, using the term agent to refer to the computer / robot / program Features of RL: Learning from rewards sometimes rewards are rare or delayed Interacting during the task (i.e. sequences of states, actions and rewards) Exploitation/exploration trade-off Problem of goal-directed learning Example Playing poker when you don t know the rules: bet, bet, bet: you lose bet, fold: you lose bet, bet, bet: you win. Feed back is occasional, usually after a sequence of actions The task is to learn when to bet, and how much, to optimise winnings 3

Practical Applications Animal learning e.g. animal learning to find food and avoid predators Robotics e.g. robot trying to learn how to dock with charging station Games e.g. chess player learning to beat opponent Control systems e.g. Temperature thermostat keeping warmth, while minimising fuel consumption State Space Central to RL is the idea of State Space An agent occupies a given state at a given time To keep it simple, we will use a discrete state space one where the agent moves from state to state in fixed steps, a bit like a chess board An action moves the agent from one state to another (states might be physical locations, but do not have to be). The state at time t is denoted s t and the action taken at time t is denoted a t 4

Representing States States might be physical locations, represented by coordinates They might be defined by the history of steps that took the agent to the current state E.g. State Left,Straight,Right The latter can be less efficient as many series of steps can take you to the same location State Trees You can think of all the possible states that result from a series of actions as a tree Start Left Straight Right Left Straight Right Left Straight Right Left Straight Right The yellow leaf is the state you are in after going Straight, Straight. With an equal number of choices at each step (n) and d steps, there are n d possible states you could be in. 5

Deterministic or Stochastic State Spaces An agent in state s t might perform action a t and move to state s t+1 A transition model tells the agent the new state given a current state and an action Without a transition model, the action must be taken for real and the new state is a physical state in the environment In a deterministic state space there is a function T(s t, a t )= s t+1 that tells you the new state arrived at In a stochastic state space, there is a probability distribution that tells you the probability of each new state: P(s a, s) tells you the probability of moving to state s if you perform action a in state s Stochastic Reward In the same way, reward may be deterministic or stochastic The reward for putting a pound in a slot machine is stochastic The reward for putting a pound in a chocolate vending machine is (usually!) deterministic 6

What to Learn? Utility based learning Learn to relate states to utility looks at all possible states that it might move to and picks the one with the highest utility Q-Learning Learns the relationship between actions and utility so it can pick the action with the highest utility Reflex Learning Learns to relate a state with an action no utility is used Utility Learning Requires a model of state transitions: For each possible action: 1. Predict the new state that it would take you to 2. Look up the value of that state 3. Choose the best Harder if the states are stochastic as you need to know the probability of each new state and its value If I do this, I ll be in that state, which will give me the best reward. 7

Q Learning Learn the utility of each action in a given state directly No need for a model of the state transitions just a model of the utilities of actions If I do this, I don t know what state I ll be in, but my reward will be the best. Reflex Learning Learn actions from states No model of the state transitions needed No idea of utility needed Just look up your current state, see what the best action is, do it. I don t know what this action will lead to, or what reward it will bring, but I know it is the best thing to do. 8

RL Policy However the agent learns, the rules that determine its actions as known as a policy POLICY π t (s, a) = P(a t = a s t = s) Given the state at time t is s, the policy gives the probability that the agent s action will be a. Reinforcement learning => learn the policy Reward And Return The reward function indicates how good things are at this time instant But the agent wants to maximize reward in the long-run, i.e. over many time steps We refer to the reward of a whole policy as its return Calculating return: where T is the last time step of the world. So, it is just the sum of all the rewards. 9

Discounted Return The geometrically discounted model of return: where 0 γ 1 is the discount rate. (Gamma) Used to: bound the infinite sum Give more weight to earlier rewards (e.g. to give preference to shorter paths) Value The Value of any state is its expected value over the entire policy During learning, an agent will try to bring its own measure of each state s value as close as possible to the true return that would be gained from that state: V(s t ) V(s t ) + α[r t - V(s t )] Where α is the learning rate 0<α<1 10

Exploration / Exploitation Starting from zero, every action the agent takes is exploration After some time, it knows that one action is quite good. Should it just stick to this action (exploitation) Or look for a better one (exploration) The optimal strategy is hard to find, but starting with a lot of exploration and moving towards exploitation over time is sensible RL Framework: How Does It Work? 1. Agent in state s t chooses action a t 2. World changes to state s t+1 3. Agent perceives situation s t+1 and gets reward r t+1 11

The Learning Process In the current state: Collect the reward for this state Update the value for the state using the learning rule V(s t ) V(s t ) + α[r t - V(s t )] Keep going until V(s t ) = R t A problem occurs when rewards only come at the end of a set of actions, with steps preceding the goal having zero reward Temporal Difference Learning The solution is to assume that the value of any state is similar to the value of its neighbouring states as they can be easily visited from the current state So, to update the current state s value, look at the reward you get and the value of the next state and update the current state thus: V(s t ) V(s t ) + α[r t +γv(s t+1 ) - V(s t )] α is the learning rate, γ is the discount term 12

Long Searches What if the next state s reward is zero too? And the next, and the next? Well, in the end, a reward (perhaps the goal state) will be found The state that led to it will now have value, giving two states to look for Repeat enough and you will have a value for every possible state Or the world will end, which ever happens first Representing the Values Simple RL learning represents the values in a huge table In any task of real use, this table would be far too big and would never be completely filled Many states would never be visited, and when they finally were, there would be no action in the policy table 13

Neural Networks and RL Instead of using a giant table, the values and states can be learned using a neural network Now the agent learns a function between state and value (or action) that has two advantages: It is smaller and faster than a table It can generalise to states it hasn t seen before 14