Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

TD(λ) and Q-Learning Based Ludo Players

A Reinforcement Learning Variant for Control Scheduling

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Comparison of Annealing Techniques for Academic Course Scheduling

Major Milestones, Team Activities, and Individual Deliverables

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Axiom 2013 Team Description Paper

Intelligent Agents. Chapter 2. Chapter 2 1

AMULTIAGENT system [1] can be defined as a group of

Generative models and adversarial training

The Strong Minimalist Thesis and Bounded Optimality

Improving Action Selection in MDP s via Knowledge Transfer

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Speeding Up Reinforcement Learning with Behavior Transfer

CS Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

High-level Reinforcement Learning in Strategy Games

Introduction to Simulation

On the Combined Behavior of Autonomous Resource Management Agents

Georgetown University at TREC 2017 Dynamic Domain Track

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Python Machine Learning

Innovative Methods for Teaching Engineering Courses

SARDNET: A Self-Organizing Feature Map for Sequences

Using focal point learning to improve human machine tacit coordination

Software Maintenance

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning and Transferring Relational Instance-Based Policies

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Assignment 1: Predicting Amazon Review Ratings

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Physics 270: Experimental Physics

BMBF Project ROBUKOM: Robust Communication Networks

Laboratorio di Intelligenza Artificiale e Robotica

While you are waiting... socrative.com, room number SIMLANG2016

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

Radius STEM Readiness TM

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Generating Test Cases From Use Cases

(Sub)Gradient Descent

On-the-Fly Customization of Automated Essay Scoring

Probability and Statistics Curriculum Pacing Guide

Evolutive Neural Net Fuzzy Filtering: Basic Description

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

Arizona s College and Career Ready Standards Mathematics

Firms and Markets Saturdays Summer I 2014

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

arxiv: v1 [math.at] 10 Jan 2016

STA 225: Introductory Statistics (CT)

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Learning to Schedule Straight-Line Code

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Algebra 2- Semester 2 Review

The Evolution of Random Phenomena

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

An Online Handwriting Recognition System For Turkish

CSC200: Lecture 4. Allan Borodin

An empirical study of learning speed in backpropagation

Learning goal-oriented strategies in problem solving

Go fishing! Responsibility judgments when cooperation breaks down

Grade 6: Correlated to AGS Basic Math Skills

Regret-based Reward Elicitation for Markov Decision Processes

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Math 96: Intermediate Algebra in Context

Knowledge Transfer in Deep Convolutional Neural Nets

Analysis of Enzyme Kinetic Data

Learning Methods for Fuzzy Systems

Intensive English Program Southwest College

A Neural Network GUI Tested on Text-To-Phoneme Mapping

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

AUTHOR COPY. Techniques for cold-starting context-aware mobile recommender systems for tourism

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Improving Conceptual Understanding of Physics with Technology

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Transfer Learning Action Models by Measuring the Similarity of Different Domains

A General Class of Noncontext Free Grammars Generating Context Free Languages

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

First Grade Standards

Test Effort Estimation Using Neural Network

Integrating simulation into the engineering curriculum: a case study

Probabilistic Latent Semantic Analysis

Discriminative Learning of Beam-Search Heuristics for Planning

Transcription:

Reinforcement Learning Andreas Wichert DEIC (Página da cadeira: Fenix) Reinforcement Learning n No specific learning methods n Actions within & responses from the environment n Any learning method that address this interaction is reinforcement learning n Is reinforcement learning same as supervised learning? n No n Absence of a designated teacher to give positive and negative examples 1

n By random move the agent build a predictive model of its environment n Without some feedback about what is good and what is bad, the agent will have no grounds for deciding which move to make n Needs to know when some thing good happened, or that some thin bad has happened n Feedback: reward or reinforcement n Chess, the reinforcement is only received at the end of the game n Agent must recognize the reward n Animals recognize pain and hunger as well as pleasure and food as positive rewards 2

n Examine how an agent can learn from success and failure, reward and punishment RL is learning from interaction 3

Three agent designs n Utility-based agent n Q-learning n Reflex agent Utility based agents n A utility function maps a state (or a sequence) onto a real number, which describes the agents happiness 4

Utility based agents n A utility-based agent must have a model of the environment in order to make decisions n In order to use backgammon evaluation function, a program must know what is a legal moves and how thy affect board positions Q-learning n A Q-learning agent on the other hand can compare the value of its available choices without needing to know their outcomes n It does not need to model its environment n Q-learning agents cannot look ahead 5

Reflex agent n Learns a policy that maps directly from state to action n Policy: A solution that specifies what an agent should do for any state that the agent might reach n For state s agent should do π(s) n Passive reinforcement learning 6

n Passive learning, agent s policy (actions) is fixed and the task is to learn utilities (happiness) of states (state action parts) Passive Reinforcement Learning n Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states n Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits n An action recommended by a policy for the a certain state 7

n Policy π is fixed n In state s it always executes the action π(s) n Goal to learn a good policy, to learn the utility function U π (s) Passive learning v.s. Active learning n Passive learning n The agent imply watches the world going by and tries to learn the utilities of being in various states n Active learning n The agent not simply watches, but also acts 8

n In passive learning, the environment generates state transitions and the agent perceives them n Consider an agent trying to learn the utilities of the states shown below: n Agent can move {North, East, South, West} n Terminate on reading [4,2] or [4,3] 9

Passive Learning in a Known Agent is provided: Mi j = a model given the probability of reaching from state i to state j Each state transition to a neighbouring state with equal probability among all neighbouring states Passive learning scenario n The agent see the the sequences of state transitions and associate rewards n The environment generates state transitions and the agent perceive them e.g (1,1) à (1,2) à (1,3) à (2,3) à (3,3) à (4,3)[+1] (1,1)à (1,2) à (1,3) à (1,2) à (1,3) à (1,2) à (1,1) à (2,1) à (3,1) à (4,1) à (4,2)[-1] n Key idea: updating the utility value using the given training sequences 10

Passive Learning in a Known Use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning) Passive Learning in a Known LMS (Least Mean Squares) Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1 11

Passive Learning in a Known LMS: -Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1?) -Learner computes average for each state Provably converges to true expected value (utilities) -Instance of supervised learning State as input and observed rewared as output n It misses an important source of information n Utilities of states are not independent! n The utility of each state equals its own reward plus the expected utility of its successor 12

Passive Learning in a Known LMS Main Drawback: - slow convergence - it takes the agent well over a 1000 training sequences to get close to the correct value Passive Learning in a Known ADP (Adaptive Dynamic Programming) Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated model 13

Passive Learning in a Known ADP In general: - R(i) is reward of being in state i (often non zero for only a few end states) - Mij is the probability of transition from state i to j Passive Learning in a Known ADP Consider U(3,3) U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152 14

Passive Learning in a Known ADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces Passive Learning in a Known TD (Temporal Difference Learning) The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations 15

Passive Learning in a Known TD Learning Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule (R(i) is the reward being in the state) Passive Learning in an Unknown n Least Mean Square(LMS) approach and Temporal-Difference(TD) approach operate unchanged in an initially unknown environment n Adaptive Dynamic Programming(ADP) approach adds a step that updates an estimated model of the environment n Transition probabilities 16

Passive Learning in an Unknown ADP Approach n The environment model is learned by direct observation of transitions n The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbors Passive Learning in an Unknown ADP & TD Approaches n The ADP approach and the TD approach are closely related n Both try to make local adjustments to the utility estimates in order to make each state agree with its successors 17

Passive Learning in an Unknown Minor differences : n TD adjusts a state to agree with its observed successor n ADP adjusts the state to agree with all of the successors Passive Learning in an Unknown Important differences : n TD makes a single adjustment per observed transition n ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M 18

Active Learning in an Unknown An active agent must consider : n what actions to take n what their outcomes may be n how they will affect the rewards received Active Learning in an Unknown Minor changes to passive learning agent : n environment model now incorporates the probabilities of transitions to other states given a particular action n maximize its expected utility n agent needs a performance element to choose an action at each step 19

n M ij a denotes the probability of reaching state j if action a is taken in state i Active Learning in an Unknown Active ADP Approach n need to learn the probability M a ij of a transition instead of M ij n the input to the function will include the action taken 20

Active Learning in an Unknown Active TD Approach n the model acquisition problem for the TD agent is identical to that for the ADP agent n the update rule remains unchanged n the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity Exploration Learning also involves the exploration of unknown areas An agent can benefit from actions in 2 ways -immediate rewards (on the current sequence) -received percepts (receive rewards in feature sequence) Trade-off between its immediate good- as reflected in its current utility estimate-and long term well being 21

Exploration: Wacky Approach Vs. Greedy Approach Wacky approach acts randomly in the hope that it will eventually explore the whole environment Learn, but never gets better at reaching positive reward Greedy approach acts to maximize its utility function using current estimation Sticks in a solution (never learns utility of other states) n We need an approach between wackiness and greediness n An agent should be more wacky when it has little idea of environment n More greedy when it has a model that is close to being correct 22

n Can be implemented using n It assigns a higher utility estimate to relatively unexplored action-state pairs n Let U + (i) denote the optimistic utility of the state i n N(a,i) number of times action has been tried in state i U + (i) R(i) + max a $ f & % f(u,n) is an exploration function j ' M a ij U + ( j),n(a,i) ) ( 23

Exploration The Exploration Function: a simple example u= expected utility (greed) n= number of times actions have been tried(wacky) R + = best reward possible N e = fixed parameter, try at least N e times n Important U + rather than U appears n As exploration proceeds, the state and actions near the start state may well be tried a large number of times n Actions that lead to unexplored regions are weighted more highly, rather than just actions that are themselves unfamiliar 24

Action-Value Function Q-Learning Learning An Action Value-Function What Are Q-Values? Action-value assigns an expected utility to taking action in a given state 25

Learning An Action Value-Function The Q-Values Formula Learning An Action Value-Function The Q-Values Formula Application -just an adaptation of the active learning equation 26

Learning An Action Value-Function The TD Q-Learning Update Equation - requires no model - calculated after each transition from state i to j Learning An Action Value-Function The TD Q-Learning Update Equation in Practice Program:Neurogammon - attempted to learn from self-play and implicit representation 27

Generalization In Reinforcement Learning Explicit Representation n we have assumed that all the functions learned by the agents(u,m,r,q) are represented in tabular form n explicit representation involves one output value for each input tuple. Explicit Representation n good for small state spaces, but the time to convergence and the time per iteration increase rapidly as the space gets larger n it may be possible to handle 10,000 states or more n this suffices for 2-dimensional, maze-like environments 28

n Problem: more realistic worlds are out of question n eg. Chess & backgammon are tiny subsets of the real world, yet their state spaces contain on the order of 10 50 to 10 120 states So it would be absurd to suppose that one must visit all these states in order to learn how to play the game. Generalization In Reinforcement Learning Implicit Representation n Overcome the explicit problem n a form that allows one to calculate the output for any input, but that is much more compact than the tabular form 29

Weighted linear function n For example, an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f 1 f n : U(i) = w 1 f 1 (i)+w 2 f 2 (i)+.+w n f n (i) Implicit Representation n The utility function is characterized by n weights. n A typical chess evaluation function might only have 10 weights, so this is enormous compression 30

Compression n enormous compression : achieved by an implicit representation allows the learning agents to generalize from states it has visited to states it has not visited n the most important aspect : it allows for inductive generalization over input states. n Therefore, such method are said to perform input generalization Cart pole n The cart pole problem: n set up the problem of balancing a long pole upright on the top of a moving cart. 31

n The cart can be jerked left or right by a controller that observes x, x, θ, and θ n the earliest work on learning for this problem was carried out by Michie and Chambers(1968) n their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials. Generalization Input Generalisation n The algorithm first discretized the 4- dimensional state into boxes, hence the name n it then ran trials until the pole fell over or the cart hit the end of the track. n Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence 32

Input Generalisation n The discretization causes some problems when the apparatus was initialized in a different position n improvement : using the algorithm that adaptively partitions that state space according to the observed variation in the reward 33