Reinforcement Learning. Business Analytics Practice Winter Term 2015/16 Nicolas Pröllochs and Stefan Feuerriegel

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

TD(λ) and Q-Learning Based Ludo Players

Georgetown University at TREC 2017 Dynamic Domain Track

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Improving Action Selection in MDP s via Knowledge Transfer

AMULTIAGENT system [1] can be defined as a group of

High-level Reinforcement Learning in Strategy Games

Introduction to Simulation

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

Axiom 2013 Team Description Paper

Regret-based Reward Elicitation for Markov Decision Processes

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Prospective Robot Behavior

Speeding Up Reinforcement Learning with Behavior Transfer

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Python Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

A Reinforcement Learning Variant for Control Scheduling

Task Completion Transfer Learning for Reward Inference

Truth Inference in Crowdsourcing: Is the Problem Solved?

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Using focal point learning to improve human machine tacit coordination

AI Agent for Ice Hockey Atari 2600

A Comparison of Annealing Techniques for Academic Course Scheduling

FF+FPG: Guiding a Policy-Gradient Planner

On the Combined Behavior of Autonomous Resource Management Agents

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Laboratorio di Intelligenza Artificiale e Robotica

Generative models and adversarial training

Task Completion Transfer Learning for Reward Inference

Discriminative Learning of Beam-Search Heuristics for Planning

An empirical study of learning speed in backpropagation

Softprop: Softmax Neural Network Backpropagation Learning

The Strong Minimalist Thesis and Bounded Optimality

Learning to Schedule Straight-Line Code

Lecture 1: Basic Concepts of Machine Learning

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

BMBF Project ROBUKOM: Robust Communication Networks

An investigation of imitation learning algorithms for structured prediction

LEGO MINDSTORMS Education EV3 Coding Activities

Probability and Game Theory Course Syllabus

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

On-the-Fly Customization of Automated Essay Scoring

(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Visual CP Representation of Knowledge

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

6 Financial Aid Information

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Activities for School

Mathematics Success Grade 7

The Moodle and joule 2 Teacher Toolkit

Robot manipulations and development of spatial imagery

Disambiguation of Thai Personal Name from Online News Articles

Uncertainty concepts, types, sources

While you are waiting... socrative.com, room number SIMLANG2016

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Intelligent Agents. Chapter 2. Chapter 2 1

A study of speaker adaptation for DNN-based speech synthesis

A Version Space Approach to Learning Context-free Grammars

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Getting Started with TI-Nspire High School Science

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Abstractions and the Brain

Reducing Features to Improve Bug Prediction

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

CSL465/603 - Machine Learning

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

WHEN THERE IS A mismatch between the acoustic

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

SARDNET: A Self-Organizing Feature Map for Sequences

Improving Fairness in Memory Scheduling

Online Updating of Word Representations for Part-of-Speech Tagging

Learning Methods for Fuzzy Systems

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

School of Innovative Technologies and Engineering

Transcription:

Reinforcement Learning Business Analytics Practice Winter Term 2015/16 Nicolas Pröllochs and Stefan Feuerriegel

Today s Lecture Objectives 1 Grasp an understanding of Markov decision processes 2 Understand the concept of reinforcement learning 3 Apply reinforcement learning in R 4 Distinguish pros/cons of different reinforcement learning algorithms Reinforcement Learning 2

Outline 1 Reinforcement Learning 2 Markov Decision Process 3 Learning Algorithms 4 Q-Learning in R 5 Wrap-Up Reinforcement Learning 3

Outline 1 Reinforcement Learning 2 Markov Decision Process 3 Learning Algorithms 4 Q-Learning in R 5 Wrap-Up Reinforcement Learning: Reinforcement Learning 4

Branches of Machine Learning Supervised Learning Learns from pairs of input and desired outcome (i. e. labels) Unsupervised Learning Tries to find hidden structure in unlabeled data Supervised Learning Reinforcement Learning Unsupervised Learning Reinforcement Learning Learning from interacting with the environment No need for pairs of input and correct outcome Feedback restricted to a reward signal Mimics human-like learning in actual environments Reinforcement Learning: Reinforcement Learning 5

Example: Backgammon Reinforcement learning can reach a level similar to the top three human players in backgammon Learning task Select best move at arbitrary board states i. e. with highest probability to win Training signal Win or loss of overall game Training 300,000 games played against the system itself Algorithm Reinforcement learning (plus neural network) Tesauro (1995): Temporal Difference Learning and TD-Gammon. In: Comm. of the ACM, 38:3, pp. 58 68 Reinforcement Learning: Reinforcement Learning 6

Reinforcement Learning An agent interacts with its environment Agent takes actions that affect the state of the environment Feedback is limited to a reward signal that indicates how well the agent is performing Goal: improve the behavior given only this limited feedback Observation Reward Action Examples Defeat the world champions at backgammon or Go Manage an investment portfolio Make a humanoid robot walk Reinforcement Learning: Reinforcement Learning 7

Agent and Environment Agent State s t Reward r t State s t+1 Reward r t+1 Environment Action a t At each step t, the agent: Executes action a t Receives observation s t Receives scalar reward r t The environment: Changes upon action a t Emits observation s t+1 Emits scalar reward r t+1 Time step t is incremented after each iteration Reinforcement Learning: Reinforcement Learning 8

Agent and Environment Example 1 ENVIRONMENT You are in state 3 with 4 possible actions 2 AGENT I ll take action 2 3 ENVIRONMENT You received a reward of 5 units.. Formalization You are in state 1 with 2 possible actions. r t 1 r t r t+1... s t 1 s t s t+1... a t 1 a t a t+1 Reinforcement Learning: Reinforcement Learning 9

Reinforcement Learning Problem Finding an optimal behavior Learn optimal behavior π based on past actions Maximize the expected cumulative reward over time Challenges Feedback is delayed, not instantaneous Agent must reason about the long-term consequences of its actions Illustration In order to maximize one s future income, one has to study now However, the immediate monetary reward from this might be negative How do we learn optimal behavior? Reinforcement Learning: Reinforcement Learning 10

Trial-and-Error Learning The agent should discover optimal behavior via trial-and-error learning 1 Exploration Try new or non-optimal actions to learn their reward Gain a better understanding of the environment 2 Exploitation Use current knowledge This might not be optimal yet, but should deviate only slightly Examples 1 Restaurant selection Exploitation: go to your favorite restaurant Exploration: try a new restaurant 2 Game playing Exploitation: play the move you believe is best Exploration: play an experimental move Reinforcement Learning: Reinforcement Learning 11

ε-greedy Action Selection Idea Provide a simple heuristic to choose between exploitation and exploration Implemented via a random number 0 ε 1 With probability ε, try a random action With probability 1 ε, choose the current best ε a t Random action s t 1 ε a t Greedy action Typical choice is e. g. ε = 0.1 Other variants decrease this value over time i. e. agent gains confidence and thus needs less exploration Reinforcement Learning: Reinforcement Learning 12

Outline 1 Reinforcement Learning 2 Markov Decision Process 3 Learning Algorithms 4 Q-Learning in R 5 Wrap-Up Reinforcement Learning: MDP 13

Markov Decision Process A Markov decision process (MDP) specifies a setup for reinforcement learning MDPs allow to model decision making in situations where outcomes are partly random and partly under the control of a decision maker Definition 1 A Markov Decision Process is a 4-tuple (S,A,R,T ) with A set of possible world states S A set of possible actions A A real-valued reward function R Transition probabilities T 2 A MDP must fulfill the so-called Markov property The effects of an action taken in a state depend only on that state and not on the prior history Reinforcement Learning: MDP 14

Markov Decision Process State A state s t is a representation of the environment at time step t Can be directly observable to the agent or hidden Actions At each state, the agent is able to perform an action a t that affects the subsequent state of the environment s t+1 Actions can be any decisions which one wants to learn Transition probabilities Given a current state s, a possible subsequent state s and an action a The transition probability T a ss from s to s is defined by T a ss = P [ s t+1 = s s t = s,a t = a ] Reinforcement Learning: MDP 15

Rewards A reward r t+1 is a scalar feedback signal emitted by the environment Indicates how well agent is performing when reaching step t + 1 The expected reward R a ss when moving from state s to s via action a is given by R a ss = [ E r t+1 s t = s,a t = a,s t+1 = s ] Examples 1 Playing backgammon or Go Zero reward after each move A positive/negative reward for winning/losing a game 2 Managing an investment portfolio A positive reward for each dollar left in the bank Goal: maximize the expected cumulative reward over time Reinforcement Learning: MDP 16

Markov Decision Process Example: Moving a pawn to a destination on a grid +10 s 5 s 6 s 7 s 3 s 0 s 1 10 available actions A(s) depend on current state s s 4 s 2 States S = {s 0,s 1,...,s 7 } Actions A = {up, down, left, right} Transition probabilities T up s 0,s 3 = 0.9 Ts right 0,s 1 = 0.1... Rewards R right R up s 6,s 7 = +10 s 2,s 4 = 10 Otherwise R = 0 Start in s 0 Game over when reaching s 7 Reinforcement Learning: MDP 17

Policy Learning task of an agent Execute actions in the environment and observe results, i. e. rewards Learn a policy π : S A that works as a selection function of choosing an action given a state A policy fully defines the behavior of an agent, i. e. its actions MDP policies depend only on the current state and not its history Policies are stationary (i. e. time-independent) Objective Maximize the expected cumulative reward over time The expected cumulative reward from an initial state s with policy π is ] J π (s) = R a t s t,s t+1 = E π [ r t s 0 = s t t Reinforcement Learning: MDP 18

Value Functions Definition The state-value function V π (s) of an MDP is the expected reward starting from state s, and then following once policy π V π (s) = E π [J π (s t ) s t = s] Quantifies how good is it to be in a particular state s Definition The state-action value function Q π (s,a) is the expected reward starting from state s, taking action a, and then following policy π Q π (s,a) = E π [J π (s t ) s t = s,a t = a] Quantifies how good is it to be in a particular state s and apply action a, and afterwards follow policy π Now, we can formalize the policy definition (with discount factor γ) via π(s) = argmax a Reinforcement Learning: MDP s T a ss (Ra ss + γv π(s ) 19

Optimal Value Functions While π can be any policy, π denotes the optimal one with the highest expected cumulative reward The optimal value functions specify the best possible policy A MDP is solved when the optimal value functions are known Definitions 1 The optimal state-value function V π (s) maximizes the expected reward over all policies V π (s) = max V π(s) π 2 The optimal action-value function Q π (s,a) maximizes the action-value function over all policies Reinforcement Learning: MDP Q π (s,a) = max Q π(s,a) π 20

Markov Decision Processes in R Load R library MDPtoolbox library(mdptoolbox) Create transition matrix for two states and two actions T <- array(0, c(2, 2, 2)) T[,,1] <- matrix(c(0, 1, 0.8, 0.2), nrow=2, ncol=2, byrow=true) T[,,2] <- matrix(c(0.5, 0.5, 0.1, 0.9), nrow=2, ncol=2, byrow=true) Dimensions are #states #states #actions Create reward matrix (of dimensions #states #actions) R <- matrix(c(10, 10, 1, -5), nrow=2, ncol=2, byrow=true) Check whether the given T and R represent a well-defined MDP mdp_check(t, R) ## [1] "" Returns an empty string if the MDP is valid Reinforcement Learning: MDP 21

Outline 1 Reinforcement Learning 2 Markov Decision Process 3 Learning Algorithms 4 Q-Learning in R 5 Wrap-Up Reinforcement Learning: Learning Algorithms 22

Types of Learning Algorithms Aim: find optimal policy and value functions Model-based learning Aim: find optimal policy and value functions Model of the environment is as MDP with transition probabilities Approach: learn the MDP model or an approximation of it Model-free learning Explicit model of the environment model is not available i. e. transition probabilities are unknown Approach: derive the optimal policy without explicitly formalizing the model Reinforcement Learning: Learning Algorithms 23

Outline 3 Learning Algorithms Model-Based Learning Model-Free Learning Reinforcement Learning: Learning Algorithms 24

Model-Based Learning: Policy Iteration Approach via policy iteration Given an initial policy π 0 Evaluate policy π i to find the corresponding value function V πi Improve policy over V π via greedy exploration Policy iteration always converges to optimal policy π Illustration with π 0 E V π0 I π 1 E: policy evaluation I: policy improvement E V π1 I E V π I π Reinforcement Learning: Learning Algorithms 25

Policy Evaluation Computes the state-value function V π for an arbitrary policy π via V π (s) = E π [ rt+1 + γr t+2 + γ 2 r t 3 + s t = s ] = E π [r t+1 + γv π (s + 1) s t = s] [ = π(s,a) T a ss R a ss + γv π(s ) ] a s System of S linear equations with S unknowns Solvable but computational expensive if S is large Advanced methods are available, e. g. iterative policy evaluation Discount factor If 0 < γ < 1, makes cumulative reward finite Necessary for setups with infinite time horizons Puts more importance on first learning steps, but less on later ones Reinforcement Learning: Learning Algorithms 26

Iterative Policy Evaluation Iterative policy evaluation uses dynamic programming Iteratively approximate V π Choose V 0 arbitrarily Then use Bellman equation as an update rule V k+1 (s) = E π [r t+1 + γv k (s + 1) s t = s] [ = π(s,a) T a ss R a ss + γv k(s ) ] a s Sequence V k,v k+1,... converges to V π as k Reinforcement Learning: Learning Algorithms 27

Policy Improvement Policy evaluation determines the value function V π for a policy π The alternative step exploits this knowledge to select the optimal action in each state For that, policy improvement searches policy π that is as good as or better than π Remedy is to use state-action value function via π (s) = argmaxq π (s,a) a = argmax a = argmax a E [r t+1 + γv k (s + 1) s t = s] [ T a ss R a ss + γv k(s ) ] s Afterwards, continue with policy evaluation and policy improvement until a desired convergence criterion is reached Reinforcement Learning: Learning Algorithms 28

Policy Iteration Example Learning an agent traveling through a 2 2 grid (i. e. 4 states) Wall (red line) prevents direct moves s 0 s 3 (Goal) s 1 s 2 from s 0 to s 3 Reward favors shorter routes Visiting each square/state gives a reward of 1 Reaching the goal gives a reward of 10 Actions: move left, right, up or down Transition probabilities are < 1 i. e. allows erroneous moves Reinforcement Learning: Learning Algorithms 29

Policy Iteration in R Example Design an MDP that finds the optimal policy to that problem Create individual matrices with pre-specified (random) transition probabilities for each action up <- matrix(c( 1, 0, 0, 0, 0.7, 0.2, 0.1, 0, 0, 0.1, 0.2, 0.7, 0, 0, 0, 1), nrow=4, ncol=4, byrow=true) left <- matrix(c(0.9, 0.1, 0, 0, 0.1, 0.9, 0, 0, 0, 0.7, 0.2, 0.1, 0, 0, 0.1, 0.9), nrow=4, ncol=4, byrow=true) Reinforcement Learning: Learning Algorithms 30

Policy Iteration in R Second chunk of matrices down <- matrix(c(0.3, 0.7, 0, 0, 0, 0.9, 0.1, 0, 0, 0.1, 0.9, 0, 0, 0, 0.7, 0.3), nrow=4, ncol=4, byrow=true) right <- matrix(c(0.9, 0.1, 0, 0, 0.1, 0.2, 0.7, 0, 0, 0, 0.9, 0.1, 0, 0, 0.1, 0.9), nrow=4, ncol=4, byrow=true) Aggregate previous matrices to create transition probabilities in T T <- list(up=up, left=left, down=down, right=right) Reinforcement Learning: Learning Algorithms 31

Policy Iteration in R Create matrix with rewards R <- matrix(c(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 10, 10, 10, 10), nrow=4, ncol=4, byrow=true) Check if this provides a well-defined MDP mdp_check(t, R) # empty string => ok ## [1] "" Reinforcement Learning: Learning Algorithms 32

Policy Iteration in R Run policy iteration with discount factor γ = 0.9 m <- mdp_policy_iteration(p=t, R=R, discount=0.9) Display optimal policy π m$policy ## [1] 3 4 1 1 names(t)[m$policy] ## [1] "down" "right" "up" "up" Display value function V π m$v ## [1] 58.25663 69.09102 83.19292 100.00000 Reinforcement Learning: Learning Algorithms 33

Outline 3 Learning Algorithms Model-Based Learning Model-Free Learning Reinforcement Learning: Learning Algorithms 34

Model-Free Learning Drawbacks of model-based learning Requires MDP, i. e. explicit model of the dynamics in the environment Transition probabilities are often not available or difficult to define Model-based learning is thus often intractable even in simple cases Model-free learning Idea: learn directly from interactions with the environment Only use experience from the sequences of states, action, and rewards Common approaches 1 Monte Carlo methods are simple but has slow convergence 2 Q-learning is more efficient due to off-policy learning Reinforcement Learning: Learning Algorithms 35

Monte Carlo Method Monte Carlo methods require no knowledge of transition as in MDPs Perform reinforcement learning from a sequence of interactions Mimic policy iteration to find optimal policy Estimate the value of each action Q(s,a) instead of V(s) Store average rewards in state-action table Example State-action table State Actions Optimal Policy a 1 a 2 s 1 2 1 a 1 s 2 1 3 a 2 s 3 2 4 a 2 Reinforcement Learning: Learning Algorithms 36

Monte Carlo Method Algorithm 1 Start with an arbitrary state-action table (and corresponding policies) Often all rewards are initially set to zero 2 Observe first state 3 Choose an action according to ε-greedy action selection, i. e. With probability ε, pick a random action Otherwise, take action with highest expected reward 4 Update state-action table with new reward (averaging) 5 Observe new state 6 Go to step 3 Disadvantage High computational time and thus slow convergence Method must frequently evaluate a suboptimal policy Reinforcement Learning: Learning Algorithms 37

Q-Learning One of the most important breakthroughs in reinforcement learning Off-policy learning concept Explore the environment and at the same time exploit the current knowledge In each step, take a look forward to the next state and observe the maximum possible reward for all available actions in that state Use this knowledge to update the action-value of the corresponding action in the current state Apply update rule with learning rate α (0 < α 1) Q(s,a) Q(s,a) }{{} old value + α }{{} learning rate }{{} r reward + γ }{{} discount factor max a Q(s,a ) }{{} expected optimal value Q(s,a) }{{} old value Q-learning is repeated for different episodes (e. g. games, trials, etc.) Reinforcement Learning: Learning Algorithms 38

Q-Learning Algorithm 1 Initialize the table Q(s,a) to zero for all state-action pairs (s, a) 2 Observe the current state s 3 Repeat until convergence Select an action a and apply it Receive immediate reward r Observe the new state s Update the table entry for Q(s,a) according to [ ] Q(s,a) Q(s,a) + α r + γ maxq(s,a ) Q(s,a) a Move to next state, i. e. s s Reinforcement Learning: Learning Algorithms 39

Outline 1 Reinforcement Learning 2 Markov Decision Process 3 Learning Algorithms 4 Q-Learning in R 5 Wrap-Up Reinforcement Learning: Q-Learning in R 40

Q-Learning in R Unfortunately, R has no dedicated library for model-free reinforcement learning yet Alternative implementations are often available in other programming languages Possible remedy: write your own implementation Not too difficult with the building blocks on the next slides Example Learning an agent finding a destination in a 2 2 grid with a wall Initialize 4 states and 4 actions actions <- c("up", "left", "down", "right") states <- c("s0", "s1", "s2", "s3") Note: real applications (such as in robotics) are prone to disturbances Reinforcement Learning: Q-Learning in R 41

Q-Learning in R Building blocks 1 Adding a function that mimics the environment simulateenvironment <- function(state, action) {... } 2 Add a Q-learning function that performs a given number n of episodes Qlearning <- function(n, s_0, s_terminal, epsilon, learning_rate) {... } 3 Call Q-learning with an initial state s_0, a final state s_terminal and desired parameters to search a policy Qlearning(n, s_0, s_terminal, epsilon, learning_rate) Reinforcement Learning: Q-Learning in R 42

Q-Learning in R Function returns a list with two entries: the next state and the corresponding reward given the current state and an intended action simulateenvironment <- function(state, action) { # Calculate next state (according to sample grid with wall) # Default: remain in a state if action tries to leave grid next_state <- state if (state == "s0" && action == "down") next_state <- "s1" if (state == "s1" && action == "up") next_state <- "s0" if (state == "s1" && action == "right") next_state <- "s2" if (state == "s2" && action == "left") next_state <- "s1" if (state == "s2" && action == "up") next_state <- "s3" if (state == "s3" && action == "down") next_state <- "s2" # Calculate reward if (next_state == "s3") { reward <- 10 } else { reward <- -1 } } return(list(state=next_state, reward=reward)) Reinforcement Learning: Q-Learning in R 43

Q-Learning in R Function applies Q-learning for a given number n of episodes Qlearning <- function(n, s_0, s_terminal, epsilon, learning_rate) { # Initialize state-action function Q to zero Q <- matrix(0, nrow=length(states), ncol=length(actions), dimnames=list(states, actions)) # Perform n episodes/iterations of Q-learning for (i in 1:n) { Q <- learnepisode(s_0, s_terminal, epsilon, learning_rate, Q) } } return(q) Returns state-action function Q Reinforcement Learning: Q-Learning in R 44

Q-Learning in R learnepisode <- function(s_0, s_terminal, epsilon, learning_rate, Q) { state <- s_0 # set cursor to initial state while (state!= s_terminal) { # epsilon-greedy action selection if (runif(1) <= epsilon) { action <- sample(actions, 1) } else { action <- which.max(q[state, ]) } # pick random action # pick first best action # get next state and reward from environment response <- simulateenvironment(state, action) # update rule for Q-learning Q[state, action] <- Q[state, action] + learning_rate * (response$reward + max(q[response$state, ]) - Q[state, action]) } state <- response$state # move to next state } return(q) Reinforcement Learning: Q-Learning in R 45

Q-Learning in R Choose learning parameters epsilon <- 0.1 learning_rate <- 0.1 Calculate state-action function Q after 1000 episodes set.seed(0) Q <- Qlearning(1000, "s0", "s3", epsilon, learning_rate) Q ## up left down right ## s0-79.962619-81.15445-68.39532-79.34825 ## s1-73.891963-52.43183-52.67565-47.91828 ## s2-8.784844-46.32207-17.97360-20.29088 ## s3 0.000000 0.00000 0.00000 0.00000 Optimal policy # note: problematic for states with ties actions[max.col(q)] ## [1] "down" "right" "up" "up" s 3 (Goal) Agent chooses optimal action in all states Reinforcement Learning: Q-Learning in R 46

Outline 1 Reinforcement Learning 2 Markov Decision Process 3 Learning Algorithms 4 Q-Learning in R 5 Wrap-Up Reinforcement Learning: Wrap-Up 47

Wrap-Up Summary Reinforcement learning learns through trial-and-error from interactions The reward indicates the performance of the agent But without showing how to improve its behavior Learning is grouped into model-based and model-free strategies A common and efficient model-free variant is Q-learning Similar to human-like learning in real-world environments Common for trade-offs between long-term vs. short-term benefits Drawbacks Can be computational expensive when state-action space is large No R library is yet available for model-free learning Reinforcement Learning: Wrap-Up 48

Wrap-Up Commands inside MDPtoolbox mdp_example_rand() mdp_check(t, R) mdp_value_iteration(...) mdp_policy_iteration(...) Generate a random MDP Check whether the given T and R represent a well-defined MDP Run value iteration to find best policy Run policy iteration to find best policy Further readings Sutton & Barto (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Also available online: https: //webdocs.cs.ualberta.ca/~sutton/book/the-book.html Slides by Watkins: http: //webdav.tuebingen.mpg.de/mlss2013/2015/speakers.html Slides by Littman: http://mlg.eng.cam.ac.uk/mlss09/mlss_slides/littman_1.pdf Vignette for MDPtoolbox: https://cran.r-project.org/web/ packages/mdptoolbox/mdptoolbox.pdf Reinforcement Learning: Wrap-Up 49