Breakout Group Reinforcement Learning

Size: px

Start display at page:

Download "Breakout Group Reinforcement Learning"

Karen Hampton
6 years ago
Views:

1 Breakout Group Reinforcement Learning FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 12/01/2017

2 Outline Theoretical introduction (30 minutes) Discussion of code (30 minutes) Solve version of grid world with SARSA Discussion of RL and its applications to String Theory (30 minutes)

3 How to teach a machine Supervised Learning (SL): provide a set of training tuples [(in 0, out 0 ), (in 1, out 1 ),...,(in n, out n )] after training, machine predicts Unsupervised Learning (UL): out i from in i only provide set training input set [in 0, in 1,...,in n ] give task to machine (e.g. cluster input) without telling it how to do this exactly After training, the machine will perform self-learned action on Reinforcement Learning (RL): in i in between SL and UL Machine acts autonomously, but actions are reinforced / punished

4 Theoretical introduction

5 Reinforcement Learning - Vocabulary Basic textbooks/literature [Barton, Sutton 98 17] The thing that learns is called agent or worker The thing that is explored is called environment The elements of the environment are called states or observations The things that take you from one state to another are called actions The thing that tells you how to select the next action is called policy Actions are executed sequentially in a sequence called (time) steps The reinforcement the agent experiences is called reward The accumulated reward is called return In RL, an agent performs actions in an environment with the goal to maximize its long-term return

6 Reinforcement Learning - Details We focus on discrete state and action spaces State space S = {states in environment} Action space total: A = {actions to transition between states} s 2 S for : A(s) ={possible actions in state s} Policy (s) =a, a2 A(s) : Select next action for given state : S 7! A Reward R(s, a) 2 R: Reward for taking action a in state s R : S A 7! R

7 Reinforcement Learning - Details Return: The accumulated reward from current step G t = 1X k=0 k r t+k+1, 2 (0, 1] State value function v (s): Expected return for s with policy : v (s) =E[ G t s = s t ] Action value function q(s, a) : Expected return for performing action a in state s with policy : q (s, a) =E[ G t s = s t,a= a t ] Prediction problem: Given, predict v (s) or q (s, a) Control problem: Find optimal policy maximizes v (s) or q (s, a) that t

8 Reinforcement Learning - Details Commonly used policies: greedy: Choose the action that maximizes the action value function: 0 (s) = argmax q(s, a) " - greedy: Explore different possibilities 0 Choose greedy in (1 ") cases (s) = Choose random action in cases We take "-greedy policy improvement On-policy: Update policy you are following (e.g. always "- greedy) Off-policy: Use different policy for choosing next action and updating q(s t,a t ) a t+1

9 Reinforcement Learning - SARSA Solving the control problem: v(s t )= [G t v(s t )] =0 v(s t ) : Learning rate ( means no update to ) One step approximation: G t = r + v(s t+1 ) Similar for action value function: q(s t,a t )= [G t q(s t,a t )] = [r + q(s t+1,a t+1 ) q(s t,a t ))] Update depends on tuple (s t,a t,r,s t+1,a t+1 ) a t+1 s t+1 is currently best known action for state Note: SARSA is on-policy

10 Reinforcement Learning - Q-Learning Very similar to SARSA Difference in update: SARSA: q(s t,a t )= [r + q(s t+1,a t+1 ) q(s t,a t )] Q_Learning: q(s t,a t )= [r + max a 0 q(s t+1,a 0 ) q(s t,a t )] Note: This means that Q-Learning is off-policy SARSA is found to perform better Q-Learning is proven to converge to solution Combine with (deep NNs): Deep Q-Learning

11 Example - Gridworld Worker ( Explorer ) Pitfall Exit Wall

12 Example - Gridworld We will look at a version of grid world: Gridworld is a grid-like maze with walls, pitfalls, and an exit Each state is a point on the grid of the maze The actions are A = {up, down, left, right} Goal: Find the exit (strongly rewarded) Each step is punished mildly (solve maze quickly) Pitfalls should be avoided (strongly punished) Running into a wall does not change the state

13 Gridworld vs String Landscape Walls = Boundaries of landscape (negative number of branes) Empty square = Consistent point in the landscape which does not correspond to our Universe Pitfalls = Mathematically / Physically inconsistent states (anomalies, tadpoles, ) Exit = Standard Model of Particle Physics

14 Coding

15 Discussion

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation