Reinforcement Learning - PDF Free Download

Reinforcement Learning Lecture 1: Introduction Vien Ngo MLR, University of Stuttgart

What is Reinforcement Learning? Reinforcement Learning is a subfield of Machine Learning from David Silver s lecture 2/20

RL: A subfield of Machine Learning (from Machine Learning course, 2011, Marc Toussaint) Supervised learning: learn from labelled data {(x i, y i )} N i=1 Unsupervised learning: learn from unlabelled data {x i } N i=0 only Semi-supervised learning: many unlabelled data, few labelled data Reinforcement learning: learn from data {(s t, a t, r t, s t+1 )} learn a predictive model (s, a) s learn to predict reward (s, a) r learn a behavior s a that maximizes the expected total reward 3/20

Success of Reinforcement Learning 4/20

Success of Reinforcement Learning Games Backgammon (Tesauro, 1994) deep RL in playing Atari games (2014), AlphaGO (2016) Operations Research Inventory Management (Van Roy, Bertsekas, Lee, & Tsitsiklis, 1996) Investment portfolio Dynamic Channel Allocation (e.g. Singh & Bertsekas, 1997) Online advertisements Robotics Helicopter Control (e.g. Ng, 2003, Abbeel & Ng, 2006) Many Robots (navigation, bi-pedal walking, grasping, switching between skills,...) 5/20

TD-Gammon, by Gerald Tesauro (See section 11.1 in Sutton & Barto s book.) See (Tesauro, 1992, 1994, 1995) Only reward given at end of game for win. Self-play: use the current policy to sample moves on both sides! After about 300,000 games against itself, near the level of the world s strongest grandmasters. 6/20

AlphaGO AlphaGO by Google Deepmind got the Go grandmaster rank (updated 4.4.2016) 7/20

Reinfocement Learning in Robotics Learning motor skills, Autonomous Helicopter Flight (2000, by Schaal, Atkeson, Vijayakumar) (2014, playing Atari games by Google Deepmind) (2004, Tedrake et al.) (2007, Andrew Ng et al.) 8/20

Reinforcement learning in neuroscience (Yael Niv, ICML 2009 s tutorial.) 9/20

Reinforcement learning in neuroscience Peter Dayan and Yael Niv, Neurobiology 2008. The brain employs both model-free and model-based decision-making strategies in parallel, with each dominating in different circumstances. 10/20

What is Reinforcement Learning? 11/20

What is Reinforcement Learning? RL is learning from interaction. There is no supervisor, only signals of reward/evaluative feedback. Decisions in sequence does matter as they affect the outcome of subsequent decisions. from Satinder Singh s Introduction to RL 12/20

What is Reinforcement Learning? s 1 a 1 r 2 s 2 a 2 r 2 s i a i r i+1 s i+1 13/20

What is Reinforcement Learning? s 1 a 1 r 2 s 2 a 2 r 2 s i a i r i+1 s i+1 States can be vectors or other structures, defined as sufficient statistics to predict what happens next. Actions/Controls can be multi-dimensional Rewards are scalar but can be arbitrarily uninformative, and might be delayed; e.g., r t tells how well the agent does at time t (after taking action a t at s t ). Objective: is desribed as the maximization of expected total reward. States are sometimes not directly observable, unobservable. o 1 a 1 r 2 o 2 a 2 r 2 o i a i r i+1 o i+1 Agent has only partial knowledge about environment, e.g unknown dynamics, reward, observation functions, etc.. 13/20

What is Reinforcement Learning? Example of Rewards: +1/ 1 of winning/losing a game, e.g. GO, Backgammon,... +/ for increasing/decreasing score, e.g. in deep RL algorithms playing Atari games. +/ rewards for earning/losing money in managing an investment portfolio. +/ rewards for following the desired trajectory/for crashing in controlling a stunt helicopter. etc. 14/20

Components of An RL Agent Policy: define behaviours of the agent, e.g a mapping π : S A or π : S A [0, 1] Value Functions: the expected return from this state (if starting from this state). V π [ (s) = E π γ t R t s 0 = s ] Model: the agent s internal representation of the environment, e.g. P (s s, a), R(s, a, s ). t 15/20

Admin 16/20

Schedule of this course Part 1: The Basis Markov Decision Process (MDP), Partially Observable MDP (POMDP). Dynamic Programming: Value Iteration, Policy Iteration Part 2: Reinforcement Learning Topics Temporal Difference learning, Q-Learning. Reinforcement learning with function approximation Policy search Part 3: Advanced Topics Inverse reinforcement learning, imitation learning. Exploration vs. Exploitation: Multi-armed bandis, PAC-MDP, Bayesian reinforcement learning. Hierarchical reinforcement learning: macro actions, skill acquisition. Deep reinforcement learning Reinforcement learning in POMDP environment. 17/20

Schedule of this course Missing: Relational MDP MDP/POMDP/RL as Inference 18/20

Literature Richard S. Sutton, Andrew Barto: Reinforcement Learning: An Introduction. The MIT Press Cambridge, Massachusetts London, England, 1998. http://webdocs.cs.ualberta.ca/ ~sutton/book/the-book.html Csaba Szepesvri: Algorithms for Reinforcement Learning. Morgan & Claypool in July 2010. http://www.ualberta.ca/~szepesva/ RLBook.html 19/20

Organisation Course webpage:: https://ipvs.informatik.uni-stuttgart.de/mlr/teaching/reinforcement-learning-ss16/ Slides, Exercises Links to other resources Secretary, admin issues Carola Stahl, Carola.Stahl@ipvs.uni-stuttgart.de, Raum 2.217 Lecture: Wed. 14:00-15:30, Room 0.108 Tutorial: Tue. 17:30-19:30, Room 38.03 Rules for the tutorials: Doing the exercises is crucial! At the beginning of each tutorial: sign into a list mark which exercises you have (successfully) worked on Students are randomly selected to present their solutions You need 50% of completed exercises to be allowed to the exam 20/20