Reinforcement Learning - PDF Free Download

Reinforcement Learning Introduction Vien Ngo Marc Toussaint University of Stuttgart

Problems facing in daily life? 2/20

Problems facing in daily life? 3/20

Problems facing in daily life? This is a sequential decision problem: optimal decision making maximize reward, or minimize penalty. 3/20

Problems facing in daily life? This is a sequential decision problem: optimal decision making maximize reward, or minimize penalty. hard? stochasticity and uncertainty. delayed reward or penalty. 3/20

What is Reinforcement Learning? RL is learning from interaction. from Satinder Singh s Introduction to RL, videolectures.com 4/20

What is Reinforcement Learning? s 1 a 1 r 2 s 2 a 2 r 2 s i a i r i+1 s i+1 5/20

What is Reinforcement Learning? s 1 a 1 r 2 s 2 a 2 r 2 s i a i r i+1 s i+1 States can be vectors or other structures, defined as sufficient statistics to predict the future. Actions can be multi-dimensional Rewards are scalar but can be arbitrarily uninformative States are sometimes not directly observable. o 1 a 1 r 2 o 2 a 2 r 2 o i a i r i+1 o i+1 5/20

What is Reinforcement Learning? from Satinder Singh s Introduction to RL, videolectures.com 6/20

Long history in AI Idea of programming a computer to learn by trial and error (Turing, 1954) SNARCs (Stochastic Neural-Analog Reinforcement Calculators) (Minsky, 54) Checkers playing program (Samuel, 59) Lots of RL in the 60s (e.g., Waltz & Fu 65; Mendel 66; Fu 70) MENACE (Matchbox Educable Naughts and Crosses Engine (Mitchie, 63) RL based Tic Tac Toe learner (GLEE) (Mitchie 68) Classifier Systems (Holland, 75) Adaptive Critics (Barto & Sutton, 81) Temporal Differences (Sutton, 88) from Satinder Singh s Introduction to RL, videolectures.com 7/20

RL: A subfield of Machine Learning 8/20

RL: A subfield of Machine Learning (from Machine Learning course, 2011, Marc Toussaint) Supervised learning: learn from labelled data {(x i, y i )} N i=1 Unsupervised learning: learn from unlabelled data {x i } N i=0 only Semi-supervised learning: many unlabelled data, few labelled data Reinforcement learning: learn from data {(s t, a t, r t, s t+1 )} learn a predictive model (s, a) s learn to predict reward (s, a) r learn a behavior s a that maximizes reward 8/20

Success of Reinforcement Learning Games Backgammon (Tesauro, 1994) Solitaire (X. Yan et. al., 2005) Chess, Checkers, Operations Research Inventory Management (Van Roy, Bertsekas, Lee, & Tsitsiklis, 1996) Dynamic Channel Allocation (e.g. Singh & Bertsekas, 1997) Vehicle Routing, etc. Economics Trading, Robotics Robocup Soccer (e.g. Stone & Veloso, 1999) Helicopter Control (e.g. Ng, 2003, Abbeel & Ng, 2006) Many Robots (navigation, bi-pedal walking, grasping, switching between skills,...) more from http://umichrl.pbworks.com/w/page/7597597/successes of Reinforcement Learning 9/20

TD-Gammon, by Gerald Tesauro (See section 11.1 in Sutton & Barto s book.) See (Tesauro, 1992, 1994, 1995) Only reward given at end of game for win. Self-play: use the current policy to sample moves on both sides! After about 300,000 games against itself, near the level of the world s strongest grandmasters. 10/20

GO using UCT, by Gelly (See Gelly et. al 2012, Communications of the ACM for a review.) 11/20

Reinfocement Learning in Robotics Learning motor skills, Autonomous Helicopter Flight (around 2000, by Schaal, Atkeson, Vijayakumar) 12/20

(2007, Andrew Ng et al.) 12/20 Reinfocement Learning in Robotics Learning motor skills, Autonomous Helicopter Flight (around 2000, by Schaal, Atkeson, Vijayakumar)

Reinfocement Learning in Robotics Planning and exploration in a relational stochastic world (Lang and Marc, JMLR 2012) 13/20

Reinforcement learning in neuroscience (Yael Niv, ICML 2009 s tutorial.) 14/20

Reinforcement learning in neuroscience Peter Dayan and Yael Niv, Neurobiology 2008. The brain employs both model-free and model-based decision-making strategies in parallel, with each dominating in different circumstances. 15/20

Schedule of this course Part 1: The Basis Markov Decision Process Dynamic Programming: Value Iteration, Policy Iteration Part 2: Reinforcement Learning Topics TD, Q-Learning. Reinforcement learning with function approximation: LSPI, regression,... Policy search: Policy gradient, covariant policy search, entropy policy search,... Actor-Critic Part 3: Advance Topics Inverse reinforcement learning, imitation learning. Exploration vs. Exploitation: Multi-armed bandis, PAC-MDP, Bayesian reinforcement learning. Hierarchical reinforcement learning: macro actions, skill acquisition. Intrinsically motivated reinforcement learning. Connection to control theory. Reinforcement learning in POMDP environment. 16/20

Schedule of this course Missing: Relational MDP MDP/POMDP/RL as Inference 17/20

Literature Richard S. Sutton, Andrew Barto: Reinforcement Learning: An Introduction. The MIT Press Cambridge, Massachusetts London, England, 1998. http://webdocs.cs.ualberta.ca/ ~sutton/book/the-book.html 18/20

Literature Csaba Szepesvri: Algorithms for Reinforcement Learning. Morgan & Claypool in July 2010. http://www.ualberta.ca/ ~szepesva/rlbook.html 19/20

Organisation Course webpage:: http://ipvs.informatik.uni-stuttgart.de/mlr/reinforcement-learning-ws1314/ Slides, Exercises Links to other resources Secretary, admin issues Carola Stahl, Carola.Stahl@ipvs.uni-stuttgart.de, Raum 2.217 one exercise: Freitag 08:00-09:30 Rules for the tutorials: Doing the exercises is crucial! At the beginning of each tutorial: sign into a list mark which exercises you have (successfully) worked on Students are randomly selected to present their solutions You need 50% of completed exercises to be allowed to the exam (Prof. Marc Toussaint s rules.) 20/20