Reinforcement Learning. CS 188: Artificial Intelligence Reinforcement Learning. Reinforcement Learning. Example: Learning to Walk. The Crawler!

CS 188: rtificial Intelligence Dan Klein, Pieter bbeel Univerity of California, Berkeley xample: Learning to Walk gent State: Reward: r ction: a nvironment Baic idea: Receive feedback in the form of reward gent utility i defined by the reward function Mut (learn to) act o a to maximize expected reward ll learning i baed on oberved ample of outcome! Before Learning Learning Trial fter Learning [1K Trial] [Kohl and Stone, ICR 24] The Crawler! Still aume a Markov deciion proce (MDP): et of tate S et of action (per tate) model T(,a, ) reward function R(,a, ) Still looking for a policy π() New twit: don t know T or R I.e. we don t know which tate are good or what the action do Mut actually try action and tate out to learn [You, in Project 3] 1

Offline (MDP) v. Online (RL) Paive Offline Solution Online Learning Paive Simplified tak: policy evaluation Input: a fixed policy π() You don t know the tranition T(,a, ) You don t know the reward R(,a, ) Goal: learn the tate value In thi cae: Learner i along for the ride No choice about what action to take Jut execute the policy and learn from experience Thi i NOT offline planning! You actually take action in the world. Direct valuation Goal: Compute value for each tate under π Idea: verage together oberved ample value ct according to π very time you viit a tate, write down what the um of dicounted reward turned out to be verage thoe ample Thi i called direct evaluation xample: Direct valuation Problem with Direct valuation Input Policy π ume: γ= 1 Oberved piode (Training) piode 1 piode 2 B, eat, C, -1 D, exit, x, +1 B, eat, C, -1 D, exit, x, +1 piode 3 piode 4, north, C, -1 D, exit, x, +1, north, C, -1 C, eat,, -1, exit, x, -1 Output Value -1 +8 +4 +1-2 What good about direct evaluation? It eay to undertand It doen t require any knowledge of T, R It eventually compute the correct average value, uing jut ample tranition What bad about it? It wate information about tate connection ach tate mut be learned eparately So, it take a long time to learn Output Value -1 +8 +4 +1-2 If B and both go to C under thi policy, how can their value be different? 2

Why Not Ue Policy valuation? Simplified Bellman update calculate V for a fixed policy: ach round, replace V with a one-tep-look-ahead layer over V π(), π() xample: xpected ge Goal: Compute expected age of c188 tudent Known P() Thi approach fully exploited the connection between the tate Unfortunately, we need T and R to do it!, π(), Key quetion: how can we do thi update to V without knowing T and R? In other word, how to we take a weighted average without knowing the weight? Why doe thi work? Becaue eventually you learn the right model. Without P(), intead collect ample [a 1, a 2, a N ] Unknown P(): Model Baed Unknown P(): Model Free Why doe thi work? Becaue ample appear with the right frequencie. Model-Baed Learning Model-Baed Learning Model-Baed Idea: Learn an approximate model baed on experience Solve for value a if the learned model were correct Step 1: Learn empirical MDP model Count outcome for each, a Normalize to give an etimate of Dicover each when we experience (, a, ) Step 2: Solve the learned MDP For example, ue policy evaluation xample: Model-Baed Learning Model-Free Learning Input Policy π ume: γ= 1 Oberved piode (Training) piode 1 piode 2 B, eat, C, -1 D, exit, x, +1 B, eat, C, -1 D, exit, x, +1 piode 3 piode 4, north, C, -1 D, exit, x, +1, north, C, -1 C, eat,, -1, exit, x, -1 Learned Model T(,a, ). T(B, eat, C) = 1. T(C, eat, D) =.75 T(C, eat, ) =.25 R(,a, ). R(B, eat, C) = -1 R(C, eat, D) = -1 R(D, exit, x) = +1 3

Sample-Baed Policy valuation? Temporal Difference Learning We want to improve our etimate of V by computing thee average: Idea: Take ample of outcome (by doing the action!) and average π() Big idea: learn from every experience! Update V() each time we experience a tranition (, a,, r) Likely outcome will contribute update more often Temporal difference learning of value Policy till fixed, till doing evaluation! Move value toward value of whatever ucceor occur: running average π(), π(), π(), π(), 2' ' 1' 3' Sample of V(): Update to V(): lmot! But we can t rewind time to get ample after ample from tate. Same update: xponential Moving verage xample: Temporal Difference Learning xponential moving average The running interpolation update: State Oberved Tranition B, eat, C, -2 C, eat, D, -2 Make recent ample more important: 8-1 8-1 3 8 Forget about the pat (ditant pat value were wrong anyway) Decreaing learning rate (alpha) can give converging average ume: γ= 1, α = 1/2 Problem with TD Value Learning ctive TD value leaning i a model-free way to do policy evaluation, mimicking Bellman update with running ample average However, if we want to turn value into a (new) policy, we re unk: a, a Idea: learn Q-value, not value Make action election model-free too!,a, 4

ctive Full reinforcement learning: optimal policie (like value iteration) You don t know the tranition T(,a, ) You don t know the reward R(,a, ) You chooe the action now Goal: learn the optimal policy / value In thi cae: Learner make choice! Fundamental tradeoff: exploration v. exploitation Thi i NOT offline planning! You actually take action in the world and find out what happen Detour: Q-Value Iteration Value iteration: find ucceive (depth-limited) value Start with V () =, which we know i right Given V k, calculate the depth k+1 value for all tate: But Q-value are more ueful, o compute them intead Start with Q (,a) =, which we know i right Given Q k, calculate the depth k+1 q-value for all q-tate: Q-Learning Q-Learning: ample-baed Q-value iteration Learn Q(,a) value a you go Receive a ample (,a,,r) Conider your old etimate: Conider your new ample etimate: Incorporate the new etimate into a running average: Q-Learning Propertie mazing reult: Q-learning converge to optimal policy --even if you re acting uboptimally! Thi i called off-policy learning Caveat: You have to explore enough You have to eventually make the learning rate mall enough but not decreae it too quickly Baically, in the limit, it doen t matter how you elect action (!) [demo grid, crawler Q ] CS 188: rtificial Intelligence II We till aume an MDP: et of tate S et of action (per tate) model T(,a, ) reward function R(,a, ) Still looking for a policy π() Dan Klein, Pieter bbeel Univerity of California, Berkeley New twit: don t know T or R I.e. don t know which tate are good or what the action do Mut actually try action and tate out to learn 5

The Story So Far: MDP and RL Known MDP: Offline Solution Goal Technique Compute V*, Q*, π* Value / policy iteration valuate a fixed policy π Policy evaluation Unknown MDP: Model-Baed Unknown MDP: Model-Free Goal Technique Goal Technique Compute V*, Q*, π* VI/PI on approx. MDP Compute V*, Q*, π* Q-learning valuate a fixed policy π P on approx. MDP valuate a fixed policy π Value Learning Model-free (temporal difference) learning xperience world through epiode Update etimate each tranition Over time, update will mimic Bellman update Model-Free Learning Q-Value Iteration (model-baed, require known MDP) Q-Learning (model-free, require only experienced tranition) a, a r a, a Q-Learning We d like to do Q-value update to each Q-tate: But can t compute thi update without knowing T, R Intead, compute average a we go Receive a ample tranition (,a,r, ) Thi ample ugget But we want to average over reult from (,a) (Why?) So keep a running average Q-Learning Propertie mazing reult: Q-learning converge to optimal policy --even if you re acting uboptimally! Thi i called off-policy learning Caveat: You have to explore enough You have to eventually make the learning rate mall enough but not decreae it too quickly Baically, in the limit, it doen t matter how you elect action (!) [demo off policy] xploration v. xploitation How to xplore? Several cheme for forcing exploration Simplet: random action (ε-greedy) very time tep, flip a coin With (mall) probability ε, act randomly With (large) probability 1-ε, act on current policy Problem with random action? You do eventually explore the pace, but keep thrahing around once learning i done One olution: lower εover time nother olution: exploration function [demo crawler] 6

xploration Function Regret When to explore? Random action: explore a fixed amount Better idea: explore area whoe badne i not (yet) etablihed, eventually top exploring xploration function Take a value etimate u and a viit count n, and return an optimitic utility, e.g. Regular Q-Update: Modified Q-Update: Note: thi propagate the bonu back to tate that lead to unknown tate a well! [demo crawler] ven if you learn the optimal policy, you till make mitake along the way! Regret i a meaure of your total mitake cot: the difference between your (expected) reward, including youthful uboptimality, and optimal (expected) reward Minimizing regret goe beyond learning to be optimal it require optimally learning to be optimal xample: random exploration and exploration function both end up optimal, but random exploration ha higher regret pproximate Q-Learning Generalizing cro State Baic Q-Learning keep a table of all q-value In realitic ituation, we cannot poibly learn about every ingle tate! Too many tate to viit them all in training Too many tate to hold the q-table in memory Intead, we want to generalize: Learn about ome mall number of training tate from experience Generalize that experience to new, imilar ituation Thi i a fundamental idea in machine learning, and we ll ee it over and over again xample: Pacman Feature-Baed Repreentation Let ay we dicover through experience that thi tate i bad: In naïve q-learning, we know nothing about thi tate: Or even thi one! Solution: decribe a tate uing a vector of feature (propertie) Feature are function from tate to real number (often /1) that capture important propertie of the tate xample feature: Ditance to cloet ghot Ditance to cloet dot Number of ghot 1 / (dit to dot) 2 I Pacmanin a tunnel? (/1) etc. I it the exact tate on thi lide? Can alo decribe a q-tate (, a) with feature (e.g. action move cloer to food) [demo RL pacman] 7

Linear Value Function pproximate Q-Learning Uing a feature repreentation, we can write a q function (or value function) for any tate uing a few weight: Q-learning with linear Q-function: dvantage: our experience i ummed up in a few powerful number Diadvantage: tate may hare feature but actually be very different in value! Intuitive interpretation: djut weight of active feature.g., if omething unexpectedly bad happen, blame the feature that were on: dipreferall tate with that tate feature Formal jutification: online leat quare xact Q pproximate Q xample: Q-Pacman Q-Learning and Leat Square [demo RL pacman] Linear pproximation: Regreion* Optimization: Leat Square* 4 26 24 2 22 2 Obervation rror or reidual 2 3 2 1 1 2 3 4 Prediction Prediction: Prediction: 2 8

Minimizing rror* Imagine we had only one point x, with feature f(x), target value y, and weight w: Overfitting: Why Limiting Capacity Can Help* 3 25 2 Degree 15 polynomial 15 1 5 pproximate q update explained: -5 target prediction -1-15 2 4 6 8 1 12 14 16 18 2 Policy Search Policy Search Problem: often the feature-baed policie that work well (win game, maximize utilitie) aren t the one that approximate V / Q bet.g. your value function from project 2 were probably horrible etimate of future reward, but they till produced good deciion Q-learning priority: get Q-value cloe (modeling) ction election priority: get ordering of Q-value right (prediction) We ll ee thi ditinction between modeling and prediction again later in the coure Solution: learn policie that maximize reward, not the value that predict them Policy earch: tart with an ok olution (e.g. Q-learning) then fine-tune by hill climbing on feature weight Policy Search Simplet policy earch: Start with an initial linear value function or Q-function Nudge each feature weight up and down and ee if your policy i better than before Problem: How do we tell the policy got better? Need to run many ample epiode! If there are a lot of feature, thi can be impractical Better method exploit lookaheadtructure, ample wiely, change multiple parameter Concluion We re done with Part I: Search and Planning! We ve een how I method can olve problem in: Search Contraint Satifaction Problem Game Markov Deciion Problem Next up: Part II: Uncertainty and Learning! 9