CS 188: Artificial Intelligence Fall 8 Lecture 11: Reinforcement Learning 1/2/8 Reinforcement Learning Reinforcement learning: Still have an MDP: A et of tate S A et of action (per tate) A A model T(,a, ) A reward function R(,a, ) Still looking for a policy π() [DEMO] Dan Klein UC Berkeley Many lide over the coure adapted from either Stuart Ruell or Andrew Moore New twit: don t know T or R I.e. don t know which tate are good or what the action do Mut actually try action and tate out to learn 1 3 Example: Animal Learning Example: Backgammon RL tudied experimentally for more than 6 year in pychology Reward: food, pain, hunger, drug, etc. Mechanim and ophitication debated Example: foraging Bee learn near-optimal foraging plan in field of artificial flower with controlled nectar upplie Bee have a direct neural connection from nectar intake meaurement to motor planning area Reward only for win / lo in terminal tate, zero otherwie TD-Gammon learn a function approximation to V() uing a neural network Combined with depth 3 earch, one of the top 3 player in the world You could imagine training Pacman thi way but it tricky! 4 5 Paive Learning [DEMO Optimal Policy] Example: Direct Etimation Simplified tak You don t know the tranition T(,a, ) You don t know the reward R(,a, ) You are given a policy π() Goal: learn the tate value (and maybe the model) I.e., policy evaluation In thi cae: Learner along for the ride No choice about what action to take Jut execute the policy and learn from experience We ll get to the active cae oon Thi i NOT offline planning! Epiode: (4,3) exit +1 (4,2) exit -1 y +1-1 γ = 1, R = -1 V(1,1) ~ (92 + -16) / 2 = -7 V(3,3) ~ (99 + 97 + -12) / 3 = 31.3 x 6 7 1
Model-Baed Learning Idea: Learn the model empirically (rather than value) Solve the MDP a if the learned model were correct Empirical model learning Simplet cae: Count outcome for each,a Normalize to give etimate of T(,a, ) Dicover R(,a, ) the firt time we experience (,a, ) More complex learner are poible (e.g. if we know that all quare have related action outcome, e.g. tationary noie ) Example: Model-Baed Learning Epiode: (4,2) exit -1 (4,3) exit +1 y +1-1 γ = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 x 8 9 Recap: Model-Baed Policy Evaluation Sample Avg to Replace Expectation? Simplified Bellman update to calculate V for a fixed policy: New V i expected one-tep-lookahead uing current V Unfortunately, need T and R π(), π(),π(), Who need T and R? Approximate the expectation with ample (drawn from T!) π(), π() 1 2 3 1 11 Model-Free Learning Example: TD Policy Evaluation Big idea: why bother learning T? Update V each time we experience a tranition Frequent outcome will contribute more update (over time) Temporal difference learning (TD) Policy till fixed! Move value toward value of whatever ucceor occur: running average! π(), π() (4,2) exit -1 (4,3) exit +1 Take γ = 1, α =.5 12 13 2
Problem with TD Value Learning Active Learning TD value leaning i model-free for policy evaluation However, if we want to turn our value etimate into a policy, we re unk: Idea: learn Q-value directly Make action election model-free too! a, a,a, Full reinforcement learning You don t know the tranition T(,a, ) You don t know the reward R(,a, ) You can chooe any action you like Goal: learn the optimal policy (maybe value) In thi cae: Learner make choice! Fundamental tradeoff: exploration v. exploitation Thi i NOT offline planning! 14 15 Model-Baed Learning Example: Greedy ADP In general, want to learn the optimal policy, not evaluate a fixed policy Idea: adaptive dynamic programming Learn an initial model of the environment: Solve for the optimal policy for thi model (value or policy iteration) Refine model through experience and repeat Crucial: we have to make ure we actually learn about all of the model Imagine we find the lower path to the good exit firt Some tate will never be viited following thi policy from (1,1) We ll keep re-uing thi policy becaue following it never collect the region of the model we need to learn the optimal policy?? 16 17 What Went Wrong? Q-Value Iteration Problem with following optimal policy for current model: Never learn about better region of the pace if current policy neglect them Fundamental tradeoff: exploration v. exploitation Exploration: mut take action with uboptimal etimate to dicover new reward and increae eventual utility Exploitation: once the true optimal policy i learned, exploration reduce utility Sytem mut explore in the beginning and exploit in the limit?? Value iteration: find ucceive approx optimal value Start with V * () =, which we know i right (why?) Given V i*, calculate the value for all tate for depth i+1: But Q-value are more ueful! Start with Q * (,a) =, which we know i right (why?) Given Q i*, calculate the q-value for all q-tate for depth i+1: 18 19 3
Q-Learning Learn Q*(,a) value Receive a ample (,a,,r) Conider your old etimate: Conider your new ample etimate: [DEMO Grid Q ] Q-Learning Propertie Will converge to optimal policy If you explore enough If you make the learning rate mall enough But not decreae it too quickly! Baically doen t matter how you elect action (!) [DEMO Grid Q ] Neat property: learn optimal q-value regardle of action election noie (ome caveat) Incorporate the new etimate into a running average: S E S E 21 Exploration / Exploitation [DEMO RL Pacman] Several cheme for forcing exploration Simplet: random action (ε greedy) Every time tep, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problem with random action? You do explore the pace, but keep thrahing around once learning i done One olution: lower ε over time Another olution: exploration function Exploration Function When to explore Random action: explore a fixed amount Better idea: explore area whoe badne i not (yet) etablihed Exploration function Take a value etimate and a count, and return an optimitic utility, e.g. (exact form not important) 22 23 Q-Learning [DEMO Crawler Q ] Q-learning produce table of q-value: Q-Learning In realitic ituation, we cannot poibly learn about every ingle tate! Too many tate to viit them all in training Too many tate to hold the q-table in memory Intead, we want to generalize: Learn about ome mall number of training tate from experience Generalize that experience to new, imilar tate Thi i a fundamental idea in machine learning, and we ll ee it over and over again 24 25 4
Example: Pacman Let ay we dicover through experience that thi tate i bad: In naïve q learning, we know nothing about thi tate or it q tate: Or even thi one! Feature-Baed Repreentation Solution: decribe a tate uing a vector of feature Feature are function from tate to real number (often /1) that capture important propertie of the tate Example feature: Ditance to cloet ghot Ditance to cloet dot Number of ghot 1 / (dit to dot) 2 I Pacman in a tunnel? (/1) etc. Can alo decribe a q-tate (, a) with feature (e.g. action move cloer to food) 26 27 Linear Feature Function Uing a feature repreentation, we can write a q function (or value function) for any tate uing a few weight: Function Approximation Q-learning with linear q-function: Advantage: our experience i ummed up in a few powerful number Diadvantage: tate may hare feature but be very different in value! Intuitive interpretation: Adjut weight of active feature E.g. if omething unexpectedly bad happen, diprefer all tate with that tate feature Formal jutification: online leat quare 28 29 Example: Q-Pacman Linear regreion 4 26 24 22 1 3 1 1 3 4 Given example Predict given a new point 3 31 5
Linear regreion Ordinary Leat Square (OLS) 4 26 24 22 3 1 1 3 4 Obervation Prediction Error or reidual Prediction Prediction 32 33 Minimizing Error 3 25 Overfitting 15 Degree 15 polynomial 1 5-5 Value update explained: -1-15 2 4 6 8 1 12 14 16 18 34 [DEMO] 35 Policy Search Policy Search Problem: often the feature-baed policie that work well aren t the one that approximate V / Q bet E.g. your value function from project 2 were probably horrible etimate of future reward, but they till produced good deciion We ll ee thi ditinction between modeling and prediction again later in the coure Solution: learn the policy that maximize reward rather than the value that predict reward Thi i the idea behind policy earch, uch a what controlled the upide-down helicopter 36 37 6
Policy Search Simplet policy earch: Start with an initial linear value function or q-function Nudge each feature weight up and down and ee if your policy i better than before Problem: How do we tell the policy got better? Need to run many ample epiode! If there are a lot of feature, thi can be impractical Policy Search* Advanced policy earch: Write a tochatic (oft) policy: Turn out you can efficiently approximate the derivative of the return with repect to the parameter w (detail in the book, but you don t have to know them) Take uphill tep, recalculate derivative, etc. 38 39 Take a Deep Breath We re done with earch and planning! Next, we ll look at how to reaon with probabilitie Diagnoi Tracking object Speech recognition Robot mapping lot more! Lat part of coure: machine learning 4 7