Review of basic concepts for final

The final 35%, 2hrs in class ~8 questions question types: - some equations (e.g. write down the equation for such an such) - word answers (explain some concept) - numeric questions (e.g. calculate the return or a prediction using TD update equation) - case study where you formalize something as an MDP Purpose is to help you see if you understand the major concepts covered in the course

Major topics covered so far Incremental learning and acting (Ch2) - n-armed bandits and algorithms Formalizing the RL problem (Ch3) - whats the task, assumptions, how do we define success Simple solution methods: (Ch4,5,&6) - Dynamic programing: what if we had a distribution model and a finite MDP - Monte Carlo: no model, learn from interaction with the world - Temporal difference learning: no model, learn and act on each time step

Major topics Advanced tabular solution methods (Ch7, &8) - n-step TD methods: multi-step updates, dealing with delayed or sparse reward - learning, planning, and acting: learning and using a model, to update value functions more efficiently On-policy prediction with function approximation (Ch9) - objective functions, semi-gradient methods - linear function approximation On-policy control with function approximation (Ch10) - n-step semi-gradient Sarsa

Major topics Eligibility traces (Ch12) - the lambda-return, forward and backward views, TD(λ), and different forms of eligibility traces Linear Off-policy gradient TD learning - issues with TD and off-policy learning (counter examples) - basic ways to do off-policy learning (importance sampling, Q-learning, residual gradient etc)

Lets go through each in detail

Key Concepts in Ch2 Formalization of bandit problems! What assumptions do we make? - one state - care about expected reward for each arm - how is this different from returns! gamma? - trying to find the best single arm - actions have no consequence on future rewards - how does this differ from MDPs?

Key Concepts in Ch2 Algorithms maintain estimates of action values online and incrementally - update action values after every arm pull - policy can change with each arm pull Non-Stationary learning tasks - how do we deal with this? Fully Incremental learning rules: - Qt+1(a) = Qt(a) + α[rt - Qt(a)]

Key Concepts in Ch2 Exploration vs Exploitation - epsilon-greedy - optimistic initialization - softmax - UCB - gradient bandits

Key Concepts in Ch3 Agent-environment interaction - what are the key components? This book is about Finite MDPs What is the Markov property in math or words What is the goal of an RL system: maximize expected return Returns, episodic and continuing: know their definitions

Key Concepts in Ch3 State (v) and action-value (q) functions - when do we use upper or lower case letters - can you convert between the expectation notation and summation notation? Why are Bellman equations so important?

Key Concepts in Ch3 Given a problem description can you formalize the MDP??

Key Concepts in Ch4 Assume we have the dynamics model, and we don t interact with the world - planning setting What is the policy evaluation problem? Why are DP methods called iterative How do we construct DP methods? - why does initialization of the value function help Why do we not need to worry about exploration in DP? What if the model is wrong? What value function will we learn?

Key Concepts in Ch4 Describe in words or math the policy improvement theorem Why is the policy improvement theorem important of RL algorithms that learn value functions Some basic implications of the policy improvement theorem

Key Concepts in Ch4 Given v* and the one-step dynamics model, how do we select actions optimally? What are the two components of policy iteration and how do they interact? How does the iterative policy improvement algorithm differ from value iteration? Are these methods guaranteed to converge? - how many steps do they take to converge What is a sweep? What is bootstrapping? What is a full backup?

Key Concepts in Ch5 Why is it ok to average sample returns to estimate the value function? Difference between first visit and every visit MC Explain why the maintaining exploration problem arise in learning optimal policies (policy improvement) but not for policy evaluation - imagine learning Q(s,a) and learning π* 3 ways to handle the exploration problem: - exploring starts, learning ϵ-soft policies, off-policy learning What is an importance sampling ratio and why can it cause high-variance

Key Concepts in Ch6 What is the update target of TD(0) - what is the basic update rule of TD How is TD(0) similar to MC and how is it similar to DP? What are some of the advantages special to TD - TD methods bootstrap so they don t need to wait for final outcomes (end of episodes) - TD methods can learn from experience without a model What does it mean to converge to the correct predictions? Why do TD and MC get different value estimates in the batch setting? Certainty equivalence

Key Concepts in Ch6 Why is policy evaluation or learning vπ called prediction? Explain the differences between Sarsa and Q- learning and Expected Sarsa - the main update rules make this very clear Why does Sarsa outperform Q-learning in the cliff walking problem?

Key Concepts in Ch7 What is a 2-step, 3-step,, n-step return? How does updating toward n-step returns help over 1-step returns? What are the main differences between the implementations of TD(0) and n-step TD How do n-step TD methods relate to MC and TD(0)

Key Concepts in Ch8 Difference between simulation model and distribution model How can we use a model to update the policy - by updating the value function What is the difference between real experience and simulated experience - interacting with the world and planning How does real experience effect the planning process?

Key Concepts in Ch8 Why is it that the planning loop of Dyna can be implemented without reducing the reactiveness of our agents? What is the basic idea of Dyna-Q+? - why does it help, what is the change? Why is it harder for the agent to react when the world changes to become easier? What is the basic idea of Prioritized sweeping? - why does it improve over Dyna-Q so much

That was the material up to and including the quiz

Key Concepts in Ch9 Learning an approximate value function, is a supervised learning problem - we get samples S t U t, label training samples and we want to learn a parametric function v(s,θ) that generalizes well to new unseen S t What is the equation for MSVE? Can you explain the terms in it? What are the conditions on E[U t ] such that we get a stochastic gradient descent algorithm? Why is TD(0) with function approximation not a true gradient descent algorithm?

Key Concepts in Ch9 At a high level, what is the basic process of obtaining an update algorithm for v(s,θ) starting with the MSVE? What is the gradient of v(s,θ) with linear function approximation? How do access the prediction v(st,θ) with linear function approximation? What is the basic update rule for semi-gradient TD with linear function approximation?

Key Concepts in Ch9 Tile coding, RBFs, Polynomial expansion, Fourier basis are all ways to construct feature vectors What are some of the advantages of Tile Coding - binary (fast implementation of TD; norm of ɸ is constant; easy rule for setting the step-size) - fast & robust implementation available - achieves fast learning with wide tiles and good discrimination with large number of tilings - works well in low-dimensional domains Explain how all these methods suffer from the curse of dimensionality

Key Concepts in Ch10 Extending the ideas of Ch9 to the control setting Gradient descent rule for learning the parameters of q(s t,a t,θ) Semi-gradient one-step Sarsa Semi-gradient n-step Sarsa Linear control, whats the gradient of q(s t,a t,θ) and how to query the state-action value? Why must the update in the terminal state be treated specially under function approximation?

Key Concepts in Ch12 All about the TD(λ) algorithm λ-returns - how λ-returns related to n-step returns - averaging all n-step returns; exponential weighting how different values of λ relate to one-step TD and Monte-Carlo How TD(λ) is the same and different from n-step TD

Key Concepts in Ch12 What is the forward view and why we cannot implement it? The backward view involves using eligibility traces and sending back TD-error to approximate the forward view approximate the λ-return algorithm 3 types of eligibility traces and how they differ (good way is to look at their updates) Linear semi-gradient TD(λ) update equations with accumulating traces

Key Concepts in Ch12 Using TD(λ) for policy evaluation inside generalized policy iteration is how you arrive at semi-gradient Sarsa(λ) Bootstrapping seems to make a huge performance difference in control with linear function approximation

Key Concepts in Off-policy learning (Ch11) Mainly focused on the instability of semi-gradient TD with off-policy sampling and introduced the gradient TD family of methods which fixes this instability

Key Concepts in Off-policy learning (Ch11) Understand how importance sampling can cause instability - what about Baird s counter example breaks TD The deadly triad: off-policy + function_approximation + bootstrapping How the gradient TD method TDC differs from semi-gradient TD

Will topic X that you did not cover in this review be on the exam? Ask me now!