Review of basic concepts for final

Similar documents
Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

Georgetown University at TREC 2017 Dynamic Domain Track

TD(λ) and Q-Learning Based Ludo Players

Axiom 2013 Team Description Paper

Reinforcement Learning by Comparing Immediate Reward

Python Machine Learning

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

FF+FPG: Guiding a Policy-Gradient Planner

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Simulation

CSL465/603 - Machine Learning

AMULTIAGENT system [1] can be defined as a group of

High-level Reinforcement Learning in Strategy Games

Speeding Up Reinforcement Learning with Behavior Transfer

On the Combined Behavior of Autonomous Resource Management Agents

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

A Reinforcement Learning Variant for Control Scheduling

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Assignment 1: Predicting Amazon Review Ratings

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Calibration of Confidence Measures in Speech Recognition

Improving Fairness in Memory Scheduling

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

arxiv: v1 [cs.lg] 15 Jun 2015

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Discriminative Learning of Beam-Search Heuristics for Planning

Improving Action Selection in MDP s via Knowledge Transfer

Semi-Supervised Face Detection

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Planning with External Events

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Generative models and adversarial training

An Introduction to Simulation Optimization

The Strong Minimalist Thesis and Bounded Optimality

Probabilistic Latent Semantic Analysis

An Online Handwriting Recognition System For Turkish

Attributed Social Network Embedding

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Laboratorio di Intelligenza Artificiale e Robotica

Active Learning. Yingyu Liang Computer Sciences 760 Fall

WHEN THERE IS A mismatch between the acoustic

CS/SE 3341 Spring 2012

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Evolutive Neural Net Fuzzy Filtering: Basic Description

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Comparison of Annealing Techniques for Academic Course Scheduling

Radius STEM Readiness TM

SARDNET: A Self-Organizing Feature Map for Sequences

AI Agent for Ice Hockey Atari 2600

Mathematics. Mathematics

Probability and Game Theory Course Syllabus

Detailed course syllabus

Lecture 6: Applications

Regret-based Reward Elicitation for Markov Decision Processes

Syllabus ENGR 190 Introductory Calculus (QR)

An Introduction to Simio for Beginners

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Online Updating of Word Representations for Part-of-Speech Tagging

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Model Ensemble for Click Prediction in Bing Search Ads

Mathematics subject curriculum

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Truth Inference in Crowdsourcing: Is the Problem Solved?

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Task Completion Transfer Learning for Reward Inference

arxiv: v1 [cs.cv] 10 May 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Stochastic Model for the Vocabulary Explosion

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Learning to Schedule Straight-Line Code

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

GACE Computer Science Assessment Test at a Glance

Honors Mathematics. Introduction and Definition of Honors Mathematics

Task Completion Transfer Learning for Reward Inference

Algebra 2- Semester 2 Review

arxiv: v1 [math.at] 10 Jan 2016

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning From the Past with Experiment Databases

Learning Methods in Multilingual Speech Recognition

An empirical study of learning speed in backpropagation

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

CS Machine Learning

Foothill College Summer 2016

Speaker recognition using universal background model on YOHO database

Reflective problem solving skills are essential for learning, but it is not my job to teach them

Speech Recognition at ICSI: Broadcast News and beyond

Transcription:

Review of basic concepts for final

The final 35%, 2hrs in class ~8 questions question types: - some equations (e.g. write down the equation for such an such) - word answers (explain some concept) - numeric questions (e.g. calculate the return or a prediction using TD update equation) - case study where you formalize something as an MDP Purpose is to help you see if you understand the major concepts covered in the course

Major topics covered so far Incremental learning and acting (Ch2) - n-armed bandits and algorithms Formalizing the RL problem (Ch3) - whats the task, assumptions, how do we define success Simple solution methods: (Ch4,5,&6) - Dynamic programing: what if we had a distribution model and a finite MDP - Monte Carlo: no model, learn from interaction with the world - Temporal difference learning: no model, learn and act on each time step

Major topics Advanced tabular solution methods (Ch7, &8) - n-step TD methods: multi-step updates, dealing with delayed or sparse reward - learning, planning, and acting: learning and using a model, to update value functions more efficiently On-policy prediction with function approximation (Ch9) - objective functions, semi-gradient methods - linear function approximation On-policy control with function approximation (Ch10) - n-step semi-gradient Sarsa

Major topics Eligibility traces (Ch12) - the lambda-return, forward and backward views, TD(λ), and different forms of eligibility traces Linear Off-policy gradient TD learning - issues with TD and off-policy learning (counter examples) - basic ways to do off-policy learning (importance sampling, Q-learning, residual gradient etc)

Lets go through each in detail

Key Concepts in Ch2 Formalization of bandit problems! What assumptions do we make? - one state - care about expected reward for each arm - how is this different from returns! gamma? - trying to find the best single arm - actions have no consequence on future rewards - how does this differ from MDPs?

Key Concepts in Ch2 Algorithms maintain estimates of action values online and incrementally - update action values after every arm pull - policy can change with each arm pull Non-Stationary learning tasks - how do we deal with this? Fully Incremental learning rules: - Qt+1(a) = Qt(a) + α[rt - Qt(a)]

Key Concepts in Ch2 Exploration vs Exploitation - epsilon-greedy - optimistic initialization - softmax - UCB - gradient bandits

Key Concepts in Ch3 Agent-environment interaction - what are the key components? This book is about Finite MDPs What is the Markov property in math or words What is the goal of an RL system: maximize expected return Returns, episodic and continuing: know their definitions

Key Concepts in Ch3 State (v) and action-value (q) functions - when do we use upper or lower case letters - can you convert between the expectation notation and summation notation? Why are Bellman equations so important?

Key Concepts in Ch3 Given a problem description can you formalize the MDP??

Key Concepts in Ch4 Assume we have the dynamics model, and we don t interact with the world - planning setting What is the policy evaluation problem? Why are DP methods called iterative How do we construct DP methods? - why does initialization of the value function help Why do we not need to worry about exploration in DP? What if the model is wrong? What value function will we learn?

Key Concepts in Ch4 Describe in words or math the policy improvement theorem Why is the policy improvement theorem important of RL algorithms that learn value functions Some basic implications of the policy improvement theorem

Key Concepts in Ch4 Given v* and the one-step dynamics model, how do we select actions optimally? What are the two components of policy iteration and how do they interact? How does the iterative policy improvement algorithm differ from value iteration? Are these methods guaranteed to converge? - how many steps do they take to converge What is a sweep? What is bootstrapping? What is a full backup?

Key Concepts in Ch5 Why is it ok to average sample returns to estimate the value function? Difference between first visit and every visit MC Explain why the maintaining exploration problem arise in learning optimal policies (policy improvement) but not for policy evaluation - imagine learning Q(s,a) and learning π* 3 ways to handle the exploration problem: - exploring starts, learning ϵ-soft policies, off-policy learning What is an importance sampling ratio and why can it cause high-variance

Key Concepts in Ch6 What is the update target of TD(0) - what is the basic update rule of TD How is TD(0) similar to MC and how is it similar to DP? What are some of the advantages special to TD - TD methods bootstrap so they don t need to wait for final outcomes (end of episodes) - TD methods can learn from experience without a model What does it mean to converge to the correct predictions? Why do TD and MC get different value estimates in the batch setting? Certainty equivalence

Key Concepts in Ch6 Why is policy evaluation or learning vπ called prediction? Explain the differences between Sarsa and Q- learning and Expected Sarsa - the main update rules make this very clear Why does Sarsa outperform Q-learning in the cliff walking problem?

Key Concepts in Ch7 What is a 2-step, 3-step,, n-step return? How does updating toward n-step returns help over 1-step returns? What are the main differences between the implementations of TD(0) and n-step TD How do n-step TD methods relate to MC and TD(0)

Key Concepts in Ch8 Difference between simulation model and distribution model How can we use a model to update the policy - by updating the value function What is the difference between real experience and simulated experience - interacting with the world and planning How does real experience effect the planning process?

Key Concepts in Ch8 Why is it that the planning loop of Dyna can be implemented without reducing the reactiveness of our agents? What is the basic idea of Dyna-Q+? - why does it help, what is the change? Why is it harder for the agent to react when the world changes to become easier? What is the basic idea of Prioritized sweeping? - why does it improve over Dyna-Q so much

That was the material up to and including the quiz

Key Concepts in Ch9 Learning an approximate value function, is a supervised learning problem - we get samples S t U t, label training samples and we want to learn a parametric function v(s,θ) that generalizes well to new unseen S t What is the equation for MSVE? Can you explain the terms in it? What are the conditions on E[U t ] such that we get a stochastic gradient descent algorithm? Why is TD(0) with function approximation not a true gradient descent algorithm?

Key Concepts in Ch9 At a high level, what is the basic process of obtaining an update algorithm for v(s,θ) starting with the MSVE? What is the gradient of v(s,θ) with linear function approximation? How do access the prediction v(st,θ) with linear function approximation? What is the basic update rule for semi-gradient TD with linear function approximation?

Key Concepts in Ch9 Tile coding, RBFs, Polynomial expansion, Fourier basis are all ways to construct feature vectors What are some of the advantages of Tile Coding - binary (fast implementation of TD; norm of ɸ is constant; easy rule for setting the step-size) - fast & robust implementation available - achieves fast learning with wide tiles and good discrimination with large number of tilings - works well in low-dimensional domains Explain how all these methods suffer from the curse of dimensionality

Key Concepts in Ch10 Extending the ideas of Ch9 to the control setting Gradient descent rule for learning the parameters of q(s t,a t,θ) Semi-gradient one-step Sarsa Semi-gradient n-step Sarsa Linear control, whats the gradient of q(s t,a t,θ) and how to query the state-action value? Why must the update in the terminal state be treated specially under function approximation?

Key Concepts in Ch12 All about the TD(λ) algorithm λ-returns - how λ-returns related to n-step returns - averaging all n-step returns; exponential weighting how different values of λ relate to one-step TD and Monte-Carlo How TD(λ) is the same and different from n-step TD

Key Concepts in Ch12 What is the forward view and why we cannot implement it? The backward view involves using eligibility traces and sending back TD-error to approximate the forward view approximate the λ-return algorithm 3 types of eligibility traces and how they differ (good way is to look at their updates) Linear semi-gradient TD(λ) update equations with accumulating traces

Key Concepts in Ch12 Using TD(λ) for policy evaluation inside generalized policy iteration is how you arrive at semi-gradient Sarsa(λ) Bootstrapping seems to make a huge performance difference in control with linear function approximation

Key Concepts in Off-policy learning (Ch11) Mainly focused on the instability of semi-gradient TD with off-policy sampling and introduced the gradient TD family of methods which fixes this instability

Key Concepts in Off-policy learning (Ch11) Understand how importance sampling can cause instability - what about Baird s counter example breaks TD The deadly triad: off-policy + function_approximation + bootstrapping How the gradient TD method TDC differs from semi-gradient TD

Will topic X that you did not cover in this review be on the exam? Ask me now!