Lecture 14: MCTS 2. Emma Brunskill. Winter CS234 Reinforcement Learning. 2 With many slides from or derived from David Silver

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Lecture 1: Machine Learning Basics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Georgetown University at TREC 2017 Dynamic Domain Track

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Regret-based Reward Elicitation for Markov Decision Processes

High-level Reinforcement Learning in Strategy Games

Axiom 2013 Team Description Paper

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Artificial Neural Networks written examination

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TD(λ) and Q-Learning Based Ludo Players

The Strong Minimalist Thesis and Bounded Optimality

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

(Sub)Gradient Descent

An OO Framework for building Intelligence and Learning properties in Software Agents

Generative models and adversarial training

FF+FPG: Guiding a Policy-Gradient Planner

Introduction to Simulation

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Improving Action Selection in MDP s via Knowledge Transfer

AMULTIAGENT system [1] can be defined as a group of

Human-like Natural Language Generation Using Monte Carlo Tree Search

Learning and Transferring Relational Instance-Based Policies

Speeding Up Reinforcement Learning with Behavior Transfer

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Evolutive Neural Net Fuzzy Filtering: Basic Description

Laboratorio di Intelligenza Artificiale e Robotica

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Acquiring Competence from Performance Data

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Task Completion Transfer Learning for Reward Inference

Laboratorio di Intelligenza Artificiale e Robotica

Go fishing! Responsibility judgments when cooperation breaks down

On the Combined Behavior of Autonomous Resource Management Agents

Guided Monte Carlo Tree Search for Planning in Learned Environments

Task Completion Transfer Learning for Reward Inference

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

AI Agent for Ice Hockey Atari 2600

CSL465/603 - Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

Probability estimates in a scenario tree

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

Truth Inference in Crowdsourcing: Is the Problem Solved?

Self Study Report Computer Science

Seminar - Organic Computing

Probability and Game Theory Course Syllabus

Python Machine Learning

A Reinforcement Learning Variant for Control Scheduling

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Planning with External Events

SARDNET: A Self-Organizing Feature Map for Sequences

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Software Maintenance

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Discriminative Learning of Beam-Search Heuristics for Planning

Radius STEM Readiness TM

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Uncertainty concepts, types, sources

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Using focal point learning to improve human machine tacit coordination

BMBF Project ROBUKOM: Robust Communication Networks

CSC200: Lecture 4. Allan Borodin

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

MGT/MGP/MGB 261: Investment Analysis

Analysis of Enzyme Kinetic Data

Lecture 6: Applications

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

Knowledge Transfer in Deep Convolutional Neural Nets

Enduring Understandings: Students will understand that

Learning Methods for Fuzzy Systems

Rule Learning With Negation: Issues Regarding Effectiveness

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Why Did My Detector Do That?!

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Lecture 1: Basic Concepts of Machine Learning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning with Negation: Issues Regarding Effectiveness

Transcription:

Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 3 Winter 2018 1 / 57

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 4 Winter 2018 2 / 57

Table of Contents 1 Introduction 2 Model-Based Reinforcement Learning 3 Simulation-Based Search 4 Integrated Architectures Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 5 Winter 2018 3 / 57

Model-Based Reinforcement Learning Previous lectures: learn value function or policy or directly from experience This lecture: learn model directly from experience and use planning to construct a value function or policy Integrate learning and planning into a single architecture Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 6 Winter 2018 4 / 57

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 7 Winter 2018 5 / 57

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Model-Based RL Learn a model from experience Plan value function (and/or policy) from model Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 8 Winter 2018 6 / 57

Model-Free RL Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 9 Winter 2018 7 / 57

Model-Based RL Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 10 Winter 2018 8 / 57

Table of Contents 1 Introduction 2 Model-Based Reinforcement Learning 3 Simulation-Based Search 4 Integrated Architectures Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 11 Winter 2018 9 / 57

Model-Based RL Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 12 Winter 2018 10 / 57

Advantages of Model-Based RL Advantages: Can efficiently learn model by supervised learning methods Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs) Disadvantages First learn a model, then construct a value function two sources of approximation error Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 13 Winter 2018 11 / 57

MDP Model Refresher A model M is a representation of an MDP < S, A, P, R >, parametrized by η We will assume state space S and action space A are known So a model M =< P η, R η > represents state transitions P η P and rewards R η R S t+1 P η (S t+1 S t, A t ) R t+1 = R η (R t+1 S t, A t ) Typically assume conditional independence between state transitions and rewards P[S t+1, R t+1 S t, A t ] = P[S t+1 S t, A t ]P[R t+1 S t, A t ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 14 Winter 2018 12 / 57

Model Learning Goal: estimate model M η from experience {S 1, A 1, R 2,..., S T } This is a supervised learning problem S 1, A 1 R 2, S 2 S 2 A 2 R 3, S 3. S T 1, A T 1 R T, S T Learning s, a r is a regression problem Learning s, a s is a density estimation problem Pick loss function, e.g. mean-squared error, KL divergence,... Find parameters η that minimize empirical loss Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 15 Winter 2018 13 / 57

Examples of Models Table Lookup Model Linear Expectation Model Linear Gaussian Model Gaussian Process Model Deep Belief Network Model... Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 16 Winter 2018 14 / 57

Table Lookup Model Model is an explicit MDP, ˆP, ˆR Count visits N(s, a) to each state action pair ˆP a s,s = 1 N(s, a) ˆR a s = T 1(S t, A t, S t+1 = s, a, s ) t=1 1 N(s, a) T 1(S t, A t = s, a) t=1 Alternatively At each time-step t, record experience tuple < S t, A t, R t+1, S t+1 > To sample model, randomly pick tuple matching < s, a,, > Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 17 Winter 2018 15 / 57

AB Example Two states A,B; no discounting; 8 episodes of experience We have constructed a table lookup model from the experience Recall: For a particular policy, TD with a tabular representation with infinite experience replay will converge to the same value as computed if construct a MLE model and do planning Check Your Memory: Will MC methods converge to the same solution? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 18 Winter 2018 16 / 57

Planning with a Model Given a model M η =< P η, R η > Solve the MDP < S, A, P η, R η > Using favourite planning algorithm Value iteration Policy iteration Tree search Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 19 Winter 2018 17 / 57

Sample-Based Planning A simple but powerful approach to planning Use the model only to generate samples Sample experience from model S t+1 P η (S t+1 S t, A t ) R t+1 = R η (R t+1 S t, A t ) Apply model-free RL to samples, e.g.: Monte-Carlo control Sarsa Q-learning Sample-based planning methods are often more efficient Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 20 Winter 2018 18 / 57

Back to the AB Example Construct a table-lookup model from real experience Apply model-free RL to sampled experience Real experience A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 Sampled experience B, 1 B, 0 B, 1 A, 0 B, 1 B, 1 A, 0 B, 1 B, 1 B, 0 e.g. Monte-Carlo learning: V (A) = 1, V (B) = 0.75 Check Your Memory: What would have MC on the original experience have converged to? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 21 Winter 2018 19 / 57

Planning with an Inaccurate Model Given an imperfect model < P η, R η > < P, R > Performance of model-based RL is limited to optimal policy for approximate MDP < S, A, P η, R η > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a sub-optimal policy Solution 1: when model is wrong, use model-free RL Solution 2: reason explicitly about model uncertainty (see Lectures on Exploration / Exploitation) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 22 Winter 2018 20 / 57

Table of Contents 1 Introduction 2 Model-Based Reinforcement Learning 3 Simulation-Based Search 4 Integrated Architectures Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 23 Winter 2018 21 / 57

Forward Search Forward search algorithms select the best action by lookahead They build a search tree with the current state st at the root Using a model of the MDP to look ahead No need to solve whole MDP, just sub-mdp starting from now Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 24 Winter 2018 22 / 57

Simulation-Based Search Forward search paradigm using sample-based planning Simulate episodes of experience from now with the model Apply model-free RL to simulated episodes Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 25 Winter 2018 23 / 57

Simulation-Based Search (2) Simulate episodes of experience from now with the model {S k t, A k t, R k t+1,..., S k T }K k=1 M v Apply model-free RL to simulated episodes Monte-Carlo control Monte-Carlo search Sarsa TD search Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 26 Winter 2018 24 / 57

Simple Monte-Carlo Search Given a model M v and a simulation policy π For each action a A Simulate K episodes from current (real) state s t {s t, a, R k t+1,..., S k T } K k=1 M v, π Evaluate actions by mean return (Monte-Carlo evaluation) Q(s t, a) = 1 K K P G t qπ (s t, a) (1) k=1 Select current (real) action with maximum value a t = argmin Q(s t, a) a A Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 27 Winter 2018 25 / 57

Recall Expectimax Tree If have a MDP model M v Can compute optimal q(s, a) values for current state by constructing an expectimax tree Limitations: Size of tree scales as Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 28 Winter 2018 26 / 57

Monte-Carlo Tree Search (MCTS) Given a model M v Build a search tree rooted at the current state s t Samples actions and next states Iteratively construct and update tree by performing K simulation episodes starting from the root state After search is finished, select current (real) action with maximum value in search tree a t = argmin Q(s t, a) a A Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 29 Winter 2018 27 / 57

Monte-Carlo Tree Search Simulating an episode involves two phases (in-tree, out-of-tree) Tree policy: pick actions to maximize Q(S, A) Roll out policy: e.g. pick actions randomly, or another policy To evaluate the value of a tree node i at state action pair (s, a), average over all rewards received from that node onwards across simulated episodes in which this tree node was reached Q(i) = 1 N(i) K k=1 u=t T 1(i epi.k)g k (i) P q(s, a) (2) Under mild conditions, converges to the optimal search tree, Q(S, A) q (S, A) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 30 Winter 2018 28 / 57

Upper Confidence Tree (UCT) Search How to select what action to take during a simulated episode? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 31 Winter 2018 29 / 57

Upper Confidence Tree (UCT) Search How to select what action to take during a simulated episode? UCT: borrow idea from bandit literature and treat each node where can select actions as a multi-armed bandit (MAB) problem Maintain an upper confidence bound over reward of each arm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 32 Winter 2018 30 / 57

Upper Confidence Tree (UCT) Search How to select what action to take during a simulated episode? UCT: borrow idea from bandit literature and treat each node where can select actions as a multi-armed bandit (MAB) problem Maintain an upper confidence bound over reward of each arm Q(s, a, i) = 1 N(s, a, i) K k=1 u=t T 1(i epi.k)g k (s, a, i)+c For simplicity can treat each node as a separate MAB lnn(s) n(s, a) (3) For simulated episode k at node i, select action/arm with highest upper bound to simulate and expand (or evaluate) in the tree a ik = arg max Q(s, a, i) (4) This implies that the policy used to simulate episodes with (and expand/update the tree) can change across each episode Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 33 Winter 2018 31 / 57

Case Study: the Game of Go Go is 2500 years old Hardest classic board game Grand challenge task (John McCarthy) Traditional game-tree search has failed in Go Check your understanding: does playing Go involve learning to make decisions in a world where dynamics and reward model are unknown? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 34 Winter 2018 32 / 57

Rules of Go Usually played on 19x19, also 13x13 or 9x9 board Simple rules, complex strategy Black and white place down stones alternately Surrounded stones are captured and removed The player with more territory wins the game Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 35 Winter 2018 33 / 57

Position Evaluation in Go How good is a position s Reward function (undiscounted): R t = 0 for all non-terminal steps t < T { 1, if Black wins. R T = 0, if White wins. (5) Policy π =< π B, π W > selects moves for both players Value function (how good is position s): v π (s) = E π [R T S = s] = P[Black wins S = s] v (s) = max π B min v π (s) π W Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 36 Winter 2018 34 / 57

Monte-Carlo Evaluation in Go Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 37 Winter 2018 35 / 57

Applying Monte-Carlo Tree Search (1) Go is a 2 player game so tree is a minimax tree instead of expectimax White minimizes future reward and Black maximizes future reward when computing action to simulate Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 38 Winter 2018 36 / 57

Applying Monte-Carlo Tree Search (2) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 39 Winter 2018 37 / 57

Applying Monte-Carlo Tree Search (3) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 40 Winter 2018 38 / 57

Applying Monte-Carlo Tree Search (4) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 41 Winter 2018 39 / 57

Applying Monte-Carlo Tree Search (5) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 42 Winter 2018 40 / 57

Advantages of MC Tree Search Highly selective best-first search Evaluates states dynamically (unlike e.g. DP) Uses sampling to break curse of dimensionality Works for black-box models (only requires samples) Computationally efficient, anytime, parallelisable Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 43 Winter 2018 41 / 57

In more depth: Upper Confidence Tree (UCT) Search UCT: borrow idea from bandit literature and treat each tree node where can select actions as a multi-armed bandit (MAB) problem Maintain an upper confidence bound over reward of each arm and select the best arm Check your understanding: Why is this slightly strange? Hint: why were upper confidence bounds a good idea for exploration/ exploitation? Is there an exploration/ exploitation problem during simulated episodes? 44 44 Relates to metalevel reasoning (for an example related to Go see Selecting Computations: Theory and Applications, Hay, Russell, Tolpin and Shimony 2012) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 45 Winter 2018 42 / 57

MCTS and Early Go Results Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 46 Winter 2018 43 / 57

MCTS Variants UCT and vanilla MCTS are just the beginning Potential extensions / alterations? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 47 Winter 2018 44 / 57

MCTS Variants UCT and vanilla MCTS are just the beginning Potential extensions / alterations? Use a better rollout policy (have a policy network? Learned from expert data or from data gathered in the real world) Learn a value function (can be used in combination with simulated trajectories to get a state-action estimate, can be used to bias initial actions considered, can be used to avoid having to rollout to the full episode length,...) Many other possibilities Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 48 Winter 2018 45 / 57

MCTS and AlphaGo / AlphaZero... MCTS was a critical advance for defeating Go Several new versions including AlphaGo Zero and AlphaZero which have even more impressive performance AlphaZero has also been applied to other games now including Chess Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 49 Winter 2018 46 / 57

Table of Contents 1 Introduction 2 Model-Based Reinforcement Learning 3 Simulation-Based Search 4 Integrated Architectures Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 50 Winter 2018 47 / 57

Real and Simulated Experience We consider two sources of experience Real experience Sampled from environment (true MDP) S P a s,s R = R a s Simulated experience Sampled from model (approximate MDP) S P η (S S, A) R = R η (R S, A) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 51 Winter 2018 48 / 57

Integrating Learning and Planning Model-Free RL No model Learn value function (and/or policy) from real experience Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 52 Winter 2018 49 / 57

Integrating Learning and Planning Model-Free RL No model Learn value function (and/or policy) from real experience Model-Based RL (using Sample-Based Planning) Learn a model from real experience Plan value function (and/or policy) from simulated experience Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 53 Winter 2018 50 / 57

Integrating Learning and Planning Model-Free RL No model Learn value function (and/or policy) from real experience Model-Based RL (using Sample-Based Planning) Learn a model from real experience Plan value function (and/or policy) from simulated experience Dyna Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 54 Winter 2018 51 / 57

Dyna Architecture Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 55 Winter 2018 52 / 57

Dyna-Q Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 56 Winter 2018 53 / 57

Dyna-Q on a Simple Maze Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 57 Winter 2018 54 / 57

Dyna-Q with an Inaccurate Model Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 58 Winter 2018 55 / 57

Dyna-Q with an Inaccurate Model (2) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 59 Winter 2018 56 / 57

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 60 Winter 2018 57 / 57