Breakout Group Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

High-level Reinforcement Learning in Strategy Games

Artificial Neural Networks written examination

Georgetown University at TREC 2017 Dynamic Domain Track

Lecture 1: Machine Learning Basics

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

TD(λ) and Q-Learning Based Ludo Players

Speeding Up Reinforcement Learning with Behavior Transfer

Laboratorio di Intelligenza Artificiale e Robotica

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Axiom 2013 Team Description Paper

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Improving Action Selection in MDP s via Knowledge Transfer

AMULTIAGENT system [1] can be defined as a group of

Python Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

(Sub)Gradient Descent

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Lecture 6: Applications

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Rule Learning with Negation: Issues Regarding Effectiveness

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Self Study Report Computer Science

PLCs - From Understanding to Action Handouts

Task Completion Transfer Learning for Reward Inference

Generative models and adversarial training

Level 1 Mathematics and Statistics, 2015

Learning Prospective Robot Behavior

A Reinforcement Learning Variant for Control Scheduling

Surprise-Based Learning for Autonomous Systems

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Tutor s Guide TARGET AUDIENCES. "Qualitative survey methods applied to natural resource management"

Firms and Markets Saturdays Summer I 2014

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

An OO Framework for building Intelligence and Learning properties in Software Agents

Learning and Transferring Relational Instance-Based Policies

DOCTOR OF PHILOSOPHY HANDBOOK

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Certified Six Sigma - Black Belt VS-1104

Task Completion Transfer Learning for Reward Inference

CSL465/603 - Machine Learning

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning to Schedule Straight-Line Code

CS Machine Learning

FF+FPG: Guiding a Policy-Gradient Planner

Regret-based Reward Elicitation for Markov Decision Processes

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

An Introduction to Simio for Beginners

Intelligent Agents. Chapter 2. Chapter 2 1

What is Research? A Reconstruction from 15 Snapshots. Charlie Van Loan

Week 01. MS&E 273: Technology Venture Formation

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Welcome to ACT Brain Boot Camp

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Kindergarten Lessons for Unit 7: On The Move Me on the Map By Joan Sweeney

Prince2 Foundation and Practitioner Training Exam Preparation

Learning From the Past with Experiment Databases

Pitching Accounts & Advertising Sales ADV /PR

Hentai High School A Game Guide

Learning to Think Mathematically with the Rekenrek Supplemental Activities

Validation Requirements and Error Codes for Submitting Common Completion Metrics

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

A General Class of Noncontext Free Grammars Generating Context Free Languages

Millersville University Testing Library Complete Archive (2016)

Detecting English-French Cognates Using Orthographic Edit Distance

Lecture 1: Basic Concepts of Machine Learning

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Holy Family Catholic Primary School SPELLING POLICY

Improving Conceptual Understanding of Physics with Technology

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Backwards Numbers: A Study of Place Value. Catherine Perez

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Probability and Game Theory Course Syllabus

CS 101 Computer Science I Fall Instructor Muller. Syllabus

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Coaching Others for Top Performance 16 Hour Workshop

An Introduction to Simulation Optimization

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

AI Agent for Ice Hockey Atari 2600

Executive Guide to Simulation for Health

ESL Curriculum and Assessment

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

IMPACT INSTITUTE BEHAVIOR MANAGEMENT. Krissy Matthaei Gina Schutt

JAMAICA TEST PAPERS GRADE 4 PDF

Reducing Spoon-Feeding to Promote Independent Thinking

Welcome to SAT Brain Boot Camp (AJH, HJH, FJH)

Making the ELPS-TELPAS Connection Grades K 12 Overview

Express, an International Journal of Multi Disciplinary Research ISSN: , Vol. 1, Issue 3, March 2014 Available at: journal.

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Universal Design for Learning Lesson Plan

Transcription:

Breakout Group Reinforcement Learning FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 12/01/2017

Outline Theoretical introduction (30 minutes) Discussion of code (30 minutes) Solve version of grid world with SARSA Discussion of RL and its applications to String Theory (30 minutes)

How to teach a machine Supervised Learning (SL): provide a set of training tuples [(in 0, out 0 ), (in 1, out 1 ),...,(in n, out n )] after training, machine predicts Unsupervised Learning (UL): out i from in i only provide set training input set [in 0, in 1,...,in n ] give task to machine (e.g. cluster input) without telling it how to do this exactly After training, the machine will perform self-learned action on Reinforcement Learning (RL): in i in between SL and UL Machine acts autonomously, but actions are reinforced / punished

Theoretical introduction

Reinforcement Learning - Vocabulary Basic textbooks/literature [Barton, Sutton 98 17] The thing that learns is called agent or worker The thing that is explored is called environment The elements of the environment are called states or observations The things that take you from one state to another are called actions The thing that tells you how to select the next action is called policy Actions are executed sequentially in a sequence called (time) steps The reinforcement the agent experiences is called reward The accumulated reward is called return In RL, an agent performs actions in an environment with the goal to maximize its long-term return

Reinforcement Learning - Details We focus on discrete state and action spaces State space S = {states in environment} Action space total: A = {actions to transition between states} s 2 S for : A(s) ={possible actions in state s} Policy (s) =a, a2 A(s) : Select next action for given state : S 7! A Reward R(s, a) 2 R: Reward for taking action a in state s R : S A 7! R

Reinforcement Learning - Details Return: The accumulated reward from current step G t = 1X k=0 k r t+k+1, 2 (0, 1] State value function v (s): Expected return for s with policy : v (s) =E[ G t s = s t ] Action value function q(s, a) : Expected return for performing action a in state s with policy : q (s, a) =E[ G t s = s t,a= a t ] Prediction problem: Given, predict v (s) or q (s, a) Control problem: Find optimal policy maximizes v (s) or q (s, a) that t

Reinforcement Learning - Details Commonly used policies: greedy: Choose the action that maximizes the action value function: 0 (s) = argmax q(s, a) " - greedy: Explore different possibilities 0 Choose greedy in (1 ") cases (s) = Choose random action in cases We take "-greedy policy improvement On-policy: Update policy you are following (e.g. always "- greedy) Off-policy: Use different policy for choosing next action and updating q(s t,a t ) a t+1

Reinforcement Learning - SARSA Solving the control problem: v(s t )= [G t v(s t )] =0 v(s t ) : Learning rate ( means no update to ) One step approximation: G t = r + v(s t+1 ) Similar for action value function: q(s t,a t )= [G t q(s t,a t )] = [r + q(s t+1,a t+1 ) q(s t,a t ))] Update depends on tuple (s t,a t,r,s t+1,a t+1 ) a t+1 s t+1 is currently best known action for state Note: SARSA is on-policy

Reinforcement Learning - Q-Learning Very similar to SARSA Difference in update: SARSA: q(s t,a t )= [r + q(s t+1,a t+1 ) q(s t,a t )] Q_Learning: q(s t,a t )= [r + max a 0 q(s t+1,a 0 ) q(s t,a t )] Note: This means that Q-Learning is off-policy SARSA is found to perform better Q-Learning is proven to converge to solution Combine with (deep NNs): Deep Q-Learning

Example - Gridworld Worker ( Explorer ) Pitfall Exit Wall

Example - Gridworld We will look at a version of grid world: Gridworld is a grid-like maze with walls, pitfalls, and an exit Each state is a point on the grid of the maze The actions are A = {up, down, left, right} Goal: Find the exit (strongly rewarded) Each step is punished mildly (solve maze quickly) Pitfalls should be avoided (strongly punished) Running into a wall does not change the state

Gridworld vs String Landscape Walls = Boundaries of landscape (negative number of branes) Empty square = Consistent point in the landscape which does not correspond to our Universe Pitfalls = Mathematically / Physically inconsistent states (anomalies, tadpoles, ) Exit = Standard Model of Particle Physics

Coding

Discussion