COMP219: Artificial Intelligence. Lecture 27: Reinforcement Learning

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 10: Reinforcement Learning

Intelligent Agents. Chapter 2. Chapter 2 1

Exploration. CS : Deep Reinforcement Learning Sergey Levine

TD(λ) and Q-Learning Based Ludo Players

Axiom 2013 Team Description Paper

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Artificial Neural Networks written examination

Lecture 1: Basic Concepts of Machine Learning

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Speeding Up Reinforcement Learning with Behavior Transfer

Probabilistic Latent Semantic Analysis

Python Machine Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

Lecture 6: Applications

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Results In. Planning Questions. Tony Frontier Five Levers to Improve Learning 1

(Sub)Gradient Descent

A Reinforcement Learning Variant for Control Scheduling

Learning Methods for Fuzzy Systems

Law Professor's Proposal for Reporting Sexual Violence Funded in Virginia, The Hatchet

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An OO Framework for building Intelligence and Learning properties in Software Agents

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

Learning Prospective Robot Behavior

The Strong Minimalist Thesis and Bounded Optimality

Seminar - Organic Computing

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

An investigation of imitation learning algorithms for structured prediction

Natural Language Processing. George Konidaris

A Case Study: News Classification Based on Term Frequency

PreReading. Lateral Leadership. provided by MDI Management Development International

Discriminative Learning of Beam-Search Heuristics for Planning

Regret-based Reward Elicitation for Markov Decision Processes

Every curriculum policy starts from this policy and expands the detail in relation to the specific requirements of each policy s field.

Improving Action Selection in MDP s via Knowledge Transfer

CSL465/603 - Machine Learning

Go fishing! Responsibility judgments when cooperation breaks down

Major Milestones, Team Activities, and Individual Deliverables

Agent-Based Software Engineering

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Robot manipulations and development of spatial imagery

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

Learning and Transferring Relational Instance-Based Policies

Innovative Methods for Teaching Engineering Courses

Student Assessment and Evaluation: The Alberta Teaching Profession s View

DRAFT VERSION 2, 02/24/12

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

ECE-492 SENIOR ADVANCED DESIGN PROJECT

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

West s Paralegal Today The Legal Team at Work Third Edition

Lesson Plan. Preparation

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Explorer Promoter. Controller Inspector. The Margerison-McCann Team Management Wheel. Andre Anonymous

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Platform for the Development of Accessible Vocational Training

LITERACY ACROSS THE CURRICULUM POLICY

Guru: A Computer Tutor that Models Expert Human Tutors

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Cognitive Thinking Style Sample Report

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Human Emotion Recognition From Speech

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Machine Learning and Development Policy

A non-profit educational institution dedicated to making the world a better place to live

How to Judge the Quality of an Objective Classroom Test

Bachelor Class

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

DOCTOR OF PHILOSOPHY HANDBOOK

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

MGT/MGP/MGB 261: Investment Analysis

2.B.4 Balancing Crane. The Engineering Design Process in the classroom. Summary

Learning to Schedule Straight-Line Code

The Roaring 20s. History. igcse Examination Technique. Paper 2. International Organisations. September 2015 onwards

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Resource Package. Community Action Day

PUBLIC CASE REPORT Use of the GeoGebra software at upper secondary school

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Dialog-based Language Learning

Probability and Game Theory Course Syllabus

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Constraining X-Bar: Theta Theory

CS Machine Learning

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Transcription:

COMP219: Artificial Intelligence Lecture 27: Reinforcement Learning 1

Revision Lecture Revision Lecture: Date: Wednesday January 10, 2018 time: 10:00am Location: CHAD-CHAD 2

Class Test 2 15th December, 15:00 Again, based on first letter of last name: A-G CHAD-ROTB H-Z CTH-LTA What to study? Everything except Prolog. Example questions end of lecture 3

Overview Last time Regression and classification with linear models; Non-parametric models: K-nearest neighbours Today Reinforcement learning General overview N-armed bandit problem and Gittins index Learning outcomes covered today: Identify or describe the major approaches to learning in AI and apply these to simple examples 4

Reinforcement Learning (RL) A learning task: agents learn what to do without labelled examples learn from a series of reinforcements: rewards (and/or punishments) That is, RL is a problem, not one particular technique but can 'approach a problem' by phrasing it as an RL problem Reinforcement learning has been studied by animal psychologists for over 60 years Animals recognise pain and hunger as negative rewards, and pleasure and food intake as positive rewards Foraging behaviour of bees Alan Turing proposed the reinforcement learning approach in 1948, but he thought it ineffective at best a part of the teaching process Arthur Samuel did the first successful work on machine learning (1959) which applied most of the modern reinforcement learning ideas 5

Reinforcement Learning Task The agent has to learn a policy that maps states to actions leading to maximum reward Source: Julien Vitay 6

Reinforcement Learning Agent Agent interacts with its environment and learns a policy which maximises the reward obtained from the environment (optimal policy) There are no labelled examples to learn from the agent must discover whether an action is correct or not by observing rewards. Therefore it must try out all possibilities (exploration) Imagine playing a game whose rules you don t know: you lose Source: Julien Vitay The exploratory space can become huge 7

RL Agent Interacts with Environment RL agents need to interact with the environment. For example Games: When a master chess player makes a move, the choice is informed both by planning (anticipating possible responses and counter- responses) and by immediate, intuitive judgments of the desirability of particular positions and moves Adaptive control: An adaptive controller adjusts parameters of a control system in real time. The controller agent optimises the yield/cost/quality trade-off on the basis of specified margin costs without strictly following the set parameters originally suggested by engineers Mobile robots: A mobile vacuum cleaning robot decides whether it should enter a new room in search of more dirt to clean or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the recharger in the past 8

Elements of RL (I) Policy π defines the behaviour of the agent: which action to take in a given state to maximize the received reward in the long term Stimulus-response rules or associations Could be a simple lookup table or function, or need more extensive computation (e.g. search) Can be probabilistic Reward function r defines the goal in a reinforcement learning problem: maps a state or action to a scalar number, the reward (or reinforcement). The RL agent s sole objective is to maximise the total reward it receives in the long run Defines good and bad events Cannot be altered by the agent but may inform change of policy Can be probabilistic (expected reward) 9

Elements of RL (II) Value function V defines the total amount of reward an agent can expect to accumulate over the future, starting from that state What is good in the long run (reward function defines what is good now) considering the states (and rewards) that are likely to follow A state may yield a low reward but have a high value (or the opposite) e.g. immediate pain/pleasure vs. long term happiness Transition model M defines the transitions in the environment: action a taken in the state s 1 will lead to state s 2 Can be probabilistic 10

Elements of RL (II) Value function V defines the total amount of reward an agent can expect to accumulate over the future, starting from that state Value function example. What is good in the long run (reward function defines what is good now) considering the states (and rewards) that are likely to follow A state may yield a low reward but have a high value (or the opposite) e.g. immediate pain/pleasure vs. long term happiness Transition model M defines the transitions in the environment: action a taken in the state s 1 will lead to state s 2 Can be probabilistic 11

Types of Reinforcement Learning Reinforcement learning can be Passive where the agent s policy is fixed and the task is to learn the utilities of states (or state-action pairs) Active where the agent must also learn what to do, i.e. exploration 12

Passive Reinforcement Learning The agent s policy π is fixed: in state s it always executes π(s) Goal is to learn how good the policy is: to learn the value function V π (s) Agent does not know the reward function r or transition model M Agent executes a set of trials in the environment using its policy π Starts in initial state s 0, experiences a sequence of states and rewards until it reaches a terminal state s t Agent uses information about rewards to learn the expected value V π (s i ) associated with each non-terminal state s i 13

Active Reinforcement Learning A passive agent has a fixed policy determining behaviour, but an active agent must decide which actions to take... For instance ( model-based RL or adaptive dynamic programming ): learn a complete model M with outcome probabilities for all actions then learn the value function V(s) then, given the resulting V, decide which actions to take Issue: what if the learned model is incorrect...? Might perform suboptimally! Active agent must trade-off between exploitation (to maximise its reward), exploration (to learn if there are better actions/states it has not found yet) How to balance? can t exploit all the time; can t explore all the time. 14

n-armed Bandit Problem Model to reason about exploration vs exploitation A one-armed bandit is a slot machine: A gambler can insert a coin, pull the lever and collect the winnings (if any) An n-armed bandit has n levers: gambler must choose which lever to play on each successive time step... he one that has paid off best? Or the one that has not been tried? 15

n-armed Bandit Problem cont d n-armed bandit problem is a formal model for real problems in many domains e.g. in marketing (which ad to show) Exploration is risky: uncertain payoffs But failure to explore means never discovering worthwhile actions To formulate an n-armed bandit problem properly, we must define what we mean by optimal 16

Gittins Index The Gittins index is a measure of the reward that can be achieved by a sequence of actions from the present state onwards with the probability that it will be terminated in the future n-armed bandit problem it is possible to calculate a Gittins index for n-armed bandit machine: Gittins index = a function of the number of times a bandit has been played and how much it has paid out Indicates how worthwhile it is to invest more Gittins, J.C. (1989). Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization. Chichester: John Wiley & Sons, Ltd. ISBN 0-471-92059-2. https://en.wikipedia.org/wiki/gittins_index 17

RL Applications: Games It is very hard for a human to provide accurate and consistent evaluations of a large number of positions to train an evaluation function 1959 Arthur Samuel applied RL to checkers 1992 Gerald Tesauro s TD-GAMMON used RL techniques to find the optimal strategy to play backgammon: learn from self-play alone Recent successes: Atari games: https://www.youtube.com/watch?v=tmpftpjtdgg Go Poker Rewards may be fairly frequent (e.g. in table tennis, each point is a reward) or only at the end of the game (e.g. chess) The main problem for RL is that the reward (e.g. win or loss) could be delayed too much, e.g. a game that never ends 18

RL Applications: Robotics Motor control Navigation and exploration Sequence learning Decision making Source: Julien Vitay 19

Reinforcement Learning Possibilities Because of its potential for eliminating hand coding of control strategies, RL is one of the most active areas of machine learning research Applications in robotics promise to be especially valuable will need methods for handling continuous, high-dimensional, partially observable environments in which successive behaviours may consist of millions of primitive actions 20

We have considered 3 types of learning Supervised learning Agent learns a function from observing example input-output pairs Unsupervised learning Learn patterns in the input without explicit feedback Most common task is clustering Reinforcement learning Learn from a series of reinforcements: rewards or punishments We note the existence of other approaches for addressing machine learning methods, but we conclude our study here 21

Class test 2 example questions 22

Class test 2 example questions 23

Class test 2 example questions 24

Class test 2 example questions 25

Class test 2 example questions 26

Summary Reinforcement learning Agent task, elements of RL Passive vs active RL N-armed bandit problem and Gittins index Applications of RL Further reading RL: R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction. MIT Press, 1998 http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html Reinforcement Learning: State-of-the-Art. Editors: Wiering, Marco, van Otterlo, Martijn (Eds.) https://link.springer.com/book/10.1007%2f978-3-642-27645-3 Further ML resources: Russel & Norvig...! Christopher Bishop. Pattern Recognition and Machine Learning Goodfellow, Bengio & Courville. Deep Learning Andrew Ng's Coursera course on Machine learning. Next time Jan. 10 th, 10am: revision lecture 27