CS 380: ARTIFICIAL INTELLIGENCE REINFORCEMENT LEARNING. Santiago Ontañón

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Laboratorio di Intelligenza Artificiale e Robotica

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 1: Machine Learning Basics

AMULTIAGENT system [1] can be defined as a group of

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

An OO Framework for building Intelligence and Learning properties in Software Agents

High-level Reinforcement Learning in Strategy Games

TD(λ) and Q-Learning Based Ludo Players

Georgetown University at TREC 2017 Dynamic Domain Track

Introduction to Simulation

Python Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

CS Machine Learning

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

FF+FPG: Guiding a Policy-Gradient Planner

Regret-based Reward Elicitation for Markov Decision Processes

Using focal point learning to improve human machine tacit coordination

A Reinforcement Learning Variant for Control Scheduling

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Improving Action Selection in MDP s via Knowledge Transfer

CSL465/603 - Machine Learning

B.S/M.A in Mathematics

Speeding Up Reinforcement Learning with Behavior Transfer

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Lecture 1: Basic Concepts of Machine Learning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Visual CP Representation of Knowledge

Learning and Transferring Relational Instance-Based Policies

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A Comparison of Annealing Techniques for Academic Course Scheduling

Intelligent Agents. Chapter 2. Chapter 2 1

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Task Completion Transfer Learning for Reward Inference

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Softprop: Softmax Neural Network Backpropagation Learning

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

CS177 Python Programming

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning goal-oriented strategies in problem solving

Backwards Numbers: A Study of Place Value. Catherine Perez

Task Completion Transfer Learning for Reward Inference

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

A survey of multi-view machine learning

Probability and Game Theory Course Syllabus

An investigation of imitation learning algorithms for structured prediction

Mathematics Scoring Guide for Sample Test 2005

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Generative models and adversarial training

Self Study Report Computer Science

An Introduction to Simulation Optimization

Constraining X-Bar: Theta Theory

Seminar - Organic Computing

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Cases to Resolve Conflicts and Improve Group Behavior

Transfer Learning Action Models by Measuring the Similarity of Different Domains

BMBF Project ROBUKOM: Robust Communication Networks

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

12- A whirlwind tour of statistics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Rule Learning With Negation: Issues Regarding Effectiveness

DOCTOR OF PHILOSOPHY HANDBOOK

Go fishing! Responsibility judgments when cooperation breaks down

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Remainder Rules. 3. Ask students: How many carnations can you order and what size bunches do you make to take five carnations home?

Generating Test Cases From Use Cases

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

A Neural Network GUI Tested on Text-To-Phoneme Mapping

B. How to write a research paper

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Firms and Markets Saturdays Summer I 2014

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Action Models and their Induction

Transcription:

CS 380: ARTIFICIAL INTELLIGENCE REINFORCEMENT LEARNING Santiago Ontañón so367@drexel.edu

Machine Learning Computational methods for computers to exhibit specific forms of learning. For example: Learning from Examples: Supervised learning Unsupervised learning Reinforcement Learning Learning from Observation (demonstration/imitation)

Examples Reinforcement Learning: learning to walk

Examples Reinforcement Learning: https://www.youtube.com/watch?v=hx_bgotf7bs https://www.youtube.com/watch?v=e27tummkoa0 https://www.youtube.com/watch?v=0jl04jjjocc

Reinforcement Learning How can an agent learn to take actions in an environment to maximize some notion of reward Actions Agent Environment State Reward Assumption: environment is unknown and maybe stochastic

Basic Concepts State (S): The configuration of the environment, as perceived by the agent Actions (A): The set of different actions the agent can perform. We will assume is is discrete (but this does not need to be so for other RL algorithms) Reward (R): Each time the agent performs an action, it observes a reward Real value

Policies and Plans Plan: Sequence of actions generated to achieve a certain goal from a given starting state Policy: A mapping of states to actions i.e.: a function that defines which action to perform in every possible state For RL, given an initial state, does the agent need to learn a plan or a policy? (environment is stochastic)

Policies and Plans Plan: Sequence of actions generated to achieve a certain goal from a given starting state Policy: A mapping of states to actions i.e.: a function that defines which action to perform in every possible state For RL, given an initial state, does the agent need to learn a plan or a policy? (environment is stochastic) A policy, since plans assume deterministic execution

Policies RL algorithms learn policies How do we represent a policy? Example: as a table (if it s a deterministic policy) State s 0 s 1 s 2 s n Action right right up left

Policies RL algorithms learn policies A stochastic policy would specify the probability of each action in each state. How do we represent a policy? Example: as a table (if it s a deterministic policy) State s 0 s 1 s 2 s n Action right right up left

Value Function Imagine we have a policy P: The Value of a state S using policy P is the expected reward we would get if we execute P starting from S: V P (S) =E " X t=0...1 R(S t,p(s t )) S 0 = S Since that might be infinite, we assume a discount factor (a number between 0 and 1, that discounts future rewards): V P (S) =E " X t=0...1 # t R(S t,p(s t )) S 0 = S #

State-Action Value Function (Q value) Imagine we have a policy P: The Q value of a state S and an action A using policy P is: the expected reward we would get if we execute first A and then we follow policy P starting from S: " X # Q P (S, A) =E t R(S t,p(s t )) S 0 = S, A 0 = A t=0...1

Q table A Q table is a matrix with one row per state, and one column per action with the Q value of each state, action pair State right up s 0 0.4 0.1 s 1 0.5 0.1 s 2 0.3 0.05 s n 0.1 0.8

Q table A Q table is a matrix with one row per state, and one column per action A with Q table the defines Q value a deterministic of each state, action pair policy as: taking the action with the maximum Q value in each state. State right up s 0 0.4 0.1 s 1 0.5 0.1 s 2 0.3 0.05 s n 0.1 0.8 State s 0 s 1 s 2 s n Action right right up left

Q learning Basic reinforcement learning algorithm Learns the Q table Starts with an initial (e.g., all zeroes) Q table Updates the Q table iteratively over time using Bellman s Equations

Bellman Equations Imagine that: We have a current estimation of the Q table An agent is in state S Performs action A (which takes it to state S ) And observes reward R How do we update the Q table with the new piece of information? Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )]

Bellman Equations Imagine that: We have a current estimation of the Q table An agent is in state S Performs action A (which takes it to state S ) And observes reward R Previous Q value How do we update the Q table with the new piece of estimate information? New Q value estimate Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )] Learning rate Discount factor

Q Learning 1. Initialize Q table to some uniform value (e.g., all zeroes) 2. S = initial state 3. A = choose action based on Q table and current state S 4. Execute action A: 1. S = new state after executing A 2. R = observed reward 5. Update Q table: Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )] 6. Go to 3

Q Learning 1. Initialize Q table to some uniform value (e.g., all zeroes) 2. S = initial state 3. A = choose action based on Q table and current state S 4. Execute action A: 1. S = new state after executing A 2. R = observed reward 5. Update Q table: Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )] 6. Go to 3 How do we choose an action?

Exploration vs Exploitation During learning, the agent is in a given state S, and has to choose an action using the Q table: State right left forward s 0 0.4 0.9 0.1 s 1 0.5 0.3 0.1 Current state Action that maximizes Q value s 2 0.3 0.1 0.05 s n 0.1 0.3 0.8

Exploration vs Exploitation During learning, instead of choosing actions that maximize Q value, we use a policy to balance exploration and exploitation Remember MCTS? (tree policy), this is the same thing! For example: ε-greedy ε = 0.1 (or some small value between 0 and 1) With probability ε, choose an action at random With probability (1- ε), choose the action with maximum Q value Why?

Exploration vs Exploitation During learning, instead of choosing actions that maximize Q value, we use a policy to balance exploration and exploitation Remember MCTS? (tree policy), this is the same thing! For example: ε-greedy ε = 0.1 (or some small value between 0 and 1) With probability ε, choose an action at random With probability (1- ε), choose the action with maximum Q value Why? The action currently believed to be the best might just happen to be by coincidence. So, we need to keep exploring just in case other actions turn out to be better.

Q Learning Example output of Q Learning (Q table): (image borrowed from Hal Daumé s CS421 slides)

Problems with Q Learning No generalization: If two states are very similar, Q learning does not exploit this, and will have to learn the Q values for each of them independently Many techniques to address that: Function approximation Feature-based state representation Deep Q-learning: Uses a neural network to represent the Q table (implicit generalization)

Examples (Again) Reinforcement Learning: https://www.youtube.com/watch?v=hx_bgotf7bs https://www.youtube.com/watch?v=e27tummkoa0 https://www.youtube.com/watch?v=0jl04jjjocc

Machine Learning Computational methods for computers to exhibit specific forms of learning. For example: Learning from Examples: Supervised learning Unsupervised learning Reinforcement Learning Learning from Observation (demonstration/imitation)