Reinforcement Learning

Similar documents
Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Reinforcement Learning by Comparing Immediate Reward

Lecture 1: Basic Concepts of Machine Learning

Lecture 10: Reinforcement Learning

Intelligent Agents. Chapter 2. Chapter 2 1

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Machine Learning Basics

An OO Framework for building Intelligence and Learning properties in Software Agents

TD(λ) and Q-Learning Based Ludo Players

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Artificial Neural Networks written examination

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

CS Machine Learning

(Sub)Gradient Descent

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

MYCIN. The MYCIN Task

A Reinforcement Learning Variant for Control Scheduling

AMULTIAGENT system [1] can be defined as a group of

Discriminative Learning of Beam-Search Heuristics for Planning

Seminar - Organic Computing

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

CSL465/603 - Machine Learning

Python Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Speeding Up Reinforcement Learning with Behavior Transfer

Probability estimates in a scenario tree

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Creation. Shepherd Guides. Creation 129. Tear here for easy use!

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

University of Groningen. Systemen, planning, netwerken Bosman, Aart

On the Combined Behavior of Autonomous Resource Management Agents

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Visual CP Representation of Knowledge

Learning goal-oriented strategies in problem solving

A Case-Based Approach To Imitation Learning in Robotic Agents

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

SARDNET: A Self-Organizing Feature Map for Sequences

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Designing A Computer Opponent for Wargames: Integrating Planning, Knowledge Acquisition and Learning in WARGLES

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

ACTIVITY: Comparing Combination Locks

Rule Learning with Negation: Issues Regarding Effectiveness

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Australian Journal of Basic and Applied Sciences

GACE Computer Science Assessment Test at a Glance

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

Rule-based Expert Systems

Probabilistic Latent Semantic Analysis

Action Models and their Induction

A Neural Network GUI Tested on Text-To-Phoneme Mapping

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

PreReading. Lateral Leadership. provided by MDI Management Development International

Lecturing in the Preclinical Curriculum A GUIDE FOR FACULTY LECTURERS

Cognitive Modeling. Tower of Hanoi: Description. Tower of Hanoi: The Task. Lecture 5: Models of Problem Solving. Frank Keller.

4-3 Basic Skills and Concepts

Learning Methods for Fuzzy Systems

Using focal point learning to improve human machine tacit coordination

Mathematics Scoring Guide for Sample Test 2005

Active Learning. Yingyu Liang Computer Sciences 760 Fall

High-level Reinforcement Learning in Strategy Games

Contents. Foreword... 5

Utilizing Soft System Methodology to Increase Productivity of Shell Fabrication Sushant Sudheer Takekar 1 Dr. D.N. Raut 2

Learning and Transferring Relational Instance-Based Policies

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Go fishing! Responsibility judgments when cooperation breaks down

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Knowledge-Based - Systems

Classify: by elimination Road signs

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Section 3.4. Logframe Module. This module will help you understand and use the logical framework in project design and proposal writing.

Evolution of Collective Commitment during Teamwork

Aspectual Classes of Verb Phrases

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Hentai High School A Game Guide

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Secret Code for Mazes

An Introduction to Simio for Beginners

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Cooperative evolutive concept learning: an empirical study

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Association Between Categorical Variables

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Transcription:

CSC 4510/9010: Applied Machine Learning 1 Reinforcement Learning Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789 Some slides based on https://www.csee.umbc.edu/courses/671/fall05/slides/c28_rl.ppt

What Is Machine Learning? 2 Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time. Herbert Simon In other words, the end result is a changed model or of some kind; the focus is on the end product Learning is constructing or modifying representations of what is being experienced. Ryszard Michalski The experiences perceived must be captured or represented in some way; learning modifies that representation. This definition focuses on the process, rather than the result.

So what is Machine Learning? 3 We can consider that the system is a computer and its programs, or a statistical model with parameters. Another way of looking at machine learning is as a way to get a computer to do things without having to explicitly describe what steps to take, by giving it examples or feedback The computer then looks for patterns which can explain or predict what happens. The computer is trained through the examples.

The Architecture of a ML System 4 Every machine learning system has four parts: a representation or model of what is being learned an actor. This is the part that uses the representation and actually does something. a critic. The part that provides feedback. a learner. The part that modifies the representation or model, using the feedback. Based on Russell and Norvig, Artificial Intelligence: A Modern Approach, Third Edition, Prentice-Hall, 2009. http://aima.cs.berkeley.edu/

Representation 5 A learning system must have a representation or model of what is being learned. This is the component that changes based on experience. In a machine learning system this may be a mathematical model or formula, a set of rules, a decision tree, or some other form of information. For classification and clustering we have studied a number of models.

General Model of Learning Agent 6 Performance Standard Critic Sensors feedback Learning Element changes knowledge Performer with KB Environment learning goals Problem Generator Effectors Agent Based on Russell and Norvig, Artificial Intelligence: A Modern Approach, http://aima.cs.berkeley.edu/

Animals Behind the Scene 7 Computer: Is it a bird? Human: No Computer: Is it a frog? Human: No Computer: What is it? Human: A mouse Computer: Tell me a question that distinguishes a frog from a mouse. Human: Does it have fur? Computer: What is the answer for a frog? Human: no Is it a bird? Yes No Is it a penguin? Does it have fur? Yes No Is it a mouse? Is it a frog? After several passes.

Animals Guessing Game Architecture 8 The Animals Guessing Game has all of the parts of a Machine Learning Architecture: The Representation is a sequence of questions and pairs of yes/no answers (called a binary decision tree). The Actor walks the tree, interacting with a human; at each question it chooses whether to follow the yes branch or the no branch. The Critic is the human player telling the game whether it has guessed correctly. The Learner elicits new questions and adds questions, guesses and branches to the tree.

Reinforcement Learning 9 The Animals Game is a simple form of Reinforcement Learning: the feedback is at the end, on a series of actions. Very early concept in Artificial Intelligence! Arthur Samuels checker program was a simple reinforcement based learner, initially developed in 1956. In 1962 it beat a human checkers master. www-03.ibm.com/ibm/history/ibm100/us/en/icons/ ibm700series/impacts/

Machine Learning So Far 10 Supervised learning is the simplest and most studied type of machine learning. But requires training cases. Unsupervised learning uses some measure of similarity as a critic Both are static, in the sense that all of the data from which the system will learn already exist. However, for many real-world situations the problem is more complex; rather than a single action or decision, there are a series of decisions to be made And feedback is not available at each step

Reinforcement Learning 11 In many situations, we have an agent which has a task to perform It takes some actions in the world At some later point, it gets feedback telling it how well it did on performing the task The agent performs the same task repeatedly This problem is called reinforcement learning: The agent gets positive reinforcement for tasks done well The agent gets negative reinforcement for tasks done poorly It must somehow figure out which actions to take

Reinforcement Learning (cont.) 12 The goal is to get the agent to act in the world so as to maximize its rewards The agent has to figure out what it did that made it get the reward/punishment This is known as the credit assignment problem Reinforcement learning approaches can be used to train computers to do many tasks backgammon and chess playing job shop scheduling controlling robot limbs

Simple Example 13 Learn to play checkers Two-person game 8x8 boards, 12 checkers/side relatively simple set of rules: http:// www.darkfish.com/ checkers/rules.html Goal is to eliminate all your opponent s pieces https://pixabay.com/en/checker-board-blackgame-pattern-29911/

Representing Checkers 14 First we need to represent the game To completely describe one step in the game you need A representation of the game board. A representation of the current pieces A variable which indicates whose turn it is A variable which tells you which side is black There is no history needed; a look at the current board setup gives you a complete picture of the state of the game

Representing Rules 15 Second, we need to represent the rules The rules are represented as a set of allowable moves given the state of the board If a checker is at row x, column y, and row x+1 column y+-1 is empty, it can move there If a checker is at (x,y), a checker of the opposite color is at (x+1, y+1), and (x+2,y+2) is empty, the checker must move there, and remove the jumped checker from play. There are additional rules, but all can be expressed in terms of the state of the board and the checkers. Each rule includes the outcome of the relevant action in terms of the state.

A More Complex Example 16 Consider a driving agent, which must learn to drive a car State? Possible actions? Reward value?

Formalization for Agent 17 Given: a state space S a set of actions a1,, ak including their results reward value at the end of each trial (series of action) (may be positive or negative) Output: a mapping from states to actions

Reactive Agent 18 This kind of agent is a reactive agent The general algorithm for a reactive agent is: Observe some state If it is a terminal state, stop Otherwise choose an action from the actions possible in that state Perform the action Recur.

What Do We Want to Learn 19 Given A description of some state of the game A list of the moves allowed by the rules What move should we make? Typically more than one move is possible So we would like some strategies or heuristics or hints about which move to make. This is what we would like to learn What we have to learn from is whether the game was won or lost

Simple Checkers Learning 20 We can represent a number of heuristics or rules-ofthumb in the same formalism as we have used for the board and the rules If there is a legal move that will create a king, take it. If checkers at (7,y) and (8,y-1) or (8,y+1) is free, move there. If there are two legal moves, choose the one that moves a checker farther toward the top row If checker(x,y) and checker(p,q) can both move, and x>p, move checker(x,y). Each of these heuristics also needs some kind of priority or weight

Formalization for Agent 21 Given: a state space S a set of actions a1,, ak including their results a set of heuristics for resolving conflict among actions reward value at the end of each trial (series of action) (may be positive or negative) Output: a mapping from states to preferred actions

Learning Agent 22 This kind of agent is a simple learning agent The general algorithm for a learning agent is: Observe some state If it is a terminal state stop If a win, increase the weight on all heuristics used If a lose, decrease the weight on all heuristics used Otherwise choose an action from the actions possible in that state, using the heuristics to select the preferred action Perform the action Recur.

Policy 23 A policy is a complete mapping from states to actions There must be an action for each state There may be more than one action A policy is not necessarily optimal The goal of a learning agent is to tune the policy so that the preferred action is optimal, or at least good. analogous to training a classifier Checkers Trained policy includes all legal actions with a weight for preferred actions

Approaches 24 Learn policy directly function mapping from states to actions This function could be directly learned values Value of state which removes last opponent checker is +1. Or a heuristic function which has itself been trained Learn utility values for states (value function) Estimate the value for each state Checkers: How happy am I with this state that turns a man into a king?

Value Function 25 The agent knows what state it is in The agent has a number of actions it can perform in each state. Initially, it doesn't know the value of any of the states If the outcome of performing an action at a state is deterministic, then the agent can update the utility value U() of states: U(oldstate) = reward + U(newstate) The agent learns the utility values of states as it works its way through the state space

Learning States and Actions 26 A typical approach is: At state S choose some action A Taking us to new State S1. If S1 has a positive value, increase value of A at S. If S1 has a negative value, decrease value of A at S. If S1 is new initial value is unknown. Leave value at A unchanged. Repeat until? Convergence? One complete learning pass or trial eventually gets to a deterministic state. (win or lose)

Selecting an Action 27 Simply choose action with highest (current) expected utility? Problem: each action has two effects yields a reward (or penalty) on current sequence information is received and used in learning for future sequences Trade-off: immediate good for long-term well-being Like trying a shortcut: might get lost, might learn a quicker route.

Exploration 28 The agent may occasionally choose to explore suboptimal moves in the hopes of finding better outcomes Only by visiting all the states frequently enough can we guarantee learning the true values of all the states When the agent is learning, ideal would be to get accurate values for all states Even though that may mean getting a negative outcome When agent is performing, ideal would be to get optimal outcome. A learning agent should have an exploration policy

Exploration policy 29 Wacky approach (exploration): act randomly in hopes of eventually exploring entire environment Choose any legal checkers move Greedy approach (exploitation): act to maximize utility using current estimate Choose moves that have in the past led to wins Reasonable balance: act more wacky (exploratory) when agent has little idea of environment; more greedy when the model is close to correct Suppose you know no checkers strategy? What s the best way to get better?

N-Armed Bandits 30 Example: n-armed bandits a row of slot machines various payouts and percentages of wins which to play and how often? State Space is a set of machines with payout and percentage values Action is pull a lever. Actions do not directly change the state space: no transitions Each action has a positive or negative result which then adjusts the utility of that action (pulling that lever)

N-Armed BanditsExample 31 Each action starts with a standard payout. Result is either some cash (a win) or none (a lose) Initially we don t know anything about which Exploration: try things until we get some estimates for the payouts. Try them all Exploitation: when we have some idea of the values of each action, choose the best. Clearly this is heuristic. May not find the best lever to pull The more exploration we can do the better our model But the higher the cost over multiple trials

Reinforcement Learning 32 Reinforcement learning systems learn a series of actions or decisions, rather than a single decision, based on feedback given at the end of the series. A reinforcement learner has a goal, and carries out trialand-error search to find the best paths toward that goal

Reinforcement Learning 33 A typical reinforcement learning system is an active agent, interacting with its environment. It must balance exploration: trying different actions and sequences of actions to discover which ones work best achievement: using sequences which have worked well so far It must also learn successful sequences of actions in an uncertain environment Typical current applications are in artificial intelligence and in engineering.

RL Summary 34 Active area of research Approaches from both OR and AI There are many more sophisticated algorithms that we have not discussed Applicable to game-playing, robot controllers, others