Machine Learning. Outline. Reinforcement learning 2. Defining an RL problem. Solving an RL problem. Miscellaneous. Eric Xing /15

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

TD(λ) and Q-Learning Based Ludo Players

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

Laboratorio di Intelligenza Artificiale e Robotica

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Improving Action Selection in MDP s via Knowledge Transfer

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Generative models and adversarial training

Speeding Up Reinforcement Learning with Behavior Transfer

Regret-based Reward Elicitation for Markov Decision Processes

(Sub)Gradient Descent

AMULTIAGENT system [1] can be defined as a group of

Learning Prospective Robot Behavior

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Lecture 6: Applications

Python Machine Learning

Artificial Neural Networks written examination

On the Combined Behavior of Autonomous Resource Management Agents

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

FF+FPG: Guiding a Policy-Gradient Planner

An OO Framework for building Intelligence and Learning properties in Software Agents

Lecture 1: Basic Concepts of Machine Learning

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

The Strong Minimalist Thesis and Bounded Optimality

Introduction to Simulation

arxiv: v1 [cs.lg] 15 Jun 2015

Software Maintenance

AI Agent for Ice Hockey Atari 2600

Learning and Transferring Relational Instance-Based Policies

Planning with External Events

Acquiring Competence from Performance Data

An investigation of imitation learning algorithms for structured prediction

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

What is a Mental Model?

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

CSL465/603 - Machine Learning

A Comparison of Annealing Techniques for Academic Course Scheduling

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Softprop: Softmax Neural Network Backpropagation Learning

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Task Completion Transfer Learning for Reward Inference

Surprise-Based Learning for Autonomous Systems

CSC200: Lecture 4. Allan Borodin

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Executive Guide to Simulation for Health

Task Completion Transfer Learning for Reward Inference

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A Reinforcement Learning Variant for Control Scheduling

Learning to Schedule Straight-Line Code

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

An Introduction to Simio for Beginners

An empirical study of learning speed in backpropagation

Learning Methods in Multilingual Speech Recognition

Evolutive Neural Net Fuzzy Filtering: Basic Description

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Intelligent Agents. Chapter 2. Chapter 2 1

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Improving Fairness in Memory Scheduling

Probabilistic Latent Semantic Analysis

SARDNET: A Self-Organizing Feature Map for Sequences

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Mining Student Evolution Using Associative Classification and Clustering

A Grammar for Battle Management Language

The Evolution of Random Phenomena

INPE São José dos Campos

Online Updating of Word Representations for Part-of-Speech Tagging

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Active Learning. Yingyu Liang Computer Sciences 760 Fall

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Cooperative evolutive concept learning: an empirical study

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Discriminative Learning of Beam-Search Heuristics for Planning

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Welcome to. ECML/PKDD 2004 Community meeting

Soft Computing based Learning for Cognitive Radio

The Round Earth Project. Collaborative VR for Elementary School Kids

While you are waiting... socrative.com, room number SIMLANG2016

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

On-the-Fly Customization of Automated Essay Scoring

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Shockwheat. Statistics 1, Activity 1

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Transcription:

Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Reinforcement learning 2 Eric Xing Lecture 28, April 30, 2008 Reading: Chap. 13, T.M. book Eric Xing 1 Outline Defining an RL problem Markov Decision Processes Solving an RL problem Dynamic Programming Monte Carlo methods Temporal-Difference learning Miscellaneous state representation function approximation rewards Eric Xing 2 1

Markov Decision Process (MDP) set of states S, set of actions A, initial state S 0 transition model P(s,a,s ) P( [1,1], up, [1,2] ) = 0.8 reward function r(s) r( [4,3] ) = +1 goal: maximize cumulative reward in the long run policy: mapping from S to A π(s) or π(s,a) reinforcement learning transitions and rewards usually not available how to change the policy based on experience how to explore the environment Eric Xing 3 Dynamic programming Main idea use value functions to structure the search for good policies need a perfect model of the environment Two main components policy evaluation: compute V π from π policy improvement: improve π based on V π start with an arbitrary policy repeat evaluation/improvement until convergence Eric Xing 4 2

Policy/Value iteration Eric Xing 5 Using DP need complete model of the environment and rewards robot in a room state space, action space, transition model can we use DP to solve robot in a room? back gammon? helicopter? DP bootstraps updates estimates on the basis of other estimates Eric Xing 6 3

Monte Carlo methods don t need full knowledge of environment just experience, or simulated experience but similar to DP policy evaluation, policy improvement averaging sample returns defined only for episodic tasks episodic (vs. continuing) tasks game over after N steps optimal policy depends on N; harder to analyze Eric Xing 7 Monte Carlo policy evaluation Want to estimate V π (s) = expected return starting from s and following π estimate as average of observed returns in state s First-visit MC average returns following the first visit to state s Eric Xing 8 4

Monte Carlo control V π not enough for policy improvement need exact model of environment Estimate Q π (s,a) MC control update after each episode Non-stationary environment A problem greedy policy won t explore all actions Eric Xing 9 Maintaining exploration Deterministic/greedy policy won t explore all actions don t know anything about the environment at the beginning need to try all actions to find the optimal one Maintain exploration use soft policies instead: π(s,a)>0 (for all s,a) ε-greedy policy with probability 1-ε perform the optimal/greedy action with probability ε perform a random action will keep exploring the environment slowly move it towards greedy policy: ε -> 0 Eric Xing 10 5

Simulated experience 5-card draw poker s0: A, A, 6, A, 2 a0: discard 6, 2 s1: A, A, A, A, 9 + dealer takes 4 cards return: +1 (probably) DP list all states, actions, compute P(s,a,s ) P( [A,A,6,A,2 ], [6,2 ], [A,9,4] ) = 0.00192 MC all you need are sample episodes let MC play against a random policy, or itself, or another algorithm Eric Xing 11 Temporal Difference Learning Combines ideas from MC and DP like MC: learn directly from experience (don t need a model) like DP: bootstrap works for continuous tasks, usually faster than MC Constant-alpha MC: have to wait until the end of episode to update simplest TD update after every step, based on the successor Eric Xing 12 6

TD in passive learning TD(0) key idea: adjust the estimated utility value of the current state based on its immediately reward and the estimated value of the next state. The updating rule α is the learning rate parameter α Only when is a function that decreases as the number of times a state has been visited increased, then can V(s) converge to the correct value. Eric Xing 13 Algorithm TD(λ) (not in Russell & Norvig book) Idea: update from the whole epoch, not just on state transition. Special cases: λ=1: LMS λ=0: TD Intermediate choice of λ (between 0 and 1) is best. Interplay with α Eric Xing 14 7

Eric Xing 15 MC vs. TD Observed the following 8 episodes: A 0, B 0 B 1 B 1 B - 1 B 1 B 1 B 1 B 0 MC and TD agree on V(B) = 3/4 MC: V(A) = 0 converges to values that minimize the error on training data TD: V(A) = 3/4 converges to ML estimate of the Markov process Eric Xing 16 8

The TD learning curve (4,3) (2,3) (2,2) (1,1) (3,1) (4,1) (4,2) Eric Xing 17 Another model free method TD- Q learning Define Q-value function Q-value function updating rule See subsequent slides Key idea of TD-Q learning Combined with temporal difference approach Rule to chose the action to take Eric Xing 18 9

Sarsa Again, need Q(s,a), not just V(s) Control start with a random policy update Q and π after each step again, need ε-soft policies Eric Xing 19 Q-learning Before: on-policy algorithms start with a random policy, iteratively improve converge to optimal Q-learning: off-policy use any policy to estimate Q Q directly approximates Q* (Bellman optimality eqn) independent of the policy being followed only requirement: keep updating each (s,a) pair Sarsa Eric Xing 20 10

TD-Q learning agent algorithm For each pair (s, a), initialize Q(s,a) Observe the current state s Loop forever { Select an action a (optionally with ε-exploration) and execute it Receive immediate reward r and observe the new state s Update Q(s,a) } s=s Eric Xing 21 Exploration Tradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting (n-armed bandit models) Q-learning converges to optimal Q-values if Every state is visited infinitely often (due to exploration), The action selection becomes greedy as time approaches infinity, and The learning rate a is decreased fast enough but not too fast (as we discussed in TD learning) Eric Xing 22 11

Pole-balancing Eric Xing 23 Eric Xing 24 12

Eric Xing 25 Eric Xing 26 13

Eric Xing 27 Eric Xing 28 14

A Success Story TD Gammon (Tesauro, G., 1992) -A Backgammon playing program. - Application of temporal difference learning. - The basic learner is a neural network. - It trained itself to the world class level by playing against itself and learning from the outcome. So smart!! - More information: http://www.research.ibm.com/massive/tdl.html Eric Xing 29 Eric Xing 30 15

Summary Reinforcement learning use when need to make decisions in uncertain environment Solution methods dynamic programming need complete model Monte Carlo time difference learning (Sarsa, Q-learning) most work algorithms simple need to design features, state representation, rewards Eric Xing 31 Future research in RL Function approximation (& convergence results) On-line experience vs. simulated experience Amount of search in action selection Exploration method (safe?) Kind of backups Full (DP) vs. sample backups (TD) Shallow (Monte Carlo) vs. deep (exhaustive) λ controls this in TD(λ) Macros Advantages Reduce complexity of learning by learning subgoals (macros) first Can be learned by TD(λ) Problems Selection of macro action Learn models of macro actions (predict their outcome) Eric Xing 32 How do you come up with subgoals 16

Types of Learning Supervised Learning - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification. Unsupervised Learning - Training data: X. (features only) - Find similar points in high-dim X-space. - Clustering. Reinforcement Learning - Training data: (S, A, R). (State-Action-Reward) - Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robotics, Board game playing programs Eric Xing 33 Where Machine Learning is being used or can be useful? Information retrieval Speech recognition Computer vision Games Robotic control Pedigree Evolution Eric Xing Planning 34 17