Behavior Clustering Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

Lecture 10: Reinforcement Learning

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

Speeding Up Reinforcement Learning with Behavior Transfer

Reinforcement Learning by Comparing Immediate Reward

Artificial Neural Networks written examination

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

On the Combined Behavior of Autonomous Resource Management Agents

Python Machine Learning

AMULTIAGENT system [1] can be defined as a group of

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Learning Methods for Fuzzy Systems

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Calibration of Confidence Measures in Speech Recognition

Improving Fairness in Memory Scheduling

arxiv: v2 [cs.ro] 3 Mar 2017

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Introduction to Simulation

LEGO MINDSTORMS Education EV3 Coding Activities

An investigation of imitation learning algorithms for structured prediction

Evolutive Neural Net Fuzzy Filtering: Basic Description

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

TD(λ) and Q-Learning Based Ludo Players

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Australian Journal of Basic and Applied Sciences

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

A Reinforcement Learning Variant for Control Scheduling

Georgetown University at TREC 2017 Dynamic Domain Track

A study of speaker adaptation for DNN-based speech synthesis

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Game-based formative assessment: Newton s Playground. Valerie Shute, Matthew Ventura, & Yoon Jeon Kim (Florida State University), NCME, April 30, 2013

Speech Recognition at ICSI: Broadcast News and beyond

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Learning From the Past with Experiment Databases

FF+FPG: Guiding a Policy-Gradient Planner

CS Machine Learning

WHEN THERE IS A mismatch between the acoustic

A Comparison of Annealing Techniques for Academic Course Scheduling

ME 4495 Computational Heat Transfer and Fluid Flow M,W 4:00 5:15 (Eng 177)

Learning to Schedule Straight-Line Code

CSL465/603 - Machine Learning

DOCTOR OF PHILOSOPHY HANDBOOK

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Radius STEM Readiness TM

EXECUTIVE SUMMARY. Online courses for credit recovery in high schools: Effectiveness and promising practices. April 2017

Modeling user preferences and norms in context-aware systems

Grade 6: Correlated to AGS Basic Math Skills

Major Milestones, Team Activities, and Individual Deliverables

Teaching a Laboratory Section

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Transfer Learning Action Models by Measuring the Similarity of Different Domains

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Mathematics subject curriculum

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

The Strong Minimalist Thesis and Bounded Optimality

Social Emotional Learning in High School: How Three Urban High Schools Engage, Educate, and Empower Youth

Speech Emotion Recognition Using Support Vector Machine

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

INPE São José dos Campos

Go fishing! Responsibility judgments when cooperation breaks down

Assignment 1: Predicting Amazon Review Ratings

Seminar - Organic Computing

Semi-Supervised Face Detection

Probability and Game Theory Course Syllabus

High-level Reinforcement Learning in Strategy Games

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Case-Based Approach To Imitation Learning in Robotic Agents

EGRHS Course Fair. Science & Math AP & IB Courses

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Rule Learning With Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Using focal point learning to improve human machine tacit coordination

Lecture 6: Applications

On the Formation of Phoneme Categories in DNN Acoustic Models

Truth Inference in Crowdsourcing: Is the Problem Solved?

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Answer Key For The California Mathematics Standards Grade 1

Causal Link Semantics for Narrative Planning Using Numeric Fluents

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Regret-based Reward Elicitation for Markov Decision Processes

Learning Methods in Multilingual Speech Recognition

Visual CP Representation of Knowledge

SARDNET: A Self-Organizing Feature Map for Sequences

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Genevieve L. Hartman, Ph.D.

Transcription:

Behavior Clustering Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks By Siddharthan Rajasekaran Committee: Jie Fu (Advisor), Jane Li (Co-advisor), Carlo Pinciroli

Outline Part I Behavior Clustering Inverse Reinforcement Learning Part II Approximate Optimal Control with Temporal Logic Tasks

Outline Part I Behavior Clustering Inverse Reinforcement Learning

Outline Behavior Clustering Inverse Reinforcement Learning Learning from demonstrations Related work Behavior cloning Reinforcement Learning Background Feature Expectation Maximum Entropy IRL Motivation Method Results Conclusion Future work

Related work - Broad Overview Learning from demonstrations Behavioral cloning Reward shaping towards demonstrations Inverse Reinforcement Learning

Related work - Broad Overview Learning from demonstrations Behavioral cloning - Bojarski et. al. 2016, Ross et. al. 2011 Treat demonstrations as labels and perform supervised learning Simple to use and implement Does not generalize well Crash due to positive feedback Given trajectories Learn the function approximation

Related work - Broad Overview Learning from demonstrations Reward shaping towards demonstrations - Brys et. al. 2015, Vasan et. al. 2017 Give auxiliary reward for mimicking the expert Does not generalize well Requires definition of distance metrics Mimicking action Closest point Demonstration

Related work Learning from demonstrations Inverse Reinforcement Learning - Abbeel et. al. 2004, Ziebart et. al. 2008 Finds the reward the expert is maximizing Generalizes well to unseen situations This is the topic of interest

Related work Why Inverse Reinforcement Learning? Finding the intent Useful for reasoning the decisions of expert Prediction Plan ahead of time Collaboration Assist humans to complete a task

IRL Motivation Motivating example

IRL Motivation An autonomous agent practicing IRL Recognize intent Take actions completely different from the expert to serve the intent IRL practitioner Warneken & Tomasello 2006

IRL Motivation - Collaboration An autonomous agent practicing IRL Recognize intent Take actions completely different from the expert to serve the intent Warneken & Tomasello 2006

Outline Preliminaries

Preliminary - Reinforcement Learning (RL) Agent interaction modeling in Reinforcement Learning

Preliminary - Reinforcement Learning (RL) Agent interaction modeling in Reinforcement Learning Objective of RL:

Preliminary - RL Reinforcement Learning Given Environment Set of actions to choose from Rewards Finds The optimal behavior to maximize cumulative reward

Preliminary - RL Reinforcement Learning Given Environment Set of actions to choose from Rewards Finds The optimal behavior to maximize cumulative reward

Preliminary - RL VS IRL Inverse Reinforcement Learning Given Environment Set of actions to choose from Expert demonstrations Finds The best reward function that explains the expert demonstrations

Preliminary - RL We will introduce Linear Reward Setting Feature expectation Graphical interpretation of RL Required for graphical interpretation of IRL

Preliminary - RL Linear Rewards Linear only in weights: Can be complex and nonlinear in states Using non-linear features

Linear reward - simple example Grid world Each color is a region

Linear reward - simple example Grid world Each color is a region Reward function Red = +5 Yellow = -1

Linear reward - simple example Grid world Each color is a region Reward function Red = +5 Yellow = -1 Each dimension in the feature vector is an indicator if we are in that region

RL - Linear Setting RL Objective:

RL - Linear Setting RL Objective:

RL - Linear Setting Feature expectation of any behavior is a vector in n-dimensional space

RL - Linear Setting Feature expectation of any behavior is a vector in n-dimensional space

RL - Linear Setting RL Geometrically, Objective:

RL - Linear Setting RL Geometrically, Objective: Objective: minimize Ψ

IRL - Overview IRL algorithms

Outline - Background Maximum Entropy Inverse Reinforcement Learning

MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) S

MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) S

MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) S

MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) Objective function given demonstrations

IRL - Linear setting Gradient ascent on likelihood

IRL - Linear setting MaxEnt IRL Algorithm

IRL - Linear setting

IRL - Linear setting

IRL - Linear setting

IRL - Linear setting

IRL - Linear setting

IRL - Linear setting

IRL - Linear setting

IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality S

IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality Consider these demonstrations S

IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality Consider these demonstrations S

IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality Consider these demonstrations S

IRL - Linear setting Why should we not learn the mean behavior Wrong prediction Agent now predicts the mean behavior Learn the unintended behavior Might learn unsafe behavior (think in case of driving) Wrong intent learned. Cannot collaborate Not practical to get consistent demonstrations

Behavior Clustering IRL Behavior Clustering IRL Parametric Clusters/behaviors: Soft clustering, learns: Probability that a given demonstration Learns reward parameters: Non-parametric In addition learns the number of clusters: belongs to a class

Behavior Clustering IRL Expectation Maximization Missing data: distribution over behaviors Given data: Demonstrations Easier to optimize than

Behavior Clustering IRL The new objective function Previous Objective function For a single behavior: For multiple behaviors: where,

Behavior Clustering IRL The new objective function Update reward functions using where, is the probability that demonstration comes from behavior Update in vanilla MaxEnt IRL:

Behavior Clustering IRL The new objective function Update reward functions using Update the priors using where, is the probability that demonstration comes from behavior Update in vanilla MaxEnt IRL:

Non-parametric Behavior Clustering IRL Non-parametric BCIRL Learns the number of clusters We should learn the minimum number of clusters Chinese Restaurant Process (CRP) is used for non-parametric clustering

Non-parametric Behavior Clustering IRL Chinese Restaurant Process (CRP) is used for non-parametric clustering Source: Internet (CS224n NLP course, Stanford) Probability of choosing the table:

Non-parametric Behavior Clustering IRL Non-parametric BCIRL Learns the number of clusters We should learn the minimum number of clusters Chinese Restaurant Process (CRP) For our problem, we count the soft cluster assignment mass) (probability

Algorithm - BCIRL

Algorithm - BCIRL Always some non-zero probability of creating a new cluster

Algorithm - BCIRL Always some non-zero probability of creating a new cluster For every demonstration-cluster combination, compute

Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters)

Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters) Clustering happens here

Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters) Clustering happens here Weighted feature expectations

Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters) Clustering happens here Weighted feature expectations We need not solve the complete inverse problem at every iteration!

Results On a motivating example Actions States

Results On a motivating example Policy MaxEnt IRL Non-parametric BCIRL Likelihood of demonstrations (objective)

Results Highway task Aggressive demonstrations: F I N I S H S T A R T Path of the car Other cars Evasive demonstrations: S T A R T Agent F I N I S H

Results Demonstrations Learned Behaviors

Results Demonstrations Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL

Results Demonstrations Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Ref:

Results Gazebo simulator

Results Gazebo simulator Aggressive behavior using potential field controller

Results Gazebo simulator Aggressive behavior using potential field controller

Results Discretize the state space based on size of the car Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL

Results Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Clustered into - [21, 19, 5, 4, 1]

Results Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Clustered into - [21, 19, 5, 4, 1] Cluster 1: Evasive Cluster 2: Aggressive Clusters 3,4, and 5: Neither Able to learn the behaviors though we cannot get consistent demonstrations

Conclusion Advantages of Behavior Clustering IRL Can cluster demonstrations and learn reward function for each behavior Can predict new samples with high probability Can be used to separate consistent demonstrations from the rest Disadvantages Feature selection is harder Need features to also explain the differences in the behavior Does not scale well (exists in MaxEnt also) Solve multiple IRL problems for each cluster

Future work Addressing some of the disadvantages Feature selection Feature construction for IRL (Levine 2010) Guided cost learning (Finn 2016) Scalability (exists in MaxEnt also) Guided Policy search (Levine 2013) Path integral and Metropolis Hasting sampling (Kappen 2009)

Outline Part II Approximate Optimal Control with Temporal Logic Tasks

Outline Background LTL specifications Reward shaping Policy Gradients Actor critic Method Relation between Reward shaping and Actor critic Heuristic value initialization Results Conclusion

Motivation Motivating example Robot Soccer Robot Soccer. Source: IEEE Spectrum, Internet

Motivation Simpler task No opponents or teammates Robot Soccer There is a sequence of requirements temporally constrained For example, Get the ball - T1 Go near the goal - T2 Shoot - T3 LTL specification Goal Ball Agent

Motivation Simpler task No opponents or teammates Define the reward function +1 if Goal is scored Goal Ball Agent

Motivation - Why just RL fails Simpler task No opponents or teammates Define the reward function +1 if Goal is scored Very hard to explore Goal Ball Agent

Motivation How to use LTL to accelerate LTL specification Either True when satisfied Or False otherwise There is no signal towards partial completion Exploit structure in actor critic to motivate the agent towards completion Goal Ball Agent

Preliminaries

Reward Shaping Simpler task No opponents or teammates Define the reward function +1 if Goal is scored We need to satisfy temporally related requirements Get the ball: R = 0.01 (shaping reward) Score a Goal: R = +1 (true reward) Goal Ball Agent

Reward Shaping We need to satisfy temporally related requirements Example, Get the ball: R = 0.01 (shaping reward) Score a Goal: R = +1 (true reward) Goal Ball Agent

Reward Shaping We need to satisfy temporally related requirements Example, Get the ball: R = 0.01 (shaping reward) Score a Goal: R = +1 (true reward) Result (Andrew Ng 1999) The agent keeps vibrating near the ball Goal Ball Agent

Reward Shaping Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) Goal Ball Agent

Reward Shaping Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) Goal Ball Agent

Reward Shaping and Policy Invariance Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) Goal Ball Agent

Reward Shaping and Policy Invariance Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) More generally Goal Ball Agent

Preliminaries - Policy Gradients Objective of RL

Preliminaries - Policy Gradients Objective of RL By parametrizing the policy

Preliminaries - Policy Gradients Objective of RL By parametrizing the policy Utility of the parameter Objective:

Preliminaries - Policy Gradients Gradient of the utility from samples

Preliminaries - Policy Gradients Gradient of the utility from samples

Preliminaries - Policy Gradients Gradient of the utility from samples Policy gradient

Background - Actor Critic Policy gradients

Background - Actor Critic Policy gradients Reward shaping

Background - Actor Critic Policy gradients Reward shaping Actor Critic

Background - Actor Critic Policy gradients Reward shaping Actor Critic

Background - Actor Critic Policy gradients Reward shaping Actor Critic

Background - Actor Critic Actor Critic Actor (policy) update Use the empirical estimate of the gradient Critic (value) update Use any supervised learning to learn the targets Critics are shaping functions

Background - Actor Critic Actor Critic Actor (policy) update Use the empirical estimate of the gradient Critic (value) update Use any supervised learning to learn the targets Critics are shaping functions

Method - Accelerating Actor Critic using LTL Given a specification +10 for satisfying the specification - Agent - R1 - R2 - R3 -O

Method - Accelerating Actor Critic using LTL Given a specification Break down into several reach avoid task for critic initialization Task1: Task2:

Method - Accelerating Actor Critic using LTL Break down into several reach avoid task Task1: Task2: Automata of the original specification

Method - Accelerating Actor Critic using LTL Break down into several reach avoid task Task1: Task2: Heuristic value initialization

Method - Accelerating Actor Critic using LTL Heuristic value initialization for Task 2 Reward: + 10 if is satisfied. -5 for running into obstacles. - Agent - R1 - R2 - R3

Results Learned values Reward: + 10 if is satisfied. -5 for running into obstacles. - Agent - R1 - R2 - R3

Results Actor critic with and without heuristic initialization Reward: + 10 if is satisfied. -5 for running into obstacles.

Conclusion and Discussions Summary IRL with automated behavior clustering Improve feature selection and scalability Accelerating actor-critic with temporal logic constraints. Automate the decomposition procedure for LTL specifications for scalable systems Possible directions Use LTL specifications to accelerate BCIRL Applications to general domains: Big data Urban planning