Behavior Clustering Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks

Size: px

Start display at page:

Download "Behavior Clustering Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks"

Austen Wade
5 years ago
Views:

Logic Tasks By Siddharthan Rajasekaran Committee:

1 Behavior Clustering Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks By Siddharthan Rajasekaran Committee: Jie Fu (Advisor), Jane Li (Co-advisor), Carlo Pinciroli

2 Outline Part I Behavior Clustering Inverse Reinforcement Learning Part II Approximate Optimal Control with Temporal Logic Tasks

3 Outline Part I Behavior Clustering Inverse Reinforcement Learning

4 Outline Behavior Clustering Inverse Reinforcement Learning Learning from demonstrations Related work Behavior cloning Reinforcement Learning Background Feature Expectation Maximum Entropy IRL Motivation Method Results Conclusion Future work

5 Related work - Broad Overview Learning from demonstrations Behavioral cloning Reward shaping towards demonstrations Inverse Reinforcement Learning

6 Related work - Broad Overview Learning from demonstrations Behavioral cloning - Bojarski et. al. 2016, Ross et. al Treat demonstrations as labels and perform supervised learning Simple to use and implement Does not generalize well Crash due to positive feedback Given trajectories Learn the function approximation

7 Related work - Broad Overview Learning from demonstrations Reward shaping towards demonstrations - Brys et. al. 2015, Vasan et. al Give auxiliary reward for mimicking the expert Does not generalize well Requires definition of distance metrics Mimicking action Closest point Demonstration

8 Related work Learning from demonstrations Inverse Reinforcement Learning - Abbeel et. al. 2004, Ziebart et. al Finds the reward the expert is maximizing Generalizes well to unseen situations This is the topic of interest

9 Related work Why Inverse Reinforcement Learning? Finding the intent Useful for reasoning the decisions of expert Prediction Plan ahead of time Collaboration Assist humans to complete a task

10 IRL Motivation Motivating example

11 IRL Motivation An autonomous agent practicing IRL Recognize intent Take actions completely different from the expert to serve the intent IRL practitioner Warneken & Tomasello 2006

12 IRL Motivation - Collaboration An autonomous agent practicing IRL Recognize intent Take actions completely different from the expert to serve the intent Warneken & Tomasello 2006

13 Outline Preliminaries

14 Preliminary - Reinforcement Learning (RL) Agent interaction modeling in Reinforcement Learning

15 Preliminary - Reinforcement Learning (RL) Agent interaction modeling in Reinforcement Learning Objective of RL:

16 Preliminary - RL Reinforcement Learning Given Environment Set of actions to choose from Rewards Finds The optimal behavior to maximize cumulative reward

17 Preliminary - RL Reinforcement Learning Given Environment Set of actions to choose from Rewards Finds The optimal behavior to maximize cumulative reward

18 Preliminary - RL VS IRL Inverse Reinforcement Learning Given Environment Set of actions to choose from Expert demonstrations Finds The best reward function that explains the expert demonstrations

19 Preliminary - RL We will introduce Linear Reward Setting Feature expectation Graphical interpretation of RL Required for graphical interpretation of IRL

20 Preliminary - RL Linear Rewards Linear only in weights: Can be complex and nonlinear in states Using non-linear features

21 Linear reward - simple example Grid world Each color is a region

22 Linear reward - simple example Grid world Each color is a region Reward function Red = +5 Yellow = -1

23 Linear reward - simple example Grid world Each color is a region Reward function Red = +5 Yellow = -1 Each dimension in the feature vector is an indicator if we are in that region

24 RL - Linear Setting RL Objective:

25 RL - Linear Setting RL Objective:

26 RL - Linear Setting Feature expectation of any behavior is a vector in n-dimensional space

27 RL - Linear Setting Feature expectation of any behavior is a vector in n-dimensional space

28 RL - Linear Setting RL Geometrically, Objective:

29 RL - Linear Setting RL Geometrically, Objective: Objective: minimize Ψ

30 IRL - Overview IRL algorithms

31 Outline - Background Maximum Entropy Inverse Reinforcement Learning

32 MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) S

33 MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) S

34 MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) S

35 MaxEnt IRL Maximum Entropy IRL (Ziebart 2010) Objective function given demonstrations

36 IRL - Linear setting Gradient ascent on likelihood

37 IRL - Linear setting MaxEnt IRL Algorithm

38 IRL - Linear setting

39 IRL - Linear setting

40 IRL - Linear setting

41 IRL - Linear setting

42 IRL - Linear setting

43 IRL - Linear setting

44 IRL - Linear setting

45 IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality S

46 IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality Consider these demonstrations S

47 IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality Consider these demonstrations S

48 IRL - Linear setting Problems with MaxEnt Interprets variance in demonstrations as suboptimality Consider these demonstrations S

49 IRL - Linear setting Why should we not learn the mean behavior Wrong prediction Agent now predicts the mean behavior Learn the unintended behavior Might learn unsafe behavior (think in case of driving) Wrong intent learned. Cannot collaborate Not practical to get consistent demonstrations

50 Behavior Clustering IRL Behavior Clustering IRL Parametric Clusters/behaviors: Soft clustering, learns: Probability that a given demonstration Learns reward parameters: Non-parametric In addition learns the number of clusters: belongs to a class

51 Behavior Clustering IRL Expectation Maximization Missing data: distribution over behaviors Given data: Demonstrations Easier to optimize than

52 Behavior Clustering IRL The new objective function Previous Objective function For a single behavior: For multiple behaviors: where,

53 Behavior Clustering IRL The new objective function Update reward functions using where, is the probability that demonstration comes from behavior Update in vanilla MaxEnt IRL:

54 Behavior Clustering IRL The new objective function Update reward functions using Update the priors using where, is the probability that demonstration comes from behavior Update in vanilla MaxEnt IRL:

55 Non-parametric Behavior Clustering IRL Non-parametric BCIRL Learns the number of clusters We should learn the minimum number of clusters Chinese Restaurant Process (CRP) is used for non-parametric clustering

56 Non-parametric Behavior Clustering IRL Chinese Restaurant Process (CRP) is used for non-parametric clustering Source: Internet (CS224n NLP course, Stanford) Probability of choosing the table:

57 Non-parametric Behavior Clustering IRL Non-parametric BCIRL Learns the number of clusters We should learn the minimum number of clusters Chinese Restaurant Process (CRP) For our problem, we count the soft cluster assignment mass) (probability

58 Algorithm - BCIRL

59 Algorithm - BCIRL Always some non-zero probability of creating a new cluster

60 Algorithm - BCIRL Always some non-zero probability of creating a new cluster For every demonstration-cluster combination, compute

61 Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters)

62 Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters) Clustering happens here

63 Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters) Clustering happens here Weighted feature expectations

64 Algorithm - BCIRL Always some non-zero probability of creating a new cluster Weighted resampling to avoid having unlikely clusters (just like particle filters) Clustering happens here Weighted feature expectations We need not solve the complete inverse problem at every iteration!

65 Results On a motivating example Actions States

66 Results On a motivating example Policy MaxEnt IRL Non-parametric BCIRL Likelihood of demonstrations (objective)

67 Results Highway task Aggressive demonstrations: F I N I S H S T A R T Path of the car Other cars Evasive demonstrations: S T A R T Agent F I N I S H

68 Results Demonstrations Learned Behaviors

69 Results Demonstrations Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL

70 Results Demonstrations Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Ref:

71 Results Gazebo simulator

72 Results Gazebo simulator Aggressive behavior using potential field controller

73 Results Gazebo simulator Aggressive behavior using potential field controller

74 Results Discretize the state space based on size of the car Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL

75 Results Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Clustered into - [21, 19, 5, 4, 1]

Results Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Clustered into - [21, 19,

76 Results Gazebo simulator results Likelihood of demonstrations (objective value) at convergence MaxEnt IRL Non-parametric BCIRL Clustered into - [21, 19, 5, 4, 1] Cluster 1: Evasive Cluster 2: Aggressive Clusters 3,4, and 5: Neither Able to learn the behaviors though we cannot get consistent demonstrations

77 Conclusion Advantages of Behavior Clustering IRL Can cluster demonstrations and learn reward function for each behavior Can predict new samples with high probability Can be used to separate consistent demonstrations from the rest Disadvantages Feature selection is harder Need features to also explain the differences in the behavior Does not scale well (exists in MaxEnt also) Solve multiple IRL problems for each cluster

78 Future work Addressing some of the disadvantages Feature selection Feature construction for IRL (Levine 2010) Guided cost learning (Finn 2016) Scalability (exists in MaxEnt also) Guided Policy search (Levine 2013) Path integral and Metropolis Hasting sampling (Kappen 2009)

79 Outline Part II Approximate Optimal Control with Temporal Logic Tasks

80 Outline Background LTL specifications Reward shaping Policy Gradients Actor critic Method Relation between Reward shaping and Actor critic Heuristic value initialization Results Conclusion

81 Motivation Motivating example Robot Soccer Robot Soccer. Source: IEEE Spectrum, Internet

82 Motivation Simpler task No opponents or teammates Robot Soccer There is a sequence of requirements temporally constrained For example, Get the ball - T1 Go near the goal - T2 Shoot - T3 LTL specification Goal Ball Agent

83 Motivation Simpler task No opponents or teammates Define the reward function +1 if Goal is scored Goal Ball Agent

84 Motivation - Why just RL fails Simpler task No opponents or teammates Define the reward function +1 if Goal is scored Very hard to explore Goal Ball Agent

85 Motivation How to use LTL to accelerate LTL specification Either True when satisfied Or False otherwise There is no signal towards partial completion Exploit structure in actor critic to motivate the agent towards completion Goal Ball Agent

86 Preliminaries

87 Reward Shaping Simpler task No opponents or teammates Define the reward function +1 if Goal is scored We need to satisfy temporally related requirements Get the ball: R = 0.01 (shaping reward) Score a Goal: R = +1 (true reward) Goal Ball Agent

88 Reward Shaping We need to satisfy temporally related requirements Example, Get the ball: R = 0.01 (shaping reward) Score a Goal: R = +1 (true reward) Goal Ball Agent

89 Reward Shaping We need to satisfy temporally related requirements Example, Get the ball: R = 0.01 (shaping reward) Score a Goal: R = +1 (true reward) Result (Andrew Ng 1999) The agent keeps vibrating near the ball Goal Ball Agent

90 Reward Shaping Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) Goal Ball Agent

91 Reward Shaping Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) Goal Ball Agent

92 Reward Shaping and Policy Invariance Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) Goal Ball Agent

93 Reward Shaping and Policy Invariance Before shaping Optimal policy was to score a goal After shaping Optimal policy is to vibrate near the ball Optimal policy Invariance (Andrew Ng 1999) More generally Goal Ball Agent

94 Preliminaries - Policy Gradients Objective of RL

95 Preliminaries - Policy Gradients Objective of RL By parametrizing the policy

96 Preliminaries - Policy Gradients Objective of RL By parametrizing the policy Utility of the parameter Objective:

97 Preliminaries - Policy Gradients Gradient of the utility from samples

98 Preliminaries - Policy Gradients Gradient of the utility from samples

99 Preliminaries - Policy Gradients Gradient of the utility from samples Policy gradient

100 Background - Actor Critic Policy gradients

101 Background - Actor Critic Policy gradients Reward shaping

102 Background - Actor Critic Policy gradients Reward shaping Actor Critic

103 Background - Actor Critic Policy gradients Reward shaping Actor Critic

104 Background - Actor Critic Policy gradients Reward shaping Actor Critic

105 Background - Actor Critic Actor Critic Actor (policy) update Use the empirical estimate of the gradient Critic (value) update Use any supervised learning to learn the targets Critics are shaping functions

106 Background - Actor Critic Actor Critic Actor (policy) update Use the empirical estimate of the gradient Critic (value) update Use any supervised learning to learn the targets Critics are shaping functions

107 Method - Accelerating Actor Critic using LTL Given a specification +10 for satisfying the specification - Agent - R1 - R2 - R3 -O

108 Method - Accelerating Actor Critic using LTL Given a specification Break down into several reach avoid task for critic initialization Task1: Task2:

109 Method - Accelerating Actor Critic using LTL Break down into several reach avoid task Task1: Task2: Automata of the original specification

110 Method - Accelerating Actor Critic using LTL Break down into several reach avoid task Task1: Task2: Heuristic value initialization

111 Method - Accelerating Actor Critic using LTL Heuristic value initialization for Task 2 Reward: + 10 if is satisfied. -5 for running into obstacles. - Agent - R1 - R2 - R3

112 Results Learned values Reward: + 10 if is satisfied. -5 for running into obstacles. - Agent - R1 - R2 - R3

113 Results Actor critic with and without heuristic initialization Reward: + 10 if is satisfied. -5 for running into obstacles.

114 Conclusion and Discussions Summary IRL with automated behavior clustering Improve feature selection and scalability Accelerating actor-critic with temporal logic constraints. Automate the decomposition procedure for LTL specifications for scalable systems Possible directions Use LTL specifications to accelerate BCIRL Applications to general domains: Big data Urban planning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should