CS 380: ARTIFICIAL INTELLIGENCE REINFORCEMENT LEARNING. Santiago Ontañón

Size: px

Start display at page:

Download "CS 380: ARTIFICIAL INTELLIGENCE REINFORCEMENT LEARNING. Santiago Ontañón"

Randell Stevens
5 years ago
Views:

1 CS 380: ARTIFICIAL INTELLIGENCE REINFORCEMENT LEARNING Santiago Ontañón

2 Machine Learning Computational methods for computers to exhibit specific forms of learning. For example: Learning from Examples: Supervised learning Unsupervised learning Reinforcement Learning Learning from Observation (demonstration/imitation)

3 Examples Reinforcement Learning: learning to walk

4 Examples Reinforcement Learning:

5 Reinforcement Learning How can an agent learn to take actions in an environment to maximize some notion of reward Actions Agent Environment State Reward Assumption: environment is unknown and maybe stochastic

6 Basic Concepts State (S): The configuration of the environment, as perceived by the agent Actions (A): The set of different actions the agent can perform. We will assume is is discrete (but this does not need to be so for other RL algorithms) Reward (R): Each time the agent performs an action, it observes a reward Real value

7 Policies and Plans Plan: Sequence of actions generated to achieve a certain goal from a given starting state Policy: A mapping of states to actions i.e.: a function that defines which action to perform in every possible state For RL, given an initial state, does the agent need to learn a plan or a policy? (environment is stochastic)

8 Policies and Plans Plan: Sequence of actions generated to achieve a certain goal from a given starting state Policy: A mapping of states to actions i.e.: a function that defines which action to perform in every possible state For RL, given an initial state, does the agent need to learn a plan or a policy? (environment is stochastic) A policy, since plans assume deterministic execution

9 Policies RL algorithms learn policies How do we represent a policy? Example: as a table (if it s a deterministic policy) State s 0 s 1 s 2 s n Action right right up left

10 Policies RL algorithms learn policies A stochastic policy would specify the probability of each action in each state. How do we represent a policy? Example: as a table (if it s a deterministic policy) State s 0 s 1 s 2 s n Action right right up left

11 Value Function Imagine we have a policy P: The Value of a state S using policy P is the expected reward we would get if we execute P starting from S: V P (S) =E " X t=0...1 R(S t,p(s t )) S 0 = S Since that might be infinite, we assume a discount factor (a number between 0 and 1, that discounts future rewards): V P (S) =E " X t=0...1 # t R(S t,p(s t )) S 0 = S #

12 State-Action Value Function (Q value) Imagine we have a policy P: The Q value of a state S and an action A using policy P is: the expected reward we would get if we execute first A and then we follow policy P starting from S: " X # Q P (S, A) =E t R(S t,p(s t )) S 0 = S, A 0 = A t=0...1

13 Q table A Q table is a matrix with one row per state, and one column per action with the Q value of each state, action pair State right up s s s s n

14 Q table A Q table is a matrix with one row per state, and one column per action A with Q table the defines Q value a deterministic of each state, action pair policy as: taking the action with the maximum Q value in each state. State right up s s s s n State s 0 s 1 s 2 s n Action right right up left

15 Q learning Basic reinforcement learning algorithm Learns the Q table Starts with an initial (e.g., all zeroes) Q table Updates the Q table iteratively over time using Bellman s Equations

16 Bellman Equations Imagine that: We have a current estimation of the Q table An agent is in state S Performs action A (which takes it to state S ) And observes reward R How do we update the Q table with the new piece of information? Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )]

17 Bellman Equations Imagine that: We have a current estimation of the Q table An agent is in state S Performs action A (which takes it to state S ) And observes reward R Previous Q value How do we update the Q table with the new piece of estimate information? New Q value estimate Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )] Learning rate Discount factor

18 Q Learning 1. Initialize Q table to some uniform value (e.g., all zeroes) 2. S = initial state 3. A = choose action based on Q table and current state S 4. Execute action A: 1. S = new state after executing A 2. R = observed reward 5. Update Q table: Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )] 6. Go to 3

19 Q Learning 1. Initialize Q table to some uniform value (e.g., all zeroes) 2. S = initial state 3. A = choose action based on Q table and current state S 4. Execute action A: 1. S = new state after executing A 2. R = observed reward 5. Update Q table: Q new (S, A) =(1 )Q(S, A)+ [R + max A 0Q(S 0,A 0 )] 6. Go to 3 How do we choose an action?

20 Exploration vs Exploitation During learning, the agent is in a given state S, and has to choose an action using the Q table: State right left forward s s Current state Action that maximizes Q value s s n

21 Exploration vs Exploitation During learning, instead of choosing actions that maximize Q value, we use a policy to balance exploration and exploitation Remember MCTS? (tree policy), this is the same thing! For example: ε-greedy ε = 0.1 (or some small value between 0 and 1) With probability ε, choose an action at random With probability (1- ε), choose the action with maximum Q value Why?

22 Exploration vs Exploitation During learning, instead of choosing actions that maximize Q value, we use a policy to balance exploration and exploitation Remember MCTS? (tree policy), this is the same thing! For example: ε-greedy ε = 0.1 (or some small value between 0 and 1) With probability ε, choose an action at random With probability (1- ε), choose the action with maximum Q value Why? The action currently believed to be the best might just happen to be by coincidence. So, we need to keep exploring just in case other actions turn out to be better.

23 Q Learning Example output of Q Learning (Q table): (image borrowed from Hal Daumé s CS421 slides)

24 Problems with Q Learning No generalization: If two states are very similar, Q learning does not exploit this, and will have to learn the Q values for each of them independently Many techniques to address that: Function approximation Feature-based state representation Deep Q-learning: Uses a neural network to represent the Q table (implicit generalization)

25 Examples (Again) Reinforcement Learning:

26 Machine Learning Computational methods for computers to exhibit specific forms of learning. For example: Learning from Examples: Supervised learning Unsupervised learning Reinforcement Learning Learning from Observation (demonstration/imitation)

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation