Reinforcement Learning - PDF Free Download

Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal Sutton & Barto, Reinforcement learning, 1998.

Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal Sutton & Barto, Reinforcement learning, 1998. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. Actions may affect not only the immediate reward but also the next situation and all subsequent rewards. These two characteristics--trial-and-error search and delayed reward--are the two most important distinguishing features of reinforcement learning.

We examine how an agent can learn from success and failure, from reward and punishment Russell & Norvig, Artificial Intelligence: a Modern Approach, 2011

We examine how an agent can learn from success and failure, from reward and punishment Russell & Norvig, Artificial Intelligence: a Modern Approach, 2011 RL is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment, without specifying how the task is to be achieved.

RL vs traditional AI Techniques that require a predefined model of state transitions and assume determinism Search Planning generate a satisfactory generate a satisfactory trajectory trajectory through a graph of with more complexity than a graph, states states are represented by compositions (atomic symbols) of logical expressions

RL vs traditional AI Techniques that require a predefined model of state transitions and assume determinism RL assumes that the entire state space can be enumerated and stored in memory; no model necessary interact with the environment delayed reward exploration partially observable states life-long learning

RL vs traditional AI delayed reward The agent must learn from current state s the optimal action a to maximize target function a= (s). Training samples should be <s, (s)> but... information is not available this way! r s r a s a s a r s cumulative reward maximum reward??? a r s a r s cumulative reward

RL vs traditional AI exploration vs. exploitation The agent influences the distribution of training examples by the action sequences it chooses exploration of unknown states and actions (new information) exploitation of states and actions already learned (to maximize cumulative reward)

RL vs traditional AI partially observable states The agent sensors may provide only partial information. For example: camera in robot It may be necessary to consider previous observations + current sensor data in order to choose actions life-long learning Robot must learn several related tasks, within the same environment, with the same sensors and same possible actions use experience

Types of machine learning Supervised Learning set of (INPUT, OUTPUT) pairs (x1, y1), (x2, y2) (xn, yn) try to produce a function Y = f(x) to apply to future data

Types of machine learning Supervised Learning set of (INPUT, OUTPUT) pairs (x1, y1), (x2, y2) (xn, yn) try to produce a function Y = f(x) to apply to future data Unsupervised Learning only INPUT points X1, X2, X3 XN try to either find clusters of those data or generate a probability function over the random variable X P(X = x) a sequence of states and actions s,a,s,a,s where some of the states have associated rewards r try to learn an optimal policy (s), so that for every state s, choose the optimal action to do

RL key elements policy ( ): defines the learning agent's way of behaving at a given time. It is a mapping from perceived states of the environment to actions to be taken when in those states. reward function (r): defines the goal in a reinforcement learning problem. It maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. The reward function defines what are the good and bad events for the agent. value function (V) : specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. It indicates the long-term desirability of states after taking into account the states (and rewards) that are likely to follow. model of the environment: it is something that imitates the behavior of the environment.

RL Agent Environment interaction RL deals with how an autonomous agent that senses and acts in its environment estimate Value function can learn to choose optimal actions to achieve its goal. a.k.a. learning with a critic

RL Agent Environment interaction Goals can be defined by a reward (r) function that assigns numerical values to action state pairs (a,s) This reward function is known by the critic who could be external o built-in The task of the agent is to perform sequences of actions, observe their consequences and learn a control policy ( ) :S A that chooses actions that maximize the accumulated reward

But first A little bit of history

A little bit of history Early AI psychology of animal learning Learning by trial and error Law of effects Search and memory

A little bit of history Early AI psychology of animal learning Optimal control Value functions and dynamic programming Learning by trial and error Bellman equations, 1950 Law of effects Search and memory Howard, 1960 1980 Modern RL

A little bit of history Early AI psychology of animal learning. 1954 Minsky PhD thesis: computational models of reinforcement learning SNARCs: Stochastic Neural-Analog Reinforcement Calculators

A little bit of history Early AI psychology of animal learning. 1954 Minsky PhD thesis: computational models of reinforcement learning SNARCs: Stochastic Neural-Analog Reinforcement Calculators 1954 Clark and Farley paper: Trial-and-error learning generalization and pattern recognition Reinforcement learning supervised learning Confusion!!!

A little bit of history 1960 1962 Widrow and Hoff Rosenblatt Motivated by reinforcement learning used rewards and punishments but they studied supervised learning systems Some NN books use the term "trial-and-error" to describe networks that learn from training examples, because they use error information to update weights CONFUSION!! It misses the essential selectional character of trial-and-error learning. 1968 Michie and Chambers

A little bit of history 1964 Widrow and Smith: used supervised learning methods, assuming instruction from a teacher learning with a teacher 1973 Widrow, Gupta and Maitra: modified LMS to produce a RL rule that could learn from success and failure signals learning with a critic" 1975 John Holland: trial and error in evolutionary methods 1986 classifier RL systems including association and value functions, with a genetic algorithm

A little bit of history 1980s Much of the early work was directed towards showing that RL and supervised learning were different (Barto, Sutton, and Brouwer, 1981; Barto and Sutton, 1981; Barto and Anandan, 1985). Studies showed how RL could address important problems in NN learning how it could produce learning algorithms for multilayer networks (Barto, Anderson, and Sutton, 1982; Barto and Anderson, 1985; Barto and Anandan, 1985; Barto, 1985, 1986; Barto and Jordan, 1987). 1989 Chris Watkins: Q-learning algorithm

Typical RL example: TIC TAC TOE How to construct a player that will find imperfections in its opponent's play and learn to maximize its chances of winning?

Typical RL example: TIC TAC TOE How to construct a player that will find imperfections in its opponent's play and learn to maximize its chances of winning? simple problem however it cannot be solved in a satisfactory way through classical techniques Game theory - minimax Evolutionary approach it assumes a particular way of playing by the opponent Dynamic programming can compute an optimal solution for any opponent, but require as input a complete specification of that opponent, including the probabilities with which the opponent makes each move in each board state Search By directly searching the policy space entire policies are proposed and compared on the basis of scalar evaluations.

Typical RL example: TIC TAC TOE Value Function: table of numbers, one for each possible state of the game. Each number will be the estimate of the probability of our winning from that state For all states with 3 Xs in a row the probability of winning is 1, because we have already won. For all states with three O in a row, the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5 (50% chance of winning) Play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value highest estimated probability of winning. Occasionally, we select randomly from other moves exploratory moves.

Typical RL example: TIC TAC TOE mejor movida movida exploratoria

Typical RL example: TIC TAC TOE While playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. The current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state's value a fraction alpha of the way towards the value of the later state: This update rule is an example of a temporal-difference learning method, so called because its changes are based on a difference between estimates at two different times.

RL model environment agent Standard reinforcement learning model

RL model environment fully/partially observable environment non deterministic/stationary environment agent Standard reinforcement learning model

Agent Environment interaction examples environment agent

Agent Environment interaction environment The agent must find a policy, mapping states to actions, that maximizes some measure of agent reinforcement/reward

Agent Environment interaction Usual assumptions: Markov model, discrete states, finite actions, discrete time, stochastic transitions, perfect observations, rationality.

Markov decision processes Assume: finite set of states S set of actions A at each discrete time agent observes state st ϵ S and chooses at ϵ A then receives immediate reward rt and state changes to st+1 Markov assumption: st+1 = δ(st,at) and rt = r(st,at) - rt and st+1 depend only on current state and action - functions δ and r may be nondeterministic - functions δ and r not necessarily known to agent

Agent learning task Execute actions in environment, observe results and: learn action policy : S A that maximizes cumulative reward over time E [rt + γ rt+1 + γ2 rt+2 + ] γ=0, only inmediate reward is considered from any starting state in S 0 γ < 1 is the discount factor for future rewards γ=1, future reward are given greater emphasis than inmediate reward Note: Target function is : S A but we have no training examples of form (s,a)!!! Training examples are of form <(s,a),r>

Policy Search What is an optimal policy? How the agent should take the future into account in the decisions it makes finite-horizon model infinite-horizon discounted model average reward model

Policy Search What is an optimal policy? How the agent should take the future into account in the decisions it makes finite-horizon model at a given moment in time, the agent should optimize its expected reward for the next h steps

Policy Search What is an optimal policy? How the agent should take the future into account in the decisions it makes finite-horizon model infinite-horizon discounted model It takes the long-run reward of the agent into account, but future rewards are discounted according to a discount factor γ

Policy Search What is an optimal policy? How the agent should take the future into account in the decisions it makes finite-horizon model infinite-horizon discounted model average reward model The agent is supposed to take actions that optimize its long-run average reward

Value function In a deterministic world, for each possible policy, an evaluation function over states can be defined: V (st) rt + γ rt+1 + γ2 rt+2 + Σ γi rt+i where rt, rt+1, are generated by following policy π starting at state st So the agent task is to learn the optimal policy * * argmax V (s) V*(s)= Vπ* (s) Maximum reward to be obtained starting from s

Example States: each position in the grid possible action reward absorbing state (s,a) r(s,a) immediate reward values γ = 0.9 r ((1,1), right) 0 ((1,1), up) 0 ((1,2), right) 0 ((1,2), left) 0 ((1,2), up) 0 ((1,3), up) 100

Example Policy example: it specifies exactly one action that the agent will select in any given state. The optimal policy shortest path towards G

Example <(s,a),a,a)> Vπ(s) values V <(1,1), up,right,right)> 81 <(1,1), right,right,up)> 81 <(1,2), right,up)> 90 <(1,3), up)> 100

Example 90??? <(s,a),a,a)> Vπ(s) values V <(1,1), up,right,right)> 81 <(1,1), right,right,up)> 81 <(1,2), right,up)> 90 <(1,3), up)> 100

Example 90 = 0 + γ 100 + γ2 0 + γ3 0 + = 0.9*100 <(s,a),a,a)> Vπ(s) values V <(1,1), up,right,right)> 81 <(1,1), right,right,up)> 81 <(1,2), right,up)> 90 <(1,3), up)> 100

Example??? Vπ(s) values

Example 81 = 0 + γ 0 + γ2 100 + = 0.92 * 100 <(s,a),a,a)> Vπ(s) values V <(1,1), up,right,right)> 81 <(1,1), right,right,up)> 81 <(1,2), right,up)> 90 <(1,3), up)> 100

Value function The task of the agent is to learn a policy : S A that selects next action at based on current observed state st, for example, (st)=at How? Policy that maximizes cumulative reward over time. That is, policy that maximizes : V (st) rt + γ rt+1 + γ2 rt+2 + Σ γi rt+i where the sequence of rewards was generated by a0 = (s0) a1 = (s1) a2 = (s2) s0 r0 s1 r1 s2 r2

What to learn? The agent tries to learn the evaluation function V * (or V*) The agent should prefer state s1 over s2 whenever V*(s1) > V*(s2), because the cumulative future reward will be greater from s1. a1 s1 a2 s2 s0

What to learn? The agent tries to learn the evaluation function Vπ* (or V*) The agent should prefer state s1 over s2 whenever V*(s1) > V*(s2), because the cumulative future reward will be greater from s1. a1 s1 a2 s2 s0 But we have a problem!!!???

What to learn? The optimal action in state s is the action that maximizes the sum of the immediate reward r(s,a) plus the value V* of the immediate successor state, discounted by γ: *(s) = argmax [ r(s,a) + γ V*(δ(s,a))] a δ(st,at) = st+1 immediate reward value of successor state

Value function How finding an optimal policy for an infinite-horizon discounted model? (Bellman, 1957)

Value function How finding an optimal policy for an infinite-horizon discounted model? (Bellman, 1957) The optimal value of a state: t V ( s ) max E rt t 0 * can be found as the solution of the Bellman equations: V * ( s) max r ( s, a) P( s, a, s' )V * ( s' ), s S a s ' S

Value function Optimal value function (Value iteration algorithm, Bellman, 1957): * V ( s ) max r ( s, a ) P( s, a, s ' )V ( s' ), s S a s ' S * Given the optimal value function, the optimal policy can be specified as: * ( s ) arg max r ( s, a ) P ( s, a, s ' )V ( s ' ) a s ' S *

Q function Define new function very similar to V* Q(s,a) r(s,a) + γ V*(δ(s,a)) If agent learns Q, it can choose optimal action even without knowing δ! *(s) = argmax [ r(s,a) + γ V*(δ(s,a))] a *(s) = argmax Q(s,a) a Q is the evaluation function the agent will learn

Q function Q is the evaluation function the agent will learn Important facts regarding Q-learning: One can choose globally an optimal sequence of actions by reacting to the local values of Q for the current state. the agent can choose an optimal action without a loookahead search to explicitly consider what state results from the action the value of Q for the current state and action summarizes in a single number all the information needed to determine the discounted cumulative reward

Q function (s,a) Q ((1,1), right,right,up) 81 Q(s,a) values ((1,1), up, ) 81 ((1,2), right, ) 90 ((1,2), left, ) 72 ((1,3), up) 100

Training rule to learn Q Note Q and V* are closely related: V*(s) = max Q(s,a ) a' which allows us to write Q recursively as: Q(st,at) = r(st,at) + γ V*(δ(st,at)) = r(st,at) + γ max Q(st+1,a ) a (Watkins, 1989) The training rule will be: ˆ ( s, a ) r max Q ˆ ( s', a' ) Q a' where s = δ(s,a) the state resulting from applying action a in state s

Q learning algorithm 2. For each s,a initialize table entry Qˆ ( s, a ) 0 Observe current state s 3. Do forever: 1. a) Select an action a and execute it b) Receive immediate reward r c) Observe new state s d) Update the table entry for Q estimate as follows: Qˆ ( s, a ) r max Qˆ ( s ', a ' ) e) a' s s This algorithm converges towards the true Q function, if the system can be assumed a deterministic Markov Decission Process and the immediate reward values are bounded (Machine Learning, T. Mitchell, chapter 13)

Q learning algorithm: example ˆ ( s 1,,a ˆ Q^(s ) r + γ max Q^(s,a ) Q 1 aright right ) r max Q ( s 2,2 a ' ) a' 0 + 0.9 max{63,81,100} 90

Another Q-learning example: Tower of Hanoi Initial state Final state Source: http://people.revoledu.com/kardi/tutorial/reinforcementlearning/tower-of-hanoi.htm

Another Q-learning example: Tower of Hanoi Source: http://people.revoledu.com/kardi/tutorial/reinforcementlearning/tower-of-hanoi.htm

Another Q-learning example: Tower of Hanoi Q(s,a) r + γ max Q(s,a ) Source: http://people.revoledu.com/kardi/tutorial/reinforcementlearning/tower-of-hanoi.htm

Another Q-learning example: Tower of Hanoi Source: http://people.revoledu.com/kardi/tutorial/reinforcementlearning/tower-of-hanoi.htm

Passive state-based representation the policy π is fixed: in state s, it always executes the action π(s) the goal is to learn how good the policy is to learn the value function Vπ(s) non deterministic rewards and actions the agent does not know the transition model P(s,a,s ), which specifies the probability of reaching state s from state s after doing action a the agent does not know the reward function r(s)

Passive The agents executes a series of trials in the environment using policy (s) (1,1)-.04 (1,2)-.04 (1,3)-.04 (1,2)-.04 (1,3)-.04 (2,3)-.04 (3,3)-.04 (4,3)+1

Passive The agents executes a series of trials in the environment using policy (s) (1,1)-.04 (1,2)-.04 (1,3)-.04 (1,2)-.04 (1,3)-.04 (2,3)-.04 (3,3)-.04 (4,3)+1 (1,1)-.04 (2,1)-.04 (3,1)-.04 (3,2)-.04 (4,2)-1

Passive The agents executes a series of trials in the environment using policy (s) Objective: to use the information about rewards to learn the expected utility Vπ(s) Vπ(s) = E [Σ γt r(st)] where r(s) is the reward for a state, st (a random variable) is the state reached at time t when executing policy π and s0 = s

Passive 0.812 0.868 0.762 0.705 0.918 0.660 0.655 0.611 0.388 HOMEWORK: FIND optimal policy for the 4x3 world, with reward r(s) = -0.04 in all the nonterminal states

References & Further reading. Sutton and Barto. MIT Press Cambridge (1988) Artificial Intelligence. A Modern Approach. Russell and Norvig. Ed Prentice Hall (2011) (ch. 21) Machine Learning. Mitchell. Ed. McGraw Hil (1997) (ch. 13) : a survey. Kaelbling, Littman and Moore. Journal of Artificial Intelligence Research, vol. 4, 237-285 (1996)

NATURE NEWS (2016): AI algorithm masters ancient game of Go "A computer has beaten a human professional for the first time at Go an ancient board game that has long been viewed as one of the greatest challenges for AI http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234 deep learning + reinforcement learning: "the AlphaGo program applied deep learning. It first studied 30 million positions from expert games, on the state of play from board data. Then it played against itself across 50 computers with reinforcement learning. "