MDP: Motivation. Markovian Decision Processes (MD. Exploration/Exploitation Conflict. Example

MP Motivation P aniel Polani Scenario sequence of decisions where 1. each decision may lead randomly to different outcomes. each decision is connected with a reward 3. rewards cumulate to total utility. rewards may be delayed Relevance for Social Intelligence adaptive models for socia interaction xample P p.1/ xploration/xploitation Conflict iven armed bandit problem. ach trial of some bandit arm costs. Payoff distributions with differing mean and variance per arm (finite and unknown Seeking strategy with maximum payoff. Temporal conditions limited unlimited weighted time Problem finding balance between exploration and exploration leads to conflict reedy Strategy shortterm gain remember tragedy of the commons Note in the following we will ignore the intricacies of the exploration/exploitation dilemma and use only simple strategies for their control P p.3/

( ' 89 10 / Utility stimation Simple Strategies ssumption reward for an action mean. Choose sequence of actions estimate for via is random variable with. Then obtain reedy Strategy choose action "& $% "# lternative reedy with probability choose greedy with probability random action. dvantage More exploration which are the times where the at the times between 1 and where action is chosen. Initialization! is initially set to any value e.g. 0. P p./ Incremental stimates Notes Remark (Incremental Computation of consider actions having been made and the action at time be. Then / 01 old value.(note 3 / 01 dependent factor deviation from target value to Notes 1. need only to store and. the update common in learning systems 3. for 8 8 is very Nonstationary nvironments give more importance to recent values e.g. / 01 e.g. constant in ;! < P p./

Reinforcement Learning Preliminary efinitions The Full Reinforcement Learning Problem ef. (state a full description of the current situation agent and world ef. (policy policy at a time is a conditional probability that an agent in a state chooses an action gent at a time step it has access to current state reward just obtained current policy From This calculate current action choice policy. and following Markovian ecision Process P p.9/ xample Note full access to current state border between agent and environment given by absolute control not by limitation of knowledge Note goals of agent specified by rewards oal longterm maximization of cumulated rewards xamples reward structure as follows 1. robot is supposed to learn to move if step ahead. maze 0 per step 1 for a step outside the maze 3. maze 1 per step in the maze P p.11/

( # ( #? @ F F @ C Value Function Value Function a measure how good it is for an agent to b in a certain state. This depends on future actions (mor precisely on policy $ $! "#%$ Function a measure how good it is for an agent to be in a state and picking a certain action (again dependent on the policy in the following states. $ $ "#$ ackup iagrams ackup iagram for C ackup iagram for C Objective. Question have sequence of rewards What do we want to maximize? In eneral want to maximize total payoff Cases pisodic Tasks tasks with natural end time require discounting Unlimited Tasks with P p.13/ ellmann quation Theorem with 8 3 10 / 0 8 3 0 0 one has > 8 0 0 < 10 / 0 ;80 9; ' ; 0 ;80 ( ' P p.1/

8 8 "& %? F @ Optimal Policy ellmanns Optimality Criterion Comparison of Policies we say that good as if for all states we have Theorem there exists always an optimal policy for all policies. ( is at least as. i.e. Note an optimal policy is not necessarily unique! ut the same for all optimal policies. Therefore write optimal value function. nalogously define Remark one has is for ellmann s Optimality quation one has nalogously "& % ; 8( 0 ' 8 P p.1/ ackup iagrams Learning Methods For max Methods dynamic programming value iteration learning For max C C P p.19/

Learning Learning Properties Learning update rule given by #%$ # $ #%$ # $ Theorem (Watkins if all are being update often enough converges towards independently of policy. dvantages offpolicy does not require explicit averaging done implicit as you go no model required Reinforcement Learning uses Learning as central mode many variants and improvements exist. eneral Remarks P p.1/ ambler s Problem Reinforcement Learning learning from delayed rewards In Particular Learning requires no model of dynamics only immediate backup but state must have Markov Property the result of an action must only depend on a current state variable which must be known to the agent. In particular 1. it must not depend on some memory effects unseen by the agent. it must not depend on the history of the agent or the world Scenario playing red/black starting and exiting with given amount P p.3/