Imitative Policies for Reinforcement Learning

Imitative Policies for Reinforcement Learning Dana Dahlstrom and Eric Wiewiora Department of Computer Science and Engineering University of California, San Diego La Jolla CA 92093-0114, USA {dana,wiewiora}@cs.ucsd.edu Abstract We discuss a reinforcement learning framework where learners observe experts interacting with the environment. Our approach is to construct from these observations exploratory policies which favor selection of actions the expert has taken. This imitation strategy can be applied at any stage of learning, and requires neither that information regarding reinforcement be conveyed from the expert to the learner nor that the learner have any explicit knowledge of its reinforcement structure. We show that learning with an imitative policy can be be faster than passively observing an expert or learning from direct experience alone. We also show that imitative policies are robust to sub-optimal experts. 1 Introduction In its standard formulation, reinforcement learning assumes a solipsistic environment where the learner s task is to find a reward-maximizing control policy starting with no prior knowledge and receiving no outside help the only available information is first-hand experience. In the real world, however, learners are usually part of social networks from which they can cull some knowledge pertaining to their tasks. For such social learners, Learning is more often a transfer than a discovery [1]. To take full advantage of social knowledge requires sophisticated communication. Besides techniques for accomplishing tasks, it is useful to know potential pitfalls and how to avoid them, or how to recover afterward. Human language can encode such complex information, but it should be possible to transfer some knowledge without the full generality of language. A relatively simple way to learn from others is by imitating them. If their behavior is expedient, imitation may yield better performance much faster than lone experimentation. Of course others may not be perfect, or they may have different goals entirely. Thus it is generally unwise to merely imitate; one should pay attention to first-hand experience as well. When imitated behavior is not perfect, it may be possible to improve upon it by experimentation. The approach we propose is to use observations of others to guide the learner s exploration. We assume the learner can observe the states and actions of others but not necessarily their rewards. We also assume the learner can identify its own states with those of others, and that the actions it observes others taking are available to it. These assumptions are consistent

with a model in which the learner and those it observes are of the same species but may have different goals. 2 Background One way to use an expert is simply as another source of experience. In what Whitehead has termed Learning by Watching (LBW), the learner observes an expert s state-actionreinforcement triples and uses them as if they were its own [2]. Lin integrates this strategy with his experience replay technique, which is to record experience sequences and replay them in chronologically backward order to speed up reinforcement propagation through the value function [3, 4]. Price and Boutilier present an LBW approach that doesn t require knowing the expert s actions: a special placeholder action is assumed for every expert state transition [5]. Other approaches use the expert in entirely different ways. For instance, the learning agent of Utgoff and Clouse queries the expert for a recommended action when its confidence in its own decision is too low [6]. Alternatively the expert might take the initiative, intervening to suggest an action when it sees fit, as suggested later by Clouse and Utgoff [7]. This active role intentionally guiding the learner s exploration is appropriately called teaching. LBW is appealing not only for its simplicity, but for its generality. Very little is demanded of the expert: it need not respond to queries or actively make suggestions to the learner, but must merely act normally in the environment. What s more, using LBW the learner can benefit from observations even when they exhibit poor performance, though in practice it is more useful to observe an expert than a novice [8]. An underlying assumption in LBW is that the learner knows the reinforcement it would receive in the expert s position 1. In our approach the learner uses the observed actions to guide its own exploration, so the approach is still applicable when this assumption does not hold. In that the expert guides the learner s exploration, our approach is related to that of Clouse and Utgoff, but does not require the expert to intentionally teach the learner. 2.1 Markov decision processes Most reinforcement learning techniques model the learning environment as a Markov decision process (MDP) [9]. An MDP is a quadruple (S, A, T, R), where S is the set of states, A is the set of actions, T (s s, a) is the probability of transitioning to state s when performing action a in state s, and R(s, a, s ) is the reinforcement received when action a is performed in state s and there is a transition to state s. The reinforcement learning task is to find a policy π : S A that maximizes the total discounted reinforcement t=0 γt r t where r t is the reinforcement received at time t and γ is the discount rate determining the relative importance of future versus immediate reinforcement. 1 If the expert s reinforcements are observable and the learner has the same reinforcement structure as the expert, then the learner effectively knows what reinforcement it would receive.

2.2 Q-Learning Q-learning is a reinforcement learning algorithm based on estimating the expected total discounted reinforcement Q(s, a) received when taking action a in state s [9]. An experience is a quadruple (s, a, r, s ) where action a is taken in state s, resulting in reinforcement r and a transition to next state s. We consider an implementation of Q-learning which stores Q values in a tabular format; for each experience, a table entry is updated according to the rule Q(s, a) (1 α)q(s, a) + α ( r + γ argmax Q(s, a ) ) a where α is the learning rate. The greedy policy π g (s) = argmax a Q(s, a) is optimal when the Q values are accurate. To guarantee the entries in the table converge to true Q values, all state-action pairs must be explored infinitely often. This can be ensured by using an exploration strategy such as ɛ-greedy: with probability ɛ, choose an action uniformly at random; otherwise choose the greedy action π g (s). 3 Imitative policies We present two imitative policies for reinforcement learning and show how they integrate with Q-learning. Action biasing modifies the greedy objective so observed actions are more likely to be taken; estimated policy exploration replaces ɛ-greedy s uniform distribution with an estimation of the expert s policy. Both strategies are applicable whenever observation of the expert is available be it prior to direct experience, in parallel with it, or intermittent. Both mechanisms utilize the count e (s, a) of the times the expert has been observed taking action a in state s. Action biasing also uses count(s) and count e (s): the number of times the learner and expert have visited s, respectively. 3.1 Action biasing Action biasing is similar to Whitehead s Biasing Binary Learning with an External Critic (BB-LEC) framework [2]. In action biasing, the policy is defined with respect to the bias function { +b : counte (s, a) > 0 B(s, a) = 0 : count e (s) = 0 b : otherwise where the nominal bias b is the magnitude of the reward for taking an action the teacher has taken, or the penalty for taking one it hasn t. The learner uses a biased ɛ-greedy policy { a U(A) with probability ɛ π b (s) = ( argmax a (1 δ count(s) )B(s, a) + δ count(s) Q(s, a) ) otherwise where U(A) is a uniform distribution over the actions and δ (0, 1) is the decay rate of the bias influence. As the learner gains experience the influence of the bias approaches zero and π b converges to the greedy policy π g. 3.2 Estimated policy exploration Another approach is to estimate the expert s policy and choose exploratory actions probabilistically according to this estimation. For discrete MDPs the observed policy can be

estimated as a multinomial distribution: given a count of the times the expert has taken each action in each state, Lidstone s law of succession gives the estimated expert policy count e (s, a) + λ π e (a s) = a [count e(s, a ) + λ] where the flattening parameter λ > 0 determines how much weight to assign to the uniform prior distribution versus the observed distribution of the expert s actions. This estimation can replace the usual uniform distribution over actions in ɛ-greedy: { a π π(s) = e ( s) with probability ɛ argmax a Q(s, a) otherwise When the expert has never been observed in state s, the estimated expert policy π e (a s) is a uniform distribution over the actions, and in this case the policy is equivalent to standard ɛ-greedy. There are other ways to estimate multinomial distributions that may be preferable to Lidstone s law in some circumstances [10]. Estimating the expert s policy as a multinomial distribution accounts for nondeterminism but cannot capture the dependence upon previous states when the Markov assumption does not hold, such as in partially observable Markov decision processes (POMDPs). Sequence prediction techniques are apt to this more complex task. 4 Experiments with pong We have experimented with imitative policies in learning a control policy for a pong game. The pong board is a 10 12 rectangle in which a point-sized ball bounces. The agent controls a 2-unit-wide paddle which can move left or right 1 unit per time step along the bottom edge of the board. The goal is to position the paddle to intercept the ball whenever it reaches the bottom edge. The ball is launched from one of 12 positions on the top edge with a downward velocity of 1 unit per time step and an integral horizontal velocity ranging from 2 to 2 units per time step. An episode begins when the ball is launched and ends when it contacts the bottom edge. Hitting the ball yields a +1 reward; missing it yields a 1 penalty. We compare the estimated policy exploration and action bias methods to LBW and standard Q-learning. In our experiments, the learners interact with the environment in three distinct, interleaved phases, with one episode per phase for each cycle. observation The agent observes expert experience. exploration The agent learns from direct experience. evaluation The agent neither observes nor learns; it merely exploits a greedy policy. We measure performance during the evaluation phase to isolate the effects of the methods on the learned greedy policy without the noise introduced by random exploration. So that the standard Q-learner gets the same amount of total experience as the others, it explores on its own for the time they observe the expert s actions. All the learners use a learning rate α = 0.5 and a discount rate γ = 0.95. The action biasing learner uses a bias decay rate δ = 0.75. The estimated policy exploration learner uses a flattening parameter λ = 1, and its exploration rate ɛ = 1000/(1000 + k) decays based on the number k of exploration episodes. The rest of the learners use a constant exploration rate ɛ = 0.1.

Learning with a Perfect Expert Learning with an Imperfect Expert 100 100 Performance over Previous 100 Episodes 50 0 Policy Imitation Biased Actions Learning by Watching Standard Q learning Expert 50 0 1000 2000 3000 4000 5000 6000 7000 Performance on Previous 100 Evaluation Episodes 50 0 Policy Imitation Critic Biasing Learning by Watching Q Learning Expert 50 0 1000 2000 3000 4000 5000 6000 7000 Observation and Exploration Episodes Observation and Exploration Epsisodes Figure 1: Observing a perfect expert. Figure 2: Observing an imperfect expert. We have created two expert pong agents. The perfect expert predicts where the ball will next contact the bottom edge and moves the paddle there directly. The imperfect expert attempts to keep the paddle directly beneath the ball at all times. When the ball s horizontal velocity is faster than the paddle can move, this strategy results in frequent misses. 4.1 Observing the perfect expert Figure 1 shows the benefit of imitative policies in conjunction with LBW; the imitative learners did Q updates using the expert experiences in addition to using their respective imitative policies. Both imitative learners learned faster than LBW, and all three outperformed standard Q-learning. Imitative exploration is most useful in the initial stage of learning when reinforcement has not yet propagated far through the Q table; imitative policies can pick up some of the expert s rewarding behavior even before this happens. 4.1.1 Observing the imperfect expert Figure 2 demonstrates the effects of imitating the imperfect expert, which misses the ball approximately 20% of the time. Again the imitative learners did updates using the expert experiences. All the learners that observe the expert initially learn faster than standard Q-learning, but the imitative learners do not outperform LBW. By the end of the run, all learners have reached approximately the same level of performance, but the standard Q-learner s performance has a steeper upward slope than the rest. Once the learners surpass the expert s performance, standard Q-learning has an advantage: the others spend their observation phases watching the imperfect expert while the standard Q-learner explores based on its superior policy. In short, the influence of an imperfect expert makes it more difficult to improve beyond its level. 4.2 Unobservable expert reinforcements Unlike in LBW, the imitative learners do not need to know the expert s reinforcements. Figure 3 shows the performance of imitative policies without LBW updates. The imitative learners are tested against the standard Q-learner, which again receives additional exploration episodes instead of observations. The results show that even though the imitation learners are making half as many updates to their Q-tables, they are not dramatically outperformed by Q-learning. In fact, the action bias learner that imitates the perfect expert does slightly better. If observation is less costly

Performance of Imitative Algorithms Without Learning by Watching 100 80 Performance over Previous 100 Evaluation Episodes 60 40 20 0 20 40 Policy Imitation with Perfect Expert Biased Actions with Perfect Expert 60 Policy Imitation with Imerfect Expert Biased Actions with Imerfect Exper Standard Q Learning 80 0 1000 2000 3000 4000 5000 6000 7000 Observation and Exploration Episodes Figure 3: Observation without reinforcements. than direct experience, an imitative policy is likely a good choice. 5 Discussion When they can observe a reasonably good expert, learners can improve much faster than by direct experience alone. If the learner must trade off observation against direct experience, however, this benefit is not without cost. Probably the most problematic situation is imitating a bad expert, in which case the learner is biased away from more rewarding actions. Because our methods decay the influence of observations as the learner gains its own experience, imitating a bad expert will only slow convergence to an optimal policy rather than prevent it altogether. Future work on imitation in reinforcement learning may include developing a principled way to decay imitation based upon the relative performance of the expert and the learner; our separate exploration and evaluation phases could facilitate this kind of comparison. If there is a cost for observation, another related problem is how to decide when it would be beneficial to observe rather than explore. It may also be productive to incorporate imitative policies into other reinforcement learning frameworks such as learning with eligibility traces, generalizing function approximators, or model-based methods. References [1] Steven D. Whitehead. A study of cooperative mechanisms for faster reinforcement learning. Technical Report 365, Department of Computer Science, University of Rochester, Mar 1991. [2] Steven D. Whitehead. A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings, Ninth National Conference on Artificial Intelligence (AAAI-91), volume 2, pages 607 613. AAAI Press / The MIT Press, Jul 1991. [3] Long-Ji Lin. Programming robots using reinforcement learning and teaching. In Proceedings, Ninth National Conference on Artificial Intelligence (AAAI-91), volume 2, pages 781 786. AAAI Press / The MIT Press, Jul 1991. [4] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning, 8(3/4):293 321, May 1992. [5] Bob Price and Craig Boutilier. Implicit imitation in multiagent reinforcement learning. In Machine Learning, Proceedings of the Sixteenth International Conference, pages 325 334. Morgan Kaufmann, Jun 1999.

[6] Paul E. Utgoff and Jeffery A. Clouse. Two kinds of training information for evaluation function learning. In Proceedings, Ninth National Conference on Artificial Intelligence (AAAI-91), volume 2, pages 596 600. AAAI Press / The MIT Press, Jul 1991. [7] Jeffery A. Clouse and Paul E. Utgoff. A teaching method for reinforcement learning. In Machine Learning, Proceedings of the Ninth International Workshop (ML92), pages 92 101. Morgan Kaufmann, Jul 1992. [8] Ming Tan. Multi-agent reinforcement learning: Independent versus cooperative. In Machine Learning, Proceedings of the Tenth International Conference, pages 330 337, Jun 1993. [9] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [10] Eric Sven Ristad. A natural law of succession. Technical Report CS-TR-495-95, Department of Computer Science, Princeton University, May 1995.