Task-Oriented Reinforcement Learning

Size: px

Start display at page:

Download "Task-Oriented Reinforcement Learning"

Pamela Greene
5 years ago
Views:

1 Task-Oriented Reinforcement Learning Md Abdus Samad Kamal February 2003 Masters Course Department of Electrical and Electronic System Engineering

2 A thesis On Task-Oriented Reinforcement Learning By Md Abdus Samad Kamal Submitted to the Department of Electrical and Electronic System Engineering in partial fulfillment of the requirements for the degree of Masters in Information Science At the Kyushu University, Japan February 2003 Guided by Dr. Junichi Murata Assoc. Professor Certified by Prof. Kiyoshi WADA Thesis Supervisor

3 Acknowledgement This thesis is the result of two years of work whereby I have been accompanied and supported by many people. I am extremely indebted to Dr. Junichi Murata, Associate Professor, Department of Electrical and Electronic System Engineering, Kyushu University, for his continuous guidance and encouragement throughout the period to work on this thesis. I would like to express my gratitude to Professor Kotaro Hirasawa for recommending me as research student in Kyushu University, and for his stimulating suggestions and comments on my research. I am also grateful to all students of Intelligent Control Laboratory for their friendly help when I faced different problems in my study and in my daily life as a foreign student. I want to thank Japanese Government and People for supporting me by scholarship. Especially, I would like to give my special thanks to my parents for their encouragement and my wife whose patient love enabled me to complete this work. Md Abdus Samad Kamal February 2003 i

4 Contents 1 Introduction Intelligent System Computational Agent Reinforcement Learning Present Study Outline of the Thesis 4 2 Reinforcement Learning Introduction Historical Background Reinforcement Learning Problems Reinforcement Learning Algorithms Dynamic Programming Monte Carlo Methods Temporal Difference Learning Multi-agent Reinforcement Learning Multi-agent System Multi-agent Reinforcement Learning 21 3 Task-Oriented Reinforcement Learning Open Problems in Reinforcement Learning Related Work Task-Oriented Reinforcement Learning 24 4 Examples Introduction Tile World Task Description Implementation Simulation Results 31 ii

5 4.3 Elevator Group Control Domain Description Complexity in Elevator System Implementation Simulation Results 41 5 Conclusions Discussions Conclusions Future Work 46 Bibliography 48 iii

6 Chapter 1 Introduction 1.1 Intelligent System Artificial Intelligence (AI) is one of the popular sub-fields of computer science focusing on creating machines that can engage in behaviors that humans consider intelligent. The ability to create intelligent machines has intrigued humans since ancient times, and now with the advent of the digital computer and a series of research into AI programming techniques, the dream of smart machines is becoming a reality. A machine may act intelligently either by fixed program taught at the time of design or due to it s self-learning from its past experiences, depending on the construction of its intelligent core. But the intelligent machines, which are capable to learn complex task and behave intelligently, are still far away from reality. One of the basic fields of AI is machine learning, which is concerned with the computational aspects of learning in natural as well as technical systems. Interest in developing capable learning system is increasing, where the designers of such systems have often faced the extremely difficult task of trying to anticipate all possible contingencies and interactions among the agents ahead of time. A truly intelligent system must be able to learn from its environment by acting with it and observing its nature. In human society and also in animal, learning is an essential component of intelligent behaviors. However, each individual agent need not learn everything form scratch by its own discovery. Instead, they exchange information and knowledge with each other and learn form skilled one. When a task is too big for a single agent to handle or a task concerns the goal of the society, they may cooperate in order to accomplish the task. Learning enables system to be more flexible and robust, and it makes them better able to handle uncertainty and changing circumstances. This is also important in multi-agent systems, where the designers of such systems have often faced the extremely difficult task of trying to anticipate all possible contingencies and interactions among the agents ahead of time. The goal of an agent in a dynamic environment is to make optimal decisions over time. Learning servers such purpose by biasing the agent s action choices through information gathered over time. Learning and intelligence are intimately related to each other. It is usually agreed that a system capable of learning deserves to be called intelligent; and conversely, a system being considered as intelligent is, among other things, usually expected to be able to learn [1]. Learning means to acquire knowledge and to do with the self-improvement of future behavior based on past experiences. Different learning mechanisms are available in the fields of machine learning such as supervised learning, unsupervised learning, which are classified on the basis of guiding procedure of the system in learning. Our main interest is to develop a system that is capable to learn from the interactions with its working 1

7 environment in performing a more realistic and dynamic task considering its natural aspects without any artificial constraint. 1.2 Computational Agents An agent is a computational entity, which acts in the machine and is responsible for its intelligent behaviors that can be viewed as perceiving and acting upon its environment and that is autonomous in that its behavior at least partially depends on its own experience. Simply, it can be said that the controller of any system is the agent. An agent should have following properties: (i) (ii) (iii) Perceptual, cognitive and effectual skills: ability to interact with the environment in a somewhat intelligent manner. Communicative and social abilities: ability to communicate, cooperate and compete with other agents. Autonomy: it should be smart enough to act (self-control) rationally according to its self-interest. As an intelligent entity, an agent operates flexibly and rationally in a verity of environmental circumstances given its perceptual and effectual equipment. Intelligent agents are able to perceive their environment, respond in a timely fashion to changes that occur in it, exhibit goal-directed behavior by taking the initiative, and capable of interacting with other agents, in order to satisfy their design objective [2]. 1.3 Reinforcement Learning Reinforcement learning [3] is a machine-learning framework in which an agent manipulates its environment through a series of actions, and in response to each action, receives a scalar reward value. The agent stores its knowledge about how to choose reward-maximizing actions in a mapping from agent-internal states to actions. It is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment. If a combination of actions is followed by high reward (a positive reinforcement), then the agents tend to exhibit it in the future, on the other hand, if a combination of actions is followed by a punishment (negative reinforcement), then the agents tend to not exhibit it in the future. The agent tries to trade off using best currently known combination of actions and exploring a different one to determine whether the latter is better than the former. A particular advantage of these techniques is the fact that they can be used in domains in which agents have little or no pre-existing domain expertise, and have little information about the capabilities and goals of other agent. Reinforcement learning uses a formal framework defining the interaction between agent and environment in terms of states, actions, and rewards. This framework is intended to be 2

8 a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, of uncertainty, and the existence of explicit goals. Most relevant is the formalism of Markov decision processes, which provides a precise, and relatively neutral, way of including the key features. The concepts of value and value functions are the key features of the reinforcement learning methods. Value functions are essential for efficient search in the space of policies. The use of value functions distinguish reinforcement learning methods from evolutionary methods that search directly in policy space guided by scalar evaluations of entire policies. 1.4 Present Study Instead of its conceptual simplicity, the reinforcement-learning algorithm often cannot scale well to non-uniform problems with large or infinite state and action spaces. The episodic task is the simplest way of learning, where the agent has an opportunity to repeat its trial for the same initial conditions in a stationary environment, resulting in faster convergence. But in the episodic problem the environment needs to be reinitialized after a certain interval of time or when certain conditions are met. In this way, the autonomous agent becomes dependent on some external mechanism to repeat the episode, which contradicts with the concept of autonomous and unsupervised learning. The real world problems are continuous in nature and it is hardly possible to model in episodic manner. Most of them have their own typical characteristics with complex dynamics and there is rarely a known optimal policy for such systems. In such cases reinforcement-learning algorithm needs some modified approach to scale the complex problem well in designing a robust system. This work presents a new approach of reinforcement learning, which we name taskoriented reinforcement learning [4-6], to solve a high dimensional problem. The taskoriented reinforcement learning methods decompose a complex problem into some logical sub-problems according to the types of actions, and learning is carried out from the viewpoints of the task making the status of the task and its goal clear to the agent. It is proved that the task-oriented learning method has faster convergence characteristic for a simple episodic task [4]. To examine the effectiveness of the task-oriented reinforcementlearning two different type of test domains are considered in this thesis. First domain is a complex tile-world, where the agent does not have any opportunity to repeat its trial with same initial conditions. In this task-oriented implementation the agent uses one lookup table for its own movement throughout the environment for different purposes. For the subtask that requires different type of action rather than only agent s movement, a separate lookup table is proposed. For all lookup tables, the relative information of the environment topology and the relative direction of the goal state are used as state information. The learning process does not depend on initial state of an agent and the experience of a trial can be effectively applied to the different type of trials. The effectiveness of the proposed system is also verified for multiple agents. The use of separate lookup tables greatly reduces the dimensionality in state spaces and ensures faster learning as it considered only the task related information as state. The relative information 3

9 of the environment topology generalizes the system to conduct the task continuously in a dynamic environment. Second test domain considered in this thesis is well-known elevator group supervisory control. This domain is different from the first one as it needs to handle stochastic passengers arrival in real time manipulating large state information and also different types of actions are needed asynchronously for its proper operation. This is inherently a distributed system and should be controlled by multiple agents. In applying task-oriented method, two groups of agents are proposed which are responsible for corresponding subtask. According to the types of tasks the goal and policy of each group of agents are different, but by their combined efforts ultimate goal of the system is attained. The learning is carried out from the viewpoint of the task considering its expectation. The taskdecomposition limits the size of state space hence ensures less memory and less computation requirement with faster convergence, and thinking from the viewpoint of the task leads the agent to choose the action more precisely. 1.5 Outline of the Thesis The next section is devoted to establishing notion of reinforcement learning algorithm. First, a brief description of its historical background and an idea of reinforcement learning problems have been explained. This will be followed by an detailed review and explanation of reinforcement learning algorithm, which includes Bellman optimality equation, dynamic programming, Monte-Carlo method, Temporal difference learning, etc. An introduction to multi-agent system and scope of applying reinforcement learning in multi-agent system will also be investigated in this chapter. Chapter three describes the proposed algorithm task-oriented reinforcement learning. This chapter starts with a review of critical issues in reinforcement algorithms, which remain unsolved, followed by an overview of related research to overcome these drawbacks and limitations. An explanation of task-oriented approach for reinforcement learning is described which includes a comparative notion of task-oriented approach and its benefits over conventional approach in solving high dimensional problems and scope of its use for multiple agent system. Chapter four considers implementation of task-oriented reinforcement learning algorithms for two different problems and simulation results. First simulation considers a dynamic and continuous tile-world problem where the agent has to learn in one situation and use its experience in different situations since at the end of each trial the environment elements change its position. The implementation of task-oriented reinforcement learning makes the system robust and easy to convert into multiagent system without any modification, which will be shown in this chapter. Second simulation considered in this chapter is the elevator group control, which is a real world problem. Brief description of its dimensional complexity and difficulties in facing unknown passenger arrival patterns will be investigated. Implementation procedures of task-oriented reinforcement learning and its 4

10 merits in terms of state space size, convergence rate and systems performance will be shown in this chapter using comparative table and graphs. The chapter five describes the concluding remarks of this thesis. An explanatory discussion on the task-oriented method and on simulation results is described followed by a brief conclusion of this contribution. This chapter also contains the scope of task-oriented reinforcement learning in controlling much more complex problems, which remain unsolved, and a brief plan of future research to generalize and enhance the task-oriented reinforcement learning method. 5

11 Chapter 2 Reinforcement Learning 2.1 Introduction Many researchers have focused primarily on supervised learning, where the system learns from examples, in the form of input-output pairs, provided by a knowledgeable external supervisor. This technique is useful in a wide variety of problems involving pattern classification or function approximation. However, there are many situations in which training examples are costly or even impossible to obtain due to the nature of the problem and in interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In these difficult situations reinforcement-learning algorithm can be applied successfully as it needs only a critic that provides a scalar evaluations of the output that was selected, rather than specifying the best output or a direction of how to change the output. Even though the training examples are available the reinforcement learning system may outperform the supervised system as it has an additional feature called exploration, that is, exploring the environment to determine the best output for any given input. 2.2 Historical Background Reinforcement learning dates back to the early days of cybernetics and work in statistics, psychology, neuroscience, and computer science [7]. The Bellman optimality equation for value function is the base of modern reinforcement learning methods, and first introduced by Richard Bellman in 1957, who called it the basic function equation. Bellman also introduced discrete stochastic version of the optimal control problem known as Markovian decision processes (MDP). In 1960, the policy iteration method for MDPs was devised by Ron Howard. These are essential elements underlying the theory and algorithm of modern reinforcement learning methods. The term dynamic programming is due to Bellman (1957), who showed how these methods could be applied to a wide range of problems. It is widely considered that the dynamic programming is the only feasible way of solving general stochastic optimal control problems. An early use of Monte Carlo methods to estimate action values in a reinforcement learning context was by Michie and Cambers (1968). The most novel method in reinforcement learning is temporal difference learning proposed by Sutton (1988), which combines the idea of both previously proposed dynamic programming and Monte Carlo method. Q-learning was introduced by Watkins (1989) whereas the Sarsa algorithm was first explored by Rummery and Niranjan (1994), who called it modified Q- learning and, later Sutton (1994) introduced the name Sarsa. 6

12 2.3 Reinforcement Learning Problem Reinforcement learning problem is a straightforward framework for computational learning agents that use experience from their interaction with an environment to improve performance over time [3]. There is no explicit teacher to guide a learning agent like supervised learning, instead the agent communicates directly with the dynamic environment in which it operates. The agent is the learner and decision maker, and the things it interacts with, comprising everything outside the agent, is called the environment. The agent uses its sensors to perceive the current state of the environment and is able to perform actions that cause the environment to change its state. The agent receives a scalar reward signal from the environment, which is a feedback on the agent s immediate performance and the agent tries to update its policy to maximize the reward over time. Based on this feedback signal, the agent forms a goal concerned with what is good in the long term. This goal incorporates any uncertainty that may be present in the system dynamics and future course of events. Generally, it is considered that in reinforcement learning problems time is discrete, though the time units do not need to correspond to fixed (equal) intervals of real time. Time steps can be determined by events happening within the system, such as the environment s change of state or the moment when a new action has to be taken by the agent. At each time step t, the environment is in some state s t S, where S is a set of all states. The state space S may be finite or infinite. The action a t performed by the agent at the time step t is selected from the set A(s t ) of actions available from state s t. The generalized framework of the agent-environment interaction in reinforcement learning is show in fig.2.1. The controller is the agent itself, which senses the environment elements and decides it s state and with certain policy it chooses an action from the look up table. The action is the equivalent to control signal of a real control system, which makes the agent in new state and the agent receives a scalar reward from the environment. Agent x Control System state policy Q-Table a Sensors r Actuator Environment Figure 2.1: Framework of reinforcement learning system 7

13 For simplicity it is assumed that the time is discrete and state and action spaces are finite even though many of the ideas of reinforcement learning can be extended to the continuous time case. As a result of performing action a t in state s t on the next time step, the agent receives a numerical reward signal r t+1 and the environment transitions to a new state s t+1. The numerical reward signal that the environment provides is the primary means for the agent to evaluate its performance. In general, this signal is stochastic. It is the means by which the designer of an reinforcement learning system can tell the agent what it is supposed to achieve, but not how. This constitutes a major difference from supervised learning systems where a teacher always provides examples of the desired response. The goal of the agent is to maximize the cumulative reward - the long-term return - which is an additive function of the reward sequence. Since the environment is stochastic, the agent is supposed to maximize the expected return, taking into account any uncertainty pertaining to the system s elements. The agent strives to find a policy, a way of behaving, which maximizes this return. Mathematically, a policy is described by a probability distribution : S A [0,1], π s, a denotes the probability of taking action a in state s. π where ( ) One can distinguish two main types of the reinforcement learning tasks: episodic and continuing tasks. In episodic (finite horizon) tasks, there is a terminal state, where an episode ends. The system then can be reinitialized to some starting conditions and a new episode begins. Continuing tasks or infinite horizon problems, on the other hand, consist of just one infinite sequence of state changes, actions and rewards. Also reinforcement learning tasks can be divided naturally into two types: non-sequential and sequential tasks. In non-sequential tasks, agents must learn mappings from situations to actions that maximize the expected immediate payoff. In sequential tasks, agents must learn mappings from situations to actions that maximize the expected long-term payoffs. Sequential tasks are more difficult because the actions selected by the agents may influence their future situations and thus their future payoffs. In this case, the agents interact with their environment over an extended period of time, and they need to evaluate their actions on the basis of their long-term consequences. In general, a return represents a cumulative (additive) function of the reward sequence. For an episodic task, for example, it is the sum of all rewards received from the beginning of an T 0 = r k 0 k+ 1 episode until its end: ( ) =, R s where T is the number of time steps in the episode and s 0 is the starting state. In the case of continuing tasks, there are many problems where one would value rewards obtained in the near future more than those received later. In this case, future rewards are discounted by a factor γ and the return is defined as: () k Rt s = γ + r, k= 0 t+ k+ 1 where s t = s. Discounting with 0 < γ < 1 ensures that the returns from all states are finite, which makes it possible to optimize this performance criterion. When γ = 1 (i.e. in the undiscounted case), the returns need not be finite in general. In this case, one has to ensure that some additional assumptions about the problem are satisfied. In particular, there has to exist a set of absorbing states, which are reached with probability 1 on any trajectory through the state space, and immediate rewards in these states have to be zero. Undiscounted problems can be considered episodic tasks, for which the number of 8

14 stages in an episode is not known in advance and is stochastic. An episode ends when the system enters an absorbing state. Generally the state is meant whatever information available to the agent. To constitute state what kinds of information should be considered and should not be considered is an important issue. The state signal should not be expected to inform the agent of everything about the environment, or everything that would be useful to it in making decisions. A state signal that succeeds in retaining all relevant information is said to be Markov, or to have Markov property. Reinforcement learning relies on the assumption that the system dynamics has the Markov property, which can be defined as follows: Pr{ s t 1 s, rt 1 r st, at} Pr{ st 1 s + = + = = + =, rt + 1 = r s0, a0, r0, s1, a1, r1,......, st, at, rt } (2.1) The Markov property means that the next state and immediate reward depend only on the current state and action. Systems that have the Markov property are called Markov Decision Processes (MDPs) [8]. In reality, most tasks are not strictly Markov but usually quite close to MDPs. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP). Finite MDPs are particularly important to the theory of reinforcement learning. Theory based on the Markov property helps to understand and analyze the algorithms used to solve reinforcement learning problems. The algorithms are still very useful in practice even when the Markov property does not hold in all states. action state reward Figure 2.2: A Markov decision process. Squares indicate visible variables (state information), and diamonds indicate actions. The state is dependent on the previous state and action, and the reward depends on the current state and action. A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action, s and a, the probability of each possible a next state, s, is P, Pr{ s 1 = s s = s, a = a,} s, s S, a A( s) (2.2) s s = t+ t t These quantities are called transition probabilities. Similarly, given any current state and action, s and a, together with any next state, s, the expected value of the next reward is a R E r s = s, a = a, s = s } s, s S, a A( ) (2.3) s, s = { t+ 1 t t t+ 1 s 9

15 P These quantities, a s, s and dynamics of a finite MDP. a R s s,, completely specify the most important aspects of the The class of MDPs is a restricted but important class of problems. By assuming that a problem is Markov, one can ignore the history of the process, and thereby prevent an exponential increase in the size of the domain of the policy. 2.4 Reinforcement Learning Algorithms The objective of the reinforcement-learning algorithm is either to evaluate performance of a given policy (for prediction problem) or to find an optimal policy (for control problem). Almost all reinforcement learning algorithms are based on estimating value functions, functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state), which is a result of the roots of reinforcement learning in optimal control and dynamic programming. The notion of how good here is defined in terms of expected return. Off course, the agent can expect the rewards to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to policies. A policy, π is a mapping from each state s S, and action, a A(s), to the probability π ( s, a) of taking action a when in state s. the value of a state s under a policy π, denoted V π (s), is the expected return when starting in s and following π thereafter. For MDPs, V π (s) can be defined formally as V π k { R s = s} = E γ r s = s s S s = E t t ( ) π π k= 0 t+ k+ 1 t, (2.4) Similarly for the action-value function it can be defined as Q π k { R s = s, a = a} = E γ r s = s, a = a, s S, a A( ) ( s, a) = E t t t π π k= 0 t+ k+ 1 t t s (2.5) It represents the expected return starting from state s, taking action a in s and then following policy π forever. There is one fundamental property of value functions that makes them valuable for solving reinforcement learning tasks. The state-value function satisfies a recursive equation that, for the discounted case, has the following form: π a a π V s) E R s = s = π ( s, a) P R + γv ( s ) s (2.6) = π { t t } s, s [ s s ], S (, a s where s t 1 s + =. This is the Bellman equation and represents the relationship between the value of a state and the value of its successors. This system of equations has a unique solution, which is the state-value function for policy π. The Bellman equation averages over all the possibilities, weighting each by its probability of occurring. It states that the 10

16 value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. A similar equation is satisfied by the action-value function. Value functions are useful because they define a partial ordering over policies. A policy π π π is considered being better than another policy π if and only if V ( s) V ( s), s S. * The optimal policy is a policy corresponding to the maximum state-value function V, which is called the optimal state-value function. The optimal state-value function satisfies the Bellman optimality equation: * V * s Q π s a P a [ R a V π s ] s S ( ) max (, ) = max s, s s, s a A( s) a A( s) s = + γ ( ), (2.7) * For any finite MDP, there exists a unique solution to this system of equations, V, which is * achievable by a deterministic optimal policy π. The optimal policy can be obtained from the optimal state-value function by one-step look-ahead search. For each state, there will be at least one action at which the maximum is attained in the Bellman optimality equation. A policy that assigns non-zero probability to such an action and zero probability to all others * will be an optimal policy. This policy is greedy with respect to V, as well as optimal in a long run. All algorithms discussed in this section assume that the value functions are represented by look up tables, in which there is a value entry for each state (or state-action pair) Dynamic Programming A theoretical foundation for reinforcement learning algorithms is provided by dynamic programming, which refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment eq.(2.2)-(2.3) as a Markov decision process (MDP). Dynamic programming methods can compute optimal policies by using the value functions and Bellman equations to guide the search for optimal policies. Rather than solving the Bellman equations directly, dynamic programming methods treat them as recursive update rules. Dynamic programming algorithms are bootstrapping - they update the estimates of state values based on the estimates of the values for the successor states. Classical dynamic programming algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense. The key idea of dynamic programming as well as of reinforcement learning is the use of value functions to organize and structure the search for good policies. The Policy evaluation task is concerned with estimating the values of states when the agent acts according to some fixed policy π. Assume that the agent has adopted a deterministic policy π, where π ( s) = a A( s) and is interested in computing the state-value function V π associated with this policy. The agent starts with some arbitrary initial approximation of 11

17 π the state-value function, V 0, and uses the Bellman equation for the state-value function as a recursive update rule to improve the approximation: V [ ] π a a ( ) ( k+ 1 s Ps, s Rs, s + γv π k s ), s S, a = π ( s), k = 0,1, 2,... (2.8) s This algorithm is called iterative policy evaluation. To produce each successive approximation, V k+1 from V k, iterative policy evaluation applies the same operation to each states: it replaces the old value of s with a new value obtained from the old values of the successor states of s, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. This kind of operation is called full backup. Each iteration of iterative policy evaluation backs up the value of every state once to produce the new approximate value function of V k+1. The unique solution to the Bellman equation is a fixed point of this update rule. Policy evaluation can be shown to converge in the limit to the correct V π, due to the contraction property of the operator (2.8). Estimating value functions is particularly useful for finding better policies. The policy improvement algorithm uses the action-value function to improve the current policy. The process of making a new policy that improves on an original policy, by making it greedy or nearly greedy with respect to the value function of the original policy, is called policy π π improvement. If Q ( s, a) V ( s) for some a π (s), then it is better to select the action a at the state s than to select π( s ). This follows from the policy improvement theorem, which states that for any pair of deterministic policies π and π, such that π π s S, Q ( s, π ( s)) V ( s), policyπ must be as good as or better than π. In this manner we can construct a new improved policy π, which is greedy with respect to V π : π a a π π s ) = arg max Q ( s, a) = arg max P R + γv ( s ) (2.9) [ ] ( a A( s) a A( s) s, s s, s s Policy evaluation and policy improvement can be interleaved to construct a sequence of successively improving policies. This algorithm, known as policy iteration, constructs an improving sequence: π 0 E V I E I I E π 0 π1 * π1 V... π V π *. where E denotes a policy evaluation and I denotes a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal. Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. This way of finding an optimal policy is called policy iteration. One drawback of policy iteration algorithm is the fact that the policy evaluation step converges only in the limit as each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. However, policy evaluation can be stopped before convergence occurs. The value iteration 12

18 algorithm performs just one policy evaluation iteration - one sweep over the state space - followed by a policy improvement step. a a π V s) max P R + γ V ( s ), s S (2.10) [ ] k+ 1( s, s s, s a s Value iteration is estimating value of the optimal policy directly and can be seen as turning the Bellman optimality equation into an update rule, similarly to policy evaluation. Like policy evaluation, value iteration converges in the limit to the optimal value function V* due to the contraction property of the operator (2.10) and formally requires an infinite number of iterations. Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. Policy iteration relies on the full convergence of policy evaluation while value iteration performs just one-step of policy evaluation between successive policy improvement steps. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep. An intermediate solution is to perform some fixed number (k>1) of policy evaluation steps before the policy improvement step. This variation (optimistic policy iteration) also converges, but it can be more efficient than value iteration because policy evaluation iterations are less expensive than value iteration iterations when the number of actions is quite large. A major drawback of the above algorithms is that they require an update of the value function over the entire state set of MDP. In problems with large state spaces, one may get stuck on single iteration for a long time before any improvements in performance are made and a single sweep can be prohibitively expensive. Asynchronous dynamic programming algorithms are in-place iterative dynamic programming algorithms that allow updating states in an arbitrary order. In fact, some of the states may be updated several times before the others get their turn. To converge correctly, these algorithms require that all states continue to be updated infinitely often in the limit. Asynchronous dynamic programming does not guarantee that an optimal policy is reached with less computation, but it enables faster policy improvement. This approach of dynamic programming algorithms allow great flexibility in selecting states to which backup operations are applied and also make it easier to intermix computation with real time interaction. Compared with other methods for solving MDPs, Dynamic programming methods are actually quite efficient. But it is sometimes thought to be of limited applicability because of the curse of dimensionality, the fact that the number of states often grows exponentially with the number of state variables Monte Carlo Methods Dynamic Programming methods can only be used when a model of the system in terms of transition probabilities and expected rewards is available. Of course, an agent can learn a model and then use it in dynamic programming methods. However, learning a value 13

19 function directly from interaction with the environment can be more efficient. Monte Carlo (MC) methods estimate value functions directly based on the experience of the agent without having complete knowledge of the environment. By experience we mean sample sequences of states, actions and rewards from an on-line or simulated interaction with the environment. The idea underlying MC methods in general is to use available samples of a random variable to estimate its expected value as the sample mean. Recall that the value functions defined for states and actions are actually expected values of the long-term returns, which are random variables. MC methods estimate the state or action values based on averaging sample returns observed during the interaction of the agent with its environment. Since samples of complete returns can be obtained only for finite tasks, MC methods are defined for episodic tasks. For each state (or state-action pair) a sample return is the sum of the rewards received starting from the occurrence of the state (or state-action pair) and until the end of an episode. As more samples are observed, their average converges to the true expected value of the return under the policy used by the agent for generating the sample sequences. MC methods are thus incremental in an episode-byepisode sense, but not in a step-by-step sense. Despite the differences between Monte Carlo and dynamic programming methods most key ideas such as, policy evaluation, policy improvement and generalized policy iteration carry over from dynamic programming to Monte Carlo method. If a model is not available, then it is particularly useful to estimate action values rather than state values, which is considered in Monte Carlo method. One can design a policy iteration algorithm, where the policy evaluation step estimates the value function using MC methods. There is one complication, however, that did not arise in dynamic programming. If the agent adopts a deterministic policy π, then the experience generated by its interaction with the environment contains samples only for actions suggested by policy π. The values for other actions will not be estimated and there will be no information on which to base the policy improvement step. Therefore, maintaining sufficient exploration is key for the success of policy iteration using MC methods. One solution is for the agent to adopt a stochastic policy with non-zero probabilities of selecting all actions from all states: a soft stochastic policy, such that π ( s, a) > 0, a A( s). There are different ways to implement this approach such as on-policy methods and off-policy methods. In the case of on-policy methods, the agent uses a soft stochastic policy when it interacts with the environment to generate experience, and evaluates its performance under this policy. There are many possible variations on on-policy methods. One possibility is to gradually shift the policy toward a deterministic optimal policy. In order to benefit from the currently available knowledge and do sufficient exploration at the same time, the agent gradually biases its policy to take greedy actions more often. For instance, the agent an use an ε-greedy policy, which selects with probability (1 ε) the action that is greedy with respect to the current estimate of the action-value function and with probability ε any other action (where ε has a small positive value). Another popular choice of the soft policy relies on the Boltzman distribution: 14

20 π ( s, a) = t e Q( st, a) τ e b A( s ) t Q( st, b) τ where τ is a positive temperature parameter that decreases to zero in the limit. Second another approach is off-policy learning: the agent uses one policy to interact with the environment and generate experience (behavior policy), but estimates the value function for a different policy (estimation policy). An immediate advantage of this approach is that the estimation policy can be deterministic (e.g., greedy) while the behavior policy is stochastic and fixed, ensuring sufficient exploration. In particular, an agent can try to learn the value of the optimal policy while following an arbitrary stochastic policy. Policy iteration with MC-based policy evaluation converges in the limit to the optimal policy (both for on-policy and off-policy learning) as long as every state-action pair is visited infinitely often. But in practice it is e encountered the same problem as for dynamic programming methods - one cannot wait forever until the policy evaluation step converges. When stopping after some finite number of observations considering an approximate version of the algorithm. Though convergence seems to be intuitively inevitable, no formal proof exists for this case. The Monte Carlo methods learn value function and optimal policies from experience in the form of sample episodes. This gives them several advantages over dynamic programming methods. First, they can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment s dynamics. Second, they can be used with simulation or sample models. Third, it is easy efficient to focus Monte Carlo methods on a small subset of the states. A region of special interest can be accurately evaluated without going to the expense of accurately evaluation the rest of the state set. Another advantage of Monte Carlo methods is that they may be less harmed by violations of the Markov property, as they do not bootstrap Temporal Difference Learning A combination of the ideas from dynamic programming and Monte Carlo methods yields temporal difference (TD) learning, a central idea of reinforcement learning [9]. Similarly to the Monte Carlo method, this approach allows learning directly from the on-line or simulated experience without any prior knowledge of the system s model. The feature shared by TD and dynamic programming methods is that they both use bootstrapping for estimating the value functions. TD methods combine the sampling of Monte Carlo with bootstrapping of dynamic programming. 15

21 TD algorithms make updates of the estimated values based on each observed state transition and on the immediate reward received from the environment on this transition. The simplest version of such algorithms is one-step TD performs the following update on every time step: V s ) V ( s ) + α r + V ( s ) V ( s ) (2.11) [ ] ( t t t t+ 1 γ t+ 1 t This method bootstraps by estimating in part on the basis of other estimates, but it uses sample updates instead of full updates as in the case of dynamic programming. Only one successor state, observed during the interaction with the environment, is used to update V instead of using values of all the possible successors and weighting them according to their probabilities. For any fixed policy π, the one-step TD algorithm converges in the limit to V π, in the mean for a constant step-size parameter α, and with probability 1 if the step-size satisfies stochastic approximation conditions: α 0 for all t and with probability 1, t= 0 2 α = and α <. t t= 0 t The one-step TD method can be used for the policy evaluation step of the policy iteration algorithm. As with MC methods, sufficient exploration in the generated experience must be ensured in order to find the optimal policy. TD methods have an advantage over dynamic programming methods in that they do not require a model of the environment, of its reward and next-state probability distributions. The most obvious advantage of TD methods over Monte Carlo methods is that they naturally implemented in an on-line, fully incremental fashion. Again, one can use either on-policy or off-policy approaches to ensure an adequate exploration. In an on-policy method it must estimates Q π ( s, a) for the current behavior policy π and for all states s and action a. This can be done using essentially the same TD method described π above for learning V. An example of the on-policy approach is the SARSA algorithm, which performs on every time step the following update: Q s, a ) Q( s, a ) + α r + Q( s, a ) Q( s, a ) (2.12) [ ] ( t t t t t t+ 1 γ t+ 1 t+ 1 t t t If st+ 1 is terminal, then Q ( s t + 1, at+ 1) is defined as zero. This rule uses every element of the quintuple of events ( s t, at, rt + 1 st+ 1, at+ 1), that make up a transition from one state-action pair to the next. This quintuple gives rise to the name sarsa for the algorithm. Here the actionvalue function is estimated in order to obtain a corresponding greedy policy without the need for a system model. The algorithm estimates the action value function for the current behavior policy, which is ε-greedy with respect to the last estimate of Q. This algorithm converges in the limit with probability 1 to an optimal ε-greedy policy if all state-action pairs are visited infinitely often. Convergence to the optimal greedy policy occurs if Const ε = (the amount of exploration diminishes over time). time The most popular representative of the off-policy approach is the Q-learning algorithm [10], its simplest form, one-step Q-learning, is defined by 16

22 [ rt γ max Q( st, at ) Q( st, at )] Q( s, a ) Q( s, a ) + α 1 (2.13) t t t t t + a This algorithm estimates the optimal action-value function Q*, regardless of what behavior policy is followed, since the action value estimate in the update is selected according to the greedy policy from each successor state (maximization at s t+1. If all state-action pairs continue to be updated infinitely often and the step-size parameter satisfy stochastic approximation conditions, the algorithm converges in the limit to Q* with probability 1. The major difference between TD and MC methods is that MC methods perform updates based on the entire sequence of observed rewards until the end of an episode, while TD methods use the immediate reward and the sample next state for the updates. An (n) intermediate approach is to use the n-step truncated return, R t, obtained from a sequence of n>1 transitions and rewards. This method is rarely used in practice however, because more efficient alternatives exist. To go one step further, one can compute the updates to the value function estimate based on several n-step returns. The family of methods TD(λ), with 0 λ 1, combine n-step returns weighted proportionally to λ n-1 : R λ t = (1 λ) n= 1 λ n 1 R ( n) t (2.14) Weights decay by λ with each additional time step. It is easy to see that for episodic tasks the Monte Carlo methods update by setting λ=1, whereas by one-step TD method by setting λ=0. The above algorithm is known as the forward view of the TD(λ) algorithm and the updates to the value function estimates are calculated as: λ V s ) V ( s ) + α R V ( s ) (2.15) [ ] ( t t t t t Obviously, to implement this algorithm directly, we still need to wait indefinitely in order λ to compute R t. Thus, the forward view is mainly theoretical and not directly implementable because it is not causal, using at each step knowledge of what will happen many steps later, but it leads to a practical algorithm. The backward view of the TD(λ) algorithm provides a casual, incremental mechanism for approximating the forward view and, in the off-line case, for achieving it exactly. This variant introduces the use of eligibility traces, an additional memory variable associated with each state, which establish the eligibility of a particular event for participating in updating the value function. An eligibility trace is a variable e t (s) associated with each state on every time step. These variables are initialized to 0 at the beginning of learning and are updated at each stage for s S as follows: γλet 1( s) if s st et ( s) = (2.16) γλet 1( s) + 1 if s = st where γ is the discount rate and λ is the trace-decay parameter. This kind of eligibility trace is called an accumulating trace because it accumulates each time a state is visited, and 17

23 then fades away when the state is not visited. The trace for a state is increased every time the state is visited and decreases exponentially otherwise. We denote by δ t the TD error or temporal difference at stage t: δ r + V ( s 1) V ( s ) (2.17) t = t+ 1 γ t+ t On every time step all the previously visited states are updated proportionally to their eligibility traces: V ( s ) V ( s ) + αδ e ( s) (2.18) t t t t As a result, states visited earlier are given less credit (or blame) for the current TD error, since their eligibility traces have decreased. There are two ways of performing updates. In on-line updating, changes to the estimate of the value function are made as soon as the appropriate increment is computed. In off-line updating, the updates are accumulated and estimates are committed to new values only at the end of an episode. A slightly modified eligibility trance shows significantly better performance in some cases, which is known as replacing trace. If a state is revisited before the trace due to first visit has fully decayed to zero, then with accumulating traces, the revisit causes a further increment in the trace, driving it greater than 1, whereas with replacing traces, the trace is reset to 1. The replacing trace for a discrete state s is defined by γλet 1( s) if s st et ( s) = (2.19) 1 if s = st An illustrative diagram fig.2.3 shows the traces records for particular state in eligibility trace for both accumulating trace and replacing trace. Although replacing traces are only slightly different from accumulation traces, they can produce a significant improvement in learning rate. Times of state visits Accumulating trace Replacing trace Figure 2.3: Eligibility trace: accumulating and replacing trace for a particular state 18

24 The TD(λ) forward view converges in the limit with probability 1 under stochastic approximation conditions. The backward and forward view can be shown to be equivalent in the case of off-line updating. TD(1) with off-line updating is the Monte Carlo method. TD(1) with on-line updating is an approximation of MC. This incremental implementation of Monte Carlo methods is much more general: it can be applied to discounted continuing tasks, not just to episodic ones. The idea of TD(λ) with eligibility traces can be applied with previously described Sarsa algorithm and it is called Sarsa(λ) and the original Sarsa is called one-step Sarsa. So a trace for state-action pair instead of the trace of state is needed in this algorithm. If e t (s,a) denotes the trace for state-action pair s, a, the updating equation of Sarsa(λ) can be written substituting state-action variables for state variables in equation (2.17)-(2.19): Q s, a ) Q( s, a ) + αδ e ( s, a ) (2.20) ( t t t t t t t t δ r + Q s, a ) Q( s, a ) (2.21) t = t+ 1 γ ( t+ 1 t+ 1 t t and 1 if s = st & a = at et ( s, a) = for γλet 1( s, a) Otherwise all s S, a A (2.22) One-step Sarsa and Sarsa(λ) are on policy algorithms, meaning that they approximate Q π ( s, a), the action values for the current policy, π, then improve the policy gradually based on the approximate values for the current policy. Like Sarsa(λ), the idea of eligibility trace can be combined with Q-learning which is called Q(λ) but it is not directly applicable to Q-learning because it is off line method. The simplest way of Q(λ) also known as Watkin s Q(λ), where eligibility traces are used just as in Sarsa(λ), except that they are set to zero whenever an exploratory (non-greedy) action is taken. The trace update is best thought of as occurring in two steps. First, the traces for all state-action pairs either decayed by γλ or, if an exploratory action was taken, set to 0. Second, the trace corresponding to the current state and action is incremented by 1. Cutting off traces every time an exploratory action is taken loses much of the advantage of using eligibility traces. If exploratory actions are frequent, as they often are early in learning, then only rarely will be backups of more than one or two steps be done, and learning may be little faster than one-step Q-learning. This is the main draw back of Watkin s Q(λ). Peng s Q(λ) is an modified approach of Watkin s Q(λ) that overcomes that problem and it is a hybrid of Sarsa(λ) and Watkin s Q(λ). Unlike Q-learning, there is no distinction between exploratory and greedy actions. Each component backup is over many steps of actual experiences, and all but the last are capped by a final maximization over actions. The component backups are neither on-policy nor off-policy. This method performs well 19

25 empirically and significantly better than Watkin s Q(λ) and almost as well as Sarsa(λ). But this method is very complex and cannot be implemented as simply as Watkin s Q(λ). The TD(λ) methods, unfortunately, require all Q-values of state-action pairs updated in every step so it requires much more computation than one-step methods. A solution of this problem is proposed by Cichosz [11] called truncated temporal difference (TTD) method, which stores fixed state-action pairs (fixed window to the past). At each time step, the value of the least recently entry is updated and the window is shifted. This way great computational saving is achieved. Although TD(λ) method require more computation than one-step methods, in return they offer significantly faster learning, particularly when rewards are delayed by many steps. By adjusting λ one can place eligibility trace methods anywhere along a continuum from Monte Carlo to one-step TD methods. Monte Carlo methods have advantages in non- Markov tasks because they do not bootstrap. The TD method can be applied only to Markov tasks. Eligibility traces are the first line defense against both long-delayed rewards and non-markov tasks. 2.5 Multi-agent Reinforcement Learning Multi-agent Systems A powerful single-agent may be capable of controlling or optimizing a system in a centralized way more efficiently if anything that can be computed in a distributed system can be moved to a single computer. However, distributed computations are sometimes easier to understand and easier to develop, especially when the problem is being solved itself distributed [12]. Distribution can lead to computational algorithms that might not have been discovered with a centralized approach. There are also times when a centralized approach is impossible, because the systems and data belong to independent organizations that want to keep their information private and secure for competitive reasons. To take the advantages of the distributed system intelligent computational agents need to be distributed and embedded throughout the enterprise. The agents would be knowledgeable about information resources that are local to them, and cooperate to provide global access to, and better management of the information. For the practical reason that the systems are too large and dynamic for global solutions to be formulated and implemented, the agents need to execute autonomously and be developed independently. The multi-agent environments are typically open and have no centralized designer and contains agents that are autonomous and distributed, and may be self interested or cooperative. Agents communicate better achievement of their own goal or common goal of the system in which they exist. Communication can enable the agents to coordinate their actions and behavior, resulting in systems that are more coherent. Coordination is a property of a system of agents performing some activity in a shared environment, which 20

26 enables them to achieve the goal easier. Coordination among non-antagonistic agents is called cooperation whereas coordination among competitive or simply self-interested agents is called negotiation. To cooperate successfully, each agent must maintain a model of the other agents, and also develop a model of future interactions. By Negotiation a joint decision is reached by two or more agents, each trying to reach an individual goal or objective. In cooperative situations, agents can learn complimentary policies to solve the problem. This amounts to role specialization rather than developing identical behavior. Agents can transfer learning to similar situations, i.e., once agents learn to coordinate for a given problem, they can learn to coordinate quickly for a similar problem. Multi-agent systems are different from single agent systems in the sense that there is no global control and globally consistent knowledge. Multi-agent systems are more flexible and fault tolerant as several simple agents are easier to handle and cheaper to build compared to a single powerful robot, which can carry out different task. Also distribution brings up inherent advantages of distributed systems, such as scalability, fault-tolerance, parallelism, etc. Even when a distributed approach is not required, multiple agents may still provide an excellent way of scaling up to approximate solutions for very large problems by streamlining the search through the space of possible policies as many researchers have proposed multiple agents in lieu of a single agent to make a complex learning task easier and to achieve better performance, through combining outcomes of multiple agents Multi-agent Reinforcement Learning It is natural to apply reinforcement learning for multi-agent system since, an agent in a multi-agent system may know little about others because information is distributed. Even when an agent has some prior information about others, the behavior of others may change over time because they are learning. Many researchers have focused on top-down approaches to building distributed systems, creating them from a global vantage point. The main drawback of this approach is the extraordinary complexity of designing such agents, since it is extremely difficult to anticipate all possible interactions and contingencies ahead of time in complex systems. Some researchers have taken opposite approach, combining large numbers of relatively unsophisticated agents in a bottom-up manner and seeing what emerges when they are put together into a group. This amounts to a sort of iterative procedure: designing a set of agents, observing their group behavior, and repeatedly adjusting the design and noting its effect on group behavior. Multi-agent reinforcement-learning method attempts to combine the advantages of both approaches to the design of multi-agent systems. It achieves the simplicity of bottom-up approach by allowing the use of relatively unsophisticated agents that learn on the basis of their own experiences. At the same time, reinforcement-learning agents adapt to a top-down global reinforcement signal, which guides their behavior toward the achievement of complex pre-defined goals. As a result, very robust systems for complex problems can be created with a minimum of human effort [13]. 21

27 There are several key issues to consider in applying reinforcement learning algorithm for multi-agent system. How does an agent treat other agents working in the same environment? How can reinforcement-learning agents be cooperative? A number of researchers have investigated applying sequential reinforcement learning algorithm in multi-agent contexts. Although much of the work has been in simplistic domains such as grid worlds, several interesting applications have appeared that have pointed to the promise of sequential multi-agent reinforcement learning. In most cases, single-agent reinforcement learning methods are applied without much modification. Such an approach treats other agents as a part of the environment [14], i.e., a learning agent explicitly considers other agents in the system. There are two problems with this approach: First, the environment in this treatment is non-stationary as other agents are learning and changing their responses. But the convergence of single agent reinforcement learning is based on the assumption that the environment is stationary. Second, an agent who does not take into account of other agents may have worse performance than the one does. However, this approach is difficult to implement in noisy environments where agents cannot reliably discern states and actions of other agents. A more decentralize approach is to give each agent the capability of independent learning in the environment while allowing agents to share learning experience when they find it beneficial. To make agents in a multi-agent system to be cooperative three ways have been proposed in [15]. First, agents can communicate instantaneous information such as sensation, actions, or rewards. Second, agents can communicate episodes that are sequences of (sensations, action, reward) triples experienced by agents. Third, agents can communicate learned decision policies. Simulating a simple prey-hunter experiment with multi-agent reinforcement learning it is found that these cooperation mechanisms outperforming uncooperative agents that were learning for the same number of time steps. 22

28 Chapter 3 Task-Oriented Reinforcement Learning 3.1 Open Problems in Reinforcement Learning There are a variety of reinforcement-learning techniques that work effectively on a variety of small problems. But very few of these techniques scale well to larger problems because it is very difficult to solve arbitrary problems in the general case. Much of the work has been in simplistic domains such as maze learning, considering stationary environments where the agent has an opportunity to learn its policy in episodic manner. Learning in a partial observable and non-stationary environment or in more realistic complex and dynamic environment are still one of the challenging problems in the area of reinforcement learning. Until now, there is little research in this matter considering only slow-varying non-stationary environment. Beside these there are many important problems remain unsolved in reinforcement learning. First key problem is to develop reinforcement learning methods for hierarchical problem solving. For very large search spaces, where the distance to the goal and the branching factor are big, no search method can work well. Often such large search spaces have a hierarchical structure that can be exploited to reduce the cost of search. The second key problem is to develop intelligent exploration methods. Weak exploration methods that rely on random or biased random choice of actions cannot be expected to scale well to large, complex spaces. Third problem is that optimizing cumulative discounted reward is not always appropriate. In problem where the system needs to operate continuously, a better goal is to maximize the average reward per unit time. However, algorithms for this criterion are more complex and not as well behaved. Another difficult problem is that existing reinforcement learning algorithms assume that the entire state of the environment is visible at each time step. This assumption is not true in many applications, such as robot navigation or factory control, where the available sensors provide only partial information about the environment. The challenge is to find approximate methods that scale well to large hidden-state applications. 3.2 Related Work To deal with a complex problem with large state and action spaces a number of research have been done to bias the agents learning techniques in scaling the large problem well [7], which may be useful for reinforcement learning system too: Shaping: In this technique a teacher presents very simple problems to solve first, then gradually exposes the learner to more complex problems. Shaping has been used in supervised-learning systems, and can be used to train hierarchical reinforcement-learning 23

29 systems from the bottom up [16], and to alleviate problems of delayed reinforcement by decreasing the delay until the problem is well understood [17]. Local reinforcement signals: Whenever possible, agents should be given reinforcement signals that are local. In applications in which it is possible to compute a gradient, rewarding the agent for taking steps up the gradient, rather than just for achieving the final goal, can speed learning significantly [18]. Problem decomposition: Decomposing a huge learning problem into a collection of smaller ones, and providing useful reinforcement signals for the sub-problems is a very powerful technique for biasing learning. Most interesting examples of robotic reinforcement learning employ this technique to some extent [19]. Modular approach in decomposing a problem space improves learning performance in reinforcement learning [20]. Function Approximation: On large problems, reinforcement learning system should use parameterized function approximators such as neural networks in order to generalize between similar situations and actions [21]. Function approximation may be a good solution for any simple problem with large state space where a suitable value function can be approximated for the same types of actions. In these cases there are no strong theoretical results on the accuracy of convergence, and instability may occurs as the weights of the neural networks can become unstable. If the system is complex and dynamic where an agent needs to take different types of actions to reach the final goal or objective of the system and if the system is non-episodic where the agent has no opportunity to repeat its trial for same initial environment condition, then the use of function approximation technique is not feasible. In this research we consider the complex and dynamic problems with high-dimensional state and action spaces. Our interest is to develop a robust system in a more realistic and generalized way so that the learning process does not hampered by the dynamics of the environment and can work in both continuous and episodic way in the environment where single or multi-agent may exist. In the field of reinforcement learning it is considered that the system learns its policy by simulation or in a training period and finally applied the learned policy in controlling the real field of the problem. So if there is any change in the environment after the policy has been learned, the system may not work and it needs to repeat the learning process from initial. Here, one of our interests is to build a system, which can be applied in controlling a system where learning process continues forever and if there any changes the system can adjust its policy without any hamper of the ongoing process. 3.3 Task-Oriented Reinforcement Learning One key feature of conventional reinforcement learning is that it explicitly considers the whole problem of a goal directed agent interaction with an uncertain environment [3]. The usual approach of reinforcement learning is to take into account all kinds of information of 24

30 the environment to constitute state and only one lookup table is used for the whole task. Since an agent needs to repeat its interaction several time to get a better policy, dynamic system is also impractical to be controlled by reinforcement learning. In conventional approach, an agent interacts with the environment considering its own states rather than analyzing the nature or state of the task and by repeated process, it reaches goal as it performed the task without any idea on it. The process goes well if the agent receives the reward properly. But for the same task if the agent faces new situation at all time and if there are some critical issues behind the task, which cannot be considered as the agent state information, the systems fail to converge. To overcome these limitations, a modified approach of reinforcement learning is proposed, which is called task-oriented reinforcement learning [4-6]. This new approach of reinforcement learning can overcome this complexity of the problem by decomposing the whole task into some logical subtasks according to the types of actions (for example, searching the environment, moving to a particular location, conducting a job, etc). For each subtask a separate lookup table is used in which the state signal represents the job condition with respect to the corresponding agent, and the action space is the indications to the agent how the task should be carried out. The goal of each subtask may be different apparently, but it helps attaining the global goal of the system. This method provides one lookup-table for each subtask, therefore, the same agent may deal all lookup-tables or separate agent can be used for each subtask depending on the global objective of the system. The main aspect of the task-oriented learning is to describe the task condition clearly to the agent in terms of state. For example, if actions and the goal of a task depend on agent s behavior in the environment only, then the state information must consider only the agent s position with respect to its surroundings; if actions and the goal of a task related to some other job which should be handled by an agent, then the state information should be a description of job condition to the agent; and if the goal and action depend on how an agent from a group is to be assigned a task, then the state information should consists of description of those agents existing in its surrounding. The main objective of this method is to simplify the learning process considering less information related to corresponding task only, which reduces the state-space size, hence requires less memory and faster convergence can be achieved. An intuitive way to understand the relation between agent and its environment in reinforcement learning for both conventional learning and task-oriented learning are with the following example of table 3.1. Table 3.1: The learning process in reinforcement learning for conventional and taskoriented approach: a comparison Conventional Learning Environment: Your state is 351. You have 4 possible actions. Task-Oriented Learning Environment: Task s state is 351. There are 4 possible actions with this task. 25

31 Agent: I am taking action 3. Environment: You received a reinforcement of -2 units. Now your state is 43. You have 3 possible actions. Agent: I am taking action 1. Environment: You received a reinforcement of 6 units. Now your state is 351. You have 4 possible actions. Agent: I am taking action 2. Environment: You received a reinforcement of 2 units. Now your state is 431. You have 2 possible actions. : : : Agent: I am executing action 3. Environment: You received a reinforcement of -2 units. Task s state is 43. There are 3 possible actions with this task. Agent: I am executing action 1. Environment: You received a reinforcement of 6 units. Task s state is 433. There are 3 possible actions with this task. Agent: I am executing action 2. Environment: You received a reinforcement of 2 units. Task s state is 143. There are 3 possible actions with this task. : : : The task-oriented reinforcement learning has two main features: (i) Decomposing the whole problem according to the types of actions: It reduces the complexity of a problem significantly as the small sub-task requires less information to be considered at a time. The size of state space increases exponentially with the available environment information. This leads the system free from curse of dimensionality as less information provide smaller size of the state spaces and also requires less memory. The system converges faster as it requires less computation. (ii) Agent learns from the viewpoint of task: This is the most important feature of the task-oriented system. The state space of the task or sub-task considers only the information related to the task, which describes the task condition clearly to the agent. Generally, it is considered that the agent does not have any idea on the nature of its task and its goal, but with interacting with its environment it gradually acquires to the optimal policy of the system in performing its job. But the agent cannot fulfill the goal by this way if the task is much more complex and its dynamics and nature cannot be considered in the agent state. So this feature of task oriented learning enables agent to learn smoothly and system becomes robust as the dynamics of the system or presence of other agents may have little effect on its learning (since the state considers only task condition which is almost same in all environment situations). This feature reveals an opportunity to use multiple agents to work in the same environment without 26

32 much more modification. In performing a task any agent can use the lookup table, which is, belong to that sub-task only and sharing learned policy may speed up the learning process. In this cooperation the agent does not need to sacrifice its own information leakage since each agent maintains separate lookup table for its own task, which has no relation with other agent. In reality, the dynamic characteristics and complexity of all problems are not same. The decomposition of a problem may produce different types of sub-task. It is almost impossible to define the state representation process for all kinds of task in a generalized way since the demand of each sub-task are not same. Generally, whatever the goal of a subtask, if its state represent task condition only and agent found that in its current state this task have to be done and the agent consider the task state to execute its action, then these process can be referred as task oriented learning. The notion, features and advantages of task-oriented reinforcement learning will be better understood through some examples. The next chapter contains example of task-oriented implementation of reinforcement learning for two different domains. 27

33 Chapter 4 Examples 4.1 Introduction The task oriented reinforcement learning methods solve a high dimensional problem decomposing it according to the types of actions needed to fulfill the goal, and then learning is carried out from the viewpoints of the task. In reality the characteristics of all complex problems are not same. The decomposed tasks in some problems differ form the others. Some sub-tasks may require common role among agents, and some sub-tasks require coordination among agents. It is difficult to describe the inherent properties and benefit of task-oriented system with one example. For this reason in this thesis we carry out simulations for two different types of problems. 4.2 Tile World Task Description A pseudo-realistic tile world of 10 by 10 grids that evolves in discrete time steps is considered for the first experiment domain. Each cell of the world may contain an agent, an obstacle, a tile or a hole. In this continuous task, a tile and a hole appear in random location of the environment stochastically and disappear after a certain time interval. The illustrative diagram, Fig.4.1, shows the complexity of the task due to many obstacles. The agent is permitted only to push the tile but not to pull, and the movement in diagonal directions and moving off the environment are considered to be illegal. More than one agent, or an agent and the tile cannot be in the same cell. The agent s task is to discover the hole and the tile and then fill up the hole by putting the tile into it. Tile-world is a well know test domain in the field of mobile robot learning task. Our test domain is much more complex compared to popular other tile-world problem as the positions of the tile and the hole are not fixed. Also at starting of each trial the position of the agent differ from one place to another. These dynamics of the environment cannot be handled in a conventional way of reinforcement learning. The cells just inside the boundary region are restricted only for the agent s movement, and the tile is not allowed to be pushed into them. This restriction is only to avoid the permanent dead lock situation. At each time step the agent has four possible actions to choose from: pushing the tile if it is available or moving in to North, South, East or West. Before making any action, the agent searches its field of vision of limited depth 2 for the tile, hole, other agent, etc. The environment is partially observable and there is no special marks or coordinates to distinguish one cell from another. 28

Fig. 4.1: A 10 by 10 grid tile world 4.2.2 Implementation First, the task-oriented division of the whole task is considered [4].

Then the agent needs to move at desired location in the environment, such as at certain side of the tile to push it, or after pushing it once, the agent may need to move to another side of the tile

34 Fig. 4.1: A 10 by 10 grid tile world Implementation First, the task-oriented division of the whole task is considered [4]. The agent needs to find out the tiles and the hole at first, so it needs to move randomly throughout the environment until it finds out both. Then the agent needs to move at desired location in the environment, such as at certain side of the tile to push it, or after pushing it once, the agent may need to move to another side of the tile to push it again, and this process continues until the tile is being pushed into the hole. Here the whole task is decomposed as follows: (a) finding out the tile and the hole, (b) moving to any particular location, and (c) transition of the tile towards the hole. The first two subtasks fully concern with the movement of agent itself. To search the environment the agent decides a relative location (sub-goal) randomly and looks for the tile and the hole during its trip towards that location. This process is repeated several times until the tile and the hole are found out. The agent remembers the location of the tile and the hole by relative Cartesian coordinates and updates this value at each transition of state. After finding out both or after pushing the tile once the agent may need to move to certain cell beside the tile (sub-goal) and this is determined by the Q-value of the sub-task related to the tile movement. This sub-task ends when the agent reaches the appropriate cell beside the tile to push it. For both of these subtasks only one Q A -table is proposed. The state of Q A -table is constituted by the relative directional information of the sub-goal state and status of the neighbor cells. The actions are either to move in North, South, East or West. The action that makes the agent to reach the target location is given a reward of 1 and all other actions receive a reward of 0. The other subtask deals with transition of the tile, how the agent should move it towards the hole. For this subtask the proposed Q T -table contains the information of the tile itself 29

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation