Developing Focus of Attention Strategies Using Reinforcement Learning

Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Developing Focus of Attention Strategies Using Reinforcement Learning Srividhya Rajendran rajendra@cse.uta.edu Technical Report CSE-2003-32 This report was also submitted as an M.S. thesis.

DEVELOPING FOCUS OF ATTENTION STRATEGIES USING REINFORCEMENT LEARNING The members of the Committee approve the master s thesis of Srividhya Rajendran Manfred Huber Supervising Professor Farhad Kamangar Lawrence B. Holder

DEDICATION This thesis is dedicated to my father, T.K. Rajendran and my mother, Bhagyalakshmi

DEVELOPING FOCUS OF ATTENTION STRATEGIES USING REINFORCEMENT LEARNING by SRIVIDHYA RAJENDRAN Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON December 2003

ACKNOWLEDGEMENTS I would like to thank Dr. Huber for introducing me to the exciting field of Artificial Intelligence. I would also like to thank him for his enormous patience, timely advice and pertinent examples that helped me immensely towards completion of my research work. I would like to thank Dr. Kamangar and Dr. Holder for being on my committee. I would like to thank Prashant for his love, support and hot meals he made for me when I was too busy to cook. I would like to thank Appa, Amma, Geetha and Kittu for being there for me through thick and thin, and for always encouraging me to achieve what I believed in. I would also like to thank all my friends for being there to boost my morale when I needed it the most. November 17, 2003 v

ABSTRACT DEVELOPING FOCUS OF ATTENTION STRATEGIES USING REINFORCEMENT LEARNING Publication No. Srividhya Rajendran, CSE The University of Texas at Arlington, 2003 Supervising Professor: Dr. Manfred Huber Robot and AI agents that can adapt and handle multiple tasks are the need of today. This requires them to have the capability to handle real world situations. Robots use sensors to interact with the world. Processing the raw data from these sensors becomes computationally intractable in real time. This problem can be addressed by learning strategies of focus of attention. This thesis presents an approach that considers focus of attention as a reinforcement learning problem of selecting controller and feature pairs to be processed at any given point in time. The result is a sensing and control policy that is task specific and can adapt to real world situations using the feedback from the world. vi

Handling all the information of the world for the successful completion of a task is computationally intractable. In order to resolve this, the current approach is further augmented with short term memory. This enables the agent to learn a memory policy. The memory policy tells the agent what to remember and when to remember in order to successfully complete a task. The approach is illustrated using a number of tasks in the blocks world domain. vii

TABLE OF CONTENTS ACKNOWLEDGEMENTS... ABSTRACT... LIST OF ILLUSTRATIONS... LIST OF TABLES... v vi x xii Chapter I. INTRODUCTION... 1 1.1 Problem Description... 1 1.2 Related Work... 4 1.3 Approach Taken... 6 II. MACHINE LEARNING... 8 2.1 Learning Agent Design... 8 2.2 Learning Methods... 11 III. REINFORCEMENT LEARNING... 12 3.1 Reinforcement Learning Model... 13 3.2 Markov Decision Processes... 15 3.3 Learning an Optimal Policy : Model Based Methods... 15 3.4 Learning an Optimal Policy using Q-Learning: Model Free Method. 16 3.5 Exploration versus Exploitation Strategies... 18 viii

3.6 A Sample Example- The Grid World... 20 IV. CONTROL ARCHITECTURE... 23 4.1 Control Architecture... 23 V. LEARNING FOCUS OF ATTENTION- PERCEPTUAL LEVEL... 27 5.1 State Space Representation... 29 5.2 Table Cleaning Task... 30 5.3 Sorting Task... 37 VI. LEARNING FOCUS OF ATTENTION- MEMORY LEVEL... 42 6.1 State Space Representation... 47 6.2 Block Stacking Task... 49 6.3 Block Copying Task... 53 VII. CONCLUSION AND FUTURE WORK... 66 REFERENCES... 68 BIOGRAPHICAL INFORMATION... 72 ix

LIST OF ILLUSTRATIONS Figure Page 2.1 Learning Agent Model... 9 3.1 Reinforcement Learning Model... 13 3.2 R (s, a) Immediate Reward Values... 20 3.3(a) Q(s, a) Values... 22 3.3(b) Optimal Policies... 22 4.1 Control Architecture... 24 5.1 Table Cleaning Task Setup... 30 5.2 Learning Curve (All Objects on the Table Are Movable)... 32 5.3 Partial Learned Policy (All Objects on the Table Are Movable)... 34 5.4 Learning Curve (Not All Objects on the Table Are Movable)... 35 5.5 Learning Curve (None of the Objects on the Table Are Movable)... 36 5.6 Learned Policy (None of the Objects on the Table Are Movable)... 36 5.7 Sorting Task Setup... 37 5.8 One of the Learned Policy for Sorting Task... 38 5.9 Learning Curve for Sorting Task... 39 5.10 Sorting Task with Different Number of Features in State Vector... 41 6.1 Action Top Blue Yellow... 45 6.2 Block Stacking Task Setup with Two Objects... 49 x

6.3 One of the Learned Policy Block Stacking Task (Two Objects)... 51 6.4 Learning Curve for Block Stacking Task (Two Objects)... 52 6.5 Block Stacking Task Setup with Three Objects... 52 6.6 Learning Curve for Block Stacking Task (Three Objects)... 53 6.7 Block Copy Setup with Two Objects... 54 6.8 One of the Partial Policy for Block Copy (with Two Objects)... 57 6.9 Learning Curve for Block Copying Task (with Two Objects)... 58 6.10 Block Copy Setup with Three Objects... 59 6.11 Learning Curve for Block Copying Task (with Three Objects)... 61 6.12 Stacking Task with Different Temperature Decay Rates... 62 6.13 Copying Task with Different Temperature Decay Rates... 63 6.14 Stacking Task with and without Memory Policy... 64 6.15 Copying Task with and without Memory Policy... 65 xi

LIST OF TABLES Table Page 3.1 Q-Learning Algorithm... 18 3.2 State Transitions with Action Up... 21 3.3 State Transitions with Action Down... 21 3.4 State Transitions with Action Left... 21 3.5 State Transitions with Action Right... 21 6.1 Block Copying Strategy used by Humans... 44 6.2 Different World configurations for Block Copying with Three Objects... 60 xii

CHAPTER I INTRODUCTION 1.1 Problem Description The development of science and technology has led humans to explore the field of artificially intelligent (AI) agents and robots and this has resulted in robots and AI agents that are very special purpose or task specific, e.g. vacuuming robot, robots for refueling of calendria tubes in an atomic power plant, etc. These robots develop their intelligent behavior from a huge amount of task-specific knowledge. We would like to have AI agents and robots that are more flexible in the range of task they can perform such as assisting us in doing repetitive and dangerous tasks, assisting handicapped or elderly people by monitoring and controlling various aspects of the environment, etc. This requires the AI agents to have the capability of interacting with humans and to be able to deal with the uncertainties inherent in the real world. In order to accomplish this, AI agents need to extensively make use of sensors for observing and interpreting the state of the world. The use of diverse sensors results in huge amounts of raw data and AI agents have to process this raw data to extract knowledge about the world. This poses many serious problems. They are (but not limited to): 1

1. Lack of computational resources to process the huge amount of data in real-time. 2. Huge number of data points considered by the AI agents to make realworld decisions in real-time. This seeks AI agents to come up with mechanisms of filtering the relevant data out of the raw data to enable a targeted processing of the data critical to decision making. Similar mechanisms are observed in biological systems. Humans display the mechanical and cognitive aspects of this mechanism. For example, the retinal image from an eye has a high acuity region at the center known as fovea, beyond which the image resolution drops in the visual periphery. Humans motivated by their behavioral needs are able to focus their sight (by moving eyes or turning heads) onto the object of interest. The cognitive aspect of this mechanism can be observed in humans, for example, while driving a car driver s eyes form the image of the things within the visual boundaries but he/she only sees the things in front or in the proximity of the car. This is because the cognitive part of the brain only processes (focuses attention on) the visual information from front and around the car since it is very relevant for driving while it ignores the rest [Tu 1996]. This mechanism of processing only a small number of features in the sensory input while ignoring the less relevant aspects is known as Focus of Attention [William 1890] [Broadbent 1958]. 2

According to Piaget [Mandler 1992], human infants by 4 months of age have reflexive responses that are organized into motor strategies, and all sensor organs start acting in a coordinated manner and start developing mechanisms of focus of attention. In the next few months they develop mechanisms for coordinating the developed motor strategies with the developed focus of attention mechanisms to address incrementally complex tasks [Piaget 1952]. E.g., newborns continuously learn sensing and control policies by flailing their hands. This random exploration of the terrain by moving their hands helps them determine properties like traversability from the available resources, thus developing strategies for efficient perception of task-specific properties along with learning control policies to move their hands to a given location. Here the precise nature of successful sensing strategies depends heavily on the available resources and the overall task at hand. For example, while navigating through a jungle we tend to actively look out for the tiger that might kill us. But the same task, navigation in the city does not require us to look out for the tiger since there are none but would require us to look for vehicles moving along the path we are taking. So, though the task is the same, the sensing strategy changes depending on the situation we are in. This suggests that robots and AI agents that can handle a wide range of tasks must also have mechanisms for acquiring task-specific focus of attention by autonomously interacting with the environment based on the resources available to them. 3

This thesis mainly concentrates on developing the above specified idea of acquiring strategies for task-specific focus of attention using reinforcement learning algorithms [Sutton & Barto 1998] to enable robots and AI agents to handle a wide range of tasks. 1.2 Related Work This section gives a brief introduction to various concepts used in this thesis and discusses previous research work related to this thesis work. 1.2.1. Reinforcement Learning Reinforcement learning [Sutton & Barto 1998] is a learning framework that attempts to learn how to map situations to actions so as to maximize a numerical reward signal. The learner is not told what actions to take, but instead must discover which actions yield the most reward by trying them. Reinforcement learning is defined not by characterizing learning algorithms, but by characterizing a learning problem. One of the challenges that arise in reinforcement learning is the tradeoff between exploration and exploitation. To maximize the reward, a reinforcement learning agent must prefer actions that have been tried and found to be yielding good reward. In order to discover such actions, the agent needs to try actions that have not been selected before. The agent needs to exploit what it already knows in order to obtain reward, but it also needs to explore in order to learn to select better actions in the future. The dilemma is that neither exploitation nor exploration can be pursued exclusively without failing at the 4

task. For example, a mobile robot that has to decide whether it should enter a new room in search of more trash to collect or start finding its way back to its battery recharging station. The robot makes its decision based on how quickly and easily it has been able to find the recharger in the past. Q-Learning [Watkins 1989] is a reinforcement learning algorithm that does not need a model of its environment and can be used on-line. This algorithm works by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values are learned, the optimal action from any state is the one with the highest Q-value. A more detailed explanation of machine learning, reinforcement learning, and Q-learning is presented in Chapter II and Chapter III. 1.2.2. Task Specific Focus of Attention Many approaches have been tried to learn selective attention policies for task completion. Andrew McCallum [McCallum 1996] developed mechanisms for learning selective attention for sequential tasks. Using the U-Tree algorithm McCallum s work dynamically builds a state space representation by pruning the large perceptual state space. His work further augments this state space with short term memory containing the crucial features that were missing because of hidden state. The results indicate that the state space representation developed by the U-Tree algorithm does not perform well in continuous state spaces. [Laar, Heskes & Gielen 1997] learned task dependent focus 5

of attention using neural networks. This idea used a limited sensor modality, constructed a special computational mechanism and did not have any real time feedback. This limits its use in real world robotic tasks. [Goncalves et. al 1999] presented an approach that uses similar learning mechanisms as presented in this thesis to identify objects identities in an active stereo vision system. This approach introduced special purpose visual maps to accomplish the required task. [Rao & Ballard 1995] proposed a general active vision architecture based on efficiently computable iconic representations. The architecture employed by them used visual routines for object identification and localization. They showed that complex visual behaviors could be obtained by using these two routines with different parameters. Most of the previous work considered very limited sensor resources and constructed special purpose mechanisms specific to one type of sensor modality and task domain. Moreover, these approaches did not provide real time feedback and were therefore limited in the context of robotic task execution. 1.3 Approach Taken The approach proposed in this thesis considers focus of attention as the problem of selecting a set of features to be processed at any given point in time during task execution. This is done by tight integration of sensing and control policies. While implementing this, each action that the robot system executes in the real world is associated with a set of features that form the control objective for that action. For example, action Reach Blue on a robot arm results in the robot reaching for a blue 6

object. At each point in time the robot has to decide the relevant features to process in the context of the chosen action. This reduces the amount of raw data that has to be analyzed to the data required for the related features. The sensing and control policies are learned by an online reinforcement learning algorithm that acquires task-specific sensing strategies that are optimized with respect to the task, capabilities, and resources of the AI agent. The learning architecture used here is based on a hybrid discrete/continuous system architecture where control strategies are constructed as a discrete event dynamic system supervisor [Ramadge & Wonham 1989] [Ozveren & Willsky1990] [Thistle & Wonham 1994] [Huber & Grupen 1997]. In this architecture the new sensing and control strategies are learned by direct interaction between the system and its environment. This architecture is used since it allows the simplification of the temporal component of the policy by handling it in the discrete event dynamic system framework and thus eliminates the requirement of representing continuous time. This allows the system to autonomously derive sequence of concurrent actions and relevant perceptual features in order to optimize the system performance in the given task. 7

CHAPTER II MACHINE LEARNING A computer program is said to be a machine learning algorithm if it can improve its performance at a task over time. A Formal definition of Machine Learning by Tom Mitchell is as follows: A computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E [Mitchell 1997]. For example, a computer program that is designed to play chess is measured by its ability to win chess games. Here the task is a series of chess games, performance is the number of games won by the program against its opponents and the training experience is playing chess games against itself. 2.1 Learning Agent Design The learning agent is basically divided into four modules [Russell and Norvig 1995]. 1. Critic 2. Learning element 3. Problem generator 4. Performance element 8

Performance standard Critic Sensors Feedback Learning Element Learning Goals Changes Knowledge Performance Element Environment Problem Generator Effectors Agent Figure 2.1 Learning Agent Model [Russell and Norvig 1995] 2.1.1. Critic The critic tells the learning element how well the agent is learning. The critic has an input known as performance standard. This is required because the percepts by themselves do not indicate anything about the agents performance. For example, a chess program may receive a percept indicating that it has been checkmated by its opponent. It will not know that being checkmated is good or bad unless there is a negative reward (performance standard) indicating that it is bad, or positive reward (performance standard) indicating that checkmating its opponent is good. The performance standard is a fixed measure that indicates how good or bad a given action 9

is from a given state. The performance standard is set by the environment of the learning agent. 2.1.2. Problem Generator The problem generator is responsible for generating actions that will lead to new and informative experience, otherwise the agent will only take the best action given what it knows. However, exploration leads the agent to learn better actions through a series of suboptimal actions over the long run, as opposed to exploitation. 2.1.3. Learning Element The learning element is responsible for improving the efficiency of the performance element. It takes feedback from the critic in any given state and adjusts the value of the performance element accordingly, thereby improving the learning over time. The learning element design is based on the following: 1. Components of the performance element to be improved. 2. Representation used by the components of the performance element. 3. Feedback given by the critic. 4. Amount of prior information available. 2.1.4. Performance Element The performance element is responsible for taking external actions. The design of the learning element depends on the design of the performance element. The 10

performance element is mainly responsible for improving the evaluation function by enhancing its accuracy over time. 2.2 Learning Methods There are basically two ways in which the components of the performance element are learned. They are: 1. Supervised learning 2. Unsupervised learning. 2.2.1. Supervised Learning Supervised learning is a method by which an evaluation function is learned from training samples. The training samples consist of input and desired output. The task of the learner is to predict the value of a function for any valid input after having seen a small number of training samples. This is done by generalizing from the presented data to unseen situations in a reasonable way. There are many approaches to implement supervised learning such as artificial neural networks, decision tree learning, etc. 2.2.2 Unsupervised Learning Unsupervised learning is a method where the learner has no idea about the output. In unsupervised learning, a data set of input objects is gathered and then a joint density model is built. A form of unsupervised learning is clustering. 11

CHAPTER III REINFORCEMENT LEARNING Reinforcement learning is a form of supervised learning mainly used for robots and AI agents to learn a task. The learner explores the environment by perceiving the state and taking subsequent actions. The environment in return provides rewards (which might be positive or negative) and the agent attempts to learn a policy that maximizes cumulative reward over the tasks. A reinforcement learning algorithm differs from other learning algorithms in the following ways: 1. It does not know a priori what the effect of its actions is on its environment and it does not have any knowledge about which actions are best in its long term interest. 2. It can receive reward in any state or only in a terminal state. Rewards define the utility the agent is trying to maximize. 3. It decides the distribution of the training examples by the sequence of actions it chooses. 4. It may or may not have a model of the environment. 5. The environment in which the agent is acting may or may not be fully observable. In an observable environment states can be identified with 12

percepts, whereas in a partially observable environment the agent has to maintain some internal state for keeping track of the environment. 3.1 Reinforcement Learning Model Agent State Reward Action Environment a 0 a 1 a 2 s 0 s 1 s 2 s 3 r 0 r 1 r 2 Figure 3.1 Reinforcement Learning Model Figure 3.1 shows a basic reinforcement learning model [Mitchell 1997]. In this model the agent receives an added input from the environment representing a measure of how good/bad the last action executed by the agent was. The model of the interaction is that the agent makes an observation and interprets the state of the world as s t and selects an action a t. It then performs this action, resulting in a new state s t+1, and receives a reward r t. The aim of the agent is to learn to choose actions that tend to increase the expected cumulative reward. 13

There are many algorithms that can be used by the agent to learn to behave optimally. However before learning to behave optimally we need to decide what the model of optimal behavior will be. There are three basic models that have been widely considered [Kaelbling, Littman & Moore 1996]. They are: 1. The finite horizon model 2. The infinite horizon model 3. The average reward model 3.1.1. The Finite Horizon Model In this model, at any give moment in time the agent should optimize its expected reward for the next h steps: E( h t t= 0 where r t represents the scalar reward received t steps into the future. r ) 3.1.2. The Infinite Horizon Model In this model, the agent takes into account its long run reward, but the rewards that are received in the future are geometrically discounted according to the discount factor γ, (where 0 γ < 1): t= 0 t E( γ rt ) where γ is a constant that represents the relative value of delayed versus immediate rewards. 14

3.1.3. The Average Reward Model In this model, the agent is supposed to take actions that optimize its long run average reward: 1 lim E( h h r ) h t t= 0 3.2 Markov Decision Processes Problems with delayed rewards can be modeled as Markov decision processes (MDP). An MDP consists of the following [Kaelbling, Littman & Moore 1996]: 1. A set of state S, 2. A set of actions A, 3. A reward function R : S X A R, and 4. A state transition function T: S X A Π( S), where a member of П (S) is a probability distribution over S. The model is Markov, if the state transitions are independent of any previous environment states or agent actions. 3.3 Learning an Optimal Policy: Model Based Methods Let us assume that the agent knows a correct model and is trying to find an optimal policy for the infinite horizon discounted model. Then the optimal value of a state, V * (s) is the expected infinite discounted sum of reward that the agent will gain if 15

it starts in that state and executes an optimal policy. If П is a complete decision policy then: [Kaelbling, Littman & Moore 1996] [Mitchell 1997] Then the optimal value function is: V * = (s) max E Π t= * t V γ = a s' S * () s max R(s,a) + γ T(s,a,s') V (s'), s S given the optimal value function the optimal policy can be specified as: * * Π () s = arg max R(s,a) + γ T(s,a,s') V (s') a s' S here the T(s,a,s') is the probability of making a transition from state s to s' using action a and R(s,a) is reward of taking action a in state s. 0 r t 3.4 Learning an Optimal Policy Using Q- Learning: Model Free Method Q-learning [Watkins 1989] [Kaelbling, Littman & Moore 1996] [Mitchell 1997] is a learning algorithm that is used to learn an optimal policy when the agent does not have any model. The optimal policy when the model is known is * * Π () s = arg max R(s,a) + γ T(s,a,s') V (s') a s' S However, since the agent here does not know T(s,a,s') and R(s, a), learning requires the agent to be able to predict the immediate reward and immediate successor state for each state-action transition. Moreover, it is impossible to do that, rendering the V * useless. Therefore the agent is required to use a more general evaluation function. 16

Let us define the evaluation function Q(s,a) such that the value of Q is the reward received immediately upon executing action a in state s, plus the value (discounted by γ) of following the optimal policy thereafter. Q(s, a) R(s, a) + γ V ( * δ (s,a)) here the Q(s,a) is the value being maximized, δ(s,a) denotes the state resulting from applying action a to state s. Now the agent is required to learn the Q function instead of V * and, by doing so it will be able to learn an optimal policy even though the agent does not have any model. It follows that the learning of the Q function corresponds to learning an optimal policy. The relationship between Q and V * is: * V (s) = max Q(s,a') therefore Q(s, a) = R(s, a) + γ max Q( δ (s,a),a'), and since the definition of Q is a' recursive it is learned iteratively in the Q- learning algorithm. In this algorithm the learner has Q values for each state action pair in a large table. Before the learning phase begins the Q value for each state action pair is initialized to some random value. Each time the learner takes an action a in state s and receives a reward r, the Q value in the table corresponding to that state action pair is updated using the following update rule: a' Q(s, a) here α is the learning rate. Q(s, a) + α (r + γ max Q(s', a' ) - Q(s, a)) a' This process is repeated until the algorithm converges i.e. the old Q(s, a) is the same as new Q(s, a). The Q-learning algorithm converges toward a true Q function if: 17

1. The system is a deterministic MDP. 2. Each state action transition occurs infinitely often. 3. There exists some positive constant c such that for all states s and actions a R(s,a) < c. For each s, a initialize Q(s, a) to zero Observer the current state s Do forever: Select an action and execute it Receive immediate reward r Observe the new state s' Table 3.1 Q-Learning Algorithm [Mitchell 1997] Update the table entry for Q(s, a) as follows: Q(s, a) s s' = Q(s, a) + α(r + γ max Q(s', a' ) - Q(s, a)) a' 3.5 Exploration versus Exploitation Strategies A reinforcement learning algorithm is different from other supervised learning algorithms in the sense that the learner has to explicitly explore the environment while learning a policy. One of the strategies for an agent in state s is to select an action a that maximizes Q(s, a) ; this strategy is known as exploitation. But by using this strategy the agent risks overcommitting to actions that are found to have high Q-values during the early phase of training, failing to explore the other possible actions that might have the potential to yield even higher Q-values. The above convergence theorem requires that each state action transition occurs infinitely often, and since only the best action is chosen this approach runs the risk of not achieving the convergence of the learning 18

algorithm. Another strategy for an agent in state s is to select a random action a, and this strategy is known as exploration. In this strategy, the agent does learn actions with good values but this turns out to be not very significant since the agent is following the approach of not putting to use what it has learned during exploration. Therefore, the best way to train a reinforcement learner is a strategy that does both exploration and exploitation in moderation. This means a method that allows the agent to explore when it has no idea of the environment, and to exploit greedily when it has learned sufficiently well about the environment. The method that this thesis uses for the aforementioned purpose is referred to as Boltzmann soft-max distribution. 3.5.1. Boltzmann soft-max Distribution If there are n items and the fitness of each item i is f(i), then the Boltzmann distribution defines the probability of selecting an item i, p(i), as p(i) = where T is called temperature. By varying the parameter T we can vary the selection from picking a random item (T is infinite) to having higher probabilities for items with higher fitness (T small finite), and to strictly picking the item with best fitness (T tends to 0). This is accomplished by decaying the temperature exponentially using the e f(i) T j e f(j) T equation T t λt = T * e where Tt is temperature at time step t, T 0 is temperature at time 0 step 0, λ is the decay constant and t is the time step. 19

In our case where there are n actions from state s, the fitness of an action is given by Q(s, a i ). The probability p (a i s) of taking action a i from s is given by: p(a i s) = n e j= 0 Q(s,ai ) T e Q(s,a j ) T 3.6 A Sample Example The Grid World 0 100 S 4 S 5 0 0 0 0 0 0 G 100 0 0 S 1 S 2 S 3 0 0 Figure 3.2 R(s, a) Immediate Reward Values Let us consider a small grid world example [Mitchell 1997] as shown in figure 3.2. In this world each grid square represents a unique state for the agent. There are four possible actions that an agent can take to move from one square to another. These actions are: 20

1. Up: When the agent takes this action it moves up in the grid world. Table 3.2 State Transitions with Action Up Old State Action New State S 1 Up S 4 S 4 Up S 4 2. Down: When the agent takes this action it moves down in the grid world. Table 3.3 State Transitions with Action Down Old State Action New State S 1 Down S 1 S 4 Down S 1 3. Left: When the agent takes this action it moves left in the grid world. Table 3.4 State Transitions with Action Left Old State Action New State S 4 Left S 4 S 5 Left S 4 4. Right: When the agent takes this action it moves right in the grid world. Table 3.5 State Transitions with Action Right Old State Action New State S 3 Right S 3 S 2 Right S 3 21

However, some of these actions cannot be executed from cells at the boundary of the grid world. The goal of the agent is to learn a path (maximizing the cumulative reward) to reach the grid cell containing the gold (absorbing state). Only the action that leads to gold gets a reward of 100 and all other actions get a reward of 0. The agent uses the Q-learning algorithm since it does not have any idea about the model of the world. This is done by creating a table that contains an entry for the Q value of each stateaction pair. All Q-value entries are initialized to 0 as shown in Q-learning algorithm in Table 3.1. Let γ (0 γ < 1) be equal to 0.9. Since the agent does not have any knowledge of the world, it starts out by exploring the world and updating the Q value of corresponding state-action pair using the updating equation shown in the Q-learning algorithm in Table 3.1. Figure 3.2 shows that if agent reaches state S 5 and takes action Right, it receives a reward of 100 since it reaches the gold. So the new Q (S 5, Right ) is: new Q(S5,"Right") = 0 + 1.0*(100 + 0.9 (0) - 0) = 100 (using equation Q(s, a) = Q(s, a) + α(r + γ max Q(s', a' ) - Q(s, a)) ) a' Similarly, if the agent takes action Up in state S 3 it receives a reward of 100 and if the learning process continues, the state-action pair values are updated as shown in Figure 3.3(a) and finally when the agent has explored sufficiently it learns a policy as shown in Figure 3.3(b). 22

81 0 0 90 100 S 4 S 5 G S 4 S 5 81 72 90 81 100 G 81 90 S 1 S 2 S 3 S 1 S 2 S 3 72 81 (a) (b) Figure 3.3 (a) Q(s, a) Values, (b) Optimal Policies 23

CHAPTER IV CONTROL ARCHITECTURE A robotic or AI system that deals with new situations and handles a wide range of task requires a large degree of generality and the capability to adjust on-line. This is because of the fact that it is not very realistic and easy to have the complete prior knowledge of the environments the robot system has to deal with. This calls for highly complex and nonlinear robotic control systems with a control architecture that reduces the overall complexity. This is achieved using a control architecture [Huber & Grupen 1997] based on a hybrid discrete event dynamic system [Ramadge & Wonham 1989] [Ozveren & Willsky.1990] [Thistle & Wonham 1994] to control the robots. 4.1 Control Architecture Figure 4.1 shows the organization of the control architecture. This architecture mainly consists of following components: 1. Controller / Feature pairs. 2. Supervisor. 3. Learning Element All the components are tightly integrated in order to achieve the required performance characteristics. 24

Reinforcement Learning Element State Information Control / Sensing Policy Supervisor Symbolic Events Control Activation Controller / Feature Pairs Physical Sensors Physical Actuators Figure 4.1 Control Architecture 4.1.1. Controller / Feature pairs The controller/feature pairs represent the bottommost module of the given control architecture. This module directly interacts with the physical sensors and actuators of the robot system. At each point in time the control policy in the supervisor determines the controller that has to be activated and the associated features that have to be processed. This set of features forms the control objective of the controller. If there is 25

an action Reach that controls the robot to reach for objects in the world, then the features associated with it tell the robot which object it is reaching for and where it is located in the world. For example, Reach Blue results in the robot arm reaching for a blue object in the world. The convergence of controllers represents the completion of a control objective and results in the generation of a discrete symbolic event. This symbolic event triggers the transition of robots state, and thus results in the deactivation of the control signal. This process of choosing the relevant features to process in the context of the chosen action reduces the amount of raw data that has to be analyzed to the one required for the selected features. 4.1.2. Supervisor The supervisor is the heart of this control architecture. It represents the task specific control policy that is learned by the robot using the learning component of the given architecture. The supervisor is built on top of the symbolic predicate space and abstract event space. The predicate space model represents an abstract description of all possible ways in which the robot can attempt to actively manipulate the state of the system and the environment. This abstract representation provides a compact state space to be used by the learning component. An abstract state consists of a vector of predicates representing the effects of controller/feature pairs on the world. Each state corresponds to a subgoal that is directly attainable by the controller/feature pairs. The action dependent choice of the predicates thus forms a compact representation on which control policies can be learned 26

for all tasks that are directly addressable using the underlying closed loop controllers of this system. The aim of the controller is to reach an equilibrium state where the control objective is satisfied, thus asserting the related predicates of the abstract state. However, the outcomes of these control actions at the supervisor level are nondeterministic due to kinematics limitations, controller interactions and other nonlinearities in the system. Therefore it may happen that the same action with the same set of features to be monitored from state s may lead to a set of different states {s 1, s 2, s 3 } at different points in time. Hence the supervisor takes the form of a nondeterministic finite automaton that triggers transitions with the convergence of controllers. 4.1.3. Learning Element The learning component uses Q-learning to learn sensing and control policies. This allows the learning component to learn policies that optimize the reinforcement given by the environment upon completion of functional goals. At the same time the exploration process allows to improve the abstract system model of the supervisor. Each time a new state is reached, the world gives feedback in the form of reinforcement to the learning component. This, in turn, is used to update the model by updating the transition probabilities of the states. 27

CHAPTER V LEARNING FOCUS OF ATTENTION PERCEPTUAL LEVEL As described earlier, a robotic system that is constructed to operate successfully in the real world has to have a mechanism to learn task specific sensing and control policies. In the approach presented here, the robot learns sensing and control polices using the reinforcement learning component of the control architecture. The learned policy provides the robot with an action and a set of features to be processed with this action. The features processed with each action determine the control objective of the action in the given state. To address complex tasks in real time and adapt online, the control architecture used by the robot system reduces the size of the continuous state space by using an abstract representation of the states. To illustrate the proposed approach, let us consider the following examples in the blocks world domain where the robot interacts with objects on the table top. 1. Table Cleaning Task 2. Sorting Task The robot configuration consists of a robot arm, a stereo vision system, and feature extraction algorithms to identify visual features of the objects in the world, such as color, shape, size and texture. The robot learns a task-specific sensing and control 28

policy in order to optimize the system performance for the given task through interaction with the blocks world. The robot can perform the following actions: 1. Reach : This action is used by the robot arm to reach for an object at any given location within the boundaries of the blocks world. 2. Pick : This action is used to pickup or drop an object at the current location. 3. Stop : This action allows the robot to stop its interaction with the blocks world. This action is mainly used since : a. The robot does not know what the task is and learns it from the reinforcements given by the world. b. There are no absorbing states in the real world. Therefore, the robot has to learn when to stop while performing a task so as to maximize expected reward. Each Reach or Pick action is associated with a set of features that the robot has to process in order to derive the goal of the action and then to complete the given task. For example, Reach Blue will cause the robot arm to reach for a blue object in the blocks world if such an object exists. As the number of features present in the world increases, the complexity of the system also increases. In order to restrict the system complexity, the number of features that can be processed with each action at a given point in time is limited to two in the experiments presented here. 29

5.1 State Space Representation The Q-learning algorithm uses an abstract state space to learn the sensing and control policy. Each abstract state can be represented as a vector of predicates. In the blocks world domain, the vector of predicates constituting an abstract state consists of: 1. Action Taken: This indicates what action was taken 2. Action Outcome: This indicates if the action was successful or unsuccessful. 3. Feature 1 and / or Feature 2: These are the features that were chosen by the robot to determine the control objective of the last action. 4. Feature Successful / Unsuccessful: This represents whether the feature combination used by the last action was found. 5. Arm Holding: This indicates what the arm is holding. For example: if the robot is in the start state s 0 { Null, Null, Null, Null, Null, Holding Nothing } and takes action Reach and the features that formed control objectives were color Blue and shape Square, then the robot tries to reach for an object that has color Blue and has a Square shape. If the world contains an object with this feature combination then this action will lead to a new state and the vector of predicates for this state will have the value { Reach, Successful, Blue, Square, Successful, Holding Nothing } meaning that the action Reach for an object with features Blue and Square was successful and the feature combination Blue and Square was found. By the end of this action the arm is Holding Nothing. 30

5.2 Table Cleaning Task In this experiment a number of objects are on the table top and the robot can move or pick up only one object at a time. The task is to learn a sensing and control policy that will allow the robot to reach, and pick up objects (that are movable) from the table and then move and drop these objects in a box. While learning a policy for the task, the robot also has to learn when to stop since there is no explicit information or feedback as to when the task is completed. The robot has a cost associated with each action it takes, and receives a small positive reward each time it picks up and drops an object from the table into the box. Size Shape Texture Small Square Texture 1 Texture 2 Medium Round Texture 3 Big Rectangle Texture 4 Figure 5.1 Table Cleaning Task Setup The blocks world has the following objects and features present on the table (as shown in Figure 5.1) Object 1: Texture 1, Square, Small 31

Object 2: Texture 1, Round, Medium Object 3: Texture 2, Square, Medium Object 4: Texture 3, Round, Small Object 5: Texture 3, Square, Small and a box (Texture 4, Rectangle, Big) in which all the objects are to be dropped. The objects with size Big can not be picked up by the robot arm. All other objects are either movable or unmovable. Whether an object is movable or unmovable can only be determined by trying to pick it up at least once. Once a particular object is dropped in the box it is no longer visible. As a result, the robot is in a state s x { Pick, Unsuccessful, Texture 4, Null, Successful, Holding Nothing } where it has just dropped an object with feature Texture 1 into the box and again reaches for an object with feature Texture 1. Then the Reach action will only be successful if there is another object with Texture 1 on the table. Otherwise, it will be unsuccessful since the features of the dropped object are no longer accessible to the feature extraction algorithm. Starting from the derived abstract state representation, the system uses the learning component to learn the value function using the reward it receives each time a transition from state s t to s t+1 takes place. The robot starts out exploring the world, taking random actions (100 % exploration) and incrementally decreasing its exploration using the Boltzmann soft-max distribution until it reaches a stage where the exploration is about 10 %. This amount of exploration is maintained to allow the system to visit different potentially new parts of the abstract state space even after it achieves a 32

locally optimal policy. This is done in order to improve its performance by enabling it to learn a more global strategy. For the table cleaning task 3 possible cases were considered: 1. All objects on the table are movable 2. Not All objects on table are movable: A few objects on the table are movable and few are unmovable. 3. None of the objects on the table are movable. 5.2.1. All Objects on the Table Are Movable Figure 5.2 shows the Learning curve for table cleaning task when all objects on the table are movable. The learning curve is plotted as a running average over 30 steps and depicts the average of 10 learning trials. The intervals indicate one standard deviation. Average Reward 15 10 5 0-5 -10-15 -20-25 -30-35 -40 0 50000 100000 150000 200000 Learning Steps Figure 5.2 Learning Curve (All Objects on the Table Are Movable) 33

Figure 5.3 shows a part of the control policy learned by the robot for case 1 of the table cleaning task. Each arrow in the figure represents a possible result of the related action in terms of a transition from the old state to the new state of the world. The robot starts in state S 0 { Null, Null, Null, Null, Null, Holding Nothing }.From state S 0, the robot takes action Reach Texture 1. This results in the robot reaching for an object with Texture 1, leading to state S 1 { Reach, Successful Texture 1, Null, Successful, Holding Nothing }. If there is more than one object with Texture 1, then the robot randomly moves to any one of those objects since the objects look similar if they have the same texture. From state S 1, the robot takes action Pick Texture 1 resulting in successful picking up of the object with Texture 1, leading to state S 2 { Pick, Successful, Texture 1, Null Successful, Holding Texture 1 }. The robot arm then reaches for the box where it needs to drop the object using action Reach Texture 4, leading to state S 3 { Reach, Successful, Texture 4, Null, Successful, Holding Texture 1 }. Finally, the robot takes the action Pick Texture 4 and this results in the dropping of the held object into the box, thus leading to state S 4 { Pick, Unsuccessful, Texture4, Null, Successful, Holding Nothing }. From state S 4 the robot can again try to Reach for Texture 1 and if there are more objects with Texture 1 then the states S 1, S 2, S 3, S 4 are repeated until all objects of Texture 1 are dropped into the box. Once all the objects of Texture 1 are dropped, taking action Reach for Texture 1 results in a transition to a state S 5 { Reach, Unsuccessful Texture 1, Null, Unsuccessful, Holding Nothing }. This transition to state S 5 tells the robot 34

that there are no more objects with Texture 1 on the table and thus helps the robot to shift its attention from the current feature Texture 1 to some other feature that can be used as a control objective for the actions to further clean the table. Once the table is clean all the features except those in the box are unsuccessful. Consequently, the robot learns that taking any more actions in the blocks world domain results in no reward. This makes the robot learn that Stop is the best action that maximizes expected reward once the table is clean. Reach Pick S 0 Texture 1 S 1 Texture 1 S 2 S 5 Reach Texture 1 Texture 1 Reach Texture 4 Pick S 4 S 3 Texture 4 Reach Texture 3 Reach Stop S 6 S 7 Figure 5.3 Partial Learned Policy (All Objects on the Table Are Movable) 35

5.2.2. Not All Objects on the Table Are Movable In this case, the blocks world contains the same number of objects as in case 1 (all objects are movable). However, in this case objects 2 and 3 on the table are not movable. Figure 5.4 shows the learning curve for this task. Average Reward 15 10 5 0-5 -10-15 -20-25 -30 0 20000 40000 60000 80000 100000 120000 140000 Learning Steps Figure 5.4 Learning Curve (Not All Objects on the Table Are Movable) The robot learns a policy that results in the robot picking and dropping all the movable objects into the box. It learns to stop once all movable objects are dropped in the box, since it receives no reward for its actions of moving and trying to pick up the unmovable objects. The result is a partial cleaning of the table. 36

5.2.3. None of the Objects on the Table Are Movable Figure 5.5 shows the learning curve for this task. In this case the robot starts out by randomly exploring the world and soon learns that none of the objects on the table are movable. Average Reward 0-2 -4-6 -8-10 -12-14 -16-18 0 20000 40000 60000 80000 100000 Learning Steps Figure 5.5 Learning Curve (None of the Objects on the Table Are Movable) Thus, the robot learns to stop (Figure 5.6) immediately instead of taking any other actions, since it learns that the table can never be cleaned. Stop S 0 S 1 Figure 5.6 Learned Policy (None of the Objects on the Table Are Movable) 37

5.3 Sorting Task The robot has to learn a policy for sorting objects on the table into different boxes based on a given criteria e.g. color, shape or color & shape etc. The blocks world has the following objects in the sorting task, (Figure 5.7 shows the blocks world setup for the sorting task). The blocks world has the following objects on the table: 1. Object 1: Texture 1, Square, Small 2. Object 2: Texture 1, Round, Medium And, a box 1 (Texture 3, Rectangle, Big), box 2 (Texture 2, Rectangle, Big) in which all the objects are to be sorted. Size Shape Texture Small Texture 1 Square Round Medium Texture 2 Rectangle Big Texture 3 Figure 5.7 Sorting Task Setup The robot gets a reward when it drops 1. An object having features Texture 1 and Small into box 1. 2. An object having features Texture 1 and Medium size into box 2. 38

Reach Texture 1 Small Pick Texture 1 Small S 0 S 1 S 2 Reach Texture 1 Medium S5 S 4 Pick Texture 3 Big S 3 Reach Texture 3 Big Pick Texture1 Medium S 6 Reach Texture 2 Big S 9 Pick Texture 2 Big S 7 S 8 Stop Figure 5.8 One of the Learned Policy for Sorting Task Figure 5.8 shows one of the policies learned by the robot to do the sorting task. The robot starts in state S 0 { Null, Null, Null, Null, Null, Holding Nothing } and learns to reach and pick up object with features Texture 1 and Small (states S 1 { Reach, Successful, Texture 1, Small, Successful, Holding Nothing }, S 2 { Pick, Successful, Texture 1, Small, Successful, Holding Texture 1 and Small }respectively for Reach and Pick actions), and 39