Developing Focus of Attention Strategies Using Reinforcement Learning

Size: px
Start display at page:

Download "Developing Focus of Attention Strategies Using Reinforcement Learning"

Transcription

1 Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX Developing Focus of Attention Strategies Using Reinforcement Learning Srividhya Rajendran Technical Report CSE This report was also submitted as an M.S. thesis.

2 DEVELOPING FOCUS OF ATTENTION STRATEGIES USING REINFORCEMENT LEARNING The members of the Committee approve the master s thesis of Srividhya Rajendran Manfred Huber Supervising Professor Farhad Kamangar Lawrence B. Holder

3 Copyright by Official Srividhya Rajendran 2003 All Rights Reserved

4 DEDICATION This thesis is dedicated to my father, T.K. Rajendran and my mother, Bhagyalakshmi

5 DEVELOPING FOCUS OF ATTENTION STRATEGIES USING REINFORCEMENT LEARNING by SRIVIDHYA RAJENDRAN Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON December 2003

6 ACKNOWLEDGEMENTS I would like to thank Dr. Huber for introducing me to the exciting field of Artificial Intelligence. I would also like to thank him for his enormous patience, timely advice and pertinent examples that helped me immensely towards completion of my research work. I would like to thank Dr. Kamangar and Dr. Holder for being on my committee. I would like to thank Prashant for his love, support and hot meals he made for me when I was too busy to cook. I would like to thank Appa, Amma, Geetha and Kittu for being there for me through thick and thin, and for always encouraging me to achieve what I believed in. I would also like to thank all my friends for being there to boost my morale when I needed it the most. November 17, 2003 v

7 ABSTRACT DEVELOPING FOCUS OF ATTENTION STRATEGIES USING REINFORCEMENT LEARNING Publication No. Srividhya Rajendran, CSE The University of Texas at Arlington, 2003 Supervising Professor: Dr. Manfred Huber Robot and AI agents that can adapt and handle multiple tasks are the need of today. This requires them to have the capability to handle real world situations. Robots use sensors to interact with the world. Processing the raw data from these sensors becomes computationally intractable in real time. This problem can be addressed by learning strategies of focus of attention. This thesis presents an approach that considers focus of attention as a reinforcement learning problem of selecting controller and feature pairs to be processed at any given point in time. The result is a sensing and control policy that is task specific and can adapt to real world situations using the feedback from the world. vi

8 Handling all the information of the world for the successful completion of a task is computationally intractable. In order to resolve this, the current approach is further augmented with short term memory. This enables the agent to learn a memory policy. The memory policy tells the agent what to remember and when to remember in order to successfully complete a task. The approach is illustrated using a number of tasks in the blocks world domain. vii

9 TABLE OF CONTENTS ACKNOWLEDGEMENTS... ABSTRACT... LIST OF ILLUSTRATIONS... LIST OF TABLES... v vi x xii Chapter I. INTRODUCTION Problem Description Related Work Approach Taken... 6 II. MACHINE LEARNING Learning Agent Design Learning Methods III. REINFORCEMENT LEARNING Reinforcement Learning Model Markov Decision Processes Learning an Optimal Policy : Model Based Methods Learning an Optimal Policy using Q-Learning: Model Free Method Exploration versus Exploitation Strategies viii

10 3.6 A Sample Example- The Grid World IV. CONTROL ARCHITECTURE Control Architecture V. LEARNING FOCUS OF ATTENTION- PERCEPTUAL LEVEL State Space Representation Table Cleaning Task Sorting Task VI. LEARNING FOCUS OF ATTENTION- MEMORY LEVEL State Space Representation Block Stacking Task Block Copying Task VII. CONCLUSION AND FUTURE WORK REFERENCES BIOGRAPHICAL INFORMATION ix

11 LIST OF ILLUSTRATIONS Figure Page 2.1 Learning Agent Model Reinforcement Learning Model R (s, a) Immediate Reward Values (a) Q(s, a) Values (b) Optimal Policies Control Architecture Table Cleaning Task Setup Learning Curve (All Objects on the Table Are Movable) Partial Learned Policy (All Objects on the Table Are Movable) Learning Curve (Not All Objects on the Table Are Movable) Learning Curve (None of the Objects on the Table Are Movable) Learned Policy (None of the Objects on the Table Are Movable) Sorting Task Setup One of the Learned Policy for Sorting Task Learning Curve for Sorting Task Sorting Task with Different Number of Features in State Vector Action Top Blue Yellow Block Stacking Task Setup with Two Objects x

12 6.3 One of the Learned Policy Block Stacking Task (Two Objects) Learning Curve for Block Stacking Task (Two Objects) Block Stacking Task Setup with Three Objects Learning Curve for Block Stacking Task (Three Objects) Block Copy Setup with Two Objects One of the Partial Policy for Block Copy (with Two Objects) Learning Curve for Block Copying Task (with Two Objects) Block Copy Setup with Three Objects Learning Curve for Block Copying Task (with Three Objects) Stacking Task with Different Temperature Decay Rates Copying Task with Different Temperature Decay Rates Stacking Task with and without Memory Policy Copying Task with and without Memory Policy xi

13 LIST OF TABLES Table Page 3.1 Q-Learning Algorithm State Transitions with Action Up State Transitions with Action Down State Transitions with Action Left State Transitions with Action Right Block Copying Strategy used by Humans Different World configurations for Block Copying with Three Objects xii

14 CHAPTER I INTRODUCTION 1.1 Problem Description The development of science and technology has led humans to explore the field of artificially intelligent (AI) agents and robots and this has resulted in robots and AI agents that are very special purpose or task specific, e.g. vacuuming robot, robots for refueling of calendria tubes in an atomic power plant, etc. These robots develop their intelligent behavior from a huge amount of task-specific knowledge. We would like to have AI agents and robots that are more flexible in the range of task they can perform such as assisting us in doing repetitive and dangerous tasks, assisting handicapped or elderly people by monitoring and controlling various aspects of the environment, etc. This requires the AI agents to have the capability of interacting with humans and to be able to deal with the uncertainties inherent in the real world. In order to accomplish this, AI agents need to extensively make use of sensors for observing and interpreting the state of the world. The use of diverse sensors results in huge amounts of raw data and AI agents have to process this raw data to extract knowledge about the world. This poses many serious problems. They are (but not limited to): 1

15 1. Lack of computational resources to process the huge amount of data in real-time. 2. Huge number of data points considered by the AI agents to make realworld decisions in real-time. This seeks AI agents to come up with mechanisms of filtering the relevant data out of the raw data to enable a targeted processing of the data critical to decision making. Similar mechanisms are observed in biological systems. Humans display the mechanical and cognitive aspects of this mechanism. For example, the retinal image from an eye has a high acuity region at the center known as fovea, beyond which the image resolution drops in the visual periphery. Humans motivated by their behavioral needs are able to focus their sight (by moving eyes or turning heads) onto the object of interest. The cognitive aspect of this mechanism can be observed in humans, for example, while driving a car driver s eyes form the image of the things within the visual boundaries but he/she only sees the things in front or in the proximity of the car. This is because the cognitive part of the brain only processes (focuses attention on) the visual information from front and around the car since it is very relevant for driving while it ignores the rest [Tu 1996]. This mechanism of processing only a small number of features in the sensory input while ignoring the less relevant aspects is known as Focus of Attention [William 1890] [Broadbent 1958]. 2

16 According to Piaget [Mandler 1992], human infants by 4 months of age have reflexive responses that are organized into motor strategies, and all sensor organs start acting in a coordinated manner and start developing mechanisms of focus of attention. In the next few months they develop mechanisms for coordinating the developed motor strategies with the developed focus of attention mechanisms to address incrementally complex tasks [Piaget 1952]. E.g., newborns continuously learn sensing and control policies by flailing their hands. This random exploration of the terrain by moving their hands helps them determine properties like traversability from the available resources, thus developing strategies for efficient perception of task-specific properties along with learning control policies to move their hands to a given location. Here the precise nature of successful sensing strategies depends heavily on the available resources and the overall task at hand. For example, while navigating through a jungle we tend to actively look out for the tiger that might kill us. But the same task, navigation in the city does not require us to look out for the tiger since there are none but would require us to look for vehicles moving along the path we are taking. So, though the task is the same, the sensing strategy changes depending on the situation we are in. This suggests that robots and AI agents that can handle a wide range of tasks must also have mechanisms for acquiring task-specific focus of attention by autonomously interacting with the environment based on the resources available to them. 3

17 This thesis mainly concentrates on developing the above specified idea of acquiring strategies for task-specific focus of attention using reinforcement learning algorithms [Sutton & Barto 1998] to enable robots and AI agents to handle a wide range of tasks. 1.2 Related Work This section gives a brief introduction to various concepts used in this thesis and discusses previous research work related to this thesis work Reinforcement Learning Reinforcement learning [Sutton & Barto 1998] is a learning framework that attempts to learn how to map situations to actions so as to maximize a numerical reward signal. The learner is not told what actions to take, but instead must discover which actions yield the most reward by trying them. Reinforcement learning is defined not by characterizing learning algorithms, but by characterizing a learning problem. One of the challenges that arise in reinforcement learning is the tradeoff between exploration and exploitation. To maximize the reward, a reinforcement learning agent must prefer actions that have been tried and found to be yielding good reward. In order to discover such actions, the agent needs to try actions that have not been selected before. The agent needs to exploit what it already knows in order to obtain reward, but it also needs to explore in order to learn to select better actions in the future. The dilemma is that neither exploitation nor exploration can be pursued exclusively without failing at the 4

18 task. For example, a mobile robot that has to decide whether it should enter a new room in search of more trash to collect or start finding its way back to its battery recharging station. The robot makes its decision based on how quickly and easily it has been able to find the recharger in the past. Q-Learning [Watkins 1989] is a reinforcement learning algorithm that does not need a model of its environment and can be used on-line. This algorithm works by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values are learned, the optimal action from any state is the one with the highest Q-value. A more detailed explanation of machine learning, reinforcement learning, and Q-learning is presented in Chapter II and Chapter III Task Specific Focus of Attention Many approaches have been tried to learn selective attention policies for task completion. Andrew McCallum [McCallum 1996] developed mechanisms for learning selective attention for sequential tasks. Using the U-Tree algorithm McCallum s work dynamically builds a state space representation by pruning the large perceptual state space. His work further augments this state space with short term memory containing the crucial features that were missing because of hidden state. The results indicate that the state space representation developed by the U-Tree algorithm does not perform well in continuous state spaces. [Laar, Heskes & Gielen 1997] learned task dependent focus 5

19 of attention using neural networks. This idea used a limited sensor modality, constructed a special computational mechanism and did not have any real time feedback. This limits its use in real world robotic tasks. [Goncalves et. al 1999] presented an approach that uses similar learning mechanisms as presented in this thesis to identify objects identities in an active stereo vision system. This approach introduced special purpose visual maps to accomplish the required task. [Rao & Ballard 1995] proposed a general active vision architecture based on efficiently computable iconic representations. The architecture employed by them used visual routines for object identification and localization. They showed that complex visual behaviors could be obtained by using these two routines with different parameters. Most of the previous work considered very limited sensor resources and constructed special purpose mechanisms specific to one type of sensor modality and task domain. Moreover, these approaches did not provide real time feedback and were therefore limited in the context of robotic task execution. 1.3 Approach Taken The approach proposed in this thesis considers focus of attention as the problem of selecting a set of features to be processed at any given point in time during task execution. This is done by tight integration of sensing and control policies. While implementing this, each action that the robot system executes in the real world is associated with a set of features that form the control objective for that action. For example, action Reach Blue on a robot arm results in the robot reaching for a blue 6

20 object. At each point in time the robot has to decide the relevant features to process in the context of the chosen action. This reduces the amount of raw data that has to be analyzed to the data required for the related features. The sensing and control policies are learned by an online reinforcement learning algorithm that acquires task-specific sensing strategies that are optimized with respect to the task, capabilities, and resources of the AI agent. The learning architecture used here is based on a hybrid discrete/continuous system architecture where control strategies are constructed as a discrete event dynamic system supervisor [Ramadge & Wonham 1989] [Ozveren & Willsky1990] [Thistle & Wonham 1994] [Huber & Grupen 1997]. In this architecture the new sensing and control strategies are learned by direct interaction between the system and its environment. This architecture is used since it allows the simplification of the temporal component of the policy by handling it in the discrete event dynamic system framework and thus eliminates the requirement of representing continuous time. This allows the system to autonomously derive sequence of concurrent actions and relevant perceptual features in order to optimize the system performance in the given task. 7

21 CHAPTER II MACHINE LEARNING A computer program is said to be a machine learning algorithm if it can improve its performance at a task over time. A Formal definition of Machine Learning by Tom Mitchell is as follows: A computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E [Mitchell 1997]. For example, a computer program that is designed to play chess is measured by its ability to win chess games. Here the task is a series of chess games, performance is the number of games won by the program against its opponents and the training experience is playing chess games against itself. 2.1 Learning Agent Design The learning agent is basically divided into four modules [Russell and Norvig 1995]. 1. Critic 2. Learning element 3. Problem generator 4. Performance element 8

22 Performance standard Critic Sensors Feedback Learning Element Learning Goals Changes Knowledge Performance Element Environment Problem Generator Effectors Agent Figure 2.1 Learning Agent Model [Russell and Norvig 1995] Critic The critic tells the learning element how well the agent is learning. The critic has an input known as performance standard. This is required because the percepts by themselves do not indicate anything about the agents performance. For example, a chess program may receive a percept indicating that it has been checkmated by its opponent. It will not know that being checkmated is good or bad unless there is a negative reward (performance standard) indicating that it is bad, or positive reward (performance standard) indicating that checkmating its opponent is good. The performance standard is a fixed measure that indicates how good or bad a given action 9

23 is from a given state. The performance standard is set by the environment of the learning agent Problem Generator The problem generator is responsible for generating actions that will lead to new and informative experience, otherwise the agent will only take the best action given what it knows. However, exploration leads the agent to learn better actions through a series of suboptimal actions over the long run, as opposed to exploitation Learning Element The learning element is responsible for improving the efficiency of the performance element. It takes feedback from the critic in any given state and adjusts the value of the performance element accordingly, thereby improving the learning over time. The learning element design is based on the following: 1. Components of the performance element to be improved. 2. Representation used by the components of the performance element. 3. Feedback given by the critic. 4. Amount of prior information available Performance Element The performance element is responsible for taking external actions. The design of the learning element depends on the design of the performance element. The 10

24 performance element is mainly responsible for improving the evaluation function by enhancing its accuracy over time. 2.2 Learning Methods There are basically two ways in which the components of the performance element are learned. They are: 1. Supervised learning 2. Unsupervised learning Supervised Learning Supervised learning is a method by which an evaluation function is learned from training samples. The training samples consist of input and desired output. The task of the learner is to predict the value of a function for any valid input after having seen a small number of training samples. This is done by generalizing from the presented data to unseen situations in a reasonable way. There are many approaches to implement supervised learning such as artificial neural networks, decision tree learning, etc Unsupervised Learning Unsupervised learning is a method where the learner has no idea about the output. In unsupervised learning, a data set of input objects is gathered and then a joint density model is built. A form of unsupervised learning is clustering. 11

25 CHAPTER III REINFORCEMENT LEARNING Reinforcement learning is a form of supervised learning mainly used for robots and AI agents to learn a task. The learner explores the environment by perceiving the state and taking subsequent actions. The environment in return provides rewards (which might be positive or negative) and the agent attempts to learn a policy that maximizes cumulative reward over the tasks. A reinforcement learning algorithm differs from other learning algorithms in the following ways: 1. It does not know a priori what the effect of its actions is on its environment and it does not have any knowledge about which actions are best in its long term interest. 2. It can receive reward in any state or only in a terminal state. Rewards define the utility the agent is trying to maximize. 3. It decides the distribution of the training examples by the sequence of actions it chooses. 4. It may or may not have a model of the environment. 5. The environment in which the agent is acting may or may not be fully observable. In an observable environment states can be identified with 12

26 percepts, whereas in a partially observable environment the agent has to maintain some internal state for keeping track of the environment. 3.1 Reinforcement Learning Model Agent State Reward Action Environment a 0 a 1 a 2 s 0 s 1 s 2 s 3 r 0 r 1 r 2 Figure 3.1 Reinforcement Learning Model Figure 3.1 shows a basic reinforcement learning model [Mitchell 1997]. In this model the agent receives an added input from the environment representing a measure of how good/bad the last action executed by the agent was. The model of the interaction is that the agent makes an observation and interprets the state of the world as s t and selects an action a t. It then performs this action, resulting in a new state s t+1, and receives a reward r t. The aim of the agent is to learn to choose actions that tend to increase the expected cumulative reward. 13

27 There are many algorithms that can be used by the agent to learn to behave optimally. However before learning to behave optimally we need to decide what the model of optimal behavior will be. There are three basic models that have been widely considered [Kaelbling, Littman & Moore 1996]. They are: 1. The finite horizon model 2. The infinite horizon model 3. The average reward model The Finite Horizon Model In this model, at any give moment in time the agent should optimize its expected reward for the next h steps: E( h t t= 0 where r t represents the scalar reward received t steps into the future. r ) The Infinite Horizon Model In this model, the agent takes into account its long run reward, but the rewards that are received in the future are geometrically discounted according to the discount factor γ, (where 0 γ < 1): t= 0 t E( γ rt ) where γ is a constant that represents the relative value of delayed versus immediate rewards. 14

28 The Average Reward Model In this model, the agent is supposed to take actions that optimize its long run average reward: 1 lim E( h h r ) h t t= Markov Decision Processes Problems with delayed rewards can be modeled as Markov decision processes (MDP). An MDP consists of the following [Kaelbling, Littman & Moore 1996]: 1. A set of state S, 2. A set of actions A, 3. A reward function R : S X A R, and 4. A state transition function T: S X A Π( S), where a member of П (S) is a probability distribution over S. The model is Markov, if the state transitions are independent of any previous environment states or agent actions. 3.3 Learning an Optimal Policy: Model Based Methods Let us assume that the agent knows a correct model and is trying to find an optimal policy for the infinite horizon discounted model. Then the optimal value of a state, V * (s) is the expected infinite discounted sum of reward that the agent will gain if 15

29 it starts in that state and executes an optimal policy. If П is a complete decision policy then: [Kaelbling, Littman & Moore 1996] [Mitchell 1997] Then the optimal value function is: V * = (s) max E Π t= * t V γ = a s' S * () s max R(s,a) + γ T(s,a,s') V (s'), s S given the optimal value function the optimal policy can be specified as: * * Π () s = arg max R(s,a) + γ T(s,a,s') V (s') a s' S here the T(s,a,s') is the probability of making a transition from state s to s' using action a and R(s,a) is reward of taking action a in state s. 0 r t 3.4 Learning an Optimal Policy Using Q- Learning: Model Free Method Q-learning [Watkins 1989] [Kaelbling, Littman & Moore 1996] [Mitchell 1997] is a learning algorithm that is used to learn an optimal policy when the agent does not have any model. The optimal policy when the model is known is * * Π () s = arg max R(s,a) + γ T(s,a,s') V (s') a s' S However, since the agent here does not know T(s,a,s') and R(s, a), learning requires the agent to be able to predict the immediate reward and immediate successor state for each state-action transition. Moreover, it is impossible to do that, rendering the V * useless. Therefore the agent is required to use a more general evaluation function. 16

30 Let us define the evaluation function Q(s,a) such that the value of Q is the reward received immediately upon executing action a in state s, plus the value (discounted by γ) of following the optimal policy thereafter. Q(s, a) R(s, a) + γ V ( * δ (s,a)) here the Q(s,a) is the value being maximized, δ(s,a) denotes the state resulting from applying action a to state s. Now the agent is required to learn the Q function instead of V * and, by doing so it will be able to learn an optimal policy even though the agent does not have any model. It follows that the learning of the Q function corresponds to learning an optimal policy. The relationship between Q and V * is: * V (s) = max Q(s,a') therefore Q(s, a) = R(s, a) + γ max Q( δ (s,a),a'), and since the definition of Q is a' recursive it is learned iteratively in the Q- learning algorithm. In this algorithm the learner has Q values for each state action pair in a large table. Before the learning phase begins the Q value for each state action pair is initialized to some random value. Each time the learner takes an action a in state s and receives a reward r, the Q value in the table corresponding to that state action pair is updated using the following update rule: a' Q(s, a) here α is the learning rate. Q(s, a) + α (r + γ max Q(s', a' ) - Q(s, a)) a' This process is repeated until the algorithm converges i.e. the old Q(s, a) is the same as new Q(s, a). The Q-learning algorithm converges toward a true Q function if: 17

31 1. The system is a deterministic MDP. 2. Each state action transition occurs infinitely often. 3. There exists some positive constant c such that for all states s and actions a R(s,a) < c. For each s, a initialize Q(s, a) to zero Observer the current state s Do forever: Select an action and execute it Receive immediate reward r Observe the new state s' Table 3.1 Q-Learning Algorithm [Mitchell 1997] Update the table entry for Q(s, a) as follows: Q(s, a) s s' = Q(s, a) + α(r + γ max Q(s', a' ) - Q(s, a)) a' 3.5 Exploration versus Exploitation Strategies A reinforcement learning algorithm is different from other supervised learning algorithms in the sense that the learner has to explicitly explore the environment while learning a policy. One of the strategies for an agent in state s is to select an action a that maximizes Q(s, a) ; this strategy is known as exploitation. But by using this strategy the agent risks overcommitting to actions that are found to have high Q-values during the early phase of training, failing to explore the other possible actions that might have the potential to yield even higher Q-values. The above convergence theorem requires that each state action transition occurs infinitely often, and since only the best action is chosen this approach runs the risk of not achieving the convergence of the learning 18

32 algorithm. Another strategy for an agent in state s is to select a random action a, and this strategy is known as exploration. In this strategy, the agent does learn actions with good values but this turns out to be not very significant since the agent is following the approach of not putting to use what it has learned during exploration. Therefore, the best way to train a reinforcement learner is a strategy that does both exploration and exploitation in moderation. This means a method that allows the agent to explore when it has no idea of the environment, and to exploit greedily when it has learned sufficiently well about the environment. The method that this thesis uses for the aforementioned purpose is referred to as Boltzmann soft-max distribution Boltzmann soft-max Distribution If there are n items and the fitness of each item i is f(i), then the Boltzmann distribution defines the probability of selecting an item i, p(i), as p(i) = where T is called temperature. By varying the parameter T we can vary the selection from picking a random item (T is infinite) to having higher probabilities for items with higher fitness (T small finite), and to strictly picking the item with best fitness (T tends to 0). This is accomplished by decaying the temperature exponentially using the e f(i) T j e f(j) T equation T t λt = T * e where Tt is temperature at time step t, T 0 is temperature at time 0 step 0, λ is the decay constant and t is the time step. 19

33 In our case where there are n actions from state s, the fitness of an action is given by Q(s, a i ). The probability p (a i s) of taking action a i from s is given by: p(a i s) = n e j= 0 Q(s,ai ) T e Q(s,a j ) T 3.6 A Sample Example The Grid World S 4 S G S 1 S 2 S Figure 3.2 R(s, a) Immediate Reward Values Let us consider a small grid world example [Mitchell 1997] as shown in figure 3.2. In this world each grid square represents a unique state for the agent. There are four possible actions that an agent can take to move from one square to another. These actions are: 20

34 1. Up: When the agent takes this action it moves up in the grid world. Table 3.2 State Transitions with Action Up Old State Action New State S 1 Up S 4 S 4 Up S 4 2. Down: When the agent takes this action it moves down in the grid world. Table 3.3 State Transitions with Action Down Old State Action New State S 1 Down S 1 S 4 Down S 1 3. Left: When the agent takes this action it moves left in the grid world. Table 3.4 State Transitions with Action Left Old State Action New State S 4 Left S 4 S 5 Left S 4 4. Right: When the agent takes this action it moves right in the grid world. Table 3.5 State Transitions with Action Right Old State Action New State S 3 Right S 3 S 2 Right S 3 21

35 However, some of these actions cannot be executed from cells at the boundary of the grid world. The goal of the agent is to learn a path (maximizing the cumulative reward) to reach the grid cell containing the gold (absorbing state). Only the action that leads to gold gets a reward of 100 and all other actions get a reward of 0. The agent uses the Q-learning algorithm since it does not have any idea about the model of the world. This is done by creating a table that contains an entry for the Q value of each stateaction pair. All Q-value entries are initialized to 0 as shown in Q-learning algorithm in Table 3.1. Let γ (0 γ < 1) be equal to 0.9. Since the agent does not have any knowledge of the world, it starts out by exploring the world and updating the Q value of corresponding state-action pair using the updating equation shown in the Q-learning algorithm in Table 3.1. Figure 3.2 shows that if agent reaches state S 5 and takes action Right, it receives a reward of 100 since it reaches the gold. So the new Q (S 5, Right ) is: new Q(S5,"Right") = *( (0) - 0) = 100 (using equation Q(s, a) = Q(s, a) + α(r + γ max Q(s', a' ) - Q(s, a)) ) a' Similarly, if the agent takes action Up in state S 3 it receives a reward of 100 and if the learning process continues, the state-action pair values are updated as shown in Figure 3.3(a) and finally when the agent has explored sufficiently it learns a policy as shown in Figure 3.3(b). 22

36 S 4 S 5 G S 4 S G S 1 S 2 S 3 S 1 S 2 S (a) (b) Figure 3.3 (a) Q(s, a) Values, (b) Optimal Policies 23

37 CHAPTER IV CONTROL ARCHITECTURE A robotic or AI system that deals with new situations and handles a wide range of task requires a large degree of generality and the capability to adjust on-line. This is because of the fact that it is not very realistic and easy to have the complete prior knowledge of the environments the robot system has to deal with. This calls for highly complex and nonlinear robotic control systems with a control architecture that reduces the overall complexity. This is achieved using a control architecture [Huber & Grupen 1997] based on a hybrid discrete event dynamic system [Ramadge & Wonham 1989] [Ozveren & Willsky.1990] [Thistle & Wonham 1994] to control the robots. 4.1 Control Architecture Figure 4.1 shows the organization of the control architecture. This architecture mainly consists of following components: 1. Controller / Feature pairs. 2. Supervisor. 3. Learning Element All the components are tightly integrated in order to achieve the required performance characteristics. 24

38 Reinforcement Learning Element State Information Control / Sensing Policy Supervisor Symbolic Events Control Activation Controller / Feature Pairs Physical Sensors Physical Actuators Figure 4.1 Control Architecture Controller / Feature pairs The controller/feature pairs represent the bottommost module of the given control architecture. This module directly interacts with the physical sensors and actuators of the robot system. At each point in time the control policy in the supervisor determines the controller that has to be activated and the associated features that have to be processed. This set of features forms the control objective of the controller. If there is 25

39 an action Reach that controls the robot to reach for objects in the world, then the features associated with it tell the robot which object it is reaching for and where it is located in the world. For example, Reach Blue results in the robot arm reaching for a blue object in the world. The convergence of controllers represents the completion of a control objective and results in the generation of a discrete symbolic event. This symbolic event triggers the transition of robots state, and thus results in the deactivation of the control signal. This process of choosing the relevant features to process in the context of the chosen action reduces the amount of raw data that has to be analyzed to the one required for the selected features Supervisor The supervisor is the heart of this control architecture. It represents the task specific control policy that is learned by the robot using the learning component of the given architecture. The supervisor is built on top of the symbolic predicate space and abstract event space. The predicate space model represents an abstract description of all possible ways in which the robot can attempt to actively manipulate the state of the system and the environment. This abstract representation provides a compact state space to be used by the learning component. An abstract state consists of a vector of predicates representing the effects of controller/feature pairs on the world. Each state corresponds to a subgoal that is directly attainable by the controller/feature pairs. The action dependent choice of the predicates thus forms a compact representation on which control policies can be learned 26

40 for all tasks that are directly addressable using the underlying closed loop controllers of this system. The aim of the controller is to reach an equilibrium state where the control objective is satisfied, thus asserting the related predicates of the abstract state. However, the outcomes of these control actions at the supervisor level are nondeterministic due to kinematics limitations, controller interactions and other nonlinearities in the system. Therefore it may happen that the same action with the same set of features to be monitored from state s may lead to a set of different states {s 1, s 2, s 3 } at different points in time. Hence the supervisor takes the form of a nondeterministic finite automaton that triggers transitions with the convergence of controllers Learning Element The learning component uses Q-learning to learn sensing and control policies. This allows the learning component to learn policies that optimize the reinforcement given by the environment upon completion of functional goals. At the same time the exploration process allows to improve the abstract system model of the supervisor. Each time a new state is reached, the world gives feedback in the form of reinforcement to the learning component. This, in turn, is used to update the model by updating the transition probabilities of the states. 27

41 CHAPTER V LEARNING FOCUS OF ATTENTION PERCEPTUAL LEVEL As described earlier, a robotic system that is constructed to operate successfully in the real world has to have a mechanism to learn task specific sensing and control policies. In the approach presented here, the robot learns sensing and control polices using the reinforcement learning component of the control architecture. The learned policy provides the robot with an action and a set of features to be processed with this action. The features processed with each action determine the control objective of the action in the given state. To address complex tasks in real time and adapt online, the control architecture used by the robot system reduces the size of the continuous state space by using an abstract representation of the states. To illustrate the proposed approach, let us consider the following examples in the blocks world domain where the robot interacts with objects on the table top. 1. Table Cleaning Task 2. Sorting Task The robot configuration consists of a robot arm, a stereo vision system, and feature extraction algorithms to identify visual features of the objects in the world, such as color, shape, size and texture. The robot learns a task-specific sensing and control 28

42 policy in order to optimize the system performance for the given task through interaction with the blocks world. The robot can perform the following actions: 1. Reach : This action is used by the robot arm to reach for an object at any given location within the boundaries of the blocks world. 2. Pick : This action is used to pickup or drop an object at the current location. 3. Stop : This action allows the robot to stop its interaction with the blocks world. This action is mainly used since : a. The robot does not know what the task is and learns it from the reinforcements given by the world. b. There are no absorbing states in the real world. Therefore, the robot has to learn when to stop while performing a task so as to maximize expected reward. Each Reach or Pick action is associated with a set of features that the robot has to process in order to derive the goal of the action and then to complete the given task. For example, Reach Blue will cause the robot arm to reach for a blue object in the blocks world if such an object exists. As the number of features present in the world increases, the complexity of the system also increases. In order to restrict the system complexity, the number of features that can be processed with each action at a given point in time is limited to two in the experiments presented here. 29

43 5.1 State Space Representation The Q-learning algorithm uses an abstract state space to learn the sensing and control policy. Each abstract state can be represented as a vector of predicates. In the blocks world domain, the vector of predicates constituting an abstract state consists of: 1. Action Taken: This indicates what action was taken 2. Action Outcome: This indicates if the action was successful or unsuccessful. 3. Feature 1 and / or Feature 2: These are the features that were chosen by the robot to determine the control objective of the last action. 4. Feature Successful / Unsuccessful: This represents whether the feature combination used by the last action was found. 5. Arm Holding: This indicates what the arm is holding. For example: if the robot is in the start state s 0 { Null, Null, Null, Null, Null, Holding Nothing } and takes action Reach and the features that formed control objectives were color Blue and shape Square, then the robot tries to reach for an object that has color Blue and has a Square shape. If the world contains an object with this feature combination then this action will lead to a new state and the vector of predicates for this state will have the value { Reach, Successful, Blue, Square, Successful, Holding Nothing } meaning that the action Reach for an object with features Blue and Square was successful and the feature combination Blue and Square was found. By the end of this action the arm is Holding Nothing. 30

44 5.2 Table Cleaning Task In this experiment a number of objects are on the table top and the robot can move or pick up only one object at a time. The task is to learn a sensing and control policy that will allow the robot to reach, and pick up objects (that are movable) from the table and then move and drop these objects in a box. While learning a policy for the task, the robot also has to learn when to stop since there is no explicit information or feedback as to when the task is completed. The robot has a cost associated with each action it takes, and receives a small positive reward each time it picks up and drops an object from the table into the box. Size Shape Texture Small Square Texture 1 Texture 2 Medium Round Texture 3 Big Rectangle Texture 4 Figure 5.1 Table Cleaning Task Setup The blocks world has the following objects and features present on the table (as shown in Figure 5.1) Object 1: Texture 1, Square, Small 31

45 Object 2: Texture 1, Round, Medium Object 3: Texture 2, Square, Medium Object 4: Texture 3, Round, Small Object 5: Texture 3, Square, Small and a box (Texture 4, Rectangle, Big) in which all the objects are to be dropped. The objects with size Big can not be picked up by the robot arm. All other objects are either movable or unmovable. Whether an object is movable or unmovable can only be determined by trying to pick it up at least once. Once a particular object is dropped in the box it is no longer visible. As a result, the robot is in a state s x { Pick, Unsuccessful, Texture 4, Null, Successful, Holding Nothing } where it has just dropped an object with feature Texture 1 into the box and again reaches for an object with feature Texture 1. Then the Reach action will only be successful if there is another object with Texture 1 on the table. Otherwise, it will be unsuccessful since the features of the dropped object are no longer accessible to the feature extraction algorithm. Starting from the derived abstract state representation, the system uses the learning component to learn the value function using the reward it receives each time a transition from state s t to s t+1 takes place. The robot starts out exploring the world, taking random actions (100 % exploration) and incrementally decreasing its exploration using the Boltzmann soft-max distribution until it reaches a stage where the exploration is about 10 %. This amount of exploration is maintained to allow the system to visit different potentially new parts of the abstract state space even after it achieves a 32

46 locally optimal policy. This is done in order to improve its performance by enabling it to learn a more global strategy. For the table cleaning task 3 possible cases were considered: 1. All objects on the table are movable 2. Not All objects on table are movable: A few objects on the table are movable and few are unmovable. 3. None of the objects on the table are movable All Objects on the Table Are Movable Figure 5.2 shows the Learning curve for table cleaning task when all objects on the table are movable. The learning curve is plotted as a running average over 30 steps and depicts the average of 10 learning trials. The intervals indicate one standard deviation. Average Reward Learning Steps Figure 5.2 Learning Curve (All Objects on the Table Are Movable) 33

47 Figure 5.3 shows a part of the control policy learned by the robot for case 1 of the table cleaning task. Each arrow in the figure represents a possible result of the related action in terms of a transition from the old state to the new state of the world. The robot starts in state S 0 { Null, Null, Null, Null, Null, Holding Nothing }.From state S 0, the robot takes action Reach Texture 1. This results in the robot reaching for an object with Texture 1, leading to state S 1 { Reach, Successful Texture 1, Null, Successful, Holding Nothing }. If there is more than one object with Texture 1, then the robot randomly moves to any one of those objects since the objects look similar if they have the same texture. From state S 1, the robot takes action Pick Texture 1 resulting in successful picking up of the object with Texture 1, leading to state S 2 { Pick, Successful, Texture 1, Null Successful, Holding Texture 1 }. The robot arm then reaches for the box where it needs to drop the object using action Reach Texture 4, leading to state S 3 { Reach, Successful, Texture 4, Null, Successful, Holding Texture 1 }. Finally, the robot takes the action Pick Texture 4 and this results in the dropping of the held object into the box, thus leading to state S 4 { Pick, Unsuccessful, Texture4, Null, Successful, Holding Nothing }. From state S 4 the robot can again try to Reach for Texture 1 and if there are more objects with Texture 1 then the states S 1, S 2, S 3, S 4 are repeated until all objects of Texture 1 are dropped into the box. Once all the objects of Texture 1 are dropped, taking action Reach for Texture 1 results in a transition to a state S 5 { Reach, Unsuccessful Texture 1, Null, Unsuccessful, Holding Nothing }. This transition to state S 5 tells the robot 34

48 that there are no more objects with Texture 1 on the table and thus helps the robot to shift its attention from the current feature Texture 1 to some other feature that can be used as a control objective for the actions to further clean the table. Once the table is clean all the features except those in the box are unsuccessful. Consequently, the robot learns that taking any more actions in the blocks world domain results in no reward. This makes the robot learn that Stop is the best action that maximizes expected reward once the table is clean. Reach Pick S 0 Texture 1 S 1 Texture 1 S 2 S 5 Reach Texture 1 Texture 1 Reach Texture 4 Pick S 4 S 3 Texture 4 Reach Texture 3 Reach Stop S 6 S 7 Figure 5.3 Partial Learned Policy (All Objects on the Table Are Movable) 35

49 Not All Objects on the Table Are Movable In this case, the blocks world contains the same number of objects as in case 1 (all objects are movable). However, in this case objects 2 and 3 on the table are not movable. Figure 5.4 shows the learning curve for this task. Average Reward Learning Steps Figure 5.4 Learning Curve (Not All Objects on the Table Are Movable) The robot learns a policy that results in the robot picking and dropping all the movable objects into the box. It learns to stop once all movable objects are dropped in the box, since it receives no reward for its actions of moving and trying to pick up the unmovable objects. The result is a partial cleaning of the table. 36

50 None of the Objects on the Table Are Movable Figure 5.5 shows the learning curve for this task. In this case the robot starts out by randomly exploring the world and soon learns that none of the objects on the table are movable. Average Reward Learning Steps Figure 5.5 Learning Curve (None of the Objects on the Table Are Movable) Thus, the robot learns to stop (Figure 5.6) immediately instead of taking any other actions, since it learns that the table can never be cleaned. Stop S 0 S 1 Figure 5.6 Learned Policy (None of the Objects on the Table Are Movable) 37

51 5.3 Sorting Task The robot has to learn a policy for sorting objects on the table into different boxes based on a given criteria e.g. color, shape or color & shape etc. The blocks world has the following objects in the sorting task, (Figure 5.7 shows the blocks world setup for the sorting task). The blocks world has the following objects on the table: 1. Object 1: Texture 1, Square, Small 2. Object 2: Texture 1, Round, Medium And, a box 1 (Texture 3, Rectangle, Big), box 2 (Texture 2, Rectangle, Big) in which all the objects are to be sorted. Size Shape Texture Small Texture 1 Square Round Medium Texture 2 Rectangle Big Texture 3 Figure 5.7 Sorting Task Setup The robot gets a reward when it drops 1. An object having features Texture 1 and Small into box An object having features Texture 1 and Medium size into box 2. 38

52 Reach Texture 1 Small Pick Texture 1 Small S 0 S 1 S 2 Reach Texture 1 Medium S5 S 4 Pick Texture 3 Big S 3 Reach Texture 3 Big Pick Texture1 Medium S 6 Reach Texture 2 Big S 9 Pick Texture 2 Big S 7 S 8 Stop Figure 5.8 One of the Learned Policy for Sorting Task Figure 5.8 shows one of the policies learned by the robot to do the sorting task. The robot starts in state S 0 { Null, Null, Null, Null, Null, Holding Nothing } and learns to reach and pick up object with features Texture 1 and Small (states S 1 { Reach, Successful, Texture 1, Small, Successful, Holding Nothing }, S 2 { Pick, Successful, Texture 1, Small, Successful, Holding Texture 1 and Small }respectively for Reach and Pick actions), and 39

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Learning Prospective Robot Behavior

Learning Prospective Robot Behavior Learning Prospective Robot Behavior Shichao Ou and Rod Grupen Laboratory for Perceptual Robotics Computer Science Department University of Massachusetts Amherst {chao,grupen}@cs.umass.edu Abstract This

More information

High-level Reinforcement Learning in Strategy Games

High-level Reinforcement Learning in Strategy Games High-level Reinforcement Learning in Strategy Games Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu Guy Shani Department of Computer

More information

TD(λ) and Q-Learning Based Ludo Players

TD(λ) and Q-Learning Based Ludo Players TD(λ) and Q-Learning Based Ludo Players Majed Alhajry, Faisal Alvi, Member, IEEE and Moataz Ahmed Abstract Reinforcement learning is a popular machine learning technique whose inherent self-learning ability

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

AMULTIAGENT system [1] can be defined as a group of

AMULTIAGENT system [1] can be defined as a group of 156 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008 A Comprehensive Survey of Multiagent Reinforcement Learning Lucian Buşoniu, Robert Babuška,

More information

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Varun Raj Kompella, Marijn Stollenga, Matthew Luciw, Juergen Schmidhuber The Swiss AI Lab IDSIA, USI

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration. CS : Deep Reinforcement Learning Sergey Levine Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

More information

Improving Action Selection in MDP s via Knowledge Transfer

Improving Action Selection in MDP s via Knowledge Transfer In Proc. 20th National Conference on Artificial Intelligence (AAAI-05), July 9 13, 2005, Pittsburgh, USA. Improving Action Selection in MDP s via Knowledge Transfer Alexander A. Sherstov and Peter Stone

More information

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

More information

Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Basic Concepts of Machine Learning Lecture 1: Basic Concepts of Machine Learning Cognitive Systems - Machine Learning Ute Schmid (lecture) Johannes Rabold (practice) Based on slides prepared March 2005 by Maximilian Röglinger, updated 2010

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Intelligent Agents. Chapter 2. Chapter 2 1

Intelligent Agents. Chapter 2. Chapter 2 1 Intelligent Agents Chapter 2 Chapter 2 1 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types The structure of agents Chapter 2 2 Agents

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

University of Groningen. Systemen, planning, netwerken Bosman, Aart

University of Groningen. Systemen, planning, netwerken Bosman, Aart University of Groningen Systemen, planning, netwerken Bosman, Aart IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

A Reinforcement Learning Variant for Control Scheduling

A Reinforcement Learning Variant for Control Scheduling A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors) Intelligent Agents Chapter 2 1 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Agent types 2 Agents and environments sensors environment percepts

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Seminar - Organic Computing

Seminar - Organic Computing Seminar - Organic Computing Self-Organisation of OC-Systems Markus Franke 25.01.2006 Typeset by FoilTEX Timetable 1. Overview 2. Characteristics of SO-Systems 3. Concern with Nature 4. Design-Concepts

More information

Action Models and their Induction

Action Models and their Induction Action Models and their Induction Michal Čertický, Comenius University, Bratislava certicky@fmph.uniba.sk March 5, 2013 Abstract By action model, we understand any logic-based representation of effects

More information

Modeling user preferences and norms in context-aware systems

Modeling user preferences and norms in context-aware systems Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos

More information

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners Andrea L. Thomaz and Cynthia Breazeal Abstract While Reinforcement Learning (RL) is not traditionally designed

More information

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing a Moving Target How Do We Test Machine Learning Systems? Peter Varhol, Technology

More information

Regret-based Reward Elicitation for Markov Decision Processes

Regret-based Reward Elicitation for Markov Decision Processes 444 REGAN & BOUTILIER UAI 2009 Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto, ON, CANADA kmregan@cs.toronto.edu

More information

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators s and environments Percepts Intelligent s? Chapter 2 Actions s include humans, robots, softbots, thermostats, etc. The agent function maps from percept histories to actions: f : P A The agent program runs

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

More information

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham Curriculum Design Project with Virtual Manipulatives Gwenanne Salkind George Mason University EDCI 856 Dr. Patricia Moyer-Packenham Spring 2006 Curriculum Design Project with Virtual Manipulatives Table

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Concept Acquisition Without Representation William Dylan Sabo

Concept Acquisition Without Representation William Dylan Sabo Concept Acquisition Without Representation William Dylan Sabo Abstract: Contemporary debates in concept acquisition presuppose that cognizers can only acquire concepts on the basis of concepts they already

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

An Introduction to Simio for Beginners

An Introduction to Simio for Beginners An Introduction to Simio for Beginners C. Dennis Pegden, Ph.D. This white paper is intended to introduce Simio to a user new to simulation. It is intended for the manufacturing engineer, hospital quality

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

An OO Framework for building Intelligence and Learning properties in Software Agents

An OO Framework for building Intelligence and Learning properties in Software Agents An OO Framework for building Intelligence and Learning properties in Software Agents José A. R. P. Sardinha, Ruy L. Milidiú, Carlos J. P. Lucena, Patrick Paranhos Abstract Software agents are defined as

More information

The Evolution of Random Phenomena

The Evolution of Random Phenomena The Evolution of Random Phenomena A Look at Markov Chains Glen Wang glenw@uchicago.edu Splash! Chicago: Winter Cascade 2012 Lecture 1: What is Randomness? What is randomness? Can you think of some examples

More information

A Comparison of Annealing Techniques for Academic Course Scheduling

A Comparison of Annealing Techniques for Academic Course Scheduling A Comparison of Annealing Techniques for Academic Course Scheduling M. A. Saleh Elmohamed 1, Paul Coddington 2, and Geoffrey Fox 1 1 Northeast Parallel Architectures Center Syracuse University, Syracuse,

More information

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016 AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems Hannes Omasreiter, Eduard Metzker DaimlerChrysler AG Research Information and Communication Postfach 23 60

More information

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1 Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial

More information

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering Lecture Details Instructor Course Objectives Tuesday and Thursday, 4:00 pm to 5:15 pm Information Technology and Engineering

More information

MYCIN. The MYCIN Task

MYCIN. The MYCIN Task MYCIN Developed at Stanford University in 1972 Regarded as the first true expert system Assists physicians in the treatment of blood infections Many revisions and extensions over the years The MYCIN Task

More information

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science Gilberto de Paiva Sao Paulo Brazil (May 2011) gilbertodpaiva@gmail.com Abstract. Despite the prevalence of the

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Speeding Up Reinforcement Learning with Behavior Transfer

Speeding Up Reinforcement Learning with Behavior Transfer Speeding Up Reinforcement Learning with Behavior Transfer Matthew E. Taylor and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712-1188 {mtaylor, pstone}@cs.utexas.edu

More information

FF+FPG: Guiding a Policy-Gradient Planner

FF+FPG: Guiding a Policy-Gradient Planner FF+FPG: Guiding a Policy-Gradient Planner Olivier Buffet LAAS-CNRS University of Toulouse Toulouse, France firstname.lastname@laas.fr Douglas Aberdeen National ICT australia & The Australian National University

More information

Genevieve L. Hartman, Ph.D.

Genevieve L. Hartman, Ph.D. Curriculum Development and the Teaching-Learning Process: The Development of Mathematical Thinking for all children Genevieve L. Hartman, Ph.D. Topics for today Part 1: Background and rationale Current

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Using focal point learning to improve human machine tacit coordination

Using focal point learning to improve human machine tacit coordination DOI 10.1007/s10458-010-9126-5 Using focal point learning to improve human machine tacit coordination InonZuckerman SaritKraus Jeffrey S. Rosenschein The Author(s) 2010 Abstract We consider an automated

More information

Evolution of Collective Commitment during Teamwork

Evolution of Collective Commitment during Teamwork Fundamenta Informaticae 56 (2003) 329 371 329 IOS Press Evolution of Collective Commitment during Teamwork Barbara Dunin-Kȩplicz Institute of Informatics, Warsaw University Banacha 2, 02-097 Warsaw, Poland

More information

Getting Started with TI-Nspire High School Science

Getting Started with TI-Nspire High School Science Getting Started with TI-Nspire High School Science 2012 Texas Instruments Incorporated Materials for Institute Participant * *This material is for the personal use of T3 instructors in delivering a T3

More information

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society UC Merced Proceedings of the nnual Meeting of the Cognitive Science Society Title Multi-modal Cognitive rchitectures: Partial Solution to the Frame Problem Permalink https://escholarship.org/uc/item/8j2825mm

More information

SOFTWARE EVALUATION TOOL

SOFTWARE EVALUATION TOOL SOFTWARE EVALUATION TOOL Kyle Higgins Randall Boone University of Nevada Las Vegas rboone@unlv.nevada.edu Higgins@unlv.nevada.edu N.B. This form has not been fully validated and is still in development.

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

LEGO MINDSTORMS Education EV3 Coding Activities

LEGO MINDSTORMS Education EV3 Coding Activities LEGO MINDSTORMS Education EV3 Coding Activities s t e e h s k r o W t n e d Stu LEGOeducation.com/MINDSTORMS Contents ACTIVITY 1 Performing a Three Point Turn 3-6 ACTIVITY 2 Written Instructions for a

More information

BMBF Project ROBUKOM: Robust Communication Networks

BMBF Project ROBUKOM: Robust Communication Networks BMBF Project ROBUKOM: Robust Communication Networks Arie M.C.A. Koster Christoph Helmberg Andreas Bley Martin Grötschel Thomas Bauschert supported by BMBF grant 03MS616A: ROBUKOM Robust Communication Networks,

More information

Lecture 6: Applications

Lecture 6: Applications Lecture 6: Applications Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real-Life Reinforcement Learning What is RL? Branch of machine learning concerned with

More information

A Version Space Approach to Learning Context-free Grammars

A Version Space Approach to Learning Context-free Grammars Machine Learning 2: 39~74, 1987 1987 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands A Version Space Approach to Learning Context-free Grammars KURT VANLEHN (VANLEHN@A.PSY.CMU.EDU)

More information

16.1 Lesson: Putting it into practice - isikhnas

16.1 Lesson: Putting it into practice - isikhnas BAB 16 Module: Using QGIS in animal health The purpose of this module is to show how QGIS can be used to assist in animal health scenarios. In order to do this, you will have needed to study, and be familiar

More information

Major Milestones, Team Activities, and Individual Deliverables

Major Milestones, Team Activities, and Individual Deliverables Major Milestones, Team Activities, and Individual Deliverables Milestone #1: Team Semester Proposal Your team should write a proposal that describes project objectives, existing relevant technology, engineering

More information

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I Session 1793 Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I John Greco, Ph.D. Department of Electrical and Computer Engineering Lafayette College Easton, PA 18042 Abstract

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC On Human Computer Interaction, HCI Dr. Saif al Zahir Electrical and Computer Engineering Department UBC Human Computer Interaction HCI HCI is the study of people, computer technology, and the ways these

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes WHAT STUDENTS DO: Establishing Communication Procedures Following Curiosity on Mars often means roving to places with interesting

More information

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Erkki Mäkinen State change languages as homomorphic images of Szilard languages Erkki Mäkinen State change languages as homomorphic images of Szilard languages UNIVERSITY OF TAMPERE SCHOOL OF INFORMATION SCIENCES REPORTS IN INFORMATION SCIENCES 48 TAMPERE 2016 UNIVERSITY OF TAMPERE

More information

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning CSL465/603 - Machine Learning Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Introduction CSL465/603 - Machine Learning 1 Administrative Trivia Course Structure 3-0-2 Lecture Timings Monday 9.55-10.45am

More information

First Grade Standards

First Grade Standards These are the standards for what is taught throughout the year in First Grade. It is the expectation that these skills will be reinforced after they have been taught. Mathematical Practice Standards Taught

More information

Learning and Transferring Relational Instance-Based Policies

Learning and Transferring Relational Instance-Based Policies Learning and Transferring Relational Instance-Based Policies Rocío García-Durán, Fernando Fernández y Daniel Borrajo Universidad Carlos III de Madrid Avda de la Universidad 30, 28911-Leganés (Madrid),

More information

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

More information

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand Texas Essential Knowledge and Skills (TEKS): (2.1) Number, operation, and quantitative reasoning. The student

More information

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games David B. Christian, Mark O. Riedl and R. Michael Young Liquid Narrative Group Computer Science Department

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

School of Innovative Technologies and Engineering

School of Innovative Technologies and Engineering School of Innovative Technologies and Engineering Department of Applied Mathematical Sciences Proficiency Course in MATLAB COURSE DOCUMENT VERSION 1.0 PCMv1.0 July 2012 University of Technology, Mauritius

More information

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts s e s s i o n 1. 8 A Math Focus Points Developing strategies for solving problems with unknown change/start Developing strategies for recording solutions to story problems Using numbers and standard notation

More information

4-3 Basic Skills and Concepts

4-3 Basic Skills and Concepts 4-3 Basic Skills and Concepts Identifying Binomial Distributions. In Exercises 1 8, determine whether the given procedure results in a binomial distribution. For those that are not binomial, identify at

More information

A Pipelined Approach for Iterative Software Process Model

A Pipelined Approach for Iterative Software Process Model A Pipelined Approach for Iterative Software Process Model Ms.Prasanthi E R, Ms.Aparna Rathi, Ms.Vardhani J P, Mr.Vivek Krishna Electronics and Radar Development Establishment C V Raman Nagar, Bangalore-560093,

More information

Visual CP Representation of Knowledge

Visual CP Representation of Knowledge Visual CP Representation of Knowledge Heather D. Pfeiffer and Roger T. Hartley Department of Computer Science New Mexico State University Las Cruces, NM 88003-8001, USA email: hdp@cs.nmsu.edu and rth@cs.nmsu.edu

More information

Computers Change the World

Computers Change the World Computers Change the World Computing is Changing the World Activity 1.1.1 Computing Is Changing the World Students pick a grand challenge and consider how mobile computing, the Internet, Big Data, and

More information

A Case-Based Approach To Imitation Learning in Robotic Agents

A Case-Based Approach To Imitation Learning in Robotic Agents A Case-Based Approach To Imitation Learning in Robotic Agents Tesca Fitzgerald, Ashok Goel School of Interactive Computing Georgia Institute of Technology, Atlanta, GA 30332, USA {tesca.fitzgerald,goel}@cc.gatech.edu

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Saliency in Human-Computer Interaction *

Saliency in Human-Computer Interaction * From: AAA Technical Report FS-96-05. Compilation copyright 1996, AAA (www.aaai.org). All rights reserved. Saliency in Human-Computer nteraction * Polly K. Pook MT A Lab 545 Technology Square Cambridge,

More information

Causal Link Semantics for Narrative Planning Using Numeric Fluents

Causal Link Semantics for Narrative Planning Using Numeric Fluents Proceedings, The Thirteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE-17) Causal Link Semantics for Narrative Planning Using Numeric Fluents Rachelyn Farrell,

More information