340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010

Size: px

Start display at page:

Download "340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010"

Polly Gallagher
6 years ago
Views:

1 340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 Multilevel Darwinist Brain (MDB): Artificial Evolution in a Cognitive Architecture for Real Robots Francisco Bellas, Member, IEEE, Richard J. Duro, Senior Member, IEEE, Andrés Faiña, and Daniel Souto Abstract The multilevel Darwinist brain (MDB) is a cognitive architecture that follows an evolutionary approach to provide autonomous robots with lifelong adaptation. It has been tested in real robot on-line learning scenarios obtaining successful results that reinforce the evolutionary principles that constitute the main original contribution of the MDB. This preliminary work has lead to a series of improvements in the computational implementation of the architecture so as to achieve realistic operation in real time, which was the biggest problem of the approach due to the high computational cost induced by the evolutionary algorithms that make up the MDB core. The current implementation of the architecture is able to provide an autonomous robot with real time learning capabilities and the capability for continuously adapting to changing circumstances in its world, both internal and external, with minimal intervention of the designer. This paper aims at providing an overview or the architecture and its operation and defining what is required in the path towards a real cognitive robot following a developmental strategy. The design, implementation and basic operation of the MDB cognitive architecture are presented through some successful real robot learning examples to illustrate the validity of this evolutionary approach. Index Terms Adaptive systems, artificial neural networks, autonomous robotics, cognitive architecture, developmental robotics, evolutionary computation. I. INTRODUCTION SOME types of cognitive functions such as anticipation and planning may be achieved by internally simulating the robot s interaction with the environment through its actions and their consequences [1]. However, this would require the existence of good enough updated models of the world and of the robot itself. Models must adapt to changing circumstances and they must be remembered and generalized in order to be reused. Actions and sequences of actions, on the other hand, must be generated in real time so that the robot can cope with the environment and survive. A control system such as the one that is necessary for autonomy is really something that goes beyond traditional control in terms of its specifications or requirements. These additional Manuscript received March 01, 2010; revised August 14, 2010; accepted September 30, Date of publication October 14, 2010; date of current version December 10, This work was partially funded by the Xunta de Galicia through Project (09DPI012166PR) and the European Regional Development Funds. The authors are with the Integrated Group for Engineering Research, University of Coruña, Coruña, 15001, Spain ( fran@udc.es; richard@udc.es; afaina@udc.es; dsouto@udc.es). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TAMD requirements imply the ability to learn the control function from scratch, the ability to change or adapt it to new circumstances, the ability to interact with the world in real time while performing the aforementioned processes, and in some instances, even to change the objectives that guide the control system. Consequently, and to differentiate it from traditional control systems, these structures are usually called cognitive systems or cognitive architectures [2]. This work tries to provide an overview of what would be required as a first step towards a real cognitive robot and, in this line, one that can learn throughout its lifetime in an adaptive manner. An approach will be described that considers evolution, and in particular, neuroevolution [3], an intrinsic part of the cognitive system that allows a robot to be able to learn different tasks and objectives. This approach, called the multilevel Darwinist brain (MDB), has been extensively tested in different real robotics applications [4], [5], obtaining new requirements and improvements that have been implemented and tested. The MDB is not intended as a biologically plausible path, but rather, as a computationally effective way of providing the required functionality in real-time robotics. Consequently, the crucial aspects of this work are those related with reality, real-time, real world, and real robots, mainly considering that evolutionary techniques are highly time consuming and they do not seem to be adequate for real time operation in robots. Notwithstanding these previous comments, given the structure of the MDB, it is tempting to draw some parallels with large-scale neural ensembles. For instance, multiple motor cortical populations are present and compete for access to action selection as carried out by the basal ganglia through internal sensorimotor loops. Concepts such as emotion in action selection (amygdala) could be easily considered. The paper has been organized as follows. Section II discusses the elements that must be considered in the design of a cognitive architecture for real robots following a developmental approach. Section III introduces the principles, basic elements, operation, and computational implementation of the MDB architecture. Section IV is devoted to the presentation of two illustrative application examples implemented with the MDB in real robots. Finally, Section V provides some conclusions as well as some indications of the work that still needs to be done. II. COGNITIVE ARCHITECTURES FOR REAL ROBOTS In robotics, it has been quite common to study animal behavior as a reference for developing cognitive architectures for real robots, trying to reproduce the learning patterns observed /$ IEEE

2 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 341 in nature [6]. Animals acquire many competences during their lifetime by interacting with the world. Their secret weapon is an autonomously operating and nonpredetermined open-ended lifelong learning brain-body system. It seems that, in autonomous robotics, instead of designing intelligent robots, for a long time researchers have been designing just controllers. Initially, years have been spent on directly programmed classical control schemes [7]. Later, different authors applied deliberative approaches based on symbolic representations of knowledge using preprogrammed models and objectives [8]. Once this approach became too complex, other authors resorted to reactive approaches [9] that, to deal with complex tasks, required too much designer intervention. The main conclusion extracted from these initial approaches to the problem is that all preprogrammed control systems are very limited in terms of autonomy and, consequently, inadequate for real robotics. As in the case of nature, open-ended lifelong learning systems present the potential of achieving a more realistic level of autonomy. Here, this concept will be taken as a mandatory objective for the design of a cognitive architecture. A. Control Versus Cognition Cognition may be defined as the mental process of knowing, including aspects such as awareness, perception, reasoning, and judgment [11]. From a computational perspective, cognition can be considered as a collection of emerging information technologies inspired by the qualitative nature of biologically-based information processing and information extraction found in the nervous system, human reasoning, human decision-making, and natural selection [12]. Therefore, on one hand, we have the mental process of knowing, which in mathematical terms can be considered as extracting models from data. These models can be employed in the process of decision making so that appropriate actions may be taken as a function of sensing and motivation. On the other, we have the decision making process itself, and, in robotics, a decision is always related to an action or sequence of actions. In a sense, the models must be used in order to decide the appropriate actions so as to fulfill the robot s motivations. It is in how the model making and the action determination processes take place that cognitive architectures differ from each other [2]. B. Developmental Robotics The main objective of the developmental robotics field is to create open-ended, autonomous learning systems that continually adapt to their environment [12], as opposed to constructing robots that carry out particular, predefined tasks. The main inspiration within this field has been taken from complex biological organisms and the developmental process they follow during their lifetime, which, under the control of a developmental program, develop mental capabilities through autonomous real-time interactions with their environments by using their own sensors and effectors [13]. The philosophy behind developmental robotics is that learning occurs by taking small steps and building on what is already known [14]. As commented by Lisa Meeden under a developmental process, a system can continually advance what it knows by placing itself into situations where it almost knows something, and then learning it. Applied repeatedly, such a developmental process can potentially lead to much more complex, general-purpose behavior than has been achieved to date. This developmental robotics approach has been followed in the design of the MDB cognitive architecture, but the problem has been addressed by making use of some of the concepts of traditional cognition, introducing ontogenetic evolutionary processes for the on-line adaptation of the knowledge bearing structures. In addition to the mental aspects commented above, cognitive developmental robotics (CDR) includes all the topics related with body development [15]. However, fetal sensorimotor development, voluntary movement acquisition, spatial perception, and body/motor representation or understanding are problems beyond the scope of this paper. Here, the focus of attention will be placed on the development of a cognitive architecture to be applied to existing physical robots. Embodiment, adaptive motivation, open-ended lifelong learning, or autonomous knowledge acquisition are some of the typical CDR topics on which this approach will concentrate, dealing with all the implementation problems that arise when using a developmental approach in real time operation. As will be discussed later, some of the practical requirements to have an efficient computational implementation may imply modifying or relaxing previous theoretical assumptions. C. Embodied Cognition A cognitive system makes no sense without its link to the real (or virtual) world in which it is immersed and which provides the data to be manipulated into information and knowledge, thus requiring the capability of acting and sensing and of doing so in a timely and unconstrained manner. Consequently, to design a cognitive system for real autonomy, that is capable of thinking things out before acting, it seems necessary to start with a typical deliberative structure that includes models as an intrinsic element. Its general structure could be like the one shown in Fig. 1 (top). Obviously, this structure assumes preset goals for which the models and the action selector are designed. Thus, to make it independent from preset goals, a satisfaction model, which is basically a learnable utility function, must be added as shown in Fig. 1 (middle). However, deliberative structures usually require quite a bit of time in order to decide on an action and embodied systems must often act very fast with whatever information is available in order to survive. To allow for this, it would be necessary to include a reactive part to the deliberative mechanism that would be seamlessly linked to it. This is what is shown in the bottom part of Fig. 1 through the introduction of the concept of behavior as an independent processing element capable of producing sequences of actions for the agent as a response to some external or internal stimuli. This way, when the system is interacting with the world, it is using the behavior based part. On the other hand, the deliberative part of the mechanism, in its own time frame, adapts the behavior module in order to maximize satisfaction. Thus, Fig. 1 (bottom) can be seen as a behavior based deliberative structure. Its operation implies two different time scales: execution time scale, where the reactive behaviors are directly applied, and deliberation time scale, in which the

3 342 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 task, a motivation (defined as the need or desire that makes an agent act) must exist that guides the behavior as a function of its degree of satisfaction. The external perception of an agent is made up of the sensory information it is capable of acquiring through its sensors from the environment in which it operates (like distances, shapes, colors, sounds, etc). The environment can change due to the actions of the agent or to factors uncontrolled by the agent. Consequently, the external perception can be expressed as a function of the last action performed by the agent, the sensory perception it had of the external world in the previous time instant and a description of the events occurring in the environment that are not due to its actions through a function The internal perception of an agent is made up of the sensory information provided by its internal sensors, its propioception (like a battery level, stress level in terms of CPU load, motivation level, etc.). Internal perception can be written in terms of the last action performed by the agent, the sensory perception it had from the internal sensors in the previous time instant and other internal events not caused by the actions of the agent through a function The satisfaction of the agent may be defined as a magnitude or vector that represents the degree of fulfilment of the motivation or motivations of the agent and it can be related to its internal and external perceptions through a function.as a first approximation, the social aspects of the robot s development, that is, the events over which the agent has no control will be ignored and the problem reduced to the interactions of the agent with the world and itself. Thus, generalizing Fig. 1. Conceptual evolution from a traditional deliberative architecture (top) towards an architecture that allows for the intrinsic change of goals or motivation by introducing a satisfaction model (middle) and finally, to an architecture that permits fast reactive real time behavior while preserving the deliberative characteristics by considering the selection of behaviors instead of simple actions (bottom). new behaviors are learned through interaction with the environment. This scheme represents a starting point for the design of an embodied cognitive system for real autonomous robots and, as will be later explained in detail, it is the basic structure of the MDB. D. Cognitive Model Starting from the general concepts stated in the previous sections, a cognitive model based on a particularization of the standard abstract architectures for agents [16] was used for the design of the MDB architecture. In this case, a utilitarian cognitive model is adopted. It starts from the premise that, to carry out any The main objective of this utilitarian cognitive architecture is the satisfaction of the motivation of the agent, which, without any loss of generality, may be expressed as the maximization of the value of the satisfaction in each instant of time and the satisfaction can be expressed as a function of functions and acting over the external perceptions, the internal perceptions and previous actions According to the previous expression, to solve this maximization problem, the only parameter the agent can modify is the action it performs, as the external and internal perceptions should not be manipulated (doing this would lead to distorted perceptions as in altered mental states). That is, the cognitive architecture must explore the possible action space in order to maximize the resulting satisfaction. To obtain a system that can be applied in real time, the optimization of the action must be carried out internally (without interaction with the environment)., and are theoretical functions that must be somehow obtained. These functions correspond to what are traditionally called the following: world model ( ): function that relates the external perception before and after applying an action;

BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 343 Fig. 2. Functional diagram of the cognitive model.

perceptions provided by the world and internal models. Fig. 2 displays a functional diagram of this cognitive model indicating the relationships among all the elements involved.

4 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 343 Fig. 2. Functional diagram of the cognitive model. internal model ( ): function that relates the internal perception before and after applying an action; satisfaction model ( ): function that provides a predicted satisfaction from predicted perceptions provided by the world and internal models. Fig. 2 displays a functional diagram of this cognitive model indicating the relationships among all the elements involved. This diagram is useful to realize that action evaluation is a sequential process, and that satisfaction prediction requires the prior execution of the perceptual models. As commented before, the main starting point in the design of a developmental cognitive architecture is that the acquisition of knowledge should be automatic and occur during the agent s lifetime. Thus, it is necessary to establish that the three models,, and must be obtained at execution time as the agent interacts with the world. To be able to carry out this modeling process, information must be extracted from the real data the agent has after each interaction with the environment. Hereafter, these data will be called action perception pairs and are made up of the sensorial data in instant, the action applied in instant, the sensorial data in instant, and the satisfaction in. This way, we have all the perceptual information before and after applying an action. The model organization displayed in Fig. 2 allows for an intrinsically adaptive operation. If the sensorial information changes (dynamic environment, hardware failures), the world and internal models can be updated or replaced without any consequence in terms of the architecture. In addition, if the motivation changes, the satisfaction model would change while the action selection method remains unchanged. Summarizing, for every interaction of the agent with its environment, two processes must be solved: 1) the modeling of functions,, and using the information in the action perception pairs; 2) the optimization of the action using the models available at that time trying to maximize the predicted satisfaction provided by the satisfaction model. To create models is to try to minimize the difference between the reality that is being modeled and the predictions provided by the model. Consequently, it is clear that a cognitive architecture must involve some type of minimization strategy or algorithm. Fig. 3. MDB elements and their relations. As the search spaces are not really known beforehand and are usually very complex, here we propose using as the algorithm for on-line modeling one of the most powerful stochastic multipoint search techniques: artificial evolution. III. MULTILEVEL DARWINIST BRAIN (MDB) MDB is a general cognitive architecture first presented in [17], that follows a developmental robotics approach for the automatic acquisition of knowledge in a real robot through the interaction with its environment, so that it can autonomously adapt its behavior to achieve its design objectives. The background idea of the MDB of applying artificial evolution for knowledge acquisition takes inspiration from classical biopsychological theories by Changeaux [18], Conrad [19], and Edelman [20] in the field of cognitive science relating the brain and its operation through a Darwinist process. All of these theories lead to the same concept of cognitive structure based on the brain adapting its neural connections in real time through evolutionary or selectionist processes. Fig. 3 displays a block diagram of the current implementation of the MDB. It follows a generalization of the cognitive model described in the previous section through the addition of two new elements.

5 344 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER ) Behavior structures: as commented before, they generalize the concept of single action used in the cognitive model. A behavior represents a decision element able to provide actions or sequences of actions according to the particular sensorial inputs. That is, the robot could have a behavior for wall-following, another for wandering, etc. 2) Memory elements: a short-term and a long-term memory are required in the learning processes. They will be discussed later in detail. As presented in Fig. 3, the MDB is structured into two different time scales, one devoted to the execution of the actions in the environment (reactive part) and the other dealing with the learning of the models and behaviors (deliberative part). The operation of the MDB can be described in terms of these two scales. 1) Execution time scale: The following steps are continuously repeated in a sequential manner. 1.1) There is a current behavior, which has been selected in the deliberative process that chooses, based on its perception, the next action to be applied. 1.2) The selected action is applied to the environment through the actuators of the robot obtaining a new set of perceptions. 2) Deliberation time scale: these processes also take place in different time scales, and, consequently, it must be pointed out that they are not sequential. 2.1) The acting and sensing values obtained after the execution of an action in the environment in the execution time scale provide a new action perception pair that is stored in a short-term memory (STM). 2.2) The evolutionary model learning processes (for world, internal and satisfaction models) try to find functions that generalize the real samples stored in the STM. Each evolutionary process has been represented by two blocks in Fig. 3, one related to the evolution itself (labelled evolver) and the other one representing the population of each evolution (labelled Base). The computation time required for each evolutionary process may be different, depending on the complexity of the models. The practical implementation of such processes will require attention to avoid incoherence during the interplay of the elements involved in real time operation. 2.3) The best models in a given instant of time are taken as current world model (WM), current internal model (IM), and current satisfaction model (SM) and are used by the behavior evolver to select the best behavior with regards to the predicted satisfaction of the motivation (behavior proposer block in Fig. 3). Therefore, another evolutionary process has been added to the cognitive model presented above that is continuously obtaining new behaviors for the robot using the best three models it has for the evaluation of the individuals. Consequently, the best behavior in a given instant of time is the one that provides a higher level of satisfaction on average for all the samples stored in the STM. The blocks labelled current WM, current IM, current SM, and behavior proposer are really included in the behavior evolver within the individual evaluation stage. 2.4) The behavior evolver is continuously proposing new behaviors that are better adapted to the STM contents. Upon request, the behavior selector provides the best one to the reactive part of the MDB (the one operating in the execution time scale), the current behavior block in Fig. 3, which replaces the one there and which should be better adapted to the STM and, consequently, to the current reality of the robot. 2.5) The block labelled long-term memory (LTM) in Fig. 3 stores those models and behaviors that have provided successful and stable results in their application to a given task in order to be reused directly in other problems or as seeds for new evolutionary learning processes. Each time the robot executes an action during real time operation, a new action perception pair is obtained. This real information is the most relevant one in the MDB, as all the learning processes depend on the number and quality of action perception pairs. Consequently, each interaction of the robot with the environment has been taken as the basic time unit within the MDB and called iteration. As more iterations take place, the MDB acquires more information from the real environment and thus the model learning processes should produce better models and, consequently, the behaviors obtained using these models should be more reliable, and the actions provided by them more appropriate to fulfil the motivations. The MDB diagram represented in Fig. 3 follows the structure shown in Fig. 1 (bottom) for an embodied cognitive architecture. It includes concurrent reactive and deliberative processes. The following three subsections will describe the most important aspects of the MDB in practical terms: evolution, memories, and implementation. A. Lifelong Learning by Evolution The main difference of the MDB with respect to other cognitive architectures for real robots lays in the way the knowledge is acquired through evolutionary techniques [2], [15]. The models resulting from this learning are usually complex due to the fact that the real world is dynamic and the robot state, the environment, and the objective may change in time. To achieve the desired neural adaptation through evolution established by the Darwinist theories that are the base for this architecture, it was decided to use artificial neural networks (ANN) as the representation for the models, mainly due to their suitability for being adapted through evolutionary processes. There is no limitation regarding the type of ANN that can be used in the MDB. Regular feedforward, radial basis functions, recurrent or delay based networks, or spiking neural networks may be useful depending on the type of model that needs to be learned. Consequently, the acquisition of knowledge in the MDB is a neuroevolutionary process, with an evolutionary algorithm devoted to learning the parameters of the ANN. Neuroevolution is a reference learning tool due to its robustness and adaptability to dynamic environments and nonstationary tasks as commented in [3].

6 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 345 Here, the modeling is not an optimization process, but a lifelong learning process taking into account that the best generalization for all times, or, at least, an extended period of time is sought, which is different from minimizing an error function in a given instant. Consequently, the modeling technique selected must allow for gradual application, as the information is known progressively and in real time. Evolutionary techniques permit this gradual learning process by controlling the number of generations of evolution for a given content of the STM. Thus, if evolutions last just a few generations per iteration, gradual learning by all the individuals is achieved. To obtain general modeling properties in the MDB, the population of the evolutionary algorithms must be preserved between iterations (represented in Fig. 3 through the world, internal, and satisfaction model base blocks that are connected to the evolutionary blocks), leading to a sort of inertia learning effect where what is being learned is not the contents of the STM in a given instant of time, but of sets of STMs that were previously seen. In addition, the dynamics of the real environments where the MDB will be applied imply that the architecture must be intrinsically adaptive. This strategy of evolving for a few generations and preserving populations between iterations permits a quick adaptation of models to the dynamics of the environment, as a collection of possible solutions is present in the populations and they can be easily adapted to the new situation. In the case of behaviors, in the current version of the MDB they are also represented by ANNs, and consequently, they can be viewed as neural behavior controllers that provide the action the robot must apply in the environment according to its sensorial inputs. The learning of behaviors follows the same gradual principles as that for the models in order to avoid the fluctuations in the evolution caused by an exhaustive optimization for each particular content of the STM. Again, a behavior will improve with iterations and not within iterations. In the first versions of the MDB, the evolution of the models was carried out using standard canonical genetic algorithms where the ANNs were represented by simple two-layer perceptron models [17]. These tools provided successful results when dealing with simple environments and tasks, but as soon as real world learning problems were faced, they were insufficient. Standard genetic/evolutionary algorithms when applied to these tasks tend to converge towards homogeneous populations, that is, populations where all of the individuals are basically the same. In static problems this would not be a drawback if this convergence took place after reaching the optimum. Unfortunately, there is no way to guarantee this, and diversity may be severely reduced long before the global optimum is achieved. In dynamic environments, where even the objective may change, this is quite a severe drawback. In addition, the episodic nature of real-world problems implies that whatever perceptual streams the robot receives could contain information corresponding to different learning processes or models that are intermingled (periodically or not), that is, learning samples need not arise in an orderly and appropriate manner. Some of these sequences of samples are related to different sensorial or perceptual modalities and might not overlap in their information content; others correspond to the same modalities but should be assigned to different models. The problem that arises is how to learn all of these different models, the samples of which are perceived as partial sequences that appear randomly intermingled with those of the others. In order to deal with the particularities of the learning processes involved in the MDB, it was necessary to develop a new neuroevolutionary algorithm able to deal with general dynamic problems, that is, combining both memory elements and the preservation of diversity. This algorithm was called the promoter-based genetic algorithm (PBGA) [26]. In this algorithm, the chromosome is endowed with an internal or genotypic memory and tries to preserve diversity using a genotype-phenotype encoding that prevents the loss of relevant information throughout the generations. One of the main features of the PBGA is that it automatically adjusts the number of neurons required for each layer. The practical operation and details of the algorithm are detailed in [22], where a discussion on how this algorithm outperforms others like NEAT [21], in nonstationary conditions, is presented. An analysis of how the algorithm can be improved with the incorporation of an external memory, in this case a LTM is presented in [23]. Summarizing this subsection, the MDB implements four parallel neuroevolutionary processes during operation using the STM contents as fitness function for learning the models and behaviors. The cognitive architecture is transparent to the particular type of algorithm or ANN, but taking into account the features of real world learning, a neuroevolutionary algorithm, the PBGA, that outperforms the existing ones in these conditions has been designed. B. Remembering Facts, Situations, and Behaviors The management of the real data represented by the action perception pairs that are stored in the STM ( the facts ) is critical in the real time learning processes of the MDB. The quality of the learned models depends on what is stored in this memory and the way it changes. On the other hand, the particular conditions of the environment and the robot itself throughout its lifetime ( the situations ) and the actions taken ( the behaviors ) in those cases must be stored in a LTM to learn from experience. 1) STM: STM is a memory element that stores data obtained from the real time interaction of the agent with its environment. The internal models the agent creates during the learning process should predict and generalize all the data stored in the STM. Thus, what is learned and how it is learned depends on the contents of the STM during time. Obviously, it is not realistic to store all the samples acquired throughout an agent s lifetime. The STM must be limited in size and, consequently, a replacement strategy is required in order to store the information the agent considers more relevant in a given instant of time. The replacement strategy should be dynamic and adaptable to the needs of the agent and, therefore, it must be subject to external regulation. For this reason, a replacement strategy has been designed that labels the samples using four basic features related to saliency and temporal relevance. The point in time a sample is stored ( ): It favors the elimination of the oldest samples, maximizing the learning of the most current information acquired.

7 346 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 The distance between samples ( ): Measured as the Euclidean distance between the action perception pair vectors, this parameter favors the storage of samples from all over the feature space in order to achieve a general model. The complexity of a sample to be learned ( ): This parameter favors the storage of the hardest samples to be learned. The error provided by the current models when predicting a sample is used to calculate it. The relevance of a sample ( ): This parameter favors the storage of the most particular samples, that is, those that, even though they may be learned by the models very well, initially presented large errors. Thus, each sample is stored in the STM has a label ( ) that is calculated every iteration as a function of these four basic terms, that is,. Different functions require different management strategies. The regulation of these four features can be carried out by the cognitive mechanism or by other parts of the memory system so as to improve the learning and generalization properties. 2) LTM: LTM is a higher level memory element, because it stores information obtained after the analysis of the real data stored in the STM. From a psychological point of view, the LTM stores the knowledge acquired by the agent during its lifetime, its experience. This knowledge is represented in the MDB by the models and behaviors. In an initial approach, that has provided successful results, it has been considered that a model must be stored in the LTM if it predicts the contents of the STM with high accuracy during an extended period of time (iterations in the MDB). In the case of a behavior, it should be stored if during its application it has led to a relevant increase in the satisfaction obtained. These models are considered relevant for the robot s operation in a given context and should not be forgotten. Each model is stored together with its context, that is, the existing STM where it performed properly. It would not be efficient to store models obtained over equivalent STMs as it is assumed that they predict the same reality of the robot. To determine whether a model should be stored in the LTM is not evident, as models are generalizations of situations and, in the present case, where they are implemented as artificial neural networks, it is not easy to see if a model is the same or similar to another one or not. Thus, every time a new model is a candidate for inclusion in the LTM, it must be phenotypically compared to the rest of models in the LTM. This is achieved by simply performing cross predictions of their associated STMs. That is, to compare two models, each is run over the context of the other to see how similar they are. From a practical point of view, the addition of the LTM in the MDB avoids the need of relearning the models and behaviors in a problem with a real agent in a dynamic situation every time the agent changes into different states (different environments or different operation schemas). The models and behaviors stored in the LTM in a given instant of time are introduced in their corresponding evolving populations as seeds so that if the agent returns to a previously learned situation, the model or behavior will be present in the population and the prediction will be accurate soon. If the new situation is similar to one the agent has learned before, the fact of seeding the evolving population with the LTM will allow the evolutionary process to reach a solution very fast. 3) Memory Interplay: A mutual regulation system has been developed to control the interaction between these memories in the MDB. There are two main undesirable effects in the learning process that can be avoided with a correct management system. First of all, as was mentioned before, the replacement strategy of the STM favors the storage of relevant samples. But what is considered relevant could change in time (change of motivation or environment), and consequently the information stored in the STM should also change so that the new models and behaviors generated correspond to the new situation. If no regulation is introduced, when situations change, the STM memory will be polluted by information from previous situations (there is a mixture of information) and, consequently, the models and behaviors that are generated do not correspond to any one of them. These intermediate situations can be detected by the replacement strategy of the LTM as it is continuously testing the models and behaviors to be stored in the LTM. Thus, if it detects a model or behavior that suddenly and repeatedly fails in the predictions of the samples stored in the STM, it is possible to assume that a change of context has occurred. This detection will produce a regulation of the parameters controlling the replacement in the STM so that it will purge the older context. It can even become a completely temporal strategy for a while. This purge will allow new data to fill the STM and thus the models and behaviors can be correctly generated. It is a clear case of LTM monitoring affecting the operation of the STM and thus the learning processes. The other undesirable effect that must be avoided is a continuous storage in the LTM. This happens because the data stored in the STM are not general enough and the models or behaviors seem to be different although they are not. The replacement strategy of the LTM can detect if the agent s situation has changed or not and, consequently, after a change of situation it can detect if the number of attempts to enter the LTM is high. In such case, the parameters of the replacement strategy of the STM are regulated so that they favor information that is more general by empowering parameters such as distance, relevance or complexity and the reduction of the influence of time. Summarizing, a strategy for avoiding model learning in intermediate situations and another for avoiding the overload of the LTM are required. These two strategies are necessary in the interplay between memories together with the management mechanisms for each one of them individually. Hence, a dynamic memory structure arises that improves the efficiency in the use of memory resources, minimizing the number of models and behaviors stored in LTM without affecting performance and allowing them to be as general as possible. This last fact is quite important because, as the models and behaviors are used as seeds in the evolution processes, the more general they are the better they will adapt to new situations. A more detailed description of the memory elements within the MDB can be found in [24]. C. Real-Time Operation The computational implementation of all the elements that make up a complex architecture like the MDB is the key to its

8 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 347 success in real robotic systems. There are several aspects that must be carefully designed and implemented to obtain a tool that can be practical in terms of reliability and computational cost. The current version of the MDB has been developed in JAVA, and in its object-oriented design there are four basic packages that constitute the computational core of the architecture: robot, evolutionary algorithm, model and memory. 1) Robot: As a first design requirement, already in the initial versions of the MDB, it was imposed that the basic onboard processor of the current real robots (usually a microcontroller) should be used just to execute actions and to collect the sensorial information in real time. Regarding the two different time scales shown in Fig. 3, this means that the onboard processor is in charge of the execution time scale elements. Thus, the deliberative part of the MDB is always executed in a separate processor (on or off the robot), and the communications between these two structures are carried out using the standard TCP/IP protocol. The second basic design requirement related with the robot is that the MDB should be as independent from the particular hardware as possible. That is, it is assumed that the MDB receives sensorial information and provides actions to be applied in a robot, but its core cannot include any particularity about the robot. To this end, inspiration was taken from the Player/Stage project [25], which uses a network server for robot control that provides an interface to the robot s sensors and actuators over the IP network. The robot package includes all the classes implementing the previous two requirements. The designer must create a simple configuration file including a description of the robot sensors and actuators and the IP port where it is connected. On the other hand, for each particular robot or simulator, an interface program must be developed to provide sensorial information and to capture the actions obtained by the MDB using the standard TCP/IP protocol. The onboard processor s computational load is minimized and, consequently, these data are always in raw format. If any kind of processing is required, it will be carried out by a dedicated perception package. 2) Evolutionary Algorithm: This package includes all the classes in charge of executing the evolutionary processes for the models and behaviors, which make up the core of the architecture. One of the main drawbacks of the application of evolutionary algorithms in real robotics is the computational cost they imply, which makes them apparently unsuitable for real time operation. This problem has been addressed through the following design decisions. First, as commented in Section III-A, the evolution of the models and behaviors only last a few generations per iteration in order to obtain a smooth learning curve. This obviously reduces the computational cost in between interactions with the world and makes its real time implementation feasible. In addition, the MDB is intrinsically concurrent and each evolutionary process runs on an independent thread that is automatically assigned to a different processor when available. In the case of having a local area network, the processes may be executed in different computers over the network automatically. The behavior evolution module uses the current models to evaluate candidate behaviors. Taking into account the distributed execution of each model evolution, a coherence protocol has been implemented that ensures an updated evaluation of the individuals, using always the most recent models. A similar procedure has been implemented for the update of the current behavior in the reactive part of the MDB, which is replaced every time a better one is obtained but ensuring a coherent action selection. On the other hand, the MDB is independent from the particular type of evolutionary algorithm and ANN. This is very easy to support with the object-oriented design of the architecture. In fact, the architecture has been tested with the different algorithms implemented in the JEAF library [26], by simply changing the class in the configuration file. 3) Model: To create a new experiment using the MDB, the first step is to set up the model configuration. This implies formally describing the world, internal and satisfaction models, that is, the knowledge representation within the architecture must be chosen. This selection is very relevant because the success or failure of learning depends highly on the complexity of the models. What the designer must do is simply indicate the inputs and outputs corresponding to each model. This information corresponds to the external and internal sensorial information and to the action space. Take into account that the robot configuration file already includes all the real sensors and actuators, but here, virtual sensors and actuators that process the raw data provided by the robot may be used as inputs and outputs to the models allowing the designer to manipulate the sensorial information freely. A general procedure was developed to automate model configuration based on the division of independent sensorial information into different models. For example, in the case of world models, for a robot with four infrared sensors and two light sensors, the MDB will create two models as a first approach. The first one relating the four infrared inputs in instant and and the action applied, and the second one relating the two light inputs with the action in the same way. What is relevant for the computational implementation is that several concurrent evolutionary processes may be required for all the submodels making up the world, internal and satisfaction models as well as for the behaviors. As commented above, a mechanism has been implemented that automatically executes these concurrent processes in different, and even remote, threads. 4) Memory: As discussed in Section III-B1, the replacement strategy of the STM determines the type of learning achieved, and it must be adjusted depending on the complexity of the model. Taking into account the previously explained subdivision into models, using a single STM for all the models is not feasible. Hence, in the current implementation of the MDB, each model evolution has its own STM with its particular replacement strategy. All the classes required for this management are included in the memory package. The corresponding configuration file only requires establishing the type of replacement strategy, while the creation and execution of the STM is automatic. Regarding the LTM, the design and implementation follows basically the same principles: there is a different LTM for

348 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 Fig. 4. Experimental setup with the Hermes II robot, an objective block, and a teacher that guides the learning process.

9 348 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 Fig. 4. Experimental setup with the Hermes II robot, an objective block, and a teacher that guides the learning process. each evolutionary process, which is executed concurrently with it. To summarize this section, the following implementation aspects within the MDB, all related with the operation of the architecture in real robots, should be highlighted: object-oriented design and JAVA implementation; hardware/simulator independence through the use of a TCP/IP middleware approach; time scale independence: reactive elements are executed onboard and deliberative elements in remote or onboard computers through TCP/IP communication; automatic concurrent execution of the evolutionary processes in remote processors; automatic division of STM and LTM memories and execution according to the evolutionary processes; persistent evolutionary processes that never stop although the fitness function can change; easy integration with evolutionary and ANN libraries. Once the principles and operation of the MDB have been presented, the next section will be devoted to a series of results that were obtained applying it to real robot problems. IV. APPLICATION RESULTS This section describes two representative application examples that summarize the behavior of the MDB in real robot learning. The first one is very simple but highly conceptual, and it is focused on the developmental features of the architecture. The second one is more complex in learning terms and it includes all the elements of the architecture working together. A. Learning Basic Skills The first experiment was carried out using the Hermes II hexapod robot (see Fig. 4), which has six legs with two degrees of freedom (swing and lift), six infrared sensors, each one placed on top of each leg, two whiskers, inclinometers and six force sensors. In the first part of the example, we want the Hermes II robot to learn to walk. The motion of each leg can be described through 3 parameters (for the swing and lift motion): the initial phase, which establishes the starting point of the leg s motion, the frequency, which increases or decreases the speed of the movement and the sweep amplitude. In this case, all of the parameters are fixed except the initial phase of the swing motion for each leg. The different combinations of phases lead to different gaits, some useful, some useless and some even completely impractical. The mechanism must allow the robot to develop an efficient gait so that it can fulfil its motivations. A developmental approach has been followed in the execution of the experiment where the teacher guides the learning process step by step. The robot is placed in a standing pose in a random point of an empty environment, and an object (a block) is placed one meter away from it (see these elements in Fig. 4). The mechanism selects the gait that must be applied and the robot uses it during a fixed time (24 seconds). A gait is defined by the initial phase of the swing motion in the six legs. Through its infrared sensors, using a time integration virtual sensor presented in [27], the robot always has an indication, in general noisy, of the distance to the block. In the MDB, the designer has to define the motivation of the robot in measurable terms and the particular models required according to the robot s features. In this case, the motivation is very simple and general: minimize the distance to the block. In the sensorial map of the robot, this implies a maximization of the detection in the two front infrared sensors. A single world model was used as there is only one type of sensorial information (infrared data). The world model has seven inputs: the distance to the block provided by the virtual sensor (in a range from 0 to 10) and the six input phases applied to the legs (in a range from 5 to 5, corresponding to the real limits of 45 to 45 ). The unique output of this world model is the predicted distance to the block. In this case, the ANN that represents the world model is a multilayer perceptron with two hidden layers of four neurons each. No LTM was considered in this first experiment. With such setup, the experiment was started and each time the robot falls or loses the block, the teacher is in charge of placing all the elements again in the correct positions (shown in Fig. 4). The MDB was run for 300 iterations until the gait was successful and the robot reached the block consistently. In this first experiment a simple genetic algorithm was used for the evolution of the world models. The algorithm considered 700 individuals, 57 genes (corresponding to the weights and bias of the neural network), 60% crossover and 2% mutation. No internal sensors were used in this experiment. For the sake of simplicity, in this case, the satisfaction is directly the predicted distance to the block. The behaviors are, in this case, simple actions but they have been obtained using another genetic algorithm with 120 individuals, six genes (direct encoding of the input phases), 60% crossover and 6% mutation. The STM was limited to 40 action perception pairs and it worked with a purely temporal replacement strategy, that is, each sample was labeled with the iteration ( ) in which it was acquired, and when the STM is full the oldest samples are eliminated following a first-in first-out (FIFO) strategy. Fig. 5 displays the variation in time of the standard mean squared error (MSE) of the distance predicted by the best world model as compared to the STM action perception pairs. Specifically, the MSE is calculated using MSE where is the output predicted by the model, is the real output of the corresponding action perception pair and is the STM size. As shown in the figure, the error decreases clearly

10 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 349 Fig. 5. Evolution of the MSE in the STM prediction for the best world model. Fig. 6. Efficiency of the gaits applied by the robot in each iteration of the MDB with tendency line. but with a continuous oscillation as a consequence of the constant variation of the STM, which in each iteration replaces one action perception pair, thus modifying the fitness criteria. In order to understand the evolutionary learning process that occurs in the MDB and how it affects the actions that are applied, a gait efficiency parameter was defined as the distance in the vertical direction to the objective covered by the robot in a fixed simulation time ( ), weighted by the horizontal distance ( ) that its trajectory is separated from a straight line (directly subtracted) and normalized by the maximum possible distance ( ) That is, a gait is taken as better if the robot goes straight to the block without any lateral deviation. It is important to point out that this measure is never used in the cognitive mechanism; it is just a way of clarifying the presentation of results. Fig. 6 displays the behavior of this efficiency throughout the 300 iterations of the robot s lifetime. It can be observed that the curve tends to 1, as expected. Initially, the gaits are poor and the robot moves in irregular trajectories. This is reflected in the efficiency graph by the large variations in the efficiency from one instant to the next. Sometimes, by chance it reaches the block, others it ends up very far away from it. Note that whatever the result of the action, it does produce a real action perception pair, which is useful data in order to improve the models. As the interaction progresses, the robot learns to reach the block without any deviation in a consistent manner, and the efficiency tends to one. Comparing Figs. 5 and 6 it can be seen that, although Fig. 7. Representation of the gaits obtained through iterations. the learning of the models is a noisy process with large oscillations, the resulting actions that are in the background of Fig. 6 improve in a more continuous and natural trend. In the three graphs of Fig. 7, we represent the temporal occurrence of the end of the swing motion for each leg (considered as a swing angle of -45, that is, the highest reverse turn) during the 20 s. The top graph corresponds to iteration 6 and we can see that the swings are completely out of phase because the legs reach the end point in different instants of time. The resulting gait is not appropriate for walking in a straight line and the robot turns, leading to a low efficiency value. The middle graph corresponds to iteration 87 where the resulting gait is more efficient than before according to the level of error for that iteration (see Fig. 5). Finally, the bottom figure shows the combination of phases corresponding to iteration 300. As we can see, the initial phases are equal in groups of three and the resulting gait is quite good. This combination of phases leads to a very common and efficient gait called tripod gait, where three legs move in phase and the other three legs in counter-phase resulting in a very fast and stable straight line motion. At this point, the Hermes II robot had learned to walk, and thus, in a developmental learning process, it was decided to use the MDB to provide it with the basic skill of turning using the combination of initial phases obtained (tripod gait) in the previous case. In this case, the same block was placed in a semicircumference in front of the robot at a random distance between

350 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 Fig. 8. Iterations between two consecutive captures of the object. Fig. 10.

The left image shows the path followed by the Hermes II robot in the first iterations. The right image shows the path when the behavior is successful.

11 350 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 Fig. 8. Iterations between two consecutive captures of the object. Fig. 10. Interaction between teacher and robot using the Pioneer 2 (left) and the AIBO (right) robot. The top images correspond to the learning stage and the bottom images to the induced behavior. Fig. 9. The left image shows the path followed by the Hermes II robot in the first iterations. The right image shows the path when the behavior is successful. 50 and 100 cm, and the MDB should provide the robot with the best combination of amplitudes in the swing motion in order to reach it. The rest of the parameters in the gait are fixed. If the robot reaches the block (distance of less than 20 cm) or if it loses it (distance larger than 100 cm), the teacher places it again in a new position within the semicircumference. The world model now has three inputs, the distance and angle of the robot with respect to the block (provided by the virtual sensor applied before) and the amplitude of turn. The outputs are the predicted distance and angle. In this case, an explicit satisfaction model was used with these two outputs of the world model as inputs and with just one output, the predicted satisfaction. The motivation of the robot was again the maximization of the infrared sensing in the two front sensors. Consequently, the robot had to reach the block (minimizing distance) with low deviation (minimizing angle). The world models had two hidden layers with four neurons each and the satisfaction models with three neurons. The population in the genetic algorithms was 600 individuals for the world models and 300 for the satisfaction models. The STM size was 40 and the management strategy was purely temporal (first in first out), as in the previous case. Fig. 8 provides an alternative view of the learning evolution. We have represented the number of iterations between two consecutive captures of the object. It can be clearly seen how, in the first stages of the behavior, there is a large delay from one capture to the next because the models are poor and the selected actions are not successful. The tendency changes about iteration 200 and the number of iterations between two consecutive captures decreases to one, implying that the robot has acquired the turning skill. The left image of Fig. 9 displays the path followed by the real robot with the strategies applied in iterations 53, 54, 55, and 56. As indicated in Fig. 8, these iterations correspond to the first stages of the mechanism where the number of iterations required to reach the object is large. In fact, the block remains in the same position during the application of these four strategies and the robot never turns towards it. The right image of Fig. 9 displays the path followed in iterations 421, 422, 423, and 424. In this case, as the block is reached by the robot, it is moved by the teacher. To conclude this first experiment, it must be pointed out that the robot was able to autonomously generate a tripod gait and modulate the amplitudes of the legs in order to turn to reach an objective through continuous interaction with the environment using its own sensors and a very simple motivation. This is very important because the mechanism allows the robot to find the best solution according to the limitations of its environment and its sensorial and actuation apparatus. In fact, the robot is adapting and surviving in this particular world. In addition, the computational implementation of the MDB performed successfully even with a highly limited real robot. B. Induced Behavior To show the behavior of the MDB in a more complex task that is guided by a teacher, a typical example of induced behavior has been reproduced. This experiment was carried out using two different physical agents to demonstrate the robustness of the architecture implementation and its transparency with respect to the particular hardware: a Pioneer 2 wheeled robot and Sony s AIBO. The task the physical agent must carry out is simple: learn to obey the commands of a teacher that, initially, guides the robot towards an object located in its neighborhood. Fig. 10 displays the experimental setup for both agents. In the case of the Pioneer 2 robot (left images of Fig. 10), the target is a black cylinder that must be caught and in the case of the AIBO robot the target is a pink ball (right images of Fig. 10). The Pioneer 2 robot is a wheeled robot that has a sonar sensor array around its body and a laptop placed on its top platform. The laptop provides two more sensors, a microphone and the numerical keyboard, and the MDB runs on it as explained in Section III-C. The

BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 351 Fig. 11. Representation of the models used in this experiment.

12 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 351 Fig. 11. Representation of the models used in this experiment. AIBO robot is a dog-like robot with a richer set of sensors and actuators. Its digital camera, the microphones and the speaker were used for this example. In this case, the MDB is executed remotely in a PC and communicates with the robot through a TCP/IP protocol and wireless connection. Fig. 11 displays a schematic view of the current world and satisfaction models (with their respective numbers of inputs and outputs) that arise in this experiment in a given instant. The sensory meaning of the inputs and outputs of these models in both physical agents are the following. Command (One Input) for the Pioneer 2 Robot: group of seven possible values according to the seven musical notes; provided by the teacher through a musical keyboard; sensed by the robot using the microphone of the laptop; translated to a discrete numerical range from 9to9 (linear relation for the first teacher and a random association for the second one). Command (One Input) for the AIBO Robot: group of seven possible values according to seven spoken words: hard right, medium right, right, straight, left, medium left, and hard left; the teacher speaks directly; sensed using the stereo microphones of the robot; speech recognition using Sphinx software translated into a discrete numerical range from 9 to 9 (linear relation for the first teacher and a random association for the second one). Action (One Input) for the Pioneer 2 Robot: group of seven possible actions: turn hard right, turn medium right, turn right, follow straight, turn left, turn medium left, and turn hard left that are encoded with a discrete numerical range from 9to9; the selected action is decoded as linear and angular speed. Action (One Input) for the AIBO Robot: group of seven possible actions: turn hard right, turn medium right, turn right, follow straight, turn left, turn medium left, and turn hard left that are encoded with a discrete numerical range from 9to9; the selected action is decoded as linear speed, angular speed, and displacement. Human Feedback (One Output/Input) for the Pioneer 2 Robot: discrete numerical range that depends on the degree of fulfillment of a command from 0 (disobey) to 5 (obey); provided by the teacher directly to the MDB using the numerical keyboard of the laptop. Human Feedback (One Output/Input) for the AIBO Robot: group of five possible values according to five spoken words: well done, good dog, ok, pay attention, and bad dog; the teacher speaks directly; sensed using stereo microphones of the robot; speech recognition using Sphinx software translated into a discrete numerical range from 0 to 5. Satisfaction (One Output) for the Pioneer 2 Robot: Continuous numerical range from 0 to 11 that is automatically calculated after applying an action. It depends on: the degree of fulfillment of a command from 0 (disobey) to 5 (obey); the distance increase from 0 (no increase) to 3 (max); the angle with respect to the object from 0 (back turned) to 3 (robot frontally to the object). Satisfaction (One Output) for the AIBO Robot: Continuous numerical range from 0 to 11 that is automatically calculated after applying an action. It depends on: the degree of fulfillment of a command from 0 (disobey) to 5 (obey); the distance increase from 0 (no increase) to 3 (max); the angle with respect to the object from 0 (back turned) to 3 (robot frontally to the object). Distance and Angle (Two Outputs/Inputs) for the Pioneer 2 Robot: sensed by the robot using the sonar array sensor; measured from the robot to the black cylinder and encoded directly in cm and degrees and transformed to a range [0:10]. Distance and Angle (Two Outputs/Inputs) for the AIBO Robot: sensed by the robot using the images provided by the color camera; color segmentation process and area calculation taken from Tekkotsu software [28]; encoded in cm and degrees and transformed to a range [0:10]; measured from the robot to the pink ball. In this example, the internal sensors of the robots were not considered and, consequently, internal models were not used. The flow of the learning process is as follows: the teacher observes the relative position of the robot with respect to the object and provides a command that guides it towards the object. Initially, the robot has no idea of what each command means in regards to the actions it applies. After sensing the command, the robot acts and, depending on the degree of obedience, the teacher provides a reward or a punishment as a pleasure or pain signal. The motivation of the physical agent in this experiment is to maximize being rewarded by the teacher. Consequently, to carry out this task, the robot just needs to follow the commands of the teacher, and a world model with that command as sensory input is obtained (top world model of Fig. 11) to select the action. From this point forward this model will be called communications model. The satisfaction model (top satisfaction model of Fig. 11) is trivial and it is not used

13 352 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 in the first part of the experiment (it is displayed in Fig. 11 for coherence) as the satisfaction is directly related to the output of the communications model, this is, the reward or punishment. Regarding the models corresponding to the remaining sensors of the robot, a second world model was simultaneously obtained (bottom world model of Fig. 11) that uses distance and angle to the object as sensory inputs. Obviously, this model is relating information different from the teacher s commands during the performance of the task. If the commands produce any regularities in the information provided by other sensors in regards to the satisfaction obtained, these models can be applied when operating without a teacher. This is, if in a given instant of time, the teacher stops providing commands, the communications model will not have any sensory input and cannot be used to select the actions, leaving this task in the hands of other models that do have inputs. For this second case, the satisfaction model is more complex, relating the satisfaction value to the distance and angle, directly associated with rewards or punishments. The higher satisfaction (value 1) corresponds to the minimum distance and angle. The four models are represented by multilayer perceptron ANNs (with neurons for the communications model, neurons for the world model, and neurons for the second satisfaction model). They were adjusted by means of the PBGA genetic algorithm [22] that automatically provided the mentioned size of the ANNs. Summarizing, in this case, the MDB executes four evolutionary processes over four different model populations every iteration. These processes run concurrently in the current version of the MDB. The STM has a size of 10 action perception pairs in all the experiments and the label L explained in Sections III-B1 and III-B3 was calculated using Fig. 12. Evolution of the mean squared error of the outputs provided by the current models (predicted distance, angle, satisfaction, and human feedback) for the AIBO robot (top) and the Pioneer 2 robot (bottom) experiments. in a stable context if a context change is detected. Fig. 12 displays the evolution of the mean squared error (calculated using the same expression explained above for the experiment shown in Fig. 5) provided by the current models (communications, world and satisfaction) predicting the STM as iterations of the MDB take place in both physical agents (the top graph corresponds to the AIBO robot experiment and the bottom graph to the Pioneer 2 robot one). The error clearly decreases in all cases and in a very similar way for both agents (except at the beginning, first 10 iterations, where the STM is being filled up). This means that the MDB works similarly in two very different real platforms and that the MDB is able to provide a real modeling of the environment, the communications and the satisfaction of the physical agents. As the error values show in Fig. 12, both robots learned to follow teacher commands in an accurate way in about 20 iterations (from a practical point of view this means about 10 min of real time) and, what is more relevant, the operation without teacher was successful using the induced world and satisfaction models. In this kind of real robot examples, the main measure that must be considered in order to decide the goodness of an experiment is the time consumed in the learning process to achieve perfect obedience. Fig. 10 displays a real execution of actions in both robots. In the pictures with a teacher, the robot is following Fig. 13. Evolution of the mean squared error provided by the outputs of the current communications model (predicted human feedback) and satisfaction model (predicted satisfaction) compared to the STM content as iterations of the MDB take place when a dynamic language and reward policy is applied. commands; otherwise it is performing the behavior without any commands, just using its induced models. It can be clearly seen that the behavior is basically the same although a little less efficient without teacher commands (as it has learned to decrease its distance to the object and not the fastest way to do it). With the aim of showing the adaptive capabilities of the MDB in real robot operation, Fig. 13 represents the evolution of the standard MSE provided by the current communications model during 200 iterations for the experiment with the Pioneer 2 robot (human feedback curve in the figure). Focusing our analysis in the communications model, in the first 70 iterations the teacher

14 BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 353 provides commands using the same encoding (language) applied in the previous experiment. This encoding is not preestablished and the teacher can make use of any correspondence it wants as long as it is consistent. From iteration 70 to iteration 160 another teacher appears using a different language (different and more complex relationship between musical notes) and, finally, from iteration 160 to iteration 200 the original teacher returns. As shown in Fig. 13, in the first 70 iterations the error decreases fast to a level of 0.17, which results in a very accurate prediction of the rewards. Consequently, the robot successfully follows the commands of the teacher. When the second teacher appears, the error level increases because the STM starts to store samples of the new language and the previous models fail in the prediction. At this point, as commented before, the LTM management system detects this mixed situation (detects an unstable model) and induces a change in the parameters of the STM replacement strategy to a FIFO strategy. The increase in the value of the error stops in about 10 iterations and, once the STM has been purged of samples from the first teacher s language, the error decreases again (0.13 at iteration 160). The error level between iterations 70 and 160 is not as stable as in the first iterations. This happens because the language used by the second teacher is more complex than the previous one, that is, its relationship to the encoding variable is nonlinear and, in addition, it must be pointed out that the evolution graphs obtained from real robots oscillate, in general, much more than in simulated experiments due to the broad range of noise sources of the real environments. But the practical result is that about iteration 160 the robot follows the new teacher s commands successfully again, adapting itself to teacher characteristics. When the original teacher returns using the original language (iteration 160 of Fig. 13), the adaptation is very fast because the communication models stored in the LTM during the first iterations are introduced as seeds in the evolutionary processes. Regarding the satisfaction model curve represented in Fig. 13, it corresponds to an equivalent experiment in which a change in the rewards provided by the teacher was carried out. From the initial iteration until iteration 70, the teacher rewards reaching the object and, as shown in the graph, the error level is low (1.4%). From iteration 70 to 160, the teacher changes its behavior and punishes reaching the object, rewarding escaping from it. There is a clear increase of error level due to the complexity of the new situation (high ambiguity of possible solutions, that is, there are more directions for escaping than for reaching the object). In iteration 160, the teacher returns to the first behavior and, as expected, the error level decreases to the original levels quickly obtaining a successful adaptive behavior. V. CONCLUSION This paper presents the MDB cognitive architecture for robots. It follows a developmental approach to provide real robots with autonomous lifelong learning capabilities. The knowledge acquisition is carried out by means of neuroevolutionary processes that use the real data obtained during the operation of the robots as fitness function. The computational implementation of the architecture includes several improvements to maximize the efficiency and reliability of its practical application to real robots. The experiments carried out with the MDB have confirmed its capabilities for real-time learning of basic skills and more complex behaviors in dynamic environments. These results open a very promising line of research involving evolutionary cognitive structures in which several aspects may be considered and improved. The current version of the MDB does not take into account the social aspects of autonomous operation, a very important issue that must be studied in depth. Furthermore, the use of internal sensors and, consequently, internal models, must be analyzed and considered as an intrinsic part of the robot representation. The possibility of a dynamic change of motivations adapted to the robot behavior and environmental conditions is an aspect that should be included in the architecture in order to produce really autonomous and adaptive systems. Research opportunities may also be found in the control of the short and long term memories, especially in terms of deciding what goes in or is forgotten and how to produce LTM representations that are not directly a consequence of STM related models but rather generalizations of knowledge already present in the LTM. Regarding the immediate future work, new experiments with real robots are being carried out implying complex sequences of actions to study online behavior learning and adaptation in depth. In addition, other representations for the models apart from ANNs are being tested and analyzed. Finally, robots with a larger and redundant sensorial and actuation repertoire are being considered in order to determine the suitability of the mechanism in these contexts. We expect the range of behaviors and models to increase exponentially but, at the same time, we expect the mechanism to be able to cope better due to the fact that it will have more information and options to achieve an objective. REFERENCES [1] R. Cotterill, Enchanted Looms: Conscious Networks in Brains and Computers. Cambridge, U.K.: Cambridge Univ. Press, [2] D. Vernon, G. Metta, and G. Sandini, A survey of artificial cognitive systems: Implications for the autonomous development of mental capabilities in computational agents, IEEE Trans. Evol. Comput., vol. 11, no. 2, pp , Apr [3] X. Yao, Evolving artificial neural networks, Proc. IEEE, vol. 87, no. 9, pp , Sep [4] F. Bellas, A. Lamas, and R. J. Duro, Adaptive behavior through a Darwinist machine, Lecture Notes Artif. Intell., vol. 2159, pp , [5] F. Bellas, J. A. Becerra, and R. J. Duro, Induced behavior in a real agent using the multilevel Darwinist brain, Lecture Notes Comput. Sci., vol. 3562, pp , [6] G. A. Bekey, Autonomous Robots: From Biological Inspiration to Implementation and Control. Cambridge, MA: MIT Press, [7] J. A. Farrell and M. M. Polycarpou, Adaptive Approximation Based Control: Unifying Neural, Fuzzy and Traditional Adaptive Approximation Approaches. New York: Wiley, [8] N. Nilsson, Principles of Artificial Intelligence. San Mateo, CA: Morgan Kaufmann, [9] R. Arkin, Behavior-Based Robotics. Cambridge, MA: MIT Press, [10] The American Heritage Dictionary of the English Language 4th ed. Houghton Mifflin Company, [11] L. M. Brasil, F. M. de Azevedo, J. M. Barreto, and M. Noirhomme- Fraiture, Complexity and cognitive computing, Lecture Notes Comput. Sci., vol. 1415, pp , [12] D. Blank, J. Marshall, and L. Meeden, What is it like to be a developmental robot?, Newslett. Autonom. Mental Develop. Tech. Committee, vol. 4, no. 1, p. 7, [13] J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, and E. Thelen, Autonomous mental development by robots and animals, Science, vol. 291, no. 5504, pp , [14] L. Meeden and D. Blank, Editorial: Introduction to developmental robotics, Connect. Sci., vol. 18, no. 2, pp , 2006.

354 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 [15] M. Asada, K. Hosoda, Y. Kuniyoshi, H. Ishiguro, T. Inui, Y. Yoshikawa, M. Ogino, and C.

Nilsson, Logical Foundations of Artificial Intelligence. San Mateo, CA: Morgan Kauffman, 1987. [17] R. J. Duro, J. Santos, F. Bellas, and A.

Changeux, P. Courrege, and A. Danchin, A theory of the epigenesis of neural networks by selective stabilization of synapses, in Proc. Nat. Acad. Sci., 1973, vol. 70, pp. 2974 2978. [19] M.

O. Stanley and R. Miikkulainen, Evolving neural networks through augmenting topologies, Evol. Comput., vol. 10, pp. 99 127, 2002. [22] F. Bellas, J.

15 354 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010 [15] M. Asada, K. Hosoda, Y. Kuniyoshi, H. Ishiguro, T. Inui, Y. Yoshikawa, M. Ogino, and C. Yoshida, Cognitive developmental robotics: A survey, IEEE Trans. Autonom. Mental Develop., vol. 1, no. 1, pp , May [16] M. R. Genesereth and N. Nilsson, Logical Foundations of Artificial Intelligence. San Mateo, CA: Morgan Kauffman, [17] R. J. Duro, J. Santos, F. Bellas, and A. Lamas, On line Darwinist cognitive mechanism for an artificial organism, in Proceedings Supplement Book SAB2000. New York: International Society for Adaptive Behavior, 2000, pp [18] J. Changeux, P. Courrege, and A. Danchin, A theory of the epigenesis of neural networks by selective stabilization of synapses, in Proc. Nat. Acad. Sci., 1973, vol. 70, pp [19] M. Conrad, Evolutionary learning circuits, J. Theoret. Biol., vol. 46, [20] G. Edelman, Neural Darwinism. The Theory of Neuronal Group Selection. New York: Basic Books, 1987, pp [21] K. O. Stanley and R. Miikkulainen, Evolving neural networks through augmenting topologies, Evol. Comput., vol. 10, pp , [22] F. Bellas, J. A. Becerra, and R. J. Duro, Using promoters and functional introns in genetic algorithms for neuroevolutionary learning in nonstationary problems, Neurocomputing, vol. 72, pp , [23] F. Bellas, J. A. Becerra, and R. J. Duro, Internal and external memory in neuroevolution for learning in non-stationary problems, Lecture Notes Artif. Intell., vol. 5040, pp , [24] F. Bellas and R. J. Duro, Introducing long term memory in an ANN based multilevel Darwinist brain, Lecture Notes Comput. Sci., vol. 2686, pp , [25] T. H. J. Collett, B. A. MacDonald, and B. P. Gerkey, Player 2.0: Toward a practical robot programming framework, in Proc. Annu. Sci. Meeting Exhibt. (ACRA), Sydney, Australia, Dec [26] P. Caamano, R. Tedin, and J. A. Becerra, Java Evolutionary Algorithm Framework [Online]. Available: [27] F. Bellas, J. A. Becerra, J. Santos, and R. J. Duro, Applying synaptic delays for virtual sensing and actuation in mobile robots, in Proc. IJCNN 2000, Como, Italy, 2000, pp [28] Tekkotsu Homepage, [Online]. Available: Richard J. Duro (M 94 SM 04) received the B.Sc., M.Sc., and Ph.D. degrees in physics from the University of Santiago de Compostela, Spain, in 1988, 1989, and 1992, respectively. He is currently a Profesor Titular in the Department of Computer Science and head of the Integrated Group for Engineering Research at the University of A Coruña, Coruña, Spain. His research interests include higher order neural network structures, signal processing, and autonomous and evolutionary robotics. Andrés Faiña received the M.Sc. degree in industrial engineering from the University of A Coruña, Coruña, Spain, in He is currently working towards the Ph.D. degree in the Department of Industrial Engineering at the same university. He is currently a Researcher at the Integrated Group for Engineering Research. His interests include modular and self-reconfigurable robotics, mobile robotics, and electronic and mechanical design. Daniel Souto received the M.Sc. degree in industrial engineering from the University of A Coruña, Coruña, Spain, in He is working towards the Ph.D. degree in the Department of Industrial Engineering at the same university. He is currently a Researcher at the Integrated Group for Engineering Research. His research activities are related to automatic design and mechanical design of robots. Francisco Bellas (M 10) received the B.Sc. and M.Sc. degrees in physics from the University of Santiago de Compostela, Spain, in 1999 and 2001, respectively, and the Ph.D. degree in computer science from the University of A Coruña, Coruña, Spain, in He is currently a Profesor Contratado Doctor at the University of A Coruña. He is a member of the Integrated Group for Engineering Research at the University of A Coruña. His current research interests are related to evolutionary algorithms applied to artificial neural networks, multiagent systems, and robotics.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should