To appear in: Advances in Neural Information Processing Systems 3, Planning with an Adaptive World Model. Knut Moller. University of Bonn

Similar documents
A Reinforcement Learning Variant for Control Scheduling

Axiom 2013 Team Description Paper

Learning Methods for Fuzzy Systems

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

Reinforcement Learning by Comparing Immediate Reward

Lecture 10: Reinforcement Learning

SARDNET: A Self-Organizing Feature Map for Sequences

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

(Sub)Gradient Descent

Python Machine Learning

phone hidden time phone

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

On the Combined Behavior of Autonomous Resource Management Agents

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Knowledge Transfer in Deep Convolutional Neural Nets

Speeding Up Reinforcement Learning with Behavior Transfer

A Generic Object-Oriented Constraint Based. Model for University Course Timetabling. Panepistimiopolis, Athens, Greece

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning to Schedule Straight-Line Code

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

WHEN THERE IS A mismatch between the acoustic

A Case Study: News Classification Based on Term Frequency

A Comparison of Annealing Techniques for Academic Course Scheduling

The Strong Minimalist Thesis and Bounded Optimality

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Softprop: Softmax Neural Network Backpropagation Learning

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Evolutive Neural Net Fuzzy Filtering: Basic Description

Introduction to Simulation

AMULTIAGENT system [1] can be defined as a group of

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Case-Based Approach To Imitation Learning in Robotic Agents

Using focal point learning to improve human machine tacit coordination

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

INPE São José dos Campos

TD(λ) and Q-Learning Based Ludo Players

University of Groningen. Systemen, planning, netwerken Bosman, Aart

A redintegration account of the effects of speech rate, lexicality, and word frequency in immediate serial recall

arxiv: v1 [cs.lg] 15 Jun 2015

Georgetown University at TREC 2017 Dynamic Domain Track

Discriminative Learning of Beam-Search Heuristics for Planning

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Learning Prospective Robot Behavior

Learning From the Past with Experiment Databases

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Modeling function word errors in DNN-HMM based LVCSR systems

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

arxiv: v1 [cs.cv] 10 May 2017

CS Machine Learning

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

An empirical study of learning speed in backpropagation

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

arxiv: v1 [math.at] 10 Jan 2016

Generative models and adversarial training

Evolution of Symbolisation in Chimpanzees and Neural Nets

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Cued Recall From Image and Sentence Memory: A Shift From Episodic to Identical Elements Representation

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v2 [cs.ro] 3 Mar 2017

FF+FPG: Guiding a Policy-Gradient Planner

Seminar - Organic Computing

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Audible and visible speech

Ricochet Robots - A Case Study for Human Complex Problem Solving

Model Ensemble for Click Prediction in Bing Search Ads

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Classification Using ANN: A Review

A Review: Speech Recognition with Deep Learning Methods

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

An Introduction to Simulation Optimization

Rule Learning With Negation: Issues Regarding Effectiveness

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Robot manipulations and development of spatial imagery

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Modeling function word errors in DNN-HMM based LVCSR systems

Device Independence and Extensibility in Gesture Recognition

Calibration of Confidence Measures in Speech Recognition

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Computational Value of Nonmonotonic Reasoning. Matthew L. Ginsberg. Stanford University. Stanford, CA 94305

Rule Learning with Negation: Issues Regarding Effectiveness

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The Effects of Ability Tracking of Future Primary School Teachers on Student Performance

Comment-based Multi-View Clustering of Web 2.0 Items

Disambiguation of Thai Personal Name from Online News Articles

Transcription:

To appear in: Advances in Neural Information Processing Systems, Touretzky, D.S., Lippmann, R. (eds.), San Mateo, CA: Morgan Kaufmann Planning with an Adaptive World Model Sebastian B. Thrun German National Research Center for Computer Science (GMD) D-55 St. Augustin, FRG Knut Moller University of Bonn Department of Computer Science D-5 Bonn, FRG Abstract Alexander Linden German National Research Center for Computer Science (GMD) D-55 St. Augustin, FRG We present a new connectionist planning method [TML9]. By interaction with an unknown environment, a world model is progressively constructed using gradient descent. For deriving optimal actions with respect to future reinforcement, planning is applied in two steps: an experience network proposes a plan which is subsequently optimized by gradient descent with a chain of world models, so that an optimal reinforcement may be obtained when it is actually run. The appropriateness of this method is demonstrated by a robotics application and a pole balancing task. INTRODUCTION Whenever decisions are to be made with respect to some events in the future, planning has been proved to be an important and powerful concept in problem solving. Planning is applicable if an autonomous agent interacts with a world, and if a reinforcement is available which measures only the over-all performance of the agent. Then the problem of optimizing actions yields the temporal credit assignment problem [Sut84], i.e. the problem of assigning particular reinforcements to particular actions in the past. The problem becomes more complicated if no knowledge about the world is available in advance. Many connectionist approaches so far solve this problem directly, using techniques based on the interaction of an adaptive world model and an adaptive controller [Bar89, Jor89, Mun87]. Although such controllers are very fast after training, training itself is rather complex, mainly because of two reasons: a) Since future is not considered explicitly, future eects must be directly encoded into the world model. This complicates model training. b) Since the controller is trained with the world model, training of the former lags behind the latter. Moreover, if there do exist

action (t) state (t) world model network (external) state (t+) reinforcement (t+) predicted state (t+) pred. reinforcement (t+) error gradients Figure : The training of the model network is a system identication task. Internal parameters are estimated by gradient descent, e.g. by backpropagation. several optimal actions, such controllers will only generate at most one regardless of all others, since they represent many-to-one functions. E.g., changing the objective function implies the need of an expensive retraining. In order to overcome these problems, we applied a planning technique to reinforcement learning problems. A model network which approximates the behavior of the world is used for looking ahead into future and optimizing actions by gradient descent with respect to future reinforcement. In addition, an experience network is trained in order to accelerate and improve planning. LOOK-AHEAD PLANNING. SYSTEM IDENTIFICATION Planning needs a world model. Training of the world model is adopted from [Bar89, Jor89, Mun87]. Formally, the world maps actions to subsequent states and reinforcements (Fig. ). The world model used here is a standard non-recurrent or a recurrent connectionist network which is trained by backpropagation or related gradient descent algorithms [WZ88, TS9]. Each time an action is performed on the world their resulting state and reinforcement is compared with the corresponding prediction by the model network. The dierence is used for adapting the internal parameters of the model in small steps, in order to improve its accuracy. The resulting model approximates the world's behavior. Our planning technique relies mainly on two fundamental steps: Firstly, a plan is proposed either by some heuristic or by a so-called experience network. Secondly, this plan is optimized progressively by gradient descent in action space. First, we will consider the second step.. PLAN OPTIMIZATION In this section we show the optimization of plans by means of gradient descent. For that purpose, let us assume an initial plan, i.e. a sequence of N actions, is given. The rst action of this plan together with the current state (and, in case of a recurrent model network, its current context activations) are fed into the model network (Fig. ). This gives us a prediction for the subsequent state and reinforcement of the world. If we assume that the state prediction is a good estimation for the next state, we can proceed by predicting the immediate next state and reinforcement from the second action of the plan correspondingly. This procedure is repeated for each of the N stages of the plan. The nal output is a sequence of N reinforcement predictions, which represents the quality of the plan. In order to maximize reinforcement, we

model network (N) plan: N th action reinforcement energy E reinf predicted reinforcements model network () plan: nd action w o r l d model network () state context units (recurrent networks only) plan: st action (PLANNING RESULT) Figure : Looking ahead by the chain of model networks. establish a dierentiable reinforcement energy function E reinf, which measures the deviation of predicted and desired reinforcement. The problem of optimizing plans is transformed to the problem of minimizing E reinf. Since both E reinf and the chain of model networks are dierentiable, the gradients of the plan with respect to E reinf can be computed. These gradients are used for changing the plan in small steps, which completes the gradient descent optimization. The whole update procedure is repeated either until convergence is observed or, which makes it more convenient for real-time applications, a predened number of iterations { note that in the latter case the computational eort is linear in N. From the planning procedure we obtain the optimized plan, the rst action of which is then performed on the world. Now the whole procedure is repeated. The gradients of the plan with respect to E reinf can be computed either by backpropagation through the chain of models or by a feed-forward algorithm which is related to [WZ88, TS9]: Hand in hand with the activations we propagate also the gradients j is () @ activation j () @ action i (s) through the chain of models. Here i labels all action input units and j all units of the whole model network, ( N) is the time associated with the th model of the chain, and s (s) is the time of the sth action. Thus, for each action (8i; s) its inuence on later activations (8j; 8 s) of the chain of networks, including all predictions, is measured by j is (). It has been shown in an earlier paper that this gradient can easily be propagated forward through the network [TML9]: j is () = 8 > <>: ij s if j action input unit if = ^ j state/context input unit j ({) if > ^ j state/context input unit is (j X corresponding output unit of preceding model) logistic (net j()) weight jl is() l otherwise lpred(j) If an unknown world is to be explored, this action might be disturbed by adding a small random variable. () ()

NX X The reinforcement energy to be minimized is dened as g k (), reinf k, activation k() : () E reinf = k (k numbers the reinforcement output units, reinf k is the desired reinforcement value, usually 8k: reinf, and k g k weights the reinforcement with respect to and k, in the simplest case g k ().) Since E reinf is dierentiable, we can compute the gradient ofe reinf with respect to each particular reinforcement prediction. From these gradients and the gradients is k of the reinforcement prediction units the gradients is @E reinf @ action i (s) =, NX X =s k g k (), reinf k, activation k() k is() (4) are derived which indicate how tochange the plan in order to minimize E reinf. Variable plan lengths: The feed-forward manner of the propagation allows it to vary the number of look-ahead steps due to the current accuracy of the model network. Intuitively, if a model network has a relatively large error, looking far into future makes little sense. A good heuristic is to avoid further look-ahead if the current linear error (due to the training patterns) of the model network is larger than the eect of the rst action of the plan to the current predictions. This eect is exactly the gradients k i(). Using variable plan lengths might overcome the diculties in nding an appropriate plan length N a priori.. INITIAL PLANS { THE EXPERIENCE NETWORK It remains to show how to obtain initial plans. There are several basic strategies which are more or less problem-dependent, e.g. random, average over previous actions etc. Obviously, if some planning took place before, the problem of nding an initial plan reduces to the problem of nding a simple action, since the rest of the previous plan is a good candidate for the next initial plan. A good way of nding this action is the experience network. This network is trained to predict the result of the planning procedure by observing the world's state and, in the case of recurrent networks, the temporal context information from the model network. The target values are the results of the planning procedure. Although the experience network is trained like a controller [Bar89], it is used in a dierent way, since outcoming actions are further optimized by the planning procedure. Thus, even if the knowledge of the experience network lags behind the model network's, the derived actions are optimized with respect to the \knowledge" of the model network rather than the experience network. On the other hand, while the optimization is gradually shifted into the experience network, planning can be progressively shortened. APPROACHING A ROLLING BALL WITH A ROBOT ARM We applied planning with an adaptive world model to a simulation of a real-time robotics task: A robot arm in -dimensional space was to approach a rolling ball. Both hand position (i.e. x,y,z and hand angle) and ball position (i.e. x,y )were observed by a camera system in workspace. Conversely, actions were dened as angular changes of the robot joints in conguration space. Model and experience

reinforcement prediction ball prediction hand prediction (workspace) + model network 6 8 8 H X-Y-Space H current hand pos. B current ball pos. ^B previous ball pos. - plans 8 + 4 ball pos. context layer hand pos. action (workspace) (configuration space) 4 5 6- B ^B 8 4 experience network Figure : (a) The recurrent model network (white) and the experience network (grey) at the robotics task. (b) Planning: Starting with the initial plan, the approximation leads nally to plan. The rst action of this plan is then performed on the world. networks are shown in Fig. a. Note that the ball movement was predicted by a recurrent Elman-type network, since only the current ball position was visible at any time. The arm prediction is mathematically more sophisticated, because kinematics and inverse kinematics are required to solve it analytically. The reason why planning makes sense at this task is that we did not want the robot arm to minimize the distance between hand and ball at each step { this would obviously yield trajectories in which the hand follows the ball, e.g.: robot arm Figure 4: Basic strategy, the arm \follows" the ball. initial hand position 4 initial ball position 9 9 Instead, we wanted the system to nd short cuts by making predictions about the ball's next movement. Thus, the reinforcement measured the distance in workspace. Fig. b illustrates a \typical" planning process with look-ahead N =4, 9 iterations, g k () =: (c.f. ()),aweighted stepsize =:5 :9, and well-trained model and experience networks. Starting with an initial plan by the experience network This exponential function is crucial for minimizing later distances rather than the sooner.

the optimization led to plan. It is clear to see that the resulting action surpassed the initial plan, which demonstrates the appropriateness of the optimization. The nal trajectory was: robot arm Figure 5: Planning: The arm nds the short cut. initial hand position initial ball position 6 We were now interested in modifying the behavior of the arm. Without further learning of either the model or the experience network, we wanted the arm to approach the ball from above. For this purpose we changed the energy function (7): Before the arm was to approach the ball, the energy was minimal if the arm reached a position exactly above the ball. Since the experience network was not trained for that task, we doubled the number of iteration steps. This led to: robot arm Figure 6: The arm approaches from above due to a modied energy function. initial hand position initial ball position 6 A rst implementation on a real robot arm with a camera system showed similar results. 4 POLE BALANCING Next, we applied our planning method to the pole balancing task adopted from [And89]. One main dierence to the task described above is the fact that gradient descent is not applicable with binary reinforcement, since the better the approximation by the world model, the more the gradient vanishes. This eect can be prevented by using a second model network with weight decay, which is trained with the same training patterns. Weight decay smoothes the binary mapping. By using the model network for prediction only and the smoothed network for gradient propagation, the pole balancing problem became solvable. We see this as a general

technique for applying gradient descent to binary reinforcement tasks. We were especially interested in the dependency of look-ahead and the duration of balance. It turned out that in most randomly chosen initial congurations of pole and cart the look-ahead N =4was sucient to balance the pole more than steps. If the cart is moved randomly, after on average movements the pole falls. 5 DISCUSSION The planning procedure presented in this paper has two crucial limitations. By using a bounded look-ahead, eects of actions to reinforcement beyond this bound can not be taken into account. Even if the plan lengths are kept variable (as described above), each particular planning process must use a nite plan. Moreover, using gradient descent as search heuristic implies the danger of getting stuck in local minima. It might be interesting to investigate other search heuristics. On the other hand this planning algorithm overcomes certain problems of adaptive controller networks, namely: a) The training is relatively fast, since the model network does not include temporal eects. b) Decisions are optimized due to the current \knowledge" in the system, and no controller lags behind the model network. c) The incorporation of additional constraints to the objective function at runtime is possible, as demonstrated. d) By using a probabilistic experience network the planning algorithm is able to act as a non-deterministic many-to-many controller. Anyway, we have not investigated the latter point yet. Acknowledgements The authors thank Jorg Kindermann and Frank Smieja for many fruitful discussions and Michael Contzen and Michael Fabender for their help with the robot arm. References [And89] C.W. Anderson. Learning to control an inverted pendulum using neural networks. IEEE Control Systems Magazine, 9():{7, 989. [Bar89] A. G. Barto. Connectionist learning for control: An overview. Technical Report COINS TR 89-89, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA, September 989. [Jor89] M. I. Jordan. Generic constraints on unspecied target constraints. In Proceedings of the First International Joint Conference on Neural Networks, Washington, DC, San Diego, 989. IEEE TAB NN Committee. [Mun87] P. Munro. A dual backpropagation scheme for scalar-reward learning. In Ninth Annual Conference of the Cognitive Science Society, pages 65{76, Hillsdale, NJ, 987. Cognitive Science Society, Lawrence Erlbaum. [Sut84] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, 984. [TML9] S. Thrun, K. Moller, and A. Linden. Adaptive look-ahead planning. In G. Dorner, editor, Proceedings KONNAI/OEGAI, Springer, Sept. 99. [TS9] S. Thrun and F. Smieja. A general feed-forward algorithm for gradientdescent in connectionist networks. TR 48, GMD, FRG, Nov. 99. [WZ88] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. TR ICS Report 885, Institute for Cognitive Science, University of California, San Diego, CA, 988.