Multilayer Perceptrons with Radial Basis Functions as Value Functions in Reinforcement Learning

Similar documents
Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

Speeding Up Reinforcement Learning with Behavior Transfer

Reinforcement Learning by Comparing Immediate Reward

TD(λ) and Q-Learning Based Ludo Players

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Python Machine Learning

A Reinforcement Learning Variant for Control Scheduling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 10: Reinforcement Learning

Improving Action Selection in MDP s via Knowledge Transfer

Learning Methods for Fuzzy Systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Neural Network GUI Tested on Text-To-Phoneme Mapping

CSL465/603 - Machine Learning

On the Combined Behavior of Autonomous Resource Management Agents

Laboratorio di Intelligenza Artificiale e Robotica

Discriminative Learning of Beam-Search Heuristics for Planning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

High-level Reinforcement Learning in Strategy Games

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

INPE São José dos Campos

Lecture 1: Machine Learning Basics

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Softprop: Softmax Neural Network Backpropagation Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Georgetown University at TREC 2017 Dynamic Domain Track

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning Prospective Robot Behavior

Laboratorio di Intelligenza Artificiale e Robotica

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Learning to Schedule Straight-Line Code

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

An OO Framework for building Intelligence and Learning properties in Software Agents

BMBF Project ROBUKOM: Robust Communication Networks

arxiv: v1 [cs.cv] 10 May 2017

Issues in the Mining of Heart Failure Datasets

Human Emotion Recognition From Speech

SARDNET: A Self-Organizing Feature Map for Sequences

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

AMULTIAGENT system [1] can be defined as a group of

Time series prediction

FF+FPG: Guiding a Policy-Gradient Planner

Welcome to. ECML/PKDD 2004 Community meeting

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

(Sub)Gradient Descent

Test Effort Estimation Using Neural Network

Speaker Identification by Comparison of Smart Methods. Abstract

Soft Computing based Learning for Cognitive Radio

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Evolution of Symbolisation in Chimpanzees and Neural Nets

Word Segmentation of Off-line Handwritten Documents

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning and Transferring Relational Instance-Based Policies

Using focal point learning to improve human machine tacit coordination

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

arxiv: v2 [cs.cv] 30 Mar 2017

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Lecture 1: Basic Concepts of Machine Learning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v2 [cs.ro] 3 Mar 2017

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

An empirical study of learning speed in backpropagation

Deep Neural Network Language Models

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A Case-Based Approach To Imitation Learning in Robotic Agents

Kamaldeep Kaur University School of Information Technology GGS Indraprastha University Delhi

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Analysis of Enzyme Kinetic Data

Detailed course syllabus

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

WHEN THERE IS A mismatch between the acoustic

An Online Handwriting Recognition System For Turkish

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Artificial Neural Networks

Software Maintenance

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Surprise-Based Learning for Autonomous Systems

Probability and Statistics Curriculum Pacing Guide

Learning From the Past with Experiment Databases

An ICT environment to assess and support students mathematical problem-solving performance in non-routine puzzle-like word problems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Knowledge Transfer in Deep Convolutional Neural Nets

AI Agent for Ice Hockey Atari 2600

Seminar - Organic Computing

Second Exam: Natural Language Parsing with Neural Networks

The Strong Minimalist Thesis and Bounded Optimality

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Speech Emotion Recognition Using Support Vector Machine

Transcription:

Multilayer Perceptrons with Radial Basis Functions as Value Functions in Reinforcement Learning Victor Uc Cetina Humboldt University of Berlin - Department of Computer Science Unter den Linden 6, 10099 Berlin - Germany Abstract. Using multilayer perceptrons (MLPs) to approximate the state-action value function in reinforcement learning (RL) algorithms could become a nightmare due to the constant possibility of unlearning past experiences. Moreover, since the target values in the training examples are bootstraps values, this is, estimates of other estimates, the chances to get stuck in a local minimum are increased. These problems occur very often in the mountain car task, as showed by Boyan and Moore [2]. In this paper we present empirical evidence showing that MLPs augmented with one layer of radial basis functions (RBFs) can avoid these problems. Our experimental testbeds are the mountain car task and a robot control problem. 1 Introduction Reinforcement learning [9] is a very appealing artificial intelligence method to approach the machine learning problem. The idea of programming a computational system in such a way that it could improve its performance through several interactions with the environment is certainly attractive. In relatively small problems with discrete state and action spaces using a lookup table and algorithms like TD(λ) [8], Q-learning [12] or Sarsa [10] should be enough to get optimal results. Of course, we need to find the best set of parameters and allow for enough training episodes. The challenging part in reinforcement learning comes when we try to solve more complicated problems involving continuous spaces, and particularly high dimensional ones. Then, a lookup table is not enough to represent the value function and we need to approximate it somehow. When we get to this point, we have to decide between using a linear or a non-linear method. Linear methods like the cerebellar model articulation controllers (CMACs) [10, 5] and RBFs networks of gaussian functions [1, 3] are by far the most recommended methods for RL, primarily because they are localised function approximators and therefore they are less affected by the unlearning problem. Kretchmar and Anderson [4] studied the similarities and differences between CMACs and RBFs with Q-learning applied to the mountain car task. Another option worth mentioning is the use of regression trees like in the method proposed by Wang and Dietterich [11], although it should be noted its limited applicability for tasks where incremental learning is required. In this paper we present experimental results showing how a non-linear function approximator like the MLP augmented with a RBFs layer could become 161

a good choice to represent the state-action value function in RL problems with continuous state spaces and high dimensionality. We tested this approach in the mountain car task, which is well known as a tricky control problem, especially for neural networks, as demonstrated by Boyan and Moore [2]. We also experimented with the dribbling problem in the framework of the RoboCup competitions. The rest of this paper is organized as follows. In Section 2 we present the Sarsa algorithm and the learning structure we propose to approximate the value function. In Sections 3 and 4 we describe the experiments performed with the mountain car task and the dribbling problem respectively. Finally, we present our conclusion in Section 5 and comment about our future work. 2 Algorithm and Value Function Structure Sarsa is an on-policy temporal difference control algorithm which continually estimates the state-action value function Q π for the behavior policy π, andat the same time changes π toward greediness with respect to Q π [9]. In problems with a small number of state-action pairs and discrete spaces, the Q function is stored using a lookup table. However, when the number of those pairs grows, the use of lookup tables becomes impractical, or simply impossible. We need a function approximator instead. In our case, the Q function is represented with a set of MLPs, one MLP per action. The Sarsa algorithm with the changes needed to use a set of MLPs as function approximator is presented in Algorithm 1. Algorithm 1: Sarsa algorithm for continuous states using MLPs 1 initialize the weights vector W i for all MLP i arbitrarily 2 foreach training episode do 3 initialize s 4 choose a from s using policy derived from Q 5 repeat for each step of episode 6 take action a, observe r, s 7 choose a from s using policy derived from Q 8 TargetQ MLP a (s)+α[r + γmlp a (s ) MLP a (s)] 9 train MLP a with example (s, T argetq) 10 s s ; a a 11 until s is terminal 12 end The use of MLPs as value function approximators in reinforcement learning is usually not recommended, given that they suffer from the unlearning problem and fall into local optima very often. However, if we add a layer of radial basis functions to the standard MLP, it is possible to create a semi-localised function approximator that can be used to obtain optimal policies in hard problems with continuous state spaces and high dimensionality. The proposed MLP has 4 162

Fig. 1: Multilayer perceptron with one layer of radial basis functions layers: 2 hidden layers plus the input and output layers (see Fig. 1). The number m of input units must equal the size of the feature vector that represents the current state of the environment. In the first hidden layer there are k RBFs. For each input variable x i there is a set R i of RBFs r ij. The r ij R i should be defined to cover the range of values that x i can take. The outputs of the RBFs layer are fed into the second hidden layer that consists of n sigmoidal functions. Finally, the outputs of the second hidden layer reach the output unit. During the training stage, only the connection weights between both hidden layers, and between the second hidden layer and the output layer are learned, leaving the weights between the input and first hidden layer set to 1. Although one possibility when working with radial basis functions is the optimization of their parameters through the application of unsupervised learning methods, in the results presented here, we only experimented with the number of radial basis functions needed to learn the Q value function. We used gaussian functions of the form: ) RBF(x i ) = exp ( xi ci,j 2 2σ 2 The centers c i,j of the b i basis functions defined for x i are placed at a distance dist i one from the other, where dist i = max(xi) min(xi) b i and σ i = disti 2 Comprehensive introductions to radial basis functions and their training can be found in [1, 3]. The main advantage of our topology is that it can be used with high dimensional state spaces without problems of exponential grow in the number of RBFs. This is, in the case of having the same number p of RBFs for each one of the m input variables, we would need only mp RBFs, in contrast to the p m we would use in a straightforward implementation of RBF networks. One common option to avoid the curse of dimensionality is to group the input variables in pairs, and define the number of RBFs required to cover the resulting 2-dimensional subspaces generated by each pair. However, the successful selection of the variable pairs requires some previous knowledge about 163

the input space of the problem, or an important amount of experimentation instead. 3 Mountain Car Problem Our first testbed is the mountain car problem, where a car is driving along a mountain road and it must drive up a hill. However, the engine is too weak to directly go up the slope. This problem is commonly used as a testbed in reinforcement learning, and a complete description of it and its dynamics, are given by Sutton and Barto [9]. 3.1 Experiments and Results For this problem we experimented with 2, 6, 8 and 12 RBFs for each input variable, and 2 sigmoidal units in the second hidden layer. The best results were obtained with 12 RBFs and 50,000 training episodes, as it is illustrated in Fig. 2a. Each training episode was terminated either when the goal was reached, or when 100 movements were performed. The reward function penalizes the actions with 0.1 all the time, except when the last action performed allowed the car to reach the goal, in this case the reward is 0. The training policy was ɛ-greedy with a constant ɛ =0.01, α =0.5 andγ =0.5. In terms of the MLPs we used α MLP =0.001 and activation functions with outputs in the interval ( 1, 1). Some of our best policies were able to reach the goal in 59 steps, however in average the goal is reached in 63 steps. The quality of our solution is similar to those presented by Smart and Kaelbling [7], and more recently by Whiteson and Stone [13]. Moreover, given the great similarity between the shape of our final value function presented in Fig. 2b and the best one provided by Singh and Sutton [6, 10], we conclude that our solution is a near-optimal policy. 4 Dribbling Problem In the RoboCup simulation league, one of the most difficult skills that the robots can perform is dribbling. Dribbling can be defined as the skill that allows a player to run on the field while keeping the ball always in its kick range. In order to accomplish this skill, the player must alternate run and kick actions. The run action is performed through the use of the command (dash Power), while the kick action is performed using the command (kick Power Direction), where Power [ 100, 100] and Direction [ 180, 180]. There are three factors that make this skill a difficult one to accomplish. First, the simulator adds noise to the movement of objects, and to the parameters of commands. This is done to simulate a noisy environment and make the competition more challenging. Second, since the ball must remain close to the robot without collisioning with it, and at the same time it must be kept in the kick range, the margin for error is small. And third, the most challenging factor, the use of heterogeneous players during competitions. Using heterogeneous players means that for each game the simulator generates seven different player types at startup, and the eleven players 164

Steps to goal 100 100 2 RBFs 6 RBFs 90 8 RBFs 90 12 RBFs 80 80 70 70 60 60 0 10000 20000 30000 40000 50000 Training episodes (a) (b) Fig. 2: Mountain car problem: (a) learning curves for different numbers of RBFs, calculated with a moving average of size 1,000 and averaged over 10 runs; (b) the learned value function has the typical shape for this problem Meters 16 16 14 5 RBFs 10 RBFs 14 12 20 RBFs 12 10 10 8 8 6 6 4 4 2 2 0 0 0 20 k 40 k 60 k 80 k 100 k Training episodes Fig. 3: Learning curves for different numbers of RBFs, calculated with a moving average of size 1,000 and averaged over 10 runs of each team are selected from this set of seven types. Given that each player type has different physical capacities, an optimal policy learned with one type of player is simply suboptimal when followed by another player of different type. In theory, the number of player types is infinite. Due to these three reasons, a good performance in the dribbling skill is very difficult to obtain. Up today, even the best teams perform only a reduced number of dribbling sequences during a game. Most of the time the ball is simply passed from one player to another. 4.1 Experiments and Results For this problem we experimented with 5, 10 and 20 RBFs for each input variable, and 4 sigmoidal units in the second hidden layer. The best results were obtained with 5 RBFs and 100,000 training episodes, as it is illustrated in Fig. 3. Each training episode was terminated either when the agent kicked the ball 165

out of its kicking range or when 35 actions were performed. The reward function returns 4 and 8 when the agent runs more than 5 and 10 meters respectively. The agent is penalized with -4 when it collision with or loses the ball. The training policy was ɛ-greedy with a constant ɛ =0.01, α =0.5 andγ =0.7. In terms of the MLPs we used α MLP =0.05 and activation functions with outputs in the interval (0, 1). 5 Conclusion In this paper we provide empirical evidence showing that multilayer perceptrons with one layer of radial basis functions can be used as robust function approximators of the value function in reinforcement learning problems. We present experimental work with the Sarsa algorithm and two testbeds: the mountain car task and a difficult robot control problem known as the dribbling task. Extensions to this work include using Q-learning and Actor-Critic methods. Acknowledgements This research work was supported by a PROMEP scholarship from the Education Secretariat of Mexico (SEP), and Universidad Autónoma de Yucatán. References [1] C. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995 [2] J. Boyan and A. Moore, Generalization in reinforcement learning: Safely approximating the value function, In Advances in Neural Information Processing Systems, 7, 1995 [3] S. Haykin, Neural networks: a comprehensive foundation, Prentice Hall, 1999 [4] R. Kretchmar and C. Anderson, Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning, Proceedings of the IEEE International Conference on Neural Networks, Houston, pages 834-837, 1997 [5] W. Miller, F. Glanz, and L. Kraft, CMAC: An associative neural network alternative to backpropagation, Proceedings of IEEE. Special Issue on Neural Networks, 78:1561-1567, October, 1990. [6] S. Singh and R. Sutton, Reinforcement learning with replacing eligibility traces, Machine Learning, 22:123-158, 1996 [7] W. Smart and L. Kaelbling, Practical reinforcement learning in continuous spaces, Proceedings of the International Conference on Machine Learning, pages 903-910, 2000 [8] R. Sutton, Learning to predict by the methods of temporal difference, Machine Learning, 33:9-44, 1988 [9] R. Sutton and A. Barto, Reinforcement learning: an introduction, The MIT Press, 1998 [10] R. Sutton, Generalization in reinforcement learning: successful examples using sparse coarse coding, In Advances in Neural Information Processing Systems, 8, 1986 [11] X. Wang and T. G. Dietterich, Efficient Value Function Approximation Using Regression Trees, Proceedings of the IJCAI Workshop on Statistical Machine Learning for Large- Scale Optimization, 1999 [12] C. Watkins, Learning from delayed rewards, PhD Thesis, University of Cambridge, England, 1989 [13] S. Whiteson and P. Stone, Evolutionary Function Approximation for Reinforcement Learning, Journal of Machine Learning Research, 7:877-917, 2006 166