A REINFORCEMENT LEARNING ALGORITHM WITH EVOLVING FUZZY NEURAL NETWORKS

Similar documents
Neural Network Model of the Backpropagation Algorithm

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Fast Multi-task Learning for Query Spelling Correction

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

More Accurate Question Answering on Freebase

1 Language universals

MyLab & Mastering Business

Lecture 10: Reinforcement Learning

Learning Methods for Fuzzy Systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Lecture 1: Machine Learning Basics

Axiom 2013 Team Description Paper

Reinforcement Learning by Comparing Immediate Reward

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Visual CP Representation of Knowledge

Human Emotion Recognition From Speech

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Kamaldeep Kaur University School of Information Technology GGS Indraprastha University Delhi

Speech Emotion Recognition Using Support Vector Machine

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Softprop: Softmax Neural Network Backpropagation Learning

Artificial Neural Networks written examination

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

TD(λ) and Q-Learning Based Ludo Players

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

E mail: Phone: LIBRARY MBA MAIN OFFICE

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Transfer Learning Action Models by Measuring the Similarity of Different Domains

SARDNET: A Self-Organizing Feature Map for Sequences

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

AMULTIAGENT system [1] can be defined as a group of

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Time series prediction

Test Effort Estimation Using Neural Network

Rule Learning With Negation: Issues Regarding Effectiveness

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Australian Journal of Basic and Applied Sciences

A Reinforcement Learning Variant for Control Scheduling

Rule Learning with Negation: Issues Regarding Effectiveness

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Cal s Dinner Card Deals

Detecting English-French Cognates Using Orthographic Edit Distance

Learning to Schedule Straight-Line Code

Discriminative Learning of Beam-Search Heuristics for Planning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

On the Combined Behavior of Autonomous Resource Management Agents

Education: Integrating Parallel and Distributed Computing in Computer Science Curricula

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Henry Tirri* Petri Myllymgki

Improving Fairness in Memory Scheduling

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Word Segmentation of Off-line Handwritten Documents

Reducing Features to Improve Bug Prediction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

High-level Reinforcement Learning in Strategy Games

Mining Association Rules in Student s Assessment Data

Pre-vocational training. Unit 2. Being a fitness instructor

The development and implementation of a coaching model for project-based learning

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

Speeding Up Reinforcement Learning with Behavior Transfer

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

ModellingSpace: A tool for synchronous collaborative problem solving

Kansas Adequate Yearly Progress (AYP) Revised Guidance

Assignment 1: Predicting Amazon Review Ratings

Laboratorio di Intelligenza Artificiale e Robotica

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Knowledge-Based - Systems

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Laboratorio di Intelligenza Artificiale e Robotica

Student Name: OSIS#: DOB: / / School: Grade:

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

An Estimating Method for IT Project Expected Duration Oriented to GERT

Modeling user preferences and norms in context-aware systems

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv: v2 [cs.ro] 3 Mar 2017

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Cooperative evolutive concept learning: an empirical study

Universityy. The content of

Soft Computing based Learning for Cognitive Radio

Model Ensemble for Click Prediction in Bing Search Ads

A General Class of Noncontext Free Grammars Generating Context Free Languages

INSTITUTE OF MANAGEMENT STUDIES NOIDA

BUSINESS INTELLIGENCE FROM WEB USAGE MINING

Improving Action Selection in MDP s via Knowledge Transfer

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Learning Prospective Robot Behavior

Transcription:

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics A REINFORCEMEN LEARNING ALGORIHM WIH EVOLVING FUZZY NEURAL NEWORKS Hiesh Shah Professor, Deparmen of Elecronics & Communicaion G H Pael College of Engineering & echnology Vallabh Vidyanagar, Gujara (INDIA) iid.hiesh@gmail.com Absrac he synergy of he wo paradigms, neural nework and fuzzy inference sysem, has given rise o rapidly emerging filed, neuro-fuzzy sysems. Evolving neuro-fuzzy sysems are inended o use online learning o exrac knowledge from daa and perform a high-level adapaion of he nework srucure. We explore he poenial of evolving neuro-fuzzy sysems in reinforcemen learning (RL) applicaions. In his paper, a novel on-line sequenial learning evolving neuro-fuzzy model design for RL is proposed. We develop a dynamic evolving fuzzy neural nework (DENFIS) funcion approximaion approach o RL sysems. Poenial of his approach is demonsraed hrough a case sudy wo-link robo manipulaor. Simulaion resuls have demonsraed ha he proposed approach performs well in reinforcemen learning problems. Keywords Reinforcemen learning, Neuro-fuzzy sysem I. INRODUCION Reinforcemen learning (RL) paradigm is a compuaionally simple and direc approach o he adapive opimal conrol of nonlinear sysems []. In RL, he learning agen (conroller) ineracs wih an iniially unknown environmen (sysem) by measuring saes and applying acions according o is policy o maximize is cumulaive rewards. hus, RL provides a general mehodology o solve complex uncerain sequenial decision problems, which are very challenging in many real-world applicaions. Ofen he environmen of RL is ypically formulaed as a Markov Decision Process (MDP), consising of a se of all saes S, a se of all possible acions A, a sae ransiion probabiliy disribuion P :S A S [,], and a reward funcion R : S A. When all componens of he MDP are known, an opimal policy can be deermined, e.g., using dynamic programming. here has been a grea deal of progress in he machine learning communiy on value-funcion based reinforcemen learning mehods [2]. In value-funcion based reinforcemen learning, raher han learning a direc mapping from saes o acions, he agen learns an inermediae daa srucure known as a value funcion ha maps saes (or sae-acion pairs) o he expeced long erm reward. Value-funcion based learning mehods are appealing because he value funcion has welldefined semanics ha enable a sraighforward represenaion of he opimal policy, and heoreical resuls guaraneeing he convergence of cerain mehods [3]. Q-learning is a common model-free value funcion sraegy for RL [4]. Q-learning sysem maps every sae-acion pair o a M.Gopal Direcor, School of Engineering Shiv Nadar Universiy Noida, Uar Pradesh (INDIA) mgopal@snu.edu.in real number, he Q-value, which ells how opimal ha acion is in ha sae. For small domains, his mapping can be represened explicily by able of Q-values. For large domains, his approach is simply infeasible. If, one deals wih large discree or coninuous sae and acion spaces, i is ineviable o resor o funcion approximaion, for wo reasons: firs o overcome he sorage problem (curse of dimensionaliy), second o achieve daa efficiency (i.e., requiring only a few observaions o derive a near-opimal policy) by generalizing o unobserved saes-acion pairs. here is a large lieraure on RL algorihms using various value-funcion esimaion echniques. Funcionally, a fuzzy sysem or a neural nework can be described as a funcion approximaor. heoreical invesigaions have revealed ha neural neworks and fuzzy inference sysems are universal approximaors [5, 6]. Neural neworks are used o generalize he value funcion peraining o specific siuaions. However, hese works sill assume discree acions and canno handle coninuous-valued acions. In realisic applicaions, i is imperaive o deal wih coninuous saes and acions. Fuzzy Inference Sysem (FIS) can be used o faciliae generalizaion in he sae space and o generae coninuous acions, in paricular in conjuncion wih Q-learning widely known as fuzzy Q-learning (FQL). Glorennec [7] and he exension proposed by Jouffe [8] provided a fundamenal conribuion in he definiion of FQL, his is he basis for many of he exising implemenaions. In FQL, he consequen pars of a FIS are seleced by Q-learning. However, srucure and premise parameers are sill deermined by a priori knowledge. o circumven his problem, Er and Deng [9] proposed a dynamic fuzzy Q- learning (DFQL) approach o consruc self-uning FIS based on reinforcemen signals and deal wih coninuous sae and acion spaces. Recenly, he synergy of he wo paradigms, neural nework and fuzzy inference sysem, has given rise o rapidly emerging field, neuro-fuzzy sysems. he neuro-fuzzy erm means a ype of sysem characerized for a similar srucure of a fuzzy conroller, where he fuzzy ses and rules are adjused using neural nework uning echniques in an ieraive way wih he inpu-oupu daa vecors. A Neuro-fuzzy sysem is widely ermed as fuzzy neural nework (FuNN) [, ] in he lieraure. Fuzzy neural nework sysems are inended o capure he advanages of boh fuzzy logic (approximae reasoning) and neural neworks (learning) i.e. acquire fuzzy rules based on he learning abiliy of neural neworks [2]. 38

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics Many researchers have developed such a neuro-fuzzy sysem for solving real-world problem effecively. he evolving fuzzy neural nework (EFuNN) was proposed by Kasabov in [3], one of he hybrid neuro-fuzzy archiecure. Dynamic evolving neural fuzzy inference sysem (dmefunn/denfis) [4] is a modified version of he EFuNN wih he idea ha, depending on he posiion of he inpu vecor in he inpu space, a FIS for calculaing he oupu is formed dynamically bases on m fuzzy rules ha had been creaed during he pas learning process. he applicaion of hese neworks has been in he areas of classificaion and regression using supervised learning mehods. DENFIS when used especially for online learning adapive sysems [4][5]. Use of neuro-fuzzy sysems for value funcion approximaion for RL seup has no ye been explored. In his paper, we explore he poenial of an alernaive dynamic evolving fuzzy-neural nework (dmefunn) for reinforcemen learning algorihms. We compare he learning performances of dmefunn and Dynamic FNN (here, dynamic fuzzy Q- learning) in reinforcemen learning framework, using simulaion experimen on wo-link robo manipulaor racking conrol problem. Furher, we examine he robusness performance of he proposed approach for handling he uncerainy in erms of parameer variaions and exernal disurbances. he paper is organized as follows. Secion II presens he heoreical background of fuzzy inference sysem wih reinforcemen learning approach and recen rends of neurofuzzy sysems. Secion III proposes archiecure and learning framework of dmefunn funcion approximaor for RL sysems. Secion IV exhibis he empirical performance based on he experimenal resuls of he sysem-wo-link robo manipulaor simulaions. Secion V, conclusions are drawn in he las secion. II. HEOREICAL BACKGROUND A neuro-fuzzy sysem is widely ermed as fuzzy neural nework (FuNN) [, ] in he lieraure. Fuzzy neural nework sysems are inended o capure he advanages of boh learning and compuaional power of neural nework and he high-level human-like hinking and reasoning of fuzzy sysem. Evolving fuzzy neural nework and dynamic evolving fuzzy neural nework are he hybrid neuro-fuzzy archiecure. A. Evolving Fuzzy Neural Nework (FEuNN) EFuNN implemens five layers Mamdani ype FIS. he firs layer passes crisp inpu variable o he second layer ha calculaes he degrees of compaibiliy in relaion o he predefined membership funcions. he hird layer is he rule layer and each node in his layer represens eiher an exising rule, or a rule anicipaed afer raining. he rule nodes represen prooypes of inpu-oupu daa as an associaion of hyperspheres from he fuzzy inpu and he fuzzy oupu spaces. Each rule node is defined by wo vecors of connecion weighs, which are adjused hrough a hybrid learning echnique. he fourh layer represens a fuzzy quanizaion of each oupu variable and calculaes he degree o which oupu membership funcions are mached he inpu daa. he fifh layer carries ou defuzzificaion and calculaes he crisp value for he oupu variable. In EFuNN, all he rule nodes are creaed during he learning phase. We used EFuNN as an funcion approximaor in RL framework, where inpu o he EFuNN is he sae or sae-acion pair resuled in o he oupu Q-value. B. Dynamic Evolving Fuzzy Neural Nework (DENFIS) he dynamic evolving neural-fuzzy inference sysem, DENFIS (also known as dmefunn), uses he firs-order akagi-sugeno ype of inference engine [4]. DENFIS is similar o EFuNN in some principles. I inheris and develops EFuNN s dynamic feaures ha make DENFIS suiable for online adapive sysems. he DENFIS model uses a local generalizaion. Principally srucure of EFuNN and DENFIS is somewha similar. Dynamic feaure of EFuNN developed wih he idea ha, depending on he posiion of he inpu vecor in he inpu space, a FIS for calculaing he oupu value is formed dynamically bases on m fuzzy rules ha has been creaed during he pas learning process. Evolving clusering mehod (ECM) [5] is used for fuzzy rules creaion and updaion wihin he inpu space pariioning. Alhough DENFIS mees he requiremens of online learning o form adapive inelligen sysems o a grea exen, however here is sill scope of advancemen. Our objecive is o use DENFIS as a funcion approximaor in reinforcemen learning framework. III. A novel value funcion approximaor for online sequenial learning on coninuous sae-acion domain based on DENFIS is proposed in his paper. Fig. shows archiecural view of he DENFIS funcion approximaion approach o RL sysem. APPROXIMAION OF VALUE FUNCION USING DENFIS s A Qs (, ai ) a A i DENFIS Acion selecor ε D error γv s ( ) Fig. DENFIS conroller archiecure s a ; where { } he sae-acion pair (, ) s = s, s 2,, s n S is he curren sysem sae and a is he each possible discree conrol acion in acion se A = { ai}; i =,,m, is he inpu of DENFIS model and he esimaed Q-value corresponding o ( s, a ) is he oupu of he nework. Q( s, a) = y = f( x ) = f( x, x2,, xq) () = β β x β x β x a 2 2 K v a pd a c c q Qs (, a ) wo-link robo q s (desired) Error meric evaluaor s 38

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics where x is he inpu vecor ( x =[ x, x2,, xq] = ( s, a) ) of he DENFIS model and oupu y corresponds o esimaed Q-value associaed wih each sae-acion in rule Ri ; i =,2,...,m. raining samples are obained online as he ineracion beween he learning agen (conroller) and is environmen (plan). he online learning process of DENFIS involves he creaion of new fuzzy rules, and exising fuzzy rules can be updaed incremenally. In addiion, evolving clusering mehod (ECM) is used o pariion he inpu sample space o deermine he fuzzy ses in he aneceden par, i.e., ECM is used o deermine cluser ceners and membership funcions of he aneceden par, and wrls wih forgeing facor deermine he parameers of he consequen par of a fuzzy rule. he agen s acion is seleced based on he oupus of DENFIS. In specific, conrol acions are seleced using an exploraion/exploiaion policy [4] in order o explore he se of possible acions and acquire experience hrough he online RL signals. We use a pseudo-sochasic exploraion ε -greedy as in [4]. In ε -greedy exploraion, we gradually reduce he exploraion (deermined by he ε parameer) according o some schedule; we have reduced ε o is 9 percen value afer every ieraions. he lower limi of parameer ε has been kep fixed a.2 (o mainain exploraion). I is an online learning algorihm ha learns an approximae sae-acion value funcion Qs (, a ) ha converges o he opimal funcion Q (commonly called Q-value). Online version is given by Qs (, a) Qs (, a) η[ c γvs ( ) Qs (, a)] (2) c where s s is he sae ransiion under he conrol a A( s acion )(in fac a = a ( s ) a ( s ) ; where ( apd s ) is he acion generaed by inner PD loop), c is he cos incurred by he conroller, η (,] is he learning rae parameer ha can be used o opimize he speed of learning, and γ (,] is he discoun facor ha conrols he rade-off beween immediae and fuure coss. c pd A. Learning Process in DENFIS online model he firs-order agaki-sugeno fuzzy rules [58] are employed in DENFIS online model. he linear funcions in he consequence pars are creaed and updaed by linear leassquare esimaor (LSE) [5] on he learning daa. he linear funcion for a learning daa se of p daa pairs, { ([ xi, xi2,, xiq ], yi ), i =,2,, p}, can be expressed as y = β βx β2x2 βqxq (3) he leas-square esimaor (LSE) of β = β β β2 β q is calculaed as he coefficiens b b b b2 b q of =, by applying he following weighed leas-square esimaor formula: b= (A WA) A Wy (4) where x x2 x q w x2 x22 x2q 2 A= w ; y y y2 yp and W= = xp xp2 xpq wp Here W is he weigh marix and is elemens, w ij, are defined by d j ( d is he disance beween he j h j sample and he corresponding cluser cener), j =, 2,, p. We can rewrie equaion (4) wih he use of recursive LSE formula [4] as follows: P = (A WA) (5) b= PA Wy In he DENFIS online model, Kasabov and Song [99] used a weighed recursive LSE wih a forgeing facor defined as h follows. Le he k row vecor of marix A is denoed as a k h and he k elemen of y is denoed as y k. hen b can be calculaed ieraively as follows: bk = bk wk Pk a k ( yk a k bk) wk Pa k k ak P (6) k Pk = Pk λ λa k Pka k where k = n, n,... p ; w is he weigh of k -h k sample defined by d k ( d k is he disance beween he k -h sample and he corresponding cluser cenre); and λ (.8,) is forgeing facor. he iniial values of P n and bn can be calculaed direcly from (5) wih he use of firs n daa pairs from he learning daa se. In online DENFIS model, he rules are creaed and updaed a he same ime wih he inpu space pariioning using online ECM, and equaions (4) and (6). IV. SIMULAION EXPRIMENS o demonsrae he usefulness of dynamic evolving fuzzy neural nework funcion approximaor in reinforcemen learning framework, we conduced experimens using he wellknown wo-link robo manipulaor racking conrol problem. In implemenaion, he DENFIS has as inpu he saeacion pair and as oupu, he Q-value corresponding o he sae-acion pair. In paricular, he DENFIS nework begins wih zero cluser. We firs obained a group of fuzzy rules using an DENFIS off-line learning model, wih he use of raining samples available from well defined reinforcemen fuzzy sysems (here we ake raining samples from dynamic fuzzy Q-learning conroller). hen wih agen-environmen ineracion, he raining samples available and he DENFIS model build-up an online mode based on dynamic inference, i.e., clusering and reformulaion of he rules are performed whenever a new raining example is presened o he nework. he DENFIS off-line learning model when used as an 382

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics iniializaion, improves he generalizaion (e.g., improves he learning efficiency). For simpliciy, he conroller uses wo DENFIS models as funcion approximaors; one each for he wo-links. DENFIS is one module of ECOS oolbox working in he MALAB numeric compuing environmen. he disance hreshold D hr is se o.8 and defaul value of he number of rules in dynamic fuzzy inference sysem is se o 3 for consrucing DENFIS. A. Simulaion Resuls and Discussion Simulaions were carried ou o sudy he learning performance, and robusness agains uncerainies, for DENFIS learning approach on wo-link robo manipulaor conrol problem. o analyze he DENFIS algorihm for compuaional cos, accuracy, and robusness, we compare he proposed approach wih dynamic fuzzy reinforcemen learning approach. MALAB 7. (R2a) has been used as simulaion ool. Learning performance sudy he physical sysem has been simulaed for a single run of sec using fourh-order Runge-Kua mehod, wih fixed ime sep of msec. Fig. 2 and Fig. 3 show he oupu racking error (boh he links), for boh he conrollers and. able abulaes he mean square error, absolue maximum error ( ma x e ( ) ), and absolue maximum conrol effor ( max τ ) under nominal operaing condiions..4.3 From he resuls (Figs. 2 3 and able ), we observe ha raining ime for is higher han. ouperforms, in erms of lower racking errors and he low value of absolue error and conrol effor for boh he links Robusness sudy In he following, we compare he performance of DFQ and under uncerainies. For his sudy, we rained he conroller for 2 episodes, and hen evaluaed he performance for wo cases: Effec of payload variaions : he end-effecor mass is varied wih ime, which corresponds o he roboic arm picking up and releasing payloads having differen masses. Fig. 4 and Fig. 5 show he oupu racking errors for link and link 2, respecively, and able 2 abulaes he mean square error, absolue maximum error and absolue maximum conrol effor a payload variaions wih ime..5.4.3.2. -. -.2 -.3 2 3 4 5 6 7 8 9 ime (sec) Fig. 4 Effec of payload variaion comparison: oupu racking errors (link ).2..2.8 -. -.2.6.4 -.3 2 3 4 5 6 7 8 9 ime (sec) Fig. 2 Sandard wo-link conroller comparison: oupu racking errors (link ).2.8.6.4.2 -.2 2 3 4 5 6 7 8 9 ime (sec) Fig. 3 Sandard wo-link conroller comparison: oupu racking errors (link 2) able Comparison of conrollers: learning performance sudy raining max e ( ) MSE (rad) max τ (Nm) ime Conroller (rad) (sec) Link Link 2 Link Link 2 Link Link 2 ------.83.66.336.788 89.723 35.795 9.8383.77.54.242.676 84.694 33.757 72.2333.2 -.2 2 3 4 5 6 7 8 9 ime (sec) Fig. 5 Effec of payload variaion comparison: oupu racking errors (link 2) able 2 Comparison of conrollers: effec of payload variaions MSE (rad) Conroller max e() (rad) max τ (Nm) Link Link 2 Link Link 2 Link Link 2.237.26.42.952 267.9725 399.648.25.3.392.857 263.9985 379.526 Effecs of exernal disurbances: A orque disurbance τ dis wih a sinusoidal variaion of frequency 2π rad/sec, was added wih ime o he model. he magniude of orque disurbance is expressed as a percenage of conrol effor.fig. 6 and Fig. 7 show he oupu racking errors for link and link 2, respecively, and able 3 abulaes he mean square error, absolue maximum error ( max e ( ) ), and absolue maximum conrol effor ( max τ ) for orque disurbances added wih ime o he model variaion. 383

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics.4.3 fuzzy Q-learning based RL sysem. his feaure is achieved wihou any loss of performance..2. -. -.2 -.3 2 3 4 5 6 7 8 9 ime (sec) Fig. 6 Effec of exernal disurbances comparison: oupu racking errors (link ).2.8.6.4.2 -.2 2 3 4 5 6 7 8 9 ime (sec) Fig. 7 Effec of exernal disurbances comparison: oupu racking errors (link 2) able 3 Comparison of conrollers: effec of exernal disurbances Conroller MSE (rad) max e ( ) (rad) max τ (Nm) Link Link 2 Link Link 2 Link Link 2.5.63.3748.954 259.2883 399.79.6.63.37.968 26.9928 399.597 Simulaion resuls (Figs 4 7, able 2 and able 3) show comparable robusness propery for DENFISQ-learning based conroller and Dynamic fuzzy Q-learning based conroller. V. CONCLUIONS We have explored he poenial of dynamic evolving fuzzyneural nework (DENFIS) for reinforcemen learning algorihms. DENFIS is a sequenial learning archiecure and has abiliy o grow and prune o ensure a parsimonious srucure ha is well suied for real-ime conrol applicaions. From he simulaion resuls, i is obvious ha raining ime in DENFIS based RL sysem is larger compared o he dynamic REFERENCES [] R. S. Suon, A. G. Baro, and R. J. Williams, Reinforcemen learning is direc adapive opimal conrol, IEEE Conrol Sys. Mag., vol. 2, no. 2, pp. 9 22, 992. [2] J. A. Boyan, and A. W. Moore, Generalizaion in reinforcemen learning: Safely approximaing he value funcion, Advances in Neural Informaion Proc. Sys., pp. 369 376., 995. [3] B. Raich, On characerisics of Markov decision processes and reinforcemen learning in large domains, PhD hesis, Monréal: McGill Universiy, School of Compuer Science, 24. [4] R. S. Suon, and A. G. Baro, Reinforcemen Learning: An Inroducion (adapive compuaion and machine learning), Cambridge: MI Press, 998. [5] K. Hornic, M. Sinchcombe, and H. Whie, Mulilayer feed forward neworks are universal approximaors, Neural Neworks, vol. 2, pp.359 366, 989. [6] L. Wang, Fuzzy sysems are universal approximaors, in Proc. In. Conf. Fuzzy Sysem, 992. [7] P. Y. Glorennec, L. Jouffe, Fuzzy Q-learning, Proc. IEEE In. Conf. Fuzzy Sysems; vol. 2, pp. 659 662, 997. [8] L. Jouffe, Fuzzy inference sysem learning by reinforcemen mehods, IEEE rans. Sysem, Man, and Cyberneics, Par C, vol. 28, no. 3, pp. 338 355, 998. [9] M. J. Er, and C. Deng, Online uning of fuzzy inference sysems using dynamic fuzzy Q-learning, IEEE rans. on Sysems, Man, and Cyberneics, Par B, vol. 34, no. 3, pp. 478 489, 24. [] N. Kasabov, Foundaion of Neural neworks, Fuzzy sysems and Knowledge engineering, he MI Press, CA, MA, 996. [] J. Vieira, F.M Dias, and A. Moa, Neuro-fuzzy sysems: A survey, WSEAS rans on Sysems, vol. 3, no. 2, April 24. [2] D. A. Linkes, and H. O. Nyongesa, Learning sysems in inelligen conrol: On appraisal of fuzzy, neural and geneic algorihm conrol applicaions, In Proc. Ins. Elec. Eng. Conrol heory Applicaions, vol. 43, pp. 367 386, 996. [3] N. Kasabov, Evolving fuzzy neural neworks for supervised/unsupervised online knowledge-based learning, IEEE rans. Sys., Man, Cybern., Par B, vol. 3, no. 6, pp. 92 98, Dec. 2. [4] N. Kasabov, and Q. Song, DENFIS: Dynamic evolving neuro-fuzzy inference sysem and is applicaion for ime-series predicion, IEEE rans. Fuzzy Sys., vol., no. 2, pp. 44 54, April 22. [5] M J Was, A decade of Kasabov s evolving connecionis sysems: A review, IEEE rans. Sysems, Man, and Cyberneics-Par C: Applicaions and Reviews, vol. 39, no. 3, pp. 253 269, May 29. 384