A REINFORCEMENT LEARNING ALGORITHM WITH EVOLVING FUZZY NEURAL NETWORKS

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics A REINFORCEMEN LEARNING ALGORIHM WIH EVOLVING FUZZY NEURAL NEWORKS Hiesh Shah Professor, Deparmen of Elecronics & Communicaion G H Pael College of Engineering & echnology Vallabh Vidyanagar, Gujara (INDIA) iid.hiesh@gmail.com Absrac he synergy of he wo paradigms, neural nework and fuzzy inference sysem, has given rise o rapidly emerging filed, neuro-fuzzy sysems. Evolving neuro-fuzzy sysems are inended o use online learning o exrac knowledge from daa and perform a high-level adapaion of he nework srucure. We explore he poenial of evolving neuro-fuzzy sysems in reinforcemen learning (RL) applicaions. In his paper, a novel on-line sequenial learning evolving neuro-fuzzy model design for RL is proposed. We develop a dynamic evolving fuzzy neural nework (DENFIS) funcion approximaion approach o RL sysems. Poenial of his approach is demonsraed hrough a case sudy wo-link robo manipulaor. Simulaion resuls have demonsraed ha he proposed approach performs well in reinforcemen learning problems. Keywords Reinforcemen learning, Neuro-fuzzy sysem I. INRODUCION Reinforcemen learning (RL) paradigm is a compuaionally simple and direc approach o he adapive opimal conrol of nonlinear sysems []. In RL, he learning agen (conroller) ineracs wih an iniially unknown environmen (sysem) by measuring saes and applying acions according o is policy o maximize is cumulaive rewards. hus, RL provides a general mehodology o solve complex uncerain sequenial decision problems, which are very challenging in many real-world applicaions. Ofen he environmen of RL is ypically formulaed as a Markov Decision Process (MDP), consising of a se of all saes S, a se of all possible acions A, a sae ransiion probabiliy disribuion P :S A S [,], and a reward funcion R : S A. When all componens of he MDP are known, an opimal policy can be deermined, e.g., using dynamic programming. here has been a grea deal of progress in he machine learning communiy on value-funcion based reinforcemen learning mehods [2]. In value-funcion based reinforcemen learning, raher han learning a direc mapping from saes o acions, he agen learns an inermediae daa srucure known as a value funcion ha maps saes (or sae-acion pairs) o he expeced long erm reward. Value-funcion based learning mehods are appealing because he value funcion has welldefined semanics ha enable a sraighforward represenaion of he opimal policy, and heoreical resuls guaraneeing he convergence of cerain mehods [3]. Q-learning is a common model-free value funcion sraegy for RL [4]. Q-learning sysem maps every sae-acion pair o a M.Gopal Direcor, School of Engineering Shiv Nadar Universiy Noida, Uar Pradesh (INDIA) mgopal@snu.edu.in real number, he Q-value, which ells how opimal ha acion is in ha sae. For small domains, his mapping can be represened explicily by able of Q-values. For large domains, his approach is simply infeasible. If, one deals wih large discree or coninuous sae and acion spaces, i is ineviable o resor o funcion approximaion, for wo reasons: firs o overcome he sorage problem (curse of dimensionaliy), second o achieve daa efficiency (i.e., requiring only a few observaions o derive a near-opimal policy) by generalizing o unobserved saes-acion pairs. here is a large lieraure on RL algorihms using various value-funcion esimaion echniques. Funcionally, a fuzzy sysem or a neural nework can be described as a funcion approximaor. heoreical invesigaions have revealed ha neural neworks and fuzzy inference sysems are universal approximaors [5, 6]. Neural neworks are used o generalize he value funcion peraining o specific siuaions. However, hese works sill assume discree acions and canno handle coninuous-valued acions. In realisic applicaions, i is imperaive o deal wih coninuous saes and acions. Fuzzy Inference Sysem (FIS) can be used o faciliae generalizaion in he sae space and o generae coninuous acions, in paricular in conjuncion wih Q-learning widely known as fuzzy Q-learning (FQL). Glorennec [7] and he exension proposed by Jouffe [8] provided a fundamenal conribuion in he definiion of FQL, his is he basis for many of he exising implemenaions. In FQL, he consequen pars of a FIS are seleced by Q-learning. However, srucure and premise parameers are sill deermined by a priori knowledge. o circumven his problem, Er and Deng [9] proposed a dynamic fuzzy Q- learning (DFQL) approach o consruc self-uning FIS based on reinforcemen signals and deal wih coninuous sae and acion spaces. Recenly, he synergy of he wo paradigms, neural nework and fuzzy inference sysem, has given rise o rapidly emerging field, neuro-fuzzy sysems. he neuro-fuzzy erm means a ype of sysem characerized for a similar srucure of a fuzzy conroller, where he fuzzy ses and rules are adjused using neural nework uning echniques in an ieraive way wih he inpu-oupu daa vecors. A Neuro-fuzzy sysem is widely ermed as fuzzy neural nework (FuNN) [, ] in he lieraure. Fuzzy neural nework sysems are inended o capure he advanages of boh fuzzy logic (approximae reasoning) and neural neworks (learning) i.e. acquire fuzzy rules based on he learning abiliy of neural neworks [2]. 38

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics Many researchers have developed such a neuro-fuzzy sysem for solving real-world problem effecively. he evolving fuzzy neural nework (EFuNN) was proposed by Kasabov in [3], one of he hybrid neuro-fuzzy archiecure. Dynamic evolving neural fuzzy inference sysem (dmefunn/denfis) [4] is a modified version of he EFuNN wih he idea ha, depending on he posiion of he inpu vecor in he inpu space, a FIS for calculaing he oupu is formed dynamically bases on m fuzzy rules ha had been creaed during he pas learning process. he applicaion of hese neworks has been in he areas of classificaion and regression using supervised learning mehods. DENFIS when used especially for online learning adapive sysems [4][5]. Use of neuro-fuzzy sysems for value funcion approximaion for RL seup has no ye been explored. In his paper, we explore he poenial of an alernaive dynamic evolving fuzzy-neural nework (dmefunn) for reinforcemen learning algorihms. We compare he learning performances of dmefunn and Dynamic FNN (here, dynamic fuzzy Q- learning) in reinforcemen learning framework, using simulaion experimen on wo-link robo manipulaor racking conrol problem. Furher, we examine he robusness performance of he proposed approach for handling he uncerainy in erms of parameer variaions and exernal disurbances. he paper is organized as follows. Secion II presens he heoreical background of fuzzy inference sysem wih reinforcemen learning approach and recen rends of neurofuzzy sysems. Secion III proposes archiecure and learning framework of dmefunn funcion approximaor for RL sysems. Secion IV exhibis he empirical performance based on he experimenal resuls of he sysem-wo-link robo manipulaor simulaions. Secion V, conclusions are drawn in he las secion. II. HEOREICAL BACKGROUND A neuro-fuzzy sysem is widely ermed as fuzzy neural nework (FuNN) [, ] in he lieraure. Fuzzy neural nework sysems are inended o capure he advanages of boh learning and compuaional power of neural nework and he high-level human-like hinking and reasoning of fuzzy sysem. Evolving fuzzy neural nework and dynamic evolving fuzzy neural nework are he hybrid neuro-fuzzy archiecure. A. Evolving Fuzzy Neural Nework (FEuNN) EFuNN implemens five layers Mamdani ype FIS. he firs layer passes crisp inpu variable o he second layer ha calculaes he degrees of compaibiliy in relaion o he predefined membership funcions. he hird layer is he rule layer and each node in his layer represens eiher an exising rule, or a rule anicipaed afer raining. he rule nodes represen prooypes of inpu-oupu daa as an associaion of hyperspheres from he fuzzy inpu and he fuzzy oupu spaces. Each rule node is defined by wo vecors of connecion weighs, which are adjused hrough a hybrid learning echnique. he fourh layer represens a fuzzy quanizaion of each oupu variable and calculaes he degree o which oupu membership funcions are mached he inpu daa. he fifh layer carries ou defuzzificaion and calculaes he crisp value for he oupu variable. In EFuNN, all he rule nodes are creaed during he learning phase. We used EFuNN as an funcion approximaor in RL framework, where inpu o he EFuNN is he sae or sae-acion pair resuled in o he oupu Q-value. B. Dynamic Evolving Fuzzy Neural Nework (DENFIS) he dynamic evolving neural-fuzzy inference sysem, DENFIS (also known as dmefunn), uses he firs-order akagi-sugeno ype of inference engine [4]. DENFIS is similar o EFuNN in some principles. I inheris and develops EFuNN s dynamic feaures ha make DENFIS suiable for online adapive sysems. he DENFIS model uses a local generalizaion. Principally srucure of EFuNN and DENFIS is somewha similar. Dynamic feaure of EFuNN developed wih he idea ha, depending on he posiion of he inpu vecor in he inpu space, a FIS for calculaing he oupu value is formed dynamically bases on m fuzzy rules ha has been creaed during he pas learning process. Evolving clusering mehod (ECM) [5] is used for fuzzy rules creaion and updaion wihin he inpu space pariioning. Alhough DENFIS mees he requiremens of online learning o form adapive inelligen sysems o a grea exen, however here is sill scope of advancemen. Our objecive is o use DENFIS as a funcion approximaor in reinforcemen learning framework. III. A novel value funcion approximaor for online sequenial learning on coninuous sae-acion domain based on DENFIS is proposed in his paper. Fig. shows archiecural view of he DENFIS funcion approximaion approach o RL sysem. APPROXIMAION OF VALUE FUNCION USING DENFIS s A Qs (, ai ) a A i DENFIS Acion selecor ε D error γv s ( ) Fig. DENFIS conroller archiecure s a ; where { } he sae-acion pair (, ) s = s, s 2,, s n S is he curren sysem sae and a is he each possible discree conrol acion in acion se A = { ai}; i =,,m, is he inpu of DENFIS model and he esimaed Q-value corresponding o ( s, a ) is he oupu of he nework. Q( s, a) = y = f( x ) = f( x, x2,, xq) () = β β x β x β x a 2 2 K v a pd a c c q Qs (, a ) wo-link robo q s (desired) Error meric evaluaor s 38

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics where x is he inpu vecor ( x =[ x, x2,, xq] = ( s, a) ) of he DENFIS model and oupu y corresponds o esimaed Q-value associaed wih each sae-acion in rule Ri ; i =,2,...,m. raining samples are obained online as he ineracion beween he learning agen (conroller) and is environmen (plan). he online learning process of DENFIS involves he creaion of new fuzzy rules, and exising fuzzy rules can be updaed incremenally. In addiion, evolving clusering mehod (ECM) is used o pariion he inpu sample space o deermine he fuzzy ses in he aneceden par, i.e., ECM is used o deermine cluser ceners and membership funcions of he aneceden par, and wrls wih forgeing facor deermine he parameers of he consequen par of a fuzzy rule. he agen s acion is seleced based on he oupus of DENFIS. In specific, conrol acions are seleced using an exploraion/exploiaion policy [4] in order o explore he se of possible acions and acquire experience hrough he online RL signals. We use a pseudo-sochasic exploraion ε -greedy as in [4]. In ε -greedy exploraion, we gradually reduce he exploraion (deermined by he ε parameer) according o some schedule; we have reduced ε o is 9 percen value afer every ieraions. he lower limi of parameer ε has been kep fixed a.2 (o mainain exploraion). I is an online learning algorihm ha learns an approximae sae-acion value funcion Qs (, a ) ha converges o he opimal funcion Q (commonly called Q-value). Online version is given by Qs (, a) Qs (, a) η[ c γvs ( ) Qs (, a)] (2) c where s s is he sae ransiion under he conrol a A( s acion )(in fac a = a ( s ) a ( s ) ; where ( apd s ) is he acion generaed by inner PD loop), c is he cos incurred by he conroller, η (,] is he learning rae parameer ha can be used o opimize he speed of learning, and γ (,] is he discoun facor ha conrols he rade-off beween immediae and fuure coss. c pd A. Learning Process in DENFIS online model he firs-order agaki-sugeno fuzzy rules [58] are employed in DENFIS online model. he linear funcions in he consequence pars are creaed and updaed by linear leassquare esimaor (LSE) [5] on he learning daa. he linear funcion for a learning daa se of p daa pairs, { ([ xi, xi2,, xiq ], yi ), i =,2,, p}, can be expressed as y = β βx β2x2 βqxq (3) he leas-square esimaor (LSE) of β = β β β2 β q is calculaed as he coefficiens b b b b2 b q of =, by applying he following weighed leas-square esimaor formula: b= (A WA) A Wy (4) where x x2 x q w x2 x22 x2q 2 A= w ; y y y2 yp and W= = xp xp2 xpq wp Here W is he weigh marix and is elemens, w ij, are defined by d j ( d is he disance beween he j h j sample and he corresponding cluser cener), j =, 2,, p. We can rewrie equaion (4) wih he use of recursive LSE formula [4] as follows: P = (A WA) (5) b= PA Wy In he DENFIS online model, Kasabov and Song [99] used a weighed recursive LSE wih a forgeing facor defined as h follows. Le he k row vecor of marix A is denoed as a k h and he k elemen of y is denoed as y k. hen b can be calculaed ieraively as follows: bk = bk wk Pk a k ( yk a k bk) wk Pa k k ak P (6) k Pk = Pk λ λa k Pka k where k = n, n,... p ; w is he weigh of k -h k sample defined by d k ( d k is he disance beween he k -h sample and he corresponding cluser cenre); and λ (.8,) is forgeing facor. he iniial values of P n and bn can be calculaed direcly from (5) wih he use of firs n daa pairs from he learning daa se. In online DENFIS model, he rules are creaed and updaed a he same ime wih he inpu space pariioning using online ECM, and equaions (4) and (6). IV. SIMULAION EXPRIMENS o demonsrae he usefulness of dynamic evolving fuzzy neural nework funcion approximaor in reinforcemen learning framework, we conduced experimens using he wellknown wo-link robo manipulaor racking conrol problem. In implemenaion, he DENFIS has as inpu he saeacion pair and as oupu, he Q-value corresponding o he sae-acion pair. In paricular, he DENFIS nework begins wih zero cluser. We firs obained a group of fuzzy rules using an DENFIS off-line learning model, wih he use of raining samples available from well defined reinforcemen fuzzy sysems (here we ake raining samples from dynamic fuzzy Q-learning conroller). hen wih agen-environmen ineracion, he raining samples available and he DENFIS model build-up an online mode based on dynamic inference, i.e., clusering and reformulaion of he rules are performed whenever a new raining example is presened o he nework. he DENFIS off-line learning model when used as an 382

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics iniializaion, improves he generalizaion (e.g., improves he learning efficiency). For simpliciy, he conroller uses wo DENFIS models as funcion approximaors; one each for he wo-links. DENFIS is one module of ECOS oolbox working in he MALAB numeric compuing environmen. he disance hreshold D hr is se o.8 and defaul value of he number of rules in dynamic fuzzy inference sysem is se o 3 for consrucing DENFIS. A. Simulaion Resuls and Discussion Simulaions were carried ou o sudy he learning performance, and robusness agains uncerainies, for DENFIS learning approach on wo-link robo manipulaor conrol problem. o analyze he DENFIS algorihm for compuaional cos, accuracy, and robusness, we compare he proposed approach wih dynamic fuzzy reinforcemen learning approach. MALAB 7. (R2a) has been used as simulaion ool. Learning performance sudy he physical sysem has been simulaed for a single run of sec using fourh-order Runge-Kua mehod, wih fixed ime sep of msec. Fig. 2 and Fig. 3 show he oupu racking error (boh he links), for boh he conrollers and. able abulaes he mean square error, absolue maximum error ( ma x e ( ) ), and absolue maximum conrol effor ( max τ ) under nominal operaing condiions..4.3 From he resuls (Figs. 2 3 and able ), we observe ha raining ime for is higher han. ouperforms, in erms of lower racking errors and he low value of absolue error and conrol effor for boh he links Robusness sudy In he following, we compare he performance of DFQ and under uncerainies. For his sudy, we rained he conroller for 2 episodes, and hen evaluaed he performance for wo cases: Effec of payload variaions : he end-effecor mass is varied wih ime, which corresponds o he roboic arm picking up and releasing payloads having differen masses. Fig. 4 and Fig. 5 show he oupu racking errors for link and link 2, respecively, and able 2 abulaes he mean square error, absolue maximum error and absolue maximum conrol effor a payload variaions wih ime..5.4.3.2. -. -.2 -.3 2 3 4 5 6 7 8 9 ime (sec) Fig. 4 Effec of payload variaion comparison: oupu racking errors (link ).2..2.8 -. -.2.6.4 -.3 2 3 4 5 6 7 8 9 ime (sec) Fig. 2 Sandard wo-link conroller comparison: oupu racking errors (link ).2.8.6.4.2 -.2 2 3 4 5 6 7 8 9 ime (sec) Fig. 3 Sandard wo-link conroller comparison: oupu racking errors (link 2) able Comparison of conrollers: learning performance sudy raining max e ( ) MSE (rad) max τ (Nm) ime Conroller (rad) (sec) Link Link 2 Link Link 2 Link Link 2 ------.83.66.336.788 89.723 35.795 9.8383.77.54.242.676 84.694 33.757 72.2333.2 -.2 2 3 4 5 6 7 8 9 ime (sec) Fig. 5 Effec of payload variaion comparison: oupu racking errors (link 2) able 2 Comparison of conrollers: effec of payload variaions MSE (rad) Conroller max e() (rad) max τ (Nm) Link Link 2 Link Link 2 Link Link 2.237.26.42.952 267.9725 399.648.25.3.392.857 263.9985 379.526 Effecs of exernal disurbances: A orque disurbance τ dis wih a sinusoidal variaion of frequency 2π rad/sec, was added wih ime o he model. he magniude of orque disurbance is expressed as a percenage of conrol effor.fig. 6 and Fig. 7 show he oupu racking errors for link and link 2, respecively, and able 3 abulaes he mean square error, absolue maximum error ( max e ( ) ), and absolue maximum conrol effor ( max τ ) for orque disurbances added wih ime o he model variaion. 383

Proceedings of he 23 Inernaional Conference on Sysems, Conrol and Informaics.4.3 fuzzy Q-learning based RL sysem. his feaure is achieved wihou any loss of performance..2. -. -.2 -.3 2 3 4 5 6 7 8 9 ime (sec) Fig. 6 Effec of exernal disurbances comparison: oupu racking errors (link ).2.8.6.4.2 -.2 2 3 4 5 6 7 8 9 ime (sec) Fig. 7 Effec of exernal disurbances comparison: oupu racking errors (link 2) able 3 Comparison of conrollers: effec of exernal disurbances Conroller MSE (rad) max e ( ) (rad) max τ (Nm) Link Link 2 Link Link 2 Link Link 2.5.63.3748.954 259.2883 399.79.6.63.37.968 26.9928 399.597 Simulaion resuls (Figs 4 7, able 2 and able 3) show comparable robusness propery for DENFISQ-learning based conroller and Dynamic fuzzy Q-learning based conroller. V. CONCLUIONS We have explored he poenial of dynamic evolving fuzzyneural nework (DENFIS) for reinforcemen learning algorihms. DENFIS is a sequenial learning archiecure and has abiliy o grow and prune o ensure a parsimonious srucure ha is well suied for real-ime conrol applicaions. From he simulaion resuls, i is obvious ha raining ime in DENFIS based RL sysem is larger compared o he dynamic REFERENCES [] R. S. Suon, A. G. Baro, and R. J. Williams, Reinforcemen learning is direc adapive opimal conrol, IEEE Conrol Sys. Mag., vol. 2, no. 2, pp. 9 22, 992. [2] J. A. Boyan, and A. W. Moore, Generalizaion in reinforcemen learning: Safely approximaing he value funcion, Advances in Neural Informaion Proc. Sys., pp. 369 376., 995. [3] B. Raich, On characerisics of Markov decision processes and reinforcemen learning in large domains, PhD hesis, Monréal: McGill Universiy, School of Compuer Science, 24. [4] R. S. Suon, and A. G. Baro, Reinforcemen Learning: An Inroducion (adapive compuaion and machine learning), Cambridge: MI Press, 998. [5] K. Hornic, M. Sinchcombe, and H. Whie, Mulilayer feed forward neworks are universal approximaors, Neural Neworks, vol. 2, pp.359 366, 989. [6] L. Wang, Fuzzy sysems are universal approximaors, in Proc. In. Conf. Fuzzy Sysem, 992. [7] P. Y. Glorennec, L. Jouffe, Fuzzy Q-learning, Proc. IEEE In. Conf. Fuzzy Sysems; vol. 2, pp. 659 662, 997. [8] L. Jouffe, Fuzzy inference sysem learning by reinforcemen mehods, IEEE rans. Sysem, Man, and Cyberneics, Par C, vol. 28, no. 3, pp. 338 355, 998. [9] M. J. Er, and C. Deng, Online uning of fuzzy inference sysems using dynamic fuzzy Q-learning, IEEE rans. on Sysems, Man, and Cyberneics, Par B, vol. 34, no. 3, pp. 478 489, 24. [] N. Kasabov, Foundaion of Neural neworks, Fuzzy sysems and Knowledge engineering, he MI Press, CA, MA, 996. [] J. Vieira, F.M Dias, and A. Moa, Neuro-fuzzy sysems: A survey, WSEAS rans on Sysems, vol. 3, no. 2, April 24. [2] D. A. Linkes, and H. O. Nyongesa, Learning sysems in inelligen conrol: On appraisal of fuzzy, neural and geneic algorihm conrol applicaions, In Proc. Ins. Elec. Eng. Conrol heory Applicaions, vol. 43, pp. 367 386, 996. [3] N. Kasabov, Evolving fuzzy neural neworks for supervised/unsupervised online knowledge-based learning, IEEE rans. Sys., Man, Cybern., Par B, vol. 3, no. 6, pp. 92 98, Dec. 2. [4] N. Kasabov, and Q. Song, DENFIS: Dynamic evolving neuro-fuzzy inference sysem and is applicaion for ime-series predicion, IEEE rans. Fuzzy Sys., vol., no. 2, pp. 44 54, April 22. [5] M J Was, A decade of Kasabov s evolving connecionis sysems: A review, IEEE rans. Sysems, Man, and Cyberneics-Par C: Applicaions and Reviews, vol. 39, no. 3, pp. 253 269, May 29. 384