A Novel Dynamic Target Tracking Algorithm for Image Based on Two-step Reinforcement Learning

Sensors & Transducers, Vol. 66, Issue 3, March 4, pp. 23-28 Sensors & Transducers 4 by IFSA Publishing, S. L. hp://www.sensorsporal.com A Novel Dynamic Targe Tracking Algorihm for Image Based on Two-sep Reinforcemen Learning, 2 * Xiaokun WANG, Jinhua YANG, Lijuan ZHANG, Chenghao JIANG Changchun Universiy of Science and Technology, Changchun, Jilin 022, China 2 Aviaion Universiy of Air Force, Changchun, Jilin 022, China * E-mail: wangxkcus@63.com Received: 7 December 3 /Acceped: 28 February 4 /Published: 3 March 4 Absrac: In his aricle, we modeled image arge racking ino reinforcemen learning framework, and we proposed a wo-sep reinforcemen learning algorihm for arge racking. In his algorihm, we se muliple racker agen o rack he pixel of arge, he inenion of reinforcemen learning is o achieve racking sraegy of every racker agen, we divided each learning sep of racker ino wo pars, one is o learn he division sraegy, anoher is o learn he acion sraegy, every racker agen shares he experiences hey have learned. Simulaion experimenal resuls illusrae he feasibiliy and effeciveness of he algorihm. Copyrigh 4 IFSA Publishing, S. L. Keywords: Targe racking, Reinforcemen learning, Image processing, Machine learning.. Inroducion Wih he rapid developmen of compuer vision and image processing echniques, arge racking has been deeply used in he Miliary areas, such as Miliary Saellies, Theaer Missile Defense, Reconnaissance Aircraf, Missile Guidance, and i is also widely used in Civil areas, such as video monioring ec. Targe racking in image is he core echnology in he research of arge moion analysis, and i combines image processing, paern recogniion, compuer vision, auomaion, and oher disciplines. I can make analysis and racking of moving objecs in image sequences, and hen evaluae he moion parameers of he arge in each frame of image, such as wo dimensional coordinaes, ec. Then behavior undersanding of he moving objec can be implemened by making furher processing and analysis o he moion parameers. In a word, arge racking has become one of he key research areas in he field of image processing. There are many radiional approaches for arge racking. Gomez e al. [] developed an aerial vehicle sabilizaion sysem based on compuer vision o perform racking of a moving arge on ground. Xu e al. [2] proposed a min-max approximaion mehod o esimae he arge locaion for racking. Gransrom e al. [3] presened a random se based approach o rack an unknown number of exended arges. Amodeo e al. [4] aimed o he issue of single arge racking in conrolled mobiliy sensor neworks and proposed a mehod for esimaing he curren posiion of a single arge. Liu e al. [5] proposed a reinforcemen learning based feaure selecion mehod for arge racking. Wang e al. [6] proposed o use a novel disribued Kalman filer o esimae he arge posiion and avoid collision. Caerina e al. [7] proposed he concep of marix of pixel Aricle number P_929 23

Sensors & Transducers, Vol. 66, Issue 3, March 4, pp. 23-28 weighs o preserve he srucure of he arge emplae. Reinforcemen learning is a powerful machine learning approach, and i is essenially a dynamic programming based on Markov decision processes, and i has been one of he mos imporan echniques o consruc Agen [8, 9]. In reinforcemen learning, he Agens he Agens achieve saes of environmen, choose proper acion and ge reward wih uncerainy, based on hese, he Agens learn opimal acion sraegy [0]. Reinforcemen learning has solved he problem of single agen choosing opimal behavior sraegies in environmen of Markov decision process []. In his paper, we model arge racking problem as reinforcemen learning problem, and proposed a wosep reinforcemen learning algorihm for arge racking in image. In he algorihm, we se muliple racker agens o rack he arge in image, a firs he algorihm makes ask assignmens o racker agens dynamically, ha is o say i assigned sub-goal o each racker agen, hen each racker agen choose acion o move o is sub-goal. Afer learning, he racker agens will learn he opimal acion sraegies, so racker agens can move o he arge quickly, when all racker agens move o (cach) he arge ha means arge racking is complee. The remainder of his paper is organized as follows. Secion 2 described how o conver he arge racking problem ino reinforcemen learning problem. And Secion 3 offers brief background knowledge abou reinforcemen learning. We proposed he new wo-sep reinforcemen learning algorihm for arge racking in Secion 4, and Secion 5 presens experimens resuls. Finally, conclusions and recommendaions for fuure work are summarized in Secion 6. 2. Problem Modeling Taking miliary aircraf racking for example, he aim of arge racking is for each frame of he image, calculae he wo-dimensional coordinaes, as shown in Fig. (a). We propose o absrac and simplify he problem: find he coordinae of cenral pixel of he arge in he image ha composed of pixels, as shown in Fig. (b). Furher more, we can represen he image as wo-dimensional grid, and each elemen of he grid represens one or more pixels (can be se flexible according o acual siuaion) of he image, and he arge is supposed o be represened as one cell of he grid. based on above seing, he arge racking can be compleed by following seps: firsly, we se 4 racker agens, as shown in Fig. 2(a), each racker agen occupies a cell in he image grid, so he 4 racker agens compose a collaboraive eam whose aim is o rack he arge pixel. The iniial sae of he racker agen can be se o four corner pixel of he image, as shown in Fig. 2(a). Tracker and Targe can move a mos one cell disance each ime, and here are 5 acions o move: Up, Down, Lef, Righ, Sandsill. So racker can move in he image grid, any wo rackers can no occupy same cell a he same ime, when he 4 racker agens move o he cell adjacen o he arge pixel, ha means racking is successfully complee. For example, Fig. 2(b) is a successful finish saus, so he pixel in he red frame is he arge pixel. (a) (b) Fig.. Absracion of image arge racking. (a) (b) Fig. 2. Image arge racking problem is modeled o a reinforcemen learning problem. According o above modeling approach, arge racking is acually a procedure of four racker agens pursui he arge Agen hrough eamwork. One of he mos criical issues is how o design he racking sraegy of each racker Agen, so ha each racker Agen can selec he appropriae acion o move o he arge based on he curren sae and heir own knowledge. Moreover, arge racking requires cooperaion among he racker Agens o. This is acually ransformed ino a muli-agen reinforcemen learning problems, we can learn racking sraegy of racker Agen, as well as he collaboraion of muliple racker agens via reinforcemen learning approach. 3. Reinforcemen Learning The research of reinforcemen learning focuses on how o make he agens have percepion and acion in an environmen, and selec he opimal sequence of acions o achieve heir goals [2]. During he process of learning, every acion of he agen in he environmen will achieve a reward or punishmen. The agen can learn o choose series of acions o obain a cumulaive maximum reward. Q-learning algorihm [3] is currenly mos widely used reinforcemen learning algorihm, i can 24

Sensors & Transducers, Vol. 66, Issue 3, March 4, pp. 23-28 learn he opimal acion sraegy by sampling from he environmen. The definiion of Q(s, a) is he maximum reward agen can ge by use acion a as he firs acion from sae s. Thus, he opimal sraegy of he agen is o choose he acion wih he maximum Q(s, a). In order o learn he Q funcion, he agen repeaedly observe he curren sae s, choose an acion a and execue he acion, hen consider he reward r = r(s, a) and he new sae s'. The agen updaes each enry of Q(s, a) wih he following rules: Qsa (, ) r+ γ max Qs (, a ) Q-learning algorihm is acually o modify acion sraegies wih he experience gained by he "rial and error", in order o obain he sraegy wih he maximum reward. In he iniial sep of learning, he agen does no have any experience or knowledge, i can only rely on rial and error. During he learning, agen obains knowledge and hen uses he knowledge o modify acion sraegies [4]. Muli-agen sysem is composed of muliple auonomous agens, and he sysem complees complex asks and solves complex problems hrough collaboraion among agens [5, 6]. If radiional reinforcemen learning algorihm for single agen is applied o he muli-agen environmen, i will ake long ime o converge due o he size of he exponenial sae and acion space. In order o reduce he scale of sae and acion space, disribued independen reinforcemen learning mehod is used in general, bu i is difficul o converge o he global opimal sraegy, one of he reasons is radiional mehods do no make division of asks for each agen, or only make division of asks for each agen wih a fixed sraegy before learning, i.e. assign each agen o a fixed sub-asks, so ha sub-asks of each agen is consan during he learning, ha makes each agen can no consider he eam's global benefi, he learning resul is jus each agen's local opimal sraegy for heir sub-ask, furhermore, he learning resul an no adap o he dynamic changes of he environmen. 4. The Proposed Approach for Image Targe Tracking Aiming a he characerisics of image arge racking problem and o overcome he shorcomings of radiional reinforcemen learning, in his paper we proposed a wo-sage reinforcemen learning algorihm for arge racking in image, we divided he learning procedure ino wo sages, one is "learning sraegy of ask division", ha means make proper work division for he four racker agens, in order o make hem move o he arge efficienly; he oher sage is "learning sraegy of acion choice", i.e. perform reinforcemen learning for acion sraegies of each racker agen, so ha each racker agen can a choose mos appropriae behavior(acion) o move o he arge based on he curren sae. Now we described he wo sages of he algorihm separaely. 4.. Learning he Sraegy for Task Division The condiion of arge racking is ha he four racker agens in he image simulaneously occupy he four adjacen grids of arge agen. The exising approaches are divided ino wo caegories. One is o learn he racking sraegy direcly wihou ask divisions, hese kinds of approaches have large blindness and low convergence speed. The second is o make ask division only once before he sar of he learning. In such approaches, he overall goal is decomposed ino four sub-goals ha occupying he up, down, lef and righ grids of arge agen. The four racker agens need o complee heir sub-goals o achieve he overall goal. Then each racker agen is assigned o a fixed sub-goal only before he learning process, racker agens always rack he fixed subgoal during he whole learning process. So if he racker agen whose sub-goal is "occupy he lef grid of he arge" (hereafer referred o as racker agen A) moves below he arge agen, while he racker agen whose sub-goal is "occupy he underside grid of he arge" (hereafer referred o as racker agen B) reached lef of he arge agen, in his siuaion, he algorihm can achieve he local opimum of individual A and B by he original learning sraegy, bu obviously his is no global opimal racking sraegy, i.e. i has low racking speed. A his ime, if he sub-goal of racker agen A becomes "occupy he underside grid of he arge", and he sub-goal of racker agen B becomes "occupy he lef grid of he arge", and hen he arge will be racked more quickly. So in his aricle, we proposed o dynamically disribue sub-goal, a each sep, before each racker agen choose acion o move, he sub-goals will be redisribued, and he racker agen rack he arge according o is new sub-goal. The sraegy for redisribuing sub-goals can be also achieved by reinforcemen learning. In order o reduce he space of sae-acion, he sae is represened by relaive posiion of he arge agen o racker agen. Esablishing a coordinae sysem in he wodimensional grid, he downward direcion in he coordinae sysem is y-axis direcion, and he direcion from lef o righ on he horizonal axis is x axis, he lengh of side of grid equals o a uni in he coordinae sysem. Each racker agen has a sensing radius, only when he arge agen in he sensing range of he racker agen, he racker agen can perceive he arge agen, his can be undersood as he view of he racker agen, racker agen can only perceive he hings wihin heir range of view. When racker agen perceives he arge agen, he relaive posiion of racker and arge agen can be represened as a uple (x Tracker x Targe, y Tracker y Targe ), in which (x Tracker, y Tracker ) represens he curren 25

Sensors & Transducers, Vol. 66, Issue 3, March 4, pp. 23-28 coordinae of racker agen, and (x Targe, y Targe ) represens he curren coordinae of arge agen. Because one sub-goal can no be disribued o wo racker agen a he same ime, so using a four bi binary number as a mask o represen he disribuion abou he sub-goal of Righ, Down, Lef, Up, for example, =9 represen ha he sub-goals Righ and Up have been disribued o some rackers, and he sub-goals Down and Lef have no been disribued, he range of he mask is from 0000 o, and his corresponds o decimal number 0 o 5, so he mask bi can be represened as a ineger. During he process of learning he sraegy of ask division (i.e. sub-goal disribuion), he sae of each racker agen can be represened as S ={x Tracker x Targe, y Tracker y Targe, mask}, he acion space is he selecion for four sub-goals, represened as a ={Righ, Down, Lef, Up}, he Q-value of each racker agen can be updaed according o he following rule: Q ( S, a ) = r + γ max Q ( S, a ), + a where r is he reward, for each racker agen, if he sub-goal has been disribued o oher racker, hen his disribuion is unreasonable, and he racker ges a negaive reward, oherwise using he disance beween racker and arge agen (x Tracker x Subgoal ) 2 +(y Tracker y Subgoal ) 2 o measure he qualiy of he disribuion, if he disance beween he racker and he grid of he sub-goal is lower han he disance beween he racker and oher sub-goal, i.e. he racker agen has closer disance o he sub-goal han oher sub-goal, hen a big posiive reward is achieved, oherwise a lile posiive reward is achieved. Because he racking sraegies of he four racker agens can be shared, sharing he Q-value of each racker agen o enhance he learning performance. 4.2. Learning he Sraegy for Acion Selecion For each racker agen, he aim of his sage is o learn a sraegy according o which each racker agen can complee is own sub-goal. The acions of he racker agens are {Up, Down, Lef, Righ, Sandsill}. Using he relaive coordinae of he racker o is sub-goal o represen sae, so he sae can be represened as S 2 ={x Tracker x subgoal, y Tracker y subgoal }, Fig. 3 shows he sae space of racker agen a each locaion when he sensing radius of racker agen is 2, and is sub-goal is o occupy he lef grid of he arge agen. The acion space a 2 = {Up, Down, Lef, Righ, Sandsill}, he updae rules of Q value for each racker agen are as follows: Q ( S, a ) = r + γ max Q ( S, a ), + 2 2 2 2 a2 2 2 2 where r 2 is he reward, afer racker agen choosing an acion, if i ge is sub-goal, i will ge he maximum reward, if i reduces he disance o he sub-goal, i will ge he second highes reward, and if he disance o he sub-goal is no changed, i will ge 0 reward. If he disance o he sub-goal is increased, i will ge a negaive reward. Fig. 3. The sae for learning sraegy for acion selecion. The sae of he racker agen has corresponding relaionship. From Fig. 3, i can be seen ha he sae of he oher hree sub-goals can be convered o sae space in Fig. 3 by roaing around he arge agen. For example, he saus of he sae of he racker agen whose sub-goal is occupying he upward of he arge agen (0, -) and he acion is down is equivalen o he sae of he racker agen whose sub-goal is occupying he lef of he arge agen (-, 0), and he acion is righ, and he corresponding Q value is also equivalen. So heir behavioral sraegies can be shared. So i can improve learning efficiency o share he Q funcion of racker agens. In summary, he overall framework of he algorihm for each racker agen is as follows: Algorihm Targe Tracking based on reinforcemen learning /*Iniializaion*/ Q 0 0 while no convergence do /*Learning he sraegy for ask division*/ Ge curren sae S for ask division; Choose an acion a according o S and Q ; Execue acion a ; Ge sub-goal, reward r and nex sae S + ; Updae Q wih following rule: + Q( S, a) = r+ γ max Q( S, a ) a /*Learning he sraegy for acion selecion*/ Ge sae S 2 according o curren sub-goal; Choose an acion a 2 according o S 2 and Q 2 ; Execue acion a 2 ; Ge reward r 2 and nex sae S + 2 ; Updae Q 2 wih following rule: + Q2( S2, a2) = r2 + γ max Q2( S2, a 2) a2 + end while 26

Sensors & Transducers, Vol. 66, Issue 3, March 4, pp. 23-28 4. Simulaed Experimen and Resuls In his paper we ry o adop reinforcemen learning o arge racking, so in he simulaed experimen, we use 0*0 wo-dimensional grids o simulae he image, each grid represens one or more pixels of he image. The racker and arge agen can only move wihin he 0*0 grids, he iniial posiion of each racker agen is (0,0), (0,9), (9,0), (9,9), and he iniial posiion of he arge agen is a random posiion. We make 6 experimens for comparison wih he mehod wih dynamic ask division and he mehod wihou dynamic ask division, in each experimen, given 0 frame simulaed images, and perform 0 imes arge racking, hen record he number of seps when rack erminae, every racking are divided ino one group, and ake he average number of seps of he group, so each experimen has 0 group frame images. The number of seps of racking reflecs he speed of he racking algorihm. Fig. 4 shows he comparison of number of seps for he wo mehods in he 6 experimens, in which abscissa represens he group number of he image frame, and he longiudinal coordinae represens he average seps for racking. 0 The proposed mehod Tradiional mehod 0 The proposed mehod Tradiional mehod number of racking seps number of racking seps number of simulaed image frame number of simulaed image frame (a) (b) The proposed mehod The proposed mehod 0 Tradiional mehod 0 Tradiional mehod number of racking seps number of racking seps number of simulaed image frame number of simulaed image frame (c) (d) 0 The proposed mehod The proposed mehod Tradiional mehod Tradiional mehod number of racking seps number of racking seps number of simulaed image frame number of simulaed image frame (e) (f) Fig. 3. Comparison of he proposed mehod and radiional mehod. 27

Sensors & Transducers, Vol. 66, Issue 3, March 4, pp. 23-28 5. Conclusions and Fuure Works In his aricle we conver he arge racking in image ino reinforcemen learning domain, we se muliple racker agens o rack he arge agen, and presen a wo sage reinforcemen learning algorihm for arge racking. In he algorihm, a each sep, i firs perform dynamic ask division o each racker agen, ha is o divide a sub-goal o each racker agen, hen each racker agen choose is acion according o is curren sub-goal. The learning algorihm divided he learning procedure ino wo pars, one is o learn he sraegy for ask division, and he oher is o learn he sraegy for acion selecion, each racker agen shares Q funcion o enhance he efficiency. The mehod improved he efficiency o some exen, bu each racker agen performs disribued learning separaely, so i is no a horough global opimal mehod. In fuure, we will concenrae on how o make furher cooperaion among racker agens by more ineracion, so as o obain more global opimal soluion. Furhermore, we will apply he mehod ino real applicaion domain for validaion. Zhu proposed some Bayesian nework oriened machine learning mehods [7-], i is promising o use hese sraegies o enhance he acion selecion for arge racking, and his is also he work we will concenrae on in fuure. References []. J. E. Gomez-Balderas, G. Flores, L. R. Garcia Carrillo, Tracking a ground moving arge wih a quadroor using swiching conrol, Journal of Inelligen & Roboic Sysems, Vol., Issue -4, 3, pp. 65-78. [2]. Xu Enyang, Ding Zhi, Dasgupa Soura, Targe racking and mobile sensor navigaion in wireless sensor neworks, IEEE Transacions on Mobile Compuing, Vol. 2, Issue, 3, pp. 77-86. [3]. Gransrom Karl, Orguner Umu, A Phd filer for racking muliple exended arges using random marices, IEEE Transacions on Signal Processing, Vol., Issue, pp. 5657-567. [4]. Lionel Amodeo, Mourad Farah, Chehade Hicham, Snoussi Hichem, Conrolled mobiliy sensor neworks for arge racking using an colony opimizaion, IEEE Transacions on Mobile Compuing, Vol., Issue 8, 2, pp. 26-273. [5]. Fang Liu, Jianbo Su, Reinforcemen learning-based feaure learning for objec racking, in Proceedings of he 7 h Inernaional Conference on Paern Recogniion (ICPR 04), 04, pp. 748-75. [6]. Wang Zongyao, Gu Dongbing, Cooperaive arge racking conrol of muliple robos, IEEE Transacions on Indusrial Elecronics, Vol. 59, Issue 8, 2, pp. 3232-32. [7]. G. Di Caerina, J. J. Soraghan, Robus complee occlusion handling in adapive emplae maching arge racking, Elecronics Leers, Vol. 48, Issue 4, 2, pp. 83-848. [8]. M. Michell, Machine learning, McGraw-Hill, 997, pp. 367-384. [9]. Changying Wang, Xiaohu Yin, Yiping Bao, Li Yao, A shared experience uples muli-agen cooperaive reinforcemen learning algorihm, Paern Recogniion and Arificial Inelligence, Vol. 8, Issue 2, 05, pp. 234-239. [0]. Xiaohu Yin, Changying Wang, Muli-agen reinforcemen learning algorihm based on decomposiion, in Proceedings of he 0 h CAAI Conference, November 03. []. Bo Fan, Quan Pan, Hongcai Zhang, A mehod for muli-agen coordinaion based on disribued reinforcemen learning, Compuer Simulaion, Vol. 22, Issue 6, 05, pp. 5-7. [2]. Xiao Dan, Tan Ah-Hwee, Cooperaive reinforcemen learning in opology-based muli-agen sysems, Auonomous Agens and Muli-Agen Sysems, Vol. 26, Issue, 3, pp. 86-9. [3]. Changying Wang, Bo Zhang, An agen eam based reinforcemen learning model and is applicaion, Journal of Compuer Research and Developmen, Vol. 37, Issue 9, 00, pp. 087-093. [4]. Sharma Rajneesh, Mahijs T. J. Spaan, Bayesiangame-based fuzzy reinforcemen learning conrol for decenralized POMDPs, IEEE Transacions on Compuaional Inelligence and AI in Games, Vol. 4, Issue 4, pp. 9-328. [5]. Freek Sulp, Evangelos A. Theodorou, Sefan Schaal, Reinforcemen learning wih sequences of moion primiives for robus manipulaion, IEEE Transacions on Roboics, Vol. 28, Issue 6, 2, pp. 3-3. [6]. Yong Duan, Baoxia Cui, Xinhe Xu, A muli-agen reinforcemen learning approach o robo soccer, Arificial Inelligence Review, Vol. 38, Issue 3, 2, pp. 93-2. [7]. Yungang Zhu, Dayou Liu, Haiyang Jia, Yuxiao Huang, Srucure learning of Bayesian nework wih bee riple-populaion evoluion sraegies, Inernaional Journal of Advancemens in Compuing Technology, Vol. 3, Issue 0,, pp. -48. [8]. Yungang Zhu, Dayou Liu, Haiyang Jia, A new evoluionary compuaion based approach for learning Bayesian nework, Procedia Engineering, No. 5,, pp. 26-. [9]. Yungang Zhu, Dayou Liu, Haiyang Jia, D. Trinugroho, Incremenal learning of Bayesian neworks based on chaoic dual-populaion evoluion sraegies and is applicaion o nanoelecronics, Journal of Nanoelecronics and Opoelecronics, Vol. 7, Issue 2, 2, pp. 3-8. []. Yungang Zhu, Dayou Liu, Guifen Chen, Haiyang Jia, Helong Yu, Mahemaical modeling for acive and dynamic diagnosis of crop diseases based on Bayesian neworks and incremenal learning, Mahemaical and Compuer Modelling, Vol. 58, Issue 3-4, 3, pp. 54-523. 4 Copyrigh, Inernaional Frequency Sensor Associaion (IFSA) Publishing, S. L. All righs reserved. (hp://www.sensorsporal.com) 28