Improved Multi-Agent Reinforcement Learning for Minimizing Traffic Waiting Time

Improved Muli-Agen Reinforcemen Learning for Minimizing Traffic Waiing Time Vijay Kumar M.T.U India B. Kaushik K.E.C., M.T.U., India H. Banka ISM, India ABSTRACT This paper depic using muli-agen reinforcemen learning (MAR algorihm for learning raffic paern o minimize he raveling ime or maximizing safey and opimizing raffic paern (OTP). This model provides a descripion and soluion o opimize raffic paern ha use muli-agen based reinforcemen learning algorihms. MARL uses muli agen srucure where vehicles and raffic signals are working as agens. In his model raffic area divide in differen-differen raffic ZONE. Each zone have own disribued agen and hese agen will pass he informaion one zone o oher hrew he nework. The Opimizaion objecives include he number of vehicle sops, he average waiing ime and maximum queue lengh of he nex (node) inersecion. In addiio This research also inroduce he prioriy conrol of buses and emergen vehicles ino his model. Expeced oucome of he algorihm is comparable o he performance of Q-Learning and Temporal difference learning. The resuls show significan reducion in waiing ime comparable o hose algorihms for he work more efficienly han oher raffic sysem. General Terms Learning Algorihm, Arificial Inelligence, Agen based learning. Keywords Agen Based Sysem, Inelligen Traffic Signal Conrol, Muli Objecive Scheme, Opimizaion Objecives, RL, Muli-Agen Sysem (MAS). 1. INTRODUCTION Manage he raffic in high raffic areas is a big problem. Increasing populaion size requires more efficien ransporaion sysems and hence beer raffic conrol sysem. Even developed counries are suffering high coss because of increasing road congesion levels. In he European Union (EU) alone, congesion coss 0.5% of he member counries Gross Domesic Produc (GDP) [11], [8], and his is expeced o increase o roughly 1% of he EU s GDP by 2009 if he problem is no deal wih properly. In 2002, he number of vehicles per housand persons had reached 460 which is nearly double he number (232) in 1974.In high raffic siuaions and bad driving in he EU (European Union) accouns for up o 50% of fuel consumpion on road neworks resuling in deadly emissions ha could oherwise be diminished. High raffic ranspor conribues 41% of carbon dioxide o give ou from road raffic in he EU hus resuling in serious healh and safey problems. In hese cases o avoid he high coss ha give by hese hreas, UTC has o provide some soluions o he problem of raffic managemen [11], [8]. To achieve he global goal UTC opimizaio increasing global such hreas and vehicles infrasrucure communicaing beween some sysems may be provide some exra deail.these deail may provide help for local view of he raffic condiions. In case medium raffic condiions he Wiering s mehod reduce he overall waiing ime for vehicles. This mehod reduce he waiing ime for vehicles and opimize he goal. In real raffic sysem, his model should consider differendifferen opimizaion objecives in differen raffic siuaio which is called muli-agen conrol scheme in his paper. In he free raffic siuaio presened model ry o minimize he overall number of sops of vehicles in he raffic nework. In case medium raffic siuaion his research ries o minimize he waiing ime on behalf opimal goal. In congesed raffic condiion main focused on queue lengh. So muli-agen conrol scheme can adap o differen raffic condiions and make a more inelligen raffic conrol sysem. Therefore, his model, propose a muli-agen conrol sraegy using MARL. Muli-objecive conrol and paramic simulaion model boh have some problems.firs node raffic siuaion pass o he all nex nodes. If firs node has a free raffic, his condiion will passes all he nex nodes, his is no good way for real raffic so his model will calculae raffic siuaion individually for each node. In congesed raffic siuaio queue spillovers mus be avoided o keep he nework from large-scale congesio hus he queue lengh mus be focused on [6]. In his model cycle is prevened. The value of is no fix (3) i depends on raffic conrol admin in his model.this may be 4, 5 ec. On behalf he value of his model will manage green ligh for emergen vehicles in raffic nework. In his model daa exchange beween vehicles and roadside raffic equipmens is necessary, hus vehicular ad hoc nework is uilized o build a wireless raffic informaion sysem. Therefore disribued nework helpful for uilized o develop a wireless raffic informaion sysem. Differen researchers have chosen varian ypes of arificial inelligence algorihms and mehods for he opimizaion of he raffic flow in real raffic condiions. Geneic algorihm or evoluionary algorihm is one of he mos common mehods inroduced ino he raffic conrol sysem. Rouing of raffic flow using geneic algorihm has shown some improvemen in he raffic conrol. Fuzzy logic conrol is also useful ino he raffic ligh sysems for beer conrol of raffic flow. Increase performance of real raffic ligh sysem is build wih some idea such ha increases green ligh ime period for vehicles. Anoher approach o improve he raffic conrol is using wireless nework communicaions beween vehicles and raffic conrol sysems o ge raffic informaion for raffic flow. This informaion can use for opimizaion in raffic sysem in medium and high raffic condiions. Reinforcemen learning echnique is used in cerain research sudies for he raffic flow conrol and 30

opimizaions. So reinforcemen learning echnique can be applied in raffic signal conrol effecively o response o he frequen change of raffic flow and ouperform radiional raffic conrol algorihm ha helpful for opimaliy, reducing raffic delay and build a beer raffic ligh sysem. This model are minimizing ravel ime or maximizing safey, Minimizing vehicle ravel ime, reducing raffic delay, increasing vehicle velociy, and prioriizing emergency raffic Since OTP conrollers by hand is a complex and edious ask,his research sudy how muli-agen reinforcemen learning(mar algorihms can be used for his goal. 2. AGENT BASED MODEL OF TRAFFIC SYSTEM In his model use an agen-based model o describe he pracical raffic sysem. In he roa here are wo ypes of agen one is vehicles and anoher is raffic signal conrollers called as disribued agens. Traffic informaion will be exchange beween hese agens. There are some possibiliy for each raffic conrollers ha preven raffic hreas and accidens. Two raffic lighs from opposing direcions allow vehicle o go sraigh ahead o urn righ, wo raffic lighs a he same direcion of he inersecion allow he vehicle from here o go sraigh ahea urn righ or urn Lef. When new vehicle have been added he raffic ligh decisions are made and each vehicle moves o cell if cell is no occupied.this decision conrol by he raffic sysem according o raffic condiions. There for, each vehicle is a a raffic a direcion a he node (dir), a posiion in he queue (place) and has a paricular desinaion (des). This model use [ place, in sor ([ o denoe he sae of each vehicle [7].The main objec is opimizaion wih reduce waiing ime,number of sops and raffic queue lengh. One name is Reinforcemen Learning ha suppor dynamic environmen using dynamic programming. A more popular approach is o use model-based reinforcemen learning, in which he ransiion and reward funcions are esimaed from experience and hen used o find a policy via planning mehods like dynamic programming. 3.1 Simple model Figure 2 shows he learning process of an agen. A each ime se he agen receives a reinforcemen feedback from he environmen along wih he curren sae. The goal for he agen is o creae an opimal acion selecion policy p o maximize he reward. In many cases, no only he immediae reward bu also he subsequen rewards Delayed rewards? should be considered when acions are aken. Fig 2: Agen wih sae and acion Agen and environmen inerac a discree ime seps: 0,1,2,k Agen observes sae a sep: Produces acion a sep: Ges resuling reward: : s a A r R 1 ( s ) S s And resuling nex sae: 1 Fig 1: Agen Based Model. In his model Q([ acion) o represen he oal expeced value of opimized indices for all raffic lighs for each vehicle. This process will be coninue unil vehicles arrive a he desinaion goal. In Wiering s model, consider firs node raffic siuaion pass o he all nex nodes. If firs node has a free raffic, his condiion passes all he nex nodes bu his model will calculae raffic siuaion individually for each node. This is he mos impor difference beween his model and Wiering s model. 3. REINFORCEMENT LEARNING FOR TRAFFIC CONTROL Previously several mehods for learn raffic have been developed like Sarsa and Q-learning.These all echniques suffered wih same problem in high raffic condiions. In urban or congesed raffic, hese echnique are no scale o muli-agen Reinforcemen Learning. In urban raffic may be possible ha raffic grows dynamically. So need a dynamic mehod for handle urban raffic ha grow dynamically. Q- learning and Sarsa hey are applied only o small nework. Fig 3: A general process model of RL [8] 3.2 This Basic Elemens of Reinforcemen Learning 1. Model of he process 2. Reward funcions. 3. Learning objecive. 4. Conrollers. 5. Exploraion. 3.3 Muli-agen Frame work The muli-agen framework is based on he same idea of Figure 2 bu, his Time, here are several agens deciding on acions over he environmen. The big difference resides in he fac ha all each agen probably has some effec on he environmen an so, acions can have differen oucomes depending on wha he oher agens are doing. Nex Fig. shows he muli-agen model or framework. 31

d [ C([ pos, Re d)/ C([ (2) Where C([ vehicle in he sae of C([ Re d) he ligh urns red in such sae. is he number of imes a [ is he number of imes 4.2 Medium raffic condiion In medium raffic condiion main goal of his model is o minimize he overall waiing ime of vehicles. If number of vehicles are larger 100 bu less han 150, i is consider as medium raffic., Fig 4: Muli-Agen Model In addiion o benefis owing o he disribued naure of he muli-agen soluio such as he speedup made possible by parallel compuaio muliple RL agens can harness new benefis from sharing experience, e.g., by communicaio eaching. Conversely, besides challenges inheried from single-agen RL, including he curse of dimensionaliy. 4. MULTI-AGENT CONTROL ALGORITHM BASED ON REINFORCEMENT LEARNING The muli-agen conrol algorihm considers hree ypes of raffic siuaions as follows less raffic (low raffic or free raffic) siuaio medium raffic siuaion and congesed raffic siuaion. 4.1 Free raffic condiion The number of sops will increase when a vehicle moving a a green ligh in curren ime sep mee a red ligh in he nex ime sep. In free raffic condiion he main goal is o minimize he number of sops. So use Q ([ Green) as he expeced cumulaive number of sops. The formulaion of Q ([ Green) is shown as follows. Q([ Green) ( dir', ) d [ ( R([ dir [ Q([ Green)) (1) [ Where means he sae of a vehicle in nex ime sep; d [ gives he probabiliy ha he raffic ligh urns red in nex ime sep; R([ dir, [ is a reward funcion as follows: if a vehicle says a he same raffic ligh, hen R=1, oherwise R=0, (he vehicle ges hrough his inersecion and eners he nex one); is he discoun facor (0 < < 1) which ensure he Q-values are bounded. The probabiliy ha a raffic ligh urns red is calculaed as follows. V ([ P( L [ L LQ ([ (3) Q([ L ( dir' pos) L [ n', ( R[ [ n', des ]) ( n', )) V (4) Where is L he raffic ligh sae (red or green), P ( L [, is calculaed in he same way as equaion 2, ( R [ [ n', is defined as follows as: if a vehicle says a he same raffic ligh, hen R=1, oherwise R=0 and use for force o be green ligh 10. 4.3 Congesed raffic condiion In his condiio spillovers of queue mus be avoided which will minimize he raffic conrol effec and probably cause large-scale raffic congesion. Q([ Green) ( dir' pos) Gree[ ( R([ [ R' ([,[ node ', V ( des'])) (5) Q([ Re d) ( dir', ) Re [ ' ( R ([ [ V ([ )) (6) Where Q ([ and V ([ have he same meanings as 32

under medium raffic condiion. Compared equaion 5 wih equaion 4, anoher reward funcion R '([ [ is added o indicae he influence from raffic condiion a he nex and use for force o be green ligh, 10 R ([ dir, [ Is he reward of vehicles waiing ime while R' ([ [ indicaes he reward from he change of he queue lengh a he nex raffic node. Consider queue lengh when design Q learning procedure, l ' denoe he max queue lengh a nex raffic ligh so l ' can wrien as K. L is he capaciy of he α is he adjusing facor ha lane of nex raffic ligh and deermine queue lengh K l' as follows: 0 IF K l' 0.8L 0 k ) 1.0 IF 0.8L Kl' (7) ( Tl0. 8 L.2 IF Kl L The larges value is se o.2 in his model. 4.4 Prioriy Conrol for Emergen Vehicles In case emergency vehicles like Fire Truck ambulances, Prime Miniser Vehicles ec. so need o manage raffic ligh when hese condiions were arise. For hese siuaions give high prioriy for hese ypes of vehicle. The raffic adminisraor can manage raffic ligh according o raffic condiions. If emergency condiion arises he admin of raffic conrol can reduce ime of he green ligh ha is se prioriy according o ype of vehicles for green ligh. In prioriy condiion he main focus manage green ligh on behalf his, presen model can reduce waiing ime for emergency vehicles. Q ([ pos, Green([ des', ])( R([ [ des' V[ pos ') (8) 5. RESULT In his research 1000 ime seps use for simulaion. For learning process 2000 seps use, and 2000 seps were also used for simulaion resul. The value 0.9 se o facor in his model. is se o be according o emergen Vehicles siuaion ha is for Fire Truck and ambulance he prioriy of green ligh may be differ, no 3(fix). If in a minue number of vehicles are 100 enering in raffic nework, i is consider as free raffic. If number of vehicles are larger 100 bu less han 150, i is consider as medium raffic, and number of vehicles are larger han 150 i is consider as congesed (high raffic) raffic condiion. 5.1 Comparison of average waiing ime Comparison of average waiing ime regard o he increasing of raffic volume rapidly is shown in figure 5.TD means emporal difference, QL means Q-learning algorihm, MARL means Muli-agen reinforcemen learning algorihm he model proposed in his paper. The nex able shows a daa se used in TD, QL, and MARL. Table 1 Visiing Poins wih Q-Capaciy and Q-Lengh visiing Poins q-capaciy q-lengh Lambeh 1000 50 Waford 500 150 WesDrayon 800 100 Leaherhead 900 200 Oford 800 700 Darford 950 200 Loughon 600 105 Aylesford 800 600 Table 2 Visiers disance Visiors Lambeh Waford WesDrayon Leaherhead Oford Darford Loughon Aylesford Lambeh 0 25 30 28-1 27 22-1 Waford 25 0 40-1 -1-1 52-1 WesDrayon 30 40 0 45-1 -1-1 -1 Leaherhead 28-1 45 0 47-1 -1-1 Oford -1-1 -1 47 0 22-1 35 Darford 27-1 -1-1 22 0 32 33 Loughon 22 52-1 -1-1 32 0-1 Aylesford -1-1 -1-1 -1 33-1 0 In Table 2 visiors disance,-1 show here is no any pah beween wo visior nodes. Number of sops under he muli-agen RL conrol will be less han hose under oher conrol sraegies like TD and Q- learning. Reinforcemen learning who minimize number of sops comparable o TD and Q-learning echnique in case medium raffic and congesed raffic condiions. 6. CONCLUSION This paper presened he muli-agen RL conrol algorihm based on reinforcemen learning. The simulaion indicaed ha he MARL go he minimum waiing ime under free raffic, comparable QL, TD. MARL could effecively preven he queue spillovers o avoid large scale raffic jams. There are sill some sysem parameers ha should carefully be deermined by hand. For, example, he adjusing facor α indicaing he influence of he queue a he nex raffic node o 33

he waiing ime of vehicles a curren ligh under congesed raffic condiion. This is a very imporan parameer, which we should furher research is deermining way based fuzzy logic approach such as crisp o fuzzy conversion such as Lambda cus for minimizing raffic paern. Neural nework as a ool can also be used for deecing rends in raffic paerns and o predic minimal waiing ime for raffic. Fig 5: Simulaion beween TD, QL and MARL by increasing he opposie raffic lengh. 7. ACKNOWLEDGMENTS Firs and foremos, I would like o express my sincere hanks o my paper advisor Associaive Prof. Baijnah Kaushik for providing me heir precious advices and suggesions. This model wouldn have been a success for me wihou heir cooperaion and valuable commens and suggesions. I also wan o express my graiude o Prof. P. S. Gill (H.O.D.) and Associaive Prof. Sunia Tiwari (M.Tech. Coordinaor) for heir suppor, kind hel coninued ineres and inspiraion during his work. 8. REFERENCES [1] Bowling,M.: Convergence and no-regre in muliagen learning. In: L.K.Saul, Y.Weiss, L. Boou (eds.) Advances in Neural Informaion Processing Sysems 17, pp. 209 216. MIT Press (2005). [2] Bus oniu, L., De Schuer, B., Babuˇska, R.: Muliagen reinforcemen learning wih adapive sae focus. In: Proceedings 17h Belgian-Duch Conference on Arificial Inelligence (BNAIC-05), pp. 35 42. Brussels, Belgium (2005). [3] Chalkiadakis, G.: Muliagen reinforcemen learning: Sochasic games wih muliple learning players. Tech. rep., Dep. of Compuer Science, Universiy of Torono, Canada (2003). [4] Guesri C., Lagoudakis, M.G., Parr, R.: Coordinaed reinforcemen learning. In: Proceedings 19h Inernaional Conference on Machine Learning (ICML- 02), pp. 227 234. Sydney, Ausralia (2002) [5] Hu, J., Wellma M.P.: Nash Q-learning for general-sum sochasic games. Journal of Machine Learning Research 4, 1039 1069 (2003) [6] M.Wiering, e al (2004). Inelligen Traffic Ligh Conrol. Technical Repor UU-CS-2004-029, Universiy Urech. [7] M.Wiering (2000). Muli-Agen Reinforcemen Learning for Traffic Ligh Conrol. Machine Learning: Proceedings of he 17h Inernaional Conference (ICML 2000), 1151-1158. [8] Michell, T. M. (1995) he Book of Machine Learning: McGraw-HILL INTERNATIONAL EDITIONS. [9] Nunes L., and Oliveira, E. C. Learning from muliple sources. In Proceedings of he 3rd Inernaional Join Conference on Auonomous Agens and Muli Agen Sysems, AAMAS (New York, USA, July 2004), vol. 3, New York, IEEE Compuer Sociey, pp. 1106 1113. [10] Oliveira, D., Bazza A. L. C., and Lesser, V. using cooperaive mediaion o coordinae raffic lighs: a case sudy. In Proceedings of he 4h Inernaional Join Conference on Auonomous Agens and Muli Agen Sysems (AAMAS) (July 2005), New York, IEEE Compuer Sociey, pp. 463 470. [11] Price, B., Bouilier, C.: Acceleraing reinforcemen learning hrough implici imiaion Journal of Arificial Inelligence Research 19, 569 629 (2003). [12] Ta M.: Muli-agen reinforcemen learning: Independen vs. cooperaive agens. In: Proceedings 10h Inernaional Conference on Machine Learning (ICML- 93), pp. 330 337. Amhers, US (1993). IJCA TM : www.ijcaonline.org 34