Subgoal Discovery for Hierarchical Reinforcement Learning Using Learned Policies

Subgoal Discovery for Hierarchical Reinforcemen Learning Using Learned Policies Sandeep Goel and Manfred Huber Deparmen of Compuer Science and Engineering Universiy of Texas a Arlingon Arlingon, Texas 76019-0015 {goel, huber}@cse.ua.edu Absrac Reinforcemen learning addresses he problem of learning o selec acions in order o maximize an agen s performance in unknown environmens. To scale reinforcemen learning o complex real-world asks, agen mus be able o discover hierarchical srucures wihin heir learning and conrol sysems. This paper presens a mehod by which a reinforcemen learning agen can discover subgoals wih cerain srucural properies. By discovering subgoals and including policies o subgoals as acions in is acion se, he agen is able o explore more effecively and accelerae learning in oher asks in he same or similar environmens where he same subgoals are useful. The agen discovers he subgoals by searching a learned policy model for sae ha exhibis cerain srucural properies. This approach is illusraed using gridworld asks. Inroducion Reinforcemen learning (RL) (Kaelbling, Liman, and Moore, 1996) comprises a family of incremenal algorihms ha consruc conrol policy hrough real-world experimenaion. A key scaling problem of reinforcemen learning is ha in large domains an enormous number of decisions are o be made. Hence, insead of learning using individual primiive acions, an agen could poenially learn much faser if i could absrac he innumerable micro-decisions, and focus insead on a small se of imporan decision. This immediaely raises he quesion of how o recognize hierarchical srucures wihin learning and conrol sysems and how o learn sraegies for hierarchical decision making. Wihin he reinforcemen learning paradigm, one way o do his is o inroduce subgoals wih heir own reward funcions, learn policies for achieving hese subgoals, and hen include hese policies as acions. This sraegy can faciliae skill ransfer o oher asks and accelerae learning. I is desirable ha he reinforcemen learning agen discover he subgoals auomaically. Several researchers have proposed mehods Copyrigh 2003, American Associaion for Arificial Inelligence (www.aaai.org). All righs reserved. by which policies learned for a se of relaed asks are examined for commonaliies (Thrun and Schwarz, 1995) or are probabilisically combined o form new policies (Bernsein, 1999). However, neiher of hese RL mehods inroduces subgoals. In oher work, subgoals are chosen based on informaion abou he frequency a sae was visied during policy acquisiion or based on he reward obained. Digney (Digney 1996, 1998) chooses saes ha are visied frequenly or saes where he reward gradien is high as subgoals. Similarly, McGovern (McGovern and Baro, 2001 uses diverse densiy o discover useful subgoals auomaically. However, in he case of more complicaed environmens and rewards i can be difficul o accumulae and classify he ses of successful and unsuccessful rajecories needed o compue he densiy measure or frequency couns. In addiion, hese mehods do no allow he agen o discover subgoals ha are no explicily par of he asks used in he process of discovering hem. In his paper, he focus is on discovering subgoals by searching a learned policy model for cerain srucural properies. This mehod is able o discover subgoals even if hey are no a par of he successful rajecories of he policy. If he agen can discover hese subgoal saes and learn policies o reach hem, i can include hese policies as acions and use hem for effecive exploraion as well as o accelerae learning in oher asks in which he same subgoals are useful. Reinforcemen Learning In he reinforcemen learning framework, a learning agen ineracs wih an environmen over a series of ime seps = 0, 1, 2, 3, A any insan in he ime he learner can observe he sae of he environmen, denoed by s S and apply an acion, a A. Acions change he sae of environmen, and also produce a scalar pay-off value (reward), denoed by r R. In a Markovian sysem, he nex sae and reward depend only on he preceding sae and acion, bu hey may depend on hese in a sochasic manner. The objecive of he agen is o learn o maximize he expeced value of reward received over ime. I does his by learning a (possibly sochasic) mapping from saes o acions called a policy, Π : S A i.e. a mapping from 346 FLAIRS 2003

saes s S o acions a A. More precisely, he objecive is o choose each acion so as o maximize he expeced reurn: R E[ = γ i ri ] (1) i= 0 where γ [0,1) is a discoun-rae parameer and r i refers o he pay-off a ime i. A common approach o solve his problem is o approximae he opimal sae-acion value funcion, or Q-funcion (Wakins, 1989), Q: S A R which maps saes s S and acions a A o scalar values. In paricular, Q ( s, a ) represens he expeced discouned sum of fuure rewards if acion a is aken in sae s and he opimal policy is followed aferwards. Hence Q, once learned, allows he learner o maximize R by picking acions greedily wih respec o Q: = arg maxq( s, Π (2) a A The value funcion Q is learned on-line hrough experimenaion. Suppose ha during learning he learner execues acion a in sae s, which leads o a new sae s and he immediae pay-off r s, a. In his case Q-learning uses his sae ransiion o updae Q ( s, according o: Q( s, (1 α ) Q( s, + α ( r, + γ max Q( s, ) (3) The scalar s a α [0,1) is he learning rae. Subgoal Exracion An example ha shows ha subgoals can be useful is a room o room navigaion ask where he agen should discover he uiliy of doorways as subgoals. If he agen can recognize ha a doorway is a subgoal, hen i can learn a policy o reach he doorway. This policy can accelerae learning on relaed asks in he same or similar environmens by allowing he agen o move beween he rooms using single acions. The idea of using subgoals however is no confined o gridworlds or navigaion asks. Oher asks should also benefi from subgoal discovery. For example, consider a game in which he agen mus find a key o open a door before i can proceed. If i can discover ha having a key is a useful subgoal, hen i will more quickly be able o learn how o advance from level o level (McGovern and Baro, 2001b). In he approach described in his paper, he focus is on discovering useful subgoals ha can be defined in he agen s sae space. Policies o hose subgoals are hen learned and added as acions. In a regular space (regular space here refers o a uniformly conneced sae space) every sae will have approximaely he same expeced number of direc predecessors under a given policy, excep for regions near he goal sae or close o boundaries (where he space is no regular). In a regular and unconsrained space, if he coun of all he predecessors for every sae under a given policy is accumulaed and a curve for hese a couns along a chosen pah is ploed, he expeced curve would behave like he posiive par of a quadraic, and he expeced raio of gradiens along such a curve would be a posiive consan. In he approach presened here, a subgoal sae is a sae wih he following srucural propery: he sae space rajecories originaing from a significanly larger han expeced number of saes lead o he subgoal sae while is successor sae does no have his propery. Such saes represen a funnel for he given policy. To idenify such saes i is possible o evaluae he raio of he gradiens of he coun curve before and afer he subgoal sae. Consider a pah under a given policy going hrough a subgoal sae. The predecessors of he subgoal sae along his pah lie in a relaively unconsrained space and hus he coun curve for hose saes should be quadraic. However, he dynamics changes srongly a he subgoal sae. There will be a srong increase in he coun and he curve will become seeper as he pah approaches a subgoal sae. On he oher hand, he increase in he coun can be expeced o be much lower for he successor sae of he subgoal as i again lies in a relaively unconsrained space. Thus he raio of he gradiens a his poin will be high and easily disinguishable. Le C( s ) represen he coun of predecessors for a sae s under a given policy, and C ( s ) is he coun of predecessors ha can reach s in exacly seps: C 1 = P ( s s, Π ( s )) (4) C s s = P ( s s, ( s )) C ( ) (5) + 1 Π s s s n C = C (6) i = 1 i where n is such ha C n+1 = C n or n = number of saes, whichever is smaller. The condiion s s prevens he couning of one sep loops. P( s s, s )) is he probabiliy of reaching sae s from sae s by aking acion s ) (in a deerminisic world he probabiliy is 1 or 0). If here are loops wihin he policy, hen he couns for he saes in he loop will become very high. This implies ha, if no precauions are aken, he gradien crieria used here migh also idenify saes in he loop as subgoals. To calculae he raio along a pah under he given policy, le C ( s 1 ) be he predecessor coun for he iniial sae of he pah and C ( s ) be he coun for he sae he agen will be in afer execuing seps from he iniial sae. The slope of he curve a sep, can be compued as: = ( s ) C( s 1 ) C (7) FLAIRS 2003 347

To idenify subgoals, he gradien raio > < +1 is compued if +1 (If + 1 hen he raio is less hen 1 and sae does no fi he crierion. Avoiding he compuaion of he raio for such poins hus saves compuaional effor). If he compued raio is higher hen a specified hreshold, sae s will be considered a poenial subgoal. The hreshold here depends largely on he characerisics of he sae space bu can ofen be compued independen of he paricular environmen. The subgoal exracion echnique presened here has been illusraed using a simple gridworld navigaion problem. Figure 1 shows a four-room example environmen on a 20x20 grid. For hese experimens, he goal sae was placed in he lower righ porion and each rial sared from same sae in he lef upper corner as shown in Figure1. using Mone Carlo sampling mehods. The agen hen evaluaes he raio of gradiens along he coun curve by choosing random pahs, and picks he saes in which he raio is higher hen he specified hreshold as subgoal saes. For his experimen he coun curve along one of he randomly chosen pahs hrough a subgoal sae is shown in Figure 2. The pah chosen is indicaed in Figure 1 and he subgoal sae is highlighed boh in Figure 1 and Figure 2. The value for he gradien raio a sep 4 (which is in regular space) is 1.444 while i is 95.0 a sep 6 (which is a subgoal sae). To show ha he gradien raio in he unconsrained porion of he sae space and a a subgoal sae are easily disinguishable, hisograms for he disribuion of hese raios in randomly generaed environmens, are shown in Figure 3. Figure 1. Se of primiive acions (righ) and gridworld(lef) wih he iniial sae in he upper lef corner, he goal in he lower righ porion and a random pah under he learned policy. The acion space consiss of eigh primiive acions (Norh, Eas, Wes, Souh, Norh-wes, Norh-eas, Souh-wes, and Souh-eas). The world is deerminisic and each acion succeeds in moving he agen in he chosen direcion. Wih every acion he agen receives a negaive reward of -1 for a sraigh acion and -1.2 for a diagonal acion. In addiion, he agen ges a reward of +10 when i reaches he goal sae. The agen learns using Q-OHDUQLQJ DQG -greedy exploraion. I sars wih 90 (which means 90% of he ime i ries o explore by choosing a random acion) and gradually decreases he exploraion o 0.05. In his experimen he predecessor coun for every sae is compued exhausively using equaions 4, 5, and 6. However, for large sae spaces couns can be approximaed Figure 2. Coun curve along a randomly chosen pah hrough a subgoal sae under he learned policy. The Hisogram shows daa colleced from 12 randomly generaed 20x20 gridworlds wih randomly placed rooms and goals. Each run learns a policy model for he respecive ask using Q-learning and compues he couns of predecessors for every sae using equaions 4, 5, and 6. Gradien raios for 40 random pahs in each environmen are shown in he hisogram. The subgoal saes ha he agen discovered in his experimen are shown in Figure 4. The subgoal sae leading o he lef room is idenified here due o is srucural properies under he policy and despie he fac ha i does no lie on he successful pahs beween he sar and he goal sae. The agen did no discover he doorway in he smaller room as a subgoal sae because he number of sae for which he policy leads hrough he subgoal is small compared o he oher rooms and hence he coun for his subgoal sae is no influenced significanly by he srucural propery of he sae. 348 FLAIRS 2003

To show ha he mehod for discovering subgoals discussed above is no confined o gridworlds or navigaion asks, random worlds wih 1600 saes were generaed. In hese worlds fixed numbers of acions were available in each sae. Each acion in he sae s connecs o a randomly chosen sae s in heir local neighborhood. Then he coun meric was esablished and gradien raios were compued for hese spaces wih and wihou a subgoal. The resuls showed ha he gradien raios in he unconsrained porion of he sae space and a a subgoal sae are again easily disinguishable. Figure 3. Hisogram for he disribuion of he gradien raio in regular space (dark bars) and a subgoal saes (ligh bars). Figure 4. Subgoals saes discovered by he agen (ligh gray saes) Hierarchical Policy Formaion The moivaion for discovering subgoals is he effec ha available policies ha lead o subgoals have on he agen s exploraion and speed of learning in relaed asks in he same or similar environmens. If he agen randomly selecs exploraory primiive acions, i is likely o remain wihin he more srongly conneced regions of he sae space. A policy for achieving a subgoal region, on he oher hand, will end o connec separae srongly conneced areas. For example, in a room-o-room navigaion ask, navigaion using primiive movemen commands produces relaively srongly conneced dynamics wihin each room bu no beween rooms. A doorway links wo srongly conneced regions. By adding a policy o reach a doorway subgoal he rooms become more closely conneced. This allows he agen o more uniformly explore is environmen. I has been shown ha he effec on exploraion is one of wo main reasons ha exended acions can be able o dramaically affec learning (McGovern, 1998). Learning policies o subgoals To ake advanage of he subgoal saes, he agen uses Q- learning o learn a policy o each of he subgoals discovered in he previous sep. These policies, which lead o respecive subgoal saes (subgoal policies) are added o he acion se of he agen. Learning hierarchical policies. One reason ha i is imporan for he learning agen o be able o deec subgoal saes is he effec of subgoal policies on he rae of convergence o a soluion. If he subgoals are useful hen learning should be acceleraed. To ascerain ha hese subgoals help he agen o improve is policy more quickly, wo experimens were performed where he agen learned a new ask wih and wihou he subgoal policies. The same 20x20 grid-world wih hree rooms was used o illusrae he resuls. Subgoal policies were included in he acion se of he agen (Subg1, Subg2). The ask was changed by moving he goal o lef hand room as shown in Figure 5. The agen solves he new ask using Q-learning wih an exploraion of 5%. The acion sequence under he policy learned for he new ask, when is acion se included he subgoal policies is (Subg2, Souh-wes, Souh, Souh, Souh, Souh) where Subg2 refers o he subgoal policy which leads o he sae as shown in Figure 5. Figure 6 shows he learning curves when he agen was using he subgoal policies and when i was using only primiive acions. The learning performance is compared in erms of he oal reward ha he agen would receive under he learned policy a ha poin of he learning process. The curves in Figure 6 are averaged over 10 learning runs. Only an iniial par of daa is ploed o compare he wo learning curves; wih primiives only he agen is sill learning afer 150,000 learning seps while wih subgoal policies he policy has already converged. Afer 400,000 learning seps he agen wihou subgoal FLAIRS 2003 349

policies also converges o he same overall performance. The verical inervals along he curve indicae one sandard deviaion in each direcion a ha poin. subgoals in he acion se can significanly accelerae learning in oher, relaed asks. While he example shown here are gridworld asks, he presened approach for discovering and using subgoals is no confined o gridworlds or navigaion asks. Acknowledgemens This work was suppored in par by NSF ITR-0121297 and UTA REP/RES-Huber-CSE. Figure 5. New ask wih goal sae in he lef hand room. Figure 6. Comparison of learning speed using subgoal policies and using primiive acions only. Conclusions This paper presens a mehod for discovering subgoals by searching a learned policy model for saes ha exhibi a funneling propery. These subgoals are discovered by sudying he dynamics along he predecessor coun curve and can include saes ha are no an inegral par of he iniial policy. The experimens presened here shows ha discovering subgoals and including policies for hese References Bernsein, D. S. (1999). Reusing old policies o accelerae learning on new MDPs (Technical Repor UM-CS-1999-026). Dep. of Compuer Science, Univ. of Massachuses, Amhers, MA. Digney, B. (1996). Emergen hierarchical srucures: Learning reacive/hierarchical relaionships in reinforcemen environmens. From animals o animas 4: SAB 96. MIT Press/Bradford Books. Digney, B. (1998). Learning hierarchical conrol srucure for muliple asks and changing environmens. From animals o animas 5: SAB 98. Kaelbling, L. P., Liman, M. L., and Moore, A. W. Reinforcemen Learning: A Survey, Journal of Arificial Inelligence Research, Volume 4, 1996 McGovern, A., and Baro, A. G. (2001. Auomaic Discovery of Subgoals in Reinforcemen learning using Diverse Densiy. Proceedings of he 18 h Inernaional Conference on Machine Learning, pages 361-368. McGovern, A. (1998). Roles of macro-acions in acceleraing reinforcemen learning. Maser s hesis, U. of Massachuses, Amhers. Also Technical Repor 98-70. McGovern, A., and Baro, A. G. (2001b). Acceleraing Reinforcemen Learning hrough he Discovery of Useful Subgoals. Proceedings of he 6h Inernaional Symposium on Arificial Inelligence, Roboics and Auomaion in Space. Suon, R. S. & Baro, A. G. (1998). Reinforcemen learning: an Inroducion. Cambridge, MA: MIT Press. Suon, R.S. (1988). Learning o predic by he mehods of emporal differences. Machine Learning 3: 9-44. Thrun, S. B., & Schwarz, A. (1995). Finding srucure in reinforcemen learning. NIPS 7 (pp. 385-392). San Maeo, CA: Morgan Kaufmann. Wakins, Chrisopher J.C.H. (1989). Learning from delayed rewards. PhD hesis, Dep. of Psychology, Univ. of Cambridge. 350 FLAIRS 2003