Imitation Learning Using Graphical Models

Imiaion Learning Using Graphical Models Deepak Verma and Rajesh P.N. Rao Dep. of Compuer Science & Engineering Universiy of Washingon Seale, WA, USA {deepak,rao}@cs.washingon.edu hp://neural.cs.washingon.edu/ Absrac. Imiaion-based learning is a general mechanism for rapid acquisiion of new behaviors in auonomous agens and robos. In his paper, we propose a new approach o learning by imiaion based on parameer learning in probabilisic graphical models. Graphical models are used no only o model an agen s own dynamics bu also he dynamics of an observed eacher. Parameer ying beween he agen-eacher models ensures consisency and faciliaes learning. Given only observaions of he eacher s saes, we use he expecaion-maximizaion (EM) algorihm o learn boh dynamics and policies wihin graphical models. We presen resuls demonsraing ha EM-based imiaion learning ouperforms pure exploraion-based learning on a benchmark problem (he FlagWorld domain). We addiionally show ha he graphical model represenaion can be leveraged o incorporae domain knowledge (e.g., sae space facoring) o achieve significan speed-up in learning. 1 Inroducion Learning by imiaion is a general mechanism for rapidly acquiring new skills or behaviors in humans and robos. Several approaches o imiaion have previously been proposed (e.g., [1,]). Many of hese rea he problem of imiaion as rajecory-following where he goal is o follow he eacher s rajecory as bes as possible. However, imiaion ofen involves he need o infer inenions and goals which inroduces considerable uncerainy ino he problem, besides he uncerainy already exising in he observaion process and in he environmen. Previous models of imiaion have ypically no been probabilisic and are herefore no geared owards handling uncerainy. There have been some recen effors in modeling goal-based imiaion [3] bu hese eiher assume ha he dynamics of environmen are given or need o learn he dynamics using a ime-consuming exploraion sage. A differen approach o imiaion is based on ideas from he field of Reinforcemen Learning (RL) []. In reinforcemen learning, he agen is assumed o receive rewards in cerain saes and he agen s goal is o learn a sae-oacion mapping ( policy ) ha maximizes he oal fuure expeced reward. The compuaional challenge of solving RL problem is hard for a variey of reasons: (1) he sae space is ofen exponenial in he number of aribues, and () for J.N. Kok e al. (Eds.): ECML 7, LNAI 71, pp. 757 7, 7. c Springer-Verlag Berlin Heidelberg 7

75 D. Verma and R.P.N. Rao uncerain environmens wih large sae spaces, he agen needs o perform a large amoun of exploraion o learn a model of he environmen before learning a good policy. These problems can be amelioraed by using imiaion [5] ( or appreniceship []) where a eacher exhibis he opimal behavior ha is observed by he suden or he eacher guides he suden o he mos imporan saes for exploraion. Price and Bouilier formulae his in he RL framework as Implici Imiaion [7], in which he suden learns he dynamics of he environmen by passively observing he eacher wihou any explici communicaion regarding wha acions o ake. This speeds up he learning of policies. However, hese approaches rely on knowing or inferring an explici reward funcion in he environmen, which may no always be available or easy o infer. In his paper, we propose a new approach o imiaion ha is based on probabilisic Graphical Models (GMs). We pose he problem of imiaion learning as learning he parameers of he underlying GM for he menor s and observer s behavior (we use he erms menor/eacher (and observer/suden) inerchangeably in he paper). To faciliae he ransfer of knowledge from menor o observer we ie he parameers of dynamics for he menor wih ha of he observer, and updae he observer s policy using he learned menor policy. Parameers are learned using he expecaion-maximizaion (EM) algorihm for learning in GMs from parial daa. Our approach provides a principled approach o imiaion based compleely on an inernal GM represenaion, allowing us o leverage he growing number of efficien inference and learning echniques for GMs. Graphical Models for Imiaion Noaion: We use capial leers for variables and small case leers o denoe specific insances. We assume here are wo agens, he observer A o and he menor A m operaing in he environmen 1.LeΩ S be he se of saes in he environmen and Ω A he se of all possible acions available o he agen (boh finie). A ime, he agen is in sae S and execues acion A. The agen s sae changes in a sochasic manner given by he ransiion probabiliy P (S +1 S,A ), which is assumed o be independen of, i.e., P (S +1 = s S = s, A = a) =τ s sa. When obvious from conex, we use s for S = s and a for A = a, ec. For each sae s and acion a, here is a real valued reward R m (s, a) for he menor (R o (s, a) for he observer) associaed wih being in sae s and execuing he acion a (wih negaive values denoing undesirable saes or he cos of he acion). The parameers described above define a Markov Decision Process (MDP) [9]. Solving an MDP ypically involves compuing an opimal policy a = π(s) ha maximizes oal expeced fuure reward (eiher a finie 1 We use he superscrip o disinguish he wo agens and omi i for common variables (e.g., dynamics of he environmen). For simpliciy of exposiion, we assume ha agens operae (non-ineracively) in he same environmen. However, as discussed in [], his assumpion is no essenial and one can apply he echniques discussed here o he more general seing where observer and menor(s) have differen acion and sae spaces.

Imiaion Learning Using Graphical Models 759 horizon cumulaive reward or discouned infinie horizon cumulaive reward) when acion a is execued in sae s. In a ypical Reinforcemen Learning problem, he dynamics and he reward funcion are no known, and one canno herefore compue an opimal policy direcly. One can learn boh hese funcions by exploraion bu his requires he agen o execue a large number of exploraion seps before an opimal policy can be compued. Learning can be grealy sped up via implici imiaion [7] which involves an agen (he observer) observing anoher agen (menor) who has similar goals.. The main idea is o allow he agen o quickly learn he parameers in he relevan porion of he sae space, hereby cuing down on he exploraion required o compue a near-opimal policy. We assume ha he menor follows a saionary policy π m (s) which defines is behavior compleely. The observer is only able o observe he sequence of saes ha menor has been in (S m 1:) andno he acions: hisisimporanbecause some of he mos useful forms of imiaion learning are hose in which he eacher s acions are no available, e.g., when a robo mus learn by waching a human in such a scenario, he robo can observe body poses bu has no access o he human s acions (muscle or moor commands). The ask of he observer is hen o compue he bes esimae of he dynamics ˆτ and menor policy ˆπ m, givenisownhisorys o 1:,Ao 1: and he menor s sae hisory Sm 1:.Noehaπm can be compleely independen of he observer s reward funcion R o : in fac, he problem as formulaed above does no require he inroducion of a reward funcion a all. The goal is simply o imiae he menor by esimaing and execuing he menor s policy. In he special case where he menor is opimizing he same reward funcion as he observer, π m becomes he opimal MDP policy. Noe ha since he observer canno see acions ha he menor ook and he ransiion parameers are no given, he problem is differen from oher approaches which speed up RL via imiaion [,1]..1 Generaive Graphical Model Boh he menor and he observer are solving an MDP. One key observaion we make is ha given he menor policy he acion choice and dynamics can be modeled easily using a generaive model based on he well-known graphical model for MDP shown in Fig. 1(a). One does no need o know he menor s reward model as π m compleely explains he menor sae sequence observed. The figure shows he -slice represenaion of he Dynamic Bayesian Nework (DBN) used o model he imiaion problem. Since we are assuming ha he wo agens are operaing in he same environmen, hey have he same ransiion parameers (τ m =τ o =τ). Noe ha he wo graphical models (for he menor and observer respecively) are disconneced as he wo agens are non-ineracing. The menor s acions are guided by he opimal menor policy P (A m = a S m = s) =π m (a s) and he observer s acions by he policy P (A o = a S m = s) = π o (a s). Unlike he menor, he observer updaes is policy over ime (hence he subscrip on π o ). We require only he menor o have a saionary policy. The menor observaions s m 1:T are generaed by sampling he DBN. In our

7 D. Verma and R.P.N. Rao S m S m +1 τ sas π m Menor S F1 G A m Tied parameers A m +1 S o τ sas S o +1 F3 π o Observer F A o (a) A o +1 (b) Fig. 1. Model and Domain for Imiaion. (a) Graphical Model Represenaion for Imiaion. (b) FlagWorld Domain. experimens, when a goal sae is reached, we jump o he sar sae in he nex sep. T hus represens he oal number of seps aken by agen, which could span muliple episodes of reaching a goal sae. 3 Imiaion Via Parameer Learning Our approach o imiaion is based on esimaing he unknown parameers θ = (τ,π m ) of he graphical model in Fig. 1(a) given observed daa as evidence, i.e., ˆθ =(ˆτ,ˆπ m )= argmax P (θ s m θ 1:T,so 1:T,ao 1:T ). Noe ha he evidence does no include menor acions A m 1:T. This means ha he daa is incomplee as no all nodes of he graphical model are observed. A well-known approach o learning he parameers of a GM from incomplee daa [11] is o use he expecaionmaximizaion (EM) algorihm [1]. Alhough any parameer learning mehod could be used, we use EM in he presen sudy since i is a general-purpose, well-undersood algorihm widely used in machine learning. The EM algorihm involves saring wih an iniial esimae θ (chosen randomly or incorporaing any prior knowledge) which is hen ieraively improved by performing he following wo seps: Expecaion: The curren se of parameers θ i is used o compue a disribuion (expecaion) over he hidden nodes: h(a m 1:T )=P(Am 1:T θi,s m 1:T,so 1:T,ao 1:T ). This allows he expeced sufficien saisics o be compued for he complee daa se. Maximizaion: The disribuion h is hen used o compue he new parameers θ i+1 which maximize he (expeced) log-likelihood of evidence: θ i+1 = argmax h(a m θ 1:T )log(p (sm 1:T,am 1:T,so 1:T,ao 1:T θ)) a 1:T When saes and acions are discree, he new esimae can be compued by simply using he expeced couns. The wo seps above are performed alernaively

Imiaion Learning Using Graphical Models 71 unil convergence. The mehod is guaraneed o improve performance in each ieraion in ha he incomplee log likelihood of daa (log P (s m 1:T,so 1:T,ao 1:T θi )) is guaraneed o increase in every ieraion and converge o a local maximum [1]. We hen use he esimae for ˆθ o conrol he observer. In paricular, he observer combines he learned menor policy ˆπ m wih an exploraion sraegy o arrive a he policy π o. 3.1 Parameer Learning Resuls Domain: We esed our resuls on a benchmark problem known as he Flag- World domain [13] shown in Fig. 1(b). The agen s objecive is o reach he goal sae G saring from he sae S and pick up a subse of he hree flags locaed a saes F 1, F andf3. I receives a reward of 1 poin for each flag picked up bu rewards are discouned by a facor of γ =.99 a each ime sep unil he goal is reached; he laer consrain favors shores pahs o goal. The environmen is a sandard maze environmen used in RL [] in ha each acion (N,E,S,W) akes he agen o he inended sae wih a high probabiliy (.9) and o a sae perpendicular o he inended sae wih a small probabiliy (.1). The probabiliy mass going ino he wall or ouside he maze is assigned o he sae in which acion aken. This domain is ineresing in ha here are saes (33 locaions, augmened wih a boolean aribue for each flag picked), resuling in a large number of parameers ha needs o be learned ( sae acion pairs for which τ(s, a, :) and π m (a s) needs o be learned). However, he opimal policy pah is sparse and hence only a small subse of parameers needs o be learned o compue a near-opimal policy, hereby making i ideal for demonsraing he uiliy of imiaion as a medium o speed up RL. Exploraion versus Exploiaion: We used he ɛ greedy mehod o radeoff exploraion of he domain wih exploiaion of he curren learned policy: a random acion is chosen wih probabiliy ɛ, wihɛ gradually decreased over ime o favor exploraion iniially and exploiaion of he learned policy in laer ime seps. Resuls: The resuls of EM-based learning are shown in Fig (a) (averaged over 5 runs). The parameers were learned in a bach mode where T was increased from o 5 in seps of and reward in he las seps was repored. Average reward received is shown in op righ corner. Also shown are he Error in parameers (mean absolue difference w.r.. rue parameers 3 ), he log-likelihood of he learned parameers and value funcion of sar sae under he curren esimae for observer policy Vˆπ o(s) w.r. he rue ransiion parameers. The resuls show ha he observer is able o learn he menor policy o a high degree of accuracy, hough no perfecly. The uncerain dynamics of he environmen leads i o collec less rewards han he menor as he opimal policy is no learned everywhere. An imporan poin o noe is ha he error in 3 The error beween uniformly random parameers and rue parameers is 1.5 for π m and 1.75 for τ.

7 D. Verma and R.P.N. Rao Average Error (Mean Abs dis from rue) Average Log likelihood (per sep) 1.5 1.5.5 3 3.5.5 Error in Learn Parameers ransiion policy Log likelihood of learn parameers Training Tes 5 (a) Reward obained by wo agens in las seps (5 Runs) 1 Reward 1 1 Menor (Oracle) Observer Value Funcion of Sar Sae of learn observer policy Value V(S) for Obs Opimal V(S) 1 1 1 (b) Fig.. Imiaion Learning Resuls for FlagWorld Domain. (a) (Clockwise) Error in parameers (mean absolue difference w.r.. rue parameers), average reward received, he log-likelihood of he learned parameers, and value funcion of sar sae Vˆπ o(s) w.r. he rue ransiion parameers. (b) Comparison of learned policy (ParamImi) wih some popular exploraion echniques (measured in erms of average discouned reward obained per seps). ParamImi ouperforms all he pure exploraion-based mehods. parameers is sill quie high even when observer policy is quie good, hereby confirming he inuiion ha only a small (relevan) subse of parameers needs o be learned well before he agen can sar exploiing a learned policy. Figure (b) compares he relaive qualiy of he learned policy wih a number of pure exploraion-based echniques used in [13]. The bars represen he average discouned reward obained per seps in he nd sage, i.e., obained in nex, seps afer an iniial 1s sage of exploraion consising of, seps. For ParamImi (our algorihm) he average is aken afer only seps of exploraion. The righmos bar is he Menor value. As can be seen, ParamImi ouperforms all he exploraion sraegies wih far less experience. 3. Facored Graphical Model A major advanage of using a graphical models-based approach o imiaion is he abiliy o leverage domain knowledge o speed up learning. For example, he number of rue parameers in he FlagWorld is acually much less han he number ha was learned in he previous secion since here are only 33 locaions for which he ransiion parameers need o be learned: he dynamics are he same irrespecive of which flags have been picked up. To reflec his fac, we can facor he menor sae S m ino locaion L m and flag saus variable Picked Flag PF m as shown in Fig. 3(a) (and similarly for he observer). This reduces he number of ransiion parameers significanly (from τ sas o τ lal ).

Imiaion Learning Using Graphical Models 73 We can incorporae domain knowledge abou he flags by defining he CPT P (PF +1 L +1,PF )ashe, P (PF +1 L +1,PF )=δ(pf +1,pf(PF,i)) if L +1 = Fi = δ(pf +1,PF ) oherwise where pf(pf,i)ishedeerminsic funcion which maps he old value of PF o one in which he i h flag is picked up. L m PF m π m l,pf τ lal A m (a) L m +1 PF m +1 A m +1 Average Error (Mean Abs dis from rue) Average Log likelihood (per sep) 1.5 1.5.5 3 3.5.5 Error in Learn Parameers ransiion policy Log likelihood of learn parameers Training Tes 5 Reward obained by wo agens in las seps (5 Runs) 1 Reward 1 1 Menor (Oracle) Observer Value Funcion of Sar Sae of learn observer policy Value (b) V(S) for Obs Opimal V(S) Fig. 3. Fas Learning using Facored Graphical Models. (a) Facored model for FlagWorld (only he menor model is shown). (b) Resuls using facored model. Noe he speed-up in learning w.r.. he unfacored case (Fig. (a)). The resuls of EM-based parameer learning for he facored graphical model are shown in Fig. 3(b). As expeced, he error in ransiion parameers goes down much more rapidly han in he unfacored case (compare wih Fig. (a)). Conclusion This paper inroduces a new framework for learning by imiaion based on modeling he imiaion process in erms of probabilisic graphical models. Imiaive policies are learned in a principled manner using he expecaion-maximizaion (EM) algorihm. The model achieves ransfer of knowledge by ying he parameers for he menor s dynamics wih hose of he observer. Our resuls 5 demonsrae ha he menor s policy can be esimaed direcly from observaions of This is a common rick used in GMs o encode deerminisic domain knowledge. 5 Addiional resuls are presened in he exended version of he paper available a hp://neural.cs.washingon.edu/. In paricular, we show how learning can be furher sped up by incorporaing reward informaion colleced on he way. Also, we demonsrae he generaliy of parameer learning by exending he graphical model o learn ask-oriened policies.

7 D. Verma and R.P.N. Rao he menor s sae sequences and ha significan speed-up in learning can be achieved by exploiing he graphical models framework o facor he sae space in accordance wih domain knowledge. Our curren work is focused on esing he approach more exhausively, especially in he conex of roboic imiaion. No only do Graphical Models provide a compuaionally efficien framework for general imiaion, hey are also being used for modeling behavior [1]. An exciing prospec of using graphical models for imiaion is he ease of exension o models wih more absracion, including parially observable, hierarchical, and relaional models. Acknowledgmens This maerial is based upon work suppored by ONR, he Packard Foundaion, and NSF Grans 13335 and 5. References 1. Schaal, S.: Is imiaion learning he roue o humanoid robos? Trends in Cogniive Sciences 3, 33 (1999). Dauenhahn, K., Nehaniv, C.: Imiaion in Animals and Arifacs. MIT Press, Cambridge, MA () 3. Verma, D., Rao, R.P.N.: Goal-based imiaion as probabilisic inference over graphical models. In: NIPS 1 (). Suon, R.S., Baro, A.: Reinforcemen Learning: An Inroducion. MIT Press, Cambridge, MA (199) 5. Akeson, C.G., Schaal, S.: Robo learning from demonsraion. In: Proc. 1h ICML, pp. 1 (1997). Abbeel, P., Ng, A.Y.: Appreniceship learning via inverse reinforcemen learning. In: ICML, pp. 1 () 7. Price, B., Bouilier, C.: Acceleraing reinforcemen learning hrough implici imiaion. JAIR 19, 59 9 (3). Price, B., Bouilier, C.: A bayesian approach o imiaion in reinforcemen learning. In: IJCAI, pp. 71 7 (3) 9. Bouilier, C., Dean, T., Hanks, S.: Decision-heoreic planning: Srucural assumpions and compuaional leverage. JAIR 11, 1 9 (1999) 1. Raliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning. In: ICML, pp. 79 73 () 11. Heckerman, D.: A uorial on learning wih bayesian neworks. Technical repor, Microsof Research, Redmond, Washingon (1995) 1. Dempser, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplee daa via he EM algorihm. Journal of he Royal Saisical Sociey, Series B 39, 1 3 (1977) 13. Dearden, R., Friedman, N., Andre, D.: Model-based Bayesian Exploraion. In: UAI- 99, San Francisco, CA, pp. 15 159 (1999) 1. Griffihs, T.L., Tenenbaum, J.B.: Srucure and srengh in causal inducion. Cogniive Psychology 51(), 33 3 (5)