Learning Perceptual Coupling for Motor Primitives

Learning Percepual Coupling for Moor Primiives Jens Kober, Bey Mohler, Jan Peers Max-Planck-Insiue for Biological Cyberneics Spemannsr. 38, 72076 Tuebingen, Germany Email: {kober,mohler,jrpeers}@uebingen.mpg.de Absrac Dynamic sysem-based moor primiives [1] have enabled robos o learn complex asks ranging from Tennisswings o locomoion. However, o dae here have been only few exensions which have incorporaed percepual coupling o variables of exernal focus, and, furhermore, hese modificaions have relied upon handcrafed soluions. Humans learn how o couple heir movemen primiives wih exernal variables. Clearly, such a soluion is needed in roboics. In his paper, we propose an augmened version of he dynamic sysems moor primiives which incorporaes percepual coupling o an exernal variable. The resuling percepually driven moor primiives include he previous primiives as a special case and can inheri some of heir ineresing properies. We show ha hese moor primiives can perform complex asks such as Ballin-a-Cup or Kendama ask even wih large variances in he iniial condiions where a skilled human player would be challenged. For doing so, we iniialize he moor primiives in he radiional way by imiaion learning wihou percepual coupling. Subsequenly, we improve he moor primiives using a novel reinforcemen learning mehod which is paricularly well-suied for moor primiives. I. INTRODUCTION The recen inroducion of moor primiives based on dynamic sysems [1] [4] have allowed boh imiaion learning and Reinforcemen Learning o acquire new behaviors fas and reliable. Resuling successes have shown ha i is possible o rapidly learn moor primiives for complex behaviors such as ennis swings [1], [2], T-ball baing [5], drumming [6], biped locomoion [3], [7] and even in asks wih poenial indusrial applicaion [8]. However, in heir curren form hese moor primiives are generaed in such a way ha hey are eiher only coupled o inernal variables [1], [2] or only include manually uned phase-locking, e.g., wih an exernal bea [6] or beween he gai-generaing primiive and he conac ime of he fee [3], [7]. In many human moor conrol asks, more complex percepual coupling is needed in order o perform he ask. Using handcrafed coupling based on human insigh will in mos cases no longer suffice. In his paper, i is our goal o augmen he Ijspeer- Nakanishi-Schaal approach [1], [2] of using dynamic sysems as moor primiives in such a way ha i includes percepual coupling wih exernal variables. Similar o he biokinesiological lieraure on moor learning (see e.g., [9]), we assume ha here is an objec of inernal focus described by a sae x and one of exernal focus y. The coupling beween boh foci usually depends on he phase of he movemen and, someimes, he coupling only exiss in shor phases, e.g., in a caching movemen, his could be a iniiaion of he movemen (which is largely predicive) and during he las momen when he objec is close o he hand (which is largely prospecive or reacive and includes movemen correcion). Ofen, i is also imporan ha he inernal focus is in a differen space han he exernal one. Fas movemens, such as a Tennis-swing, always follow a similar paern in join-space of he arm while he exernal focus is clearly on an objec in Caresian space or fovea-space. As a resul, we have augmened he moor primiive framework in such a way ha he coupling o he exernal, percepual focus is phase-varian and boh foci y and x can be in compleely differen spaces. Inegraing he percepual coupling requires addiional funcion approximaion, and, as a resul, he number of parameers of he moor primiives grows significanly. I becomes increasingly harder o manually une hese parameers o high performance and a learning approach for percepual coupling is needed. The need for learning percepual coupling in moor primiives has long been recognized in he moor primiive communiy [4]. However, learning percepual coupling o an exernal variable is no as sraighforward. I requires many rials in order o properly deermine he connecions from exernal o inernal focus. I is sraighforward o grasp a general movemen by imiaion and a human can produce a Ball-in-a-Cup movemen or a Tennis-swing afer a single or few observed rials of a eacher bu he will never have a robus coupling o he ball. Furhermore, small differences beween he kinemaics of eacher and suden amplify in he percepual coupling. This par is he reason why percepually driven moor primiives can be iniialized by imiaion learning bu will usually require self-improvemen by reinforcemen learning. This is analogous o he case of a human learning ennis: a eacher can show a forehand bu a lo of self-pracice is needed for a proper ennis game. II. AUGMENTED MOTOR PRIMITIVES WITH PERCEPTUAL COUPLING In his secion, we firs inroduce he general idea behind dynamic sysem moor primiives as suggesed in [1], [2] and, subsequenly, show how percepual coupling can be inroduced. Subsequenly, we show how he percepual coupling can be realized by augmening he acceleraion-based framework from [4]. A. Percepual Coupling for Moor Primiives The basic idea in he original work of Ijspeer, Nakanishi and Schaal [1], [2] is ha moor primiives can be pared ino

f 1 Transformed Sysem 1 Posiion Velociy Acceleraion Canonical Sysem f 2 Transformed Sysem 2 Posiion Velociy Acceleraion Figure 1. Illusraion of he behavior of he moor primiives (i) and he augmened moor primiives (ii). Exernal Variable f n Transformed Sysem n Posiion Velociy Acceleraion wo componens, i.e., a canonical sysem h which drives ransformed sysems g k for every considered degree of freedom k. As a resul, we have sysem of differenial equaions given by ż = h(z), (1) ẋ = g(x, z, w), (2) which deermines he variables of inernal focus x. Here, z denoes he sae of he canonical sysem and w he inernal parameers for ransforming he oupu of he canonical sysem. The schemaic in Figure 2 illusraes his radiional seup in black. In Secion II-B, we will discuss good choices for hese dynamical sysems as well as heir coupling based on he mos curren formulaion [4]. When aking an exernal variable y ino accoun, here are hree differen ways how his variable influences he moor primiive sysem which one can consider, i.e., (i) i could only influence Eq.(1) which would be appropriae for synchronizaion problems and phase-locking (similar as in [6], [10]), (ii) only affec Eq.(2) which allows he coninuous modificaion of he curren sae of he sysem by anoher variable and (iii) he combinaion of (i) and (ii). While (i) and (iii) are he righ soluion if phase-locking or synchronizaion are needed, he coupling in he canonical sysem will desroy many of he nice properies of he sysem and make i prohibiively hard o learn in pracice. Furhermore, as we focus on discree movemens in his paper, we focus on he case (ii) which has no been used o dae. In his case, we have a modified dynamical sysem ż = h(z), (3) ẋ = ĝ(x, y, ȳ, z, v), (4) ȳ = g(ȳ, z, w), (5) where y denoes he sae of he exernal variable, ȳ he expeced sae of he exernal variable and ȳ is derivaive. This archiecure inheris mos posiive properies from he original work while allowing he incorporaion of exernal feedback. We will show ha we can incorporae previous work wih ease and ha he resuling framework resembles he one in [4] while allowing o couple he exernal variables ino he sysem. B. Realizaion for Discree Movemens The original formulaion in [1], [2] was a major breakhrough as he righ choice of he dynamical sysems in Figure 2. General schemaic illusraing boh he original moor primiive framework by [2], [4] in black and he augmenaion for percepual coupling in red. Equaions (1, 2) allows deermining he sabiliy of he movemen, choosing beween a rhyhmic and a discree movemen and is invarian under rescaling in boh ime and movemen ampliude. Wih he righ choice of funcion approximaor (in our case locally-weighed regression), fas learning from a eachers presenaion is possible. In his secion, we firs discuss how he mos curren formulaion from he moor primiives as discussed in [4] is insaniaed from Secion II-A. Subsequenly, we show how i can be augmened in order o incorporae percepual coupling. While he original formulaion in [1], [2] used a secondorder canonical sysem, i has since hen been shown ha a single firs order sysem suffices [4], i.e., we have ż = h(z) = τα h z, which represens he phase of he rajecory. I has a ime consan τ and a parameer α h which is chosen such ha he sysem is sable. We can now choose our inernal sae such ha posiion of degree of freedom k is given by q k = x 2k, i.e., he 2k-h componen of x, he velociy by q k = τx 2k+1 = ẋ 2k and he acceleraion by q k = τẋ 2k+1. Upon hese assumpions, we can express he moor primiives funcion g in he following form ẋ 2k+1 = τα g (β g ( k x 2k ) x 2k+1 ) + τ (( k x 0 2k) + ak ) fk, (6) ẋ 2k = τx 2k+1. (7) This funcion has he same ime consan τ as he canonical sysem, appropriaely se parameers α g, β g, a goal parameer k, an ampliude modifier a k, and a ransformaion funcion f k. This ransformaion funcion ransforms he oupu of he canonical sysem so ha he ransformed sysem can represen complex nonlinear paerns and is given by f k (z) = N ψ i (z)w i z, (8) i=1 where w are adjusable parameers and uses normalized Gaus-

sian kernels wihou scaling such as exp ( h i (z c i ) 2) ψ i = (9) N j=1 ( h exp 2) j (z c j ) for localizing he ineracion in phase space where we have ceners c i and widh h i. In order o learn a moor primiive wih percepual coupling, we need wo componens. Firs, we need o learn he normal or average behavior ȳ of he variable of exernal focus y which can be represened by a single moor primiive ḡ, i.e., we can use he same ype of funcion from Equaions (2, 5) for ḡ which are learned based on he same z and given by Equaions (6, 7). Addiionally, we have he sysem ĝ for he variable of inernal focus x which deermines our acual movemens which incorporaes he inpus of he normal behavior ȳ as well as he curren sae y of he exernal variable. We obain he sysem ĝ by insering a modified coupling funcion ˆf(z, y, ȳ) insead of he original f(z) in g. Funcion f(z) is modified in order o include percepual coupling o y and we obain ˆf k (z, y, ȳ) = + N ψ i (z)ŵ i z i=1 M j=1 ( ) ˆψ j (z) κ T jk(y ȳ) + δ T jk(ẏ ȳ), where ˆψ j (z) denoe Gaussian kernels as in Equaion (9) wih ceners ĉ j and widh ĥj. Noe, ha i can be useful o se N > M for reducing he number of parameers. All parameers are given by v = [ŵ, κ, δ]. Here, ŵ are jus he sandard ransformaion parameers while κ jk and δ jk are he local coupling facors which can be inerpreed as gains acing on he difference beween he desired behavior of he exernal variable and is acual behavior. Noe ha for noisefree behavior and perfec iniial posiions, such coupling would never play a role; hus, he approach would simplify o he original approach. However, in he noisy, imperfec case, his percepual coupling can ensure success even in exreme cases. III. LEARNING FOR PERCEPTUALLY COUPLED MOTOR PRIMITIVES While he ransformaion funcion f k (z) can be learned from few or even jus a single rial, his simpliciy no longer ransfers o learning he new funcion ˆf k (z, y, ȳ) as percepual coupling requires ha he coupling o an uncerain exernal variable is learned. While imiaion learning approaches are feasible, hey require larger numbers of presenaions of a eacher wih very similar kinemaics for learning he behavior sufficienly well. As an alernaive, we could follow Naure as our eacher, and creae a concered approach of imiaion and self-improvemen by rial-and-error. For doing so, we firs have a eacher who presens several rials and, subsequenly, we improve our behavior by reinforcemen learning. A. Imiaion Learning wih Percepual Coupling For imiaion learning, we can largely follow he original work in [1], [2] and only need minor modificaions. We also make use of locally-weighed regression in order o deermine he opimal moor primiives, use he same weighing and compue he arges based on he dynamic sysems. However, unlike in [1], [2], we need a boosrapping sep as we deermine firs he parameers for he sysem described by Equaion (5) and, subsequenly, use he learned resuls in he learning of he sysem in Equaion (4). For doing so, we can compue he regression arges for he firs sysem by aking Equaion (6), replacing ȳ and ȳ by samples of y and ẏ, and solving forf k (z) as discussed in [1], [2]. A local regression yields good values for he parameers of f k (z). Subsequenly, we can perform he exac same sep for ˆf k (z, y, ȳ) where only he number of variables has increased bu he resuling regression follows analogously. However, noe ha while a single demonsraion suffices for he parameer vecor w and ŵ, he parameers κ and δ canno be learned by imiaion as hese require deviaion from he nominal behavior for he exernal variable. However, as discussed before, pure imiaion for percepual coupling can be difficul for learning he coupling parameers as well as he bes nominal behavior for a robo wih kinemaics differen from he human, many differen iniial condiions and in he presence of significan noise. Thus, we sugges o improve he policy by rial-and-error using reinforcemen learning upon an iniial imiaion. B. Reinforcemen Learning for Percepually Coupled Moor Primiives Reinforcemen learning [11] of discree moor primiives is a very specific ype of learning problem where i is hard o apply generic reinforcemen learning algorihms [5], [12]. For his reason, he focus of his paper is largely on domainappropriae reinforcemen learning algorihms which operae on paramerized policies for episodic conrol problems. 1) Reinforcemen Learning Seup: When modeling our problem as a reinforcemen learning problem, we always have a sae s = [z, y, ȳ, x] wih high dimensions (as a resul, sandard RL mehods which discreize he sae-space can no longer be applied), and he acion a = [f (z)+ɛ,ˆf(z, y, ȳ)+ˆɛ] is he oupu of our moor primiives. Here, he exploraion is denoed by ɛ and ˆɛ, and we can give a sochasic policy a π(s) as disribuion over he saes wih parameers θ = [w, v] R n. Afer a nex ime-sep δ, he acor ransfers o a sae s +1 and receives a reward r. As we are ineresed in learning complex moor asks consising of a single sroke [4], [9], we focus on finie horizons of lengh T wih episodic resars [11] and learn he opimal paramerized policy for such problems. The general goal in reinforcemen learning is o opimize he expeced reurn of he policy wih parameers θ defined by J(θ) = p(τ )R(τ )dτ, (10) T where τ = [s 1:T +1, a 1:T ] denoes a sequence of saes s 1:T +1 = [s 1, s 2,..., s T +1 ] and acions a 1:T = [a 1,

Figure 3. This figure shows schemaic drawings of he Ball-in-a-Cup moion, he final learned robo moion as well as a moion-capured human moion. The green arrows show he direcions of he momenary movemens. The human cup moion was augh o he robo by imiaion learning wih 91 parameers for 1.5 seconds. Also see he supplemenary video in he proceedings. a2,..., at ], he probabiliy of an episode τ is denoed by p(τ ) and R(τ ) refers o he reurn of an episode τ. Using Markov assumpion, we can wrie he pah disribuion QT +1 as p(τ ) = p(x1 ) =1 p(s+1 s, a )π(a s, ) where p(s1 ) denoes he iniial sae disribuion and p(s+1 s, a ) is he nex sae disribuion condiioned on las sae and acion. Similarly, if we assume addiive, accumulaed rewards, he PT reurn of a pah is given by R(τ ) = T1 =1 r(s, a, s+1, ), where r(s, a, s+1, ) denoes he immediae reward. While episodic Reinforcemen Learning (RL) problems wih finie horizons are common in moor conrol, few mehods exis in he RL lieraure (c.f., model-free mehod such as Episodic REINFORCE [13] and he Episodic Naural AcorCriic enac [5] as well as model-based mehods, e.g., using differenial-dynamic programming [14]). In order o avoid learning of complex models, we focus on model-free mehods and, o reduce he number of open parameers, we raher use a novel Reinforcemen Learning algorihm which is based on expecaion-maximizaion. Our new algorihm is called Policy learning by Weighing Exploraion wih he Reurns (PoWER) and can be derived from he same higher principle as previous policy gradien approaches, see [15] for deails. 2) Policy learning by Weighing Exploraion wih he Reurns (PoWER): When learning moor primiives, we inend o learn a deerminisic mean policy a = θ T µ(s) = [f (z), f (z, y, y )] which is linear in parameers θ and augmened by addiive exploraion ε(s, ) = [ˆ, ] in order o make model-free reinforcemen learning possible. As a resul, he exploraive policy can be given in he form a = θ T µ(s, ) + (µ(s, )). Previous work in [5], [12] has focused on sae-independen, whie Gaussian exploraion, i.e., (µ(s, )) N (0, Σ), and has resuled ino applicaions such as T-Ball baing [5] and operaional space conrol [12]. However, such unsrucured exploraion a every sep has several disadvanages, i.e., (i) i causes a large variance which grows wih he number of ime-seps [5], (ii) i perurbs acions oo frequenly, hus, washing ou heir effecs and (iii) can damage he sysem execuing he rajecory. Alernaively, one could generae a form of srucured, saedependen exploraion (µ(s, )) = εt µ(s, ) wih [ε ]ij 2 2 N (0, σij ), where σij are mea-parameers of he exploraion ha can also be opimized. This argumen resuls ino he policy a π(a s, ) = N (a µ(s, ), Σ (s, )). Based on he EM updaes for Reinforcemen Learning as suggesed in [12], [15], we can derive he updae rule o np T π Eτ ε Q (s, a, =1 np o. θ0 = θ + T π (s, a, ) Eτ Q =1 In order o reduce he number of rials in his on-policy scenario, we reuse he rials hrough imporance sampling [11], [16]. To avoid he fragiliy someimes resuling from imporance sampling in reinforcemen learning, samples wih very small imporance weighs are discarded. IV. E VALUATION & A PPLICATION In his secion, we demonsrae he effeciveness of he augmened framework for percepually coupled moor primiives as presened in Secion II and show ha our concered approach of using imiaion for iniializaion and reinforcemen learning for improvemen works well in pracice, paricularly

wih our novel PoWER algorihm from Secion III. We show ha his mehod can be used in learning a complex, real-life moor learning problem Ball-in-a-Cup in a physically realisic simulaion of an anhropomorphic robo arm. This problem is a good benchmark for esing he moor learning performance and we show ha we can learn he problem roughly a he efficiency of a young child. This algorihm successfully creaes a percepual coupling even o perurbaions ha are very challenging for a skilled adul player. A. Robo Applicaion: Ball-in-a-Cup We have applied he presened algorihm in order o each a physically-realisic simulaion of an anhropomorphic SAR- COS robo arm how o perform he radiional American children s game Ball-in-a-Cup, also known as Balero, Bilboque or Kendama. The oy consiss of a ball which is aached o a wooden cup by a sring. The iniial posiion is he ball hanging down verically on he sring and he player has o oss he ball ino he cup by jerking his arm [17], see Figure 3(op) for an illusraive figure. The sae of he sysem is described in Caresian coordinaes of he cup (i.e., he operaional space) and he Caresian coordinaes of he ball. The acions are he cup acceleraions in Caresian coordinaes wih each direcion represened by a moor primiive. An operaional space conrol law [18] is used in order o ransform acceleraions in he operaional space of he cup ino join-space orques. All moor primiives are perurbed separaely bu employ he same join reward which is r = exp( α(x c x b ) 2 α(y c y b ) 2 ) he momen where he ball passes he rim of he cup wih a downward direcion and r = 0 all oher imes. The cup posiion is denoed by [x c, y c, z c ] R 3, he ball posiion [x b, y b, z b ] R 3 and a scaling parameer α = 10000. The ask is quie complex as he reward is no modified solely by he movemens of he cup bu foremos by he movemens of he ball and he movemens of he ball are very sensiive o perurbaions. A small perurbaion of he iniial condiion or he rajecory will drasically change he movemen of he ball and hence he oucome of he rial if we do no use any form of percepual coupling o he exernal variable ball. Due o he complexiy of he ask, Ball-in-a-Cup is even a hard moor ask for children who only succeed a i by observing anoher person playing or deducing from similar previously learned asks how o maneuver he ball above he cup in such a way ha i can be caugh. Subsequenly, a lo of improvemen by rial-and-error is required unil he desired soluion can be achieved in pracice. The child will have an iniial success as he iniial condiions and execued cup rajecory fi ogeher by chance, aferwards he child sill has o pracice a lo unil i is able o ge he ball in he cup (almos) every ime and so cancel various perurbaions. Learning he necessary percepual coupling o ge he ball in he cup on a consisen basis is even a hard ask for aduls, as our whole lab can esify. In conras o a ennis swing, where a human jus needs o learn a goal funcion for he one momen he racke his he ball, in Ball-in-a-Cup we need a complee dynamical sysem as cup and ball consanly inerac. rewards 10 0.2 10 0.3 10 0.4 10 0.5 10 0 10 1 10 2 10 3 rials learned hand uned iniializaion Figure 4. This figure shows he expeced reurn for one specific perurbaion of he learned policy in he Ball-in-a-Cup scenario (averaged over 3 runs wih differen random seeds and he sandard deviaion indicaed by he error bars). Convergence is no uniform as he algorihm is opimizing he reurns for a whole range of perurbaions and no for his es case. Thus, he variance in he reurn as he improved policy migh ge worse for he es case bu improve over all cases. Our algorihm rapidly improves, regularly beaing a hand-uned soluion afer less han fify rials and converging afer approximaely 600 rials. Noe ha his plo is a double logarihmic plo and, hus, single uni changes are significan as hey correspond o orders of magniude. Mimicking how children learn o play Ball-in-a-Cup, we firs iniialize he moor primiives by imiaion and, subsequenly, improve hem by reinforcemen learning in order o ge an iniial success. Aferwards we also acquire he percepual coupling by reinforcemen learning. We recorded he moions of a human player using a VICON TM moion-capure seup in order o obain an example for imiaion as shown in Figure 3(c). The exraced cup-rajecories were used o iniialize he moor primiives using locally-weighed regression for imiaion learning. The simulaion of he Ball-in-a-Cup behavior was verified using he racked movemens. We used one of he recorded rajecories for which, when played back in simulaion, he ball goes in bu does no pass he cener of he opening of he cup and hus does no opimize he reward. This movemen is hen used for iniializing he moor primiives and deermining heir parameric srucure where cross-validaion indicaes ha 91 parameers per moor primiive are opimal from a biasvariance poin of view. The rajecories are opimized by reinforcemen learning using he PoWER algorihm on he parameers w for non perurbed iniial condiions. The robo consanly succeeds a bringing he ball ino he cup afer approximaely 60-80 ieraions given no noise and perfec iniial condiions. One se of he found rajecories is hen used o calculae he baseline ȳ = (h b) and ȳ = (ḣ ḃ), where h and b are he hand and ball rajecories. This se is also used o se he sandard cup rajecories. Hand uned coupling facors work quie well for small perurbaions of he iniial condiions. In order o make hem more robus we use reinforcemen learning using he same join reward as before. The iniial condiions (posiions and velociies) of he ball are perurbed compleely randomly (no PEGASUS Trick) using Gaussian random values wih variances se according o he desired sabiliy region. The PoWER algorihm converges afer approximaely 600-800 ieraions.

x posiion [m] y posiion [m] z posiion [m] 0.6 0.4 0.2 0 0 0.5 1 1.5 ime [s] 0.6 0.4 0.2 0 0 0.5 1 1.5 ime [s] 0.2 0 0.2 0.4 0 0.5 1 1.5 ime [s] cup no coupling ball no coupling cup coupling ball coupling Figure 5. This figure compares cup and ball rajecories wih and wihou percepual coupling. The rajecories and differen iniial condiions are clearly disinguishable. The percepual coupling cancels he swinging moion of he sring and ball pendulum ou. The successful rial is marked eiher by overlying (x and y) or parallel (z) rajecories of he ball and cup from 1.2 seconds on. This is roughly comparable o he learning speed of a 10 year old child (Figure 4). For he raining we used concurrenly sandard deviaions of 0.01m for x and y and of 0.1 m/s for ẋ and ẏ. The learned percepual coupling ges he ball in he cup for all esed cases where he hand-uned coupling was also successful. The learned coupling pushes he limis of he canceled perurbaions significanly furher and sill performs consisenly well for double he sandard deviaions seen in he reinforcemen learning process. Figure 5 shows an example of how he visual coupling adaps he hand rajecories in order o cancel perurbaions and o ge he ball in he cup. V. CONCLUSION Percepual coupling for moor primiives is an imporan opic as i resuls in more general and more reliable soluions while i allows he applicaion of he dynamic sysems moor primiive framework o many oher moor conrol problems. As manual uning can only work in limied seups, an auomaic acquisiion of his percepual coupling is essenial. In his paper, we have conribued an augmened version of he moor primiive framework originally suggesed by [1], [2], [4] such ha i incorporaes percepual coupling while keeping a disincively similar srucure o he original approach and, hus, preserving mos of he imporan properies. We presen a concered learning approach which relies on an iniializaion by imiaion learning and, subsequen, self-improvemen by reinforcemen learning. We inroduce a paricularly wellsuied algorihm for his reinforcemen learning problem called PoWER. The resuling framework works well for learning Ball-in-a-Cup on a simulaed anhropomorphic SARCOS arm in seups where he original moor primiive framework would no suffice o fulfill he ask. REFERENCES [1] A. J. Ijspeer, J. Nakanishi, and S. Schaal, Movemen imiaion wih nonlinear dynamical sysems in humanoid robos, in Proceedings of IEEE Inernaional Conference on Roboics and Auomaion (ICRA), Washingon, DC, May 11-15 2002, pp. 1398 1403. [2], Learning aracor landscapes for learning moor primiives, in Advances in Neural Informaion Processing Sysems 16 (NIPS), S. Becker, S. Thrun, and K. Obermayer, Eds., vol. 15. Cambridge, MA: MIT Press, 2003, pp. 1547 1554. [3] S. Schaal, J. Peers, J. Nakanishi, and A. J. Ijspeer, Conrol, planning, learning, and imiaion wih dynamic movemen primiives, in Proceedings of he Workshop on Bilaeral Paradigms on Humans and Humanoids, IEEE 2003 Inernaional Conference on Inelligen RObos and Sysems (IROS), Las Vegas, NV, Oc. 27-31, 2003. [4] S. Schaal, P. Mohajerian, and A. J. Ijspeer, Dynamics sysems vs. opimal conrol a unifying view, Progress in Brain Research, vol. 165, no. 1, pp. 425 445, 2007. [5] J. Peers and S. Schaal, Policy gradien mehods for roboics, in Proceedings of he IEEE/RSJ 2006 Inernaional Conference on Inelligen RObos and Sysems (IROS), Beijing, China, 2006, pp. 2219 2225. [6] D. Pongas, A. Billard, and S. Schaal, Rapid synchronizaion and accurae phase-locking of rhyhmic moor primiives, in Proceedings of he IEEE 2005 Inernaional Conference on Inelligen RObos and Sysems (IROS), vol. 2005, 2005, pp. 2911 2916. [7] J. Nakanishi, J. Morimoo, G. Endo, G. Cheng, S. Schaal, and M. Kawao, Learning from demonsraion and adapaion of biped locomoion, Roboics and Auonomous Sysems (RAS), vol. 47, no. 2-3, pp. 79 91, 2004. [8] H. Urbanek, A. Albu-Schäffer, and P.v.d.Smag, Learning from demonsraion repeiive movemens for auonomous service roboics, in Proceedings of he IEEE/RSL 2004 Inernaional Conference on Inelligen RObos and Sysems (IROS), Sendai, Japan, 2004, pp. 3495 3500. [9] G. Wulf, Aenion and moor skill learning. Champaign, IL: Human Kineics, 2007. [10] J. Nakanishi, J. Morimoo, G. Endo, G. Cheng, S. Schaal, and M. Kawao, A framework for learning biped locomoion wih dynamic movemen primiives, in Proceedings of he IEEE-RAS Inernaional Conference on Humanoid Robos (HUMANOIDS). Los Angeles, CA: Nov.10-12, Sana Monica, CA: IEEE, 2004. [11] R. Suon and A. Baro, Reinforcemen Learning. MIT PRESS, 1998. [12] J. Peers and S. Schaal, Reinforcemen learning for operaional space, in Proceedings of he Inernaional Conference on Roboics and Auomaion (ICRA), Rome, Ialy, 2007. [13] R. J. Williams, Simple saisical gradien-following algorihms for connecionis reinforcemen learning, Machine Learning, vol. 8, pp. 229 256, 1992. [14] C. G. Akeson, Using local rajecory opimizers o speed up global opimizaion in dynamic programming, in Advances in Neural Informaion Processing Sysems 6 (NIPS), J. E. Hanson, S. J. Moody, and R. P. Lippmann, Eds. Denver, CO, USA: Morgan Kaufmann, 1994, pp. 503 521. [15] J. Kober and J. Peers, Policy search for moor primiives in roboics, in Advances in Neural Informaion Processing Sysems (NIPS), 2008. [16] C. Andrieu, N. de Freias, A. Douce, and M. I. Jordan, An inroducion o MCMC for machine learning, Machine Learning, vol. 50, no. 1, pp. 5 43, 2003. [17] Wikipedia, June 2008. [Online]. Available: hp://en.wikipedia.org/wiki/ball_in_a_cup [18] J. Nakanishi, M. Misry, J. Peers, and S. Schaal, Experimenal evaluaion of ask space posiion/orienaion conrol owards complian conrol for humanoid robos, in Proceedings of he IEEE/RSJ 2007 Inernaional Conference on Inelligen ROboics Sysems (IROS), 2007.