Learning Perceptual Coupling for Motor Primitives

Similar documents
Neural Network Model of the Backpropagation Algorithm

Fast Multi-task Learning for Query Spelling Correction

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

More Accurate Question Answering on Freebase

MyLab & Mastering Business

1 Language universals

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

arxiv: v2 [cs.ro] 3 Mar 2017

Axiom 2013 Team Description Paper

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

FF+FPG: Guiding a Policy-Gradient Planner

Reinforcement Learning by Comparing Immediate Reward

Computational Approaches to Motor Learning by Imitation

Lecture 1: Machine Learning Basics

A Case-Based Approach To Imitation Learning in Robotic Agents

Probabilistic Latent Semantic Analysis

Learning Prospective Robot Behavior

Learning From the Past with Experiment Databases

Lecture 10: Reinforcement Learning

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Go fishing! Responsibility judgments when cooperation breaks down

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Machine Learning and Development Policy

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Transfer Learning Action Models by Measuring the Similarity of Different Domains

College Pricing and Income Inequality

ACTIVITY: Comparing Combination Locks

Improving Fairness in Memory Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Improving Action Selection in MDP s via Knowledge Transfer

College Pricing and Income Inequality

Game-based formative assessment: Newton s Playground. Valerie Shute, Matthew Ventura, & Yoon Jeon Kim (Florida State University), NCME, April 30, 2013

Practical Integrated Learning for Machine Element Design

Artificial Neural Networks written examination

Understanding and Changing Habits

An OO Framework for building Intelligence and Learning properties in Software Agents

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Prediction of Maximal Projection for Semantic Role Labeling

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

GDP Falls as MBA Rises?

An Online Handwriting Recognition System For Turkish

Georgetown University at TREC 2017 Dynamic Domain Track

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

Discriminative Learning of Beam-Search Heuristics for Planning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

On-the-Fly Customization of Automated Essay Scoring

Universityy. The content of

Sec123. Volleyball. 52 Resident Registration begins Aug. 5 Non-resident Registration begins Aug. 14

AMULTIAGENT system [1] can be defined as a group of

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Capturing and Organizing Prior Student Learning with the OCW Backpack

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Top US Tech Talent for the Top China Tech Company

An investigation of imitation learning algorithms for structured prediction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Stochastic Model for the Vocabulary Explosion

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

The Strong Minimalist Thesis and Bounded Optimality

Supervised Agricultural Experience Unit Agriculture, Food, and Natural Resources Texas Education Agency

TD(λ) and Q-Learning Based Ludo Players

Dynamic Tournament Design: An Application to Prediction Contests

Time series prediction

Design Principles to Set the Stage

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

DEVELOPMENT OF AN INTELLIGENT MAINTENANCE SYSTEM FOR ELECTRONIC VALVES

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Field Experience Management 2011 Training Guides

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

A Study on professors and learners perceptions of real-time Online Korean Studies Courses

Speech Emotion Recognition Using Support Vector Machine

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Generative models and adversarial training

WHEN THERE IS A mismatch between the acoustic

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Modeling function word errors in DNN-HMM based LVCSR systems

CSL465/603 - Machine Learning

Grammars & Parsing, Part 1:

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Reflective Teaching KATE WRIGHT ASSOCIATE PROFESSOR, SCHOOL OF LIFE SCIENCES, COLLEGE OF SCIENCE

Three Strategies for Open Source Deployment: Substitution, Innovation, and Knowledge Reuse

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Vocational Training Dropouts: The Role of Secondary Jobs

Cross Language Information Retrieval

On the implementation and follow-up of decisions

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

Transcription:

Learning Percepual Coupling for Moor Primiives Jens Kober, Bey Mohler, Jan Peers Max-Planck-Insiue for Biological Cyberneics Spemannsr. 38, 72076 Tuebingen, Germany Email: {kober,mohler,jrpeers}@uebingen.mpg.de Absrac Dynamic sysem-based moor primiives [1] have enabled robos o learn complex asks ranging from Tennisswings o locomoion. However, o dae here have been only few exensions which have incorporaed percepual coupling o variables of exernal focus, and, furhermore, hese modificaions have relied upon handcrafed soluions. Humans learn how o couple heir movemen primiives wih exernal variables. Clearly, such a soluion is needed in roboics. In his paper, we propose an augmened version of he dynamic sysems moor primiives which incorporaes percepual coupling o an exernal variable. The resuling percepually driven moor primiives include he previous primiives as a special case and can inheri some of heir ineresing properies. We show ha hese moor primiives can perform complex asks such as Ballin-a-Cup or Kendama ask even wih large variances in he iniial condiions where a skilled human player would be challenged. For doing so, we iniialize he moor primiives in he radiional way by imiaion learning wihou percepual coupling. Subsequenly, we improve he moor primiives using a novel reinforcemen learning mehod which is paricularly well-suied for moor primiives. I. INTRODUCTION The recen inroducion of moor primiives based on dynamic sysems [1] [4] have allowed boh imiaion learning and Reinforcemen Learning o acquire new behaviors fas and reliable. Resuling successes have shown ha i is possible o rapidly learn moor primiives for complex behaviors such as ennis swings [1], [2], T-ball baing [5], drumming [6], biped locomoion [3], [7] and even in asks wih poenial indusrial applicaion [8]. However, in heir curren form hese moor primiives are generaed in such a way ha hey are eiher only coupled o inernal variables [1], [2] or only include manually uned phase-locking, e.g., wih an exernal bea [6] or beween he gai-generaing primiive and he conac ime of he fee [3], [7]. In many human moor conrol asks, more complex percepual coupling is needed in order o perform he ask. Using handcrafed coupling based on human insigh will in mos cases no longer suffice. In his paper, i is our goal o augmen he Ijspeer- Nakanishi-Schaal approach [1], [2] of using dynamic sysems as moor primiives in such a way ha i includes percepual coupling wih exernal variables. Similar o he biokinesiological lieraure on moor learning (see e.g., [9]), we assume ha here is an objec of inernal focus described by a sae x and one of exernal focus y. The coupling beween boh foci usually depends on he phase of he movemen and, someimes, he coupling only exiss in shor phases, e.g., in a caching movemen, his could be a iniiaion of he movemen (which is largely predicive) and during he las momen when he objec is close o he hand (which is largely prospecive or reacive and includes movemen correcion). Ofen, i is also imporan ha he inernal focus is in a differen space han he exernal one. Fas movemens, such as a Tennis-swing, always follow a similar paern in join-space of he arm while he exernal focus is clearly on an objec in Caresian space or fovea-space. As a resul, we have augmened he moor primiive framework in such a way ha he coupling o he exernal, percepual focus is phase-varian and boh foci y and x can be in compleely differen spaces. Inegraing he percepual coupling requires addiional funcion approximaion, and, as a resul, he number of parameers of he moor primiives grows significanly. I becomes increasingly harder o manually une hese parameers o high performance and a learning approach for percepual coupling is needed. The need for learning percepual coupling in moor primiives has long been recognized in he moor primiive communiy [4]. However, learning percepual coupling o an exernal variable is no as sraighforward. I requires many rials in order o properly deermine he connecions from exernal o inernal focus. I is sraighforward o grasp a general movemen by imiaion and a human can produce a Ball-in-a-Cup movemen or a Tennis-swing afer a single or few observed rials of a eacher bu he will never have a robus coupling o he ball. Furhermore, small differences beween he kinemaics of eacher and suden amplify in he percepual coupling. This par is he reason why percepually driven moor primiives can be iniialized by imiaion learning bu will usually require self-improvemen by reinforcemen learning. This is analogous o he case of a human learning ennis: a eacher can show a forehand bu a lo of self-pracice is needed for a proper ennis game. II. AUGMENTED MOTOR PRIMITIVES WITH PERCEPTUAL COUPLING In his secion, we firs inroduce he general idea behind dynamic sysem moor primiives as suggesed in [1], [2] and, subsequenly, show how percepual coupling can be inroduced. Subsequenly, we show how he percepual coupling can be realized by augmening he acceleraion-based framework from [4]. A. Percepual Coupling for Moor Primiives The basic idea in he original work of Ijspeer, Nakanishi and Schaal [1], [2] is ha moor primiives can be pared ino

f 1 Transformed Sysem 1 Posiion Velociy Acceleraion Canonical Sysem f 2 Transformed Sysem 2 Posiion Velociy Acceleraion Figure 1. Illusraion of he behavior of he moor primiives (i) and he augmened moor primiives (ii). Exernal Variable f n Transformed Sysem n Posiion Velociy Acceleraion wo componens, i.e., a canonical sysem h which drives ransformed sysems g k for every considered degree of freedom k. As a resul, we have sysem of differenial equaions given by ż = h(z), (1) ẋ = g(x, z, w), (2) which deermines he variables of inernal focus x. Here, z denoes he sae of he canonical sysem and w he inernal parameers for ransforming he oupu of he canonical sysem. The schemaic in Figure 2 illusraes his radiional seup in black. In Secion II-B, we will discuss good choices for hese dynamical sysems as well as heir coupling based on he mos curren formulaion [4]. When aking an exernal variable y ino accoun, here are hree differen ways how his variable influences he moor primiive sysem which one can consider, i.e., (i) i could only influence Eq.(1) which would be appropriae for synchronizaion problems and phase-locking (similar as in [6], [10]), (ii) only affec Eq.(2) which allows he coninuous modificaion of he curren sae of he sysem by anoher variable and (iii) he combinaion of (i) and (ii). While (i) and (iii) are he righ soluion if phase-locking or synchronizaion are needed, he coupling in he canonical sysem will desroy many of he nice properies of he sysem and make i prohibiively hard o learn in pracice. Furhermore, as we focus on discree movemens in his paper, we focus on he case (ii) which has no been used o dae. In his case, we have a modified dynamical sysem ż = h(z), (3) ẋ = ĝ(x, y, ȳ, z, v), (4) ȳ = g(ȳ, z, w), (5) where y denoes he sae of he exernal variable, ȳ he expeced sae of he exernal variable and ȳ is derivaive. This archiecure inheris mos posiive properies from he original work while allowing he incorporaion of exernal feedback. We will show ha we can incorporae previous work wih ease and ha he resuling framework resembles he one in [4] while allowing o couple he exernal variables ino he sysem. B. Realizaion for Discree Movemens The original formulaion in [1], [2] was a major breakhrough as he righ choice of he dynamical sysems in Figure 2. General schemaic illusraing boh he original moor primiive framework by [2], [4] in black and he augmenaion for percepual coupling in red. Equaions (1, 2) allows deermining he sabiliy of he movemen, choosing beween a rhyhmic and a discree movemen and is invarian under rescaling in boh ime and movemen ampliude. Wih he righ choice of funcion approximaor (in our case locally-weighed regression), fas learning from a eachers presenaion is possible. In his secion, we firs discuss how he mos curren formulaion from he moor primiives as discussed in [4] is insaniaed from Secion II-A. Subsequenly, we show how i can be augmened in order o incorporae percepual coupling. While he original formulaion in [1], [2] used a secondorder canonical sysem, i has since hen been shown ha a single firs order sysem suffices [4], i.e., we have ż = h(z) = τα h z, which represens he phase of he rajecory. I has a ime consan τ and a parameer α h which is chosen such ha he sysem is sable. We can now choose our inernal sae such ha posiion of degree of freedom k is given by q k = x 2k, i.e., he 2k-h componen of x, he velociy by q k = τx 2k+1 = ẋ 2k and he acceleraion by q k = τẋ 2k+1. Upon hese assumpions, we can express he moor primiives funcion g in he following form ẋ 2k+1 = τα g (β g ( k x 2k ) x 2k+1 ) + τ (( k x 0 2k) + ak ) fk, (6) ẋ 2k = τx 2k+1. (7) This funcion has he same ime consan τ as he canonical sysem, appropriaely se parameers α g, β g, a goal parameer k, an ampliude modifier a k, and a ransformaion funcion f k. This ransformaion funcion ransforms he oupu of he canonical sysem so ha he ransformed sysem can represen complex nonlinear paerns and is given by f k (z) = N ψ i (z)w i z, (8) i=1 where w are adjusable parameers and uses normalized Gaus-

sian kernels wihou scaling such as exp ( h i (z c i ) 2) ψ i = (9) N j=1 ( h exp 2) j (z c j ) for localizing he ineracion in phase space where we have ceners c i and widh h i. In order o learn a moor primiive wih percepual coupling, we need wo componens. Firs, we need o learn he normal or average behavior ȳ of he variable of exernal focus y which can be represened by a single moor primiive ḡ, i.e., we can use he same ype of funcion from Equaions (2, 5) for ḡ which are learned based on he same z and given by Equaions (6, 7). Addiionally, we have he sysem ĝ for he variable of inernal focus x which deermines our acual movemens which incorporaes he inpus of he normal behavior ȳ as well as he curren sae y of he exernal variable. We obain he sysem ĝ by insering a modified coupling funcion ˆf(z, y, ȳ) insead of he original f(z) in g. Funcion f(z) is modified in order o include percepual coupling o y and we obain ˆf k (z, y, ȳ) = + N ψ i (z)ŵ i z i=1 M j=1 ( ) ˆψ j (z) κ T jk(y ȳ) + δ T jk(ẏ ȳ), where ˆψ j (z) denoe Gaussian kernels as in Equaion (9) wih ceners ĉ j and widh ĥj. Noe, ha i can be useful o se N > M for reducing he number of parameers. All parameers are given by v = [ŵ, κ, δ]. Here, ŵ are jus he sandard ransformaion parameers while κ jk and δ jk are he local coupling facors which can be inerpreed as gains acing on he difference beween he desired behavior of he exernal variable and is acual behavior. Noe ha for noisefree behavior and perfec iniial posiions, such coupling would never play a role; hus, he approach would simplify o he original approach. However, in he noisy, imperfec case, his percepual coupling can ensure success even in exreme cases. III. LEARNING FOR PERCEPTUALLY COUPLED MOTOR PRIMITIVES While he ransformaion funcion f k (z) can be learned from few or even jus a single rial, his simpliciy no longer ransfers o learning he new funcion ˆf k (z, y, ȳ) as percepual coupling requires ha he coupling o an uncerain exernal variable is learned. While imiaion learning approaches are feasible, hey require larger numbers of presenaions of a eacher wih very similar kinemaics for learning he behavior sufficienly well. As an alernaive, we could follow Naure as our eacher, and creae a concered approach of imiaion and self-improvemen by rial-and-error. For doing so, we firs have a eacher who presens several rials and, subsequenly, we improve our behavior by reinforcemen learning. A. Imiaion Learning wih Percepual Coupling For imiaion learning, we can largely follow he original work in [1], [2] and only need minor modificaions. We also make use of locally-weighed regression in order o deermine he opimal moor primiives, use he same weighing and compue he arges based on he dynamic sysems. However, unlike in [1], [2], we need a boosrapping sep as we deermine firs he parameers for he sysem described by Equaion (5) and, subsequenly, use he learned resuls in he learning of he sysem in Equaion (4). For doing so, we can compue he regression arges for he firs sysem by aking Equaion (6), replacing ȳ and ȳ by samples of y and ẏ, and solving forf k (z) as discussed in [1], [2]. A local regression yields good values for he parameers of f k (z). Subsequenly, we can perform he exac same sep for ˆf k (z, y, ȳ) where only he number of variables has increased bu he resuling regression follows analogously. However, noe ha while a single demonsraion suffices for he parameer vecor w and ŵ, he parameers κ and δ canno be learned by imiaion as hese require deviaion from he nominal behavior for he exernal variable. However, as discussed before, pure imiaion for percepual coupling can be difficul for learning he coupling parameers as well as he bes nominal behavior for a robo wih kinemaics differen from he human, many differen iniial condiions and in he presence of significan noise. Thus, we sugges o improve he policy by rial-and-error using reinforcemen learning upon an iniial imiaion. B. Reinforcemen Learning for Percepually Coupled Moor Primiives Reinforcemen learning [11] of discree moor primiives is a very specific ype of learning problem where i is hard o apply generic reinforcemen learning algorihms [5], [12]. For his reason, he focus of his paper is largely on domainappropriae reinforcemen learning algorihms which operae on paramerized policies for episodic conrol problems. 1) Reinforcemen Learning Seup: When modeling our problem as a reinforcemen learning problem, we always have a sae s = [z, y, ȳ, x] wih high dimensions (as a resul, sandard RL mehods which discreize he sae-space can no longer be applied), and he acion a = [f (z)+ɛ,ˆf(z, y, ȳ)+ˆɛ] is he oupu of our moor primiives. Here, he exploraion is denoed by ɛ and ˆɛ, and we can give a sochasic policy a π(s) as disribuion over he saes wih parameers θ = [w, v] R n. Afer a nex ime-sep δ, he acor ransfers o a sae s +1 and receives a reward r. As we are ineresed in learning complex moor asks consising of a single sroke [4], [9], we focus on finie horizons of lengh T wih episodic resars [11] and learn he opimal paramerized policy for such problems. The general goal in reinforcemen learning is o opimize he expeced reurn of he policy wih parameers θ defined by J(θ) = p(τ )R(τ )dτ, (10) T where τ = [s 1:T +1, a 1:T ] denoes a sequence of saes s 1:T +1 = [s 1, s 2,..., s T +1 ] and acions a 1:T = [a 1,

Figure 3. This figure shows schemaic drawings of he Ball-in-a-Cup moion, he final learned robo moion as well as a moion-capured human moion. The green arrows show he direcions of he momenary movemens. The human cup moion was augh o he robo by imiaion learning wih 91 parameers for 1.5 seconds. Also see he supplemenary video in he proceedings. a2,..., at ], he probabiliy of an episode τ is denoed by p(τ ) and R(τ ) refers o he reurn of an episode τ. Using Markov assumpion, we can wrie he pah disribuion QT +1 as p(τ ) = p(x1 ) =1 p(s+1 s, a )π(a s, ) where p(s1 ) denoes he iniial sae disribuion and p(s+1 s, a ) is he nex sae disribuion condiioned on las sae and acion. Similarly, if we assume addiive, accumulaed rewards, he PT reurn of a pah is given by R(τ ) = T1 =1 r(s, a, s+1, ), where r(s, a, s+1, ) denoes he immediae reward. While episodic Reinforcemen Learning (RL) problems wih finie horizons are common in moor conrol, few mehods exis in he RL lieraure (c.f., model-free mehod such as Episodic REINFORCE [13] and he Episodic Naural AcorCriic enac [5] as well as model-based mehods, e.g., using differenial-dynamic programming [14]). In order o avoid learning of complex models, we focus on model-free mehods and, o reduce he number of open parameers, we raher use a novel Reinforcemen Learning algorihm which is based on expecaion-maximizaion. Our new algorihm is called Policy learning by Weighing Exploraion wih he Reurns (PoWER) and can be derived from he same higher principle as previous policy gradien approaches, see [15] for deails. 2) Policy learning by Weighing Exploraion wih he Reurns (PoWER): When learning moor primiives, we inend o learn a deerminisic mean policy a = θ T µ(s) = [f (z), f (z, y, y )] which is linear in parameers θ and augmened by addiive exploraion ε(s, ) = [ˆ, ] in order o make model-free reinforcemen learning possible. As a resul, he exploraive policy can be given in he form a = θ T µ(s, ) + (µ(s, )). Previous work in [5], [12] has focused on sae-independen, whie Gaussian exploraion, i.e., (µ(s, )) N (0, Σ), and has resuled ino applicaions such as T-Ball baing [5] and operaional space conrol [12]. However, such unsrucured exploraion a every sep has several disadvanages, i.e., (i) i causes a large variance which grows wih he number of ime-seps [5], (ii) i perurbs acions oo frequenly, hus, washing ou heir effecs and (iii) can damage he sysem execuing he rajecory. Alernaively, one could generae a form of srucured, saedependen exploraion (µ(s, )) = εt µ(s, ) wih [ε ]ij 2 2 N (0, σij ), where σij are mea-parameers of he exploraion ha can also be opimized. This argumen resuls ino he policy a π(a s, ) = N (a µ(s, ), Σ (s, )). Based on he EM updaes for Reinforcemen Learning as suggesed in [12], [15], we can derive he updae rule o np T π Eτ ε Q (s, a, =1 np o. θ0 = θ + T π (s, a, ) Eτ Q =1 In order o reduce he number of rials in his on-policy scenario, we reuse he rials hrough imporance sampling [11], [16]. To avoid he fragiliy someimes resuling from imporance sampling in reinforcemen learning, samples wih very small imporance weighs are discarded. IV. E VALUATION & A PPLICATION In his secion, we demonsrae he effeciveness of he augmened framework for percepually coupled moor primiives as presened in Secion II and show ha our concered approach of using imiaion for iniializaion and reinforcemen learning for improvemen works well in pracice, paricularly

wih our novel PoWER algorihm from Secion III. We show ha his mehod can be used in learning a complex, real-life moor learning problem Ball-in-a-Cup in a physically realisic simulaion of an anhropomorphic robo arm. This problem is a good benchmark for esing he moor learning performance and we show ha we can learn he problem roughly a he efficiency of a young child. This algorihm successfully creaes a percepual coupling even o perurbaions ha are very challenging for a skilled adul player. A. Robo Applicaion: Ball-in-a-Cup We have applied he presened algorihm in order o each a physically-realisic simulaion of an anhropomorphic SAR- COS robo arm how o perform he radiional American children s game Ball-in-a-Cup, also known as Balero, Bilboque or Kendama. The oy consiss of a ball which is aached o a wooden cup by a sring. The iniial posiion is he ball hanging down verically on he sring and he player has o oss he ball ino he cup by jerking his arm [17], see Figure 3(op) for an illusraive figure. The sae of he sysem is described in Caresian coordinaes of he cup (i.e., he operaional space) and he Caresian coordinaes of he ball. The acions are he cup acceleraions in Caresian coordinaes wih each direcion represened by a moor primiive. An operaional space conrol law [18] is used in order o ransform acceleraions in he operaional space of he cup ino join-space orques. All moor primiives are perurbed separaely bu employ he same join reward which is r = exp( α(x c x b ) 2 α(y c y b ) 2 ) he momen where he ball passes he rim of he cup wih a downward direcion and r = 0 all oher imes. The cup posiion is denoed by [x c, y c, z c ] R 3, he ball posiion [x b, y b, z b ] R 3 and a scaling parameer α = 10000. The ask is quie complex as he reward is no modified solely by he movemens of he cup bu foremos by he movemens of he ball and he movemens of he ball are very sensiive o perurbaions. A small perurbaion of he iniial condiion or he rajecory will drasically change he movemen of he ball and hence he oucome of he rial if we do no use any form of percepual coupling o he exernal variable ball. Due o he complexiy of he ask, Ball-in-a-Cup is even a hard moor ask for children who only succeed a i by observing anoher person playing or deducing from similar previously learned asks how o maneuver he ball above he cup in such a way ha i can be caugh. Subsequenly, a lo of improvemen by rial-and-error is required unil he desired soluion can be achieved in pracice. The child will have an iniial success as he iniial condiions and execued cup rajecory fi ogeher by chance, aferwards he child sill has o pracice a lo unil i is able o ge he ball in he cup (almos) every ime and so cancel various perurbaions. Learning he necessary percepual coupling o ge he ball in he cup on a consisen basis is even a hard ask for aduls, as our whole lab can esify. In conras o a ennis swing, where a human jus needs o learn a goal funcion for he one momen he racke his he ball, in Ball-in-a-Cup we need a complee dynamical sysem as cup and ball consanly inerac. rewards 10 0.2 10 0.3 10 0.4 10 0.5 10 0 10 1 10 2 10 3 rials learned hand uned iniializaion Figure 4. This figure shows he expeced reurn for one specific perurbaion of he learned policy in he Ball-in-a-Cup scenario (averaged over 3 runs wih differen random seeds and he sandard deviaion indicaed by he error bars). Convergence is no uniform as he algorihm is opimizing he reurns for a whole range of perurbaions and no for his es case. Thus, he variance in he reurn as he improved policy migh ge worse for he es case bu improve over all cases. Our algorihm rapidly improves, regularly beaing a hand-uned soluion afer less han fify rials and converging afer approximaely 600 rials. Noe ha his plo is a double logarihmic plo and, hus, single uni changes are significan as hey correspond o orders of magniude. Mimicking how children learn o play Ball-in-a-Cup, we firs iniialize he moor primiives by imiaion and, subsequenly, improve hem by reinforcemen learning in order o ge an iniial success. Aferwards we also acquire he percepual coupling by reinforcemen learning. We recorded he moions of a human player using a VICON TM moion-capure seup in order o obain an example for imiaion as shown in Figure 3(c). The exraced cup-rajecories were used o iniialize he moor primiives using locally-weighed regression for imiaion learning. The simulaion of he Ball-in-a-Cup behavior was verified using he racked movemens. We used one of he recorded rajecories for which, when played back in simulaion, he ball goes in bu does no pass he cener of he opening of he cup and hus does no opimize he reward. This movemen is hen used for iniializing he moor primiives and deermining heir parameric srucure where cross-validaion indicaes ha 91 parameers per moor primiive are opimal from a biasvariance poin of view. The rajecories are opimized by reinforcemen learning using he PoWER algorihm on he parameers w for non perurbed iniial condiions. The robo consanly succeeds a bringing he ball ino he cup afer approximaely 60-80 ieraions given no noise and perfec iniial condiions. One se of he found rajecories is hen used o calculae he baseline ȳ = (h b) and ȳ = (ḣ ḃ), where h and b are he hand and ball rajecories. This se is also used o se he sandard cup rajecories. Hand uned coupling facors work quie well for small perurbaions of he iniial condiions. In order o make hem more robus we use reinforcemen learning using he same join reward as before. The iniial condiions (posiions and velociies) of he ball are perurbed compleely randomly (no PEGASUS Trick) using Gaussian random values wih variances se according o he desired sabiliy region. The PoWER algorihm converges afer approximaely 600-800 ieraions.

x posiion [m] y posiion [m] z posiion [m] 0.6 0.4 0.2 0 0 0.5 1 1.5 ime [s] 0.6 0.4 0.2 0 0 0.5 1 1.5 ime [s] 0.2 0 0.2 0.4 0 0.5 1 1.5 ime [s] cup no coupling ball no coupling cup coupling ball coupling Figure 5. This figure compares cup and ball rajecories wih and wihou percepual coupling. The rajecories and differen iniial condiions are clearly disinguishable. The percepual coupling cancels he swinging moion of he sring and ball pendulum ou. The successful rial is marked eiher by overlying (x and y) or parallel (z) rajecories of he ball and cup from 1.2 seconds on. This is roughly comparable o he learning speed of a 10 year old child (Figure 4). For he raining we used concurrenly sandard deviaions of 0.01m for x and y and of 0.1 m/s for ẋ and ẏ. The learned percepual coupling ges he ball in he cup for all esed cases where he hand-uned coupling was also successful. The learned coupling pushes he limis of he canceled perurbaions significanly furher and sill performs consisenly well for double he sandard deviaions seen in he reinforcemen learning process. Figure 5 shows an example of how he visual coupling adaps he hand rajecories in order o cancel perurbaions and o ge he ball in he cup. V. CONCLUSION Percepual coupling for moor primiives is an imporan opic as i resuls in more general and more reliable soluions while i allows he applicaion of he dynamic sysems moor primiive framework o many oher moor conrol problems. As manual uning can only work in limied seups, an auomaic acquisiion of his percepual coupling is essenial. In his paper, we have conribued an augmened version of he moor primiive framework originally suggesed by [1], [2], [4] such ha i incorporaes percepual coupling while keeping a disincively similar srucure o he original approach and, hus, preserving mos of he imporan properies. We presen a concered learning approach which relies on an iniializaion by imiaion learning and, subsequen, self-improvemen by reinforcemen learning. We inroduce a paricularly wellsuied algorihm for his reinforcemen learning problem called PoWER. The resuling framework works well for learning Ball-in-a-Cup on a simulaed anhropomorphic SARCOS arm in seups where he original moor primiive framework would no suffice o fulfill he ask. REFERENCES [1] A. J. Ijspeer, J. Nakanishi, and S. Schaal, Movemen imiaion wih nonlinear dynamical sysems in humanoid robos, in Proceedings of IEEE Inernaional Conference on Roboics and Auomaion (ICRA), Washingon, DC, May 11-15 2002, pp. 1398 1403. [2], Learning aracor landscapes for learning moor primiives, in Advances in Neural Informaion Processing Sysems 16 (NIPS), S. Becker, S. Thrun, and K. Obermayer, Eds., vol. 15. Cambridge, MA: MIT Press, 2003, pp. 1547 1554. [3] S. Schaal, J. Peers, J. Nakanishi, and A. J. Ijspeer, Conrol, planning, learning, and imiaion wih dynamic movemen primiives, in Proceedings of he Workshop on Bilaeral Paradigms on Humans and Humanoids, IEEE 2003 Inernaional Conference on Inelligen RObos and Sysems (IROS), Las Vegas, NV, Oc. 27-31, 2003. [4] S. Schaal, P. Mohajerian, and A. J. Ijspeer, Dynamics sysems vs. opimal conrol a unifying view, Progress in Brain Research, vol. 165, no. 1, pp. 425 445, 2007. [5] J. Peers and S. Schaal, Policy gradien mehods for roboics, in Proceedings of he IEEE/RSJ 2006 Inernaional Conference on Inelligen RObos and Sysems (IROS), Beijing, China, 2006, pp. 2219 2225. [6] D. Pongas, A. Billard, and S. Schaal, Rapid synchronizaion and accurae phase-locking of rhyhmic moor primiives, in Proceedings of he IEEE 2005 Inernaional Conference on Inelligen RObos and Sysems (IROS), vol. 2005, 2005, pp. 2911 2916. [7] J. Nakanishi, J. Morimoo, G. Endo, G. Cheng, S. Schaal, and M. Kawao, Learning from demonsraion and adapaion of biped locomoion, Roboics and Auonomous Sysems (RAS), vol. 47, no. 2-3, pp. 79 91, 2004. [8] H. Urbanek, A. Albu-Schäffer, and P.v.d.Smag, Learning from demonsraion repeiive movemens for auonomous service roboics, in Proceedings of he IEEE/RSL 2004 Inernaional Conference on Inelligen RObos and Sysems (IROS), Sendai, Japan, 2004, pp. 3495 3500. [9] G. Wulf, Aenion and moor skill learning. Champaign, IL: Human Kineics, 2007. [10] J. Nakanishi, J. Morimoo, G. Endo, G. Cheng, S. Schaal, and M. Kawao, A framework for learning biped locomoion wih dynamic movemen primiives, in Proceedings of he IEEE-RAS Inernaional Conference on Humanoid Robos (HUMANOIDS). Los Angeles, CA: Nov.10-12, Sana Monica, CA: IEEE, 2004. [11] R. Suon and A. Baro, Reinforcemen Learning. MIT PRESS, 1998. [12] J. Peers and S. Schaal, Reinforcemen learning for operaional space, in Proceedings of he Inernaional Conference on Roboics and Auomaion (ICRA), Rome, Ialy, 2007. [13] R. J. Williams, Simple saisical gradien-following algorihms for connecionis reinforcemen learning, Machine Learning, vol. 8, pp. 229 256, 1992. [14] C. G. Akeson, Using local rajecory opimizers o speed up global opimizaion in dynamic programming, in Advances in Neural Informaion Processing Sysems 6 (NIPS), J. E. Hanson, S. J. Moody, and R. P. Lippmann, Eds. Denver, CO, USA: Morgan Kaufmann, 1994, pp. 503 521. [15] J. Kober and J. Peers, Policy search for moor primiives in roboics, in Advances in Neural Informaion Processing Sysems (NIPS), 2008. [16] C. Andrieu, N. de Freias, A. Douce, and M. I. Jordan, An inroducion o MCMC for machine learning, Machine Learning, vol. 50, no. 1, pp. 5 43, 2003. [17] Wikipedia, June 2008. [Online]. Available: hp://en.wikipedia.org/wiki/ball_in_a_cup [18] J. Nakanishi, M. Misry, J. Peers, and S. Schaal, Experimenal evaluaion of ask space posiion/orienaion conrol owards complian conrol for humanoid robos, in Proceedings of he IEEE/RSJ 2007 Inernaional Conference on Inelligen ROboics Sysems (IROS), 2007.