A Practice Strategy for Robot Learning Control

Similar documents
Lecture 1: Machine Learning Basics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

arxiv: v1 [math.at] 10 Jan 2016

Lecture 10: Reinforcement Learning

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Axiom 2013 Team Description Paper

Let s think about how to multiply and divide fractions by fractions!

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

A Reinforcement Learning Variant for Control Scheduling

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Statewide Framework Document for:

Seminar - Organic Computing

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Robot manipulations and development of spatial imagery

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

On the Combined Behavior of Autonomous Resource Management Agents

A Case-Based Approach To Imitation Learning in Robotic Agents

Radius STEM Readiness TM

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Major Milestones, Team Activities, and Individual Deliverables

A Stochastic Model for the Vocabulary Explosion

While you are waiting... socrative.com, room number SIMLANG2016

(Sub)Gradient Descent

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

The Strong Minimalist Thesis and Bounded Optimality

Artificial Neural Networks written examination

Reinforcement Learning by Comparing Immediate Reward

AMULTIAGENT system [1] can be defined as a group of

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Software Maintenance

Grade 6: Correlated to AGS Basic Math Skills

Introduction to Simulation

How People Learn Physics

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Evolutive Neural Net Fuzzy Filtering: Basic Description

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Dublin City Schools Mathematics Graded Course of Study GRADE 4

How to Judge the Quality of an Objective Classroom Test

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

A Neural Network GUI Tested on Text-To-Phoneme Mapping

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Generative models and adversarial training

Mathematics process categories

Analysis of Enzyme Kinetic Data

Teaching a Laboratory Section

A Comparison of the Effects of Two Practice Session Distribution Types on Acquisition and Retention of Discrete and Continuous Skills

LEGO MINDSTORMS Education EV3 Coding Activities

WHEN THERE IS A mismatch between the acoustic

SURVIVING ON MARS WITH GEOGEBRA

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

arxiv: v2 [cs.ro] 3 Mar 2017

The open source development model has unique characteristics that make it in some

Discriminative Learning of Beam-Search Heuristics for Planning

An Online Handwriting Recognition System For Turkish

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

SARDNET: A Self-Organizing Feature Map for Sequences

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

EFFECTIVE CLASSROOM MANAGEMENT UNDER COMPETENCE BASED EDUCATION SCHEME

Circuit Simulators: A Revolutionary E-Learning Platform

Learning Methods for Fuzzy Systems

Models of / for Teaching Modeling

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probability and Game Theory Course Syllabus

Honors Mathematics. Introduction and Definition of Honors Mathematics

Softprop: Softmax Neural Network Backpropagation Learning

Calibration of Confidence Measures in Speech Recognition

Executive Guide to Simulation for Health

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Guidelines for Project I Delivery and Assessment Department of Industrial and Mechanical Engineering Lebanese American University

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Speeding Up Reinforcement Learning with Behavior Transfer

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

PHYSICS 40S - COURSE OUTLINE AND REQUIREMENTS Welcome to Physics 40S for !! Mr. Bryan Doiron

XXII BrainStorming Day

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

AC : DESIGNING AN UNDERGRADUATE ROBOTICS ENGINEERING CURRICULUM: UNIFIED ROBOTICS I AND II

Python Machine Learning

Application of Virtual Instruments (VIs) for an enhanced learning environment

CEFR Overall Illustrative English Proficiency Scales

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

On-the-Fly Customization of Automated Essay Scoring

Evolution of Symbolisation in Chimpanzees and Neural Nets

A study of speaker adaptation for DNN-based speech synthesis

Probability estimates in a scenario tree

STA 225: Introductory Statistics (CT)

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning From the Past with Experiment Databases

Getting Started with TI-Nspire High School Science

Transcription:

A Practice Strategy for Robot Learning Control Terence D. Sanger Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology, room E25-534 Cambridge, MA 02139 tds@ai.mit.edu Abstract "Trajectory Extension Learning" is a new technique for Learning Control in Robots which assumes that there exists some parameter of the desired trajectory that can be smoothly varied from a region of easy solvability of the dynamics to a region of desired behavior which may have more difficult dynamics. By gradually varying the parameter, practice movements remain near the desired path while a Neural Network learns to approximate the inverse dynamics. For example, the average speed of motion might be varied, and the inverse dynamics can be "bootstrapped" from slow movements with simpler dynamics to fast movements. This provides an example of the more general concept of a "Practice Strategy" in which a sequence of intermediate tasks is used to simplify learning a complex task. I show an example of the application of this idea to a real 2-joint direct drive robot arm. 1 INTRODUCTION The most general definition of Adaptive Control is one which includes any controller whose behavior changes in response to the controlled system's behavior. In practice, this definition is usually restricted to modifying a small number of controller parameters in order to maintain system stability or global asymptotic stability of the errors during execution of a single trajectory (Sastry and Bodson 1989, for review). Learning Control represents a second level of operation, since it uses Adaptive Con- 335

336 Sanger trol to modify parameters during repeated performance trials of a desired trajectory so that future trials result in greater accuracy (Arimoto et al. 1984). In this paper I present a third level called a "Practice Strategy", in which Learning Control is applied to a sequence of intermediate trajectories leading ultimately to the true desired trajectory. I claim that this can significantly increase learning speed and make learning possible for systems which would otherwise become unstable. 1.1 LEARNING CONTROL During repeated practice of a single desired trajectory, the actual trajectory followed by the robot may be significantly different. Many Learning Control algorithms modify the commands stored in a sequence memory to minimize this difference (Atkeson 1989, for review). However, the performance errors are usually measured in a sensory coordinate system, while command corrections must be made in the motor coordinate system. If the relationship between these two coordinate systems is not known, then command corrections might be in the wrong direction and inadvertently worsen performance. However, if the practice trajectory is close to the desired trajectory, then the errors will be small and the relationship between command and sensory errors can be approximated by the system Jacobian. An alternative to a stored command sequence is to use a Neural Network to learn an approximation to the inverse dynamics in the region of interest (Sanner and Slotine 1992, Yabuta and Yamada 1991, Atkeson 1989). In this case, the commands and results from the actual movement are used as training data for the network, and smoothness properties are assumed such that the error on the desired trajectory will decrease. However, a significant problem with this method is that if the actual practice trajectory is far from the desired trajectory, then its inverse dynamics information will be of little use in training the inverse dynamics for the desired trajectory. In fact, the network may achieve perfect approximation on the actual trajectory while still making significant errors on the desired trajectory. In this case, learning will stop (since the training error is zero) leading to the phenomenon of "learning lock-up" (An et al. 1988). So whether Learning Control uses a sequence memory or a Neural Network, learning may proceed poorly if large errors are made during the initial practice movements. 1.2 PRACTICE STRATEGIES I define a "practice strategy" as a sequence of trajectories such that the first element in the sequence is any previously learned trajectory, and the last element in the sequence is the ultimate desired trajectory. A well designed practice strategy will result in a seqence for which learning control of the trajectory for any particular step is simplified if prior steps have already been learned. This will occur if learning of prior trajectories reduces the initial performance error for subsequent trajectories, so that a network will be less likely to experience learning lock-up. One example of a practice strategy is a three-step sequence in which the intermediate step is a set of independently executable subtasks which partition the desired trajectory into discrete pieces. Another example is a multi-step sequence in which intermediate steps are a set of trajectories which are somehow related to the desired trajectory. In this paper I present a multi-step sequence which gradually

A Practice Strategy for Robot Learning Control 337 ---~-------, I " I A A " N u P y a. Figure 1: Training signals for network learning. transforms some known trajectory into the desired trajectory by varying a single parameter. This method has the advantage of not requiring detailed knowledge of the task structure in order to break it up into meaningful subtasks, and conditions for convergence can be stated explicitly. It has a close relationship to Continuation Methods for solving differential equations, and can be considered to be a particular application of the Banach Extension Theorem. 2 METHODS As in (Sanger 1992), we need to specify 4 aspects of the use of a neural network within a control system: 1. the networks' function in the control system, 2. the network learning algorithm which modifies the connection weights, 3. the training signals used for network learning, and 4. the practice strategy used to generate sample movements. The network's function is to learn the inverse dynamics of an equilibrium-point controlled plant (Shadmehr 1990). The LMS-tree learning algorithm trains the network (Sanger 1991b, Sanger 1991a). The training signals are determined from the actual practice data using either "Actual Trajectory Training" or "Desired Trajectory Training", as defined below. And the practice strategy is "Trajectory Extension Learning", in which a parameter of the movement is gradually modified during training.

338 Sanger 2.1 TRAINING SIGNALS Figure 1 shows the general structure of the network and training signals. A desired trajectory y is fed into the network N to yield an estimated command U. This command is then applied to the plant Pcx where the subscript indicates that the plant is parameterized by the variable a. Although the true command u which achieves y is unknown, we do know that the estimated command u produces y, so these signals are used for training by comparing the network response to y given by ~ = Ny to the known value u and subtracting these to yield the training error 6,. Normally, network training would use this error signal to modify the network output for inputs near y, and I refer to this as "Actual Trajectory Training". However, if y is far from y then no change in response may occur at y and this may lead even more quickly to learning lock-up. Therefore an alternative is to use the error 6fJ to train the network output for inputs near y. I refer to this as "Desired Trajectory Training", and in the figure it is represented by the dotted arrow. The following discussion will summarize the convergence conditions and theorems presented in (Sanger 1992). Define Ru. (1 - N P(x))u = u - U to be an operator which maps commands into command errors for states x on the desired trajectory. Similarly, let Ru = (1 - N P( x))u = u - ~ map commands into command errors for states x on the actual trajectory. Convergence depends upon the following assumptions: A1: The plant P is smooth and invertible with respect to both the state x and the input u with Lipschitz constants k'z; and ku, and it has stable zero-dynamics. A2: The network N is smooth with Lipschitz constant kn. A3: Network learning reduces the error in response to a pair (y, 6y ). A4: The change in network output in response to training is smooth with Lipschitz constant kl. A5: There exists a smoothly controllable parameter a such that an inverse dynamics solution is available at a = ao, and the desired performance occurs when a = ad. A6: The change in command required to produce a desired output after any change in a is bounded by the change in a multiplied by a constant kcx A 7: The change in plant response for any fixed input is bounded by the change in a multiplied by a constant kp Under assumptions A1-A3 we can prove convergence of Desired Trajectory Training: Theorem 1: If there exists a k Rn such that II Rnu - Rnull < krn li u - ull

A Practice Strategy for Robot Learning Control 339 then if the learning rate 0 < 'Y :::; 1, If k Rn < 1 and 'Y :::; 1, then the network output u approaches the correct command u. Under assumptions A1-A4, we can prove convergence of Actual Trajectory Training: Theorem 2: If there exists a krn such that IIRn u - then if the learning rate 0 < 'Y :::; 1, Rnull < krn liu - illl 2.2 TRAJECTORY EXTENSION LEARNING Let a be some modifiable parameter of the plant such that for a = ao there exists a simple inverse dynamics solution, and we seek a solution when a = ad. For example, if the plant uses Equilibrium Point Control (Shadmehr 1990), then at low speeds the inverse dynamics behave like a perfect servo controller yielding desired trajectories without the need to solve the dynamics. We can continue to train a learning controller as the average speed of movement (a) is gradually increased. The inverse dynamics learned at one speed provide an approximation to the inverse dynamics for a slightly faster speed, and thus the performance errors remain small during practice. This leads to significantly faster learning rates and greater likelihood that the conditions for convergence at any given speed will be satisfied. Note that unlike traditional learning schemes, the error does not decrease monotonically with practice, but instead maintains a steady magnitude as the speed increases, until the network is no longer able to approximate the inverse dynamics. The following is a summary of a result from (Sanger 1992). Let a change from al to a2, and let P = Pal and P' = Pa2. Then under assumptions AI-A7 we can prove convergence of Trajectory Extension Learning: Theorem 3: If there exists a kr such that for a = al then for a = a2 IIR'u' - R'illl < krllu' - ull + (2ka + knkp)la2 - all This shows that given the smoothness assumptions and a small enough change in a, the error will continue to decrease.

340 Sanger 3 EXAMPLE Figure 2 shows the result of 15 learning trials performed by a real direct-drive twojoint robot arm on a sampled desired trajectory. The initial trial required 11.5 seconds to execute, and the speed was gradually increased until the final trial required only 4.5 seconds. Simulated equilibrium point control was used (Bizzi et al. 1984) with stiffness and damping coefficients of 15 nm/rad and 1.5 nm/rad/sec, respectively. The grey line in figure 2 shows the equilibrium point control signal which generated the actual movement represented by the solid line. The difference between these two indicates the nontrivial nature of the dynamics calculations required to derive the control signal from the desired trajectory. Note that without Trajectory Extension Learning, the network does not converge and the arm becomes unstable. The neural network was an LMS tree (Sanger 1991b, Sanger 1991a) with 10 Gaussian basis functions for each of the 6 input dimensions, and a total of 15 subtrees were grown per joint (see (Sanger 1992) for further explanation). 4 CONCLUSION Trajectory Extension Learning is one example of the way in which a practice strategy can be used to improve convergence for Learning Control. This or other types of practice strategies might be able to increase the performance of many different types of learning algorithms both within and outside the Control domain. Such strategies may also provide a theoretical model for the practice strategies used by humans to learn complex tasks, and the theoretical analysis and convergence conditions could potentially lead to a deeper understanding of human motor learning and successful techniques for optimizing performance. Acknowledgements Thanks are due to Simon Giszter, Reza Shadmehr, Sandro Mussa-Ivaldi, Emilio Bizzi, and many people at the NIPS conference for their comments and criticisms. This report describes research done within the laboratory of Dr. Emilio Bizzi in the department of Brain and Cognitive Sciences at MIT. The author was supported during this work by a National Defense Science and Engineering Graduate Fellowship, and by NIH grants 5R37 AR26710 and 5ROINS09343 to Dr. Bizzi. References An C. H., Atkeson C. G., Hollerbach J. M., 1988, Model-Based Control of a Robot Manipulator, MIT Press, Cambridge, MA. Arimoto S., Kawamura S., Miyazaki F., 1984, Bettering operation of robots by learning, Journal of Robotic Systems, 1(2):123-140. Atkeson C. G., 1989, Learning arm kinematics and dynamics, Ann. Rev. Neurosci., 12:157-183. Bizzi E., Accornero N., Chapple W., Hogan N., 1984, Posture control and trajectory formation during arm movement, J. Neurosci, 4:2738-2744. Sanger T. D., 1991a, A tree-structured adaptive network for function approximation in high dimensional spaces, IEEE Trans. Neural Networks, 2(2):285-293.

A Practice Strategy for Robot Learning Control 341 Sanger T. D., 1991b, A tree-structured algorithm for reducing computation in networks with separable basis functions, Neural Computation, 3(1):67-78. Sanger T. D., 1992, Neural network learning control of robot manipulators using gradually increasing task difficulty, submitted to IEEE Trans. Robotics and Automation. Sanner R. M., Slotine J.-J. E., 1992, Gaussian networks for direct adaptive control, IEEE Trans. Neural Networks, in press. Also MIT NSL Report 910303, 910503, March 1991 and Proc. American Control Conference, Boston pages 2153-2159, June 1991. Sastry S., Bodson M., 1989, Adaptive Control: Stability, Convergence, and Robustness, Prentice Hall, New Jersey. Shadmehr R., 1990, Learning virtual equilibrium trajectories for control of a robot arm, Neural Computation, 2:436-446. Yabuta T., Yamada T., 1991, Learning control using neural networks, Proc. IEEE Int'l ConJ. on Robotics and Automation, Sacramento, pages 740-745. Figure 2: Dotted line is the desired trajectory, solid line is the actual trajectory, and the grey line is the equilibrium point control trajectory.