Meta Learning & Self Play Ilya Sutskever MARCH 24, 2018
The Reinforcement Learning Problem
Reinforcement Learning (RL) Good framework for building intelligent agents Acting to achieve goals is a key part of intelligence Can specify nearly any AI problem RL is interesting because interesting RL algorithms exist Agent action Environment observation reward
Reinforcement Learning Formulation: find a policy that maximizes expected reward Rewards are given by the environment In the real world, environments don t specify rewards. It is up to the agent to determine that a reward has occurred. Agent action observation reward Environment
Reinforcement Learning Agent = neural network action Environment observation reward
Reinforcement Learning algorithms in a nutshell Add randomness to your actions If the result was better than expected, do more of the same in the future
RL s potential An agent running a really good RL algorithm can accomplish an overwhelming variety of tasks The goal achiever A truly good RL algorithm will combine elements supervised learning unsupervised and representation learning reasoning and inference and test time and more! Today s RL algorithms have a very long way to go But it doesn t mean that progress will be slow
Hindsight Experience Replay [Andrychowicz et al., 2017]
Exploration can be hard When rewards are spares, most random attempts result in failure, and thus no learning Can we learn from failure?
Learn From Failure Setup: build a system that can reach any state Goal: reach state A Any trajectory ends up in some other state B Use this as training data to reach state B? Try to reach A A Starting point The result: how to reach B B
Cool visual explanation of HER
Dynamics randomization for Sim2Real [Peng et al., 2017]
Sim2Real with meta learning [Peng et al., 2017] It would be nice to train robots in simulation And have the policies succeed on the real robot
Key idea : simulation randomization Randomize simulation parameters Gravity Friction Torques Width and length of different geometric shapes Type of contact simulation Etc. Train a policy that can adapt to all settings of simulation parameters
This is a meta learning approach Policy quickly infers simulation parameters Could it infer the simulation parameters of the real world?
Baseline
Results
Learning a hierarchy of actions with meta learning [Frans et al., 2017]
It would be nice if learning was hierarchical Current RL learns by trying out random actions at each timestep Downsides: Hard to explore in a persistent direction Hard to do credit assignment over long horizons Example: Suppose all your agents want to maximize your GDP Should each agent decide if it should go to work on the basis of GDP fluctuations? May require a real model to really solve this problem
Meta learning approach to hierarchy Ingredients: a distribution over tasks Goal: learn a set of meta-actions that solve training tasks as quickly as possible
Evolved Policy Gradients [Houthooft et al., 2018]
Goal: learn a cost function that leads to rapid learning Train a cost function such that RL on this cost function learns very quickly Ingredients: a distribution over tasks Use evolution strategies to learn the cost function
Result: a single learning trial
Result: a single learning trial Learned cost: never learned to move right
Self Play
Self Play: TD-Gammon TD-Gammon (Tesauro, 1992) Incredibly old work: Q-learning + neural networks + self-play Beat all humans, discovered unconventional strategies that were deemed to be better! Approach was dormant until DQN for Atari
Self Play: AlphaGo Zero
Self Play: Dota 2 Pure self play Popular competitive online e-sports game Serious professional scene: $140M awarded in prizes in 2016 5v5 is main variant; 1v1 also played OpenAI beat all the pros 1v1
Appealing properties of Self Play Simple environment extremely complex strategy Convert compute into data Perfect curriculum
Carl Sims, 1994 Self Play: Artificial Life
Carl Sims, 1994 Self Play: Artificial Life
Self Play for physicality and dexterity Environment is simple, behavior is very complex Pre-train general dexterity by competing against an opponent [Bansal et al., 2017]
What s next? Main open question: design the self play environment so that the result will be useful to some external task
Can Self Play lead all the way to AGI? Social life incentivizes evolution of intelligence Homo sapiens Because corvids and apes share these cognitive tools, we argue that complex cognitive abilities evolved multiple times in Homo neanderthalensis 1500 cm 3 distantly related species with vastly different brain structures in order to solve similar socioecological problems. Science, Vol. 306, Issue 5703, pp. Homo erectus Homo habilis 1000 cm 3 Cranial Capacity 1903-1907 Australopithecus Mate selection Sahelanthropus 500 cm 3 Open-ended self play produces: Theory of mind, negotiation, social skills, empathy, real language understanding -7-6 -5-4 -3-2 -1.7-1 0-0.7 Millions of Years Ago
AI Alignment: Learning from human feedback [Christiano et al., 2017]
How to communicate goals quickly? One approach: have humans judge the behavior of an algorithm
Human judges select good behavior
Fit a scalar reward function to the human feedback Optimize a triplet loss: if a human judge deems that A > B Learn a real-valued reward consistent with the human feedback PREDICTED REWARD REWARD PREDICTOR HUMAN FEEDBACK RL ALGORITHM OBSERVATION ACTION ENVIRONMENT
500 bits of interaction It works
It works Several thousand bits of interactions to solve Atari games
Drive right behind the competitor Can easily convey unusual goals
Alignment: the future The technical problem of subtle communication will likely be solved But what are the right goals? Political problem
Thanks! Visit openai.com to learn more.