Meta Learning & Self Play

Size: px

Start display at page:

Download "Meta Learning & Self Play"

Chad Powers
5 years ago
Views:

1 Meta Learning & Self Play Ilya Sutskever MARCH 24, 2018

2 The Reinforcement Learning Problem

3 Reinforcement Learning (RL) Good framework for building intelligent agents Acting to achieve goals is a key part of intelligence Can specify nearly any AI problem RL is interesting because interesting RL algorithms exist Agent action Environment observation reward

4 Reinforcement Learning Formulation: find a policy that maximizes expected reward Rewards are given by the environment In the real world, environments don t specify rewards. It is up to the agent to determine that a reward has occurred. Agent action observation reward Environment

5 Reinforcement Learning Agent = neural network action Environment observation reward

6 Reinforcement Learning algorithms in a nutshell Add randomness to your actions If the result was better than expected, do more of the same in the future

7 RL s potential An agent running a really good RL algorithm can accomplish an overwhelming variety of tasks The goal achiever A truly good RL algorithm will combine elements supervised learning unsupervised and representation learning reasoning and inference and test time and more! Today s RL algorithms have a very long way to go But it doesn t mean that progress will be slow

8 Hindsight Experience Replay [Andrychowicz et al., 2017]

9 Exploration can be hard When rewards are spares, most random attempts result in failure, and thus no learning Can we learn from failure?

10 Learn From Failure Setup: build a system that can reach any state Goal: reach state A Any trajectory ends up in some other state B Use this as training data to reach state B? Try to reach A A Starting point The result: how to reach B B

11 Cool visual explanation of HER

12 Dynamics randomization for Sim2Real [Peng et al., 2017]

13 Sim2Real with meta learning [Peng et al., 2017] It would be nice to train robots in simulation And have the policies succeed on the real robot

14 Key idea : simulation randomization Randomize simulation parameters Gravity Friction Torques Width and length of different geometric shapes Type of contact simulation Etc. Train a policy that can adapt to all settings of simulation parameters

15 This is a meta learning approach Policy quickly infers simulation parameters Could it infer the simulation parameters of the real world?

16 Baseline

17 Results

18 Learning a hierarchy of actions with meta learning [Frans et al., 2017]

19 It would be nice if learning was hierarchical Current RL learns by trying out random actions at each timestep Downsides: Hard to explore in a persistent direction Hard to do credit assignment over long horizons Example: Suppose all your agents want to maximize your GDP Should each agent decide if it should go to work on the basis of GDP fluctuations? May require a real model to really solve this problem

20 Meta learning approach to hierarchy Ingredients: a distribution over tasks Goal: learn a set of meta-actions that solve training tasks as quickly as possible

21 Evolved Policy Gradients [Houthooft et al., 2018]

22 Goal: learn a cost function that leads to rapid learning Train a cost function such that RL on this cost function learns very quickly Ingredients: a distribution over tasks Use evolution strategies to learn the cost function

23 Result: a single learning trial

24 Result: a single learning trial Learned cost: never learned to move right

25 Self Play

all humans, discovered unconventional strategies that were

26 Self Play: TD-Gammon TD-Gammon (Tesauro, 1992) Incredibly old work: Q-learning + neural networks + self-play Beat all humans, discovered unconventional strategies that were deemed to be better! Approach was dormant until DQN for Atari

27 Self Play: AlphaGo Zero

28 Self Play: Dota 2 Pure self play Popular competitive online e-sports game Serious professional scene: $140M awarded in prizes in v5 is main variant; 1v1 also played OpenAI beat all the pros 1v1

29 Appealing properties of Self Play Simple environment extremely complex strategy Convert compute into data Perfect curriculum

30 Carl Sims, 1994 Self Play: Artificial Life

31 Carl Sims, 1994 Self Play: Artificial Life

32 Self Play for physicality and dexterity Environment is simple, behavior is very complex Pre-train general dexterity by competing against an opponent [Bansal et al., 2017]

33 What s next? Main open question: design the self play environment so that the result will be useful to some external task

tools, we argue that complex cognitive abilities evolved multiple times in Homo neanderthalensis 1500 cm 3

problems. Science, Vol. 306, Issue 5703, pp.

34 Can Self Play lead all the way to AGI? Social life incentivizes evolution of intelligence Homo sapiens Because corvids and apes share these cognitive tools, we argue that complex cognitive abilities evolved multiple times in Homo neanderthalensis 1500 cm 3 distantly related species with vastly different brain structures in order to solve similar socioecological problems. Science, Vol. 306, Issue 5703, pp. Homo erectus Homo habilis 1000 cm 3 Cranial Capacity Australopithecus Mate selection Sahelanthropus 500 cm 3 Open-ended self play produces: Theory of mind, negotiation, social skills, empathy, real language understanding Millions of Years Ago

35 AI Alignment: Learning from human feedback [Christiano et al., 2017]

36 How to communicate goals quickly? One approach: have humans judge the behavior of an algorithm

37 Human judges select good behavior

38 Fit a scalar reward function to the human feedback Optimize a triplet loss: if a human judge deems that A > B Learn a real-valued reward consistent with the human feedback PREDICTED REWARD REWARD PREDICTOR HUMAN FEEDBACK RL ALGORITHM OBSERVATION ACTION ENVIRONMENT

39 500 bits of interaction It works

40 It works Several thousand bits of interactions to solve Atari games

41 Drive right behind the competitor Can easily convey unusual goals

42 Alignment: the future The technical problem of subtle communication will likely be solved But what are the right goals? Political problem

43 Thanks! Visit openai.com to learn more.

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning