REINFORCEMENT LEARNING

Similar documents
Lecture 10: Reinforcement Learning

Georgetown University at TREC 2017 Dynamic Domain Track

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Lecture 1: Machine Learning Basics

Generative models and adversarial training

Artificial Neural Networks written examination

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Python Machine Learning

Axiom 2013 Team Description Paper

Reinforcement Learning by Comparing Immediate Reward

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

TD(λ) and Q-Learning Based Ludo Players

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

An OO Framework for building Intelligence and Learning properties in Software Agents

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

(Sub)Gradient Descent

Laboratorio di Intelligenza Artificiale e Robotica

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

AI Agent for Ice Hockey Atari 2600

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

INPE São José dos Campos

Evolutive Neural Net Fuzzy Filtering: Basic Description

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Model Ensemble for Click Prediction in Bing Search Ads

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Softprop: Softmax Neural Network Backpropagation Learning

Knowledge Transfer in Deep Convolutional Neural Nets

AMULTIAGENT system [1] can be defined as a group of

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speeding Up Reinforcement Learning with Behavior Transfer

arxiv: v1 [cs.lg] 7 Apr 2015

FF+FPG: Guiding a Policy-Gradient Planner

arxiv: v1 [cs.lg] 15 Jun 2015

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Reinforcement Learning Variant for Control Scheduling

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Forget catastrophic forgetting: AI that learns after deployment

Second Exam: Natural Language Parsing with Neural Networks

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Laboratorio di Intelligenza Artificiale e Robotica

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Deep Facial Action Unit Recognition from Partially Labeled Data

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Calibration of Confidence Measures in Speech Recognition

High-level Reinforcement Learning in Strategy Games

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

CSL465/603 - Machine Learning

Introduction to Simulation

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning Methods in Multilingual Speech Recognition

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CS Machine Learning

Human-like Natural Language Generation Using Monte Carlo Tree Search

On-Line Data Analytics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

DOCTOR OF PHILOSOPHY HANDBOOK

Summarizing Answers in Non-Factoid Community Question-Answering

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A deep architecture for non-projective dependency parsing

Learning and Transferring Relational Instance-Based Policies

Seminar - Organic Computing

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Dialog-based Language Learning

Radius STEM Readiness TM

Guided Monte Carlo Tree Search for Planning in Learned Environments

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Regret-based Reward Elicitation for Markov Decision Processes

arxiv: v1 [cs.cv] 10 May 2017

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

A Review: Speech Recognition with Deep Learning Methods

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Go fishing! Responsibility judgments when cooperation breaks down

Attributed Social Network Embedding

Artificial Neural Networks

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Analysis of Enzyme Kinetic Data

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Linking Task: Identifying authors and book titles in verbose queries

Learning Methods for Fuzzy Systems

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

A Version Space Approach to Learning Context-free Grammars

Shockwheat. Statistics 1, Activity 1

Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Transcription:

REINFORCEMENT LEARNING

Methods Traditional Deep-Learning based Non-machine Learning Machine-Learning based method Supervised SVM MLP CNN RNN (LSTM) Localizati on GPS, SLAM Self Driving Perception Pedestrian detection (HOG+SVM) Detection/ Segmentat ion/classif ication Dry/wet road classificati on ADAS Planning/ Control Optimal control End-toend Learning End-toend Learning Tasks Driver state Behavior Prediction/ Driver identificati on Vehicle Diagnosis Smart factory DNN * * Reinforcement * Unsupervised * *

Planning

Planning

Hope for Reinforcement Learening Supervised Learning Neural networks are great at memorization and not (yet) great at reasoning Reinforcement Learning Brute-force propagation of outcomes to knowledge about states and actions. Hope for Deep Learning + Reinforcement Learning General purpose artificial intelligence through efficient generalizable learning of the optimal thing to do given a formalized set of actions and states

INTRODUCTION TO REINFORCEMENT LEARNING

Methods Traditional Deep-Learning based Non-machine Learning Machine-Learning based method Supervised SVM MLP CNN RNN (LSTM) Localizati on GPS, SLAM Self Driving Perception Pedestrian detection (HOG+SVM) Detection/ Segmentat ion/classif ication Dry/wet road classificati on ADAS Planning/ Control Optimal control End-toend Learning End-toend Learning Tasks Driver state Behavior Prediction/ Driver identificati on Vehicle Diagnosis Smart factory DNN * * Reinforcement * Unsupervised * *

DeepMind's DQN playing Breakout

Deep Q-network

How to train? In the supervised learning setting, we have to collect training samples and train the network! Training samples: x i, y i x i, y i = (, ) Game state In the reinforcement setting,??? Joystick control

INTRODUCTION TO REINFORCEMENT LEARNING

Reinforcement Learning Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Atari Example

Reinforcement Learning Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward (sacrifice short-term gains for greater long-ter m gains) The need to explore and exploit

Reinforcement Learning Setting S set of states A set of actions R S A R reward for given state and action.

Reinforcement Learning Terms Policy: a = π(s) A policy π is a mapping from each state, s S, to an action a A(s) (State-) Value function: V π s the expected future reward given a current state s S and policy π Q-function (Action-value function): Q π s, a the expected future reward given a state action pair, (s, a), and policy π

Reinforcement Learning Terms If #(action) is small Action Value Value Value (for action 1) Value (for action 2). Policy Network Value Network Q-Network Q-Network Value = expected reward

Deep Q-network Network output: expected future reward when taking each action

LEARNING METHOD: DEEP Q-LEARNING

Deep Q-network From pixels to Actions: Human-level control through Deep Reinforcement Learning

How to train: Q-Learning Optimal Q-values should obey Bellman equation Bellman equation for Q s, a Q a s, a = s P ss Q a s, a; w = s P ss R a ss + γ max a R a ss + γ max a Q s, a Q s, a ; w = r + γ max a Q s, a ; w Treat right hand side r + γ max Q s, a ; w as a target a Minimize MSE loss by stochastic gradient descent

LEARNING METHOD: POLICY GRADIENT

Policy Network

Policy Gradient Method Random Initialization Repeat Generate samples (run the policy) Policy improvement Reward-weighted gradient learning (similar to the supervised learning)

40 (out of 200) neurons

CASE STUDY: ALPHAGO

바둑

Search space Approximately b 250, d 150 250 150 5 10 359 Chess 35 80 3 10 123

BOARD GAME STRATEGY

Board Game Strategy To win the game, we only need to build a game tree

Board Game Strategy To win the game, we need to find p a s p a s : Optimal action value function Which action should I take?

Board Game Strategy To win the game, we need to find v (s) v s : Optimal Value Function

THREE COMPONENTS OF ALPHAGO

Monte Carlo Tree Search

Reducing depth search with value network

Reducing breadth search with policy network

MONTE CARLO TREE SEARCH

Monte Carlo Tree Search a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results

One iteration of the general MCTS approach

General MCTS approach Selection: Starting at the root node, a child selection policy is recursively applied to descend through the tree until the most urgent expandable node is reached. Expansion: One (or more) child nodes are added to expand the tree, according to the available actions. Simulation: A simulation is run from the new node(s) according to the default policy to produce an outcome Backpropagation: The simulation result is backed up through the selected nodes to update their statistics. reward action

General MCTS approach Playout, rollout, simulation playing out the task to completion according to the default policy Four criteria for selecting the winning action Max child: Select the root child with the highest reward. Robust child: Select the most visited root child. Max-Robust child: Select the root child with both the highest visit count and the highest reward. If none exist, then continue searching until an acceptable visit count is achieved Secure child: Select the child which maximizes a lower confidence bound.

HOW TO DESIGN TREE POLICY? MULTI-ARMED BANDIT

Multi-armed bandit The K-armed bandit problem may be approached using a policy that determines which bandit to play, based on past rewards.

UCT (Upper Confidence Bounds for Trees) algorithm

Exploration vs Exploitation encourages the exploitation of higher-reward choices encourages the exploration of less visited choices

ALPHAGO

3 key components in AlphaGo MCTS Policy network Value network

POLICY NETWORK

Policy network To imitate expert moves There are 19 2 possible actions (with different probabilities)

Policy network

3 Policy networks Supervised learning policy network Reinforcement learning policy network Roll-out policy network

Supervised learning of policy networks Policy network: 12 layer convolutional neural network Training data: 30M positions from human expert games (KGS 5+ dan) Training algorithm: maximize likelihood by stochastic gradient descent Δσ log p σ(a s) σ Training time: 4 weeks on 50 GPUs using Google Cloud Results: 57% accuracy on held out test data (state-of-the art was 44%)

19X19X48 Supervised learning of policy networks 12 convolutional + rectifier layers Softmax Probability map vs Played by Human Expert

Reinforcement learning of policy networks Policy network: 12 layer convolutional neural network Training data: games of self-play between policy network Training algorithm: maximize wins z by policy gradient reinforcement learning Δρ log p ρ(a t s t ) ρ Training time: 1 week on 50 GPUs using Google Cloud Results: 80% vs supervised learning. Raw network ~3 amateur dan. z

Training the RL Policy Network P ρ Refined version of SL policy P σ Initialize weights to ρ = σ {ρ ρ is an old version of ρ} P ρ vs P ρ

Roll-out policy network Faster version of supervised learning policy network p(a s) with shall networks (3 ms 2us)

VALUE NETWORK

Value network

Value network

Reinforcement learning of value networks Value network: 12 layer convolutional neural network Training data: 30 million games of self-play Training algorithm: minimize MSE by stochastic gradient descent Δθ v θ s θ (z v θ s ) Training time: 1 week on 50 GPUs using Google Cloud Results: First strong position evaluation function - previously thought impossible

19X19X48 Training the Value Network V θ Position evaluation Approximating optimal value function Input: state, output: probability to win Goal: minimize MSE convolutional + rectifier layers fc scalar

TRAINING

Input Features

Training the Deep Neural Networks

Summary: Training the Deep Neural Networks

MCTS

Monte Carlo Tree Search

Edge storing statistics {P s, a, N v s, a, N r s, a, W v s, a, W r s, a, Q(s, a)} P s, a : prior probability N v s, a : # of leaf evaluation W v s, a : Monte Carlo estimated action value accumulated over N v s, a N r s, a : # of roll-out evaluation W r s, a : Monte Carlo estimated action value accumulated over N r s, a

Monte Carlo Tree Search: selection Each edge (s,a) stores: Q(s, a) - action value (average value of sub tree) N(s, a) visit count P(s, a) prior probability

Monte Carlo Tree Search: evaluation Leaf evaluation: Value network Random rollout

Monte Carlo Tree Search: backup Value network Roll-out

How to choose the next move? Maximum visit count Less sensitive to outliers than maximum action value

Training the Deep Neural Networks

AlphaGo VS Experts 4:1