Title Comparison between different Reinforcement Learning algorithms on Open AI Gym environment (CartPole-v0)

Similar documents
AI Agent for Ice Hockey Atari 2600

Georgetown University at TREC 2017 Dynamic Domain Track

Reinforcement Learning by Comparing Immediate Reward

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

arxiv: v1 [cs.dc] 19 May 2017

Artificial Neural Networks written examination

A Neural Network GUI Tested on Text-To-Phoneme Mapping

TD(λ) and Q-Learning Based Ludo Players

High-level Reinforcement Learning in Strategy Games

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Generative models and adversarial training

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

AMULTIAGENT system [1] can be defined as a group of

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Top US Tech Talent for the Top China Tech Company

Speeding Up Reinforcement Learning with Behavior Transfer

Regret-based Reward Elicitation for Markov Decision Processes

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Improving Action Selection in MDP s via Knowledge Transfer

A Reinforcement Learning Variant for Control Scheduling

Discriminative Learning of Beam-Search Heuristics for Planning

Transfer Learning Action Models by Measuring the Similarity of Different Domains

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Task Types. Duration, Work and Units Prepared by

FF+FPG: Guiding a Policy-Gradient Planner

Introduction to Simulation

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 1: Machine Learning Basics

Software Maintenance

Task Completion Transfer Learning for Reward Inference

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Major Milestones, Team Activities, and Individual Deliverables

Python Machine Learning

Forget catastrophic forgetting: AI that learns after deployment

Learning and Transferring Relational Instance-Based Policies

Improving Conceptual Understanding of Physics with Technology

The Strong Minimalist Thesis and Bounded Optimality

Attributed Social Network Embedding

arxiv: v1 [cs.lg] 8 Mar 2017

Truth Inference in Crowdsourcing: Is the Problem Solved?

Reducing Features to Improve Bug Prediction

On the Combined Behavior of Autonomous Resource Management Agents

Dialog-based Language Learning

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

Comment-based Multi-View Clustering of Web 2.0 Items

An OO Framework for building Intelligence and Learning properties in Software Agents

Meeting Agenda for 9/6

Knowledge Transfer in Deep Convolutional Neural Nets

Modeling function word errors in DNN-HMM based LVCSR systems

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Learning From the Past with Experiment Databases

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Task Completion Transfer Learning for Reward Inference

Active Learning. Yingyu Liang Computer Sciences 760 Fall

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

A Comparison of Annealing Techniques for Academic Course Scheduling

An Introduction to Simio for Beginners

BMBF Project ROBUKOM: Robust Communication Networks

Natural Language Processing. George Konidaris

arxiv: v1 [cs.lg] 15 Jun 2015

Aspectual Classes of Verb Phrases

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

School of Innovative Technologies and Engineering

An investigation of imitation learning algorithms for structured prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

arxiv: v1 [cs.lg] 7 Apr 2015

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

CSL465/603 - Machine Learning

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

(Sub)Gradient Descent

How long did... Who did... Where was... When did... How did... Which did...

Learning Methods in Multilingual Speech Recognition

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A deep architecture for non-projective dependency parsing

Why Did My Detector Do That?!

Coaching Others for Top Performance 16 Hour Workshop

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Radius STEM Readiness TM

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

arxiv: v1 [cs.cl] 2 Apr 2017

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Analysis of Enzyme Kinetic Data

SARDNET: A Self-Organizing Feature Map for Sequences

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Patterns for Adaptive Web-based Educational Systems

Transcription:

Title Comparison between different Reinforcement Learning algorithms on Open AI Gym environment (CartPole-v0) Author: KIM Zi Won Date: 2017. 11. 24.

Table of Contents 1. Introduction... 2 (1) Q-Learning... 2 (2) The introduction to Deep Q Network (DQN) by DeepMind... 2 (3) Development of improvement variations to the DQN... 2 (4) Project Scope... 2 2. Infrastructure... 3 (1) Environment... 3 (2) System... 3 a) Input & Output... 3 b) Evaluation... 3 c) Changes... 3 3. Outcome... 4 4. Evaluation... 6 5. Future Research... 7 (1) Different GYM environments and implementations... 7 (2) Pygame Environments... 7 (3) Running on Cloud... 7-1 -

1. Introduction (1) Q-Learning Q learning is a reinforcement learning technique that can be used for a model-free optimal action selection policy for a Markov decision process. The learning process of this technique involves an action-value function where it gives an expected utility of an action from a given state. It also takes into account the discounted expected utility of future actions given an optimal policy at the future state. Below is the q learning algorithm in equation form. Q(s, a) := Q(s,a) + α [r+γ max a Q(s, a ) Q(s, a)] Where it can be seen that Q(s, a) is updated with the last state-action pair (s, a) with the observed outcome state (s ) and reward (r), with α as learning rate and γ as discount factor. (2) The introduction to Deep Q Network (DQN) by DeepMind In 2013 December 1, DeepMind introduced its Deep Q Network (DQN) algorithm. It was a breakthrough for reinforcement learning in that it makes us of Convolutional Neural Networks(CNN) and uses raw visual inputs as states to play Atari games. The technique was a huge success and has been featured on the Nature journal as a front-page cover. (3) Development of improvement variations to the DQN The deep reinforcement learning community since then has come up with many variations of the initial DQN including Dueling DQN, Asynchronous Actor-Critic Agents (A3C), Deep Double QN, and more. In early 2017 October, DeepMind released another paper on the Rainbow DQN 2, in which they combine the benefits of the previous DQN algorithms and show that it outperforms all previous DQN models. (4) Project Scope This project will cover DQN, DRQN, Actor-Critic, and Actor-Critic with Experience Replay with existing code and compare the performance differences on Open AI Gym game environment. 1 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop 2013. arxiv:1312.5602v1 2 Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning. https://arxiv.org/pdf/1710.02298.pdf - 2 -

2. Infrastructure (1) Environment Compute Instance: AWS EC2 Instance (g2.2xlarge, us-west region) OS: Ubuntu Python: 3.5.4 Tensorflow: 1.4.0 Gym: 0.9.3 Numpy: 1.12.1 Base source code used: Implementations by Kyushik (https://github.com/kyushik/gym_drl) 3 (2) System a) Input & Output Input: An agent on CartPole-v0 with DQN and its variations for its learning algorithm Output: Training time required to reach the maximum score (optimal policy) b) Evaluation The training time and average score with the output policy will be compared between different learning algorithms. c) Changes The 4 different algorithms that will be compared in this paper have 3 common important parameters for the learning process comparison: Epsilon, Final_Epsilon, and # Training Episodes. Epsilon denotes the degree of exploration for the learning algorithm. This means that epsilon should be 0 when an optimal policy is used for the testing process and to know whether we have found an optimal policy or not. # Training Episodes denote the number of training iterations that the program will go through to find out the optimal policy. Epsilon will be decreased by a factor of 1/(# Training Episodes) until the Final_Epsilon value of 0.01. (Fixed constants: learning rate = 0.001, initial epsilon = 1, final epsilon = 0.01, testing epsilon = 0, num_replay_memory = 500, number of observation episodes = # Training Episodes / 5.) For the actor-critic model, the given learning rates will be used as in the source code. For the scope of this project, the # Training Episodes, # Obervation, and # Replay Memory will be changed to compare which algorithm gives the best performance in terms of score and execution time. All other factors will be commonly shared. Note that the maximum score achievable in this game environment is 200. 3 The base source code was modified to measure program execution time and also turn off rendering UI elements to let it run in AWS EC2 Ubuntu instances. - 3 -

3. Outcome A. Training Episode = 1000, Observation Episode = 1000, #_Replay_Memory = 500 DQN 66.41 5.29 DRQN 9.0 84.84 Actor Critic 8.1 44.75 Actor Critic w/ Experience Replay 7.73 46.0 B. Training Episode = 5000, Observation Episode = 1000, #_Replay_Memory = 500 DQN 45.17 23.47 DRQN 60.49 58.67 Actor Critic 81.5 24.44 Actor Critic w/ Experience Replay 83.5 28.40 C. Training Episode = 10000, Observation Episode = 2000, #_Replay_Memory = 500 DQN 145.73 46.35 DRQN 167.71 108.65 ActorCritic 129.0 47.46 ActorCritic w/ Experience Replay 200.0 53.35 D. Training Episode = 20000, Observation Episode = 2000, #_Replay_Memory = 500 DQN 174.95 92.98 DRQN 6.82 203.60 Actor Critic 200.0 93.66 Actor Critic w/ Experience Replay 200.0 109.7 It was observed ActorCritic and ActorCritic w/ Experience Replay did not require 20000 iterations E. Training Episode = 8000, Observation Episode = 2000, #_Replay_Memory = 1000 DQN 180.99 36.98 DRQN 194.21 87.81-4 -

Actor Critic 200.0 38.72 Actor Critic w/ Experience Replay 98.5 45.24 F. Training Episode = 8000, Observation Episode = 2000, #_Replay_Memory = 500 DQN 168.94 37.06 DRQN 180.24 86.90 Actor Critic 200.0 38.65 Actor Critic w/ Experience Replay 155.0 45.05 G. Training Episode = 8000, Observation Episode = 2000, #_Replay_Memory = 200 DQN 167.31 36.97 DRQN 152.09 87.49 Actor Critic 200.0 38.63 Actor Critic w/ Experience Replay 200.0 45.05 H. Training Episode = 25000, Observation Episode = 2500, #_Replay_Memory = 250 DQN 199.62 116.06 DRQN -180.32 252.39 Actor Critic 177.0 116.84 Actor Critic w/ Experience Replay 200.0 137.91-5 -

4. Evaluation Overall, each algorithm seems to have an optimizable parameter for it to choose in order to maximize its performance. That means that given a shared condition, it is impractical to conclude which algorithm is the best. Yet, amongst the four algorithms that were tested in this paper, Actor Critic algorithms seem to perform best, scoring the maximum score in 4/8 experiment results above with noticeable execution time compared to DQN and DRQN. This is probably because Actor Critic algorithms have an advantage over DQN algorithms in that it estimates and iterates both policy and value, whereas DQN only estimates the value. For all algorithms, it can be easily seen from the above experiments A to D that the policy becomes better with more training iterations. The performance measure depending on observation episode, which denotes the number of observations to be done before doing random explorations with epsilon value of 1, has not been done because training episodes are of major comparative importance. It is also clear that for Actor Critic with Experience Replay, reducing the # of replay memory significantly improves its score performance. This is because with less number of replay memory, the algorithm can adjust its cost function more often and ultimately achieve a better result in shorter time, or shorter training iterations. In conclusion, from the above observations it can be said that Actor Critic outperforms other algorithms within the scope of this paper. It runs faster than Actor Critic with Experience Replay, and is less bound to the number of replay memory parameter. - 6 -

5. Future Research (1) Different GYM environments and implementations Different implementations of the above algorithms and more Gym environments could be tested for further comparison. Some directions to try out are listed below. https://github.com/morvanzhou/reinforcement-learning-with-tensorflow https://github.com/keon/deep-q-learning (2) Pygame environments Running similar experiments on different types of reinforcement learning algorithms on a different environment is suggested. For example, DRL repository by Kyushik in Github 4 has lot of different RL algorithms developed for Pygame environments. (3) Running on cloud Many RL code involve deep learning. The research can speed up by utilizing powerful compute resources on the cloud. It is suggested to use AWS EC2 GPU instances such as g2.2xlarge with Jupyter and port forwarding to speed up training process without sacrificing the development environment a lot. 4 https://github.com/kyushik/drl - 7 -