Title Comparison between different Reinforcement Learning algorithms on Open AI Gym environment (CartPole-v0)

Title Comparison between different Reinforcement Learning algorithms on Open AI Gym environment (CartPole-v0) Author: KIM Zi Won Date: 2017. 11. 24.

Table of Contents 1. Introduction... 2 (1) Q-Learning... 2 (2) The introduction to Deep Q Network (DQN) by DeepMind... 2 (3) Development of improvement variations to the DQN... 2 (4) Project Scope... 2 2. Infrastructure... 3 (1) Environment... 3 (2) System... 3 a) Input & Output... 3 b) Evaluation... 3 c) Changes... 3 3. Outcome... 4 4. Evaluation... 6 5. Future Research... 7 (1) Different GYM environments and implementations... 7 (2) Pygame Environments... 7 (3) Running on Cloud... 7-1 -

1. Introduction (1) Q-Learning Q learning is a reinforcement learning technique that can be used for a model-free optimal action selection policy for a Markov decision process. The learning process of this technique involves an action-value function where it gives an expected utility of an action from a given state. It also takes into account the discounted expected utility of future actions given an optimal policy at the future state. Below is the q learning algorithm in equation form. Q(s, a) := Q(s,a) + α [r+γ max a Q(s, a ) Q(s, a)] Where it can be seen that Q(s, a) is updated with the last state-action pair (s, a) with the observed outcome state (s ) and reward (r), with α as learning rate and γ as discount factor. (2) The introduction to Deep Q Network (DQN) by DeepMind In 2013 December 1, DeepMind introduced its Deep Q Network (DQN) algorithm. It was a breakthrough for reinforcement learning in that it makes us of Convolutional Neural Networks(CNN) and uses raw visual inputs as states to play Atari games. The technique was a huge success and has been featured on the Nature journal as a front-page cover. (3) Development of improvement variations to the DQN The deep reinforcement learning community since then has come up with many variations of the initial DQN including Dueling DQN, Asynchronous Actor-Critic Agents (A3C), Deep Double QN, and more. In early 2017 October, DeepMind released another paper on the Rainbow DQN 2, in which they combine the benefits of the previous DQN algorithms and show that it outperforms all previous DQN models. (4) Project Scope This project will cover DQN, DRQN, Actor-Critic, and Actor-Critic with Experience Replay with existing code and compare the performance differences on Open AI Gym game environment. 1 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop 2013. arxiv:1312.5602v1 2 Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning. https://arxiv.org/pdf/1710.02298.pdf - 2 -

2. Infrastructure (1) Environment Compute Instance: AWS EC2 Instance (g2.2xlarge, us-west region) OS: Ubuntu Python: 3.5.4 Tensorflow: 1.4.0 Gym: 0.9.3 Numpy: 1.12.1 Base source code used: Implementations by Kyushik (https://github.com/kyushik/gym_drl) 3 (2) System a) Input & Output Input: An agent on CartPole-v0 with DQN and its variations for its learning algorithm Output: Training time required to reach the maximum score (optimal policy) b) Evaluation The training time and average score with the output policy will be compared between different learning algorithms. c) Changes The 4 different algorithms that will be compared in this paper have 3 common important parameters for the learning process comparison: Epsilon, Final_Epsilon, and # Training Episodes. Epsilon denotes the degree of exploration for the learning algorithm. This means that epsilon should be 0 when an optimal policy is used for the testing process and to know whether we have found an optimal policy or not. # Training Episodes denote the number of training iterations that the program will go through to find out the optimal policy. Epsilon will be decreased by a factor of 1/(# Training Episodes) until the Final_Epsilon value of 0.01. (Fixed constants: learning rate = 0.001, initial epsilon = 1, final epsilon = 0.01, testing epsilon = 0, num_replay_memory = 500, number of observation episodes = # Training Episodes / 5.) For the actor-critic model, the given learning rates will be used as in the source code. For the scope of this project, the # Training Episodes, # Obervation, and # Replay Memory will be changed to compare which algorithm gives the best performance in terms of score and execution time. All other factors will be commonly shared. Note that the maximum score achievable in this game environment is 200. 3 The base source code was modified to measure program execution time and also turn off rendering UI elements to let it run in AWS EC2 Ubuntu instances. - 3 -

3. Outcome A. Training Episode = 1000, Observation Episode = 1000, #_Replay_Memory = 500 DQN 66.41 5.29 DRQN 9.0 84.84 Actor Critic 8.1 44.75 Actor Critic w/ Experience Replay 7.73 46.0 B. Training Episode = 5000, Observation Episode = 1000, #_Replay_Memory = 500 DQN 45.17 23.47 DRQN 60.49 58.67 Actor Critic 81.5 24.44 Actor Critic w/ Experience Replay 83.5 28.40 C. Training Episode = 10000, Observation Episode = 2000, #_Replay_Memory = 500 DQN 145.73 46.35 DRQN 167.71 108.65 ActorCritic 129.0 47.46 ActorCritic w/ Experience Replay 200.0 53.35 D. Training Episode = 20000, Observation Episode = 2000, #_Replay_Memory = 500 DQN 174.95 92.98 DRQN 6.82 203.60 Actor Critic 200.0 93.66 Actor Critic w/ Experience Replay 200.0 109.7 It was observed ActorCritic and ActorCritic w/ Experience Replay did not require 20000 iterations E. Training Episode = 8000, Observation Episode = 2000, #_Replay_Memory = 1000 DQN 180.99 36.98 DRQN 194.21 87.81-4 -

Actor Critic 200.0 38.72 Actor Critic w/ Experience Replay 98.5 45.24 F. Training Episode = 8000, Observation Episode = 2000, #_Replay_Memory = 500 DQN 168.94 37.06 DRQN 180.24 86.90 Actor Critic 200.0 38.65 Actor Critic w/ Experience Replay 155.0 45.05 G. Training Episode = 8000, Observation Episode = 2000, #_Replay_Memory = 200 DQN 167.31 36.97 DRQN 152.09 87.49 Actor Critic 200.0 38.63 Actor Critic w/ Experience Replay 200.0 45.05 H. Training Episode = 25000, Observation Episode = 2500, #_Replay_Memory = 250 DQN 199.62 116.06 DRQN -180.32 252.39 Actor Critic 177.0 116.84 Actor Critic w/ Experience Replay 200.0 137.91-5 -

4. Evaluation Overall, each algorithm seems to have an optimizable parameter for it to choose in order to maximize its performance. That means that given a shared condition, it is impractical to conclude which algorithm is the best. Yet, amongst the four algorithms that were tested in this paper, Actor Critic algorithms seem to perform best, scoring the maximum score in 4/8 experiment results above with noticeable execution time compared to DQN and DRQN. This is probably because Actor Critic algorithms have an advantage over DQN algorithms in that it estimates and iterates both policy and value, whereas DQN only estimates the value. For all algorithms, it can be easily seen from the above experiments A to D that the policy becomes better with more training iterations. The performance measure depending on observation episode, which denotes the number of observations to be done before doing random explorations with epsilon value of 1, has not been done because training episodes are of major comparative importance. It is also clear that for Actor Critic with Experience Replay, reducing the # of replay memory significantly improves its score performance. This is because with less number of replay memory, the algorithm can adjust its cost function more often and ultimately achieve a better result in shorter time, or shorter training iterations. In conclusion, from the above observations it can be said that Actor Critic outperforms other algorithms within the scope of this paper. It runs faster than Actor Critic with Experience Replay, and is less bound to the number of replay memory parameter. - 6 -

5. Future Research (1) Different GYM environments and implementations Different implementations of the above algorithms and more Gym environments could be tested for further comparison. Some directions to try out are listed below. https://github.com/morvanzhou/reinforcement-learning-with-tensorflow https://github.com/keon/deep-q-learning (2) Pygame environments Running similar experiments on different types of reinforcement learning algorithms on a different environment is suggested. For example, DRL repository by Kyushik in Github 4 has lot of different RL algorithms developed for Pygame environments. (3) Running on cloud Many RL code involve deep learning. The research can speed up by utilizing powerful compute resources on the cloud. It is suggested to use AWS EC2 GPU instances such as g2.2xlarge with Jupyter and port forwarding to speed up training process without sacrificing the development environment a lot. 4 https://github.com/kyushik/drl - 7 -