Computational Science and Engineering (Int. Master s Program) Deep Reinforcement Learning for Superhuman Performance in Doom

Similar documents
AI Agent for Ice Hockey Atari 2600

Georgetown University at TREC 2017 Dynamic Domain Track

Axiom 2013 Team Description Paper

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Lecture 1: Machine Learning Basics

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.dc] 19 May 2017

Lecture 10: Reinforcement Learning

Generative models and adversarial training

Artificial Neural Networks written examination

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Laboratorio di Intelligenza Artificiale e Robotica

Reinforcement Learning by Comparing Immediate Reward

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

On the Combined Behavior of Autonomous Resource Management Agents

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Python Machine Learning

On-Line Data Analytics

The Role of Architecture in a Scaled Agile Organization - A Case Study in the Insurance Industry

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Seminar - Organic Computing

arxiv: v1 [cs.lg] 15 Jun 2015

FF+FPG: Guiding a Policy-Gradient Planner

An OO Framework for building Intelligence and Learning properties in Software Agents

Acquiring Competence from Performance Data

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Improving Fairness in Memory Scheduling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

AMULTIAGENT system [1] can be defined as a group of

Learning Methods for Fuzzy Systems

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Active Learning. Yingyu Liang Computer Sciences 760 Fall

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning to Schedule Straight-Line Code

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Model Ensemble for Click Prediction in Bing Search Ads

Knowledge Transfer in Deep Convolutional Neural Nets

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

Learning Cases to Resolve Conflicts and Improve Group Behavior

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

arxiv: v1 [cs.cv] 10 May 2017

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Residual Stacking of RNNs for Neural Machine Translation

Learning and Transferring Relational Instance-Based Policies

Using focal point learning to improve human machine tacit coordination

Rule Learning With Negation: Issues Regarding Effectiveness

Forget catastrophic forgetting: AI that learns after deployment

Deep Neural Network Language Models

Improving Conceptual Understanding of Physics with Technology

Evolutive Neural Net Fuzzy Filtering: Basic Description

Functional Skills Mathematics Level 2 assessment

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

High-level Reinforcement Learning in Strategy Games

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

An Introduction to Simio for Beginners

STA 225: Introductory Statistics (CT)

Speeding Up Reinforcement Learning with Behavior Transfer

Improving Action Selection in MDP s via Knowledge Transfer

Rule Learning with Negation: Issues Regarding Effectiveness

A Case-Based Approach To Imitation Learning in Robotic Agents

SARDNET: A Self-Organizing Feature Map for Sequences

BMBF Project ROBUKOM: Robust Communication Networks

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A Reinforcement Learning Variant for Control Scheduling

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Test Effort Estimation Using Neural Network

PhD in Computer Science. Introduction. Dr. Roberto Rosas Romero Program Coordinator Phone: +52 (222) Ext:

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

CS Machine Learning

IMPROVE THE QUALITY OF WELDING

2017 Florence, Italty Conference Abstract

CHALLENGES FACING DEVELOPMENT OF STRATEGIC PLANS IN PUBLIC SECONDARY SCHOOLS IN MWINGI CENTRAL DISTRICT, KENYA

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Speech Recognition at ICSI: Broadcast News and beyond

A Comparison of Annealing Techniques for Academic Course Scheduling

Centralized Assignment of Students to Majors: Evidence from the University of Costa Rica. Job Market Paper

Attributed Social Network Embedding

INPE São José dos Campos

arxiv: v4 [cs.cl] 28 Mar 2016

Surprise-Based Learning for Autonomous Systems

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

An Investigation into Team-Based Planning

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Learning From the Past with Experiment Databases

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Assignment 1: Predicting Amazon Review Ratings

Why Did My Detector Do That?!

Transcription:

Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis Deep Reinforcement Learning for Superhuman Performance in Doom Ivan Rodríguez

Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis Deep Reinforcement Learning for Superhuman Performance in Doom Author: Ivan Rodríguez 1 st examiner: Univ.-Prof. Dr. Hans-Joachim Bungartz 2 nd examiner: Univ.-Prof. Dr. Thomas Huckle Assistant advisor(s): M.Sc. Moritz August Thesis handed in on: July 15, 2017

I hereby declare that this thesis is entirely the result of my own work except where otherwise indicated. I have only used the resources given in the list of references. June 24, 2017 Ivan Rodríguez

Abstract Over the last years, Reinforcement Learning (RL) has attracted the attention of many researchers. Its powerful combination with Artificial Deep Neural Networks, when used as function approximators, has shown to be successful in many works. Rather than target classical RL problems, the most prominent examples of these works develop techniques that allow agents to learn how to play video and board games from raw input data at human level. In this thesis, we describe the implementation of an algorithm to train an agent for the popular 90 s computer game Doom. Doom features several recurrent problems in RL such as delayed rewards and partial observability which are tackled by the algorithm. In particular, we discuss the efforts to improve the efficiency of our approach and the results obtained in several tests scenarios. vii

viii

Contents Abstract Outline of the Thesis vii xi I. Introduction and Theory 1 II. Development of a bot for Doom 3 1. Approach 5 1.1. DFP setting...................................... 5 1.2. Model......................................... 5 1.3. Training....................................... 6 2. A more efficient implementation 9 2.1. GA3C setting.................................... 9 2.2. Asynchronous DFP................................. 9 III. Results 11 Appendix 15 A. Implementation Details 15 Bibliography 17 ix

Contents Part I: Introduction and Theory Outline of the Thesis CHAPTER 1: INTRODUCTION This chapter presents an overview of the thesis and the motivation behind it. CHAPTER 2: CLASSIC REINFORCEMENT LEARNING We give here the fundamental elements to describe Reinforcement Learning (RL) problems and the abstractions used to solve them. Afterwards, we present the properties of different RL methods which mainly fall in two categories: Tabular and Approximation methods. CHAPTER 3: DEEP REINFORCEMENT LEARNING In this chapter, we discuss how Deep Neural Networks come into play for RL problems. Particularly, we present simulation environments for video games (including ViZDoom for Doom) and approaches built around them. Part II: Development of a bot for Doom CHAPTER 4: APPROACH This chapter presents the approach we based our work on and how it tackles the challenges posed by Doom. CHAPTER 5: A MORE EFFICIENT IMPLEMENTATION We present here the improvements built on top of the original approach. Part III: Results CHAPTER 6: EXPERIMENTS Four scenarios with increasing difficulty were considered for experimentation. The performance in training and evaluation time of several artificial agents are shown in this chapter. Part IV: Conclusion CHAPTER 7: SUMMARY AND OUTLOOK In this chapter, we present our conclusions. xi

Part I. Introduction and Theory 1

Part II. Development of a bot for Doom 3

1. Approach We followed the approach taken by [2], the winners of the Visual AI Doom 2016 competition for the Full Deathmatch track (with unknown maps and more than one weapon available). In this chapter, their algorithm, called DFP (Direct Future Prediction) is described in detail. 1.1. DFP setting An artificial Doom player is spawned in a unknown map (environment) with a set of actions A. The interaction with the environment is carried out over discrete timesteps t = 0, 1, 2,... in the form of episodes that end with the death of the player or when a maximum number of steps is reached. At each timestep, the player receives an observation o t composed of an input image s t and a vector of measurements m t. Depending on the observation, an agent performs an action a t A and, as a consequence, its measurements are affected. The objective is, thus, to choose actions in such a way that measurement values are maximized during an episode. For discrete temporal offsets τ 1, τ 2,..., τ n, the vector f contains the difference between future and current measurements and is defined as [m t+τ1 m t, m t+τ2 m t,..., m t+τn m t ]. In addition, a maximization objective for the measurements is assumed to take the form u(f; g) = g f (1.1) where g, the goal vector, is a parametrization vector with the same size as f, specified at the beginning of the training, but that can be changed during test time. This representation allows us to define in which proportion particular future measurements are more important than others. 1.2. Model To predict future measurements, a Deep Neural Network (DNN) with parameters vector θ is used. The network takes an image s t, a vector of measurements m t, and a goal vector g as inputs (Figure 1.1). The inputs are then processed with a convolutional network and several stacked fully connected layers. Next, the results are concatenated and split into two streams, Expectation and Advantage. The former calculates the average of future measurements according to the current observation, while the latter makes an estimation of the advantage of taking a particular action over all the other possible actions. In the Advantage stream, the Normalize operation is carried out as in Equation??. Finally, both streams are added and a prediction is obtained. The prediction is thus defined as 5

1. Approach Figure 1.1.: DFP neural network P t = F (m t, s t, a, g; θ), a A (1.2) where the function F represents the computation by the DNN. After doing a prediction on the model at timestep t, we choose an action that maximizes our objective function (Equation 1.1) with respect to the specified goal vector g: 1.3. Training a t = arg max g P t (1.3) a The agent starts interacting with the environment according to a ɛ-greedy policy (Section??). Therefore, at the beginning of the training the value of ɛ is set to 1.0 (random actions) and is gradually decreased down to 0.1, at the end of the training. This decrease allows the agent to continue exploring in a smaller proportion even when the model already contains information of the environment. Similar to DQN (Section??), DFP uses a replay memory to store experiences, which helps to increase the stability of the algorithm by breaking the correlation between sequences. Experiences are represented by tuples (m i, s i, a i, g, f i ) and sampled randomly in minibatches of size N every k steps of the game. When the replay memory reaches its maximum capacity, the oldest experiences are replaced by new ones. Having a minibatch of experiences, the DNN is trained to minimize the loss function 6

1.3. Training N L(θ) = F (m t, s t, a t, g; θ) f i 2 (1.4) i=1 which corresponds to a mean square error, i.e. the error of predicting differences between current and future measurements. It is important to highlight that in Equation 1.4 the action performed is used to update only its corresponding part in the prediction function (Figure 1.2). Figure 1.2.: Target and prediction in loss function In a typical RL problem, agents are trained with experiences collected during training, without using any dataset. To collect those experiences, agents must be repeatedly interact with the environment, adding a significant overhead to the overall training time. In order to decrease this overhead, [2] used 8 agents gathering experiences in parallel that synchronously perform actions and do predictions to the model in batches (Figure 1.3). Although this technique helps effectively to reduce the overhead, we describe in the next section a more efficient implementation based on asynchronous updates to the model. Figure 1.3.: Scheme of DFP implementation with multiple agents 7

1. Approach 8

2. A more efficient implementation As mentioned in the last chapter, [2] implemented an algorithm in which several agents perform synchronous updates to the model. We describe below a modification of this strategy, by allowing agents to run asynchronously, as proposed in the work of [1] to speed up A3C (Section??). 2.1. GA3C setting In GA3C (GPU-based Asynchronous Advantage Actor-Critic) [1], multiple agents run in parallel asynchronously without sharing any global network parameters, as opposed to the original A3C implementation [3]. Instead, a Server instance receive minibatches to train the model and predict actions for every agent, thus being the only process allowed to communicate with the GPU. Two types of Server threads, Trainers and Predictors, manage two asynchronous queues to manage the data between the Server and the agents. In the first queue (training queue), agents send their minibatches of experiences; in the second queue (prediction queue), predictor threads receive request for predictions to be performed with the model. In order to use efficiently GPU bandwidth and keep GPU utilization high, a balanced combination of number of agents, predictor and trainers is desired. However, every combination affects the convergence of the algorithm as well, so a trade-off must be found. To that end, [1] designed a Dynamic Adjustment thread which tries different configurations systematically for improving the number of trainings per second T P S, directly related to the number of predictions per second P P S. Since agents perform one update every t max steps in the A3C setting, it is expected to have P P S T P S t max, which allows to find a fixed optimal configuration during the entire training by maximizing the T P S metric. 2.2. Asynchronous DFP We included GA3C s asynchronous communication scheme for training and predicting in our original approach (Figure 2.1). In our case, to achieve an optimal combination becomes problematic since P P S is not constant during training. When agents start interacting with the environment, no prediction queries are issued and actions are chosen randomly. As the interaction advances, P P S increases proportionally to the decrease of ɛ (exploration/exploitation rate). Consequently, a fixed configuration of agents, predictors and trainers does not lead necessarily to the fastest solution. 9

2. A more efficient implementation Figure 2.1.: Implementation of an asynchronous version of DFP In the next chapter, we discuss how we get to an optimal configuration for DFP and present the results of the modified approach. 10

Part III. Results 11

Appendix 13

A. Implementation Details Here come the details that are not supposed to be in the regular text. 15

Bibliography [1] Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. Reinforcement learning thorugh asynchronous advantage actor-critic on a gpu. In ICLR, 2017. [2] Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. CoRR, abs/1611.01779, 2016. [3] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. 17