A Reinforcement Learning Approach for the Dynamic Container Relocation Problem

Similar documents
Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Georgetown University at TREC 2017 Dynamic Domain Track

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Lecture 10: Reinforcement Learning

Artificial Neural Networks written examination

Human-like Natural Language Generation Using Monte Carlo Tree Search

Reinforcement Learning by Comparing Immediate Reward

Evolutive Neural Net Fuzzy Filtering: Basic Description

Axiom 2013 Team Description Paper

TD(λ) and Q-Learning Based Ludo Players

Discriminative Learning of Beam-Search Heuristics for Planning

Python Machine Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Word Segmentation of Off-line Handwritten Documents

AMULTIAGENT system [1] can be defined as a group of

AI Agent for Ice Hockey Atari 2600

High-level Reinforcement Learning in Strategy Games

(Sub)Gradient Descent

SARDNET: A Self-Organizing Feature Map for Sequences

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Generative models and adversarial training

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Reinforcement Learning Variant for Control Scheduling

INPE São José dos Campos

On the Combined Behavior of Autonomous Resource Management Agents

Regret-based Reward Elicitation for Markov Decision Processes

FF+FPG: Guiding a Policy-Gradient Planner

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

arxiv: v1 [cs.dc] 19 May 2017

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

BMBF Project ROBUKOM: Robust Communication Networks

Improving Fairness in Memory Scheduling

University of Groningen. Systemen, planning, netwerken Bosman, Aart

CS Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

GACE Computer Science Assessment Test at a Glance

Assignment 1: Predicting Amazon Review Ratings

Test Effort Estimation Using Neural Network

Learning to Schedule Straight-Line Code

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

An OO Framework for building Intelligence and Learning properties in Software Agents

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Laboratorio di Intelligenza Artificiale e Robotica

A Comparison of Annealing Techniques for Academic Course Scheduling

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Learning Methods in Multilingual Speech Recognition

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Second Exam: Natural Language Parsing with Neural Networks

Soft Computing based Learning for Cognitive Radio

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Introduction to Simulation

Radius STEM Readiness TM

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Seminar - Organic Computing

Software Maintenance

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Softprop: Softmax Neural Network Backpropagation Learning

WHEN THERE IS A mismatch between the acoustic

A deep architecture for non-projective dependency parsing

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Language properties and Grammar of Parallel and Series Parallel Languages

Grade 6: Correlated to AGS Basic Math Skills

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Automating the E-learning Personalization

Speeding Up Reinforcement Learning with Behavior Transfer

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

arxiv: v1 [cs.cv] 10 May 2017

Reducing Features to Improve Bug Prediction

Calibration of Confidence Measures in Speech Recognition

Model Ensemble for Click Prediction in Bing Search Ads

Beyond the Pipeline: Discrete Optimization in NLP

arxiv: v1 [cs.lg] 15 Jun 2015

Australian Journal of Basic and Applied Sciences

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Probabilistic Latent Semantic Analysis

A study of speaker adaptation for DNN-based speech synthesis

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Knowledge Transfer in Deep Convolutional Neural Nets

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Human Emotion Recognition From Speech

Using focal point learning to improve human machine tacit coordination

Major Milestones, Team Activities, and Individual Deliverables

Why Did My Detector Do That?!

Learning From the Past with Experiment Databases

Transcription:

A Reinforcement Learning Approach for the Dynamic Container Relocation Problem Paul Alexandru Bucur Philipp Hungerländer July 21, 2017 Abstract Given an initial configuration of a container bay and an a-priori known departure sequence of the containers, the goal of the Container Relocation Problem is to retrieve the requested containers in the predefined order, while minimizing the number of container relocations inside the bay. The Dynamic Container Relocation Problem (DCRP) introduces an additional aspect by also considering arriving containers. Although the DCRP originates from the port operations environment, its applications extend to other fields, such as industrial warehousing or the steel industry. In this paper, motivated by a cooperation with an Austrian company, we propose a Reinforcement Learning (RL) approach for solving the DCRP. In particular we use RL and problem-specific heuristics for guiding a Monte Carlo Tree Search. In our computational experiments we compare our method with a Beam Search (BS) algorithm on benchmark instances from the literature. While our RL approach cannot quite match the results provided by the problem-specific BS algorithm, it is more flexible and can be adapted much easier whenever we have to consider extensions to the standard version of the DCRP. Keywords: Dynamic Container Relocation Problem, Monte Carlo Tree Search, Reinforcement Learning. 1 Introduction Given an initial configuration of a container bay and a departure sequence of the containers, the Container Relocation Problem (CRP) aims to retrieve the requested containers in the predefined order, while minimizing the number of container relocations inside the bay. The Dynamic Container Relocation Problem (DCRP) extends the CRP by additionally considering arriving containers. Alpen-Adria Universität Klagenfurt, Austria, pabucur@edu.aau.at Alpen-Adria Universität Klagenfurt, Austria, philipp.hungerlaender@aau.at 1

Bucur and Hungerländer 2 The DCRP originates from port operations, but it is also relevant for other fields like the steel industry [9] and industrial warehousing [5]. The work presented in this paper is motivated by our cooperation with an Austrian company from the chemical industry that encountered an extended version of the DCRP when optimizing their warehouse operations. In particular the company plans to customize their warehouse operations in a two-stage process: 1. In the first stage the DCRP is extended by stochastic changes to the container arrival and departure list due to uncertainty in the business process caused by customer prioritization and customer cancellation of orders. 2. In the second stage the company wants to consider incoming goods instead of incoming containers, where the assignment of the incoming goods to appropriate containers is then an additional component of the optimization problem. Accordingly outgoing goods should be considered instead of specific outgoing containers. Due to these requested extensions to the DCRP we developed a Reinforcement Learning (RL) approach that can easily be adapted to different problem versions of the DCRP and is better suited to deal with stochastic changes to the input data than existing exact approaches and heuristics for the DCRP. For solving the CRP several mathematical programming approaches [7, 8] and heuristics [4, 12] have recently been proposed, where some of them have also been extended to the DCRP. Wan et al. [8] considered both the CRP and the DCRP. Akyüz et al. [2] suggested the first integer linear programming (ILP) formulation and a Beam Search (BS) heuristic for the DCRP. Borjian et al. [3] presented an ILP formulation for a slightly different version of the DCRP where service time windows for the containers are given instead of an arrival and departure sequence of the constainers. The paper is structured as follows. In Section 2 we describe the DCRP in more detail and discuss different ways for modelling it. In Section 3 we propose how to adapt RL approaches from the game-playing setting in order to apply them to the DCRP. In Section 4 we compare our RL approach with a BS heuristic on benchmark instances from the literature. Finally in Section 5 we conclude the paper and give suggestions for future research. 2 Problem Formulations of the DCRP In a container bay, containers are stacked in tiers on top of each other in several columns. For two-dimensional bays the exact location of a container is therefore determined by its column and its tier. The number of columns (tiers) is assumed to be finite and bounded by W (H). A straightforward property of such a storage system is that the containers may only be accessed from above, i.e. a container may only be retrieved from the bay if it is the highest in its column. To further clarify the workings of the DCRP we present a toy example with one arriving and one departing container in Fig. 1. The DCRP can also be modeled as a sequential decision making problem

Bucur and Hungerländer 3 6 4 2 6 4 7 2 7 6 4 2 4 6 7 2 4 6 7 2 Figure 1: Toy example for the DCRP with one arriving and one departing container: first Container 7 arrives and then Container 1 departs, requiring the relocation of all containers above it. in a Markov Decision Process framework, formally describing a fully observable environment. At every time step, the process is in a state s that is characterized by the combination of current container bay configuration and the arrival and departure sequence. The agent selects the next container from the sequence and chooses a bay column either for stacking the container in the case of an arriving container, or for relocation in the case of a container blocking the next outgoing container. This represents an action a. The set of actions A is thus given by the different bay columns, and in each state s, a subset of A is available as legal actions. Choosing an action a from the subset propels the environment from state s into a new state s. The environment offers feedback by means of a numerical reward R a (s, s) for each state transition. The Markov property is obviously fulfilled, since the information on how the current container bay configuration was developed, i.e. the history of stackings, retrievals and relocations of the containers, is irrelevant for future decisions. 3 Reinforcement Learning In recent years, Google DeepMind has developed a famous algorithm for playing the game of Go [1], achieving great results by combining Monte Carlo Tree Search (MCTS) and Reinforcement Learning (RL). We take inspiration in this work, and apply the same algorithmic concept for solving the DCRP. Our approach consists of four phases: selection, expansion, rollout and backpropagation. In the selection phase, starting with the root, child nodes are recursively chosen based on the UCT formula, which balances exploration and exploitation, until a leaf node L is reached. Unless node L is terminal, one or more children are created, from which one, C, is selected. In the rollout phase that represents the connection between MCTS, RL and problem-specific heuristics one or several playout simulations are run from node C. The playout structure may vary: it is possible to simulate only a limited sequence of moves, or until the end of the

Bucur and Hungerländer 4 arrival and departure schedule. We also differentiate between light playouts using random moves and heavy playouts using problem-specific heuristics. It is possible to run playout simulations using the different problem-specific heuristics from literature or based on the policy learned by an RL agent. In the current work we use the Q-learning algorithm introduced in [10]. In Q- learning, the agent learns to directly approximate the optimal state-action-value function Q by using the Q(, ) function. Watkins et al. [10] showed that Q(, ) converges to Q with probability 1 under the assumption that all actions are repeatedly sampled in all states and the action-values are represented discretely. Upon convergence, the optimal policy is given by the action with the highest Q-value in each state. We also reduce the size of the state space by describing a state through the current bay configuration and the next n containers from the schedule. The agent thus keeps track of an integer vector of length W H + n, where each vector entry represents the index of the corresponding container in the subset of the departure list consisting of the containers in the vector. The choice of n has important implications both for the agent s behaviour, i.e. his learned policy, and for the size of the state space. The higher n, the more information is available. However, the size of the search space also grows accordingly. We chose this description since it reflects the order in which the currently stored containers depart and additionally includes the next n containers from the schedule. Should the agent have not encountered a certain state before, the MCTS rollout policy defaults to a greedy heuristic for choosing the action. If the state-action space reaches considerable dimensions for large problem instances, the Q-values cannot be stored in a table anymore. In this case the Q- function can be approximated with a neural network, with the Q-values stored in the network weights [11]. The introduction of neural networks offers several advantages, among them the possibility of using experience replay, which allows agents to lever experiences from the past. 4 Computational Experiments In this short paper we consider DCRP benchmark instances kindly provided by Akyüz et al. [2]. In a forthcoming extended version of this paper we will also consider new real-world instances originating from our project partner working in the chemical sector. The complete benchmark set from [2] contains a total of 2400 instances, divided in medium and high density yard-bays, i.e. with ξ = 0.5 and ξ = 0.8 capacity usage ratios. We benchmarked on a subset comprising 25 % of the complete benchmarking set, namely the "Group-II medium" subset. In this subset, the container arrival and departure times are drawn uniformly from a specifically chosen time interval which ensures the medium density of the yardbay. The subset can further be divided in 5 different subsets, one for each integer column height in the range [2, 6]. The number of columns is constant and equal to 6. Finally each of this further subsets consists of 120 particular instances. The experiments were performed on a personal laptop running macos, with an

Bucur and Hungerländer 5 1.8 GHz i5 processor and 8 GB RAM. We first aimed to replicate the results obtained in [2], where the score presented for a bay size is the empirically found minimum average number of relocations over all its corresponding 120 Group-II medium instances. Then we chose another underlying greedy heuristic for the BS algorithm: the Expected Minimax (EM) heuristic was recently introduced in [6]. We observe that by respecting the other original parameter choices from [2] but by using the EM heuristic, the BS can be clearly improved. Due to space restrictions, in Table 1 we only present the results for a beam width of N = 10, where N describes the amount of most promising nodes which are kept and considered for further branching after each step of the heuristic. We refer the reader to [2, Table 4] for comparison with the results of the original BS heuristic for different values of N. Next we explored the potential of the MCTS-RL technique, where we offered the agent a total training time of 2 minutes per instance before using the learned policy for the MCTS rollout. The MCTS search was stopped after the same time that was needed by the BS algorithm using the EM heuristic. For the Q- learning algorithm we used a learning rate of α = 0.7 and a discounting factor of γ = 0.5. Action values were chosen according to the Q-table. The probability of choosing a random action was set to ɛ = 0.1, linearly decreasing with each training epoch. We empirically found a value of n = 3 containers considered by the agent to work best for most of the instances. The reward setting is another important factor in the learning process: we set a reward of r = 1 for each relocation and of r = 0 for every other legal move. The agent may thus maximally aspire to achieve a total cumulated reward of 0, if no relocations were necessary. One of the tested architectures was that of a Multilayer Perceptron (MLP), modelled with three hidden layers: the first layer features 384 neurons, the second layer 192 neurons, and the last layer 96 neurons. A final layer maps the output of the last hidden layer to the number of legal actions. Each neuron has a non-linear activation function: for this architecture, we chose the rectifier, i.e. f(x) = max(0, x). In this short paper, we used the simple MLP architecture due to the time constraint required for the comparison to the BS approach. In the forthcoming extended version, we will explore the importance of choosing an optimal architecture for the neural network, comparing the MLP to Convolutional Neural Networks. If a certain state had not been experienced and thus no information was available on the quality of the actions, the agent would choose the action for that state based on the Expected Reshuffling Index heuristic, since it is computationally less expensive than the EM heuristic. We observed that while for small instances optimal values were found, for bigger instances many of the states encountered later in the decision-making had not been experienced by the agent.

Bucur and Hungerländer 6 Bay size (W, H) CPU time (s) BS-EM UB MCTS-RL UB (6, 2) 23.06 4.40 6.25 (6, 3) 31.62 34.06 50.96 (6, 4) 41.50 63.24 98.43 (6, 5) 58.70 80.71 126.71 (6, 6) 125.17 103.74 173.29 Table 1: Comparison of the computational results obtained by the MCTS-RL (Monte Carlo Tree Search with Reinforcement Learning) and BS-EM (Beam Search with Expected Minimax) approaches on the Group-II medium instances from [2]. BS-EM UB and MCTS-RL UB denote the average minimum number of relocations required by the respective approach. 5 Conclusion Motivated by a cooperation with an Austrian company we designed a Reinforcement Learning (RL) approach for the Dynamic Container Relocation Problem (DCRP). While our RL approach could not achieve the same results as a problem-specific Beam Search heuristic, it is more flexible and therefore better suited for dealing with the extensions requested by our project partner. In an extended version of this paper we will provide further details on our RL approach and also conduct a more extensive computational study, where we will consider both DCRP instances from the literature and also real-world instances for extended versions of the DCRP. References [1] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. [2] Akyüz, M.H. and Lee, C.-Y. A mathematical formulation and efficient heuristics for the dynamic container relocation problem. Naval Research Logistics (NRL), 61(2):101-118, 2014. [3] Borjian, S., Manshadi, V.H., Barnhart, C. and Jaillet, P. Dynamic Stochastic Optimization of Relocations in Container Terminals. http://www.mit. edu/~jaillet/general/container13.pdf, 2013. [4] Caserta, M., Schwarze, S. and Voß, S. A New Binary Description of the Blocks Relocation Problem and Benefits in a Look Ahead Heuristic. Evolutionary Computation in Combinatorial Optimization: 9th European Conference Proceedings, Springer Berlin Heidelberg: 37 48, 2009. [5] Chen, L., Langevin, A. and Riopel, D. A tabu search algorithm for the

Bucur and Hungerländer 7 relocation problem in a warehousing system. International Journal of Production Economics, 129(1):147-156, 2011. [6] Galle V., Borijan S.B., Manshadi, V.H., Barnhart, C. and Jaillet, P. The Stochastic Container Relocation Problem. CoRR, abs/1703.04769, 2017. [7] Kim, K.H. and Hong, G-P. A Heuristic Rule for Relocating Blocks. Computers and Operations Research, 33(4):940-954, 2006. [8] Wan, Y., Liu, J. and Tsai, P. The assignment of storage locations to containers for a container stack. Naval Research Logistics (NRL), 56(8): 699-713, 2009. [9] Wang, G., Jin, C. and Deng, X. Modeling and optimization on steel plate pick-up operation scheduling on stackyard of shipyard. IEEE International Conference on Automation and Logistics, 2008. [10] Watkins J. C. H.C. and Dayan, P. Q-learning. Machine Learning: 279-.292, 1992. [11] Mnih V., Kavukcuoglu, K., Silver, D., Graves A., Antonoglou I., Wierstra D. and Riedmiller M. Playing Atari with Deep Reinforcement Learning. CoRR:abs/1312.5602, 2014 [12] Zhu, W., Qin, H., Lim, A. and Zhang, H. Iterative Deepening A* Algorithms for the Container Relocation Problem. IEEE Transactions on Automation Science and Engineering, 9(4):710 722, 2012.