A Reinforcement Learning Approach for the Dynamic Container Relocation Problem

A Reinforcement Learning Approach for the Dynamic Container Relocation Problem Paul Alexandru Bucur Philipp Hungerländer July 21, 2017 Abstract Given an initial configuration of a container bay and an a-priori known departure sequence of the containers, the goal of the Container Relocation Problem is to retrieve the requested containers in the predefined order, while minimizing the number of container relocations inside the bay. The Dynamic Container Relocation Problem (DCRP) introduces an additional aspect by also considering arriving containers. Although the DCRP originates from the port operations environment, its applications extend to other fields, such as industrial warehousing or the steel industry. In this paper, motivated by a cooperation with an Austrian company, we propose a Reinforcement Learning (RL) approach for solving the DCRP. In particular we use RL and problem-specific heuristics for guiding a Monte Carlo Tree Search. In our computational experiments we compare our method with a Beam Search (BS) algorithm on benchmark instances from the literature. While our RL approach cannot quite match the results provided by the problem-specific BS algorithm, it is more flexible and can be adapted much easier whenever we have to consider extensions to the standard version of the DCRP. Keywords: Dynamic Container Relocation Problem, Monte Carlo Tree Search, Reinforcement Learning. 1 Introduction Given an initial configuration of a container bay and a departure sequence of the containers, the Container Relocation Problem (CRP) aims to retrieve the requested containers in the predefined order, while minimizing the number of container relocations inside the bay. The Dynamic Container Relocation Problem (DCRP) extends the CRP by additionally considering arriving containers. Alpen-Adria Universität Klagenfurt, Austria, pabucur@edu.aau.at Alpen-Adria Universität Klagenfurt, Austria, philipp.hungerlaender@aau.at 1

Bucur and Hungerländer 2 The DCRP originates from port operations, but it is also relevant for other fields like the steel industry [9] and industrial warehousing [5]. The work presented in this paper is motivated by our cooperation with an Austrian company from the chemical industry that encountered an extended version of the DCRP when optimizing their warehouse operations. In particular the company plans to customize their warehouse operations in a two-stage process: 1. In the first stage the DCRP is extended by stochastic changes to the container arrival and departure list due to uncertainty in the business process caused by customer prioritization and customer cancellation of orders. 2. In the second stage the company wants to consider incoming goods instead of incoming containers, where the assignment of the incoming goods to appropriate containers is then an additional component of the optimization problem. Accordingly outgoing goods should be considered instead of specific outgoing containers. Due to these requested extensions to the DCRP we developed a Reinforcement Learning (RL) approach that can easily be adapted to different problem versions of the DCRP and is better suited to deal with stochastic changes to the input data than existing exact approaches and heuristics for the DCRP. For solving the CRP several mathematical programming approaches [7, 8] and heuristics [4, 12] have recently been proposed, where some of them have also been extended to the DCRP. Wan et al. [8] considered both the CRP and the DCRP. Akyüz et al. [2] suggested the first integer linear programming (ILP) formulation and a Beam Search (BS) heuristic for the DCRP. Borjian et al. [3] presented an ILP formulation for a slightly different version of the DCRP where service time windows for the containers are given instead of an arrival and departure sequence of the constainers. The paper is structured as follows. In Section 2 we describe the DCRP in more detail and discuss different ways for modelling it. In Section 3 we propose how to adapt RL approaches from the game-playing setting in order to apply them to the DCRP. In Section 4 we compare our RL approach with a BS heuristic on benchmark instances from the literature. Finally in Section 5 we conclude the paper and give suggestions for future research. 2 Problem Formulations of the DCRP In a container bay, containers are stacked in tiers on top of each other in several columns. For two-dimensional bays the exact location of a container is therefore determined by its column and its tier. The number of columns (tiers) is assumed to be finite and bounded by W (H). A straightforward property of such a storage system is that the containers may only be accessed from above, i.e. a container may only be retrieved from the bay if it is the highest in its column. To further clarify the workings of the DCRP we present a toy example with one arriving and one departing container in Fig. 1. The DCRP can also be modeled as a sequential decision making problem

Bucur and Hungerländer 3 6 4 2 6 4 7 2 7 6 4 2 4 6 7 2 4 6 7 2 Figure 1: Toy example for the DCRP with one arriving and one departing container: first Container 7 arrives and then Container 1 departs, requiring the relocation of all containers above it. in a Markov Decision Process framework, formally describing a fully observable environment. At every time step, the process is in a state s that is characterized by the combination of current container bay configuration and the arrival and departure sequence. The agent selects the next container from the sequence and chooses a bay column either for stacking the container in the case of an arriving container, or for relocation in the case of a container blocking the next outgoing container. This represents an action a. The set of actions A is thus given by the different bay columns, and in each state s, a subset of A is available as legal actions. Choosing an action a from the subset propels the environment from state s into a new state s. The environment offers feedback by means of a numerical reward R a (s, s) for each state transition. The Markov property is obviously fulfilled, since the information on how the current container bay configuration was developed, i.e. the history of stackings, retrievals and relocations of the containers, is irrelevant for future decisions. 3 Reinforcement Learning In recent years, Google DeepMind has developed a famous algorithm for playing the game of Go [1], achieving great results by combining Monte Carlo Tree Search (MCTS) and Reinforcement Learning (RL). We take inspiration in this work, and apply the same algorithmic concept for solving the DCRP. Our approach consists of four phases: selection, expansion, rollout and backpropagation. In the selection phase, starting with the root, child nodes are recursively chosen based on the UCT formula, which balances exploration and exploitation, until a leaf node L is reached. Unless node L is terminal, one or more children are created, from which one, C, is selected. In the rollout phase that represents the connection between MCTS, RL and problem-specific heuristics one or several playout simulations are run from node C. The playout structure may vary: it is possible to simulate only a limited sequence of moves, or until the end of the

Bucur and Hungerländer 4 arrival and departure schedule. We also differentiate between light playouts using random moves and heavy playouts using problem-specific heuristics. It is possible to run playout simulations using the different problem-specific heuristics from literature or based on the policy learned by an RL agent. In the current work we use the Q-learning algorithm introduced in [10]. In Q- learning, the agent learns to directly approximate the optimal state-action-value function Q by using the Q(, ) function. Watkins et al. [10] showed that Q(, ) converges to Q with probability 1 under the assumption that all actions are repeatedly sampled in all states and the action-values are represented discretely. Upon convergence, the optimal policy is given by the action with the highest Q-value in each state. We also reduce the size of the state space by describing a state through the current bay configuration and the next n containers from the schedule. The agent thus keeps track of an integer vector of length W H + n, where each vector entry represents the index of the corresponding container in the subset of the departure list consisting of the containers in the vector. The choice of n has important implications both for the agent s behaviour, i.e. his learned policy, and for the size of the state space. The higher n, the more information is available. However, the size of the search space also grows accordingly. We chose this description since it reflects the order in which the currently stored containers depart and additionally includes the next n containers from the schedule. Should the agent have not encountered a certain state before, the MCTS rollout policy defaults to a greedy heuristic for choosing the action. If the state-action space reaches considerable dimensions for large problem instances, the Q-values cannot be stored in a table anymore. In this case the Q- function can be approximated with a neural network, with the Q-values stored in the network weights [11]. The introduction of neural networks offers several advantages, among them the possibility of using experience replay, which allows agents to lever experiences from the past. 4 Computational Experiments In this short paper we consider DCRP benchmark instances kindly provided by Akyüz et al. [2]. In a forthcoming extended version of this paper we will also consider new real-world instances originating from our project partner working in the chemical sector. The complete benchmark set from [2] contains a total of 2400 instances, divided in medium and high density yard-bays, i.e. with ξ = 0.5 and ξ = 0.8 capacity usage ratios. We benchmarked on a subset comprising 25 % of the complete benchmarking set, namely the "Group-II medium" subset. In this subset, the container arrival and departure times are drawn uniformly from a specifically chosen time interval which ensures the medium density of the yardbay. The subset can further be divided in 5 different subsets, one for each integer column height in the range [2, 6]. The number of columns is constant and equal to 6. Finally each of this further subsets consists of 120 particular instances. The experiments were performed on a personal laptop running macos, with an

Bucur and Hungerländer 5 1.8 GHz i5 processor and 8 GB RAM. We first aimed to replicate the results obtained in [2], where the score presented for a bay size is the empirically found minimum average number of relocations over all its corresponding 120 Group-II medium instances. Then we chose another underlying greedy heuristic for the BS algorithm: the Expected Minimax (EM) heuristic was recently introduced in [6]. We observe that by respecting the other original parameter choices from [2] but by using the EM heuristic, the BS can be clearly improved. Due to space restrictions, in Table 1 we only present the results for a beam width of N = 10, where N describes the amount of most promising nodes which are kept and considered for further branching after each step of the heuristic. We refer the reader to [2, Table 4] for comparison with the results of the original BS heuristic for different values of N. Next we explored the potential of the MCTS-RL technique, where we offered the agent a total training time of 2 minutes per instance before using the learned policy for the MCTS rollout. The MCTS search was stopped after the same time that was needed by the BS algorithm using the EM heuristic. For the Q- learning algorithm we used a learning rate of α = 0.7 and a discounting factor of γ = 0.5. Action values were chosen according to the Q-table. The probability of choosing a random action was set to ɛ = 0.1, linearly decreasing with each training epoch. We empirically found a value of n = 3 containers considered by the agent to work best for most of the instances. The reward setting is another important factor in the learning process: we set a reward of r = 1 for each relocation and of r = 0 for every other legal move. The agent may thus maximally aspire to achieve a total cumulated reward of 0, if no relocations were necessary. One of the tested architectures was that of a Multilayer Perceptron (MLP), modelled with three hidden layers: the first layer features 384 neurons, the second layer 192 neurons, and the last layer 96 neurons. A final layer maps the output of the last hidden layer to the number of legal actions. Each neuron has a non-linear activation function: for this architecture, we chose the rectifier, i.e. f(x) = max(0, x). In this short paper, we used the simple MLP architecture due to the time constraint required for the comparison to the BS approach. In the forthcoming extended version, we will explore the importance of choosing an optimal architecture for the neural network, comparing the MLP to Convolutional Neural Networks. If a certain state had not been experienced and thus no information was available on the quality of the actions, the agent would choose the action for that state based on the Expected Reshuffling Index heuristic, since it is computationally less expensive than the EM heuristic. We observed that while for small instances optimal values were found, for bigger instances many of the states encountered later in the decision-making had not been experienced by the agent.

Bucur and Hungerländer 6 Bay size (W, H) CPU time (s) BS-EM UB MCTS-RL UB (6, 2) 23.06 4.40 6.25 (6, 3) 31.62 34.06 50.96 (6, 4) 41.50 63.24 98.43 (6, 5) 58.70 80.71 126.71 (6, 6) 125.17 103.74 173.29 Table 1: Comparison of the computational results obtained by the MCTS-RL (Monte Carlo Tree Search with Reinforcement Learning) and BS-EM (Beam Search with Expected Minimax) approaches on the Group-II medium instances from [2]. BS-EM UB and MCTS-RL UB denote the average minimum number of relocations required by the respective approach. 5 Conclusion Motivated by a cooperation with an Austrian company we designed a Reinforcement Learning (RL) approach for the Dynamic Container Relocation Problem (DCRP). While our RL approach could not achieve the same results as a problem-specific Beam Search heuristic, it is more flexible and therefore better suited for dealing with the extensions requested by our project partner. In an extended version of this paper we will provide further details on our RL approach and also conduct a more extensive computational study, where we will consider both DCRP instances from the literature and also real-world instances for extended versions of the DCRP. References [1] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. [2] Akyüz, M.H. and Lee, C.-Y. A mathematical formulation and efficient heuristics for the dynamic container relocation problem. Naval Research Logistics (NRL), 61(2):101-118, 2014. [3] Borjian, S., Manshadi, V.H., Barnhart, C. and Jaillet, P. Dynamic Stochastic Optimization of Relocations in Container Terminals. http://www.mit. edu/~jaillet/general/container13.pdf, 2013. [4] Caserta, M., Schwarze, S. and Voß, S. A New Binary Description of the Blocks Relocation Problem and Benefits in a Look Ahead Heuristic. Evolutionary Computation in Combinatorial Optimization: 9th European Conference Proceedings, Springer Berlin Heidelberg: 37 48, 2009. [5] Chen, L., Langevin, A. and Riopel, D. A tabu search algorithm for the

Bucur and Hungerländer 7 relocation problem in a warehousing system. International Journal of Production Economics, 129(1):147-156, 2011. [6] Galle V., Borijan S.B., Manshadi, V.H., Barnhart, C. and Jaillet, P. The Stochastic Container Relocation Problem. CoRR, abs/1703.04769, 2017. [7] Kim, K.H. and Hong, G-P. A Heuristic Rule for Relocating Blocks. Computers and Operations Research, 33(4):940-954, 2006. [8] Wan, Y., Liu, J. and Tsai, P. The assignment of storage locations to containers for a container stack. Naval Research Logistics (NRL), 56(8): 699-713, 2009. [9] Wang, G., Jin, C. and Deng, X. Modeling and optimization on steel plate pick-up operation scheduling on stackyard of shipyard. IEEE International Conference on Automation and Logistics, 2008. [10] Watkins J. C. H.C. and Dayan, P. Q-learning. Machine Learning: 279-.292, 1992. [11] Mnih V., Kavukcuoglu, K., Silver, D., Graves A., Antonoglou I., Wierstra D. and Riedmiller M. Playing Atari with Deep Reinforcement Learning. CoRR:abs/1312.5602, 2014 [12] Zhu, W., Qin, H., Lim, A. and Zhang, H. Iterative Deepening A* Algorithms for the Container Relocation Problem. IEEE Transactions on Automation Science and Engineering, 9(4):710 722, 2012.