A Production Scheduling Strategy for an Assembly Plant based on Reinforcement Learning

A Production Scheduling Strategy for an Assembly Plant based on Reinforcement Learning DRANIDIS D., KEHRIS E. Computer Science Department CITY LIBERAL STUDIES - Affiliated College of the University of Sheffield 13 Tsimiski st., 54624 Thessaloniki GREECE Abstract: - A reinforcement learning algorithm for the development of a system scheduling policy that controls a manufacturing system is investigated. The manufacturing system is characterized by considerable operation and setup times. The reinforcement learning algorithm learns to develop a scheduling policy that satisfies demand while keeping a given production mix. This paper discusses the reinforcement algorithm used, the state space representation and the structure of the neural network employed (due to the large state space of the problem). Despite the difficulty of the task assigned to the reinforcement learning algorithm, the results show that the policy learned demonstrates the desired properties. Key-Words: - Scheduling policies, Manufacturing systems, Reinforcement learning, Neural networks 1 Introduction Production scheduling deals with the way the resources of a manufacturing system (i.e. the workstations, personnel and support systems) are assigned over time to a set of activities so as to best meet a set of objectives. The inherent complexity of the current manufacturing systems, the frequently unstructured nature of managerial decision-making and the usually contradicting goals that have to be achieved through production scheduling render the problem not amenable to analytical solution. On the other hand, heuristic algorithms that obtain good production workplans have been developed and are usually evaluated through the use of simulation models that mimic the dynamic behavior of the real manufacturing system. Production scheduling is often regarded as a twolevel decision making [1,4] at the upper level system loading is decided while at the lower level the workstation loading is determined. It has been observed and reported in the literature that system scheduling (or system loading) plays a more important role than workstation control. In this paper, we investigate the possibility of utilizing a reinforcement learning (RL) algorithm for developing a system scheduling policy for a manufacturing system. The structure of this paper is as follows: section 2 describes the general concept of reinforcement learning, section 3 presents the manufacturing system for which we are going to develop a scheduling policy, while section 4 describes the simulation model developed for the simulation of the manufacturing system. The RL agent developed for the specific manufacturing system is described in section 5 and section 6 presents the results obtained by controlling the manufacturing system using the RL agent developed. These results and the conclusions derived are discussed in section 7. 2 Reinforcement Learning RL algorithms approximate dynamic programming on an incremental basis. In contrast to dynamic programming, RL algorithms do not require a model of the dynamics of the system and can be used online in an operating environment. A reinforcement learning agent senses its environment, takes actions and receives rewards depending on the effect of its actions on the environment. The agent has no knowledge of the dynamics of the environment (it cannot predict the consequences of its actions). Rewards provide the necessary information to the agent to adapt its actions. The aim of a reinforcement learning agent is to maximize the total reward received from the environment. RL agents do not have any memory, thus do not keep track of their actions. They only decide based on the knowledge they have about the current state of the environment. Figure 1 illustrates the interaction of the RL agent with the simulation system.

Fig. 1. αt The interaction of the RL agent with the simulation system. If s t S (S is a finite set of states) is the state of the system at time t then the RL agent decides the next action α t A (A is a finite set of actions) according to its current policy π : S A. A usual policy is to take greedy actions i.e. choose those actions that return the maximum expected long-term reward. The estimated long-term reward is represented by a function Q : S A R. So a greedy action corresponds to the action associated with V s ) = max( Q(, α )). (1) ( t s t t α t V : S R is called the value function and represents the long term reward the agent will receive if beginning from state s t it follows the greedy policy. The policy of a RL agent is not constant, since Q values are constantly changing during the on-line learning procedure. The following formula describes the update of Q, α ) ( s t t System r t RL agent Q s, α ) r + γq( s, α ) (2) ( t t t t+1 t+ 1 The new estimation of Q gets the value of the immediate received reward r t plus the discounted estimated Q value of taking the next action in the next state. The next Q value is discounted by a discount factor γ. During learning, a random exploration strategy is performed; with some small probability random actions are chosen instead of greedy actions to allow the network exploring new regions of the state-action space. It is proved [10] that under certain conditions the algorithm converges and the final policy of the agent is the optimal policy π *. The system should be described as a stationary Markov Decision Process (MDP) and a table should be used for storing the Q values. In real world problems usually the state space S is too large or even continuous, so the Q function cannot be represented in a tabular way. In these cases st a function approximator, usually a neural network, is employed to approximate the Q function. 2.1 Related work Reinforcement learning has been successfully applied to many application areas. The most impressive one is the TD-Gammon system [8], which achieved a master level in backgammon by applying the TD(λ) reinforcement learning algorithm (Temporal Difference algorithm [7]). TD-Gammon uses a backpropagation network for approximating the value function. The neural network receives the full representation of the board and the problem is clearly a Markov decision problem. RL algorithms are also successfully applied to control problems. Known applications are briefly described below. Crites and Barto [2] successfully apply RL algorithms in the domain of elevator dispatching. A team of RL agents (employing neural networks) is used to improve the performance of multiple elevator systems. Zhang and Dietterich [3] apply RL methods to incrementally improve a repair-based job-shop scheduler. They use the TD(λ) algorithm (the same algorithm used in TD-Gammon) to learn an evaluation function over states of scheduling. Their system has the disadvantage that it does not learn online (concurrently with the simulation). Mahadevan et al. [6] introduce a different algorithm for average-reward RL (called SMART). They apply the algorithm in controlling a productioninventory system with multiple product types. The RL agent has to decide between the two actions of producing or maintaining the system in order to avoid costly repairs. The work we present in this paper is closely related to Mahadevan et al. [6] since it concerns a manufacturing production system. However, the task assigned to the RL agent in our case is considerably harder due to the specific characteristics of the manufacturing system and the demanding objectives to be met by the system scheduling policy. 3 System description 3.1 Description of the assembly plant The manufacturing system described in this paper is a simplification of an existing assembly plant. We consider an assembly plant that consists of ten different workstations and produces two types of printed circuit boards (PCBs) referred to as Type A and Type B. Table 1 shows the existing workstations. Parts waiting for processing by the workstations are

temporarily stored at their local buffer which have a capacity of five. Due to the limited buffer capacities the system suffers from workstation blockings. Reflow soldering (workstation 3) and wave soldering stations (workstation 7) are continuous process machines and are limited only by the physical size of the moving belt the boards are placed on; the capacity of those workstations reflects the number of lots which can simultaneously be processed on them. Workstation Id Work area Comments 1 Solder paste painting 2 Surface Mounting Setup time: 9 (sec) 3 Reflow soldering Continuous process 4 SMD vision control 5 Assembly 6 Assembly 7 Wave soldering Continuous process 8 Final assembly 9 Vision control 10 Integrated circuit test Table 1: Production resources Setup time: 18 (sec) Automatic surface mounting (workstation 2) and Integrated Circuit testing machines (workstation 10) require a set up operation when a change in the type of board is encountered. Table 1, gives the set-up times for each machine of the workstation. Board Process Plan (workstation id) Type A 1 2 3 4 5 7 9 10 B 1 2 3 4 6 7 9 8 10 Table 2: Process plans The process plans of the two types are given in Table 2 in terms of the sequence of the workstations they have to visit to complete their assembly, while the duration of the corresponding operations are given in Table 3. It is evident from the duration of the processes that the setup operation is a time-consuming activity. Board Workstation id Type 1 2 3 4 5 6 7 8 9 10 A 2 2 10 6 11 0 18 0 2 2 B 2 4 10 14 0 10 18 4 4 5 Table 3: Processing Times (in sec) 3.2 Scheduling policy objectives The system scheduling policy is responsible for deciding the time instances at which a part will be input in the manufacturing plant as well as the part type. The objective of the scheduling policy is to ensure demand satisfaction and balanced production rate of the required types. The balanced production is necessary because the assembly plant feeds successive production stages. The aim of this work is to investigate the possibility of deriving a RL agent that is capable of developing workplans for the given assembly plant that satisfy the demand while keeping a good balance of the production mix. A simulation program has been developed that mimics the dynamics of the assembly plant, and then, an RL agent has been built that determines the loading policy of the simulated plant. 4 The Simulation Program The simulation model that was built for the manufacturing system was based on the FMSLIB simulation library [5] which is a generic software library written in C that facilitates the simulation of flexible manufacturing systems (FMS) and their real time control strategies. FMSLIB employs the threephase approach [9] and provides facilities for modelling the physical structure, the part flow and the system and workstation loading policy of a family of FMSs. FMSLIB currently supports the following simulation entities: parts of different types machines workstations (a group of machines) limited capacity buffers non-accumulating conveyors FMSLIB advocates the separate development of the conceptually different views of the simulated system. This approach facilitates the modular program development, program readability and maintainability and the evaluation of different strategies (dispatching, control, etc) on the same system. A simulation program based on FMSLIB is comprised of the following modules: Physical Structure (Equipment) - Contains the descriptions of the machine, conveyor, buffer and workstations that make up the simulated system Operational Logic - Contains the descriptions of the feeding policies for the machine, workstation and conveyors of the simulated system i.e. determines the input buffer policy for each machine.

Input Data - Provides system-related static data like the demand and the machine processing times Part path - Describes the part flow through the system. This module explicitly describes the equipment required by each part type, at each stage of its manufacturing process. Data Collection - Defines the user-defined parts of the system for which data collection is necessary. Control Strategy - Determines the scheduling policy to be implemented for the control of the system; i.e. it determines which part type will be introduced in the system and when. In this paper, the control strategy is implemented by the neural network which is trained using the RL agent. The separation of the different views of the simulated system advocated by FMSLIB greatly facilitated the incorporation of the software that implements the RL agent (written in C++) with the simulation code. 5 The RL agent In order to define a reinforcement learning agent one has to define: a suitable representation of the state and a suitable reward function. Both definitions are very important for the success of an RL agent. Due to the large state space a neural network approximator is used for representing the Q function. Specifically the backpropagation learning algorithm is used for updating the weights of the network. The input to the network is described in the following section. 5.1 State representation One of the most important decisions when designing an RL agent is the representation of the state. In the system described in this paper this is one of the major concerns since a complete representation is not possible due to the complexity of the problem. Therefore we choose to include the following information in the state representation: state of machines. Each machine may be found in one of four distinct states: idle, working, setup or blocked. A separate input unit is used for each one of these states. state of input buffers. Buffers are of limited capacity. We decide to use two units for each buffer. One unit for the level of the buffer and a second one which turns on when the buffer is full. elapsed simulation time divided by total simulation time t T : 10 units are used for the representation of time. These units encode the time as thermometer units. feeding rates e t ( / P( for each type i of production parts. producing rates p t ( / P( for each type i of production parts. where P( is the total demand, e t ( the number of entered parts, p t ( the number of produced parts of type i at time t respectively and T is the total production time. For each of the continuous rate measures 10 units are used, similarly to the encoding of simulation time. 5.2 Actions The RL agent has to decide between three actions: entering a part of type A or B and doing nothing. The decision is based on the comparison of the outputs of three separate neural networks, which are trained simultaneously during the simulation. All three networks receive the same state input. 5.3 The reward function Special care has to be taken when it comes to define the reward function. The implicit mapping of the reward functions to scheduling policies has to be a monotonic function: higher rewards should correspond to better scheduling policies. Taking into consideration the scheduling policy objectives, the reward is calculated with the following formula pτ ( diτ rτ = max (3) i P( where τ is used for simulation time to distinguish from decision epoch (times at which the RL agent takes decisions), which is denoted as t. The term d i = P( T is the ideal production rate of part type i. In the ideal case in which the production is balanced pτ ( diτ tends to zero. So, the RL agent is punished with the maximum distance between the desired and actual amount of production. Simulation steps do not coincide with decision epochs of the RL agent, since during the simulations states occur in which there is only one possible action. At these states the simulation proceeds without

consulting the RL controller. However rewards are calculated at each simulation step and accumulated until the next decision epoch. 6 Experimental results After training the RL agent we have conducted several experiments for testing its performance on different manufacturing scenarios. Fig. 2 illustrates the derived scheduling policy for the task of producing 10 parts of type A and seven parts of type B. The selection of small productions increases the impact of the initial transient time on overall performance. In the graph of Fig. 2 one can qualitatively compare the ideal accumulative productions d i τ (shown in the figure as two straight lines, one for each type of production part) with the actual accumulative productions p τ ( (shown as stepwise functions). It can be observed (Fig. 2) that despite the initial transient time the RL agent produces a schedule that quickly approximates the ideal productivity rates. Accumulative Production Fig. 2. 1 101 201 301 401 Time Actual Production A Ideal Production A Actual Production B Ideal Production B Actual accumulative productions versus ideal accumulative productions. 7 Discussion The manufacturing system presented in this paper is characterized by time-consuming processing and setup times as well as possible worskstation blockages due to limited buffer capacities. As a result of these characteristics, there is considerable time delay between the entrance of a part into the system until its exit (production). This implies that the consequences of the actions taken at the entry-level are observed with considerable time delay. This fact has been identified and tackled appropriately by the RL agent. Furthermore, the system undergoes through a long transient period due to the fact that it is initially empty. Thus, the behaviour observed by the RL agent at the beginning of the simulation is considerably different than that of the steady-state. This means that the RL agent should be able to distinguish among the transient and steady state. In addition to these characteristics, the objective of the system scheduling policy is to satisfy the demand while producing the part types at given production mix and keeping the work in progress low. Demand satisfaction is favored by keeping the setups at a minimum level while achieving the desired production mix requires an interchange of part types at feeding which results to setup operations. The work in progress may be kept low by adjusting the feeding to the system production capacity. As a consequence, the RL agent has to strike a balance between batch processing and product mixing while not feeding the system continuously. It becomes obvious from these observations that the RL agent is assigned to solve a hard problem. An open question remains whether the manufacturing system can be represented as a Markov Decision Process, since many rewards occur between successive decision epochs, and the system is not fully visible (due to the abstraction of the representation). Mahadevan etal [6] argue that these kinds of problems can be represented as Semi-Markov decision processes. Although we do not use the same average-reward RL algorithm, we do accumulate rewards between successive decision epochs and award them to the preceding action. As already mentioned the system used for the experiments is a simplification of an existing assembly plant. The actual system consists of multiple machines per workstation and the produced types are 17 (2 types were used for this study). Representation of the actual system would require an even larger state-action space and would considerably lengthen learning times. One of our future goals is to examine ways of adapting the RL agent already developed to deal with the increased complexity of the real problem. References: [1] R. Akella, Y. F. Choong, and S. B. Gershwin, Performance of hierarchical production scheduling policy, In IEEE Transactions Components Hybrids Manufacturing technol., Vol 7, No. 3, 1984, pp. 225-240. [2] R. Crites and A. Barto, Improving Elevator Performance Using Reinforcement Learning, In D. S. Touretzky, M. C.Mozer, and M. E. Hasselmo, eitors, Advances in Neural Information Processing Systems 8, MIT Press, 1996.

[3] Dietterich and Zhang, A reinforcement-learning approach to job-shop scheduling. In Proceedings of the 14 th International Joint Conference on Artificial Intelligence, 1995. [4] R. Graves, Hierarchical scheduling approach in flexible assembly systems, In Proceedings of the 1987 IEEE Conference on Robotics and Automation, Raleigh, NC, Vol. 1, 1987, pp. 118-123. [5] E. Kehris, Z. Doulgeri, An FMS simulation development environment for real time control strategies. XVI European Conference of Operational Research, Brussels, 1998. [6] S. Mahadevan, N. Marchalleck, T.K. Das, and A. Gosavi, Self-improving factory simulation using continuous-time average-reward reinforcement learning, In Proceedings of the 13 th International Conference on Machine Learning, 1996, pp. 202-210. [7] R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, 3, 1988, pp. 9-44. [8] G. Tesauro, TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play, Neural Computation 6, 1994, pp. 215-219. [9] K. Tocher, The Art of Simulation, Van Nostrand Company, Princeton NJ, 1963. [10]Watkins, C. J. C. H. and Dayan, P, Q-learning. Machine Learning, 8, 1992, pp. 279-292.