REINFORCEMENT LEARNING OF INDUSTRIAL SE-QUENTIAL DECISION-MAKING TASKS UNDER NEAR-PREDICTABLE DYNAMICS: A BI-CRITIC VARIANCE REDUCTION APPROACH

Abstract

Learning to plan and schedule is receiving increasing attention for industrial decision-making tasks (partly) for its potential to outperform heuristics, especially under dynamic uncertainty, as well as its efficiency in problem-solving, particularly with the adoption of neural networks and the behind GPU computing. Naturally, reinforcement learning (RL) with the Markov decision process (MDP) becomes a popular paradigm. Instead of handling the near-stationary environments like Atari games or the opposite case for open world dynamics with high uncertainty, in this paper, we aim to devise a tailored RL-based approach for the practice setting in the between: the near-predictable dynamics which often hold in many industrial applications, e.g., elevator scheduling and bin packing, as two empirical case studies investigated in this paper. Specifically, we propose a two-stage MDP to decouple the state transition uncertainty caused by the data dynamics and constrained action space in the industrial environment. A bi-critic framework is then devised for amortizing the uncertainty and reducing the variance of value estimation according to the two-stage MDP. Experimental results show that our engine can adaptively handle different dynamics data tasks and outperform recent learning-based models and traditional heuristic algorithms.

1. INTRODUCTION

The advent of Industry 4.0 has put forward demanding requirements for resolving the sequential decision-making task in the industry. The task that involves planning and scheduling has been researched for decades for its commercial value. The planning task, like the bin packing problem (BPP) (Zhao et al., 2020; Zhu et al., 2021; Duan et al., 2022; Zhao et al., 2022; Zhao & Xu, 2022) , involves a series of discrete objects under certain constraints to optimize an objective function. While the scheduling task focuses on allocating limited resources to multiple objects to optimize performance indicators under certain constraints, such as the elevator group scheduling problem (EGSP) (Crites & Barto, 1998; Zheng et al., 2013; Wei et al., 2020) , the vehicle routing problem (VRP) (Nazari et al., 2018) , and the job scheduling problem (JSP) (Chen & Tian, 2019) . With the increase of the problem scale and variety, methods for finding the optimal solution with effectiveness and joint applicability are becoming more and more attractive yet challenging. Traditional solutions in the industry are often rule-based, tuning a score function with expert experience for specific tasks but can hardly be generalized to others. Others try to formulate the tasks as combinatorial optimization problems and then apply heuristics algorithms (often due to its NP-hardness) -such as search algorithm (ELA, 2019; TUR, 2020) and the greedy algorithm (Ramalingam et al., 2017)-but lack real-time response and scalability. Many learning-based methods (Wei et al., 2020; Zhao et al., 2020) are developed, e.g., RL showing its remarkable advantages in sequential decisionmaking problems. However, emerging learning-based works still often fall behind the industry standards, which can be partly attributed to the lack of real-world training data, and the unavailability of strong simulators to provide rich and realistic data for training. We consider two aspects for addressing industrial sequential decision-making tasks with RL. The first and mainly addressed issue in our work is to better utilize the character of the environment dynamics in the industrial pipeline. Existing efforts (Hadoux et al., 2014; Chandak et al., 2020; Chen et al., 2021) assume general non-stationary environments and develop classical RL algorithms to learn structural features of the environment dynamics, including Meta-RL (Xu et al., 2020; Chen et al., 2021) and context detection RL (Padakandla et al., 2020) . However, we argue that, in fact, the environment often bears some inherent regularities, and sometimes it is nearly predictable, e.g., in BPP cases, items of similar shape and size usually appear in a batch. The above existing works neglect such potential near-predictable dynamics and leave space for more tailored algorithmic development. Another practical aspect is to strictly obey hard constraints, which are common in the industry for specific reasons, e.g., safety. Although Chen et al. ( 2021) and Wei & Luo (2021) further consider the Constrained Markov Decision Process (CMDP) (Altman, 1999) and robust constrained Markov decision process (RCMDP) (Russel et al., 2020) for safe RL (Hewing et al., 2020) , they tolerate constraints violations and are not up to industry standard. In particular, enforcing hard constraint often increases the high state transition uncertainty (Mao et al., 2018) and further leads to the high variance problem of value estimation. Thus, we argue that the near-predictability must be more carefully considered to mitigate the challenge of the hard constraints. In this paper, we propose a Dynamic-Aware and Constraints-Confined (DACC) RL framework for industrial sequential decision-making tasks. Unlike previous RL-based efforts for industrial cases that formulate the problem as non-stationary CMDP, we first identify the non-stationary but nearpredictable environmental dynamics and reformulate these tasks as a two-stage MDP (Kim et al., 2019) for its potential to distinguish the effects of environment dynamics (exogenous variables) and constrained action space (endogenous variables) in the state transition. Furthermore, the value estimation based on a two-stage MDP reduces the variance of value estimates without introducing bias, as proved by (Mao et al., 2019) . Specifically, we design DACC, a bi-critic framework for perceiving the dynamics and making decisions under hard industrial constraints with the guidance of heuristic rules, respectively. By estimating the state value in two stages with our bi-critic framework, we reduce the state transition uncertainty and state value estimation variance caused by the mutually adverse effects of dynamic variability and hard constraints. To evaluate our method's effectiveness and generalization, we conduct experiments on two typical industrial sequential decision tasks: 3D bin packing and elevator group scheduling. For the latter case for which there lacks a realistic simulator, we improve the open-source simulator by adding more constraints, business rules, and logic, and (will) release a more realistic one based on our first-hand engagement with top-tier lift manufacturer to benefit the community. The highlights of this paper are: 1) Though many sequential decision-making tasks in the industry often require strict constraints, increasing the high state transition uncertainty to challenge RL-based methods. Fortunately, in this paper, we identify that in many cases, the environment is often near-predictable such that it allows for more tailored MDP model development, which is largely ignored by existing methods. 2) We innovatively separate the state transition process of these industrial tasks into two stages. We derive theoretical solutions embodied by a two-stage MDP to the high variance problem of value estimation appearing in the single-stage settings. We then propose a bi-critic framework called Dynamic-Aware and Constraint-Confined (DACC), to capture the regularity of dynamics and makes decisions under hard industrial constraints. 3) We apply our framework to two representative yet challenging real-world cases: 3D bin packing and elevator group scheduling problems. Results show that our methods outperform conventional rule-based and state-of-the-art learning-based models. Further comparisons with the Meta-RL methods verify our framework's superiority in capturing the inherent regularities in these dynamic industrial scenes. Generalization experiments on the untrained data show that the model generalizes well.

2. RELATED WORK

Many works extend the Markov decision process model. Constrained Markov decision process (CMDP) (Altman, 1999) is suitable for constrained physical systems, such as avoiding obstacles or unsafe parts in space. Robust Markov decision process (RMDP) Petrik & Russel (2019) is suitable for scenarios where transition probabilities or rewards are unclear. And robust constrained Markov decision process (RCMDP) (Russel et al., 2020) merges both CMDP and RMDP. Time-Dependent MDP (TMDP) (Boyan & Littman, 2000) considers both stochastic state transitions and stochastic, time-dependent action duration. A two-stage MDP task is designed by (Kim et al., 2019) to differentiate the effects of state transition uncertainty and state-space complexity on the brain's arbitration

