REINFORCEMENT LEARNING OF INDUSTRIAL SE-QUENTIAL DECISION-MAKING TASKS UNDER NEAR-PREDICTABLE DYNAMICS: A BI-CRITIC VARIANCE REDUCTION APPROACH

Abstract

Learning to plan and schedule is receiving increasing attention for industrial decision-making tasks (partly) for its potential to outperform heuristics, especially under dynamic uncertainty, as well as its efficiency in problem-solving, particularly with the adoption of neural networks and the behind GPU computing. Naturally, reinforcement learning (RL) with the Markov decision process (MDP) becomes a popular paradigm. Instead of handling the near-stationary environments like Atari games or the opposite case for open world dynamics with high uncertainty, in this paper, we aim to devise a tailored RL-based approach for the practice setting in the between: the near-predictable dynamics which often hold in many industrial applications, e.g., elevator scheduling and bin packing, as two empirical case studies investigated in this paper. Specifically, we propose a two-stage MDP to decouple the state transition uncertainty caused by the data dynamics and constrained action space in the industrial environment. A bi-critic framework is then devised for amortizing the uncertainty and reducing the variance of value estimation according to the two-stage MDP. Experimental results show that our engine can adaptively handle different dynamics data tasks and outperform recent learning-based models and traditional heuristic algorithms.

1. INTRODUCTION

The advent of Industry 4.0 has put forward demanding requirements for resolving the sequential decision-making task in the industry. The task that involves planning and scheduling has been researched for decades for its commercial value. The planning task, like the bin packing problem (BPP) (Zhao et al., 2020; Zhu et al., 2021; Duan et al., 2022; Zhao et al., 2022; Zhao & Xu, 2022) , involves a series of discrete objects under certain constraints to optimize an objective function. While the scheduling task focuses on allocating limited resources to multiple objects to optimize performance indicators under certain constraints, such as the elevator group scheduling problem (EGSP) (Crites & Barto, 1998; Zheng et al., 2013; Wei et al., 2020) , the vehicle routing problem (VRP) (Nazari et al., 2018) , and the job scheduling problem (JSP) (Chen & Tian, 2019). With the increase of the problem scale and variety, methods for finding the optimal solution with effectiveness and joint applicability are becoming more and more attractive yet challenging. Traditional solutions in the industry are often rule-based, tuning a score function with expert experience for specific tasks but can hardly be generalized to others. Others try to formulate the tasks as combinatorial optimization problems and then apply heuristics algorithms (often due to its NP-hardness) -such as search algorithm (ELA, 2019; TUR, 2020) and the greedy algorithm (Ramalingam et al., 2017)-but lack real-time response and scalability. Many learning-based methods (Wei et al., 2020; Zhao et al., 2020) are developed, e.g., RL showing its remarkable advantages in sequential decisionmaking problems. However, emerging learning-based works still often fall behind the industry standards, which can be partly attributed to the lack of real-world training data, and the unavailability of strong simulators to provide rich and realistic data for training. We consider two aspects for addressing industrial sequential decision-making tasks with RL. The first and mainly addressed issue in our work is to better utilize the character of the environment 1

