MULTI-AGENT REINFORCEMENT LEARNING WITH SHARED RESOURCES FOR INVENTORY MANAGEMENT Anonymous authors Paper under double-blind review

Abstract

In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure and achieve better performance compared with standard MARL algorithms.

1. INTRODUCTION

The inventory management (IM) problem has long been one of the most important application scenarios in the supply-chain industry (Nahmias & Smith, 1993) . Its main purpose is to maintain a balance between the supply and demand of stock keeping units (SKUs) in a supply chain by optimizing the replenishment decisions for each SKU. An efficient inventory management strategy cannot only increase the profit and reduce the operational cost but also give rise to better services to help maintain customer satisfaction (Eckert, 2007) . Nevertheless, this task is quite challenging in practice due to the fact that the replenishment decisions for different SKUs compete for shared resources (e.g., the inventory capacity or the procurement budget) as well as cooperate with each other to achieve a high total profit. This becomes more challenging when the number of SKUs involved in the supply-chain becomes larger. Such co-existence of cooperation and competition renders IM a complicated and challenging problem. Traditional methods usually reduce the IM problem to dynamic programming (DP). However, these approaches often rely on unrealistic assumptions such as iid customer demands and deterministic leading time (Kaplan, 1970; Ehrhardt, 1984) . Moreover, when the state space grows rapidly along with the scaling-up of some key factors such as the leading time and the number of SKUs, the problem becomes intractable by DP due to the curse of dimension (Gijsbrechts et al., 2019) . Due to these limitations, many approaches based on approximate dynamic programming are proposed to solve IM problems in different settings (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019) . While these approaches perform well in certain scenarios, they rely heavily on problem-specific expertise or assumptions, e.g., the zero or one period leading time assumption in (Halman et al., 2009) , and thus can hardly generalize to other settings. In contrast, reinforcement learning (RL) based methods, with short inference time, can be generalized into various scenarios in a data-driven manner. However, it is hard to train a global policy that makes decisions for all SKUs due to the large global state and action space (Jiang & Agarwal, 2018). To further address the training efficiency issue, it is natural to adopt the multi-agent reinforcement learning (MARL) paradigm, where each SKU is controlled by an individual agent whose state and action spaces are localized and only contain information relevant to itself. There are currently two popular paradigms to train MARL in the literature: independent learning (Tan, 1993) and joint action learning (Lowe et al., 2017) . Despite of their success in many scenarios, these two MARL paradigms also exhibit certain weaknesses that restrain their effectiveness in solving the IM problem. For independent learning, the policy training of one agent treats all other agents as a part of the stochastic environment. It may largely increase the hardness of training convergence due to the non-stationarity of the environment. For joint action learning, a centralized critic is usu-1

