MULTI-AGENT REINFORCEMENT LEARNING WITH SHARED RESOURCES FOR INVENTORY MANAGEMENT Anonymous authors Paper under double-blind review

Abstract

In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure and achieve better performance compared with standard MARL algorithms.

1. INTRODUCTION

The inventory management (IM) problem has long been one of the most important application scenarios in the supply-chain industry (Nahmias & Smith, 1993) . Its main purpose is to maintain a balance between the supply and demand of stock keeping units (SKUs) in a supply chain by optimizing the replenishment decisions for each SKU. An efficient inventory management strategy cannot only increase the profit and reduce the operational cost but also give rise to better services to help maintain customer satisfaction (Eckert, 2007) . Nevertheless, this task is quite challenging in practice due to the fact that the replenishment decisions for different SKUs compete for shared resources (e.g., the inventory capacity or the procurement budget) as well as cooperate with each other to achieve a high total profit. This becomes more challenging when the number of SKUs involved in the supply-chain becomes larger. Such co-existence of cooperation and competition renders IM a complicated and challenging problem. Traditional methods usually reduce the IM problem to dynamic programming (DP). However, these approaches often rely on unrealistic assumptions such as iid customer demands and deterministic leading time (Kaplan, 1970; Ehrhardt, 1984) . Moreover, when the state space grows rapidly along with the scaling-up of some key factors such as the leading time and the number of SKUs, the problem becomes intractable by DP due to the curse of dimension (Gijsbrechts et al., 2019) . Due to these limitations, many approaches based on approximate dynamic programming are proposed to solve IM problems in different settings (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019) . While these approaches perform well in certain scenarios, they rely heavily on problem-specific expertise or assumptions, e.g., the zero or one period leading time assumption in (Halman et al., 2009) , and thus can hardly generalize to other settings. In contrast, reinforcement learning (RL) based methods, with short inference time, can be generalized into various scenarios in a data-driven manner. However, it is hard to train a global policy that makes decisions for all SKUs due to the large global state and action space (Jiang & Agarwal, 2018) . To further address the training efficiency issue, it is natural to adopt the multi-agent reinforcement learning (MARL) paradigm, where each SKU is controlled by an individual agent whose state and action spaces are localized and only contain information relevant to itself. There are currently two popular paradigms to train MARL in the literature: independent learning (Tan, 1993) and joint action learning (Lowe et al., 2017) . Despite of their success in many scenarios, these two MARL paradigms also exhibit certain weaknesses that restrain their effectiveness in solving the IM problem. For independent learning, the policy training of one agent treats all other agents as a part of the stochastic environment. It may largely increase the hardness of training convergence due to the non-stationarity of the environment. For joint action learning, a centralized critic is usu-ally trained to predict the value based on the global state (of all SKUs) and the joint action, which can easily become intractable with a growing number of SKUs. Furthermore, it is time-consuming to sample data from a joint simulator for a large number of SKUs due to the high computational cost to calculate the internal variables that model the complex agent interactions. To address these challenges, we leverage the structure in the IM problem to design a more effective MARL paradigm. Particularly, each agent in system interacts with the others only through the competence of shared resources such as the inventory capacity. We introduce an auxiliary variable called context to represent the shared resources (e.g., the available inventory level for all SKUs). From the MARL perspective, the dynamics of the context actually reflect the collective behaviors of all the agents. Conditioned on the context dynamics, we assume the transition dynamics and the reward function of the agents are independent. In this way, leveraging the context as an additional input for the policy or the value network of each agent enables us to not only prevent the nonstationarity in independent learning but also feed global information to the critic without leading to an intractable global critic. Based on the context, we propose Shared-Resource Stochastic Game (SRSG) to model the IM problem. Since the context and the policies of each agent depend on each other, it is hard to solve for them simultaneously. Accordingly, we make two assumptions to circumvent this issue: 1) rearranging the sampling process by first sampling the contexts and then sampling local the state/action/reward for each agent; 2) using context dynamics sampled by previous policies. With these assumptions, we can design an efficient algorithm called Context-aware Decentralized PPO (CD-PPO) that consists of two iterative learning procedures: 1) obtaining the context samples from a joint simulator, and 2) updating the policy for each agent by the data sampled from its corresponding local simulator conditioned on the collective context dynamics. By decoupling each agent from the others with a separate local simulation, our method can greatly reduce the model complexity and accelerate the learning procedure. At last, we conduct extensive experiments and the results validate the effectiveness of our method. Besides, not limited to the IM problem considered in this paper, our method may be applied to other applications with shared resources such as portfolio management (Ye et al., 2020) where different stocks share the same capital pool and smart grid scheduling (Remani et al., 2019) where different nodes share a total budget. Our contributions are summarized as follows: • We propose Shared-Resource Stochastic Game to capture the problem structure in the IM problem where agents interact with each other through competing for shared resources. • We propose a novel algorithm called Context-aware Decentralized PPO that leverages the shared-resource structure to solve the IM problem efficiently. • We conduct extensive experiments to demonstrate that our method can achieve the performance on par with state-of-the-art MARL algorithms while being more computationally and sample efficient.

2. BACKGROUND 2.1 STOCHASTIC GAMES

We build our work on the formulation of the stochastic game (SG) (Shapley, 1953) (also known as Markov game). A stochastic game is defined as a tuple (N , S, A, T , R, γ) where N = [n] denotes the set of n > 1 agents, S is the state space, A := A 1 ×• • •×A n is the action space composed of the action spaces of individual agents, T : S × A → ∆(S) is the transition dynamics, R = n i=1 R i is the total reward which is the summation of individual rewards R i : S × A i × S → R and γ ∈ [0, 1) is the discount factor. For the i-th agent, we denote its policy as π i : S → ∆ A i and the joint policy of the other agents as π -i = j∈[n]\i π j . Each agent optimizes its policy conditioned on the policies of the others, i.e., max π i η i (π i , π -i ) = E ∞ t=0 γ t r i t π i , π -i ,

