LPMARL: LINEAR PROGRAMMING-BASED IMPLICIT TASK ASSIGNMENT FOR HIERARCHICAL MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Training a multi-agent reinforcement learning (MARL) model with sparse reward is notoriously difficult as the terminal reward is induced by numerous interactions among agents. In this study, we propose linear programming (LP)-based hierarchical MARL (LPMARL) to learn effective cooperative strategies among agents. LPMARL is composed of two hierarchical decision-making schemes: (1) solving an agent-task assignment LP using the state-dependent cost parameters generated by a graph neural network (GNN) and ( 2) solving low-level cooperative games among agents assigned to the same task. We train the LP parameter-generating GNN and the low-level MARL policy in an end-to-end manner using the implicit function theorem. We empirically demonstrate that LPMARL learns an optimal agent-task allocation and the subsequent local cooperative policy for agents in sub-groups for solving various mixed cooperative-competitive games.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) has recently drawn much attention due to its practical and potential applications in controlling complicated and distributed multi-agent systems. Despite its potential, training an MARL model with sparse reward is notoriously difficult as the final sparse reward is induced by the complex long-term interactions among the agents (Liu et al., 2021) . To overcome this challenge, one needs to develop an algorithm that can learn how the interactions among the agents over long-term episodes entail the outcome of the target tasks, a delayed and sparse episodic reward, and deduce this understanding into an effective sequential decision-making policy. In this study, we propose a linear programming-based hierarchical MARL (LPMARL), a hierarchically structured decision-making scheme, to learn an effective coordination strategy among the agents. LPMARL conducts two hierarchical decision-making: (1) solving an agent-task assignment problem and (2) solving local cooperative games among agents that are assigned to the same task. For the first step, LPMARL formulates the agent-task assignment as an LP by using the state-dependent cost coefficients generated by a graph neural network (GNN). The solution of the formulated LP serves as an agent-to-task assignment, which decomposes the original team game into a set of smaller team games among the agents that are assigned to the same task. LPMARL then employs a general MARL strategy to solve each sub-task cooperatively in the second step. We train the LP-parameter-generating GNN layer and the low-level MARL policy network in an endto-end manner using the implicit function theorem. We validate the effectiveness of LPMARL using various cooperative games with constrained resource allocation. The technical contributions and novelties of the proposed method are as follows: • Interpretability (Section 6.2.1). LPMARL can induce the designed behavior of the agents (i.e., behavior inductive biases) by using specific objective terms or constraints when formulating the LP. This structured framework helps one to interpret the decision-making procedure. • Transferability (Section 6.2.2). LPMARL learns to construct and solve the task assignment optimization problems. When constructing a resource assignment LP problem, LP-MARL uses GNN to produce the state-dependent cost coefficient that will be used in the objective function of the LP. Due to the size generalization/transferability of GNN, the trained LPMARL policy can be used to solve general target problems with varying number of agents, tasks and constraints. • Amortization (Section 6.1). To execute LPMARL, the agent must solve the LP so as to use the optimal solution as the high-level action. As this centralized execution may be impossible in the real world, we amortize the central solution-finding procedure using a learned distributed task selection network. In the experiment, we demonstrate that the amortized high-level policy does not degrade the performance of the centralized version of LPMARL and outperforms existing hierarchical MARL algorithms.

2. BACKGROUND

2.1 HIERARCHICAL MULTI-AGENT REINFORCEMENT LEARNING Tang et al. ( 2018) introduced a hierarchical Dec-POMDP that is composed of high-and low-level actions, which run on different time scales. This study introduced a temporal abstraction concept to integrate the two types of actions into the overall environmental dynamics. To be specific, each agent i receives observation o i,t and chooses a high-level action a h i,t ∈ A h i , where A h i denotes a set of possible high-level actions. While the high-level actions may last for τ timesteps, the low-level actions are executed until the current a h i,t is terminated. After a h i,t ends, the next highlevel action a h i,t+τ is selected based on observation o i,t+τ . The agent receives a low-level reward (intrinsic reward) for reaching a sub-goal, denoted by r l (s t , a t , s t+1 |a h i,t ), depending to its own high-level action a h i,t . The agent receives the high-level reward r h (s t , a t , s t+1 ) whenever the agent accomplishes the whole task, i.e., reaching the final success state s T +1 .

2.2. IMPLICIT DEEP LEARNING

Implicit deep learning is a framework that incorporates implicit rules (e.g., ordinary differential equation (Chen et al., 2018) , fixed-point iterations (Bai et al., 2019) , and optimization (Amos & Kolter, 2017)) into a feed-forward neural network. Specifically, differentiable optimization is a framework that incorporates an optimization problem into the layer. A differentiable optimization layer takes the problem specific pamameters x as an input, and finds optimal solution z * (x) such that z * (x) := arg min z∈g(x) f (z, x), where f (z, x) is the objective function constructed with a given x. The output of the layer, z * (x), is then fed into the next layer to conduct various end tasks. By using this approach, one can infuse the optimization inductive bias into the layers. OptNet (Amos & Kolter, 2017), for example, propose a differentiable optimization layer, specifically for quadratic programming (QP). The backpropagation of this optimization layer requires the computation of the derivative of the QP solution with respect to the input parameters, which is derived by taking the matrix differentials of the KKT conditions of the QP. Ferber et al. (2020) and Wilder et al. (2019) extended the idea of OptNet to general LP and mixed-integer LP (MILP). To compute the gradient of the optimal solution in combinatorial optimization easily, Vlastelica et al. ( 2019) suggests a way to construct a continuous interpolation of the loss function.

3. RELATED WORK

Hierarchical MARL with pre-defined high-level action space. Some studies introduced hierarchical policies with pre-defined goal-level actions in MARL to decompose the main problem into subproblems under a semi-MDP framework (Sutton et al., 1999) . Tang et al. ( 2018) applied a temporally abstracted high-level policy to induce the agents to cooperate when selecting the sub-goals. Ahilan & Dayan (2019) also introduced a centralized sub-goal selection policy to assign the agents to tasks optimally. Liu et al. ( 2021) introduced an exploration policy, which acts as a high-level policy, to limit the explorable action space of the low-level policy. Although these methodologies learn to divide the agents into goal-dependent sub-groups cooperatively, the low-level policies of these algorithms are trained individually, making it difficult to induce cooperation among the agents within the sub-tasks.

