CAMA: A NEW FRAMEWORK FOR SAFE MULTI-AGENT REINFORCEMENT LEARNING USING CON-STRAINT AUGMENTATION

Abstract

With the widespread application of multi-agent reinforcement learning (MARL) in real-life settings, the ability to meet safety constraints has become an urgent problem to solve. For example, it is necessary to avoid collisions to reach a common goal in controlling multiple drones. We address this problem by introducing the Constraint Augmented Multi-Agent framework -CAMA. CAMA can serve as a plug-and-play module to the popular MARL algorithms, including centralized training, decentralized execution and independent learning frameworks. In our approach, we represent the safety constraint as the sum of discounted safety costs bounded by the predefined value, which we call the safety budget. Experiments demonstrate that CAMA can converge quickly to a high degree of constraint satisfaction and surpasses other state-of-the-art safety counterpart algorithms in both cooperative and competitive settings.

1. INTRODUCTION

Multi-agent problems are ubiquitous in real world, such as robotics (Al-Abbasi et al., 2019; Mguni et al., 2021) , transportation systems (Zhou et al., 2020; Chu et al., 2019) , network optimization (Wang et al., 2020; Wai et al., 2018) , and multi-player video games (Du et al., 2019; Samvelyan et al., 2019; Han et al., 2019; Peng et al., 2017) . A modern approach to solving these decisionmaking problems is multi-agent reinforcement learning (MARL), which tackles these problems using only interactions with the environment. There are many different frameworks within MARL such as fully centralized Berner et al. (2019); Sukhbaatar & Fergus (2016) , independent learning (IL) de Witt et al. (2020); Zhang et al. (2018) and a hybrid framework which is the centralized training and decentralized execution (CTDE) (Foerster et al., 2018; Lowe et al., 2017; Yang et al., 2018) . However, within the deployment of MARL, safety is still a crucial problem, which has not been fully solved yet. In recent years, several works have incorporated safety constraints into RL training, such as optimizing policy under constraints (Di Castro et al., 2012; Tessler et al., 2018; Achiam et al., 2017; Chow et al., 2018) , adding safety layers (Dalal et al., 2018) , or constructing verifiable safe exploration (Anderson et al., 2020) , etc. In the context of safe MARL, recent papers extend constrained policy optimization (Achiam et al., 2017) to multi-agent domain (Gu et al., 2021) as a model-free safe MARL algorithms. But there are still challenges with low reward performance compare to the non-safe MARL algorithms. There are also some works performed constrained policy optimization by transforming it into a min-max game (Lu et al., 2021; Liu et al., 2021) . However, which limited by the specific framework, it cannot generalize to other framework such as solving the competitive game. Therefore, a more general safe MARL framework with high reward performance is still lacking at this stage. To fill this gap, in this paper we propose a general module that can be incorporated into different MARL algorithms. The proposed Constraint Augmented Multi-Agent framework, coined as CAMA, is a plug-and-play method to improve cutting-edge non-safe MARL algorithms satisfying the adding constraints. Furthermore, CAMA aims to address both cooperation and competitive setting under CTDE and IL frameworks. In our algorithm, we represent the safety constraint as the sum of discounted safety costs bounded by a pre-defined scalar, which we call the safety budget. The main idea of this approach is the introduction of the hazard value, which tracks the accumulated costs and represents the remaining safety budget. When the hazard value falls below zero, CAMA assigns a low negative reward to the agents, incentivizing them to learn a safe policy. The implementation of CAMA can be seen as a direct modification of the environment and indirectly influencing the algorithm by constraints augmentation so that there is no need for any new algorithm-based assumptions. In summary, our contributions are three-fold. Firstly, CAMA is a flexible framework with a plugand-play feature, which can be combined with many existing MARL algorithms. Secondly, CAMA can work in both CTDE and IL settings, including cooperative and competitive multi-agent games. Lastly, we evaluate CAMA on a series of multi-agent control tasks in SMAMujoco (Gu et al., 2021) and Gym Compete (Bansal et al., 2017) . Empirical results demonstrate the effectiveness of our solutions both in terms of constraint satisfaction and reward maximisation compared to their stateof-the-art counterparts.

2. RELATED WORK.

In the following, we review the related works on general MARL, MARL with constraint and state augmentation method. Existing MARL algorithms are often developed under the paradigm of CTDE and IL (Oroo-jlooyJadid & Hajinezhad, 2021): CTDE is a commonly used learning framework, which updates decentralized policies by using the centralized critic architecture (Du et al., 2019; Gu et al., 2021) . In CTDE, the joint critic network is based on the all agents' states and actions, and thus generally handles the joint team reward scenarios (Kuba et al., 2022; Yu et al., 2021) . Conversely, if the setting is a competitive game or only focus on the individual reward, then IL is another MARL paradigm to be based on. The earliest discussions of IL in a multi-agent-based environment can be traced back to (Tan, 1993) . It subsequently evolved into the IL algorithm using neural networks as function approximators (Foerster et al., 2018; Rashid et al., 2020) . Some recent attempts, like de Witt et al. ( 2020) extend single agent proximal policy optimization algorithms (Schulman et al., 2017) into the multi-agent IL setting. Although MARL has received significant attention in recent years, there are still many unresolved safety related challenges (Gu et al., 2022) , such as the multiple constraints setting, low algorithm efficiency problem etc. There are generally several ways to solve additional safe constraints. Recent attempts such as MACPO, MAPPO-L(Gu et al., 2021) proposed to fill such a gap as the first safe model-free MARL algorithms, which are extensions of CPO (Achiam et al., 2017) and HATRPO (Kuba et al., 2022) , respectively. However, neither MACPO nor MAPPO-L is guaranteed to be applied in competitive games, which resulting in limited scalability. Another research direction is based on the parameters-sharing hypothesis. For example, the CMIX (Liu et al., 2021) can combine multi-objective programming and the Q-mix framework to solve the constraint MARL problem. However, in CMIX, different Q-function approximators are required for each constraint and each agent, which leads to scalability and efficiency challenges. Another parameters-sharing based approach, Safe Dec-PG (Lu et al., 2021) aims to stratify the constraint by passing the parameters through a predefined communication network. Instead, our framework requires no communication during policy execution. Another route to avoid unsafe action is to use shielding and barrier functions, such as ElSayed-Aly et al. (2021); Cai et al. (2021) . However, those approaches require pre-training or strong prior knowledge to create the shields for filtering actions. Moreover, they cannot generalize to new scenarios where safety shield is not known. In contrast, CAMA is more flexible when dealing with new constrained environments. We test different types of tasks with varying agents without designing or pre-train a particular shielding function for each task. The state augmentation method extend the specific state to an environment, in order to enhance policy performance or satisfy certain constraints (Calvo-Fullana et al., 2021) . Recent works like (Qiu et al., 2021) augmented the CVaR to measure over the learned distributions of individuals ' Q values, Chen et al. (2020) and Foerster et al. (2017) augmented delay awareness and experience replay, respectively. The idea of enhancing safety-related variables has been considered in the past, e.g., in classical control methods (Daryin & Kurzhanski, 2005) and in single-agent safe RL (Sootla et al., 2022a; b; Chow et al., 2017) . We apply it in a multi-agent framework, but multiple constraints and multiple policy settings hinder the direct extension. Some works, such as (Chen et al., 2020;  

