DPMAC: DIFFERENTIALLY PRIVATE COMMUNICA-TION FOR COOPERATIVE MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

Communication lays the foundation for cooperation in human society and in multi-agent reinforcement learning (MARL). Humans also desire to maintain their privacy when communicating with others, yet such privacy concern has not been considered in existing works in MARL. We propose the differentially private multi-agent communication (DPMAC) algorithm, which protects the sensitive information of individual agents by equipping each agent with a local message sender with rigorous (ϵ, δ)-differential privacy (DP) guarantee. In contrast to directly perturbing the messages with predefined DP noise as commonly done in privacy-preserving scenarios, we adopt a stochastic message sender for each agent respectively and incorporate the DP requirement into the sender, which automatically adjusts the learned message distribution to alleviate the instability caused by DP noise. Further, we prove the existence of a Nash equilibrium in cooperative MARL with privacy-preserving communication, which suggests that this problem is game-theoretically learnable. Extensive experiments demonstrate a clear advantage of DPMAC over baseline methods in privacy-preserving scenarios.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) has shown remarkable achievements in many realworld applications such as sensor networks (Zhang & Lesser, 2011) , autonomous driving (Shalev-Shwartz et al., 2016b), and traffic control (Wei et al., 2019) . To mitigate non-stationarity when training the multi-agent system, centralized training and decentralized execution (CTDE) paradigm is proposed. The CTDE paradigm yet faces the hardness to enable complex cooperation and coordination for agents during execution due to the inherent partial observability in multi-agent scenarios (Wang et al., 2020b) . To make agents cooperate more efficiently in complex partial observable environments, communication between agents has been considered. Numerous works proposed differentiable communication methods between agents, which can be trained in an end-to-end manner, for more efficient cooperation among agents (Foerster et Wang et al., 2020b) , where the connection between agents can be modeled as a complete graph, or one-to-one as a general graph (Ding et al., 2020) . However, the advantages of communication, resulting from full information sharing, come with the possible privacy leakage of individual agents for both broadcasted and one-to-one messages. Therefore, in practice, one agent may be unwilling to fully share its private information with other agents even though in cooperative scenarios. For instance, if we train and deploy an MARL-based autonomous driving system, each autonomous vehicle involved in this system could be regarded as an agent and all vehicles work together to improve the safety and efficiency of the system. Hence, this can be regarded as a cooperative MARL scenario (Shalev-Shwartz et al., 2016a; Yang et al., 2020) . However, owners of autonomous vehicles may not allow their vehicles to send private information to other vehicles without any desensitization since this may divulge their private information such as their personal life routines (Hassan et al., 2020) . Hence, a natural question arises: Can the MARL algorithm with communication under the CTDE framework be endowed with both the rigorous privacy guarantee and the empirical efficiency? To answer this question, we start with a simple motivating example called single round binary sums, where several players attempt to guess the bits possessed by others and they can share their own information by communication. In Section 4, we show that a local message sender using the randomized response mechanism allows an analytical receiver to correctly calculate the binary sum in a privacy-preserving way. From the example we gain two insights: 1) The information is not supposed to be aggregated likewise in previous communication methods in MARL (Das et al., 2019; Ding et al., 2020) , as a trusted data curator is not available in general. On the contrary, privacy is supposed to be achieved locally for every agent; 2) Once the agents know a priori, that certain privacy constraint exists, they could adjust their inference on the noised message. These two insights indicate the principles of our privacy-preserving communication structure that we desire a privacy-preserving local sender and a privacy-aware analytical receiver. Our algorithm, differentially private multi-agent communication (DPMAC), instantiates the described principles. More specifically, for the sender part, each agent is equipped with a local sender which ensures differential privacy (DP) (Dwork, 2006) by performing an additive Gaussian noise. The message sender in DPMAC is local in the sense that each agent is equipped with its own message sender, which is only used to send its own messages. Equipped with this local sender, DPMAC is able to not only protect the privacy of communications between agents but also satisfy different privacy levels required from different agents. In addition, the sender adopt the Gaussian distribution to represent the message space and sample the stochastic message from the learned distribution. However, it is known that the DP noise may impede the original learning process (Dwork et al., 2014; Alvim et al., 2011) , resulting in unstable or even divergent algorithms, especially for deep-learningbased methods (Abadi et al., 2016; Chen et al., 2020) . To cope with this issue, we incorporate the noise variance into the representation of the message distribution, so that the agents could learn to adjust the message distribution automatically according to varying noise scales. For the receiver part, because of the gradient chain between the sender and the receiver, our receiver naturally utilizes the privacy-relevant information hidden in the gradients. This implements the privacy-aware analytical receiver described in the motivating example. When protecting the privacy in communication is required in a cooperative game, the game is not purely cooperative anymore since each player involved will face a trade-off between the team utility and its personal privacy. To analyze the convergence of cooperative games with privacy-preserving communication, we first define a single-step game, namely the collaborative game with privacy (CGP). We prove that under some mild assumptions of the players' value functions, CGP could be transformed into a potential game (Monderer & Shapley, 1996) , subsequently leading to the existence of a Nash equilibrium (NE). With this property, NE could also be proved to exist in the single round binary sums game. Furthermore, we extend the single round binary sums into a multistep game called multiple round sums using the notion of Markov potential game (MPG) (Leonardos et al., 2021) . Inspired by Macua et al. (2018) and modeling the privacy-preserving communication as part of the agent action, we prove the existence of NE, which indicates that the multi-step game with privacy-preserving communication could be learnable. To validate the effectiveness of DPMAC, extensive experiments are conducted in multi-agent particle environment (MPE) (Lowe et al., 2017) , including cooperative navigation, cooperative communication and navigation, and predator-prey. Specifically, in privacy-preserving scenarios, DPMAC significantly outperforms baselines. Moreover, even without any privacy constraints, DPMAC could gain competitive performance against baselines. To sum up, the contributions of this work are threefold: • To the best of our knowledge, we make the first attempt to develop a framework for private communication in MARL, named DPMAC, with the theoretical guarantee of (ϵ, δ)-DP. • We prove the existence of the Nash equilibrium for the cooperative games with privacy-preserving communication, which shows that these games are learnable. • Experiments on the MPE show that DPMAC clearly outperforms other algorithms in privacypreserving scenarios and gains competitive performance in non-private scenarios. 

3. PRELIMINARIES

We consider a fully cooperative MARL problem where N agents work collaboratively to maximize the joint rewards. The underlying environment can be captured by a decentralized partially observable Markov decision process (Dec-POMDP), denoted by the tuple ⟨S, A, O, P, R, γ⟩. Specifically, S is the global state space, A = N i=1 A i is the joint action space, O = N i=1 O i is the joint ob- servation space, P (s ′ | s, a) := S × A × S → [0, 1] determines the state transition dynamics, R(s, a) : S × A → R is the reward function, and γ ∈ [0, 1) is the discount factor. Given a joint policy π = {π i } N i=1 , the joint action-value function at time t is Q π (s t , a t ) = E [G t | s t , a t , π], where G t = ∞ i=0 γ i R t+i is the cumulative reward, and a t = {a t i } N i=1 is the joint action. The ultimate goal of the agents is to find an optimal policy π * which maximizes Q π (s t , a t ). Under the aforementioned cooperative setting, we study the case where agents are allowed to communicate with a joint message space M = N i=1 M i . When the communication is unrestricted, the problem is reduced to a single-agent RL problem, which effectively solves the challenge posed by partially observable states, but puts the individual agent's privacy at risk. To overcome the challenges of privacy and partial observable states simultaneously, we investigate algorithms that maximize the cumulative rewards while satisfying differential privacy (DP), given in Definition 3.1. DP offers a mathematically rigorous way to quantify the privacy of an algorithm (Dwork, 2006 ). An algorithm is said to be "privatized" under the DP model if it is statistically hard to infer the presence of an individual data point in the dataset by observing the output of the algorithm. More intuitively, an algorithm satisfies DP if it provides nearly the same outputs given the neighbouring input datasets (i.e., Pr[f (D) ∈ S] ≈ Pr [f (D ′ ) ∈ S]), which hence protects the sensitive information from the curious attacker. With DP, each agent i is assigned with a privacy budget ϵ i , which is negatively correlated to the level of privacy protection. Then we have ϵ = {ϵ i } N i=1 as the set of all privacy budgets. In addition to maximizing the joint rewards as usually required in cooperative MARL, the messages sent from agent i are also required to satisfy the privacy budget ϵ i with probability at least 1 -δ.

4. MOTIVATING EXAMPLE

Before introducing our communication framework, we first investigate a motivating example, which is a cooperative game and inspires the design principles of private communication mechanisms in MARL. The motivating example is a simple yet interesting game, called single round binary sums. The game is extended from the example provided in Cheu (2021) for analyzing the shuffle model, while we illustrate the game from the perspective of multi-agent systems. We note that though this game is one-step, which is different from the sequential decision process like MDP, it is illustrative enough to show how the communication protocol works as a tool to achieve a better trade-off between privacy and utility. Assume that there are N agents involved in this game. Each agent i ∈ [N ] has a bit b i ∈ [0, 1] and can tell other agents the information about its bit by communication. The objective of the game is for every agent to guess i b i , the sum of the bits of all agents. Namely, each agent i makes a guess g i and the utility of the agent is to maximize r i = -| j b j -E[g i ]|. The (global) reward of this game is the sum of the utility over all agents, i.e., i r i . Without loss of generality, we write the guess g i into g i = j̸ =i y ij + b i , where y ij is the guessed bit of agent j by agent i. If all agents share their bits without covering up, the guessed bit y ij will obviously be equal to b j and all agents attain an optimal return. Hence this game is fully cooperative under no privacy constraints. However, the optimal strategy is under the assumption that everyone is altruistic to share their own bits. To preserve the privacy in communication, the message (i.e., the sent bit) could be randomized using randomized response, which perturbs the bit b i with probability p, as shown below: x i = R RR (b i ) := Ber(1/2) with probability p b i otherwise , where x i is the random message and Ber indicates the Bernoulli distribution. Under our context, R RR is a privacy-preserving message sender, whose privacy guarantee is shown in Proposition 4.1. When each agent is equipped with such a privacy-preserving sender R RR while adhering to the originally optimal strategy (i.e., believing what others tell and doing the guess), all agents would make an inaccurate guess. The bias of the guess denoted as err i caused by R RR is then err i = E[g i ] - i b i = j̸ =i E[x j -b j ] = p j̸ =i ( 1 2 -b j ) = p(N -1) 2 -p j̸ =i b j . Without any priori knowledge, the bias could not be reduced for (ϵ, 0)-DP algorithms. However, if the probability p of perturbation is set as a prior common knowledge for all agents before the game starts, the story will be different. One could transform the biased guess into g A i = A RR (⃗ x -i ) := 1 1 -p   j̸ =i x j -(N -1)p/2   , where ⃗ x -i = [x 1 , . . . , x i-1 , x i+1 , . . . , x N ] ⊤ denote the messages received by agent i. Then the estimate will be unbiased as E g A i = 1 1 -p   E   j̸ =i x j   - p(N -1) 2   + b i = i b i . This example inspires that a communication algorithm could be both privacy-preserving and efficient. From the perspective of privacy, by the post-processing lemma of DP, any post-processing does not affect the original privacy level. From the perspective of utility, we could eliminate the bias err i if the agent is equipped with the receiver A RR and the prior knowledge p is given. In general, our motivating example gives two principles for designing privacy-preserving communication frameworks. First, to prevent the sensitive information from being inferred by other curious agents, we equip each agent with a local message sender with certain privacy constraints. Second, given a priori knowledge about the privacy requirement of other agents, the receiver could strategically analyze the received noisy messages to statistically reduce error due to the noisy communication. These two design principles correspond to two parts of our DPMAC framework respectively, i.e., a privacy-preserving local sender and a privacy-aware receiver.

5. METHODOLOGY

Based on our design principles, we now introduce our DPMAC framework, as shown in Figure 1 . Our framework is general and flexible, which makes it compatible with any CTDE method.

5.1. PRIVACY-PRESERVING LOCAL SENDER WITH STOCHASTIC GAUSSIAN MESSAGES

In this section, we present the sender's perspective on the privacy guarantee. At time t, for agent i, a message function f s i is used to generate a message for communication. f s i takes a subset of transitions in local trajectory τ t i as input, where the subset is sampled uniformly without replacement from τ t i (denote the sampling rate as γ 1 ). This message is perturbed by the Gaussian mechanism with variance σ 2 i (Dwork, 2006) . Agent i then samples a subset of other agents to share this message (denote the sampling rate as γ 2 ). The following theorem guarantees differential privacy. Theorem 5.1 (Privacy guarantee for DPMAC). Let γ 1 , γ 2 ∈ (0, 1), and C be the ℓ 2 norm of the message functions. For any δ > 0 and privacy budget ϵ i , the communication of agent i satisfies (ϵ i , δ)-DP when σ 2 i = 14γ2γ 2 1 N C 2 α βϵi , if we have α = log δ -1 ϵi(1-β) +1 ≤ 2σ ′2 log 1/γ 1 α 1 + σ ′2 /3+1 with β ∈ (0, 1) and σ ′2 = σ 2 i /(4C 2 ) ≥ 0.7 . With Theorem 5.1, one can directly translate a non-private MARL with a communication algorithm into a private one. However, as we shall see in our experiment section, directly injecting the privacy noise into existing MARL with communication algorithms may lead to serious performance degradation. In fact, the injected noise might jeopardize the useful information incorporated in the messages, or even leads to meaningless messages. To alleviate the negative impacts of the injected privacy noise on the cooperation between agents, we adopt a stochastic message sender in the sense that the messages sent by our sender are sampled from a learned message distribution. This makes DPMAC different from existing works in MARL that communicate through deterministic messages (Sukhbaatar et In the following, we drop the dependency of parameters on t when it is clear from the context. Without loss of generality, let the message distribution be multivariate Gaussian and let p i be the message sampled from the message distribution N (µ i , Σ i ), where parameters of the sender's neural networks. Then θ µ⊤ i and θ σ⊤ i will be optimized towards making all the agents to send more effective messages to encourage better team cooperation and gain higher team rewards. For notational convenience, let θ s i = [θ µ⊤ i , θ σ⊤ i ] ⊤ . Then the sent privatized message µ i = f µ i (o i , a i ; θ µ i ) and Σ i = f σ i (o i , a i ; θ σ i ) m i = p i + u i where u i ∼ N (0, σ 2 i I d ) is an additional noise. It is clear that m i ∼ N (µ i , Σ i + σ 2 i I d ) since p i is independent from u i . Counterfactually, let m ′ i ∼ N (µ ′ i , Σ ′ i ), where µ ′ i = f µ i (o i , a i ; θ µ′ i ) and Σ ′ i = f σ i (o i , a i ; θ σ′ i ) is the sent message when it was not under any privacy constraint. Let the optimal message distribution be N (µ * i , Σ * i ). We are interested to characterize θ s ′ i and θ s i . By the optimality of µ * i , Σ * i , θ s ′ i = arg min θ D KL (N (µ ′ i , Σ ′ i )∥N (µ * i , Σ * i )) = arg min θ log |Σ * i | |Σ ′ i | + tr{Σ * -1 i Σ ′ i } + ∥µ ′ i -µ * i ∥ 2 Σ * -1 i . ( ) Then under the privacy constraints, the stochastic sender will learn θ s i such that θ s i = arg min θ D KL (N (µ i , Σ i + σ 2 i I d )∥N (µ * i , Σ * i )) = arg min θ log |Σ * i | |Σ i + σ 2 i I d | + tr{Σ * -1 i (Σ i + σ 2 i I d )} + ∥µ i -µ * i ∥ 2 Σ * -1 i . ( ) Through Equation ( 2), it is possible to directly incorporate the distribution of privacy noise into the optimization process of the sender to help to learn θ s i such that D KL (N (µ i , Σ i + σ 2 i I d )∥N (µ * i , Σ * i )) ≤ D KL (N (µ ′ i , Σ ′ i )∥N (µ * i , Σ * i )) , which means that the sender could learn to send private message m i = p i + u i that is at least as effective as the non-private message m ′ i . In this manner, the performance degradation is expected to be well alleviated.

5.2. PRIVACY-AWARE MESSAGE RECEIVER

As shown in our motivating example, the message receiver with knowledge a priori could statistically reduce the communication error in privacy-preserving scenarios. In the practical design, this motivation could be naturally instantiated with the gradient flow between the message sender and the message receiver. Specifically, agent i first concatenates all the received privatized messages as m (-i)i := {m ji } N j=1,j̸ =i and then encodes m (-i)i into an aggregated message q i = f r i (m (-i)i | θ r i ) with the decoding function f r i parameterized by θ r i . Then a similar argument to the policy gradient theorem (Sutton et al., 1999) states that the gradient of the receiver is ∇ θ r i J (θ r i ) = E τ ,o,a E πi [∇ θ r i f r i q i | m (-i)i ∇ qi log π i (a i | o i , q i ) Q π (a, o)] , where J (θ r i ) = E[G 1 | π] is the cumulative discounted reward from the starting state. In this way, the receiver could utilize the prior knowledge σ i of the privacy-preserving sender encoded in the gradient during the optimization process. Please refer to Appendix D for the detailed optimization process of the message senders and receivers.

6. PRIVACY-PRESERVING EQUILIBRIUM ANALYSIS

Many cooperative multi-agent games enjoy the existence of a unique NE, which ensures the convergence of iterative algorithms. Under the privacy constraints, however, the existence of a unique Nash equilibrium can no longer be guaranteed even if the original game admits a unique equilibrium. As the convergence of MARL algorithms could depend on the existence of an equilibrium, we investigate such existence in single-step games and extend the result to multi-step games.

6.1. SINGLE-STEP GAMES

We study a class of two-player collaborative games, denoted as collaborative game with privacy (CGP). The game involves two agents, each equipped with a privacy parameter p n , n ∈ {1, 2}. The value of p n represents the importance of privacy to agent n, with the larger value referring to greater importance. Let M be some message mechanism. We denote the privacy loss by c M (p n ), which measures the quantity of the potential privacy leakage and is formally defined in Definition B.2. Besides, let b V n , V M n (p 1 , p 2 ) be the utility gained by measuring the gap between private value function V M n (p 1 , p 2 ) and non-private value function V n . Then the trade-off between the utility and the privacy is depicted by the total utility function u n (p 1 , p 2 ) in Equation ( 3). The formal definition of CGP is given in Definition 6.1. See more details in Appendix B.1. Definition 6.1 (Collaborative game with privacy (CGP)). The collaborative game with privacy is denoted by a tuple ⟨N , Σ, U⟩, where N = {1, 2} is the the set of players, Σ = {p 1 , p 2 } is the action set with p 1 , p 2 ∈ [0, 1] representing the privacy level, and U = {u 1 , u 2 } is the set of utility functions satisfying ∀n ∈ N , u n (p 1 , p 2 ) = B n • b V n , V M n (p 1 , p 2 ) -C M n • c M (p n ) . Then the following theorem shows that if changes in the value function of each player can be expressed as a change in their own privacy parameter, then CGP is a potential game and a pure NE thereafter exists. The proof is deferred to Appendix B.1. Theorem 6.1 (CGP's NE guarantee). The collaborative game with privacy has at least one nontrivial pure-strategy Nash equilibrium if ∂ i p1 V 1 = ∂ i p2 V 2 , ∀i ∈ {1, 2}. Equilibrium in single round binary sums Let us revisit our motivating example. Armed with the CGP framework, it is immediate that the single round binary sums game guarantees the existence of a NE. This result is formalized in Theorem B.2 in Appendix B.1.

6.2. MULTI-STEP GAMES

We now consider an extended version of single round binary sums named multiple round sums. Consider an N -player game where player i owns a saving x i,t . Rather than sending a binary bit, the agent can choose to give out b i,t at round t. Meanwhile, each player i selects privacy level p i,t and sends messages to each other with a sender f s i encoding the information of b i,t with the privacy level p i,t . The reward of the agent is designed to find a good trade-off between privacy and utility. The setting of the game is thus similar to the empirical implementation of DPMAC. We first transform this game into a Markov potential game (MPG), with the reward of each agent transformed into a combination of the team reward and the individual reward. Then with existing theoretical results from Macua et al. (2018), we present the following result while deferring its proof to Appendix B.2. Theorem 6.2 (NE guarantee in multiple round sums). If Assumptions 1, 2, 3, 4 (see Appendix B.2) are satisfied, our MPG has a NE with potential function J defined as, J(x t , π(x t )) = j∈[N ] ((1 -p j,t )b j,t + αx j,t + βp i,t ) . (4) 

7. EXPERIMENTS

In this section, we present the experiment results and corresponding experiment analyses. Please see Appendix G for more detailed analyses of experiment results. Experiment results without privacy DPMAC is first compared with TarMAC, I2C, and MAD-DPG on three MPE tasks without the privacy requirement, as shown in Figure 2a . DPMAC outperforms baselines on CCN & PP and has competitive performance on CN. Note that for the PP task we pick DPMAC with ϵ = 0.10 due to even better performance over its non-private variant. The comparison between DPMAC (non-private) and baselines is provided in Appendix F.

Baselines

Experiment results with privacy We further add the privacy constraint on the communication algorithms. We set δ = 10 -4 on all tasks. Figure 2b and Figure 2c show the performance under the privacy budget ϵ = 0.10, ϵ = 1.0 and both with δ = 10 -4 . We include MADDPG as a non-communication baseline method. We observe that DPMAC with the privacy requirement could still maintain a good result compared to MADDPG, while the performance of TarMAC and I2C drops greatly. Figure 2d further gives the comparison between the performance of DPMAC under different privacy budgets. When ϵ = 0.01, DPMAC still gains remarkable performance, while other baselines' performance degraded greatly, as shown in Figure 2b . Variance adjustment of DPMAC Experiments with privacy also support our claim that DPMAC could automatically adjust the variance of our stochastic message sender so that it learns a noiserobust representation. As shown in Figure 2d , DPMAC gains very close performance when ϵ = 0.1 and ϵ = 1.0, though the privacy requirements of ϵ = 0.1 and ϵ = 1.0 differ by one order of magnitude. However, one can see large gaps for the same baseline algorithms under different ϵ from Figure 2b and Figure 2c . Please see Figure 4 and Figure 5 for direct presentations of these gaps.

8. CONCLUSION

In this paper, we study the privacy-preserving communication in MARL. Motivated by a simple yet effective example of the binary sums game, we propose DPMAC, a new efficient communicating MARL algorithm that preserves agents' privacy through differential privacy. Our algorithm is justified both theoretically and empirically. Besides, to show that the privacy-preserving communication problem is learnable, we analyze the single-step game and the multi-step game via the notion of Markov potential games (MPG) and show the existence of the Nash equilibrium. This existence further implies the learnability of several instances of MPG under privacy constraints. Extensive experiments are conducted on MPE and show the effectiveness of DPMAC when compared to baseline methods on multiple tasks both with and without the privacy constraints. Though we make the first step to establish an efficient MARL algorithm with differential private communication, some interesting questions remain open. The first question is that it is still unclear for us whether there exists the Nash equilibrium in private competitive games. Besides, on the empirical side, investigating the performance of DPMAC in competitive games with privacy-preserving communication might also be interesting and valuable.



(ϵ, δ)-DP, Dwork (2006)). A randomized mechanism f : D → R satisfies (ϵ, δ)differential privacy if for any neighbouring datasets D, D ′ ∈ D and S ⊂ R, it holds that Pr[f (D) ∈ S] ≤ e ϵ Pr [f (D ′ ) ∈ S] + δ.

Beimel et al. (2008)). Setting p = 2 e ϵ +1 in R RR suffices for (ϵ, 0)-differential privacy.

Figure 1: The overall structure of DPMAC. The message receiver of agent i integrates other agents' messages {m ji , m ki , m li } with the self-attention mechanism and the integrated message is fed into the policy π i together with the observation o i . Agent i interacts with the environment by taking action a i . Then o i and a i are concatenated and encoded by a privacy-preserving message sender and sent to other agents.

Learning curves of DPMAC, TarMAC, I2C, and MADDPG on three MPE tasks. Note that on the PP task DPMAC (ϵ = 0.10) is shown. Learning curves of different algorithms under the privacy budget ϵ = 0.10. MADDPG (non-private) is also displayed for comparison. Learning curves of different algorithms under the privacy budget ϵ = 1.0. MADDPG (non-private) is also displayed for comparison. DPMAC( = 1.00) (d) Learning curves of different privacy budgets (ϵ = 0.01, 0.10, 1.00) for DPMAC.

Figure 2: Learning curves of DPMAC and baseline algorithms. The curves are averaged over 5 seeds. Shaded areas denote 1 standard deviation.

al., 2016; Jiang & Lu, 2018; Das et al., 2019; Ding et al., 2020; Kim et al., 2021; Wang et al., 2020b). The communication can be either broadcast (Das et al., 2019; Jiang & Lu, 2018;

We implement our DPMAC and evaluate it against TarMAC(Das et al., 2019), I2C(Ding et al., 2020), and MADDPG(Lowe et al., 2017). All Algorithms are tested with and without the privacy requirement except for MADDPG, which involves no communication among agents. Since TarMAC and I2C do not have a local sender and have no DP guarantee, we add Gaussian noise to their receiver according to the noise variance specified in Theorem 5.1 for a fair comparison. Please see Appendix D for more training details. We remark that the code will be made publicly available once this manuscript is accepted.Environments We evaluate the algorithms on the multi-agent particle environment (MPE) (Mordatch & Abbeel, 2017), which is with continuous observation and discrete action space. This environment is commonly used among existing literature(Lowe et  al., 2017; Jiang & Lu, 2018; Ding et al., 2020; Kim et al., 2021). We evaluate a wide range of tasks in MPE, including cooperative navigation (CN), cooperative communication and navigation (CCN), and predator prey (PP). More details on the environmental settings are given in Appendix E.

