BI-LEVEL DYNAMIC PARAMETER SHARING AMONG INDIVIDUALS AND TEAMS FOR PROMOTING COL-LABORATIONS IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Parameter sharing has greatly contributed to the success of multi-agent reinforcement learning in recent years. However, most existing parameter sharing mechanisms are static, and parameters are indiscriminately shared among individuals, ignoring the dynamic environments and different roles of multiple agents. In addition, although a single-level selective parameter sharing mechanism can promote the diversity of strategies, it is hard to establish complementary and cooperative relationships between agents. To address these issues, we propose a bi-level dynamic parameter sharing mechanism among individuals and teams for promoting effective collaborations (BDPS). Specifically, at the individual level, we define virtual dynamic roles based on the long-term cumulative advantages of agents and share parameters among agents in the same role. At the team level, we combine agents of different virtual roles and share parameters of agents in the same group. Through the joint efforts of these two levels, we achieve a dynamic balance between the individuality and commonality of agents, enabling agents to learn more complex and complementary collaborative relationships. We evaluate BDPS on a challenging set of StarCraft II micromanagement tasks. The experimental results show that our method outperforms the current state-of-the-art baselines, and we demonstrate the reliability of our proposed structure through ablation experiments.

1. INTRODUCTION

In many areas, collaborative Multi-Agent Reinforcement Learning (MARL) has broad application prospects, such as robots cluster control (Bus ¸oniu et al., 2010) , multi-vehicle auto-driving (Bhalla et al., 2020) , and shop scheduling (Jiménez, 2012) . In a multi-agent environment, an agent should observe the environment's dynamics and understand the learning policies of other agents to form good collaborations. Real-world scenarios usually have a large number of agents with different identities or capabilities, which puts forward higher requirements for collaborations among agents. Therefore, how to solve the large-scale MARL problem and promote to form stable and complementary cooperation among agents with different identities and capabilities are particularly important. To solve the large-scale agents issue, we can find that many collaborative MARL works adopting the centralized training paradigm use the full static parameter sharing mechanism (Gupta et al., 2017) , which allows agents to share parameters of agents' policy networks, thus simplifying the algorithm structure and improving performance efficiency. This mechanism is effective because agents generally receive similar observation information in the existing narrow and simple multiagent environments. In our Google Research Football (GRF) (Kurach et al., 2020 ) experiments, we can find that blindly applying the full parameter sharing mechanism does not improve the performance of algorithms because the observation information is very different due to the movement of different players. At the same time, because the full static parameter sharing mechanism ignores the identities and abilities of different agents, it constantly limits the diversity of agents' behavior policies (Li et al., 2021; Yang et al., 2022) , which makes it difficult to promote complementarity and reliable cooperation between agents in complex scenarios. Recently, in order to eliminate the disadvantage of full parameter sharing, a single-layer selective parameter sharing mechanism has been proposed (Christianos et al., 2021; Wang et al., 2022) , that is, extracting deep features from agents' observation information by an encoder and clustering them to achieve the combination of different agents in order to select different agents for parameter sharing. Although the single-level selective parameter sharing mechanism can promote the diversity of agents' policies, it causes the relationship between agents that do not share parameters simultaneously fragmented. So that agents cannot establish complementary cooperative relationships in a broader range. More importantly, designing an effective selector is the key to sharing selective parameters, especially for the single-level dynamic selective parameter sharing mechanism, which needs to be done within several rounds of the selection operation. Most methods only use the realtime observation information of agents, which loses attention to agents' history and is not conducive to correctly mining the implicit identity characteristics of the agents. For a football team, special training for different players, such as shooting training for the forwards and defense training for the defenders, will be carried out. However, winning a game requires special training and coordination between players of different roles. That is, not only do we need to share parameters with agents of the same role, but we also need to combine agents of different identities to ensure that they can form robust and complementary collaboration on a larger scale. To address these issues, in this paper, we propose a bi-level dynamic parameter sharing mechanism among individuals and teams (BDPS). The advantage functions of agents can be expressed as the advantages of taking action relative to the average in the current state. We consider that the advantage function can better represent the actual roles of agents in current identity and grouping. In order to more accurately identify the role of agents, at the individual level, we compute the long-term advantage information of the agents as the key to virtual role identification and use the variational autoencoders (VAE) (Kingma & Welling, 2014) to learn the distribution of the role characteristics of the agents, and obtain more accurate virtual role directly by sampling the distribution of role features. To alleviate the split of the relationship between different virtual role agents by the singlelayer dynamic parameter sharing mechanism, we further use the graph attention network (GAT) (Velickovic et al., 2018) to learn the topological relationships between different roles based on the roles obtained at the individual level, so as to achieve the combination of different identity agents at a higher level and a broader range. Through the method we designed, we achieve dynamic and selective parameter sharing for agents at two different levels, individual and team, and achieve the goal of stabilizing complementary collaboration among agents in a more extensive scope while achieving the diversity of agents' policies. We test BDPS and algorithms using different parameter sharing mechanisms on the StarCraft II micromanagement environments (SMAC) (Samvelyan et al., 2019) and the Google Research Football (GRF) (Kurach et al., 2020) . The experimental results show that our method not only outperforms other methods with single-level selective parameter sharing mechanism and full parameter sharing mechanism in general on all super hard maps and four hard maps of SMAC, but also performs well in the experimental scenarios in the used GRF. In addition, we carried out ablation experiments to verify the influence of different parameter sharing settings on the formation of complementary cooperation between agents, which fully proves the reliability of our proposed method.

2. BACKGROUND 2.1 DECENTRALIZED PARTIALLY OBSERVABLE MARKOV PROCESS

A full cooperative MARL task can usually be modeled as a decentralized partially observable markov process (Dec-POMDP) (Oliehoek & Amato, 2016) . We can define a tuple M = ⟨N , S, A, R, P, Ω, O, γ⟩ to represent it, where N is the finite set of n agents, s ∈ S is a finite set of global state, and γ ∈ [0, 1). At each time step t, each agent i ∈ N receives a local observation o i ∈ Ω according to the observation function O (s, i), takes an action a i ∈ A to form a joint action a ∈ A n , and gets a shared global reward r = R (s, a). Due to the limitation of partial observability, each agent conditions its policy π i (a i |τ i ) on its own local action-observation history τ i ∈ T ≡ (Ω × A) * . Agents together aim to maximize the expected return, that is, to find a joint policy π = ⟨π 1 , ..., π n ⟩ to maximize a joint action-value function Q π = E s0:∞,a0:∞ [ ∞ t=0 γ t r t |s 0 = s, a 0 = a, π].

