BOOSTING MULTIAGENT REINFORCEMENT LEARN-ING VIA PERMUTATION INVARIANT AND PERMUTA-TION EQUIVARIANT NETWORKS

Abstract

The state space in Multiagent Reinforcement Learning (MARL) grows exponentially with the agent number. Such a curse of dimensionality results in poor scalability and low sample efficiency, inhibiting MARL for decades. To break this curse, we propose a unified agent permutation framework that exploits the permutation invariance (PI) and permutation equivariance (PE) inductive biases to reduce the multiagent state space. Our insight is that permuting the order of entities in the factored multiagent state space does not change the information. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). The core idea is to build separate entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. DPN achieves such connections by two separate module selection networks, which consistently assign the same input module to the same input entity (guarantee PI) and assign the same output module to the same entity-related output (guarantee PE). To enhance the representation capability, HPN replaces the module selection networks of DPN with hypernetworks to directly generate the corresponding module weights. Extensive experiments in SMAC, SMACv2, Google Research Football, and MPE validate that the proposed methods significantly boost the performance and the learning efficiency of existing MARL algorithms. Remarkably, in SMAC, we achieve 100% win rates in almost all hard and super-hard scenarios (never achieved before).

1. INTRODUCTION

Multiagent Reinforcement Learning (MARL) has successfully addressed many real-world problems (Vinyals et al., 2019; Berner et al., 2019; Hüttenrauch et al., 2017) . However, MARL algorithms still suffer from poor sample-efficiency and poor scalability due to the curse of dimensionality, i.e., the joint state-action space grows exponentially as the agent number increases (Li et al., 2022) . A way to solve this problem is to properly reduce the size of the state-action space (van der Pol et al., 2021; Li et al., 2021) . In this paper, we study how to utilize the permutation invariance (PI) and permutation equivariance (PE)foot_0 inductive biases to reduce the state space in MARL. Let G be the set of all permutation matricesfoot_1 of size m × m and g be a specific permutation matrix of G. A function f : X → Y where X = [x 1 , . . . x m ] T , is PI if permutation of the input components does not change the function output, i.e., f (g [x 1 , . . . x m ] T ) = f ([x 1 , . . . x m ] T ), ∀g ∈ G. In contrast, a function f : X → Y where X = [x 1 , . . .

x m ]

T and Y = [y 1 , . . . y m ] T , is PE if permutation of the input components also permutes the outputs with the same permutation g, i.e., f (g [x 1 , . . . x m ] T ) = g [y 1 , . . . y m ] T , ∀g ∈ G. For functions that are not PI or PE, we uniformly denote them as permutation-sensitive functions. A multiagent environment typically consists of m individual entities, including n learning agents and m -n non-player objects. The observation o i of each agent i is usually composed of the features of the m entities, i.e., [x 1 , . . . x m ], where x i ∈ X represents each entity's features and X is the feature space. If simply representing o i as a concatenation of [x 1 , . . . x m ] in a fixed order, the observation space will be |X | m . A prior knowledge is that although there are m! different orders of these entities, they inherently have the same information. Thus building functions that are insensitive to the entities' orders can significantly reduce the observation space by a factor of 1 m! . To this end, in this paper, we exploit both PI and PE functions to design more sample efficient MARL algorithms. To achieve PI, there are two types of previous methods. The first employs the idea of data augmentation, e.g., Ye et al. (2020) propose data augmented MADDPG, which generates more training data by shuffling the order of the input components and forcedly maps these generated data to the same output through training. But it is inefficient to train a permutation-sensitive function to output the same value when taking features in different orders as inputs. The second type applies naturally PI architectures, such as Deep Sets (Li et al., 2021) and GNNs (Wang et al., 2020b; Liu et al., 2020) , to MARL. These models use shared input embedding layers and entity-wise pooling layers to achieve PI. However, using shared embedding layers limits the model's representational capacity and may result in poor performance (Wagstaff et al., 2019) . For PE, to the best of our knowledge, it has drawn relatively less attention in MARL community and few works exploit this property. In general, the architecture of an agent's policy network can be considered as three parts: ❶ an input layer, ❷ a backbone network (main architecture) and ❸ an output layer. To achieve PI and PE, we follow the minimal modification principle and propose a light-yet-efficient agent permutation framework, where we only modify the input and output layers while keeping backbone unchanged. Thus our method can be more easily plugged into existing MARL methods. The core idea is that, instead of using shared embedding layers, we build non-shared entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). To achieve PI, DPN builds a separate module selection network, which consistently selects the same input module for the same input entity no matter where the entity is arranged and then merges all input modules' outputs by sum pooling. Similarly, to achieve PE, it builds a second module selection network, which always assigns the same output module to the same entity-related output. However, one restriction of DPN is that the number of network modules is limited. As a result, the module assigned to each entity may not be the best fit. To relax the restriction and enhance the representational capability, we further propose HPN which replaces the module selection networks of DPN with hypernetworks and directly generates the network parameters of the corresponding modules (by taking each entity's own features as input). Entities with different features are processed by modules with entity-specific parameters. Therefore, the model's representational capability is improved while ensuring the PI and PE properties. Extensive evaluations in SMAC, SMACv2, Google Research Football and MPE validate that DPN and HPN can be easily integrated into many existing MARL algorithms (both value-based and policy-based) and significantly boost their learning efficiency and converged performance. Remarkably, we achieve 100% win-rates in almost all hard and super-hard scenarios of SMAC, which has never been achieved before to the best of our knowledge. The code is available at https://github.com/tjuHaoXiaotian/API-Network.

2. RELATED WORK

To highlight our method, we briefly summarize the related works that consider the PI or PE property. Concatenation. Typical MARL algorithms, e.g., QMIX (Rashid et al., 2018) and MADDPG (Lowe et al., 2017) simply represent the set input as a concatenation of the m entities' features in a fixed order and feed the concatenated features into permutation-sensitive functions, e.g., multilayer perceptron (MLP). As each entity's feature space size is |X |, the size of the joint feature space after concatenating will grow exponentially to |X | m , thus these methods suffer sample inefficiency. Data Augmentation. To reduce the number of environmental interactions, Ye et al. (2020) propose data augmented MADDPG which generates more training data by shuffling the order of [x 1 , . . . x m ] and additionally updates the model based on the generated data. However, the method requires more computational resources and is more time-consuming. Besides, as the generated data contains the same information as the original one, they should have the same Q-value. But it is inefficient to train a permutation-sensitive function to output the same value when taking differently-ordered inputs. Deep Set & Graph Neural Network. Instead of doing data augmentation, Deep Set (Zaheer et al., 2017) constructs a family of PI neural architectures for learning set representations. Each component x i is mapped separately to some latent space using a shared embedding layer ϕ(x i ). These latent representations are then merged by a PI pooling layer (e.g. sum, mean) to ensure the PI of the whole function, e.g., f (X) = ρ (Σ m i=1 ϕ(x i )) , where ρ can be any function. Graph Neural Networks (GNNs) (Veličković et al., 2018; Battaglia et al., 2018 ) also adopt shared embedding and pooling layers to learn functions on graphs. Wang et al. (2020b) ; Li et al. (2021) and Jiang et al. (2018) ; Liu et al. (2020) have applied Deep Set and GNN to MARL. However, due to the use of the shared embedding ϕ(x i ), the representation capacity is usually limited (Wagstaff et al., 2019) . Multi-head Self-Attention & Transformer. To improve the representational capacity, Set Transformer (Lee et al., 2019) employs the multi-head self-attention mechanism (Vaswani et al., 2017) to process every x i of the input set, which allows the method to encode higher-order interactions between elements in the set. Most recently, Hu et al. (2021b) adopt Transformer (Vaswani et al., 2017) to MARL and proposes UPDeT, which could handle various input sizes. But UPDeT is originally designed for transfer learning scenarios and does not explicitly consider the PI and PE properties. PE Functions in MARL. In the deep learning literature, some works have studied the effectiveness of PE functions when dealing with problems defined over graphs (Maron et al., 2018; Keriven & Peyré, 2019) . However, in MARL, few works exploit the PE property to the best of our knowledge. One related work is Action Semantics Network (ASN) (Wang et al., 2019) , which studies the different effects of different types of actions but does not directly consider the PE property.

3.1. ENTITY-FACTORED MODELING IN DEC-POMDP

A cooperative multiagent environment typically consists of m entities, including n learning agents and m -n non-player objects. We follow the definition of Dec-POMDP (Oliehoek & Amato, 2016) . At each step, each agent i receives an observation o i ∈ O i which contains partial information of the state s ∈ S, and executes an action a i ∈ A i according to a policy π i (a i |o i ). The environment transits to the next state s ′ and all agents receive a shared global reward. The target is to find optimal policies for all agents which can maximize the expected cumulative global reward. Each agent's individual action-value function is denoted as Q i (o i , a i ). Detailed definition can be found in Appendix B.1. Modeling MARL in factored spaces is a common practice. Many recent works (Qin et al., 2022; Wang et al., 2020b; Hu et al., 2021b; Long et al., 2019; Wang et al., 2019) model the observation and the state of typical MARL benchmarks e.g., SMAC (Qin et al., 2022; Hu et al., 2021b) , MPE (Long et al., 2019) and Neural MMO (Wang et al., 2019) , into factored parts relating to the environment, agent itself and other entities. We follow such a common entity-factored setting that both the state space and the observation space are factorizable and can be represented as entity-related features, i.e., s ∈ S ⊆ R m×ds and o i ∈ O ⊆ R m×do , where d s and d o denote the feature dimension of each entity in the state and the observation. For the action space, we consider a general setting that each agent's actions are composed of two types: a set of m entity-correlated actions A equivfoot_2 and a set of entity-uncorrelated actions A inv , i.e., A i ≜ (A equiv , A inv ). The entity-correlated actions mean that there exists a one-to-one correspondence between each entity and each action, e.g., 'attacking which enemy' in SMAC or 'passing the ball to which teammate' in football games. Therefore, a equiv ∈ A equiv should be equivariant with the permutation of o i and a inv ∈ A inv should be invariant. Tasks only considering one of the two types of actions can be considered special cases. 3.2 OUR TARGET: DESIGNING PI AND PE POLICY NETWORKS Let g ∈ G be an arbitrary permutation matrix. We define that g operates on o i by permuting the orders of the m entities' features, i.e., go i , and that g operates on a i = (a equiv , a inv ) by permuting the orders of a equiv but leaving a inv unchanged, i.e., ga i ≜ (ga equiv , a inv ). Our target is to inject the PI and PE inductive biases into the policy network design such that: π i (a i |go i ) = gπ i (a i |o i ) ≜ (gπ i (a equiv |o i ), π i (a inv |o i )) ∀g ∈ G, o i ∈ O (1) where g operates on π i by permuting the orders of π i (a equiv |o i ) while leaving π i (a inv |o i ) unchanged. Figure 2 : The ideal PI and PE policy network.

4. METHODOLOGY

Our target policy network architecture is shown in Fig. 2 , which consists of four modules: ❶ input module A, ❷ backbone module B which could be any architecture, ❸ output module C for actions a inv and ❹ output module D for actions a equiv . For brevity, we denote the inputs and outputs of these modules as: z i = A(o i ), h i = B(z i ), π i (a inv |o i ) = C(h i ) , and π i (a equiv |o i ) = D(h i ) respectively. To achieve equation 1, we have to modify the architectures of {A, B, C, D} such that the outputs of C are PI and the outputs of D are PE. In this paper, we propose to directly modify A to be PI and modify D to be PE with respect to the input o i , and keep the backbone module B and output module C unchanged. The following Propositions show that our proposal is a feasible and simple solution. Proposition 1 If we make module A become PI, the output of module C will immediately become PI without modifying module B and C. Proof. Given two different inputs go i and o i , ∀g ∈ G, since module A is PI, then A(go i ) = A(o i ). Accordingly, for any functions B and C, we have (C • B • A) (go i ) = (C • B • A) (o i ) , which is exactly the definition of PI. Corollary 1. With module A being PI, if not modifying module D, the output of D will also immediately become PI, i.e., (D • B • A) (go i ) = (D • B • A) (o i ) , ∀g ∈ G. Proposition 2 With module A being PI, to make the output of module D become PE, we must introduce o i as an additional input. Proof. According to corollary 1, with module A being PI, module D also becomes PI. Following this minimal modification principle, our proposed method can be more easily plugged into many types of existing MARL algorithms, which will be shown in section 5. In the following, we propose two designs to implement the PI module A and the PE module D.

4.2. DYNAMIC PERMUTATION NETWORK

As a common practice adopted by many MARL algorithms, e.g., QMIX (Rashid et al., 2018) , MAD-DPG (Lowe et al., 2017) and MAPPO (Yu et al., 2021) T as the weights of the FC-layer A, the output is computed asfoot_3 : z i = m j=1 o i [j]W in [j] where [j] indicates the j-th element. Similarly, we denote W out = [W out 1 , . . . W out m ] T as the weight matrices of the output layer D. The output π i (a equiv |o i ) is computed as: π i (a equiv |o i )[j] = h i W out [j] ∀j ∈ {1, . . . , m} (3) PI Input Layer A. According to equation 2, permuting o i , i.e., o ′ i = go i , will result in a different output z ′ i , as the j-th input is changed to o ′ i [j] while the j-th weight matrix remains unchanged. To make an FC-layer become PI, we design a PI weight matrix selection strategy for each o i [j] such that no matter where o i [j] is arranged, the same o i [j] will always be multiplied by the same weight matrix. Specifically, we build a weight selection network of which the output dimension is m: p in W in 1 , . . . W in m o i [j] = softmax(MLP (o i [j])) where the k-th output p in W in k o i [j] indicates the probability that o i [j] selects the k-th weight matrix of W in . Then, for each o i [j], we choose the weight matrix with the maximum probability. However, directly selecting the argmax index is not differentiable. To make the selection process trainable, we apply a Straight Through Estimator (Van Den Oord et al., 2017) to get the one-hot encoding of the argmax index. We denote it as pin (o i [j]). The weight matrix with the maximum probability can be acquired by pin (o i [j]) W in , i.e. selecting the corresponding row from W in . Overall, no matter which orders the input o i is arranged, the output of layer A is computed as: Since the selection network only takes each entity's features o i [j] as input, the same o i [j] will always generate the same probability distribution and thus the same weight matrix will be selected no matter where o i [j] is ranked. Therefore, the resulting z i remains the same regardless of the arranged orders of o i , i.e., layer A becomes PI. An illustration architecture is shown in Fig. 3 (right). z i = m j=1 o ′ i [j] (p in (o ′ i [j]) W in ) o ′ i = go i , ∀g ∈ G PE Output Layer D. To make the output layer D achieve PE, we also build a weight selection network pout (o i [j]) for each entity-related output. The j-th output of D is computed as: π i (a equiv |o ′ i )[j] = h i (p out (o ′ i [j]) W out ) o ′ i = go i , ∀g ∈ G, ∀j ∈ {1, . . . , m} For ∀g ∈ G, the j-th element of o ′ i will always correspond to the same matrix of W out and thus the j-th output of π i (a equiv |o ′ i ) will always have the same value. The input order change will result in the same output order change, thus achieving PE. The architecture is shown in Fig. 3 (right).

4.3. HYPER POLICY NETWORK

The core idea of DPN is to always assign the same weight matrix to each o i [j]. Compared with Deep Set style methods which use a single shared weight matrix to embed the input o i , i.e., |W in | = 1, DPN's representational capacity has been improved, i.e., |W in | = m. However, the sizes of W in and W out are still limited. One restriction is that for each o i [j], we can only select weight matrices from these limited parameter sets. Thus, the weight matrix assigned to each o i [j] may not be the best fit. One question is whether we can provide an infinite number of candidate weight matrices such that the solution space of the assignment is no longer constrained. To achieve this, we propose a Hyper Policy Network (HPN), which incorporates hypernetworks (Ha et al., 2016) to generate customized embedding weights for each o i [j], by taking o i [j] as input. Hypernetworks (Ha et al., 2016) are a family of neural architectures which use one network to generate the weights for another network. Since we consider the outputs of hypernetworks as the weights in W in and W out , the sizes of these parameter sets are no longer limited to m. To better understand our motivations, we provide a simple policy evaluation experiment in Appendix A.1, where we can directly construct an infinite number of candidate weight matrices, to show the influence of the model's representational capacity. PI Input Layer A. For the input layer A, we build a hypernetwork hp in (o i [j]) = MLP (o i [j]) to generate a corresponding W in [j] for each o i [j]. z i = m j=1 o ′ i [j] hp in (o ′ i [j]) o ′ i = go i , ∀g ∈ G (7) As shown in Fig. 4  π i (a equiv |o ′ i )[j] = h i hp out (o ′ i [j]) o ′ i = go i , ∀g ∈ G, ∀j ∈ {1, . . . , m} By utilizing hypernetworks, the input o i [j] and the weight matrix W out [j] directly correspond oneto-one. The input order change will result in the same output order change, thus achieving PE. See Fig. 4 for the whole architecture of HPN. In this section, when describing our methods, we represent module A and D as networks with only one layer, but the idea of DPN and HPN can be easily extended to multiple layers. Besides, both DPN and HPN are general designs and can be easily integrated into existing MARL algorithms to boost their performance. All parameters of DPN and HPN are trained end-to-end with backpropagation according to the underlying RL loss function. Implementation details can be found in B.2.

5.1. STARCRAFT MULTIAGENT CHALLENGE (SMAC AND SMACV2)

Setup and codebase. We first evaluate our methods in SMAC, which is an important testbed for MARL algorithms. SMAC consists of a set of StarCraft II micro battle scenarios, where units are divided into two teams: allies and enemies. The ally units are controlled by the agents while the enemy units are controlled by the built-in rule-based bots. The agents can observe the distance, relative location, health, shield and type of the units within the sight range. The goal is to train the agents to defeat the enemy units. We evaluate our methods in all Hard and Super Hard scenarios. Following Samvelyan et al. ( 2019); Hu et al. (2021a) , the evaluation metric is a function that maps the environment steps to test winning rates. Each experiment is repeated using 5 independent training runs and the resulting plots show the median performance as well as the 25%-75% percentiles. Recently, Hu et al. (2021a) demonstrate that the optimized QMIX (Rashid et al., 2018) achieves the SOTA performance in SMAC. Thus, all codes and hyperparameters used in this paper are based on their released project PyMARL2. Detailed parameter settings are given in Appendix C. Except for SMAC, we also evaluate our HPN on a new challenging benchmark SMACv2. Due to the space limit, detailed settings and evaluation results are presented in Appendix A.10.

5.1.1. APPLYING HPN TO SOTA FINE-TUNED QMIX

We apply HPN to each Q i (a i |o i ) of the fine-tuned QMIX and VDN (Hu et al., 2021a) . The learning curves over 8 hard and super hard scenarios are shown in Fig. 5 . We conclude that: (1) HPN-QMIX surpasses the fine-tuned QMIX by a large margin and achieves 100% test win rates in almost all scenarios, especially in 5m vs 6m, 3s5z vs 3s6z and 6h vs 8z, which has never been achieved before. Our HPN-QMIX achieves the new SOTA in SMAC; (2) HPN-VDN significantly improves the performance of fine-tuned VDN, and it even surpasses the fine-tuned QMIX in most scenarios, which minimizes the gaps between the VDN-based and QMIX-based algorithms; (3) HPN significantly improves the sample efficiency of both VDN and QMIX, especially in 5m vs 6m, MMM2, 3s5z vs 3s6z and 6h vs 8z. In 5m vs 6m, to achieve the same 80% win rate, HPN-VDN and HPN-QMIX reduce the environmental interaction steps by a factor of 1 4 compared with the counterparts. The results are shown in Fig. 6 . We conclude that: HPN > DPN ≥ UPDeT ≥ ASN > QMIX > GNN ≈ SET ≈ DAfoot_5 . Specifically, (1) HPN-QMIX achieves the best win rates, which validates the effectiveness of our PI and PE design; (2) HPN-QMIX performs better than DPN-QMIX. The reason is that HPN-QMIX utilizes more flexible hypernetworks to achieve PI and PE with enhanced representational capacity. (3) UPDeT-QMIX and ASN-QMIX achieve comparable win rates with DPN-QMIX in most scenarios, which indicates that utilizing the PI and PE properties is important. (4) GNN-QMIX and SET-QMIX achieve similar performance. Although PI is achieved, GNN-QMIX and SET-QMIX still perform worse than vanilla QMIX, especially in 3s5z vs 3s6z and 6h vs 8z. This confirms that using a shared embedding layer ϕ(x i ) limits the representational capacity and restricts the final performance. (5) DA-QMIX improves the performance of QMIX in 3s5z vs 3s6z through much more times of parameter updating. However, its learning process is unstable and it collapses in all other scenarios due to the perturbation of the input features, which validates that it is hard to make a permutation-sensitive function (e.g. For HPN, incorporating hypernetworks leads to a 'bigger' model. To make a more fair comparison, we enlarge the network size of Q i (a i |o i ) in fine-tuned QMIX (denoted as BIG-QMIX) such that it has more parameters than HPN-QMIX. Besides, the UPDeT baselines already have comparable or more parameters than HPN. The detailed sizes of these models are shown in Table 3 of Appendix A.3. From Fig. 7 , we see that simply increasing the parameter number cannot improve the performance. In contrast, it even leads to worse results. VDN-based results are presented in Appendix A.3.

5.1.4. ABLATION: IMPORTANCE OF THE PE OUTPUT LAYER AND INPUT NETWORK CAPACITY

(1) To validate the importance of the PE output layer, we add the hypernetwork-based output layer of HPN to SET-QMIX (denoted as HPN-SET-QMIX). From Fig. 7 , we see that incorporating a PE output layer could significantly boost the performance of SET-QMIX. The converged performance of HPN-SET-QMIX is superior to that of vanilla QMIX. (2) However, due to the limited represen-tational capacity of the shared embedding layer of Deep Set, the performance of HPN-SET-QMIX is still worse than HPN-QMIX. The only difference between HPN-SET-QMIX and HPN-QMIX is the number of embedding weight matrices in the input layer. The results validate that improving the representational capacity of the PI input layer is beneficial. The results of applying HPN to MAPPO (Yu et al., 2021) and QPLEX (Wang et al., 2020a) in 5m vs 6m and 3s5z vs 3s6z are shown in Fig. 8 . We see that HPN-MAPPO and HPN-QPLEX consistently improve the performance of MAPPO and QPLEX, which validates that HPN can be easily integrated into many types of MARL algorithms and boost their performance. Full results are shown in Appendix A.4. We also evaluate the proposed DPN and HPN in the cooperative navigation and the predator-prey scenarios of MPE (Lowe et al., 2017) , where the actions only consist of movements. Therefore, only the PI property is needed. We follow the experimental settings of PIC (Liu et al., 2020) (which utilizes GNN to achieve PI, i.e., GNN-MADDPG) and apply our DPN and HPN to the joint Q-function of MADDPG. We implement the code based on the official PIC codebase. The learning curves are shown in Fig. 9 . We see that our HPN-MADDPG outperforms the GNN-based PIC, DA-MADDPG and MADDPG baselines by a large margin, which validates the superiority of our PI designs. The detailed experimental settings and full results are presented in Appendix A.6. Finally, we evaluate HPN in two Google Research Football (GRF) (Kurach et al., 2020) academic scenarios: 3 vs 1 with keeper and counterattack hard. We control the left team players, which need to coordinate their positions to organize attacks, and only scoring leads to rewards. The observations consist of 5 parts: ball information, left team, right team, controlled player and match state. Each agent has 19 discrete actions, including moving, sliding, shooting and passing, where the targets of the passing actions correspond to its teammates. Detailed settings can be found in Appendix A.7. We apply HPN to QMIX and compare it with the SOTA CDS-QMIX (Chenghao et al., 2021) . We show the average win rate across 5 seeds in Fig. 10 . HPN can significantly boost the performance of QMIX and HPN-QMIX outperforms the SOTA method CDS-QMIX by a large margin in these two scenarios.

6. CONCLUSION

In this paper, we propose two PI and PE designs, both of which can be easily integrated into existing MARL algorithms to boost their performance. Although we only test the proposed methods in model-free MARL algorithms, they are general designs and have great potential to improve the performance of model-based (Yuan et al., 2022) , multitask and transfer learning (Wu et al., 2023) algorithms. Besides, we currently follow the settings of (Qin et al., 2022; Wang et al., 2020b; Hu et al., 2021b; Long et al., 2019) , where the configuration of the input-output relationships and the observation/state structure are set manually. For future works, it is interesting to automatically detect such structural information.

A ADDITIONAL EXPERIMENTAL SETTINGS AND RESULTS

A.1 A SIMPLE POLICY EVALUATION PROBLEM. We present a simple policy evaluation experiment to show the influence of the model's representational capacity on the converged performance. We consider the following permutation-invariant (PI) models: ❶ Deep Set, ❷ DA-MLP (apply data augmentation to a permutation-sensitive MLP model), ❸ DPN (apply DPN to the same MLP model), ❹ HPN (apply HPN to the same MLP model) and ❺ Attention (use self-attention layers and a pooling function to achieve PI). The experimental settings are as follows. There are 2 agents in total. Each agent i only has one dimension of feature x i . For the convenience of analyzing, we set that each x i is an integer and x i ∈ {1, 2, ..., 30}, i.e., each agent i only has 30 different features. Thus the size of the joint state space ([x 1 , x 2 ]) after concatenating is 30 * 30 = 900. To make the policy evaluation task permutation-invariant, we simply set the target value Y of each state [x 1 , x 2 ] as x 1 * x 2 . In section 4.3, when introducing HPN, we asked a question that whether we can provide an infinite number of candidate weight matrices of DPN. In this simple task, we can directly construct a separate weight matrix for each feature x i . Since x i is discrete (which is enumerable), we can explicitly maintain a parameter table and exactly record a different embedding weight for each different feature x i . We denote this direct method as ❻ DPN(∞), which means 'DPN with infinite weights'. Our HPN uses a hypernetwok to approximately achieve 'infinite weight' by generating a different weight matrix for each different input x i . We use the Mean Square Error (MSE) as the loss function to train these different models. The code for this simple experiment is also available at https://github.com/tjuHaoXiaotian/API-Network (see 'Synthetic Policy Evaluation.py'). The comparison of the learning curves and the converged MSE losses of these different PI models are shown in Fig. 11 and Table 1 respectively. We conclude that DPN(∞) > HPN > Attention > DPN > DeepSet > DA-MLP, where '>' means 'performs better than', which indicates that increasing the representational capacity of the model can help to achieve much less MSE loss. (2) Since UPDeT uses a shared token embedding layer followed by multi-head self-attention layers to process all components of the input sets, the PI and PE properties are implicitly taken into consideration. The results of UPDeT-VDN also validate that incorporating PI and PE into the model design could reduce the observation space and improve the converged performance in most scenarios. (3) GNN-VDN achieves slightly better performance than SET-VDN. Although permutation-invariant is maintained, GNN-VDN and SET-VDN perform worse than vanilla QMIX, (especially in 3s5z vs 3s6z and 6h vs 8z, the win rates are approximate 0%). This confirms that the use of a shared embedding layer ϕ(x i ) for each component x i limits the representational capacities and restricts the final performance. ( 4) DA-VDN significantly improves the learning speed and performance of vanilla VDN in 3s5z vs 3s6z by data augmentation and much more times of parameter updating. However, the learning process is unstable, which collapses in all other scenarios due to the perturbation of the input features, which validates that it is hard to train a permutation-sensitive function (e.g., MLP) to output the same value when taking different orders of features as inputs. A.3 ABLATION STUDIES. We also enlarge the agent network of vanilla VDN (denoted as BIG-VDN) such that the number of parameters is larger than our HPN-VDN. The detailed numbers of parameters are shown in Table 3 . The results are shown in Fig. 13 . We see that simply increasing the parameter number cannot always guarantee better performance. For example, in 5m vs 6m, the win rate of BIG-VDN is worse than the vanilla VDN. In 3s5z vs 3s6z and 6h vs 8z, BIG-VDN does achieve better performance, but the performance of BIG-VDN is still worse than our HPN-VDN in all scenarios.

A.3.2 IMPORTANCE OF THE PE OUTPUT LAYER AND THE CAPACITY OF THE PI INPUT LAYER

To validate the importance of the permutation-equivariant output layer, we also add the hypernetwork-based output layer of HPN to SET-VDN (denoted as HPN-SET-VDN). The results are shown in Fig. 13 . We see that incorporating an APE output layer could significantly boost the performance of SET-VDN, and that the converged performance of HPN-SET-VDN is superior to the vanilla VDN in 5m vs 6m and 3s5z vs 3s6z. However, due to the limited representational capacity of the shared embedding layer of Deep Set, the performance of HPN-SET-VDN is still worse than our HPN-VDN, especially in 6h vs 8z. Note that the only difference between HPN-VDN and HPN-SET-VDN is the input layer, e.g., using hypernetwork-based customized embeddings or a simply shared one. The results validate the importance of improving the representational capacity of the permutation-invariant input layer. To demonstrate that our methods can be easily integrated into many types of MARL algorithms and boost their performance, we also apply HPN to a typical credit-assignment method QPLEX (Wang et al., 2020a) (denoted as HPN-QPLEX) and a policy-based MARL algorithm MAPPO (Yu et al., 2021) (denoted as HPN-MAPPO). The results are shown in Fig. 14 and Fig. 15 . We see that HPN significantly improves the performance of QPLEX and MAPPO, which validates that our method can be easily combined with existing MARL algorithms and improves their performance (especially for super hard scenarios). A.5 APPLYING HPN TO DEEP COORDINATION GRAPH. Recently, Deep Coordination Graph (DCG) (Böhmer et al., 2020) scales traditional coordination graph based MARL methods to large state-action spaces, shows its ability to solve the relative overgeneralization problem, and obtains competitive results on StarCraft II micromanagement tasks. Further, based on DCG, (Wang et al., 2021) proposes an improved version, named Context-Aware SparsE Coordination graphs (CASEC). CASEC learns a sparse and adaptive coordination graph (Wang et al., 2021) , which can largely reduce the communication overhead and improve the performance. Besides, CASEC incorporates action representations into the utility and payoff functions to reduce the estimation errors and alleviate the learning instability issue. Both DCG and CASEC inject the permutation invariance inductive bias into the design of the pairwise payoff function q ij (a i , a j |o i , o j ). They achieve permutation invariance by permuting the input order of [o i , o j ] and taking the average of both. To show the generality of our method, we also apply HPN to the utility function and payoff function of CASEC and show the performance in Fig. 16 . In Fig. 16 , we compare HPN-CASEC with the vanilla CASEC in 5m vs 6m. Results show that HPN can significantly improve the performance of CASEC, which validate that HPN is very easy to implement and can be easily integrated into many existing MARL approaches.

A.6 MULTIAGENT PARTICLE ENVIRONMENT

We evaluate the proposed DPN and HPN on the classical Multiagent Particle Environment (MPE) (Lowe et al., 2017) tasks, where the actions only consist of movement actions. Therefore, only the permutation invariance property is needed. We follow the experimental settings of PIC (permutationinvariant Critic for MADDPG, which utilizes GNN to achieve PI, i.e., GNNMADDPG) (Liu et al., 2020) and apply our DPN and HPN to the centralized critic Q-function of MADDPG (Lowe et al., 2017) . Each component x i represents the concatenation of agent i's observation and action. The input set X j contains all agents' observation-actions. We implement the code based on the official PIC. The baselines we considered are PIC (Liu et al., 2020) , DA-MADDPG (Ye et al., 2020) and MADDPG (Lowe et al., 2017) . The tasks we consider are as follows: • Cooperative navigation: n move cooperatively to cover L landmarks in the environment. The reward encourages the agents to get close to landmarks. An agent observes its location and velocity, and the relative location of the landmarks and other agents. • Cooperative predator-prey: n slower predators work together to chase M fast-moving prey. The predators get a positive reward when colliding with prey. Preys are environment controlled. A predator observes its location and velocity, the relative location of the L landmarks and other predators and the relative location and velocity of the prey. The learning curves of different methods in the cooperative navigation task (the agent number n = 6) and the cooperative predator-prey task (the agent number n = 3) are given in Fig. 9 . Besides, We further test HPN on two more cooperative navigation tasks with 100 and 200 agents respectively. The learning curves are shown in Figure 17 . The results show that HPN-MADDPG can significantly improve the performance of vanilla MADDPG and achieves superior sample efficiency and converged performance than PIC. All experiments are repeated for five runs with different random seeds. We see that our HPN-MADDPG outperforms the PIC, DA-MADDPG and MADDPG baselines in these two tasks, which validates the superiority of our permutation-invariant designs. 4 . Each agent has 19 discrete actions, including moving, sliding, shooting and passing. Following the settings of CDS (Chenghao et al., 2021) , we also make a reasonable change to the two half-court offensive scenarios: we will lose if our players or the ball returns to our half-court. All methods are tested with this modification. The final reward is +100 when our team wins, -1 when our player or the ball returns to our half-court, and 0 otherwise. We apply HPN to QMIX and compare it with the SOTA CDS-QMIX (Chenghao et al., 2021) . In detail, when applying HPN to QMIX, both the PI actions, e.g., moving, sliding and shooting, and the PE actions, e.g., long pass, high pass and short pass are considered. For each player, since the targets of these passing actions directly correspond to its teammates, we apply the PE output layer to generate the Q-values of these passing actions, where the hypernetwork takes each ally player's features as input and generates the weight matrices for the passing actions. Besides, in the official GRF environment, as we cannot directly control which teammates the current player passes the ball to, we take a max pooling over all ally-related Q-values to get the final Q-values for the three passing actions. We show the average win rate across 5 seeds in Fig. 10 . HPN can significantly boost the performance of QMIX and our HPN-QMIX outperforms the SOTA method CDS by a large margin in these two scenarios.

A.8 GENERALIZATION: CAN HPN GENERALIZE TO A NEW TASK WITH A DIFFERENT NUMBER OF AGENTS?

Apart from achieving PI and PE, another benefit of HPN is that it can naturally handle variable numbers of inputs and outputs. Therefore, as also stated in the conclusion section, HPN can be potentially used to design more efficient multitask learning and transfer learning algorithms. For example, we can directly transfer the learned HPN policy in one task to new tasks with different numbers of agents and improve the learning efficiency in the new tasks. Transfer learning results of surrounded by enemy units. This challenges the allied units to overcome the enemies approach from multiple angles at once. Secondly, there are the reflect position scenarios. These randomly select positions for the allied units, and then reflect their positions in the midpoint of the map to get the enemy spawn positions. Example figures are shown in Figure 20 The target is to find an optimal joint policy π which could maximize the expected return R t = T t=0 γ t r (s t , a t ), where γ is a discount factor and T is the time horizon. The joint action-value function is defined as Q π (s t , a t ) = E π,P [R t |s t , a t ]. Each agent's individual action-value function is denoted as Q i (o i , a i ).

B.2 IMPLEMENTATION DETAILS OF OUR APPROACH AND BASELINES

The key points of implementing the baselines and our methods are summarized here: (1) DPN and HPN: The proposed two methods inherently support heterogeneous scenarios since the entity's 'type' information has been taken into each entity's features. And the sample efficiency can be further improved within homogeneous agents compared to fixedly-ordered representation. For MMM and MMM2, we implemented a permutation-equivariant 'rescue-action' module for the only Medivac agent, which uses similar prior knowledge to ASN and UPDeT, i.e., action semantics. To focus on the core idea of our methods, we omitted these details in the method section. The objective to train the weight selection network of PDN. As stated in the last paragraph of Section 4, all parameters of DPN are trained end-to-end with backpropagation according to the RL loss function. The weight selection network and the other networks work cooperatively to minimize the overall RL loss function. How PI is achieved of DPN. As described in Section 4.2, for each o i [j], the weight selection network outputs the probability of selecting each weight matrix. During the forward pass at training step t, given the parameter snapshot of the weight selection network, the output probability of selecting each weight matrix is fixed. For each o i [j], we select the weight matrix with the maximum probability. However, directly selecting the argmax index is not differentiable. To make the selection process trainable, we apply a Straight Through Estimator [7] to get the one-hot encoding of the argmax index. We denote it as pin (o i [j]). The weight matrix with the maximum probability can be acquired by pin (o i [j]) • W in . As selecting the weight matrix with the maximum probability is a deterministic process, according to Equation (5), PI is guaranteed. Besides, to encourage more exploration at the beginning of training, we also add small gumbel noises (Jang et al., 2016) to the 'logits' within the epsilon anneal time. Within this interval, PI cannot be strictly guaranteed. When the epsilon anneal schedule is over, PI will be strictly guaranteed. The Straight Through Estimator written in PyTorch is shown below: For image inputs, the rotational or reflectional symmetries are more prominent characteristics. Thus, we could leverage rotation invariance or rotation equivariance to design better MARL algorithms, which is also a novel research direction. For vector inputs, when the structural information is unknown, a potential solution is: • (1) Learning action representations using a forward model. We want to learn action representations that can reflect the effects of actions on the environment and other agents. The effect of an action can be measured by the induced reward and the change in the states. • (2) Using all actions' representations as queries and using all entities' embedded features (potentially generated by HPN) as keys and values, we leverage the self-attention mechanism to generate the Q-values of each action. Since the self-attention computation is invariant to the input entities' order, PI and PE are achieved. And the input-output relationships may be learned implicitly by the self-attention mechanism. Automatically detecting such structural information is interesting and we leave this as future works.



For brevity, we use PI/PE as abbreviation of permutation invariance/permutation equivariance (nouns) or permutation-invariant/permutation-equivariant (adjectives) depending on the context. A permutation matrix has exactly a single unit value in every row and column and zeros everywhere else. Note that the size of Aequiv can be smaller than m, i.e., only corresponding to a subset of the entities (e.g., enemies or teammates). We use m here for the target of simplifying notations. For brevity and clarity, all linear layers are described without an explicit bias term; adding one does not affect the analysis. do is the dimension of oi[j] and d h is the dimension of hi. We use the binary comparison operators here to indicate the performance order of these algorithms. We use the binary comparison operators here to indicate the performance order of these algorithms.



Figure 1: A motivation example in SMAC.

Figure 3: an FC-layer in submodule view (left); dynamic permutation network architecture (right).Minimal Modification Principle. Although modifying different parts of {A, B, C, D} may also achieve equation 1, the main advantage of our proposal is that we can keep the backbone module B (and output module C) unchanged. Since existing algorithms have invested a lot in handling MARL-specific problems, e.g., they usually incorporate Recurrent Neural Networks (RNNs) into the backbone module B to handle the partially observable inputs, we believe that achieving PI and PE without modifying the backbone architectures of the underlying MARL algorithms is beneficial. Following this minimal modification principle, our proposed method can be more easily plugged into many types of existing MARL algorithms, which will be shown in section 5. In the following, we propose two designs to implement the PI module A and the PE module D.

, module A, C and D are Fully Connected (FC) Layers and module B is a Deep Neural Network (DNN) which usually incorporates RNNs to handle the partially observable inputs. Given an input o i containing m entities' features, an FC-layer under a submodule view is shown in Fig.3 (left), which has m independent weight matrices. If we denote W in = W in 1 , . . . W in m

Figure 4: PI and PE network with hypernetworks.

Figure 5: Comparisons of HPN-QMIX, HPN-VDN against fine-tuned QMIX and fine-tuned VDN. PE Output Layer D. Similarly, we build a hypernetwork hp out (o i [j]) = MLP (o i [j]) to generate the weight matrices W out for the output layer D. The j-th output of layer D is computed as:

Figure 7: Ablation studies.

Figure 8: Apply HPN to QPLEX and MAPPO.

Figure 9: Apply HPN and DPN to MADDPG.

Figure 10: Apply HPN to GRF.

Figure 12: Comparisons of VDN-based methods considering the PI and PE properties.

Figure 13: Ablation studies. All methods are equipped with VDN.

Figure 14: The learning curves of HPN-QPLEX compared with vanilla QPLEX in the hard and super hard scenarios of SMAC.

Figure 15: The learning curves of HPN-MAPPO compared with the vanilla MAPPO in the hard and super hard scenarios of SMAC.

Figure 16: The learning curves of HPN-CASEC and CASEC in 5m vs 6m.

Figure 17: Comparisons of HPN-MADDPG against PIC, DA-MADDPG and MADDPG in cooperative navigation with 100 and 200 agents.

(a) Transfer learning results on 12m. Red: reload the learned policy in 5m to 12m and then continuously train the policy. Blue: learn from scratch.(b) Transfer learning results on 8m vs 10m. Red: reload the learned policy in 5m vs 6m to 8m vs 10m and then continuously train the policy. Blue: learn from scratch.(c) Transfer learning results on 3s vs 5z. Red: reload the learned policy in 3s vs 3z to 3s vs 5z and then continuously train the policy. Blue: learn from scratch.

Figure 18: Transferring the learned HPN-VDN policy in one task to a new task with a different number of agents.

below.• Random Unit Types: Battles in SMACv2 do not always feature units of the same type each time, as they did in SMAC. Instead, units are spawned randomly according to certain pre-fixed probabilities. Units in StarCraft II are split up into different races. Units from different races cannot be on the same team. For each of the three races (Protoss, Terran, and Zerg), SMACv2 uses three unit types. Detailed generation probabilities are shown in Figure21.

Figure 20: Examples of the two different types of start positions, opposite and surrounded. Allied units are shown in blue and enemy units in dark red.

Figure 21: Detailed generation probabilities of the three types of units for the three races (Protoss, Terran, and Zerg).

Figure 22: The learning curves of HPN-VDN and VDN in 3 difficult scenarios of SMAC-v2.

To convert module D into PE, we must have module D know the entities' order in o i . Thus we have to add o i as an additional input to D, i.e., π i (a equiv |o i ) = D(h i , o i ). Then, we only have to modify the architecture of D such that D(h i , go i ) = gD(h i , o i ) = gπ i (a equiv |o i ), ∀g ∈ G.

, we first feed all o i [j]s (colored by different blues), into the shared hypernetwork hp in (o i [j]) (colored in yellow), whose input size is d o and output size is d o d h 5 . Then, we reshape the output for each o i [j] to d o × d h and regard it as the weight matrix W in [j]. Different o i [j]s will generate different weight matrices and the same o i [j] will always correspond to the same one no matter where it is arranged. The output of layer A is computed according to equation 7. Since each o i [j] is embedded separately by its corresponding W in [j] and then merged by a PI 'sum' function, the PI property is ensured.

Comparisons of HPN and DPN against baselines considering the PI and PE properties.5.1.2 COMPARISON WITH PI AND PE BASELINESThe baselines we compared include: (1) DA-QMIX: According to(Ye et al., 2020), we apply data augmentation to QMIX by generating more training data through shuffling the input order and using the generated data to additionally update the parameters. (2) SET-QMIX: we apply Deep Set(Li et al., 2021) to each Q i (a i |o i ) of QMIX, i.e., all x i s use a shared embedding layer and then aggregated by sum pooling, which can be considered as a special case of DPN, i.e., DPN(1). (3) GNN-QMIX: We apply GNN(Liu et al., 2020; Wang et al., 2020b)  to each Q i (a i |o i ) of QMIX to achieve PI. (4) ASN-QMIX(Wang et al., 2019): ASN-QMIX models the influences of different actions on other agents based on their semantics, e.g., move or attack, which is similar to the PE property considered in this paper. (5) UPDeT-QMIX(Hu et al., 2021b): a recently MARL algorithm based on Transformer which implicitly considers the PI and PE properties.

, MLP) achieve PI through training solely. The learning curves of these PI/PE baselines equipped with VDN are shown in Appendix A.2, which have similar results. All implementation details are shown in Appendix B.2. Besides, in Appendix A.1, we also test and analyze these methods via a simple policy evaluation experiment.

The comparison of the converged MSE losses of these different PI models. The comparison of the learning curves of different PI models. If we keep each x i ∈ {1, 2, ..., 30} and simply increase the agent number, the changes of the original state space by simple concatenation and the reduced state space by using PI representations are shown in Table2below. We see that by using PI representations, the state space can be significantly reduced.

Changes of the state space size with the increase of the agent number.

The number of parameters of the individual Q-networks in VDN, QMIX, BIG-VDN, BIG-QMIX, HPN-VDN, HPN-QMIX, UPDeT-VDN and UPDeT-QMIX.

The codes of CASEC and HPN-CASEC are also available at https://github.com/tjuHaoXiaotian/API-Network (see code/src/config/algs/casec.yaml and code/src/config/algs/hpn casec.yaml).

The feature composition of the observation and the state in Google Research Football

ACKNOWLEDGMENTS

This work is supported by the National Natural Science Foundation of China (Grant No.62106172), the "New Generation of Artificial Intelligence" Major Project of Science & Technology 2030 (Grant No.2022ZD0116402), and the Science and Technology on Information Systems Engineering Laboratory (Grant No.WDZC20235250409, No.WDZC20205250407).We thank Zipeng Dai for the insightful discussions and helpful writing suggestions. We are very grateful for the valuable and constructive comments of all reviewers.

annex

Published as a conference paper at ICLR 2023 5m → 12m, 5m vs 6m → 8m vs 10m, 3s vs 3z → 3s vs 5z are shown in Fig. 18 . We see that the previously trained HPN policies can serve as better initialization policies for new tasks.A.9 TO ACHIEVE PI AND PE, WHAT IF WE JUST SORT THE ENTITIES ACCORDING TO DISTANCE FROM THE FOCAL AGENT?(1) When we first started working on this project, we have also considered a similar baseline: we sort the entities according to (type, distance), i.e., according to their types first and then the relative distances if two entities' types are same. But we found that this solution do not always work well. Here, we provide the learning curves of HPN, QMIX, VDN, SORT-QMIX, and SORT-VDN in 4 hard and super hard scenarios on SMAC in Figure 19 . The results show that the sorting baseline can slightly improve the performance of vanilla QMIX/VDN in 5m vs 6m and 3s5z vs 3s6z. However, in 8m vs 9m and 6h vs 8z, it harms the performance.(2) The reason is that each entity has many types of features, e.g., relative x, relative y, relative distance, entity type, health point, shield, etc. Relative distance is just one of them. Simply sorting the entities by their relative distances while ignoring the influences of the other features may not be appropriate. Besides, as different x and y can have the same distance and different distances can have the same order, the same o i [j] may be arranged at different positions and be multiplied by different 'weight matrices' (according to). Therefore, learning may become unstable if we frequently reorder the inputs by distance only.(3) Thus, our target is not only matching the observation and action belonging to the same entity but stabilizing the learning process by always assigning the same weight matrix W in [j], i.e., a stable weight, to the same entity features o i [j] no matter where o i [j] is arranged. In this paper, we propose DPN and HPN to achieve this. A.10 EVALUATE HPN ON SMAC-V2 SMAC-v2 makes three major changes to SMAC: randomising start positions, randomising unit types, and restricting the agent field-of-view and shooting range to a cone. These first two changes increase more randomness to challenge contemporary MARL algorithms. The third change makes features harder to infer and adds the challenge that agents must actively gather information (require more efficient exploration). Since our target is not to design more efficient exploration algorithms, we keep the field-of-view and attack of the agents a full circle as in SMAC.• Random Start Positions: Random start positions come in two different types. First, there is the surrounded type, where the allied units are spawned in the middle of the map, and (2) Data Augmentation (DA) (Ye et al., 2020) : we apply the core idea of Data Augmentation (Ye et al., 2020) to SMAC by randomly generating a number of permutation matrices to shuffle the 'observation', 'state', 'action' and 'available action mask' for each sample simultaneously to generate more training data. An illustration of the Data Augmentation process is shown in Fig. 23 . A noteworthy detail is that since the attack actions are permutation-equivariant to the enemies in the observation, the same permutation matrix M 2 that is utilized to permute o enemy i should also be applied to permute the 'attack action' and 'available action mask of attack action' as well. The code is implemented based on PyMARL2 8 for fair comparison.(3) Deep Set (Zaheer et al., 2017; Li et al., 2021) : the only difference between SET-QMIX and the vanilla QMIX is that the vanilla QMIX uses a fully connected layer to process the fixedlyordered concatenation of the m components in o i while SET-QMIX uses a shared embedding layer h i = ϕ(x i ) to separately process each component x i in o i first, and then aggregates all h i s by sum pooling. The code is also implemented based on PyMARL2 for fair comparison.(4) GNN: Following PIC (Liu et al., 2020) and DyAN (Wang et al., 2020b) , we apply GNN to the individual Q-network of QMIX (denoted as GNN-QMIX) to achieve permutation-invariant. The code is also implemented based on PyMARL2.(5) ASN (Wang et al., 2019) : we use the official code and adapt the code to PyMARL2 for fair comparison.(6) UPDeT (Hu et al., 2021b) : we use the official code 9 and adapt the code to PyMARL2 for fair comparison.(7) VDN and QMIX: As mentioned in Section 3, vanilla VDN/QMIX uses fixedly-ordered entityinput and fixedly-ordered action-output (both are sorted by agent/enemy indices). Although VDN and QMIX do not explicitly consider the permutation invariance and permutation equivariance properties, they train a permutation-sensitive function to figure out the input-output relationships according to their fixed positions, which is implicit and inefficient.The codes for the baselines are also published at https://github.com/tjuHaoXiaotian/API-Network.8 https://github.com/hijkzzz/pymarl2 9 https://github.com/hhhusiyi-monash/UPDeT C HYPERPARAMETER SETTINGS For all MARL algorithms we use in SMAC (Samvelyan et al., 2019) (under the MIT License), we keep the hyperparameters the same as in PyMARL2 (Hu et al., 2021a) (under the Apache License v2.0). We list the detailed hyperparameter settings used in the paper below in Table 5 to help peers replicate our experiments more easily. 

E LIMITATIONS

We currently follow the settings of (Qin et al., 2022; Wang et al., 2020b; Hu et al., 2021b; Long et al., 2019; Wang et al., 2019) , where the configuration of the input-output relationships and the observation/state structures are set manually.What if the observations are images or the structural information is not available?The high-level idea of this paper is to leverage some formats of symmetries to reduce the size of the search space. Since typical MARL benchmarks represent observations as factorizable vectors (which can provide more direct and compact information than images), we currently focus on the permutation symmetries, i.e., PI and PE.

