UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANS-FORMERS

Abstract

Recent advances in multi-agent reinforcement learning have been largely limited training one model from scratch for every new task. This limitation occurs due to the restriction of the model architecture related to fixed input and output dimensions, which hinder the experience accumulation and transfer of the learned agent over tasks across diverse levels of difficulty (e.g. 3 vs 3 or 5 vs 6 multiagent games). In this paper, we make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing a single architecture to fit tasks with different observation and action configuration requirements. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy by decoupling the policy distribution from the intertwined input observation, using an importance weight determined with the aid of the selfattention mechanism. Compared to a standard transformer block, the proposed model, which we name Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable. UPDeT is general enough to be plugged into any multiagent reinforcement learning pipeline and equip it with strong generalization abilities that enable multiple tasks to be handled at a time. Extensive experiments on large-scale SMAC multi-agent competitive games demonstrate that the proposed UPDeT-based multi-agent reinforcement learning achieves significant improvements relative to state-of-the-art approaches, demonstrating advantageous transfer capability in terms of both performance and training speed (10 times faster).

1. INTRODUCTION

Reinforcement Learning (RL) provides a framework for decision-making problems in an interactive environment, with applications including robotics control (Hester et al. (2010) ), video gaming (Mnih et al. (2015) ), auto-driving (Bojarski et al. (2016) ), person search (Chang et al. (2018) ) and visionlanguage navigation (Zhu et al. (2020) ). Cooperative multi-agent reinforcement learning (MARL), a long-standing problem in the RL context, involves organizing multiple agents to achieve a goal, and is thus a key tool used to address many real-world problems, such as mastering multi-player video games (Peng et al. (2017) ) and studying population dynamics (Yang et al. (2017) ). A number of methods have been proposed that exploit an action-value function to learn a multiagent model (Sunehag et al. (2017 ), Rashid et al. (2018 ), Du et al. (2019 ), Mahajan et al. (2019 ), Hostallero et al. (2019 ), Zhou et al. (2020 ), Yang et al. (2020) ). However, current methods have poor representation learning ability and fail to exploit the common structure underlying the tasks this is because they tend to treat observation from different entities in the environment as an integral part of the whole. Accordingly, they give tacit support to the assumption that neural networks are able to automatically decouple the observation to find the best mapping between the whole observation and policy. Adopting this approach means that they treat all information from other agents or different parts of the environment in the same way. The most commonly used method involves concatenating 2020)), which makes zero-shot transfer impossible. Thus, the application of current methods is limited in real-world applications. Our solution to these problems is to develop a multi-agent reinforcement learning (MARL) framework with no limitation on input or output dimension. Moreover, this model should be general enough to be applicable to any existing MARL methods. More importantly, the model should be explainable and capable of providing further improvement for both the final performance on singletask scenarios and transfer capability on multi-task scenarios. Inspired by the self-attention mechanism (Vaswani et al. ( 2017)), we propose a transformer-based MARL framework, named Universal Policy Decoupling Transformer (UPDeT). There are four key advantages of this approach: 1) Once trained, it can be universally deployed; 2) it provide more robust representation with a policy decoupling strategy; 3) it is more explainable; 4) it is general enough to be applied on any MARL model. We further design a transformer-based function to handle various observation sizes by treating individual observations as "observation-entities". We match the related observation-entity with action-groups by separating the action space into several action-groups with reference to the corresponding observation-entity, allowing us to get matched observation-entity -action-group pairs set. We further use a self-attention mechanism to learn the relationship between the matched observation-entity and other observation-entities. Through the use of self-attention map and the embedding of each observation-entity, UPDeT can optimize the policy at an action-group level. We refer to this strategy as Policy Decoupling. By combining the transformer and policy decoupling strategies, UPDeT significantly outperforms conventional RNNbased models. In UPDeT, there is no need to introduce any new parameters for new tasks. We also prove that it is only with decoupled policy and matched observation-entity -action-group pairs that UPDeT can learn a strong representation with high transfer capability. Finally, our proposed UPDeT can be plugged into any existing method with almost no changes to the framework architecture required, while still bringing significant improvements to the final performance, especially in hard and complex multi-agent tasks. The main contributions of this work are as follows: First, our UPDeT-based MARL framework outperforms RNN-based frameworks by a large margin in terms of final performance on state-of-



Figure 1: An overview of the MARL framework. Our work replaces the widely used GRU/LSTMbased individual value function with a transformer-based function. Actions are separated into action groups according to observations.

availability

//github.com/

