MULTI-AGENT MULTI-GAME ENTITY TRANSFORMER

Abstract

Building large-scale generalist pre-trained models for many tasks is becoming an emerging and potential direction in reinforcement learning (RL). Research such as Gato and Multi-Game Decision Transformer have displayed outstanding performance and generalization capabilities on many games and domains. However, there exists a research blank about developing highly capable and generalist models in multi-agent RL (MARL), which can substantially accelerate progress towards general AI. To fill this gap, we propose Multi-Agent multi-Game ENtity TrAnsformer (MAGENTA) from the entity perspective as an orthogonal research to previous time-sequential modeling. Specifically, to deal with different state/observation spaces in different games, we analogize games as languages by aligning one single game to one single language, thus training different "tokenizers" and a shared transformer for various games. The feature inputs are split according to different entities and tokenized in the same continuous space. Then, two types of transformer-based model are proposed as permutation-invariant architectures to deal with various numbers of entities and capture the attention over different entities. MAGENTA is trained on Honor of Kings, Starcraft II micromanagement, and Neural MMO with a single set of transformer weights. Extensive experiments show that MAGENTA can play games across various categories with arbitrary numbers of agents and increase the efficiency of fine-tuning in new games and scenarios by 50%-100%. See our project page at https://sites.google.com/view/rl-magenta.

1. INTRODUCTION

In recent years, transformer-based models, as a solution to build large-scale generalist models, have made substantial progress in natural language processing (Brown et al., 2020; Devlin et al., 2018 ), computer vision (Dosovitskiy et al., 2020; Bao et al., 2021) , graph learning Yun et al. (2019) ; Rong et al. (2020) , Furthermore, they are showing their potential in reinforcement learning (RL) (Reed et al., 2022; Lee et al., 2022; Wen et al., 2022) by modeling and solving sequential problems. However, there are few inherent challenges in building large-scale general RL agents. First, when the training environment and the test environment are the same, RL is inclined to overfit the training environments, while lacking generalizability to unknown environments. As a result, a model is often needed to be re-trained from scratch for a new task. Second, it is challenging for a single model to adapt for various environments with differences in the numbers of agents, states, observations, actions, and dynamics. Third, training from scratch normally suffers from expensive computational cost, especially for large-scale RL. For example, AlphaStar requires 16 TPUs to train for 14 days, and Honor of King (HoK) requires 19,600 core CPUs and 168 V100 GPUs to train for nearly half a month. Thus, building a general, reusable, and efficient RL model has been an increasingly important task for both industrial and non-industrial research. To this end, we investigate whether a single model, with a single set of parameters, can be trained by playing multiple multi-agent games in an online manner, which is a blank in current research after Gato (Reed et al., 2022) and MGDT (Lee et al., 2022) . We consider training on Honor of Kings (HoK), Starcraft II micromanagement (SMAC), and Neural MMO (NMMO), informally asking: Can models learn some general knowledge about games across various categories? In this paper, we answer this question by proposing Multi-Agent multi-Game ENtity TrAnsformer (MAGENTA). We consider this problem as a few-shot transfer learning with the hypothesis, where a single model is capable of playing many games and can be adapted to never-seen-before games or scenarios with fine-tuning. For interpretability, we align the design of MAGENTA with the Entity Component System (ECS) architectural pattern of video games and the multilingual transformer. Specifically, we treat different games as different languages by aligning one single game to one single language as shown in Fig. 3 . Different languages have different tokenizers, so do games. Also, we expect that the learned representations in different games are similar to the word2vec in NLP. In this case, our transformer can be viewed as a multilingual transformer to capture the common knowledge among different games. We split the feature input according to different entities and tokenize the features in the same continuous space. Unlike existing applications of the transformer as a causal time-sequential model, the transformer in MAGENTA serves as a permutation-invariant architecture to attend to the features of different entities. We propose two types of transformer architecture, Encoder-Pooling (EP) and Encoder-Decoder (ED), to build permutation-invariant models. In this way, the output of the model should not change under any permutation of the elements in the input. Lastly, we show the training scheme of MAGENTA about how to train and transfer the pre-trained model. As a step towards developing a general model in RL/MARL, our contributions are threefold: First, by aligning games and languages, we show that it is possible to train a single generalist agent to act across multiple environments in an online manner. Second, we present permutation-invariant models for adapting various numbers of agents in different games. And third, we find that MAGENTA can achieve rapid fine-tuning in an online fashion to different scenarios within a single game, to a new type of game, and to different numbers of agents. Furthermore, we release the pre-trained models and code to encourage further research in this direction.

2.1. REINFORCEMENT LEARNING AND MULTI-AGENT RL

An RL problem is generally studied as a Markov decision process (MDP) (Bellman, 1957) , defined by the tuple: MDP = (S, A, P, r, γ, T ), where S ⊆ R n is an n-dimensional state space, A ⊆ R m an m-dimensional action space, P : S × A × S → R + a transition probability function, r : S → R a bounded reward function, γ ∈ (0, 1] a discount factor and T a time horizon. In MDP, an agent receives the current state s t ∈ S from the environment and performs an action a t ∈ A defined by a policy π θ : S → A parameterized by θ. The objective of the agent is to learn an optimal policy: π θ * := argmax π θ E π θ T i=0 γ i r t+i |s t = s . An MARL problem is formulated as a decentralised partially observable Markov decision process (Dec-POMDP) (Bernstein et al., 2002) , which is described as a tuple ⟨n, S, A, P, R, O, Ω, γ⟩, where n represents the number of agents. S, A, P, R are all global versions of those in MDP. O = {O i } i=1,••• ,n denotes the space of observations of all agents. Each agent i receives a private observation o i ∈ O i according to the observation function Ω(s, i) : S → O i .foot_0 .

2.2. ATTENTION MECHANISM IN TRANSFORMER

One of the most essential components of Transformer (Vaswani et al., 2017) is the attention mechanism, which captures the interrelationship of input sequences. Assume that we have n query vectors (corresponding to a set with n elements) each with dimension d q : Q ∈ R n×dq . The attention function is written as Attention(Q, K, V ) = ω(QK ⊤ )V , which maps queries Q to outputs using n v key-value pairs K ∈ R nv×dq , V ∈ R nv×dv . The pairwise dot product QK ⊤ ∈ R n×nv measures how similar each pair of query and key vectors is, with weights computed with an activation function ω. The output ω QK ⊤ V ∈ R n×dv is a weighted sum of V where a value gains more weight if its corresponding key has a larger dot product with the query. Furthermore, self-attention refers to cases where Q, K, V share the same set of parameters. Multi-head attention is an extension of the attention by computing h attention functions simultaneously and outputing a linear transformation of the concatenation of all attention outputs.



We present this paper in the scope of MARL, and abuse the term "feature" an alternative of observation.

