MULTI-AGENT POLICY TRANSFER VIA TASK RELATIONSHIP MODELING

Abstract

Team adaptation to new cooperative tasks is a hallmark of human intelligence, which has yet to be fully realized in learning agents. Previous works on multi-agent transfer learning accommodate teams of different sizes, but heavily rely on the generalization ability of neural networks for adapting to unseen tasks. We posit that the relationship among tasks provides the key information for policy adaptation. To utilize such relationship for efficient transfer, we try to discover and exploit the knowledge among tasks from different teams, propose to learn effect-based task representations as a common latent space among tasks, and use it to build an alternatively fixed training scheme. We demonstrate that the task representation can capture the relationship among teams and generalize to unseen tasks. As a result, the proposed method can help transfer learned cooperation knowledge to new tasks after training on a few source tasks, and the learned transferred policies can also help solve tasks that are hard to learn from scratch.

1. INTRODUCTION

Cooperation in human groups is characterized by resiliency to unexpected changes and purposeful adaptation to new tasks (Tjosvold, 1984) . This flexibility and transferability of cooperation is a hallmark of human intelligence. Computationally, multi-agent reinforcement learning (MARL) (Zhang et al., 2021a) provides an important means for machines to imitate human cooperation. Although recent MARL research has made prominent progress in many aspects of cooperation, such as policy decentralization (Lowe et al., 2017; Rashid et al., 2018; Wang et al., 2021a; c; Cao et al., 2021 ), communication (Foerster et al., 2016; Jiang & Lu, 2018), and organization (Jiang et al., 2019; Wang et al., 2020a; 2021b) , how to realize the ability of group knowledge transfer is still an open question. Compared to single-agent knowledge reuse (Zhu et al., 2020) , a unique challenge faced by multiagent transfer learning is the varying size of agent groups. The number of agents and the length of observation inputs in unseen tasks may differ from those in source tasks. To solve this problem, existing multi-agent transfer learning approaches build population-invariant (Long et al., 2019) and input-length-invariant (Wang et al., 2020c) learning structures using graph neural networks (Agarwal et al., 2020) and attention mechanisms like transformers (Hu et al., 2021; Zhou et al., 2021) . Although these methods handle varying populations and input lengths well, their knowledge transfer to unseen tasks mainly depends on the inherent generalization ability of neural networks. The relationship among tasks in MARL is not fully exploited for more efficient transfer. Towards making up for this shortage, we study the discovery and utilization of common structures in multi-agent tasks and propose Multi-Agent Transfer reinforcement learning via modeling TAsk Relationship (MATTAR). In this learning framework, we capture the common structure of tasks by modeling the similarity among transition and reward functions of different tasks. Specifically, we train a forward model for all source tasks to predict the observation, state, and reward at the next timestep given the current observation, state, and actions. The challenge is how to embody the similarity and the difference among tasks in this forward model, we specifically introduce difference by giving each source task a unique representation and model the similarity by generating the parameters of the forward model via a shared hypernetwork, which we call the representation explainer. To learn a well-formed representation space that encodes task relationship, an alternative-fixed training method is proposed to learn the task representation and representation explainer. During training, representations of source tasks are pre-defined and fixed as mutual orthogonal vectors, and the representation explainer is learned by optimizing the forward model prediction loss on all source tasks. When facing an unseen task, we fix the representation explainer and backpropagate gradients through the fixed forward model to learn the representation of the new task by a few samples. Furthermore, we design a population-invariant policy network conditioned on the learned task representation. During policy training, the representations for all source tasks are fixed, and the policy is updated to maximize the expected return over all source tasks. On an unseen task, we obtain the transferred policy by simply inserting the new task representation into the learned policy network. On the SMAC (Samvelyan et al., 2019) and MPE (Lowe et al., 2017 ) benchmarks, we empirically show that the learned knowledge from source tasks can be transferred to a series of unseen tasks with great success rates. We also pinpoint several other advantages brought by our method. First, finetuning the transferred policy on unseen tasks achieves better performance than learning from scratch, indicating that the task representation and pre-trained policy network provide a good initialization point. Second, training on multiple source tasks gets better performance compared to training on them individually and other multi-task learning methods, showing that MATTAR also provides a method for multi-agent multi-task learning. Finally, although not designed for this goal, our structure enables comparable performance against single-task learning algorithms when trained on single tasks.

2. METHOD

In this paper, we focus on knowledge transfer among fully cooperative multi-agent tasks that can be modeled as a Dec-POMDP (Oliehoek & Amato, 2016) consisting of a tuple G=⟨I, S, A, P, R, Ω, O, n, γ⟩, where I is the finite set of n agents, s ∈ S is the true state of the environment, and γ ∈ [0, 1) is the discount factor. At each timestep, each agent i receives an observation o i ∈ Ω drawn according to the observation function O(s, i) and selects an action a i ∈ A. Individual actions form a joint action a ∈ A n , which leads to a next state s ′ according to the transition function P (s ′ |s, a), and a reward r = R(s, a) shared by all agents. Each agent has local action-observation history τ i ∈ T ≡ (Ω × A) * × Ω. Agents learn to collectively maximize the global action-value function Q tot (s, a) = E s0:∞,a0:∞ [ ∞ t=0 γ t R(s t , a t )|s 0 = s, a 0 = a] (a little notation abuse: the subscript here for s and a indicates the timestep while the subscript for observation and action elsewhere in this paper indicates the index of the agent). Overall, our framework first trains on several source tasks {S i } and then transfers the learned cooperative knowledge to unseen tasks {T j }. As shown in Fig. 1 , our learning framework achieves this by designing modules for (1) task representation learning and (2) policy learning. In the following sections, we first introduce how we design the representation learning module and its learning scheme in different phases. Then, we describe the details of policy learning, including the population-invariant structure for dealing with inputs and outputs of varying sizes.

2.1. TASK REPRESENTATION LEARNING

Our main idea in achieving knowledge transfer among multi-agent tasks is to capture and exploit both the common structure and the unique characteristics of tasks by learning task representation. A task distinguishes itself from other tasks by its transition and reward functions. Therefore, we incorporate



Figure 1: Transfer learning scheme of our method. The black arrows indicate the direction of data flow and the red ones indicate the direction of gradient flow. The dashed arrows indicate the flow between the hypernetwork and the generated network.

