RPM: GENERALIZABLE MULTI-AGENT POLICIES FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the MARL problem with Markov Games and propose a simple yet effective method, called ranked policy memory (RPM), i.e., to maintain a look-up memory of policies to achieve good generalizability. The main idea of RPM is to train MARL policies via gathering massive multi-agent interaction data. In particular, we first rank each agent's policies by its training episode return, i.e., the episode return of each agent in the training environment; we then save the ranked policies in the memory; when an episode starts, each agent can randomly select a policy from the RPM as the behavior policy. Each agent uses the behavior policy to gather multi-agent interaction data for MARL training. This innovative self-play framework guarantees the diversity of multi-agent interaction in the training data. Experimental results on Melting Pot demonstrate that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks. It significantly boosts the performance up to 818% on average.

1. INTRODUCTION

In Multi-Agent Reinforcement Learning (MARL) (Yang & Wang, 2020) , each agent acts decentrally and interacts with other agents to complete given tasks or achieve specified goals via reinforcement learning (RL) (Sutton & Barto, 2018) . In recent years, much progress has been achieved in MARL research (Vinyals et al., 2019; Jaderberg et al., 2019; Perolat et al., 2022) . However, the MARL agents trained with current methods tend to suffer poor generalizability (Hupkes et al., 2020) in the new environments. The generalizability issue is critical to real-world MARL applications (Leibo et al., 2021) , but is mostly neglected in current research. In this work, we aim to train MARL agents that can adapt to new scenarios where other agents' policies are unseen during training. We illustrate a two-agent hunting game as an example in Fig. 1 . The game's objective for two agents is to catch the stag together, as one agent acting alone cannot catch the stag and risks being killed. They may perform well in evaluation scenarios similar to the training environment, as shown in Fig. 1 (a) and (b), respectively, but when evaluated in scenarios different from the training ones, these agents often fail. As shown in Fig. 1 (c ), the learning agent (called the focal agent following (Leibo et al., 2021) ) is supposed to work together with the other agent (called the background agent also following (Leibo et al., 2021) ) that is pre-trained and can capture the hare and the stag. In this case, the focal agent would fail to capture the stag without help from its teammate. The teammate of the focal agent may be tempted to catch the hare alone and not cooperate, or may only choose to cooperate with the focal agent after capturing the hare. Thus, the focal agent should adapt to their teammate's behavior to catch the stag. However, the policy of the background agent is unseen to the focal agent during training. Therefore, without generalization, the agents trained as Fig. 1 (left) cannot achieve an optimal policy in the new evaluation scenario. Inspired by the fact that human learning is often accelerated by interacting with individuals of diverse skills and experiences (Meltzoff et al., 2009; Tomasello, 2010) , we propose a novel method aimed at improving the generalization of MARL through the collection of diverse multi-agent interactions. Concretely, we first model the MARL problem with Markov Games (Littman, 1994) and then propose a simple yet effective method called ranked policy memory (RPM) to attain generalizable policies. The core idea of RPM is to maintain a look-up memory of policies during training for the agents. In particular, we first evaluate the trained agents' policies after each training update. We then rank the trained agents' policies by the training episode returns and save them in the memory. In this way, we obtain various levels, i.e., the performance of the policies. When starting an episode, the agent can access the memory and load a randomly sampled policy to replace the current behavior policy. The new ensemble of policies enables the agents in self-play to collect diversified experiences in the training environment. These diversified experiences contain many novel multi-agent interactions that can enhance the extrapolation capacity of MARL, thus boosting the generalization performance. We note that an easy extension by incorporating different behavior properties as the keys in RPM could potentially further enrich the generalization but it is left for future work. We implement RPM on top of the state-of-the-art MARL algorithm MAPPO (Yu et al., 2021) . To verify its effectiveness, we conduct large-scale experiments with the Melting Pot (Leibo et al., 2021) , which is a well-recognized benchmark for MARL generalization evaluation. The experiment results demonstrate that RPM significantly boosts the performance of generalized social behaviors up to 818% on average and outperforms many baselines in a variety of multi-agent generalization evaluation scenarios. Our code, pictorial examples, videos and experimental results are available at this link: https://sites.google.com/view/rpm-iclr2023/.

2. PRELIMINARIES

Markov Games. We consider the Markov Games (Littman, 1994) represented by a tuple G = ⟨N , S, A, O, P, R, γ, ρ⟩. N is a set of agents with the size |N | = N ; S is a set of states; A = × N i=1 A i is a set of joint actions with A i denoting the set of actions for an agent i; O = × N i=1 O i is the observation set, with O i denoting the observation set of the agent i; P : S × A → S is the transition function and R = × N i=1 r i is the reward function where r i : S × A → R specifies the reward for the agent i given the state and the joint action; γ is the discount factor; the initial states are determined by a distribution ρ : S → [0, 1]. Given a state s ∈ S, each agent i ∈ N chooses its action u i and obtains the reward r(s, u) with the private observation o i ∈ O i , where u = {u i } N i=1 is the joint action. The joint policy of agents is denoted as π θ = {π θi } N i=1 where π θi : S × A i → [0, 1] is the policy for the agent i. The objective of each agent is to maximize its total expected return R i = ∞ t=0 γ t r t i . Multi-Agent RL. In MARL, multiple agents act in the multi-agent systems to maximize their respective returns with RL. Each agent's policy π i is optimized by maximizing the following objective: J (πi) ≜ E s 0:∞ ∼ρ 0:∞ G ,a i 0:∞ ∼π i ∞ t=0 γ t r i t , where J (π i ) is a performance measure for policy gradient RL methods (Williams, 1992; Lillicrap et al., 2016; Fujimoto et al., 2018) . Each policy's Q value Q i is optimized by minimizing the following regression loss (Mnih et al., 2015) with TD-learning (Sutton, 1984) : L(θi) ≜ E D ′ ∼D y i t -Q i θ i st, ut, s i t , u i t 2 ,



Figure 1: Two-Agent Hunting Game. (a) Training environment. Two agents (hunters) hunt in the environment. (b) After training in the training environment, all agents behave cooperatively to capture the stag. (c)In the new evaluation scenario, one agent is picked as the focal agent (in the magenta circle) and paired with a pre-trained agent (in the brown circle) that behaves in different ways to evaluate the performance of the selected agent. In conclusion, the conventional evaluation protocol fails to evaluate such behavior and current MARL methods easily fail to learn the optimal policy due to the lack of diversified multi-agent interaction data during training.

