A POLICY GRADIENT ALGORITHM FOR LEARNING TO LEARN IN MULTIAGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other agents that are also simultaneously learning. In particular, each agent perceives the environment as effectively nonstationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural nonstationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to these multiagent settings. This is achieved by modeling our gradient updates to directly consider both an agent's own non-stationary policy dynamics and the non-stationary policy dynamics of other agents interacting with it in the environment. We find that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently combines key aspects of previous state of the art approaches on this topic. We test our method on several multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than previous related approaches across the spectrum of mixed incentive, competitive, and cooperative environments.

1. INTRODUCTION

Learning in multiagent settings is inherently more difficult than single-agent learning because an agent interacts both with the environment and other agents (Bus ¸oniu et al., 2010) . Specifically, the fundamental challenge in multiagent reinforcement learning (MARL) is the difficulty of learning optimal policies in the presence of other simultaneously learning agents because their changing behaviors jointly affect the environment's transition and reward function. This dependence on nonstationary policies renders the Markov property invalid from the perspective of each agent, requiring agents to adapt their behaviors with respect to potentially large, unpredictable, and endless changes in the policies of fellow agents (Papoudakis et al., 2019) . In such environments, it is also critical that agents adapt to the changing behaviors of others in a very sample-efficient manner as it is likely that their policy could update again after a small number of interactions (Al-Shedivat et al., 2018) . Therefore, effective agents should consider the learning of other agents and adapt quickly to non-stationary behaviors. Otherwise, undesirable outcomes may arise when an agent is constantly lagging in its ability to deal with the current policies of other agents. In this paper, we propose a new framework based on meta-learning for addressing the inherent non-stationarity of MARL. Meta-learning (also referred to as learning to learn) was recently shown to be a promising methodology for fast adaptation in multiagent settings. The framework by Al-Shedivat et al. (2018) , for example, introduces a meta-optimization scheme by which a meta-agent can adapt more efficiently to changes in a new opponent's policy after collecting only a handful of interactions. The key idea underlying their meta-optimization is to model the meta-agent's learning process so that its updated policy performs better than an evolving opponent. However, their work does not directly consider the opponent's learning process in the meta-optimization, treating the evolving opponent as an external factor and assuming the meta-agent cannot influence the opponent's future policy. As a result, their work fails to consider an important property of MARL: the opponent is also a learning agent changing its policy based on trajectories collected by interacting with the meta-agent. As such, the meta-agent has an opportunity to influence the opponent's future policy by changing the distribution of trajectories, and the meta-agent can take advantage of this opportunity to improve its performance during learning. Our contribution. With this insight, we develop a new meta-multiagent policy gradient theorem (Meta-MAPG) that directly models the learning processes of all agents in the environment within a single objective function. We start by extending the meta-policy gradient theorem of Al-Shedivat et al. (2018) based on the multiagent stochastic policy gradient theorem (Wei et al., 2018) to derive a novel meta-policy gradient theorem. This is achieved by removing the unrealistic implicit assumption of Al-Shedivat et al. (2018) that the learning of other agents in the environment is not dependent on an agent's own behavior. Interestingly, performing our derivation with this more general set of assumptions inherently results in an additional term that was not present in previous work by Al-Shedivat et al. (2018) . We observe that this added term is closely related to the process of shaping the learning dynamics of other agents in the framework of Foerster et al. (2018a) . As such, our work can be seen as contributing a theoretically grounded framework that unifies the collective benefits of previous work by Al-Shedivat et al. (2018) and Foerster et al. (2018a) . Meta-MAPG is evaluated on a diverse suite of multiagent domains, including the full spectrum of mixed incentive, competitive, and cooperative environments. Our experiments demonstrate that Meta-MAPG consistently results in superior adaption performance in the presence of novel evolving agents. Figure 1 : (a) A Markov chain of joint policies representing the inherent non-stationarity of MARL. Each agent updates its policy leveraging a Markovian update function, resulting in a change to the joint policy. (b) A probabilistic graph for Meta-MAPG. Unlike Meta-PG, our approach actively influences the future policies of other agents as well through the peer learning gradient.

2. PRELIMINARIES

Interactions between multiple agents can be represented by stochastic games (Shapley, 1953) . Specifically, an n-agent stochastic game is defined as a tuple M n = I, S, A, P, R, γ ; I = {1, . . ., n} is the set of n agents, S is the set of states, A = × i∈I A i is the set of action spaces, P : S × A → S is the state transition probability function, R = × i∈I R i is the set of reward functions, and γ ∈ [0, 1) is the discount factor. We typeset sets in bold for clarity. Each agent i executes an action at each timestep t according to its stochastic policy a i t ∼ π i (a i t |s t , φ i ) parameterized by φ i , where s t ∈ S. A joint action a t = {a i t , a -i t } yields a transition from the current state s t to the next state s t+1 ∈ S with probability P(s t+1 |s t , a t ), where the notation -i indicates all other agents with the exception of agent i. Agent i then obtains a reward according to its reward function r i t = R i (s t , a t ). At the end of an episode, the agents collect a trajectory τ φ under the joint policy with parameters φ, where τ φ := (s 0 , a 0 , r 0 , . . ., r H ), φ = {φ i , φ -i } represents the joint parameters of all policies, r t = {r i t , r -i t } is the joint reward, and H is the horizon of the trajectory or episode.

2.1. A MARKOV CHAIN OF POLICIES

The perceived non-stationarity in multiagent settings results from a distribution of sequential joint policies, which can be represented by a Markov chain (Al-Shedivat et al., 2018) . Formally, a Markov chain of policies begins from a stochastic game between agents with an initial set of joint policies parameterized by φ 0 . We assume that each agent updates its policy leveraging a Markovian update function that changes the policy after every K trajectories. After this time period, each agent i adapts its policy to maximize the expected return expressed as its value function: V i φ0 (s 0 ) = E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) H t=0 γ t r i t |s 0 = E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) G i (τ φ0 ) , where G i denotes agent i's discounted return from the beginning of an episode with initial state s 0 . The joint policy update results in a transition from φ 0 to the updated set of joint parameters φ 1 . The Markov chain continues for a maximum chain length of L (see Figure 1a ). This Markov chain perspective highlights the following inherent aspects of the experienced non-stationarity: Sequential dependency. The future joint policy parameters φ 1:L = {φ 1 , . . ., φ L } sequentially depend on φ 0 since a change in τ φ0 results in a change in φ 1 , which in turn affects τ φ1 and all successive joint policy updates up to φ L . Controllable levels of non-stationarity. As in Al-Shedivat et al. (2018) and Foerster et al. (2018a) , we assume stationary policies during the collection of K trajectories, and that the joint policy update happens afterward. In such a setting, it is possible to control the non-stationarity by adjusting the K and H hyperparameters: smaller K and H increase the rate that agents change their policies, leading to a higher degree of non-stationarity in the environment. In the limit of K = H = 1, all agents change their policy every step.

3. LEARNING TO LEARN IN MULTIAGENT REINFORCEMENT LEARNING

This section explores learning policies that can adapt quickly to non-stationarity in the policies of other agents in the environment. To achieve this, we leverage meta-learning and devise a new meta-multiagent policy gradient theorem that exploits the inherent sequential dependencies of MARL discussed in the previous section. Specifically, our meta-agent addresses this non-stationarity by considering its current policy's impact on its own adapted policies while actively influencing the future policies of other agents as well by inducing changes to the distribution of trajectories. In this section, we first outline the meta-optimization process in MARL and then discuss how the meta-policy gradient theorem of Al-Shedivat et al. (2018) optimizes for this objective while ignoring the dependence of the future policy of other agents on our current policy. Finally, we derive a new extension of this policy gradient theorem that explicitly leverages this dependence and discuss how to interpret the impact of the resulting form of the gradient.

3.1. GRADIENT BASED META-OPTIMIZATION IN MULTIAGENT REINFORCEMENT LEARNING

We formalize the meta-objective of MARL as optimizing meta-agent i's initial policy parameters φ i 0 so that it maximizes the expected adaptation performance over a Markov chain of policies drawn from a stationary initial distribution of policies for the other agents p(φ -i 0 ): max φ i 0 E p(φ -i 0 ) L-1 =0 V i φ 0: +1 (s 0 , φ i 0 ) , s.t. V i φ 0: +1 (s 0 , φ i 0 ) = E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) E τ φ +1 ∼p(τ φ +1 |φ i +1 ,φ -i +1 ) G i (τ φ +1 ) where τ φ 0: = {τ φ0 , . . ., τ φ }, and V i φ 0: +1 (s 0 , φ i 0 ) denotes the meta-value function. This meta-value function generalizes the notion of each agent's primitive value function for the current set of policies V i φ0 (s 0 ) over the length of the Markov chain of policies. In this work, as in Al-Shedivat et al. ( 2018), we follow the MAML (Finn et al., 2017) meta-learning framework. As such, we assume that the Markov chain of policies is governed by a policy gradient update function that corresponds to what is generally referred to as the inner-loop optimization in the meta-learning literature: φ i +1 := φ i + α i ∇ φ i E τ φ ∼p(τ φ |φ i ,φ -i ) G i (τ φ ) , φ -i +1 := φ -i + α -i ∇ φ -i E τ φ ∼p(τ φ |φ i ,φ -i ) G -i (τ φ ) , where α i and α -i denote the learning rates used by each agent in the environment.

3.2. THE META-POLICY GRADIENT THEOREM

Intuitively, if we optimize the meta-value function, we are searching for initial parameters φ i 0 such that successive inner-loop optimization steps with Equation (4) results in adapted parameters φ i +1 that can perform better than the updated policies of other agents with policy parameters φ -i +1 (see Figure 1b ).

Algorithm 1 Meta-Learning at Training Time

Require: p(φ -i 0 ): Distribution over other agents' initial policies; α, β: Learning rates 1: Randomly initialize φ i 0 2: while φ i 0 has not converged do 3: Sample a meta-train batch of φ -i 0 ∼ p(φ -i 0 ) 4: for each φ -i 0 do 5: for = 0, . . ., L do 6: Sample and store trajectory τ φ 7: Compute φ +1 = f (φ , τ φ , α) from inner-loop optimization (Equation ( 4)) 8: end for 9: end for 10: Update 2: Sample a meta-test batch of φ -i 0 ∼ p(φ -i 0 ) 3: for each φ -i 0 do 4: for = 0, . . ., L do 5: φ i 0 ← φ i 0 + β L-1 =0 ∇ φ i 0 V i φ 0: +1 (s0, φ i 0 ) based Sample trajectory τ φ 6: Compute φ +1 = f (φ , τ φ , α) from inner-loop optimization (Equation (4)) 7: end for 8: end for In Deep RL, a very practical way to optimize a value function is by following its gradient. The work of Al-Shedivat et al. (2018) derived the meta-policy gradient theorem (Meta-PG) for optimizing a setup like this. However, it is important to note that they derived this gradient while making the implicit assumption to ignore the dependence of the future parameters of other agents on φ i 0 : ∇ φ i 0 V i φ 0: +1 (s 0 , φ i 0 ) = E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) E τ φ +1 ∼p(τ φ +1 |φ i +1 ,φ -i +1 ) ∇ φ i 0 logπ(τ φ0 |φ i 0 ) Current Policy + =0 ∇ φ i 0 logπ(τ φ +1 |φ i +1 ) Own Learning G i (τ φ +1 ) . In particular, Meta-PG has two primary terms. The first term corresponds to the standard policy gradient with respect to the current policy parameters used during the initial trajectory. Meanwhile, the second term ∇ φ i 0 log π(τ φ +1 |φ i +1 ) explicitly differentiates through log π(τ φ +1 |φ i +1 ) with respect to φ i 0 . This enables a meta-agent i to model its own learning dynamics and account for the impact of φ i 0 on its eventual adapted parameters φ i +1 . As such, we can see how this term would be quite useful in improving adaptation across a Markov chain of policies. Indeed, it directly accounts for an agent's own learning process during meta-optimization in order to improve future performance.

3.3. THE META-MULTIAGENT POLICY GRADIENT THEOREM

In this section, we consider doing away with the implicit assumption from Al-Shedivat et al. (2018) discussed in the last section that we can ignore the dependence of the future parameters of other agents on φ i 0 . Indeed, meta-agents need to account for both their own learning process and the learning processes of other peer agents in the environment to fully address the inherent non-stationarity of MARL. We will now demonstrate that our generalized gradient includes a new term explicitly accounting for the effect an agent's current policy has on the learned future policies of its peers. Theorem 1 (Meta-Multiagent Policy Gradient Theorem (Meta-MAPG)). For any stochastic game M n , the gradient of the meta-objective function for agent i at state s 0 with respect to the current parameters φ i 0 of stochastic policy π evolving in the environment along with other peer agents using initial parameters φ -i 0 is: ∇ φ i 0 V i φ 0: +1 (s 0 , φ i 0 ) = E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) E τ φ +1 ∼p(τ φ +1 |φ i +1 ,φ -i +1 ) ∇ φ i 0 logπ(τ φ0 |φ i 0 ) Current Policy + =0 ∇ φ i 0 logπ(τ φ +1 |φ i +1 ) Own Learning + =0 ∇ φ i 0 logπ(τ φ +1 |φ -i +1 ) Peer Learning G i (τ φ +1 ) Proof. See Appendix A for a detailed proof of Theorem 1. Probabilistic model perspective. Probabilistic models for Meta-PG and Meta-MAPG are depicted in Figure 1b . As shown by the own learning gradient direction, a meta-agent i optimizes φ i 0 by accounting for the impact of φ i 0 on its updated parameters φ i 1: +1 and adaptation performance G i (τ φ +1 ). However, Meta-PG considers the other agents as an external factor that cannot be influenced by the meta-agent, as indicated by the absence of the dependence between τ φ 0: and φ -i 1: +1 in Figure 1b . As a result, the meta-agent loses an opportunity to influence the future policies of other agents and further improve its adaptation performance. By contrast, the peer learning term in Theorem 1 aims to additionally compute gradients through the sequential dependency between the agent's initial policy φ i 0 and the future policies of other agents in the environment φ -i 1: +1 so that it can learn to change τ φ0 in a way that maximizes performance over the Markov chain of policies. Interestingly, the peer learning term that naturally arises when taking the gradient in Meta-MAPG has been previously considered in the literature by Foerster et al. (2018a) . In the Learning with Opponent Learning Awareness (LOLA) approach (Foerster et al., 2018a) , this term was derived in an alternate way following a first order Taylor approximation with respect to the value function. Indeed, it is quite surprising to see how taking a principled policy gradient while leveraging a more general set of assumptions leads to a unification of the benefits of past works (Al-Shedivat et al., 2018; Foerster et al., 2018a) on adjusting to the learning behavior of other agents in MARL. Algorithm. We provide pseudo-code for Meta-MAPG in Algorithm 1 for meta-training and Algorithm 2 for meta-testing. Note that Meta-MAPG is centralized during meta-training as it requires the policy parameters of other agents to compute the peer learning gradient. However, for settings where a meta-agent cannot access the policy parameters of other agents during meta-training, we provide a decentralized meta-training algorithm with opponent modeling, motivated by the approach used in Foerster et al. (2018a) , in Appendix B that computes the peer learning gradient while leveraging only an approximation of the parameters of peer agents. Once meta-trained in either case, the adaptation to new agents during meta-testing is purely decentralized such that the meta-agent can decide how to shape other agents with its own observations and rewards alone.

4. RELATED WORK

The standard approach for addressing non-stationarity in MARL is to consider information about the other agents and reason about the effects of their joint actions (Hernandez-Leal et al., 2017) . The literature on opponent modeling, for instance, infers opponents' behaviors and conditions an agent's policy on the inferred behaviors of others (He et al., 2016; Raileanu et al., 2018; Grover et al., 2018) . Studies regarding the centralized training with decentralized execution framework (Lowe et al., 2017; Foerster et al., 2018b; Yang et al., 2018; Wen et al., 2019) , which accounts for the behavior of others through a centralized critic, can also be classified into this category. While this body of work alleviates non-stationarity, it is generally assumed that each agent will have a stationary policy in the future. Because other agents can have different behaviors in the future as a result of learning (Foerster et al., 2018a) , this incorrect assumption can cause sample inefficient and improper adaptation (see Example 1 in Appendix). In contrast, Meta-MAPG models the learning process of each agent in the environment, allowing a meta-learning agent to adapt efficiently. Our approach is also related to prior work that considers the learning of other agents in the environment. This includes Zhang & Lesser (2010) who attempted to discover the best response adaptation to the anticipated future policy of other agents. Our work is also related, as discussed previously, to LOLA (Foerster et al., 2018a) and more recent improvements (Foerster et al., 2018c) . Another relevant idea explored by Letcher et al. (2019) is to interpolate between the frameworks of Zhang & Lesser (2010) and Foerster et al. (2018a) in a way that guarantees convergence while influencing the opponent's future policy. However, all of these approaches only account for the learning processes of other agents and fail to consider an agent's own non-stationary policy dynamics as in the own learning gradient discussed in the previous section. Additionally, these papers do not leverage meta-learning. As a result, these approaches may require many samples to properly adapt to new agents. Meta-learning (Schmidhuber, 1987; Bengio et al., 1992) has recently become very popular as a method for improving sample efficiency in the presence of changing tasks in the Deep RL literature (Wang et al., 2016a; Duan et al., 2016b; Finn et al., 2017; Mishra et al., 2017; Nichol & Schulman, 2018) . See Vilalta & Drissi (2002) ; Hospedales et al. (2020) for in-depth surveys of meta-learning. In particular, our work builds on the popular model agnostic meta-learning (MAML) framework (Finn et al., 2017) where gradient-based learning is used both for conducting so called inner-loop learning and to improve this learning by computing gradients through the computational graph. When we train our agents so that the inner loop can accommodate for a dynamic Markov chain of other agent policies, we are leveraging an approach that has recently become popular for supervised learning called meta-continual learning (Riemer et al., 2019; Javed & White, 2019; Spigler, 2019; Beaulieu et al., 2020; Caccia et al., 2020; Gupta et al., 2020) . This means that our agent trains not just to adapt to a single set of policies during meta-training, but rather to adapt to a set of changing policies with Markovian updates. As a result, we avoid an issue of past work (Al-Shedivat et al., 2018) that required the use of importance sampling during meta-testing (see Appendix D.1 for more discussion).

5. EXPERIMENTS

We demonstrate the efficacy of Meta-MAPG on a diverse suite of multiagent domains, including the full spectrum of mixed incentive, competitive, and cooperative environments. To this end, we directly compare with the following baseline adaptation strategies: 1) Meta-PG (Al-Shedivat et al., 2018): A meta-learning approach that only considers how to improve its own learning. We detail our implementation of Meta-PG and a low-level difference with the implementation in the original paper by Al-Shedivat et al. (2018) in Appendix D. 2) LOLA-DiCE (Foerster et al., 2018c) : An approach that only considers how to shape the learning dynamics of other agents in the environment through the Differentiable Monte-Carlo Estimator (DiCE) operation. Note that LOLA-DiCE is an extension of the original LOLA approach. 3) REINFORCE (Williams, 1992) : A simple policy gradient approach that considers neither an agents own learning nor the learning processes of other agents. This baseline represents multiagent approaches that assume each agent leverages a stationary policy in the future. In our experiments, we implement each method's policy leveraging an LSTM. The inner-loop updates are based on the policy gradient with a linear feature baseline (Duan et al., 2016a) , and we use generalized advantage estimation (Schulman et al., 2016) with a learned value function for the metaoptimization. We also learn dynamic inner-loop learning rates during meta-training, as suggested in Al-Shedivat et al. (2018) . We refer readers to Appendices C, D, E, H, and the source code in the supplementary material for the remaining details including selected hyperparameters. Question 1. Is it essential to consider both an agent's own learning and the learning of others?

Agent j

Agent i C D C (0.5, 0.5) (-1.5, 1.5) D (1.5, -1.5) (-0.5, -0.5)

Table 1: IPD payoff table

To address this question, we consider the classic iterated prisoner's dilemma (IPD) domain. In IPD, agents i and j act by either (C)ooperating or (D)efecting and receive rewards according to the mixed incentive payoff defined in Table 1 . As in Foerster et al. (2018a) , we model the state space as s 0 = ∅ and s t = a t-1 for t ≥ 1. For meta-learning, we construct a population of initial personas p(φ -i 0 ) that include cooperating personas (i.e., having a probability of cooperating between 0.5 and 1.0 at any state) and defecting personas (i.e., having a probability of cooperating between 0 and 0.5 at any state). Figure 3b shows the population distribution utilized for training and evaluation. An agent j is initialized randomly from the population and adapts its behavior leveraging the inner-loop learning process throughout the Markov chain (see Figure 6 in the appendix). Importantly, the initial persona of agent j is hidden to i. Hence, an agent i should: 1) adapt to a differently initialized agent j with varying amounts of cooperation, and 2) continuously adapt with respect to the learning of j. The adaptation performance during meta-testing when an agent i, meta-trained with either Meta-MAPG or the baseline methods, interacts with an initially cooperating or defecting agent j is shown in Figure 2a and Figure 2b , respectively. In both cases, our meta-agent successfully infers the underlying persona of the other agent and adapts throughout the Markov chain obtaining higher rewards than our baselines. We observe that performance generally decreases as the number of joint policy update increases across all adaptation methods. This decrease in performance is expected as each model is playing with another agent that is also constantly learning. As a result, the other agent realizes it could potentially achieve more reward by defecting more often. Hence, to achieve good adaptation performance in IPD, an agent i should attempt to shape j's future policies toward staying cooperative as long as possible such that i can take advantage, which is achieved by accounting for both an agent's own learning and the learning of other peer agents in Meta-MAPG. We explore each adaptation method in more detail by visualizing the action probability dynamics throughout the Markov chain. In general, we observe that the baseline methods have converged to initially defecting strategies, attempting to get larger rewards than a peer agent j in the first trajectory τ φ0 . While this strategy can result in better initial performance than j, the peer agent will quickly change its policy so that it is defecting with high probability as well (see Figures 9 to 11 in the appendix). By contrast, our meta-agent learns to act cooperatively in τ φ0 and then take advantage by deceiving agent j as it attempts to cooperate at future steps (see Figure 12 in the appendix). Question 2. How is adaptation performance affected by the number of trajectories between changes? We control the level of non-stationarity by adjusting the number of trajectories K between updates (refer to Section 2.1). The results in Figure 3a shows that the area under the curve (AUC) (i.e., the reward summation during φ 1:L ) generally decreases when K decreases in IPD. This result is expected since the inner-loop updates are based on the policy gradient, which can suffer from a high variance. Thus, with a smaller batch size, policy updates have a higher variance (leading to noisier policy updates). As a result, it is harder to anticipate and influence the future policies of other agents. Nevertheless, in all cases, Meta-MAPG achieves the best AUC. Question 3. Can Meta-MAPG generalize its learning outside the meta-training distribution? We have demonstrated that a meta-agent can generalize well and adapt to a new peer. However, we would like to investigate this further and see whether a meta-agent can still perform when the meta-testing distribution is drawn from a significantly different distribution in IPD. We thus evaluate Meta-MAPG and Meta-PG using both in distribution (as in the previous questions) and out of distribution personas for j's initial policies (see Figures 3b and 3c ). Meta-MAPG achieves an AUC of 13.77±0.25 and 11.12±0.33 for the in and out of distribution evaluation respectively. On the other hand, Meta-PG achieves an AUC of 6.13±0.05 and 7.60±0.07 for the in and out of distribution evaluation respectively. Variances are based on 5 seeds and we leveraged K = 64 for this experiment. The mean and 95% confidence interval are computed using 5 seeds in (a) and 10 seeds in (c). We note that Meta-MAPG's performance decreases during the out of distribution evaluation, but still consistently performs better than the baseline. Question 4. How does Meta-MAPG perform with decentralized meta-training? We compare the performance of Meta-MAPG with and without opponent modeling in Figure 4a . We note that Meta-MAPG with opponent modeling can infer policy parameters for peer agents and compute the peer learning gradient in a decentralized manner, performing better than the Meta-PG baseline. However, opponent modeling introduces noise in predicting the future policy parameters of peer agents because the parameters must be inferred by observing the actions they take alone without any supervision about the parameters themselves. Thus, as expected, meta-agents experience difficulty in correctly considering the learning process of peer agents, which leads to lower performance than Meta-MAPG with centralized meta-training. Question 5. How effective is Meta-MAPG in a fully competitive scenario? We have demonstrated the benefit of our approach in the mixed incentive scenario of IPD. Here, we consider another classic iterated game, rock-paper-scissors (RPS) with a fully competitive payoff table (see Table 2 ). In RPS, at each time step agents i and j can choose an action of either (R)ock, (P)aper, or (S)cissors. The state space is defined as s 0 = ∅ and s t = a t-1 for t ≥ 1. Agent j Agent i R P S R (0, 0) (-1, 1) (1, -1) P (1, -1) (0, 0) (-1, 1) S (-1, 1) (1, -1) (0, 0) Similar to our meta-learning setup for IPD, we consider a population of initial personas p(φ -i 0 ), including the rock persona (with a rock action probability between 1/3 and 1.0), the paper persona (with a paper action probability between 1/3 and 1.0), and the scissors persona (with a scissors action probability between 1/3 and 1.0). As in IPD, an agent j is initialized randomly from the population and updates its policy based on the policy gradient with a linear baseline while interacting with i. Figure 2c shows the adaptation performance during meta-testing. Similar to the IPD results, we observe that the baseline methods have effectively converged to win against the opponent j in the first few trajectories. For instance, agent i has a high rock probability when playing against j with a high initial scissors probability (see Figures 13 to 15 in the appendix). This strategy, however, results in the opponent quickly changing its behavior toward the mixed Nash equilibrium strategy of (1/3, 1/3, 1/3) for the rock, paper, and scissors probabilities. In contrast, our meta-agent learned to lose slightly in the first two trajectories τ φ0:1 to achieve much larger rewards in the later trajectories τ φ2:7 while relying on its ability to adapt more efficiently than its opponent (see Figure 16 in the appendix). Compared to the IPD results, we observe that it is more difficult for our meta-agent to shape j's future policies in RPS possibly due to the fact that RPS has a fully competitive payoff structure, while IPD has a mixed incentive structure. Question 6. How effective is Meta-MAPG in settings with more than one peer? We note that the meta-multiagent policy gradient theorem is general and can be applied to scenarios with more than one peer. To validate this, we experiment with 3-player and 4-player RPS, where we consider sampling peers randomly from the entire persona population. Figure 4b shows a comparison against the Meta-PG baseline. We generally observe that the peer agents change their policies toward the mixed Nash equilibrium more quickly as the number of agents increases, which results in decreased performance for all methods. Nevertheless, Meta-MAPG achieves the best performance in all cases and can clearly be easily extended to settings with a greater number of agents. Question 7. Is it necessary to consider both the own learning and peer learning gradient? Our meta-multiagent policy gradient theorem inherently includes both the own learning and peer learning gradient, but is it important to consider both terms? To answer this question, we conduct an ablation study and compare Meta-MAPG to two methods: one trained without the peer learning gradient and another trained without the own learning gradient. Note that not having the peer learning term is equivalent to Meta-PG, and not having the own learning term is similar to LOLA-DiCE but alternatively trained with a meta-optimization procedure. Figure 4c shows that a meta-agent trained without the peer learning term cannot properly exploit the peer agent's learning process. Also, a meta-agent trained without the own learning term cannot change its own policy effectively in response to anticipated learning by peer agents. By contrast, Meta-MAPG achieves superior performance by accounting for both its own learning process and the learning process of peer agents. Question 8. Does considering the peer learning gradient always improve performance? To answer this question, we experiment with a fully cooperative setting from the multiagent-MuJoCo benchmark (de Witt et al., 2020) . Specifically, we consider the 2-Agent HalfCheetah domain, where the first and second agent control three joints of the back and front leg with continuous action spaces, respectively (see Figure 5 ). Both agents receive a joint reward corresponding to making the cheetah robot run to the right as soon as possible. Note that two agents are coupled within the cheetah robot, so accomplishing the objective requires close cooperation and coordination between them. For meta-learning, we consider a population of teammates with varying degrees of expertise in running to the left direction. Specifically, we pre-train teammate j and build a population based on checkpoints of its parameters during learning (see Figure 7 in Appendix). Then, during meta-learning, j is randomly initialized from this population of policies. Importantly, the teammate must adapt its behavior in this setting because the agent has achieved the opposite skill compared to the true objective of moving to the right during pre-training. Hence, a meta-agent i should succeed by both adapting to differently initialized teammates with varying expertise in moving the opposite direction, and guiding the teammate's learning process in order to coordinate eventual movement to the right. Our results are displayed in Figure 2d . There are two notable observations. First, influencing peer learning does not help much in cooperative settings and Meta-MAPG performs similarly to Meta-PG. The peer learning gradient attempts to shape the future policies of other agents so that the meta-agent can take advantage. In IPD, for example, the meta-agent influenced j to be cooperative in the future such that the meta-agent can act with a high probability of the defect action and receive higher returns. However, in cooperative settings, due to the joint reward, the teammate is already changing its policies in order to benefit the meta-agent, resulting in a less significant effect with respect to the peer learning gradient. Second, Meta-PG and Meta-MAPG outperform the other approaches of LOLA-DiCE and REINFORCE, achieving higher rewards when interacting with a new teammate.

6. CONCLUSION

In this paper, we have introduced Meta-MAPG which is a meta-learning algorithm that can adapt quickly to non-stationarity in the policies of other agents in a shared environment. The key idea underlying our proposed meta-optimization is to directly model both an agent's own learning process and the non-stationary policy dynamics of other agents. We evaluated our method on several multiagent benchmarks, including the full spectrum of mixed incentive, competitive, and cooperative environments. Our results indicate that Meta-MAPG is able to adapt more efficiently than previous state of the art approaches. We hope that our work can help provide the community with a theoretical foundation to build off for addressing the inherent non-stationarity of MARL in a principled manner. In the future, we plan to extend our approach to real-world scenarios, such as those including collaborative exploration between multiple agents (Chan et al., 2019) .

A DERIVATION OF META-MULTIAGENT POLICY GRADIENT THEOREM

Theorem 1 (Meta-Multiagent Policy Gradient Theorem (Meta-MAPG)). For any stochastic game M n , the gradient of the meta-objective function for agent i at state s 0 with respect to the current parameters φ i 0 of stochastic policy π evolving in the environment along with the other peer agents using initial parameters φ -i 0 is: ∇ φ i 0 V i φ 0: +1 (s 0 , φ i 0 ) = E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) E τ φ +1 ∼p(τ φ +1 |φ i +1 ,φ -i +1 ) ∇ φ i 0 logπ(τ φ0 |φ i 0 ) Current Policy + =0 ∇ φ i 0 logπ(τ φ +1 |φ i +1 ) Own Learning + =0 ∇ φ i 0 logπ(τ φ +1 |φ -i +1 ) Peer Learning G i (τ φ +1 ) Proof. We begin our derivation from the meta-value function defined in Equation (3). We expand the meta-value function with the state-action value and joint actions, assuming the conditional independence between agents' actions (Wen et al., 2019) : V i φ 0: +1 (s 0 , φ i 0 )=E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) E τ φ +1 ∼p(τ φ +1 |φ i +1 ,φ -i +1 ) G i (τ φ +1 ) =E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) V i φ +1 (s 0 ) =E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 )Q i φ +1 (s 0 , a 0 ) , where Q i φ +1 (s 0 , a 0 ) denotes the state-action value under the joint policy with parameters φ +1 at state s 0 with joint action a 0 . In Equation ( 7), we note that both φ i 1: and φ -i 1: depend on φ i 0 . Considering the joint update from φ 0 to φ 1 , for simplicity, we can write the gradients in the inner-loop (Equation ( 4)) based on the multiagent stochastic policy gradient theorem (Wei et al., 2018) : ∇ φ -i 0 E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) G i (τ φ0 ) = s ρ φ0 (s) a i ∇ φ i 0 π(a i |s, φ i 0 ) a -i π(a -i |s, φ -i 0 )Q i φ0 (s,a), ∇ φ -i 0 E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) G -i (τ φ0 ) = s ρ φ0 (s) a -i ∇ φ -i 0 π(a -i |s, φ -i 0 ) a i π(a i |s, φ i 0 )Q -i φ0 (s,a), where ρ φ0 denotes the stationary distribution under the joint policy with parameters φ 0 . Importantly, the inner-loop gradients for an agent i and its peers are a function of φ i 0 . Hence, the updated joint policy parameter φ 1 depends on φ i 0 . Following Equation ( 8), the successive inner-loop optimization until φ +1 results in dependencies between φ i 0 and φ i 1: +1 and between φ i 0 and φ -i 1: +1 (see Figure 1b ). Having identified which terms are dependent on φ i 0 , we continue from Equation ( 7) and derive the gradient of the meta-value function with respect to φ i 0 by applying the product rule: ∇ φ i 0 V i φ 0: +1 (s 0 , φ i 0 ) = ∇ φ i 0 E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 )Q i φ +1 (s 0 , a 0 ) = ∇ φ i 0 τ φ 0: p(τ φ 0: |φ i 0: , φ -i 0: ) a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 )Q i φ +1 (s 0 , a 0 ) = ∇ φ i 0 τ φ 0: p(τ φ 0: |φ i 0: , φ -i 0: ) a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 )Q i φ +1 (s 0 , a 0 ) Term A + τ φ 0: p(τ φ 0: |φ i 0: , φ -i 0: ) a i 0 ∇ φ i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 )Q i φ +1 (s 0 , a 0 ) Term B + τ φ 0: p(τ φ 0: |φ i 0: , φ -i 0: ) a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 ∇ φ i 0 π(a -i 0 |s 0 , φ -i +1 ) Q i φ +1 (s 0 , a 0 ) Term C

+

Under review as a conference paper at ICLR 2021 τ φ 0: p(τ φ 0: |φ i 0: , φ -i 0: ) a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 ) ∇ φ i 0 Q i φ +1 (s 0 , a 0 ) Term D . We first focus on the derivative of the trajectories τ φ 0: in Term A: ∇ φ i 0 τ φ 0: p(τ φ 0: |φ i 0: , φ -i 0: ) = ∇ φ i 0 τ φ 0 p(τ φ0 |φ i 0 , φ -i 0 ) τ φ 1 p(τ φ1 |φ i 1 , φ -i 1 ) × . . . × τ φ p(τ φ |φ i , φ -i ) = τ φ 0 ∇ φ i 0 p(τ φ0 |φ i 0 , φ -i 0 ) ∀ ∈{0,. . ., }\{0} τ φ p(τ φ |φ i , φ -i )+ τ φ 1 ∇ φ i 1 p(τ φ1 |φ i 1 , φ -i 1 ) ∀ ∈{0,. . ., }\{1} τ φ p(τ φ |φ i , φ -i ) + . . .+ τ φ ∇ φ i p(τ φ |φ i , φ -i ) ∀ ∈{0,. . ., }\{ } τ φ p(τ φ |φ i , φ -i ), where the probability of collecting a trajectory under the joint policy with parameters φ is given by: p(τ φ |φ i , φ -i ) = p(s 0 ) H t=0 π(a i t |s t , φ i )π(a -i t |s t , φ -i )P(s t+1 |s t , a t ). Using Equation ( 11) and the log-derivative trick, Equation ( 10) can be further expressed as: E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) ∇ φ i 0 log π(τ φ0 |φ i 0 ) ∀ ∈{0,. . ., }\{0} τ φ p(τ φ |φ i , φ -i )+ E τ φ 1 ∼p(τ φ 1 |φ i 1 ,φ -i 1 ) ∇ φ i 0 logπ(τ φ1 |φ i 1 )+logπ(τ φ1 |φ -i 1 ) ∀ ∈{0,. . ., }\{1} τ φ p(τ φ |φ i , φ -i ) + . . .+ E τ φ ∼p(τ φ |φ i ,φ -i ) ∇ φ i 0 logπ(τ φ |φ i )+logπ(τ φ |φ -i ) ∀ ∈{0,. . ., }\{ } τ φ p(τ φ |φ i , φ -i ) where the summations of the log-terms, such as ∇ φ i 0 log π(τ φ |φ i )+log π(τ φ |φ -i ) are inherently included due to the sequential dependencies between φ i 0 and φ 1: . We use the result of Equation ( 12) and organize terms to arrive at the following expression for Term A in Equation ( 9): E τ φ 0: ∼P (τ φ 0: |φ i 0: ,φ -i 0: ) ∇ φ i 0 log π(τ φ0 |φ i 0 ) + -1 =0 ∇ φ i 0 log π(τ φ +1 |φ i +1 ) + -1 =0 ∇ φ i 0 log π(τ φ +1 |φ -i +1 ) × a i 0 π(a i 0 |s 0 , φ i +1 ) a -i 0 π(a -i 0 |s 0 , φ -i +1 )Q i φ +1 (s 0 , a 0 ) . ( ) Coming back to Term B-D in Equation ( 9), repeatedly unrolling the derivative of the Q-function Sutton & Barto (1998) yields: ∇ φ i 0 Q i φ +1 (s 0 , a 0 ) by following E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) s ρ φ +1 (s) a i ∇ φ i 0 π(a i |s, φ i +1 ) a -i π(a -i |s, φ -i +1 )Q i φ+1 (s, a) + E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) s ρ φ +1 (s) a -i ∇ φ i 0 π(a -i |s, φ -i +1 ) a i π(a i |s, φ i +1 )Q i φ +1 (s, a) , which adds the consideration of future joint policy φ +1 to Equation (13). Finally, we summarize Equations ( 13) and ( 14) together and express in expectations: ∇ φ i 0 V i φ 0: +1 (s 0 , φ i 0 ) = E τ φ 0: ∼p(τ φ 0: |φ i 0: ,φ -i 0: ) E τ φ +1 ∼p(τ φ +1 |φ i +1 ,φ -i +1 ) ∇ φ i 0 logπ(τ φ0 |φ i 0 ) Current Policy + =0 ∇ φ i 0 logπ(τ φ +1 |φ i +1 ) Own Learning + =0 ∇ φ i 0 logπ(τ φ +1 |φ -i +1 ) Peer Learning G i (τ φ +1 ) B META-MAPG WITH OPPONENT MODELING Sample a meta-train batch of φ -i 0 ∼ p(φ -i 0 ) 4: for each φ -i 0 do 5: Randomly initialize φ-i 0 6: for = 0, . . ., L do 7: Sample and store trajectory τ φ 8: Approximate φ-i = f ( φ-i , τ φ , α) using opponent modeling (Algorithm 4) 9: Compute φ +1 = f (φ , τ φ , α) from inner-loop optimization (Equation ( 4)) 10: Compute φ-i +1 = f ( φ-i , τ φ , α) from inner-loop optimization (Equation ( 4)) 11: end for 12: end for 13: Update In this section, we explain Meta-MAPG with opponent modeling for settings where a meta-agent cannot access the policy parameters of its peers during meta-training. Our decentralized metatraining method in Algorithm 3 replaces the other agents' true policy parameters φ -i 1:L with inferred parameters φ-i 1:L in computing the peer learning gradient. Specifically, we follow Foerster et al. (2018a) for opponent modeling and estimate φ-i from τ φ using log-likelihood L likelihood (Line 8 in Algorithm 3): φ i 0 ← φ i 0 + β L-1 =0 ∇ φ i 0 V i φ 0: +1 (s0, φ i 0 ) based L likelihood = H t=0 log π -i (a -i t |s t , φ-i ), where s t , a -i t ∈ τ φ . A meta-agent can obtain φ-i 1:L by iteratively applying the opponent modeling procedure until the maximum chain length of L. We also apply the inner-loop update with the Differentiable Monte-Carlo Estimator (DiCE) (Foerster et al., 2018c) to the inferred policy parameters of peer agents (Line 10 in Algorithm 3). By applying DiCE, we can save the sequential dependencies between φ i 0 and updates to the policy parameters of peer agents φ-i 1:L in a computation graph and compute the peer learning gradient efficiently via automatic-differentiation (Line 13 in Algorithm 3).

C ADDITIONAL IMPLEMENTATION DETAILS

C.1 NETWORK STRUCTURE Our neural networks for the policy and value function consist of a fully-connected input layer with 64 units followed by a single-layer LSTM with 64 units and a fully-connected output layer. We reset the LSTM states to zeros at the beginning of trajectories and retain them until the end of episodes. The LSTM policy outputs a probability for the Bernoulli distribution in the iterated games (i.e., IPD, initializing every φ i from φ i 0 . However, as noted in Al-Shedivat et al. (2018) , this assumption requires interacting with the same peers multiple times and is often impossible during meta-testing. To address this issue, the framework uses the importance sampling correction during meta-testing. However, the correction generally suffers from high variance (Wang et al., 2016b) . As such, we effectively avoid using the correction by initializing from φ i 0 only once at the beginning of Markov chains for both meta-training and meta-testing. The above theoretical differences have resulted in an improved meta-agent that can learn to additionally affect future policies of other peer agents, achieving better results than the Meta-PG baseline in our experiments.

D.2 LOLA-DICE

We used an open-source PyTorch implementation for LOLA-DiCE. 1 We make minor changes to the code, such as adding the LSTM policy and value function. Figure 6 : IPD meta-learning setup. An agent j's policy is initialized randomly from the initial persona population p(φ -i 0 ) that includes various cooperating and defecting personas. The agent j then updates its policy throughout the Markov chain, requiring an agent i to adapt with respect to the learning of j.

E ADDITIONAL EXPERIMENT DETAILS

We choose to represent the peer agent j's policy as a tabular representation to effectively construct the population of initial personas p(φ -i 0 ) for the meta-learning setup. Specifically, the tabular policy has a dimension of 5 that corresponds to the number of states in IPD. Then, we randomly sample a probability between 0.5 and 1.0 and a probability between 0 and 0.5 at each state to construct the cooperating and defecting population, respectively. As such, the tabular representation enables us to sample as many as personas but also controllable distribution p(φ -i 0 ) by merely adjusting the probability range. We sample a total of 480 initial personas, including cooperating personas and defecting personas, and split them into 400 for meta-training, 40 for meta-validation, and 40 for meta-testing. Figure 3b visualizes the distribution, where we used the principal component analysis (PCA) with two components.

E.2 RPS

In RPS, we follow the same meta-learning setup as in IPD, except we sample a total of 720 initial opponent personas, including rock, paper, and scissors personas, and split them into 600 for metatraining, 60 for meta-validation, and 60 for meta-testing. Additionally, because RPS has three possible actions, we sample a rock preference probability between 1/3 and 1.0 for building the rock persona population, where the rock probability is larger than the other two action probabilities. We follow the same procedure for constructing the paper and scissors persona population.

E.3 2-AGENT HALFCHEETAH

Figure 7 : Visualization of a teammate j's initial expertise in the 2-Agent HalfCheetah domain, where the meta-test distribution has a sufficient difference to meta-train/val. We used an open source implementation for multiagent-MuJoCo benchmark.foot_1 Agents in our experiments receive state observations that include information about all the joints. For the meta-learning setup, we pre-train a teammate j with an LSTM policy that has varying expertise in moving to the left direction. Specifically, we train the teammate up to 500 train iterations and save a checkpoint at each iteration. Intuitively, as the number of train iteration increases, the teammate gains more expertise. We then use the checkpoints from 50 to 300 iterations as the meta-train/val and from 475 and 500 iterations as the meta-test distribution (see Figure 7 ). We construct the distribution with the gap to ensure that the meta-testing distribution has a sufficient difference to the meta-train/val so that we can test the generalization of our approach. Lastly, the teammate agent j updates its policy based on the policy gradient with the linear feature baseline as in IPD and RPS. The standard approach with the stationary assumption diverges, resulting in worse performance for both agents. In contrast, an approach that considers the learning process of the other agents, such as LOLA (Foerster et al., 2018a) , converges to the equilibrium.

F IMPORTANCE OF PEER LEARNING

For example, consider a stateless zero-sum game playing between two agents. Agents i and j maximize simple value functions V i φ = φ i φ j and V j φ = -φ i φ j respectively, where φ i , φ j ∈ R. In this game, there exists a unique Nash equilibrium at the origin (i.e., {φ i , φ j } = {0, 0}). We compare: 1) the standard approach that optimizes the value function in Equation (1) with the stationary assumption and 2) an approach that considers the learning process of others, such as the LOLA method. As Figure 8 shows, the standard approach diverges further from the equilibrium, resulting in worse results for both agents. The cause of the failure in this example is due to the stationary assumption that each agent assumes its opponent has the same behavior in the future (Letcher et al., 2019) . In contrast, by considering the learning process of the opponent, the LOLA approach converges to the equilibrium. As such, it is important to consider the learning of the other agents as highlighted by this example.



Available at https://github.com/alexis-jacq/LOLA_DiCE Available at https://github.com/schroederdewitt/multiagent_mujoco



Figure 2: Adaptation performance during meta-testing in mixed incentive ((a), (b)), competitive (c), and cooperative (d) environments. The results show that Meta-MAPG can successfully adapt to a new and learning peer agent throughout the Markov chain. Mean and 95% confidence interval computed for 10 random seeds for ((a), (b), (c)) and 5 random seeds for (d) are shown in figures.

Figure3: (a) Adaptation performance with a varying number of trajectories. Meta-MAPG achieves the best AUC in all cases and its performance generally improves with a larger K. Mean and 95% confidence interval are computed for 10 seeds. (b) and (c) Visualization of j's initial policy for in distribution and out of distribution meta-testing, respectively, where the out of distribution split has a smaller overlap between the policies used for meta-training/validation and those used for meta-testing.

Figure4: (a) Adaptation performance with opponent modeling (OM). Meta-MAPG with OM uses inferred policy parameters for peer agents, computing the peer learning gradient in a decentralized manner. (b) Adaptation performance with a varying number of agents in RPS. Meta-MAPG achieves the best AUC in all cases. (c) Ablation study for Meta-MAPG. Meta-MAPG achieves significantly better performance than ablated baselines with no own learning gradient and no peer learning gradient. The mean and 95% confidence interval are computed using 5 seeds in (a) and 10 seeds in (c).

Figure 5: 2-Agent HalfCheetah domain, where two agents are coupled within the robot and control the robot together. Graphic credit: de Witt et al. (2020).

Failure to consider the learning process of the other agents can result in divergence of learning objectives.

Figure8: Learning paths on the zero-sum game. The standard approach with the stationary assumption diverges, resulting in worse performance for both agents. In contrast, an approach that considers the learning process of the other agents, such as LOLA(Foerster  et al., 2018a), converges to the equilibrium.

RPS payoff table.

Algorithm 3 Meta-Learning at Training Time with Opponent Modeling

on Equation (6) and φ-i

annex

RPS). For the 2-Agent HalfCheetah domain, the policy outputs a mean and variance for the Gaussian distribution. We empirically observe that no parameter sharing between the policy and value network results in more stable learning than sharing the network parameters.

C.2 OPTIMIZATION

We detail additional important notes about our implementation:• We apply the linear feature baseline (Duan et al., 2016a) and generalized advantage estimation (GAE) (Schulman et al., 2016) during the inner-loop and outer-loop optimization, respectively, to reduce the variance in the policy gradient.• We use DiCE (Foerster et al., 2018c) to compute the peer learning gradient efficiently. Specifically, we apply DiCE during the inner-loop optimization and save the sequential dependencies between φ i 0 and φ -i 1:L in a computation graph. Because the computation graph has the sequential dependencies, we can compute the peer learning gradient by the backpropagation of the meta-value function via the automatic-differentiation toolbox.• Learning from diverse peers can potentially cause conflicting gradients and unstable learning. In IPD, for instance, a strategy to adapt against cooperating peers can be completely opposite to the adaptation strategy against defecting peers, resulting in conflicting gradients. To address this potential issue, we use the projecting conflicting gradients (PCGrad) (Yu et al., 2020) during the outer-loop optimization. We also have tested the baseline methods with PCGrad.• We use a distributed training to speed up the meta-optimization. Each thread interacts with a Markov chain of policies until the chain horizon and then computes the meta-optimization gradients using Equation ( 6). Then, similar to Mnih et al. (2016) , each thread asynchronously updates the shared meta-agent's policy and value network parameters.

D ADDITIONAL BASELINE DETAILS

We train all adaptation methods based on a meta-training set until convergence. We then measure the adaptation performance on a meta-testing set using the best-learned policy determined by a meta-validation set.

D.1 META-PG

We have improved the Meta-PG baseline itself beyond its implementation in the original work (Al-Shedivat et al., 2018) to further isolate the importance of the peer learning gradient term. Specifically, compared to Al-Shedivat et al. ( 2018), we make the following theoretical contributions to build on:Underlying problem statement. Al-Shedivat et al. (2018) bases their problem formulation off that of multi-task / continual single-agent RL. In contrast, ours is based on a general stochastic game between n agents (Shapley, 1953) .A Markov chain of joint policies. Al-Shedivat et al. ( 2018) treats an evolving peer agent as an external factor, resulting in the absence of the sequential dependencies between a meta-agent's current policy and the peer agents' future policies in the Markov chain. However, our important insight is that the sequential dependencies exist in general multiagent settings as the peer agents are also learning agents based on trajectories by interacting with a meta-agent (see Figure 1b ).Meta-objective. The meta-objective defined in Al-Shedivat et al. ( 2018) is based on single-agent settings. In contrast, our meta-objective is based on general multiagent settings (see Equations ( 2) to ( 4)).Meta-optimization gradient. Compared to Al-Shedivat et al. ( 2018), our meta-optimization gradient inherently includes the additional term of the peer learning gradient that considers how an agent can directly influence the learning process of other agents. 

