A POLICY GRADIENT ALGORITHM FOR LEARNING TO LEARN IN MULTIAGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other agents that are also simultaneously learning. In particular, each agent perceives the environment as effectively nonstationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural nonstationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to these multiagent settings. This is achieved by modeling our gradient updates to directly consider both an agent's own non-stationary policy dynamics and the non-stationary policy dynamics of other agents interacting with it in the environment. We find that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently combines key aspects of previous state of the art approaches on this topic. We test our method on several multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than previous related approaches across the spectrum of mixed incentive, competitive, and cooperative environments.

1. INTRODUCTION

Learning in multiagent settings is inherently more difficult than single-agent learning because an agent interacts both with the environment and other agents (Bus ¸oniu et al., 2010) . Specifically, the fundamental challenge in multiagent reinforcement learning (MARL) is the difficulty of learning optimal policies in the presence of other simultaneously learning agents because their changing behaviors jointly affect the environment's transition and reward function. This dependence on nonstationary policies renders the Markov property invalid from the perspective of each agent, requiring agents to adapt their behaviors with respect to potentially large, unpredictable, and endless changes in the policies of fellow agents (Papoudakis et al., 2019) . In such environments, it is also critical that agents adapt to the changing behaviors of others in a very sample-efficient manner as it is likely that their policy could update again after a small number of interactions (Al-Shedivat et al., 2018) . Therefore, effective agents should consider the learning of other agents and adapt quickly to non-stationary behaviors. Otherwise, undesirable outcomes may arise when an agent is constantly lagging in its ability to deal with the current policies of other agents. In this paper, we propose a new framework based on meta-learning for addressing the inherent non-stationarity of MARL. Meta-learning (also referred to as learning to learn) was recently shown to be a promising methodology for fast adaptation in multiagent settings. The framework by Al-Shedivat et al. (2018) , for example, introduces a meta-optimization scheme by which a meta-agent can adapt more efficiently to changes in a new opponent's policy after collecting only a handful of interactions. The key idea underlying their meta-optimization is to model the meta-agent's learning process so that its updated policy performs better than an evolving opponent. However, their work does not directly consider the opponent's learning process in the meta-optimization, treating the evolving opponent as an external factor and assuming the meta-agent cannot influence the opponent's future policy. As a result, their work fails to consider an important property of MARL: the opponent is also a learning agent changing its policy based on trajectories collected by interacting with the meta-agent. As such, the meta-agent has an opportunity to influence the opponent's future policy by changing the distribution of trajectories, and the meta-agent can take advantage of this opportunity to improve its performance during learning. Our contribution. With this insight, we develop a new meta-multiagent policy gradient theorem (Meta-MAPG) that directly models the learning processes of all agents in the environment within a single objective function. We start by extending the meta-policy gradient theorem of Al-Shedivat et al. (2018) based on the multiagent stochastic policy gradient theorem (Wei et al., 2018) to derive a novel meta-policy gradient theorem. This is achieved by removing the unrealistic implicit assumption of Al-Shedivat et al. (2018) that the learning of other agents in the environment is not dependent on an agent's own behavior. Interestingly, performing our derivation with this more general set of assumptions inherently results in an additional term that was not present in previous work by Al-Shedivat et al. (2018) . We observe that this added term is closely related to the process of shaping the learning dynamics of other agents in the framework of Foerster et al. (2018a) . As such, our work can be seen as contributing a theoretically grounded framework that unifies the collective benefits of previous work by Al-Shedivat et al. (2018) and Foerster et al. (2018a) . Meta-MAPG is evaluated on a diverse suite of multiagent domains, including the full spectrum of mixed incentive, competitive, and cooperative environments. Our experiments demonstrate that Meta-MAPG consistently results in superior adaption performance in the presence of novel evolving agents. Each agent updates its policy leveraging a Markovian update function, resulting in a change to the joint policy. (b) A probabilistic graph for Meta-MAPG. Unlike Meta-PG, our approach actively influences the future policies of other agents as well through the peer learning gradient.

2. PRELIMINARIES

Interactions between multiple agents can be represented by stochastic games (Shapley, 1953) . Specifically, an n-agent stochastic game is defined as a tuple M n = I, S, A, P, R, γ ; I = {1, . . ., n} is the set of n agents, S is the set of states, A = × i∈I A i is the set of action spaces, P : S × A → S is the state transition probability function, R = × i∈I R i is the set of reward functions, and γ ∈ [0, 1) is the discount factor. We typeset sets in bold for clarity. Each agent i executes an action at each timestep t according to its stochastic policy a i t ∼ π i (a i t |s t , φ i ) parameterized by φ i , where s t ∈ S. A joint action a t = {a i t , a -i t } yields a transition from the current state s t to the next state s t+1 ∈ S with probability P(s t+1 |s t , a t ), where the notation -i indicates all other agents with the exception of agent i. Agent i then obtains a reward according to its reward function r i t = R i (s t , a t ). At the end of an episode, the agents collect a trajectory τ φ under the joint policy with parameters φ, where τ φ := (s 0 , a 0 , r 0 , . . ., r H ), φ = {φ i , φ -i } represents the joint parameters of all policies, r t = {r i t , r -i t } is the joint reward, and H is the horizon of the trajectory or episode.

2.1. A MARKOV CHAIN OF POLICIES

The perceived non-stationarity in multiagent settings results from a distribution of sequential joint policies, which can be represented by a Markov chain (Al-Shedivat et al., 2018) . Formally, a Markov chain of policies begins from a stochastic game between agents with an initial set of joint policies parameterized by φ 0 . We assume that each agent updates its policy leveraging a Markovian update function that changes the policy after every K trajectories. After this time period, each agent i adapts its policy to maximize the expected return expressed as its value function: V i φ0 (s 0 ) = E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) H t=0 γ t r i t |s 0 = E τ φ 0 ∼p(τ φ 0 |φ i 0 ,φ -i 0 ) G i (τ φ0 ) ,



Figure 1: (a) A Markov chain of joint policies representing the inherent non-stationarity of MARL.Each agent updates its policy leveraging a Markovian update function, resulting in a change to the joint policy. (b) A probabilistic graph for Meta-MAPG. Unlike Meta-PG, our approach actively influences the future policies of other agents as well through the peer learning gradient.

