A POLICY GRADIENT ALGORITHM FOR LEARNING TO LEARN IN MULTIAGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other agents that are also simultaneously learning. In particular, each agent perceives the environment as effectively nonstationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural nonstationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to these multiagent settings. This is achieved by modeling our gradient updates to directly consider both an agent's own non-stationary policy dynamics and the non-stationary policy dynamics of other agents interacting with it in the environment. We find that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently combines key aspects of previous state of the art approaches on this topic. We test our method on several multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than previous related approaches across the spectrum of mixed incentive, competitive, and cooperative environments.

1. INTRODUCTION

Learning in multiagent settings is inherently more difficult than single-agent learning because an agent interacts both with the environment and other agents (Bus ¸oniu et al., 2010) . Specifically, the fundamental challenge in multiagent reinforcement learning (MARL) is the difficulty of learning optimal policies in the presence of other simultaneously learning agents because their changing behaviors jointly affect the environment's transition and reward function. This dependence on nonstationary policies renders the Markov property invalid from the perspective of each agent, requiring agents to adapt their behaviors with respect to potentially large, unpredictable, and endless changes in the policies of fellow agents (Papoudakis et al., 2019) . In such environments, it is also critical that agents adapt to the changing behaviors of others in a very sample-efficient manner as it is likely that their policy could update again after a small number of interactions (Al-Shedivat et al., 2018) . Therefore, effective agents should consider the learning of other agents and adapt quickly to non-stationary behaviors. Otherwise, undesirable outcomes may arise when an agent is constantly lagging in its ability to deal with the current policies of other agents. In this paper, we propose a new framework based on meta-learning for addressing the inherent non-stationarity of MARL. Meta-learning (also referred to as learning to learn) was recently shown to be a promising methodology for fast adaptation in multiagent settings. The framework by Al-Shedivat et al. (2018) , for example, introduces a meta-optimization scheme by which a meta-agent can adapt more efficiently to changes in a new opponent's policy after collecting only a handful of interactions. The key idea underlying their meta-optimization is to model the meta-agent's learning process so that its updated policy performs better than an evolving opponent. However, their work does not directly consider the opponent's learning process in the meta-optimization, treating the evolving opponent as an external factor and assuming the meta-agent cannot influence the opponent's future policy. As a result, their work fails to consider an important property of MARL: the opponent is also a learning agent changing its policy based on trajectories collected by interacting with the meta-agent. As such, the meta-agent has an opportunity to influence the opponent's future policy by changing the distribution of trajectories, and the meta-agent can take advantage of this opportunity to improve its performance during learning.

