DECENTRALIZED OPTIMISTIC HYPERPOLICY MIRROR DESCENT: PROVABLY NO-REGRET LEARNING IN MARKOV GAMES

Abstract

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result [33], we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, Decentralized Optimistic hypeRpolicy mIrror deScent (DORIS), which achieves √ K-regret in the context of general function approximation, where K is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a hyperpolicy which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) studies how each agent learns to maximize its cumulative rewards by interacting with the environment as well as other agents, where the state transitions and rewards are affected by the actions of all the agents. Equipped with powerful function approximators such as deep neural networks [31] , MARL has achieved significant empirical success in various domains including the game of Go [47] , StarCraft [50], DOTA2 [5], Atari [38] , multi-agent robotics systems [8] and autonomous driving [45] . Compared with the centralized setting where a central controller collects the information of all agents and coordinates their behaviors, decentralized algorithms [19, 42] where each agent autonomously chooses its action based on its own local information are often more desirable in MARL applications. In specific, decentralized methods (1) are easier to implement and enjoy better scalability, (2) are more robust to possible adversaries, and (3) require less communication overhead [21, 22, 9, 59, 18] . In this work, we aim to design a provably efficient decentralized reinforcement learning (RL) algorithm in the online setting with function approximation. In the sequel, for the ease of presentation, we refer to the controllable agent as the player and regard the rest of the agents as a meta-agent, called the opponent, which specifies its policies arbitrarily. Our goal is to maximize the cumulative rewards of the player in the face of a possibly adversarial opponent, in the online setting where the policies of the player and opponent can be based on adaptively gathered local information. From a theoretical perspective, arguably the most distinctive challenge of the decentralized setting is nonstationarity. That is, from the perspective of any agent, the states transitions are affected by the policies of other agents in an unpredictable and potentially adversarial way and are thus nonstationary. This is in stark contrast to the centralized setting which can be regarded as a standard RL problem for the central controller which decides the actions for all the players. Furthermore, in the online setting, as the environment is unknown, to achieve sample efficiency, the player needs to strike a balance

