DECENTRALIZED OPTIMISTIC HYPERPOLICY MIRROR DESCENT: PROVABLY NO-REGRET LEARNING IN MARKOV GAMES

Abstract

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result [33], we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, Decentralized Optimistic hypeRpolicy mIrror deScent (DORIS), which achieves √ K-regret in the context of general function approximation, where K is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a hyperpolicy which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) studies how each agent learns to maximize its cumulative rewards by interacting with the environment as well as other agents, where the state transitions and rewards are affected by the actions of all the agents. Equipped with powerful function approximators such as deep neural networks [31] , MARL has achieved significant empirical success in various domains including the game of Go [47] , StarCraft [50] , DOTA2 [5] , Atari [38] , multi-agent robotics systems [8] and autonomous driving [45] . Compared with the centralized setting where a central controller collects the information of all agents and coordinates their behaviors, decentralized algorithms [19, 42] where each agent autonomously chooses its action based on its own local information are often more desirable in MARL applications. In specific, decentralized methods (1) are easier to implement and enjoy better scalability, (2) are more robust to possible adversaries, and (3) require less communication overhead [21, 22, 9, 59, 18] . In this work, we aim to design a provably efficient decentralized reinforcement learning (RL) algorithm in the online setting with function approximation. In the sequel, for the ease of presentation, we refer to the controllable agent as the player and regard the rest of the agents as a meta-agent, called the opponent, which specifies its policies arbitrarily. Our goal is to maximize the cumulative rewards of the player in the face of a possibly adversarial opponent, in the online setting where the policies of the player and opponent can be based on adaptively gathered local information. From a theoretical perspective, arguably the most distinctive challenge of the decentralized setting is nonstationarity. That is, from the perspective of any agent, the states transitions are affected by the policies of other agents in an unpredictable and potentially adversarial way and are thus nonstationary. This is in stark contrast to the centralized setting which can be regarded as a standard RL problem for the central controller which decides the actions for all the players. Furthermore, in the online setting, as the environment is unknown, to achieve sample efficiency, the player needs to strike a balance between exploration and exploitation in the context of function approximation and in the presence of an adversarial opponent. The dual challenges of nonstationarity and efficient exploration are thus intertwined, making it challenging to develop provably efficient decentralized MARL algorithms. Consequently, there seem only limited theoretical understanding of the decentralized MARL setting with a possibly adversarial opponent. Most of the existing algorithms [7, 53, 49, 27, 23] can only compete against the Nash value of the Markov game when faced with an arbitrary opponent. This is a much weaker baseline compared with the results in classic matrix games [17, 1] where the player is required to compete against the best fixed policy in hindsight. Meanwhile, [33] seems the only work we know that can achieve no-regret learning in MARL against the best hindsight policy, which focuses on the policy revealing setting where the player observes the policies played by the opponent in previous episodes. However, the algorithm and theory in this work are limited to tabular cases and fail to deal with large or even continuous state and action space. To this end, we would like to answer the following question: Can we design a decentralized MARL algorithm that provably achieves no-regret against the best fixed policy in hindsight in the context of function approximation? In this work, we provide a positive answer to the above question under the policy revealing setting with general function approximation. In specific, we propose an actor-critic-type algorithm [29] called DORIS, which maintains a distribution over the policy space, named hyperpolicy, for decisionmaking. To combat the nonstationarity, DORIS updates the hyperpolicy via mirror descent (or equivalently, Hedge [16] ). Furthermore, to encourage exploration, the descent directions of mirror descent are obtained by solving optimistic variants of policy evaluation subproblems with general function approximation, which only involve the local information of the player. Under standard regularity assumptions on the underlying function classes, we prove that DORIS achieves a sublinear regret in the presence of an adversarial opponent. In addition, when the agents all adopt DORIS independently, we prove that their average policy constitutes an approximate coarse correlated equilibrium. At the core of our analysis is a new complexity measure of function classes that is tailored to the decentralized MARL setting. Furthermore, to demonstrate the power of DORIS, we adapt it for solving constrained Markov decision process (CMDP) and vector-valued Markov decision process (VMDP), which can both be formulated as a zero-sum Markov game with a fictitious opponent. Our Contributions. Our contributions are four-fold. First, we propose a new decentralized policy optimization algorithm, DORIS, that provably achieves no-regret in the context of general function approximation. As a result, when all agents adopt DORIS, their average policy converges to a CCE of the Markov game. Secondly, we propose a new complexity measure named Bellman Evaluation Eluder dimension, which generalizes Bellman Eluder dimension [25] for single-agent MDP to decentralized learning in Markov games, which might be of independent interest. Third, we modify DORIS for solving CMDP with general function approximation, which is shown to achieve sublinear regret and constraint violation. Finally, we extend DORIS to solving the approchability task [36] in vector-valued Markov decision process (VMDP) and attain a near-optimal solution. To our best knowledge, DORIS seems the first provably efficient decentralized algorithm for achieving no-regret in MARL with general function approximation. Notations. In this paper we let [n] = {1, • • • , n} for any integer n. We denote the set of probability distributions over any set S by ∆ S or ∆(S). We also let • denote the 2 -norm by default. Related works. Our work is related to the bodies of literature on decentralized learning with an adversarial opponent, finding equilibria in self-play Markov games, CMDPs and VMDPs. These works either consider centralized setting or do not have function approximation in decentralized online setting. Due to the page limit, we compare to these works in Appendix B.

2. PRELIMINARIES

2.1 MARKOV GAMES Let us consider an n-agent general-sum Markov game (MG) M MG = (S, {A i } n i=1 , {P h } H h=1 , {r h,i } H,n h=1,i=1 , H), where S is the state space, A i is the action space of the i-th agent, P h : S × n i=1 A i → ∆(S) is the transition function at the h-th step, r h,i : S × n i=1 A i → R + is the reward function of the i-th agent at the h-th step and H is the length of each episode. We assume

