REPRESENTATION LEARNING FOR GENERAL-SUM LOW-RANK MARKOV GAMES

Abstract

We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an ε-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from collected data. For both approaches, the algorithm achieves a sample complexity of poly(H, d, A, 1/ε), where H is the game horizon, d is the dimension of the feature vector, A is the size of the joint action space and ε is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov Games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) studies the problem where multiple agents learn to make sequential decisions in an unknown environment to maximize their (own) cumulative rewards. Recently, MARL has achieved remarkable empirical success, such as in traditional games like GO (Silver et al., 2016 , 2017) and Poker (Moravčík et al., 2017) , real-time video games such as Starcraft and Dota 2 (Vinyals et al., 2019; Berner et al., 2019) , decentralized controls or multi-agent robotics systems (Brambilla et al., 2013) and autonomous driving (Shalev-Shwartz et al., 2016) . On the theoretical front, however, provably sample-efficient algorithms for Markov games have been largely restricted to either two-player zero-sum games (Bai et al., 2020; Xie et al., 2020; Chen et al., 2021; Jin et al., 2021c) or general-sum games with small and finite state and action spaces (Bai and Jin, 2020; Liu et al., 2021; Jin et al., 2021b) . These algorithms typically do not permit a scalable implementation applicable to real-world games, due to either (1) they only work for tabular or linear Markov games which are too restrictive to model real-world games, or (2) the ones that do handle rich non-linear function approximation (Jin et al., 2021c) are not computationally efficient. This motivates us to ask the following question: Can we design an efficient algorithm that (1) provably learns multi-player general-sum Markov games with rich nonlinear function approximation and (2) permits scalable implementations? This paper presents the first positive answer to the above question. In particular, we make the following contributions: 1. We design a new centralized self-play meta algorithm for multi-agent low-rank Markov games: General Representation Learning for Multi-player General-sum Markov Game (GERL_MG2). We present a model-based and a model-free instantiation of GERL_MG2 which differ by the way function approximation is used, and a clean and unified analysis for both approaches. 2. We show that the model-based variant requires access to an MLE oracle and a NE/CE/CCE oracle for matrix games, and enjoys a Õ H 6 d 4 A 2 log(|Φ||Ψ|)/ε 2 sample complexity to learn an ε-NE/CE/CCE equilibrium policy, where d is the dimension of the feature vector, A is the size of the joint action space, H is the game horizon, Φ and Ψ are the function classes for the representation and emission process. The model-free variant replaces model-learning with solving a minimax optimization problem, and enjoys a sample complexity of Õ H 6 d 4 A 3 M log(|Φ|)/ε 2 for a slightly restricted class of Markov game with latent block structure. 3. Both of the above algorithms have sample complexities scaling with the joint action space size, which is exponential in the number of players. This unfavorable scaling is referred to as the curse of multi-agent, and is unavoidable in the worst case under general function approximation. We consider a spatial factorization structure where the transition of each player's local state is directly affected only by at most L = O(1) players in its adjacency. Given this additional structure, we provide an algorithm that achieves Õ(M 4 H 6 d 2(L+1) 2 Ã2(L+1) /ε 2 ) sample complexity, where Ã is the size of a single player's action space, thus escaping the exponential scaling to the number of agents. 4. Finally, we provide an efficient implementation of our reward-free algorithm, and show that it achieves superior performance against traditional deep RL baselines without principled representation learning.

1.1. RELATED WORKS

Markov games Markov games (Littman, 1994; Shapley, 1953) is an extensively used framework introduced for game playing with sequential decision making. Previous works (Littman, 1994; Hu and Wellman, 2003; Hansen et al., 2013) studied how to find the Nash equilibrium of a Markov game when the transition matrix and reward function are known. When the dynamic of the Markov game is unknown, recent works provide a line of finite-sample guarantees for learning Nash equilibrium in two-player zero-sum Markov games (Bai and Jin, 2020; Xie et al., 2020; Bai et al., 2020; Zhang et al., 2020; Liu et al., 2021; Jin et al., 2021c; Huang et al., 2021) and learning various equilibriums (including NE,CE,CCE, which are standard solution notions in games (Roughgarden, 2010)) in general-sum Markov games (Liu et al., 2021; Bai et al., 2021; Jin et al., 2021b) . Some of the analyses in these works are based on the techniques for learning single-agent Markov Decision Processes (MDPs) (Azar et al., 2017; Jin et al., 2018 Jin et al., , 2020)) . RL with Function Approximation Function approximation in reinforcement learning has been extensively studied in recent years. For the single-agent Markov decision process, function approximation is adopted to achieve a better sample complexity that depends on the complexity of function approximators rather than the size of the state-action space. For example, (Yang and Wang, 2019; Jin et al., 2020; Zanette et al., 2020) considered the linear MDP model, where the transition probability function and reward function are linear in some feature mapping over state-action pairs. Another line of works (see, e.g., Jiang et al., 2017; Jin et al., 2021a; Du et al., 2021; Foster et al., 2021) 



studied the MDPs with general nonlinear function approximations.When it comes to Markov game,(Chen et al., 2021; Xie et al., 2020; Jia et al., 2019)  studied the Markov games with linear function approximations. Recently,(Huang et al., 2021) and (Jin et al.,  2021c)  proposed the first algorithms for two-player zero-sum Markov games with general function approximation, and provided a sample complexity governed by the minimax Eluder dimension. However, technical difficulties prevent extending these results to multi-player general-sum Markov games with nonlinear function approximation. The results for linear function approximation assume a known state-action feature, and are unable to solve the Markov games with a more general non-linear approximation where both the feature and function parameters are unknown. For the general function class works, their approaches rely heavily on the two-player nature, and it's not clear how to apply their methods to the general multi-player setting.Representation Learning in RL Our work is closely related to representation learning in singleagent RL, where the study mainly focuses on the low-rank MDPs. A low-rank MDP is strictly more

