A SHARP ANALYSIS OF MODEL-BASED REINFORCE-MENT LEARNING WITH SELF-PLAY Anonymous

Abstract

Model-based algorithms-algorithms that explore the environment through building and utilizing an estimated model-are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for singleagent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an -approximate Nash policy in Õ(H 3 SAB/ 2 ) episodes of game playing, where S is the number of states, A, B are the number of actions for the two players respectively, and H is the horizon length. This significantly improves over the best known model-based guarantee of Õ(H 4 S 2 AB/ 2 ), and is the first that matches the information-theoretic lower bound Ω(H 3 S(A + B)/ 2 ) except for a min {A, B} factor. In addition, our guarantee compares favorably against the best known model-free algorithm if min {A, B} = o(H 3 ), and outputs a single Markov policy while existing sampleefficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

1. INTRODUCTION

This paper is concerned with the problem of multi-agent reinforcement learning (multi-agent RL), in which multiple agents learn to make decisions in an unknown environment in order to maximize their (own) cumulative rewards. Multi-agent RL has achieved significant recent success in traditionally hard AI challenges including large-scale strategy games (such as GO) (Silver et al., 2016; 2017) , real-time video games involving team play such as Starcraft and Dota2 (OpenAI, 2018; Vinyals et al., 2019) , as well as behavior learning in complex social scenarios (Baker et al., 2020) . Achieving human-like (or super-human) performance in these games using multi-agent RL typically requires a large number of samples (steps of game playing) due to the necessity of exploration, and how to improve the sample complexity of multi-agent RL has been an important research question. One prevalent approach towards solving multi-agent RL is model-based methods, that is, to use the existing visitation data to build an estimate of the model (i.e. transition dynamics and rewards), run an offline planning algorithm on the estimated model to obtain the policy, and play the policy in the environment. Such a principle underlies some of the earliest single-agent online RL algorithms such as E3 (Kearns & Singh, 2002) and RMax (Brafman & Tennenholtz, 2002) , and is conceptually appealing for multi-agent RL too since the multi-agent structure does not add complexity onto the model estimation part and only requires an appropriate multi-agent planning algorithm (such as value iteration for games (Shapley, 1953) ) in a black-box fashion. On the other hand, modelfree methods do not directly build estimates of the model, but instead directly estimate the value functions or action-value (Q) functions of the problem at the optimal/equilibrium policies, and play the greedy policies with respect to the estimated value functions. Model-free algorithms have also 



Sample complexity (the required number of episodes) for algorithms to find -approximate Nash equlibrium policies in zero-sum Markov games: VI-explore and VI-UCLB by Bai & Jin (2020), OMVI-SM by Xie et al. (2020), and Nash Q/V-learning by Bai et al. (2020). The lower bound was proved by Jin et al. (2018); Domingues et al. (2020).

annex

Lower Bound --Ω(H 3 S(A + B)/ 2 ) been well developed for multi-agent RL such as friend-or-foe Q-Learning (Littman, 2001) and Nash Q-Learning (Hu & Wellman, 2003) .While both model-based and model-free algorithms have been shown to be provably efficient in multi-agent RL in a recent line of work (Bai & Jin, 2020; Xie et al., 2020; Bai et al., 2020) , a more precise understanding of the optimal sample complexities within these two types of algorithms (respectively) is still lacking. In the specific setting of two In this paper, we advance the theoretical understandings of multi-agent RL by presenting a sharp analysis of model-based algorithms on Markov games. Our core contribution is the design of a new model-based algorithm Optimistic Nash Value Iteration (Nash-VI) that achieves an almost optimal sample complexity for zero-sum Markov games and improves significantly over existing modelbased approaches. We summarize our main contributions as follows. A comparison between our and prior results can be found in Table 1 .• We design a new model-based algorithm Optimistic Nash Value Iteration (Nash-VI) that provably finds -approximate Nash equilibria for Markov games in Õ(H 3 SAB/ 2 ) episodes of game playing (Section 3). This improves over the best existing model-based algorithm by O(HS) and is the first algorithm that matches the sample complexity lower bound except for a Õ(min {A, B}) factor, showing that model-based algorithms can indeed achieve an almost optimal sample complexity. Further, unlike state-of-the-art model-free algorithms such as Nash V-Learning (Bai et al., 2020) , this algorithm achieves in addition a Õ( √ T ) regret bound, and outputs a simple Markov policy (instead of a nested mixture of Markov policies as returned by Nash V-Learning).

