A SHARP ANALYSIS OF MODEL-BASED REINFORCE-MENT LEARNING WITH SELF-PLAY Anonymous

Abstract

Model-based algorithms-algorithms that explore the environment through building and utilizing an estimated model-are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for singleagent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an -approximate Nash policy in Õ(H 3 SAB/ 2 ) episodes of game playing, where S is the number of states, A, B are the number of actions for the two players respectively, and H is the horizon length. This significantly improves over the best known model-based guarantee of Õ(H 4 S 2 AB/ 2 ), and is the first that matches the information-theoretic lower bound Ω(H 3 S(A + B)/ 2 ) except for a min {A, B} factor. In addition, our guarantee compares favorably against the best known model-free algorithm if min {A, B} = o(H 3 ), and outputs a single Markov policy while existing sampleefficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

1. INTRODUCTION

This paper is concerned with the problem of multi-agent reinforcement learning (multi-agent RL), in which multiple agents learn to make decisions in an unknown environment in order to maximize their (own) cumulative rewards. Multi-agent RL has achieved significant recent success in traditionally hard AI challenges including large-scale strategy games (such as GO) (Silver et al., 2016; 2017) , real-time video games involving team play such as Starcraft and Dota2 (OpenAI, 2018; Vinyals et al., 2019) , as well as behavior learning in complex social scenarios (Baker et al., 2020) . Achieving human-like (or super-human) performance in these games using multi-agent RL typically requires a large number of samples (steps of game playing) due to the necessity of exploration, and how to improve the sample complexity of multi-agent RL has been an important research question. One prevalent approach towards solving multi-agent RL is model-based methods, that is, to use the existing visitation data to build an estimate of the model (i.e. transition dynamics and rewards), run an offline planning algorithm on the estimated model to obtain the policy, and play the policy in the environment. Such a principle underlies some of the earliest single-agent online RL algorithms such as E3 (Kearns & Singh, 2002) and RMax (Brafman & Tennenholtz, 2002) , and is conceptually appealing for multi-agent RL too since the multi-agent structure does not add complexity onto the model estimation part and only requires an appropriate multi-agent planning algorithm (such as value iteration for games (Shapley, 1953) ) in a black-box fashion. On the other hand, modelfree methods do not directly build estimates of the model, but instead directly estimate the value functions or action-value (Q) functions of the problem at the optimal/equilibrium policies, and play the greedy policies with respect to the estimated value functions. Model-free algorithms have also

