BEST POSSIBLE Q-LEARNING

Abstract

Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) trains a group of agents to cooperatively maximize the cumulative shared reward, which has great significance for real-world applications, including logistics (Li et al., 2019) , traffic signal control (Xu et al., 2021 ), power dispatch (Wang et al., 2021b ), and games (Vinyals et al., 2019) . Although most existing MARL methods follow the paradigm of centralized training and decentralized execution (CTDE), in many scenarios where the information of all agents is unavailable in the training period, each agent has to learn independently without centralized information. Thus, fully decentralized learning, where the agents can only use local experiences without the actions of other agents, is highly desirable. However, in fully decentralized learning, as other agents are treated as a part of the environment and are updating their policies simultaneously, the transition probabilities from the perspective of individual agents will be non-stationary. Thus, the convergence of most decentralized algorithms, e.g., independent Q-learning (IQL) (Tan, 1993), is not theoretically guaranteed. Multi-agent alternate Q-learning (MA2QL) (Su et al., 2022) guarantees the convergence to a Nash equilibrium, but the converged equilibrium may not be the optimal one when there are multiple equilibria (Zhang et al., 2021a) . Distributed IQL (Lauer & Riedmiller, 2000) can learn the optimal joint policy, yet is limited to deterministic environments. How to guarantee the convergence of the optimal joint policy in stochastic environments remains open. To tackle this challenge, we propose best possible operator, a novel decentralized operator to update the individual state-action value of each agent, and prove that the policies of agents converge to the optimal joint policy under this operator. However, it is inefficient and thus impractical to perform best possible operator, because at each update it needs to compute the expected values of all possible transition probabilities and update the state-action value to be the maximal one. Therefore, we further propose simplified best possible operator. At each update, the simplified operator only computes the expected value of one of the possible transition probabilities and monotonically updates the state-action value. We prove that the policies of agents also converge to the optimal joint policy under the simplified operator. We respectively instantiate the simplified operator with Q-table for tabular cases and with neural networks for complex environments. In Q-table instantiation, nonstationarity is instinctively avoided, and in neural network instantiation, non-stationarity in replay buffer is no longer a drawback, but a necessary condition for convergence.

