BEST POSSIBLE Q-LEARNING

Abstract

Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) trains a group of agents to cooperatively maximize the cumulative shared reward, which has great significance for real-world applications, including logistics (Li et al., 2019) , traffic signal control (Xu et al., 2021) , power dispatch (Wang et al., 2021b) , and games (Vinyals et al., 2019) . Although most existing MARL methods follow the paradigm of centralized training and decentralized execution (CTDE), in many scenarios where the information of all agents is unavailable in the training period, each agent has to learn independently without centralized information. Thus, fully decentralized learning, where the agents can only use local experiences without the actions of other agents, is highly desirable. However, in fully decentralized learning, as other agents are treated as a part of the environment and are updating their policies simultaneously, the transition probabilities from the perspective of individual agents will be non-stationary. Thus, the convergence of most decentralized algorithms, e.g., independent Q-learning (IQL) (Tan, 1993) , is not theoretically guaranteed. Multi-agent alternate Q-learning (MA2QL) (Su et al., 2022) guarantees the convergence to a Nash equilibrium, but the converged equilibrium may not be the optimal one when there are multiple equilibria (Zhang et al., 2021a) . Distributed IQL (Lauer & Riedmiller, 2000) can learn the optimal joint policy, yet is limited to deterministic environments. How to guarantee the convergence of the optimal joint policy in stochastic environments remains open. To tackle this challenge, we propose best possible operator, a novel decentralized operator to update the individual state-action value of each agent, and prove that the policies of agents converge to the optimal joint policy under this operator. However, it is inefficient and thus impractical to perform best possible operator, because at each update it needs to compute the expected values of all possible transition probabilities and update the state-action value to be the maximal one. Therefore, we further propose simplified best possible operator. At each update, the simplified operator only computes the expected value of one of the possible transition probabilities and monotonically updates the state-action value. We prove that the policies of agents also converge to the optimal joint policy under the simplified operator. We respectively instantiate the simplified operator with Q-table for tabular cases and with neural networks for complex environments. In Q-table instantiation, nonstationarity is instinctively avoided, and in neural network instantiation, non-stationarity in replay buffer is no longer a drawback, but a necessary condition for convergence. The proposed algorithm, best possible Q-learning (BQL), is fully decentralized, without using the information of other agents. We evaluate BQL on a variety of multi-agent cooperative tasks, i.e., stochastic games, MPE-based differential games (Lowe et al., 2017 ), multi-agent MuJoCo (de Witt et al., 2020b ), and SMAC (Samvelyan et al., 2019) , covering fully and partially observable, deterministic and stochastic, discrete and continuous environments. Empirically, BQL substantially outperforms baselines. To the best of our knowledge, BQL is the first decentralized algorithm that guarantees the convergence to the global optimum in stochastic environments, and more simplifications and instantiations of best possible operator can be further explored. We believe BQL will be a new paradigm for fully decentralized learning.

2.1. PRELIMINARIES

We consider N -agent MDP M env =< S, O, A, R, P env , γ > with the state space S and the joint action space A. At each timestep, each agent i chooses an individual action a i , and the environment transitions to the next state s ′ by taking the joint action a with the transition probabilities P env (s ′ |s, a). For simplicity of theoretical analysis, we assume all agents obtain the state s, though in practice each agent i can make decisions based on local observation o i ∈ O or trajectory. All agents obtain a shared reward r = R (s, s ′ ) and learn to maximize the expected return E ∞ t=0 γ t r t with the discount factor γ. In fully decentralized setting, M env is partially observable to each agent, since each agent i only observes its own action a i instead of the joint action a. From the perspective of each agent i, there is an MDP M i =< S, A i , R, P i , γ > with the individual action space A i and the transition probabilities P i (s ′ |s, a i ) = a-i P env (s ′ |s, a i , a -i ) π -i (a -i |s), where π -i denotes the joint policy of all agents except agent i, similarly for a -i . According to (1), the transition probabilities P i depend on the policies of other agents π -i . As other agents are updating their policies continuously, P i becomes non-stationary. On the non-stationary transition probabilities, the convergence of independent Q-learningfoot_0  Q i (s, a i ) = E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) is not guaranteed, and how to learn the optimal joint policy in fully decentralized settings is quite a challenge. In the next section, we propose best possible operator, a novel fully decentralized operator, which theoretically guarantees the convergence to the optimal joint policy in stochastic environments.

2.2. BEST POSSIBLE OPERATOR

First, let us consider the optimal joint Q-value Q(s, a) = E Penv(s ′ |s,a) r + γmax a ′ Q(s ′ , a ′ ) , which is the expected return of the optimal joint policy π * (s) = arg max a Q(s, a). Based on the optimal joint Q-value, for each agent i, we define max a-i Q(s, a i , a -i ), which follows the fixed point equation: max a-i Q(s, a i , a -i ) = max a-i E Penv(s ′ |s,a) r + γmax a ′ i max a ′ -i Q(s, a ′ i , a ′ -i ) = E Penv(s ′ |s,ai,π * -i (s,ai)) r + γmax a ′ i max a ′ -i Q(s, a ′ i , a ′ -i ) , where π * -i (s, a i ) = arg max a-i Q(s, a i , a -i ) is the optimal conditional joint policy of other agents given a i . ( 4) is from taking max a-i on both sides of (3), and ( 5) is by folding π * -i (s, a i ) into P env . Then we have the following lemma.



For simplicity, we refer to the optimal value Q * as Q in this paper, unless stated otherwise.

