BEST POSSIBLE Q-LEARNING

Abstract

Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) trains a group of agents to cooperatively maximize the cumulative shared reward, which has great significance for real-world applications, including logistics (Li et al., 2019) , traffic signal control (Xu et al., 2021) , power dispatch (Wang et al., 2021b) , and games (Vinyals et al., 2019) . Although most existing MARL methods follow the paradigm of centralized training and decentralized execution (CTDE), in many scenarios where the information of all agents is unavailable in the training period, each agent has to learn independently without centralized information. Thus, fully decentralized learning, where the agents can only use local experiences without the actions of other agents, is highly desirable. However, in fully decentralized learning, as other agents are treated as a part of the environment and are updating their policies simultaneously, the transition probabilities from the perspective of individual agents will be non-stationary. Thus, the convergence of most decentralized algorithms, e.g., independent Q-learning (IQL) (Tan, 1993) , is not theoretically guaranteed. Multi-agent alternate Q-learning (MA2QL) (Su et al., 2022) guarantees the convergence to a Nash equilibrium, but the converged equilibrium may not be the optimal one when there are multiple equilibria (Zhang et al., 2021a) . Distributed IQL (Lauer & Riedmiller, 2000) can learn the optimal joint policy, yet is limited to deterministic environments. How to guarantee the convergence of the optimal joint policy in stochastic environments remains open. To tackle this challenge, we propose best possible operator, a novel decentralized operator to update the individual state-action value of each agent, and prove that the policies of agents converge to the optimal joint policy under this operator. However, it is inefficient and thus impractical to perform best possible operator, because at each update it needs to compute the expected values of all possible transition probabilities and update the state-action value to be the maximal one. Therefore, we further propose simplified best possible operator. At each update, the simplified operator only computes the expected value of one of the possible transition probabilities and monotonically updates the state-action value. We prove that the policies of agents also converge to the optimal joint policy under the simplified operator. We respectively instantiate the simplified operator with Q-table for tabular cases and with neural networks for complex environments. In Q-table instantiation, nonstationarity is instinctively avoided, and in neural network instantiation, non-stationarity in replay buffer is no longer a drawback, but a necessary condition for convergence. The proposed algorithm, best possible Q-learning (BQL), is fully decentralized, without using the information of other agents. We evaluate BQL on a variety of multi-agent cooperative tasks, i.e., stochastic games, MPE-based differential games (Lowe et al., 2017 ), multi-agent MuJoCo (de Witt et al., 2020b ), and SMAC (Samvelyan et al., 2019) , covering fully and partially observable, deterministic and stochastic, discrete and continuous environments. Empirically, BQL substantially outperforms baselines. To the best of our knowledge, BQL is the first decentralized algorithm that guarantees the convergence to the global optimum in stochastic environments, and more simplifications and instantiations of best possible operator can be further explored. We believe BQL will be a new paradigm for fully decentralized learning.

2. METHOD 2.1 PRELIMINARIES

We consider N -agent MDP M env =< S, O, A, R, P env , γ > with the state space S and the joint action space A. At each timestep, each agent i chooses an individual action a i , and the environment transitions to the next state s ′ by taking the joint action a with the transition probabilities P env (s ′ |s, a). For simplicity of theoretical analysis, we assume all agents obtain the state s, though in practice each agent i can make decisions based on local observation o i ∈ O or trajectory. All agents obtain a shared reward r = R (s, s ′ ) and learn to maximize the expected return E ∞ t=0 γ t r t with the discount factor γ. In fully decentralized setting, M env is partially observable to each agent, since each agent i only observes its own action a i instead of the joint action a. From the perspective of each agent i, there is an MDP M i =< S, A i , R, P i , γ > with the individual action space A i and the transition probabilities P i (s ′ |s, a i ) = a-i P env (s ′ |s, a i , a -i ) π -i (a -i |s), where π -i denotes the joint policy of all agents except agent i, similarly for a -i . According to (1), the transition probabilities P i depend on the policies of other agents π -i . As other agents are updating their policies continuously, P i becomes non-stationary. On the non-stationary transition probabilities, the convergence of independent Q-learningfoot_0  Q i (s, a i ) = E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) is not guaranteed, and how to learn the optimal joint policy in fully decentralized settings is quite a challenge. In the next section, we propose best possible operator, a novel fully decentralized operator, which theoretically guarantees the convergence to the optimal joint policy in stochastic environments.

2.2. BEST POSSIBLE OPERATOR

First, let us consider the optimal joint Q-value Q(s, a) = E Penv(s ′ |s,a) r + γmax a ′ Q(s ′ , a ′ ) , which is the expected return of the optimal joint policy π * (s) = arg max a Q(s, a). Based on the optimal joint Q-value, for each agent i, we define max a-i Q(s, a i , a -i ), which follows the fixed point equation: max a-i Q(s, a i , a -i ) = max a-i E Penv(s ′ |s,a) r + γmax a ′ i max a ′ -i Q(s, a ′ i , a ′ -i ) (4) = E Penv(s ′ |s,ai,π * -i (s,ai)) r + γmax a ′ i max a ′ -i Q(s, a ′ i , a ′ -i ) , where π * -i (s, a i ) = arg max a-i Q(s, a i , a -i ) is the optimal conditional joint policy of other agents given a i . ( 4) is from taking max a-i on both sides of (3), and ( 5) is by folding π * -i (s, a i ) into P env . Then we have the following lemma. Lemma 1. If each agent i learns the independent value function Q i (s, a i ) = max a-i Q(s, a i , a -i ), and takes actions as arg max ai Q i (s, a i ), the agents will obtain the optimal joint policy, when there is only one optimal joint policyfoot_1 . Proof. Since max ai max a-i Q(s, a i , a -i ) = max a Q(s, a), and there is only one optimal joint policy, arg max ai Q i (s, a i ) is the action of agent i in the optimal joint action a. According to Lemma 1, to obtain the optimal joint policy is to let each agent i learn the value function Q i (s, a i ) = max a-i Q(s, a i , a -i ). To this end, we propose a new operator to update Q i in a fully decentralized way: Q i (s, a i ) = max Pi(•|s,ai) E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) . Given s and a i , there will be numerous P i (s ′ |s, a i ) due to different other agents' policies π -i . To reduces the complexity, we can only consider the deterministic policies, because when there is only one optimal joint policy, the optimal joint policy must be deterministic (Puterman, 1994) , so the operator (6) takes the maximum only over the transition probabilities P i (s ′ |s, a i ) under deterministic π -i . Intuitively, the operator continuously pursues the 'best possible expected return', until Q i reaches the optimal expected return max a-i Q(s, a i , a -i ), so we name the operator (6) as best possible operator. In the following, we theoretically prove that Q i (s, a i ) converges to max a-i Q(s, a i , a -i ) under best possible operator, thus the agents learn the optimal joint policy. Let Q k i (s, a i ) denote the value function in the update k, Q 0 i is initialized to be the minimal return, and Q i (s, a i ) := Q ∞ i (s, a i ). Then, we have the following Lemma. Lemma 2. max a-i Q(s, a i , a -i ) ≥ Q k i (s, a i ) , ∀s, a i , ∀k, under best possible operator. Proof. We prove the lemma by induction. First, as Q 0 i is initialized to be the minimal return, max a-i Q(s, a i , a -i ) ≥ Q 0 i (s, a i ). Then, suppose max a-i Q(s, a i , a -i ) ≥ Q k-1 i (s, a i ), ∀s, a i . By denoting arg max Pi(s ′ |s,ai) E Pi(s ′ |s,ai) r + γ max a ′ i Q k-1 i (s ′ , a ′ i ) as P * i (s ′ |s, a i ), we have max a-i Q(s, a i , a -i ) -Q k i (s, a i ) = max a-i s ′ P env (s ′ |s, a i , a -i ) r(s, s ′ ) + γ max a ′ i max a ′ -i Q(s ′ , a ′ i , a ′ -i ) - s ′ P * i (s ′ |s, a i ) r(s, s ′ ) + γmax a ′ i Q k-1 i (s ′ , a ′ i ) ≥ s ′ P * i (s ′ |s, a i ) r(s, s ′ ) + γ max a ′ i max a ′ -i Q(s ′ , a ′ i , a ′ -i ) - s ′ P * i (s ′ |s, a i ) r(s, s ′ ) + γmax a ′ i Q k-1 i (s ′ , a ′ i ) = γ s ′ P * i (s ′ |s, a i ) max a ′ i max a ′ -i Q(s ′ , a ′ i , a ′ -i ) -max a ′ i Q k-1 i (s ′ , a ′ i ) ≥ γ s ′ P * i (s ′ |s, a i ) max a ′ -i Q(s ′ , a ′ * i , a ′ -i ) -Q k-1 i (s ′ , a ′ * i ) ≥ 0 where a ′ * i = arg max a ′ i Q k-1 i (s ′ , a ′ i ). Thus, it holds in the update k. By the principle of induction, the lemma holds for all updates. Intuitively, max a-i Q(s, a i , a -i ) is the optimal expected return after taking action a i , so it is the upper bound of Q i (s, a i ). Further, based on Lemma 2, we prove the following lemma. Lemma 3. Q i (s, a i ) converges to max a-i Q(s, a i , a -i ) under best possible operator. Proof. From ( 5) and ( 6), we have max a-i Q(s, a i , a -i ) -Q k i (s, a i ) ∞ = max s,ai s ′ P env s ′ |s, a i , π * -i (s, a i ) r(s, s ′ ) + γ max a ′ i max a ′ -i Q(s ′ , a ′ i , a ′ -i ) - s ′ P * i (s ′ |s, a i ) r(s, s ′ ) + γmax a ′ i Q k-1 i (s ′ , a ′ i ) ← (Lemma 2) ≤ max s,ai s ′ P env s ′ |s, a i , π * -i (s, a i ) r(s, s ′ ) + γ max a ′ i max a ′ -i Q(s ′ , a ′ i , a ′ -i ) - s ′ P env s ′ |s, a i , π * -i (s, a i ) r(s, s ′ ) + γmax a ′ i Q k-1 i (s ′ , a ′ i ) ≤ γmax s ′ ,a ′ i max a ′ -i Q(s ′ , a ′ i , a ′ -i ) -Q k-1 i (s ′ , a ′ i ) = γ max a-i Q(s, a i , a -i ) -Q k-1 i (s, a i ) ∞ . We have max a-i Q(s, a i , a -i ) -Q k i (s, a i ) ∞ ≤ γ k max a-i Q(s, a i , a -i ) -Q 0 i (s, a i ) ∞ . Let k → ∞, then Q i (s, a i ) → max a-i Q(s, a i , a -i ), thus the lemma holds. According to Lemma 1 and 3, we have: Theorem 1. The agents learn the optimal joint policy under best possible operator.

2.3. SIMPLIFIED BEST POSSIBLE OPERATOR

Best possible operator guarantees the convergence to the optimal joint policy. However, to perform (6), every update, each agent i has to compute the expected values of all possible transition probabilities and update Q i to be the maximal expected value, which is too costly. Therefore, we introduce an auxiliary value function Q e i (s, a i ), and simplify (6) into two operators. First, at each update, we randomly select one of possible transition probabilities Pi for each (s, a i ) and update Q e i (s, a i ) by Q e i (s, a i ) = E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) . Q e i (s, a i ) represents the expected value of the selected transition probabilities. Then we monotonically update Q i (s, a i ) by Q i (s, a i ) = max (Q i (s, a i ), Q e i (s, a i )) . We define ( 7) and ( 8) together as simplified best possible operator. By performing simplified best possible operator, Q i (s, a i ) is efficiently updated towards the maximal expected value. And we have the following lemma. Lemma 4. Q i (s, a i ) converges to max a-i Q(s, a i , a -i ) under simplified best possible operator. Proof. According to (8), as 7) and (8). Thus, {Q k i (s, a i )} is an increasing sequence and bounded above. According to the monotone convergence theorem, {Q k i (s, a i )} converges when k → ∞, and let Q i (s, a i ) is monotonically increased, Q k i (s, a i ) ≥ Q k-1 i (s, a i ) in the update k. Similar to the proof of Lemma 2, we can easily prove max a-i Q(s, a i , a -i ) ≥ Q k i (s, a i ) under ( Q i (s, a i ) := Q ∞ i (s, a i ). Then we prove that the converged value Q i (s, a i ) is equal to max a-i Q(s, a i , a -i ). Due to mono- tonicity and convergence, ∀ϵ, s, a i , ∃K, when k > K, Q k i (s, a i ) -Q k-1 i (s, a i ) ≤ ϵ, no matter which Pi is selected in the update k. Since each Pi is possible to be selected, when selecting Pi (s ′ |s, a i ) = arg max Pi(s ′ |s,ai) E Pi(s ′ |s,ai) r + γ max a ′ i Q k-1 i (s ′ , a ′ i ) = P * i (s ′ |s, a i ) , by performing ( 7) and ( 8), we have Q k-1 i (s, a i ) + ϵ ≥ Q k i (s, a i ) ≥ Q e i (s, a i ) = s ′ P * i (s ′ |s, a i ) r(s, s ′ ) + γmax a ′ i Q k-1 i (s ′ , a ′ i ) . According to the proof of Lemma 3, we have max s,ai max a-i Q(s, a i , a -i ) -Q e i (s, a i ) ≤ γmax s,ai max a-i Q(s, a i , a -i ) -Q k-1 i (s, a i ) . s * , a * i = arg max s,ai max a-i Q(s, a i , a -i ) -Q k-1 i (s, a i ) , and Q k-1 i (s, a i ) + ϵ ≥ Q e i (s, a i ) : max a-i Q(s * , a * i , a -i ) -Q k-1 i (s * , a * i ) -ϵ ≤ γmax a-i Q(s * , a * i , a -i ) -γQ k-1 i (s * , a * i ). Then, we have max a-i Q(s, a i , a -i ) -Q k-1 i (s, a i ) ∞ ≤ ϵ 1 -γ . Thus, Q i (s, a i ) converges to max a-i Q(s, a i , a -i ). According to Lemma 1 and 4, we have: Theorem 2. The agents learn the optimal joint policy under simplified best possible operator.

2.4. BEST POSSIBLE Q-LEARNING

Best possible Q-learning (BQL) is instantiated on simplified best possible operator. We first consider learning Q-table for tabular cases. The key challenge is how to obtain all possible transition probabilities under deterministic π -i during learning. To solve this issue, the whole training process is divided into M epochs. At the epoch m, each agent i randomly and independently initializes a deterministic policy πm i and selects a subset of states S m i . Then each agent i interacts with the environment using the deterministic policy arg max ai Q i (s, a i ) if s / ∈ S m i , πm i (s) else. Each agent i stores independent experiences (s, a i , s ′ , r) in the replay buffer D m i . As P i depends on π -i , and agents act deterministic policies, D m i contains one P i under a deterministic π -i . Since P i will change if other agents modify their policies π -i , acting the randomly initialized policy πm i on S m i in the epoch m not only helps each agent i to explore state-action pairs, but also helps other agents to explore possible transition probabilities. When M is sufficiently large, given any (s, a i ) pair, any P i (s, a i ) can be found in a replay buffer. After interaction of the epoch m, each agent i has a buffer series 7), and then samples mini-batches from {D 1 i , • • • , D m i }, j i from {D 1 i , • • • , D m i }, and samples mini-batches {s, a i , s ′ , r} from D j i to update Q-table Q e i (s, a i ) by ( D j i to update Q i (s, a i ) by (8). The Q-table implementation is summarized in Algorithm 1. Then we analyze sample efficiency of collecting the buffer series. Simplified best possible operator requires that any possible P i (s, a i ) of (s, a i ) pair can be found in one buffer, but does not care about the relationship between transition probabilities of different state-action pairs in the same buffer. So BQL ideally needs only |A i | × |A -i | = |A| small buffers to cover all possible P i for any (s, a i ) pair, which is very efficient for experience collection. We give an intuitive illustration for this and analyze that BQL has similar sample complexity to the joint Q-learning (3) in Appendix B. In complex environments with large or continuous state-action space, it is inefficient and costly to follow the experience collection in tabular cases, where the agents cannot update their policies during the interaction of each epoch, and each epoch requires adequate samples to accurately estimate the expectation (7). Thus, in complex environments, same as IQL, each agent i only maintains one replay buffer D i , which contains all historical experiences, and uses the same ϵ-greedy policy as Algorithm 1. BQL with Q-table for each agent i 1: Initialize tables Q i and Q e i . 2: for m = 1, . . . , M do 3: Initialize the replay buffer D m i and the exploration policy πm i .

4:

All agents interact with the environment and store experiences (s, a i , s ′ , r) in D m i . 5: for t = 1, . . . , n update do 6: Randomly select a buffer D j i from D 1 i , • • • , D m i . 7: Update Q e i according to (7) using samples from D j i .

8:

Update Q i according to (8) using samples from D j i . 9: end for 10: end for Algorithm 2. BQL with neural network for each agent i All agents interact with the environment and store experiences (s, a i , s ′ , r) in D i . 1: Initialize neural networks Q i and Q e i ,

5:

Sample a mini-batch from D i .

6:

Update Q e i by minimizing (9).

7:

Update Q i by minimizing (10). 8: Update the target networks Qe i . 9: end for IQL without the randomly initialized deterministic policy πi . Then we instantiate simplified best possible operator with neural networks Q i and Q e i . Q e i is updated by minimizing: E s,ai,s ′ ,r∼Di (Q e i (s, a i ) -r -γQ i (s ′ , a ′ * i )) 2 , a ′ * i = arg max a ′ i Q i (s ′ , a ′ i ). And Q i is updated by minimizing: E s,ai∼Di w(s, a i ) Q i (s, a i ) -Qe i (s, a i ) 2 , w(s, a i ) = 1 if Qe i (s, a i ) > Q i (s, a i ) λ else. ( ) Qe i is the softly updated target network of Q e i . When λ = 0, (10) is equivalent to (8). However, when λ = 0, the positive random noise of Q i in the update can be continuously accumulated, which may cause value overestimation. So we adopt the weighted max in (10) by setting 0 < λ < 1 to offset the positive random noise. In continuous action space, following DDPG (Lillicrap et al., 2016) , we train a policy network π i (s) by maximizing Q i (s, π i (s)) as a substitute of arg max ai Q i (s, a i ). The neural network implementation is summarized in Algorithm 2. Simplified best possible operator is meaningful for neural network implementation. As there is only one buffer D i , we cannot perform (6) but can still perform ( 7) and ( 8) on D i . As other agents are updating their policies, the transition probabilities in D i will continuously change. If D i sufficiently goes through all possible transition probabilities, Q i (s, a i ) converges to max a-i Q(s, a i , a -i ) and the agents learn the optimal joint policy. That is to say, non-stationarity in the replay buffer is no longer a drawback, but a necessary condition for BQL.

3. RELATED WORK

Most existing MARL methods (Lowe et al., 2017; Iqbal & Sha, 2019; Wang et al., 2020; Zhang et al., 2021b; Su & Lu, 2022; Peng et al., 2021; Li et al., 2022; Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wang et al., 2021a; Rashid et al., 2020) follow the paradigm of centralized training and decentralized execution (CTDE), where the information of all agents can be accessed in a centralized way during training. Unlike these methods, we focus on fully decentralized learning where global information is not available. The most straightforward decentralized methods, i.e., independent Q-learning (Tan, 1993) and independent PPO (IPPO) (de Witt et al., 2020a) , cannot guarantee the convergence of the learned policy, because the transition probabilities are non-stationary from normalized reward the perspective of each agent, as all agents are learning policies simultaneously. Multi-agent alternate Q-learning (MA2QL) (Su et al., 2022) guarantees the convergence to a Nash equilibrium, but the converged equilibrium may not be the optimal one when there are multiple equilibria, and it has to be trained in an on-policy manner and cannot use replay buffers, which leads to poor sample efficiency. Following the principle of optimistic estimation, Hysteretic IQL (Matignon et al., 2007) sets a slow learning rate to the value punishment. Distributed IQL (Lauer & Riedmiller, 2000) , a special case of Hysteretic IQL with the slow learning rate being zero, guarantees the convergence to the optimum but only in deterministic environments. Our BQL is the first fully decentralized algorithm that converges to the optimal joint policy in stochastic environments. |S m i | = 12 |S m i | = 10 |S m i | = 8 |S m i | = 6 |S m i | = 4 In the next section, we compare BQL against these Q-learning variants (Distributed IQL is included in Hysteretic IQL). Comparing with on-policy algorithms, e.g., IPPO, that are not sample-efficient especially in fully decentralized settings, is out of focus and thus deferred to Appendix. Decentralized methods with communication (Zhang et al., 2018; Konan et al., 2021) allow information sharing with neighboring agents according to a time-varying communication channel, which do not follow the fully decentralized setting and thus are beyond the scope of this paper.

4. EXPERIMENTS

In experiments, we first test BQL with Q-table on randomly generated cooperative stochastic games to verify its convergence and optimality. Then, to illustrate its performance on complex tasks, we compare BQL with neural networks against Q-learning variants on MPE-version (Lowe et al., 2017) differential games, Multi-Agent MuJoCo (Peng et al., 2021), and SMAC (Samvelyan et al., 2019) . The experiments cover both fully and partially observable, deterministic and stochastic, discrete and continuous environments. Since we consider the fully decentralized setting, BQL and the baselines do not use parameter sharing. The results are presented using mean and standard deviation with different random seeds. More details about hyperparameters are available in Appendix E.

4.1. STOCHASTIC GAMES

To support the theoretical analysis of BQL, we test Q-table instantiation on stochastic games with 4 agents, 30 states, and infinite horizon. The action space of each agent is 4, so the joint action space |A| = 256. The distribution of initial states is uniform. Each state will transition to any state given a joint action according to transition probabilities. The transition probabilities and reward function are randomly generated and fixed in each game. We randomly generate 20 games and train the agents for four different seeds in each game. The mean normalized return (normalized by the optimal return) and std over the 20 games are shown in Figure 1a . IQL cannot learn the optimal policies due to non-stationarity. Although using the optimistic update to remedy the non-stationarity, Hysteretic IQL (H-IQL) still cannot theoretically solve this problem in stochastic environments and shows similar performance to IQL. In Appendix A, we thoroughly analyze the difference between H-IQL and BQL, and show H-IQL is a special case of BQL in deterministic environments. MA2QL guarantees the convergence to a Nash equilibrium, but the converged one may not be the optimal one, thus there is a performance gap between MA2QL and optimal policies. BQL could converge to the optimum, and the tiny gap is caused by the fitting error of the Q-table update, which verifies the theoretical analysis. Note that, in Q-table instantiations, MA2QL and BQL use different experience collection from IQL, i.e., exploration strategy and replay buffer. MA2QL only uses on-policy experiences and BQL collects a series of small buffers. However, for sample efficiency, the two methods have to use the same experience collection as IQL in complex tasks with neural networks. MA2QL-and BQL-respectively denote the two methods with the same experience collection as IQL. Trained on off-policy experiences, MA2QL-suffers from non-stationarity and achieves similar performance to IQL. Even if using only one buffer, as we have analyzed in Section 2.4, if the non-stationary buffer sufficiently goes through all possible transition probabilities, BQL agents can also converge to the optimum. Although going through all possible transition probabilities by one buffer is inefficient, BQL-significantly outperforms IQL, which implies the potential of BQL with one buffer in complex tasks. Figure 1b 1c shows the effect of the number of states on which the agents perform the randomly initialized deterministic policy πm i for exploration. The larger |S m i | means a stronger exploration for both state-action pairs and possible transition probabilities, which leads to better performance. We then consider a one-stage game that is wildly adopted in MARL (Son et al., 2019) . There are 2 agents, and the action space of each agent is 3. The reward matrix is a1/a2 A (1) A (2) A (3) A (1) 8 -12 -12 A (2) -12 0 0 A (3) -12 0 0 where reward 8 is the global optimum and the reward 0 is the sub-optimal Nash equilibrium. As shown in Figure 1d , MA2QL converges to the sub-optimal Nash equilibrium when the initial policy of the second agent selects A (2) or A (3) . But BQL converges to the global optimum easily.

4.2. MPE

To evaluate the effectiveness of BQL with neural network implementation, we design an MPE-based differential game, where 3 agents can move in the range [-1, 1]. In each timestep, agent i acts the action a i ∈ [-1, 1], and the position of agent i will be updated as x i = clip(x i +0.1×a i , -1, 1) (i.e., the updated position is clipped to [-1, 1]) with the probability 1β, or will be updated as -x i with the probability β. β controls the stochasticity. The state is the vector of positions {x 1 , x 2 , x 3 }. The reward function of each timestep r(l), l = 2 3 (x 2 1 + x 2 2 + x 2 3 ) is given in Appendix E. We visualize the relation between r and l in Figure 15 . There is only one global optimum (l = 0 and r = 1) but infinite sub-optima (l = 0.8 and r = 0.3), and the narrow region with r > 0.3 is surrounded by the region with r = 0. So it is quite a challenge to learn the optimal policies in a fully decentralized way. Each episode contains 100 timesteps, and the initial positions follow the uniform distribution. We perform experiments with different stochasticities β, and train the agents for eight seeds with each β. In continuous environments, BQL and baselines are built on DDPG. As shown in Figure 2 , IQL always falls into the local optimum (total reward ≈ 30) because of the non-stationary transition probabilities. H-IQL only escapes the local optimum in one seed in the setting with β = 0.3. In neural network implementations, MA2QL and BQL use the same experience collection as IQL, so there is no MA2QL-and BQL-. MA2QL converges to the local optimum because it cannot guarantee that the converged equilibrium is the global optimum, especially trained using off-policy data. BQL (λ = 0.01) can escape from local optimum in more than 4 seeds in all settings, which demonstrates the effectiveness of our optimization objectives ( 9) and ( 10). The difference between global optimum (total reward ≈ 100) and local optimum is large, which results in the large variance of BQL. In the objective (10), λ controls the balance between performing best possible operator and offsetting the overestimation caused by the operator. As shown in Figure 2 , the large λ, i.e., 0.1, will weaken the strength of BQL, while too small λ, i.e., 0, will cause severe overestimation and destroy the performance.

4.3. MULTI-AGENT MUJOCO

To evaluate BQL in partially observable environments, we adopt Multi-Agent MuJoCo (Peng et al., 2021) , where each agent independently controls one or some joints of the robot and can only observe the state of its own joints and bodies (with the parameter agent obsk = 0). In each task, we test four random seeds and plot the learning curves in Figure 3 . BQL achieves higher reward or learns faster than the baselines, which verifies that BQL could be applied to partially observable environments. Here, we set λ = 0.5 and show that BQL is robust to λ in Appendix C.1. In partially observable environments, BQL is performed on transition probabilities of observation P i (o ′ i |o i , a i ), which also depends on π -i . The convergence and optimality of BQL can only be guaranteed when one observation o i uniquely corresponds to one state s. It has been proven that the optimality is undecidable in partially observable Markov decision processes (POMDPs) (Madani et al., 1999) , so it is not the limitation of BQL. We only consider two-agent cases in the partially observable setting, because the too limited observation range cannot support strong policies when there are more agents. In Appendix C.2, we test BQL on 17-agent Humanoid with full observation to verify the scalability.

4.4. SMAC

We also perform experiments on partially observable and stochastic SMAC tasks (Samvelyan et al., 2019) with the version SC2.4.10, including both easy and hard maps (Yu et al., 2021) . Agent numbers vary between 2 and 9. We build BQL on the implementation of PyMARL (Samvelyan et al., 2019) and train the agents for four random seeds. The learning curves are shown in Figure 4 . BQL also outperforms the baselines, which verifies that BQL can also obtain performance gain in highdimensional complex tasks.

5. CONCLUSION

We propose best possible operator and theoretically prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. We then simplify the operator and derive BQL, the first decentralized MARL algorithm that guarantees the convergence to the global optimum in stochastic environments. Empirically, BQL outperforms baselines in a variety of multi-agent tasks. We believe BQL will be a new paradigm for fully decentralized learning.

A COMPARISON WITH HYSTERETIC IQL

Hysteretic IQL is a special case of BQL when the environment is deterministic. To thoroughly illustrate that, we rewrite the loss function of BQL w(s, a i ) Q i (s, a i ) -E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) 2 , w(s, a i ) =    1 if E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) > Q i (s, a i ) λ else. If λ = 0, the update of BQL is Q i (s, a i ) = max Q i (s, a i ), E Pi(s ′ |s,ai) r + γmax a ′ i Q i (s ′ , a ′ i ) . Hysteretic IQL follows the loss function w(s, a i ) Q i (s, a i ) -r -γmax a ′ i Q i (s ′ , a ′ i ) 2 , w(s, a i ) = 1 if r + γmax a ′ i Q i (s ′ , a ′ i ) > Q i (s, a i ) λ else. If λ = 0, Hysteretic IQL degenerates into Distributed IQL (Lauer & Riedmiller, 2000 ) Q i (s, a i ) = max Q i (s, a i ), r + γmax a ′ i Q i (s ′ , a ′ i ) . BQL takes the max of the expected target on transition probability Pi (s ′ |s, a i ), while Hysteretic IQL takes the max of the target on the next state s ′ . When the environment is deterministic, they are equivalent. However, in stochastic environments, Hysteretic IQL cannot guarantee to converge to the global optimum since the environment will not always transition to the same s ′ . BQL can guarantee the global optimum in both deterministic and stochastic environments. B EFFICIENCY OF BQL deterministic stochastic 5: Space of other agents' policies π-i given an (s, ai). We will discuss the efficiency of collecting the replay buffer for BQL. The space of other agents' policies π -i given (s, a i ) pair is a convex polytope. For clarity, Figure 5 shows a triangle space. Each π -i corresponds to a P i (s ′ |s, a i ). Deterministic policies π -i locate at the vertexes, while the edges and the inside of the polytope are stochastic π -i , the mix of deterministic ones. Since BQL only considers deterministic policies, the buffer series only needs to cover all the vertexes by acting deterministic policies in the collection of each buffer D m i , which is efficient. BQL needs only |A i | × |A -i | = |A| small buffers, which is irrelevant to state space |S|, to meet the requirement of simplified best possible operator that any one of possible P i (s ′ |s, a i ) can be found in one (ideally only one) buffer given (s, a i ) pair. More specifically, |A i | buffers are needed to cover action space, and |A -i | buffers are needed to cover transition space for each action. We intuitively illustrate this in Figure 6 . Each state in D m i requires # samples to estimate the expectation in ( 7), so the sample complexity is O(|A||S|#). For the joint Q-learning (3), the most efficient known method to guarantee the convergence and optimality in stochastic environments, each state-joint action pair i and P 2 i . We can see that any Pi(s, ai) can be found in the 4 buffers. (s, a) requires # samples to estimate the expectation, so the sample complexity is also O(|A||S|#). Thus, BQL is close to the joint Q-learning in terms of sample complexity, which is empirically verified in Figure 7 . One may ask "since you obtain all possible transition probabilities, why not perform IQL on each transition probability and choose the highest value?" Actually, this naive algorithm can also learn the optimal policy, but the buffer collection of the naive algorithm is much more costly than that of BQL. The naive algorithm requires that any one of possible transition probability functions of the whole state-action space could be found in one buffer, which needs |A -i | |S| buffers. And training IQL |A -i | |S| times is also formidable. BQL only requires that any one of possible transition probability of any state-action pair could be found in one buffer, which is much more efficient. However, considering sample efficiency, BQL with neural networks only maintains one replay buffer D i containing all historical experiences, which is the same as IQL. P i in D i corresponds to the average of other agents' historical policies, which is stochastic. Therefore, to guarantee the optimality, in theory, BQL with one buffer has to go through almost the whole π -i space, which is costly. As shown in Figure 1d , BQL-(with one buffer) outperforms IQL but cannot achieve similar results as BQL (with buffer series), showing that maintaining one buffer is costly but still effective. In neural network instantiation, we show the results of BQL with the buffer series in Figure 8 . Due to sample efficiency, the buffer series cannot achieve strong performance, and maintaining one buffer like IQL is a better choice in complex environments. 

C.1 HYPERPARAMETER λ

In the objective (10), λ controls the balance between performing best possible operator and offsetting the overestimation caused by the operator. We investigate the effectiveness of λ in Figure 9 . Too large λ will weaken the strength of BQL. When λ = 1.0, BQL degenerates into IQL. Too small λ, i.e., 0, will cause severe overestimation and destroy the performance. When λ falls within the interval [0.2, 0.8], BQL could obtain performance gain, showing the robustness to λ. 

C.2 SCALABILITY

We test BQL on 17-agent Humanoid with full observation, the results are shown in Figure 10 . BQL obtains significant performance gain in this many-agent task, which can be evidence of the good scalability of BQL. 

C.3 OTHER BASE ALGORITHMS

Besides DDPG, BQL could also be built on other variants of Q-learning, e.g., SAC. Figure 11 shows that BQL could also obtain performance gain on independent SAC. Independent PPO (IPPO) (de Witt et al., 2020a ) is an on-policy decentralized MARL baseline. IPPO is not a Q-learning method so it cannot be the base algorithm of BQL. On-policy algorithms do not use old experiences, which makes them weak on sample efficiency (Achiam, 2018) especially in fully decentralized settings as shown in Figure 11 . BQL can also be naturally and theoretically applied on MA2QL. We test BQL+MA2QL on three mujoco tasks used in MA2QL. Figure 12 shows BQL could obtain performance gain on MA2QL, especially on HalfCheetah. We also compare BQL against MA2QL in more SMAC tasks. As shown in Figure 13 , BQL and MA2QL achieve similar performance, and the converged results are very close to some CTDE methods, indicating that non-stationarity is not a severe problem in these two tasks. 

D MULTIPLE OPTIMAL JOINT POLICIES

We assume that there is only one optimal joint policy. With multiple optimal actions (with the max Q i (s, a i )), if each agent arbitrarily selects one of the optimal independent actions, the joint action might not be optimal. To address this, we set a performance tolerance ε and introduce a fixed randomly initialized reward function r(s, s ′ ) ∈ (0, (1γ)ε]. Then all agents perform BQL to learn Qi (s, a i ) of the shaped reward r + r. Since r > 0, Qi (s, a i ) > Q i (s, a i ). In Qi (s, a i ), the maximal contribution from r is (1γ)ε/(1γ) = ε, so the minimal contribution from r is Qi (s, a i )ε > Q i (s, a i )ε, which means that the maximal performance drop is ε when selecting actions according to Qi . It is a small probability event to find multiple optimal joint policies on the reward function r + r, because r(s, s ′ ) is randomly initialized. Therefore, if ε is set to be small enough, BQL could solve the task with multiple optimal joint policies. However, this technique is introduced to only remedy the assumption for our theoretical results. Empirically, this is not required, because there is usually only one optimal joint policy in complex environments. In all experiments, we do not use the randomly initialized reward function for BQL and other baselines, thus the comparison is fair. We test the randomly initialized reward function on a one-stage matrix game with two optimal joint policies (1, 2) and (2, 1), as shown in Figure 14 . If the agents independently select actions, they might choose the miscoordinated joint policies (1, 1) and (2, 2). IQL cannot converge, but BQL agents always select coordinated actions, though the value gap between the optimal policy and suboptimal policy is so small, which verifies the effectiveness of the randomly initialized reward. 



For simplicity, we refer to the optimal value Q * as Q in this paper, unless stated otherwise. We provide a simple solution for situations with multiple optimal joint policies in Appendix D.



Learning curves on cooperative stochastic games.

Learning curves on MPE-based differential games with different β.

Learning curves on SMAC.

Figure 6: Toy case for illustrating the ideal buffer number. |S| = 3, |Ai| = 2, and |A-i| = 2 corresponding to P 1i and P 2 i . We can see that any Pi(s, ai) can be found in the 4 buffers.

Figure 7: Learning curves of BQL and joint Q-learning (JQL). BQL shows similar sample efficiency to JQL.

2 × 3 Swimmer Figure8: BQL with one buffer and buffer series.

2c vs 64zg Figure9: Learning curves with different λ.

Figure 10: Learning curves on 17-agent Humanoid.

2 × 4d Ant Figure 11: Learning curves of other base algorithms.

3 × 2 Walker, obsk = 1 Figure 12: Learning curves of BQL+MA2QL.

Figure 2: (a) Payoff matrix for a harder one-step game. Boldface means the optimal joint action selection from the payoff matrix. The strikethroughs indicate the original matrix game proposed by QTRAN. (b) The learning curves of QPLEX and other baselines. (c) The learning curve of QPLEX, whose suffix aLbH denotes the neural network size with a layers and b heads (multi-head attention) for learning importance weights λ i (see Eq. (9) and (10)), respectively.Proposition 2. Given the universal function approximation of neural networks, the action-value

each of which has different transition probabilities. At training period of the epoch m, each agent i randomly selects one replay buffer D

and the target network Qe i . 2: Initialize the replay buffer D i .

shows the effect of the size of buffer D m i at the epoch m. If |D m i | is too small, i.e., 200, the experiences in |D m i | are insufficient to accurately estimate the expected value (7). If |D m i | is too large, i.e., 10000, the experiences in |D m i | are redundant, and the buffer series is difficult to cover all possible transition probabilities given total training timesteps. Figure

annex

x 2 i in MPE. We plot the density of uniform state distribution. There is only one global optimum, but the density of local optimum is high. So decentralized agents will easily learn the local optimal policies.

E HYPERPARAMETERS

In MPE-based differential games, the reward function is defined asAnd the relationship between r and l is visualized in Figure 15 .In 2 × 3 Swimmer, there are two agents and each of them controls 3 joints of ManyAgent Swimmer. In 6|2 Ant, there are two agents. One of them controls 6 joints, and one of them controls 2 joints. And so on.In MPE-based differential games and Multi-Agent MuJoCo, we adopt SpinningUp (Achiam, 2018) implementation, the SOTA implementation of DDPG, and follow all hyperparameters in Spin-ningUp. The discount factor γ = 0.99, the learning rate is 0.001 with Adam optimizer, the batch size is 100, the replay buffer contains 5 × 10 5 transitions, the hidden units are 256.In SMAC, we adopt PyMARL (Samvelyan et al., 2019) implementation and follow all hyperparameters in PyMARL. The discount factor γ = 0.99, the learning rate is 0.0005 with RMSprop optimizer, the batch size is 32 episodes, the replay buffer contains 5000 episodes, the hidden units are 64. We adopt the version SC2.4.10 of SMAC.In MPE-based differential games, we set λ = 0.01. In Multi-Agent MuJoCo, we set λ = 0.5, and in SMAC, we set λ = 0.8.The experiments are carried out on Intel i7-8700 CPU and NVIDIA GTX 1080Ti GPU. The training of each MPE and Multi-Agent MuJoCo task could be finished in 5 hours, and the training of each SMAC task could be finished in 20 hours.

F MORE SMAC RESULTS

We test BQL on SMAC MMM2 task, the results are shown in Figure 16 . 

