TOWARDS UNDERSTANDING LINEAR VALUE DECOMPOSITION IN COOPERATIVE MULTI-AGENT Q-LEARNING

Abstract

Value decomposition is a popular and promising approach to scaling up multiagent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we introduce a variant of the fitted Q-iteration framework for analyzing multi-agent Q-learning with value decomposition. Based on this framework, we derive a closed-form solution to the empirical Bellman error minimization with linear value decomposition. With this novel solution, we further reveal two interesting insights: i) linear value decomposition implicitly implements a classical multi-agent credit assignment called counterfactual difference rewards; and ii) On-policy data distribution or richer Q function classes can improve the training stability of multi-agent Qlearning. In the empirical study, our experiments demonstrate the realizability of our theoretical closed-form formulation and implications in the didactic examples and a broad set of StarCraft II unit micromanagement tasks, respectively.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has great promise for addressing coordination problems in a variety of applications, such as robotic systems (Hüttenrauch et al., 2017) , autonomous cars (Cao et al., 2012) , and sensor networks (Zhang & Lesser, 2011) . Such complex tasks often require MARL to learn decentralized policies for agents to jointly optimize a global cumulative reward signal, and post a number of challenges, including multi-agent credit assignment (Wolpert & Tumer, 2002; Nguyen et al., 2018) , non-stationarity (Zhang & Lesser, 2010; Song et al., 2019) , and scalability (Zhang & Lesser, 2011; Panait & Luke, 2005) . Recently, by leveraging the strength of deep learning techniques, cooperative MARL has made a series of great progress (Sunehag et al., 2018; Baker et al., 2020; Wang et al., 2020b; a) , particularly in value-based methods that demonstrate stateof-the-art performance on challenging tasks such as StarCraft unit micromanagement (Samvelyan et al., 2019) . Sunehag et al. (2018) proposed a popular approach called value-decomposition network (VDN) based on the paradigm of centralized training with decentralized execution (CTDE; Foerster et al., 2016) . VDN learns a centralized but factorizable joint value function Q tot , represented as the summation of individual value functions Q i . During the execution, decentralized policies can be easily derived for each agent i by greedily selecting actions with respect to its local value function Q i . By utilizing this decomposition structure, an implicit multi-agent credit assignment is realized because Q i is learned by neural network backpropagation from the total temporal-difference error on the single global reward signal, rather than on a local reward signal specific to agent i. This decomposition technique significantly improves the scalability of multi-agent Q-learning algorithms and fosters a series of subsequent works, including QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , and QPLEX (Wang et al., 2020a) . In spite of the empirical success in a broad class of tasks, multi-agent Q-learning with linear value decomposition has not been theoretically well-understood. Because of its limited representation complexity, the standard Bellman update is not a closed operator in the joint action-value function class with linear value decomposition. The approximation error induced by this incompleteness is known as inherent Bellman error (Munos & Szepesvári, 2008) , which usually deviates Q-learning to an unexpected behavior. To develop a deeper understanding of learning with value decomposition, this paper introduces a multi-agent variant of the popular Fitted Q-Iteration (FQI; Ernst et al., 2005; Levine et al., 2020) framework and derives a closed-form solution to its empirical Bellman error minimization. To the best of our knowledge, it is the first theoretical analysis that characterizes the underlying mechanism of linear value decomposition in cooperative multi-agent Q-learning, which can serve as a powerful toolkit to establish follow-up profound theories and explore potential insights from different perspectives in this popular value decomposition structure. By utilizing this novel closed-form solution, this paper formally reveals two interesting insights: 1) Learning linear value decomposition implicitly implements a classical multi-agent credit assignment method called counterfactual difference rewards (Wolpert & Tumer, 2002) , which draws a connection with COMA (Foerster et al., 2018) , a multi-agent policy-gradient method. 2) Multi-agent Q-learning with linear value decomposition potentially suffers from the risk of unbounded divergence from arbitrary initialization. On-policy data distribution or richer Q function classes can provide local or global convergence guarantees for multi-agent Q-learning, respectively. Finally, we set up an extensive set of experiments to demonstrate the realizability of our theoretical implications. Besides the FQI framework, we also consider deep-learning-based implementations of different multi-agent value decomposition structures. Through didactic examples and the StarCraft II benchmark, we design several experiments to illustrate the consistency of our closed-form formulation with the empirical results, and that online data distribution and richer Q function classes can significantly alleviate the limitations of VDN on the offline training process (Levine et al., 2020) .

2. RELATED WORK

Deep Q-learning algorithms that use neural networks as function approximators have shown great promise in solving complicated decision-making problems (Mnih et al., 2015) . One of the core component of such methods is iterative Bellman error minimization, which can be modelled by a classical framework called Fitted Q-Iteration (FQI; Ernst et al., 2005) . FQI utilizes a specific Q function class to iteratively optimize empirical Bellman error on a dataset D. Great efforts have been made towards theoretically characterizing the behavior of FQI with finite samples and imperfect function classes (Munos & Szepesvári, 2008; Farahmand et al., 2010; Chen & Jiang, 2019) . From an empirical perspective, there is also a growing trend to adopt FQI for empirical analysis of deep offline Q-learning algorithms (Fu et al., 2019; Levine et al., 2020) . In MARL, the joint Q function class grows exponentially with the number of agents, leading many algorithms (Sunehag et al., 2018; Rashid et al., 2018) to utilize different value decomposition structures with limited expressiveness to improve scalability. In this paper, we extend FQI to a multi-agent variant as our grounding theoretical framework for analyzing cooperative multi-agent Q-learning with linear value decomposition. To achieve superior effectiveness and scalability in multi-agent settings, centralized training with decentralized executing (CTDE) has become a popular MARL paradigm (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) . Individual-Global-Max (IGM) principle (Son et al., 2019) is a critical concept for value-based CTDE (Mahajan et al., 2019) , that ensures the consistency between joint and local greedy action selections and enables effective performance in both training and execution phases. VDN (Sunehag et al., 2018) utilizes linear value decomposition to satisfy a sufficient condition of IGM. The simple additivity structure of VDN has achieved excellent scalability and inspired many follow-up methods. QMIX (Rashid et al., 2018) proposes a monotonic Q network structure to improve the expressiveness of the factorized function class. QTRAN (Son et al., 2019) tries to realize the entire IGM function class, but its method is computationally intractable and requires two extra soft regularizations to approximate IGM (which actually loses the IGM guarantee). QPLEX (Wang et al., 2020a) encodes the IGM principle into the Q network architecture and realizes a complete IGM function class, but it may also have potential limitations in scalability. Based on the advantages of VDN's simplicity and scalability, linear value decomposition becomes very popular in MARL (Son et al., 2019; Wang et al., 2020a; d) . This paper focuses on the theoretical and empirical understanding of multi-agent Q-learning with linear value decomposition to explore its underlying implications.

3.1. MULTI-AGENT MARKOV DECISION PROCESS (MMDP)

To support theoretical analysis on multi-agent Q-learning, we adopt the framework of MMDP (Boutilier, 1996) , a special case of Dec-POMDP (Oliehoek et al., 2016) , to model fully cooperative multi-agent decision-making tasks. MMDP is defined as a tuple M = N , S, A, P, r, γ . N ≡ {1, . . . , n} is a finite set of agents. S is a finite set of global states. A denotes the action space for an individual agent. The joint action a ∈ A ≡ A n is a collection of individual actions [a i ] n i=1 . At each timestep t, a selected joint action a t results in a transition s t+1 ∼ P (•|s t , a t ) and a global reward signal r(s t , a t ). γ ∈ [0, 1) is a discount factor. The goal for MARL is to construct a joint policy π = π 1 , . . . , π n maximizing expected discounted rewards V π (s) = E [ ∞ t=0 γ t r(s t , π(s t ))|s 0 = s], where π i : S → A denotes an individual policy of agent i. The corresponding action-value function is denoted as Q π (s, a) = r(s, a)+γE s ∼P (•|s,a) [V π (s )]. We use Q * and V * to denote the action-value function and the state-value function corresponding to the optimal policy π * , respectively. Dec-POMDP (Oliehoek et al., 2016 ) is a generalized model of MMDP with the consideration of partial observability. In Dec-POMDPs, each agent can only access to its local observations rather than the full information of global states. As infinite-horizon Dec-POMDPs is undecidable in general (Madani et al., 1999) , this paper focuses theoretical analyses on settings with full observability. In practice, partial observability is not a hard constraint. A Dec-POMDP can be transformed to an MMDP when communication is available. Many prior efforts have been made to construct efficient communication protocols for exchanging information among agents (Foerster et al., 2016; Das et al., 2019; Wang et al., 2020c) . By constructing belief states through extended observation scopes, these methods can approximately transform learning problems in Dec-POMDPs to that in MMDPs. From this perspective, we consider MMDP as a simplification of notations to make the underlying insights more accessible.

3.2. CENTRALIZED TRAINING WITH DECENTRALIZED EXECUTION (CTDE)

Most deep multi-agent Q-learning algorithms with value decomposition adopt the paradigm of centralized training with decentralized execution (Foerster et al., 2016) . In the training phase, the centralized trainer can access all global information, including global states, shared global rewards, agents' polices, and value functions. In the decentralized execution phase, every agent makes individual decisions based on its local observations. Note that this paper considers MMDP as a simplified setting which rules out the concerns of partial observability. Thus our notations do not distinguish the concepts of states and observations. Individual-Global-Max (IGM) (Son et al., 2019) is a common principle to realize effective decentralized policy execution. It enforces the action selection consistency between the global joint action-value Q tot and individual action-values [Q i ] n i=1 , which are specified as follows: ∀s ∈ S, arg max a∈A Q tot (s, a) = arg max a1∈A Q 1 (s, a 1 ), . . . , arg max an∈A Q n (s, a n ) . (1) As stated in Eq. ( 2), the additivity constraint adopted by VDN (Sunehag et al., 2018) is a sufficient condition for the IGM constraint stated in Eq. ( 1). However, this linear decomposition structure is not a necessary condition and induces a limited joint action-value function class because the linear number of individual functions cannot represent a joint action-value function class, which is exponential with the number of agents. (Additivity) Q tot (s, a) = n i=1 Q i (s, a i ). (2) 3.3 FITTED Q-ITERATION (FQI) FOR MULTI-AGENT Q-LEARNING For multi-agent Q-learning with value decomposition, we use Q tot to denote the global but factorized value function, which can be factorized as a function of individual value functions [Q i ] n i=1 . In other words, we can use [Q i ] n i=1 to represent Q tot . For brevity, we overload Q to denote both of them. In the MMDP settings, the shared reward signal can only supervise the training of the joint value function Q tot , which requires us to modify the notation of Bellman optimality operator T as follows: (T Q) tot (s, a) = r(s, a) + γ E s ∼P (s |s,a) max a ∈A Q tot (s , a ) . Fitted Q-iteration (FQI) (Ernst et al., 2005) provides a unified framework which extends the above operator to solve high-dimensional tasks using function approximation. It follows an iterative optimization framework based on a given dataset D = {(s, a, r, s )}, Q (t+1) ← arg min Q∈Q E (s,a,r,s )∼D r + γ max a ∈A Q (t) tot (s , a ) -Q tot (s, a) 2 , where an initial solution Q (0) is selected arbitrarily from a function class Q. By constructing a specific function class Q that only contains instances satisfying the IGM condition stated in Eq. ( 1) (Sunehag et al., 2018; Rashid et al., 2018) , the centralized training procedure in Eq. ( 4) will naturally produce suitable individual values [Q i ] n i=1 , from which individual policies can be derived for decentralized execution.

4. MULTI-AGENT Q-LEARNING WITH LINEAR VALUE DECOMPOSITION

In the literature of deep MARL, constructing a specific value function class Q satisfying the IGM condition is a critical step to realize the paradigm centralized training with decentralized execution. Linear value decomposition proposed by VDN (Sunehag et al., 2018) is a simple yet effective method to implement this paradigm. In this section, we provide theoretical analysis towards a deeper understanding of this popular decomposition structure. Our result is based on a multi-agent variant of fitted Q-iteration (FQI) with linear value decomposition, named FQI-LVD. We derive the closed-form update rule of FQI-LVD, and then reveal the underlying credit assignment mechanism realized by linear value decomposition learning.

4.1. MULTI-AGENT FITTED Q-ITERATION WITH LINEAR VALUE DECOMPOSITION (FQI-LVD)

To provide a clear perspective on the effects of linear value decomposition, we make an additional assumption to simplify the notations and facilitate the analysis. Assumption 1 (Adequate and Factorizable Dataset). The dataset D contains all applicable stateaction pairs (s, a) whose empirical probability is factorizable with respect to individual behaviors of multiple agents. Formally, let p D (a|s) denote the empirical probability of joint action a executed on state s, which can be factorized to the production of individual components, p D (a|s) = i∈N p D (a i |s), ai∈A p D (a i |s) = 1, p D (a i |s) > 0, where p D (a i |s) denotes the empirical probability of the event that agent i executes a i on state s. Assumption 1 is based on the fact that an adequate dataset is necessary for FQI algorithms to find a feasible solution (Farahmand et al., 2010; Chen & Jiang, 2019) . In practice, the property of factorizable data distribution can be directly induced by a decentralized data collection procedure. When agents perform fully decentralized execution, the empirical probability of an event (s, a) in the collected dataset D is naturally factorized. Now we define FQI with linear value decomposition as follows. Definition 1 (FQI-LVD). Given a dataset D, FQI-LVD specifies the action-value function class Q LVD = Q Q tot (•, a) = n i=1 Q i (•, a i ), ∀a ∈ A and ∀Q i ∈ R |S||A| n i=1 (6) and induces the empirical Bellman operator T LVD D : Q (t+1) ← T LVD D Q (t) ≡ arg min Q∈Q LVD (s,a,s )∈S×A×S p D (a, s |s) ŷ(t) (s, a, s ) - n i=1 Q i (s, a i ) 2 = arg min Q∈Q LVD (s,a)∈S×A p D (a|s) y (t) (s, a) - n i=1 Q i (s, a i ) 2 , ( ) where ŷ(t) (s, a, s ) = r(s, a) + γ max a Q (t) tot (s , a ) denotes the sample-based regression target. y (t) (s, a) = (T Q (t) ) tot (s, a) = r(s, a) + γE s ∼P (•|s,a) max a Q (t) tot (s , a ) denotes the groundtruth target value derived by Bellman optimality operator. p D (a, s |s) = p D (a|s)P (s |s, a) denotes the empirical probability of the event that agents execute joint action a on state s and transit to s . Q tot and [Q i ] n i=1 refer to the discussion of CTDE defined in Section 3.3. The proof of Eq. ( 7) is deferred to Lemma 1 in Appendix A. Value-decomposition network (VDN) (Sunehag et al., 2018) provides an implementation of FQI-LVD, in which individual value functions [Q i ] n i=1 are parameterized by deep neural networks, and the joint value function Q tot can be simply formed by their summation.

4.2. IMPLICIT CREDIT ASSIGNMENT IN LINEAR VALUE DECOMPOSITION

In the formulation of FQI-LVD, the empirical Bellman error minimization in Eq. ( 7) can be regarded as a weighted linear least-squares problem, which contains n|S||A| variables to form individual value functions [Q i ] n i=1 and |S||A| n data points corresponding to all entries of the regression target y (t) (s, a). To solve this least-squares problem, we derive a closed-form solution stated in Theorem 1, which can be verified through Moore-Penrose inverse (Moore, 1920) for weighted linear regression analysis. Proofs for all theorems, lemmas, and propositions in this paper are deferred to Appendix. Theorem 1. Let Q (t+1) = T LVD D Q (t) denote a single iteration of the empirical Bellman operator. Then ∀i ∈ N , ∀(s, a) ∈ S × A, the individual action-value function Q (t+1) i (s, a i ) = E a -i ∼p D (•|s) y (t) s, a i ⊕ a -i evaluation of the individual action ai - n -1 n E a ∼p D (•|s) y (t) (s, a ) counterfactual baseline +w i (s), where we denote a i ⊕ a -i = a 1 , . . . , a i-1 , a i , a i+1 , . . . , a n . a -i denotes the action of all agents except for agent i. The residue term w ≡ [w i ] n i=1 is an arbitrary vector satisfying ∀s, n i=1 w i (s) = 0. As shown in Theorem 1, the local action-value function Q (t+1) i consists of three terms. The first term is the expectation of one-step TD target value over the actions of other agents, which evaluates the expected return of executing an individual action a i . The second term is the expectation of one-step target TD values over all joint actions, which can be regarded as a baseline function evaluating the average performance. The arbitrary vector w indicates the entire valid individual action-value function space. We can ignore this term because w does not affect the local action selection of each agent and will be eliminated in the summation operator of linear value decomposition (see Eq. ( 2)), which indicates that joint action-value Q (t+1) tot = i Q (t+1) i has a unique closed-form solution. We compare the theoretical analysis of FQI-LVD with the empirical results of VDN to demonstrate and verify the accuracy of our closed-form updating rule (see Eq. ( 8)) in Section 6.1. Note that, if we regard the empirical probability p D (a|s) within the dataset D as a default policy, the first term of Eq. ( 8) is the expected value of an individual action a i , and the second term is the expected value of the default policy, which is considered as the counterfactual baseline. Their difference corresponds to a credit assignment mechanism called counterfactual difference rewards, which has been used by counterfactual multi-agent policy gradient (COMA) (Foerster et al., 2018) . Implication 1. As shown in Eq. ( 8), linear value decomposition implicitly implements a counterfactual credit assignment mechanism, which is similar to what is used by COMA. Compared to COMA, this implicit credit assignment is naturally served by empirical Bellman error minimization through linear value decomposition, which is much more scalable. The extra importance weight (n -1)/n brings our derived credit assignment to be more consistent and meaningful in the sense that all global rewards should be assigned to agents. Consider a simple case where all joint actions generate the same reward signals, Eq. (8) will assign 1/n unit of rewards to each agent, but COMA will assign 0. This gap will gradually close when n becomes sufficiently large.

5. IMPROVING THE LEARNING STABILITY OF VALUE DECOMPOSITION

In the previous section, we have derived the closed-form update rule of FQI-LVD, which reveals the underlying credit assignment mechanism of linear value decomposition structure. This derivation also enables us to investigate more algorithmic functionalities of linear value decomposition in multi-agent Q-learning. Although linear value decomposition holds superior scalability in multi-agent settings, we find that FQI-LVD has the potential risk of unbounded divergence from arbitrary initialization. To improve the stability of linear value decomposition training, we theoretically demonstrate that on-policy data distribution or richer Q function classes can provide some convergence guarantees. Moreover, we also utilize a concrete MMDP example to visualize our implications. An MMDP where FQI-LVD will diverge to infinity when γ ∈ 4 5 , 1 . r is a shorthand for r(s, a) and the action space for each agent A ≡ A (1) , . . . , A (|A|) . (b) The learning curves of Q tot ∞ of on-policy FQI-LVD on the given MMDP where the dataset is generated by different choices of hyper-parameters for -greedy. (c) The learning curves of Q tot ∞ while running several deep multi-agent Q-learning algorithms. s 1 r=0 × × s 2 a1=a2=A (2) r=0 o o a1 =a2, r=0 × × a1=a2=A (1) , r=1

5.1. UNBOUNDED DIVERGENCE IN OFFLINE TRAINING

We will provide an analysis of the convergence of FQI-LVD with offline training on a dataset D. Proposition 1. The empirical Bellman operator T LVD D is not a γ-contraction, i.e., the following important property of the standard Bellman optimality operator T does not hold for T LVD D anymore. ∀Q tot , Q tot ∈ Q, T Q tot -T Q tot ∞ ≤ γ Q tot -Q tot ∞ (9) For the standard Bellman optimality operator T (Sutton & Barto, 2018), γ-contraction is critical to derive the theoretical guarantee. In the context of FQI-LVD, the additivity constraint limits the joint action-value function class that it can express, which deviates the empirical Bellman operator T LVD D from the original Bellman optimality operator T (see Theorem 1). This deviation is induced by the negative importance weight (n -1)/n stated in Eq. ( 8) and is also known as inherent Bellman error (Munos & Szepesvári, 2008) , which corrupts a broad set of stability properties, including γ-contraction. To serve a concrete example, we construct a simple MMDP with two agents, two global states, and two actions (see Figure 1a ). The optimal policy of this MMDP is simply executing the action A (1) at state s 2 , which is the only way for two agents to obtain a positive reward. The learning curve of = 1.0 (green one) in Figure 1b refers to an offline setting with uniform data distribution, in which an unbounded divergence can be observed as depicted by the following proposition. Proposition 2. There exist MMDPs such that, when using uniform data distribution, the value function of FQI-LVD diverges to infinity from an arbitrary initialization Q (0) . Note that the unbounded divergence discussed in Proposition 2 would happen to an arbitrary initialization Q (0) . To provide an implication for practical scenarios, we also investigate the performance of several deep multi-agent Q-learning algorithms in this MMDP. As shown in Figure 1c , VDN (Sunehag et al., 2018) , a deep-learning-based implementation of FQI-LVD, results in unbounded divergence. We postpone the discussion of other deep-learning-based algorithms to the next subsection.

5.2. LOCAL AND GLOBAL CONVERGENCE IMPROVEMENTS

To improve the training stability of FQI-LVD, we investigate methods to enable local and global convergence of value decomposition learning, respectively. Local Convergence Improvement. As shown in Theorem 1, the choice of training data distribution affects the output of the empirical Bellman operator T LVD D . We find that FQI-LVD has a local convergence property in an on-policy mode, i.e., the dataset D is accumulated by running an -greedy policy (Mnih et al., 2015) . Here we include an informal statement of local stability of FQI-LVD and defer the precise version, its proof, and the algorithm box of on-policy FQI-LVD to Appendix C.1. Theorem 2 (Informal). On-policy FQI-LVD will locally converge to the optimal policy and have a fixed point value function when the hyper-parameter is sufficiently small. Theorem 2 indicates that multi-agent Q-learning with linear value decomposition has a convergent region, where the value function induces optimal actions. By combining this local stability with Brouwer's fixed-point theorem (Brouwer, 1911) , we can further verify the existence of a fixed-point solution for the on-policy Bellman operator T LVD Dt . Figure 1b visualizes the performance of on-policy FQI-LVD with different values of the hyper-parameter . With a smaller (such as 0.1 or 0.01), on-policy FQI-LVD demonstrates numerical stability, and their corresponding collected datasets are closer to on-policy data distribution. Global Convergence Improvement. Linear value decomposition structure limits the joint actionvalue function class Q LVD , which is the origin of the deviation of the empirical Bellman operator T LVD D , discussed in Proposition 1. Another way to improve training stability is to enrich the expressiveness of value decomposition. We consider a multi-agent fitted Q-iteration (FQI) with a full action-value function class derived from IGM, named FQI-IGM, whose action-value function class is as follows: Q IGM = Q Q tot ∈ R |S||A| n and ∀Q i ∈ R |S||A| n i=1 with Eq. ( 1) is satisfied . (10) Note that Q LVD ⊂ Q IGM indicates that linear decomposition structure stated in Eq. ( 2) is a sufficient condition for the IGM constraint. The formal definition of FQI-IGM is deferred to Appendix C.2 and its global convergence property is established by the following theorem. Theorem 3. FQI-IGM will globally converge to the optimal value function. Theorem 3 relies on a fact that Q IGM is complete in MMDP settings, i.e., inherent Bellman errors discussed in Proposition 1 can reach zero and its empirical Bellman operator T IGM D is a γ-contraction. Using universal function approximation of neural networks, QPLEX (Wang et al., 2020a) , a deeplearning-based implementation of FQI-IGM, theoretically realizes the complete IGM function class. QTRAN (Son et al., 2019) is an approximate implementation of FQI-IGM which uses soft penalties to realize IGM constraints. As the basis of comparison, VDN (Sunehag et al., 2018) is the deeplearning-based implementation of FQI-LVD. An intermediate version, QMIX (Rashid et al., 2018) , establishes a non-linear monotonic mapping between local and global value functions. The value function class of QMIX can be summarized as follows: Q QMIX = Q Q tot (s, a) = f (s, Q 1 (s, a 1 ), . . . , Q n (s, a n )) and f (s, •) is monotonic , (11) which is known to underrepresent the IGM function class since the monotonic correspondence is not necessary for the IGM constraint stated in Eq. ( 1) (Mahajan et al., 2019) . Formally, Q LVD ⊂ Q QMIX ⊂ Q IGM is a sequence of strict inclusion relations. As shown in Figure 1c , QPLEX and QTRAN, two algorithms with representational capacity Q IGM , perform outstanding numerical stability in the proposed MDP example. By contrast, the phenomenon of unbounded divergence happens to both VDN and QMIX, whose function classes are incomplete in terms of the IGM constraint. This experiment is a didactic study connection between our theoretical implications and practical algorithms. Combining the theoretical and empirical results above, we summarize this section by the following insights. Implication 2. Multi-agent Q-learning with linear value decomposition potentially suffers from the risk of unbounded divergence from arbitrary initialization. On-policy data distribution or richer Q function classes can improve its local or global convergence, respectively.

6. EMPIRICAL ANALYSIS

In this section, we conduct an empirical study to connect our theoretical implications to practical scenarios of deep multi-agent Q-learning algorithms. An empirical analysis of a didactic example, a two-state MMDP, has been carried out in Section 5, which shows that the linear value decomposition structure needs to improve training stability in offline mode. In order to verify other implications, here we evaluate four state-of-the-art deep-learning-based methods, VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , and QPLEX (Wang et al., 2020a) on the matrix game proposed by QTRAN and StarCraft Multi-Agent Challenge (SMAC) benchmark tasks (Samvelyan et al., 2019) . The implementation details of four baselines and experimental settings are deferred to Appendix F. We test all experiments with 6 random seeds and demonstrate them with median performance and 25-75% percentiles. Under review as a conference paper at ICLR 2021 a 2 a 1 A (1) A (2) A (3) A (1) 8 -12 -12 A (2) -12 0 0 A (3) -12 0 0 (a) Payoff of matrix game a 2 a 1 A (1) A (2) A (3) A (1) -6.22 -4.89 -4.89 A (2) -4.89 -3.56 -3.56 A (3) -4.89 -3.56 -3.56 (b) Qtot of FQI-LVD a 2 a 1 A (1) A (2) A (3) A (1) -6.23 -4.90 -4.90 A (2) -4.90 -3.57 -3.57 As shown in Theorem 1, we derive the closed-form update rule of FQI-LVD. From an optimization perspective, FQI-LVD and VDN share the same objective function (see Definition 1) but have different optimization methods, i.e., arg min vs. gradient descent. Starting from a common matrix game used by QTRAN (Son et al., 2019) and QPLEX (Wang et al., 2020a) stated in Table 1a , we will illustrate the correctness of our closed-form formulation. This matrix game describes a simple cooperative multi-agent, which includes two agents and three actions. Miscoordination penalties are also considered and the optimal strategy for two agents is to perform action A (1) simultaneously. We adopt a uniform data distribution to conduct this didactic example. A (3) -4.90 -3.57 -3.57 (c) Qtot of VDN Table 1b and 1c show the joint action-value functions of FQI-LVD and VDN, respectively. Comparing with these two joint action-value functions, we find that the estimation error of VDN is only Q FQI-LVD tot -Q VDN tot ∞ = 0 .01, which corresponds to a 0.2% relative error. This simulation result strongly illustrates the accuracy of Theorem 1. A learning curve of relative error for Table 1c is provided in Appendix G.1. In addition, as discussed by QTRAN and QPLEX, VDN with limited function class cannot learn the optimal policy in this didactic matrix game. The joint action-value functions of QPLEX, QTRAN, and QMIX are deferred to Appendix G.2, where QPLEX and QTRAN can solve this task, but QMIX cannot.

6.2. IS LINEAR VALUE DECOMPOSITION LIMITED IN OFFLINE TRAINING?

Section 5 shows that in offline training mode, linear value decomposition is limited in a didactic MMDP task. In order to generalize our implications to complex domains, we investigate the performance of deep multi-agent Q-learning in the StarCraft II benchmark tasks with offline data collection. Recently, offline reinforcement learning has attracted great attention because it can equip with multi-source datasets and is regarded as a key step towards real-world applications (Dulac-Arnold et al., 2019; Levine et al., 2020) . Differing from other related work studying distributional shift (Fujimoto et al., 2019; Levine et al., 2020; Yu et al., 2020) , we aim to adopt a diverse dataset to investigate the effect of the expressiveness of a value decomposition structure on offline training, i.e., which value decomposition structure is suitable for multi-agent offline reinforcement learning. These datasets are constructed by training a behavior policy of VDN (Sunehag et al., 2018) and collecting a fixed number of experienced episodes during the whole training procedure. We evaluate the learning curve of StarCraft II on nine common maps. The results are shown in Figure 2 . To approximate the MMDP setting, we concatenate the global state with the local observations for each agent to handle partial observability. Figure 2(b-d,f-h,j-l ) illustrate that VDN (Sunehag et al., 2018) and QMIX (Rashid et al., 2018) performs poorly and cannot utilize well the offline dataset collected by an unfamiliar behavior policy. In contrast, QPLEX (Wang et al., 2020a) and QTRAN (Son et al., 2019) with richer Q function class perform pretty well, which indicates that the expressiveness of value decomposition structures dramatically affects the performance of multi-agent offline Q-learning. The learning curves of Behavior line are shown in Figure 2(a, e, i ), which is implemented by VDN with -greedy online data collection. Figure 2(a, e, i ) show that VDN with online data collection can solve these nine tasks, but cannot with offline data collection, that is, there is a considerable gap between online and offline training with linear value decomposition. Although the distribution shift (Levine et al., 2020) can be a potential cause of this gap, the remarkable performance of QPLEX and QTRAN suggests that our datasets should be sufficient for offline training. In contrast to the theoretical convergence analysis stated in Section 5, in this subsection, empirical experiments aim to conduct the performance of the deep-learning-based implementation of linear value decomposition (i.e., VDN) in the online and offline data collection settings. We have designed several comparative experiments to demonstrate two empirical implications shared with the above theoretical understanding (see Implication 2): 1) VDN with online data collection has superior performance than offline data collection. 2) VDN, a deep-learning-based algorithm with a linear value decomposition structure, has considerable limitations in offline training; while QPLEX and QTRAN, two deep-learning-based algorithms with or approximately with complete IGM function class, are the state-of-the-art value decomposition algorithms for multi-agent offline training.

7. CONCLUSION

This paper makes an initial effort to provide theoretical analysis on multi-agent Q-learning with value decomposition. We derive a closed-form solution to the empirical Bellman error minimization with linear value decomposition. Based on this novel result, we reveal the implicit credit assignment mechanism of linear value decomposition learning and provide a formal analysis of its learning stability and convergence. We also formally show that on-policy training or a richer value function class can improve the stability of factorized multi-agent Q-learning. Empirical results are conducted with state-of-the-art deep multi-agent Q-learning with value decomposition and verify theoretical insights in both didactic examples and complex StarCraft II benchmark tasks. To close this paper, we connect our results with an additional related literature named relative overgeneralization pathology (Wiegand, 2003) . In the empirical studies of cooperative learning, the behaviors of individual agents are usually negatively affected by their uncooperative partners. Focusing on a similar issue, our theoretical analysis of the implicit counterfactual credit assignment provides a detailed characterization to understand this pathological phenomenon in linear value decomposition, which also provides insights for the corresponding feasible solutions. Regarding relative overgeneralization pathology, coordination graphs (Böhmer et al., 2020) explore a different methodology for cooperative multi-agent Q-learning. This method allows collaborated action-selection through communications, which does not follow the principle of IGM consistency. In comparison with linear value decomposition, coordination graphs use a higher-order value decomposition structure (Castellini et al., 2019) in the view of function approximation. Supplementary to the results of this paper, the value factorization of coordination graphs provides a different and promising perspective for future studies. 

A OMITTED PROOFS IN SECTION 4

Definition 1 (FQI-LVD). Given a dataset D, FQI-LVD specifies the action-value function class Q LVD = Q Q tot (•, a) = n i=1 Q i (•, a i ), ∀a ∈ A and ∀Q i ∈ R |S||A| n i=1 (6) and induces the empirical Bellman operator T LVD D : Q (t+1) ← T LVD D Q (t) ≡ arg min Q∈Q LVD (s,a,s )∈S×A×S p D (a, s |s) ŷ(t) (s, a, s ) - n i=1 Q i (s, a i ) 2 = arg min Q∈Q LVD (s,a)∈S×A p D (a|s) y (t) (s, a) - n i=1 Q i (s, a i ) 2 , ( ) where ŷ(t) (s, a, s ) = r(s, a) + γ max a Q (t) tot (s , a ) denotes the sample-based regression target.  y (t) (s, a) = (T Q (t) ) tot (s, a) = r(s, a) + γE s ∼P (•|s,a) max a Q (t) tot (s , T LVD D Q (t) ≡ arg min Q∈Q LVD (s,a,s )∈S×A×S p D (a, s |s) ŷ(t) (s, a, s ) - n i=1 Q i (s, a i ) 2 = arg min Q∈Q LVD (s,a)∈S×A p D (a|s) y (t) (s, a) - n i=1 Q i (s, a i ) 2 . ( ) Proof. Recall Definition 1, T LVD D Q (t) ≡ arg min Q∈Q LVD (s,a,s )∈S×A×S p D (a, s |s) ŷ(t) (s, a, s ) - n i=1 Q i (s, a i ) 2 = arg min Q∈Q LVD E (s,a,s )∼D ŷ(t) (s, a, s ) -Q tot (s, a) 2 = arg min Q∈Q LVD E (s,a,s )∼D ŷ(t) (s, a, s ) -y (t) (s, a) + y (t) (s, a) -Q tot (s, a) 2 = arg min Q∈Q LVD E (s,a,s )∼D ŷ(t) (s, a, s ) -y (t) (s, a) 2 + E (s,a,s )∼D 2 ŷ(t) (s, a, s ) -y (t) (s, a) y (t) (s, a) -Q tot (s, a) + E (s,a,s )∼D y (t) (s, a) -Q tot (s, a) 2 (13) The first term is a constant since y (t) and ŷ(t) are fixed targets. The second term is equal to zero since E (s,a,s )∼D 2 ŷ(t) (s, a, s ) -y (t) (s, a) y (t) (s, a) -Q tot (s, a) = 2 E (s,a)∼D E s ∼P (•|s,a) ŷ(t) (s, a, s ) -y (t) (s, a) =0 y (t) (s, a) -Q tot (s, a) = 0. ( ) The third term exactly corresponds to Eq. ( 12). Lemma 2. Considering following weighted linear regression problem min x p • (Ax -b) 2 2 ( ) where A ∈ R m n ×mn , x ∈ R mn , b, p ∈ R m n , m, n ∈ Z + . Besides, A is m-ary encoding matrix namely ∀i ∈ [m n ], j ∈ [mn] A i,j = 1, if ∃u ∈ [n], j = m × u + ( i/m u mod m), 0, otherwise. For simplicity, j th row of A corresponds to a m-ary number a j = (j) m where a = a 0 a 1 . . . a n-1 , with a u ∈ [m], ∀u ∈ [n]. Assume p is a positive vector which follows that p j = p( a j ) = u∈[n] p u (a u,j ), where p u : [m] → (0, 1) and au∈[m] p u (a u ) = 1, ∀u ∈ [n] (17) The optimal solution of this problem is the following. Denote i = u × m + v, v ∈ [m], u ∈ [n] and an arbitrary vector w ∈ R mn x * i = a p( a) p u (a u ) b a • 1(a u = v) - n -1 n p( a)b a - 1 mn i ∈[mn] w i + 1 m v ∈[m] w um+v (18) Proof. For brevity, denote A p = p • A, b p = p • b Then the weighted linear regression becomes a standard Linear regression problem w.r.t A p , b p . To compute the optimal solutions, we need to calculate the Moore-Penrose inverse of A p . The sufficient and necessary condition of this inverse matrix A p, † ∈ R mn×m n is the following three statements (Moore, 1920): (1) A p A p, † and A p, † A p are self-adjoint (20) (2) A p = A p A p, † A p (21) (3) A p, † = A p, † A p A p, † We consider the following matrix as A p, † and we prove that it satisfies all three statements. For ∀i ∈ [mn], i = u × m + v, u ∈ [n], v ∈ [m], j ∈ [m n ] A p, † i,j = A p, † i, aj = p( a -u,j ) p u (a u,j ) • 1(a u,j = v) - n -1 n p( a j ) - 1 m p( a -u,j ) p u (a u,j ) + 1 mn n-1 u =0 p( a -u ,j ) p u (a u ,j ) where p( a -u ) = u =u p u (a u ).

First, we verify that

A p A p, † is a m n × m n self-adjoint matrix in statement (1). For simplicity, O( a i , a j ) = {u|a u,i = a u,j , u ∈ [n]}. (A p A p, † ) i,j = u∈[n] p( a i )[ p( a -u,j ) p u (a u,j ) • 1(a u,j = a u,i ) - n -1 n p( a j ) - 1 m p( a -u,j ) p u (a u,j ) + 1 mn n-1 u =0 p( a -u ,j ) p u (a u ,j ) ] = u∈O( ai, aj ) p( a j )p( a i ) p u (a u,j ) - n -1 n u∈[n] p( a i )p( a j ) - 1 m u∈[n] p( a j )p( a i ) p u (a u,j ) + u∈[n] 1 mn n-1 u =0 p( a j )p( a i ) p u (a u ,j ) = u∈O( ai, aj ) p( a j )p( a i ) p u (a u,j ) -(n -1) p( a i )p( a j ) - 1 m u∈[n] p( a j )p( a i ) p u (a u,j ) + 1 m u∈[n] p( a j )p( a i ) p u (a u,j ) = u∈O( ai, aj ) p( a j )p( a i ) p u (a u,j ) -(n -1) p( a i )p( a j ) Observe that p u (a u,j ) = p u (a u,i ) if a u,i = a u,j , thus (A p A p, † ) i,j = (A p A p, † ) j,i for any i, j ∈ [m n ]. This proves that A p A p, † is self-adjoint. Second, we prove that A p, † A p is a mn × mn self-adjoint matrix and has surprisingly succinct form. Let i = u × m + v, u ∈ [n], v ∈ [m]. 1. i = i . Besides, O(i) = { a ∈ [m n ]|a u = v} (A p, † A p ) i,i = a∈O(i) p( a)[ p( a -u ) p u (a u ) • 1(a u = v) - n -1 n p( a) - 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] = a∈O(i) p( a) p u (a u ) - n -1 n p( a) - 1 m p( a) p u (a u ) + 1 mn n-1 u =0 p( a) p u (a u ) = a∈O(i) p( a -u ) - 1 m p( a -u ) + 1 mn n-1 u =0 p( a -u ) - n -1 n p u (a u = v) = 1 - 1 m - n -1 n p u (a u = v) + 1 mn u ∈[n] u =u a∈O(i) p( a -u ) + 1 mn a∈O(i) p( a -u ) = 1 - 1 m - n -1 n p u (a u = v) + 1 mn + n -1 mn mp u (a u = v) = 1 - 1 m + 1 mn (25) 2. i = u × m + v, i = u × m + v , v = v . This implies that Q(i) ∩ O(i ) = ∅ (A p, † A p ) i,i = a∈O(i ) p( a)[ p( a -u ) p u (a u ) • 1(a u = v) - n -1 n p( a) - 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] = a∈O(i)∩O(i ) p( a) p u (a u ) - n -1 n a∈O(i ) p( a) - 1 m a∈O(i ) p( a) p u (a u ) + 1 mn u ∈[n] u =u a∈O(i ) p( a) p u (a u ) + 1 mn a∈O(i ) p( a) p u (a u ) = - n -1 n p u (a u = v ) - 1 m + n -1 mn a∈O(i ) p( a -u ) + 1 mn = - 1 m + 1 mn (26) 3. i = u 1 × m + v 1 , i = u 2 × m + v 2 , u 1 = u 2 . (A p, † A p ) i,i = a∈O(i ) p( a)[ p( a -u1 ) p u1 (a u1 ) • 1(a u1 = v) - n -1 n p( a) - 1 m p( a -u1 ) p u1 (a u1 ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] = a∈O(i)∩O(i ) p( a) p u1 (a u1 ) - n -1 n a∈O(i ) p( a) - 1 m a∈O(i ) p( a) p u1 (a u1 ) + 1 mn u ∈[n] u =u2 a∈O(i ) p( a) p u (a u ) + 1 mn a∈O(i ) p( a) p u2 (a u2 ) = p u2 (a u2 ) - n -1 n p u2 (a u2 ) -p u2 (a u2 ) + n -1 mn mp u2 (a u2 ) + 1 mn = 1 mn Observe that A p, † A p is self-adjoint by equation (2,3,4) and the expression is succinct. Third, we verify statement (2). Since we have computed A p, † A p , the verification is straightforward. For brevity, denote A p, † A p as A p 0 (A p A p 0 ) a,i = p( a) u∈[n] (A p 0 ) um+au,i = p( a) 1(∃u ∈ [n], i = um + a u ) - 1 m + 1 mn + (n -1) 1 mn = p( a) • 1(∃u ∈ [n], i = um + a u ) (28) Thus, A p A p, † A p = A p . Similarly, we can verify statement (3). Suppose i 0 = u 0 × m + v 0 , we have (A p 0 A p, † ) i0, a = 1 mn u =u 0 u∈[n] v∈[m] [ p( a -u ) p u (a u ) • 1(a u = v) - n -1 n p( a) - 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] + v∈[m] (1(v = v 0 ) - 1 m + 1 mn )[ p( a -u0 ) p u0 (a u0 ) • 1(a u0 = v) - n -1 n p( a) - 1 m p( a -u0 ) p u0 (a u0 ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] = 1 mn u∈[n] v∈[m] [ p( a -u ) p u (a u ) • 1(a u = v) - n -1 n p( a) - 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] + v∈[m] (1(v = v 0 ) - 1 m )[- n -1 n p( a) - 1 m p( a -u0 ) p u0 (a u0 ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] + v∈[m] (1(v = v 0 ) - 1 m ) p( a -u0 ) p u0 (a u0 ) • 1(a u0 = v) = 1 mn u∈[n] p( a -u ) p u (a u ) - n -1 n p( a) + 1 n u∈[n] [- 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] +   v∈[m] (1(v = v 0 ) - 1 m )   [- n -1 n p( a) - 1 m p( a -u0 ) p u0 (a u0 ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] + (1(a u0 = v 0 ) - 1 m ) p( a -u0 ) p u0 (a u0 ) Clearly, we have the following relations u∈[n] [- 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] =0 (30) v∈[m] (1(v = v 0 ) - 1 m ) = 0 (31) Thus (A p 0 A p, † ) i0, a = 1 mn u∈[n] p( a -u ) p u (a u ) - n -1 n p( a) + (1(a u0 = v 0 ) - 1 m ) p( a -u0 ) p u0 (a u0 ) (32) = A p, † i0, a This proves A p, † = A p, † A p A p, † in statement (3) and A p, † is the Moore-Penrose inverse of A p . Since the optimal solution x * = A p, † b p + (I mn×mn -A p, † A p )w where w ∈ R mn is any vector (Moore, 1920). Denote x p = A p, † b p . We have ∀i = u × m + v x p i = a A p, † i, a p( a)b a = a [ p( a -u ) p u (a u ) • 1(a u = v) - n -1 n p( a) - 1 m p( a -u ) p u (a u ) + 1 mn n-1 u =0 p( a -u ) p u (a u ) ] p( a)b a = a p( a) p u (a u ) • 1(a u = v) - n -1 n p( a) - 1 m p( a) p u (a u ) + 1 mn n-1 u =0 p( a) p u (a u ) b a From equation (2, 3, 4), we have i = u × m + v, i = u × m + v (I -A p, † A p ) i,i = 1 m -1 mn if u = u -1 mn if u = u (35) If we consider w as the following i 0 = u 0 × m + v 0 w i0 = a∈O(i0) p( a) p u0 (a u0 ) b a Then for i = u × m + v ((I -A p, † A p )w) i = i 0 ∈[mn] u =u0 - 1 mn w i0 + i0:u0=u ( 1 m - 1 mn )w i0 (37) = a - 1 mn u ∈[n] p( a) p u (a u ) b a + 1 m a p( a) p u (a u ) b a Notice that this is exactly the last two terms in equation ( 5). Therefore, the optimal solutions of this weighted linear regression problem can be written as: i = u × m + v, v ∈ [m], u ∈ [n] and an arbitrary vector w ∈ R mn . x * i = a p( a) p u (a u ) b a • 1(a u = v) - n -1 n p( a)b a - 1 mn i ∈[mn] w i + 1 m v ∈[m] w um+v (39) This completes the proof. Theorem 1. Let Q (t+1) = T LVD D Q (t) denote a single iteration of the empirical Bellman operator. Then ∀i ∈ N , ∀(s, a) ∈ S × A, the individual action-value function Q (t+1) i (s, a i ) = E a -i ∼p D (•|s) y (t) s, a i ⊕ a -i evaluation of the individual action ai - n -1 n E a ∼p D (•|s) y (t) (s, a ) counterfactual baseline +w i (s), where we denote a i ⊕ a -i = a 1 , . . . , a i-1 , a i , a i+1 , . . . , a n . a -i denotes the action of all agents except for agent i. The residue term w ≡ [w i ] n i=1 is an arbitrary vector satisfying ∀s, n i=1 w i (s) = 0. Proof. In the formulation of FQI-LVD stated in Definition 1, the empirical Bellman error minimization in Eq. ( 7) can be regarded as a weighted linear least squares problem as follows: ∀s ∈ S, min x p • (Ax -b) 2 2 (40) where let m, n ∈ Z + denote the size of action space |A| and the number of agents, respectively; A ∈ R m n ×mn denotes the multi-agent credit assignment coefficient matrix of actionvalue functions with linear value decomposition; x ∈ R mn denotes individual action-value functions Q  A i,j = 1, if ∃u ∈ [n], j = m × u + ( i/m u mod m), 0, otherwise. ( ) For simplicity, j th row of A corresponds to a m-ary number a j = (j) m where a = a 0 a 1 . . . a n-1 , with a u ∈ [m], ∀u ∈ [n]. According to the factorizable empirical probability p D shown in Assumption 1, p is a corresponding positive vector which follows that p j = p( a j ) = u∈[n] p u (a u,j ), where p u : [m] → (0, 1) and au∈[m] p u (a u ) = 1, ∀u ∈ [n] (42) According to Lemma 2, we derive the optimal solution of this problem is the following. Denote i = u × m + v, v ∈ [m], u ∈ [n] and an arbitrary vector w ∈ R mn x * i = a p( a) p u (a u ) b a • 1(a u = v) - n -1 n p( a)b a - 1 mn i ∈[mn] w i + 1 m v ∈[m] w um+v (43) which means ∀i ∈ N , ∀(s, a) ∈ S × A, the individual action-value function Q (t+1) i (s, a i ) = E a -i ∼p D (•|s) y (t) s, a i ⊕ a -i - n -1 n E a ∼p D (•|s) y (t) (s, a ) + w i (s), where we denote a i ⊕ a -i = a 1 , . . . , a i-1 , a i , a i+1 , . . . , a n . a -i denotes the action of all agents except agent i. The residue term w ≡ [w i ] n i=1 is an arbitrary vector satisfying ∀s,  ∀Q tot , Q tot ∈ Q, T Q tot -T Q tot ∞ ≤ γ Q tot -Q tot ∞ (9) Proof. Assume the empirical Bellman operator T LVD D is a γ-contraction. For any MMDPs, when using a uniform data distribution, the value function of FQI-LVD will converge (Ernst et al., 2005) because of the contraction of the distance (infinity norm) between any pair of Q. However, one counterexample is indicated in Proposition 2, which shows that there exists MMDPs such that, when using a uniform data distribution, the value function of FQI-LVD diverges to infinity from an arbitrary initialization Q (0) . The assumption of γ-contraction is not hold and the empirical Bellman operator T LVD D is not a γ-contraction. Proposition 2. There exist MMDPs such that, when using uniform data distribution, the value function of FQI-LVD diverges to infinity from an arbitrary initialization Q (0) . Proof. We consider the following MMDP with 2 agents, 2 states (s 1 , s 2 ) and each agent (i = 1, 2) has 2 actions A ≡ A (1) , A (2) . The reward function is listed below which r(s j , a) denotes the reward of (s j , a), where a = a 1 , a 2 . r(s 1 ) = 0 0 0 0 r(s 2 ) = 1 0 0 0 (45) Besides, the transition is deterministic. T (s 1 ) = s 1 s 1 s 1 s 1 T (s 2 ) = s 2 s 2 s 2 s 1 (46) Furthermore, γ ∈ ( 4 5 , 1). (In practice, γ is usually chosen as 0.99 or 0.95.) The following proves that this MMDP will diverge for any initialization. Denote Q t i (s j , a i ) as the decomposed Q-value of agent i after t th value-iteration at state s j with action a i . Then, the total Q-value can be described as Q t tot (s j , a) = Q t 1 (s j , a 1 ) + Q t 2 (s j , a 2 ). For brevity, 0 th Q-value is its initialization. First, we clarify the process of each iteration. Since the value-iteration for linear decomposed function class is solving the MSE problem in Lemma 2. b is target one-step TD-value w.r.t the Q-value of the last iteration. Through described in Lemma 2, the optimal solution of this MSE problem is not unique. We can ignore the term of an arbitrary vector w when considering the joint action-value functions because w does not affect the local action selection of each agent and will be eliminated in the summation operator of linear value decomposition. In addition, under uniformed sampling, we observe that p u (a u ) = 1 2 for any a, u. Then, in equation 34 - 1 m p( a) p u (a u ) + 1 mn n-1 u =0 p( a) p u (a u ) = 0 (47) Second, we denote V t tot (s j ) = max a Q t tot (s j , a) and observe that ∀t ≥ 1, s j Q t 1 (s j , a 1 ) = 1 2 a2∈A r(s j , a) + γV t-1 tot (T (s j , a) - 1 2 a∈A 1 4 r(s j , a) + γV t-1 tot (T (s j , a)) (48) = Q t 2 (s j , a 2 ) ( ) The second equation holds because the transition T and the reward R are symmetric for both agents. Thus, we omit the subscript of local Q-values as Q t (s j , a) when t ≥ 1. Third, we analyze the Q-values on state s 1 . Clearly, its iteration is irrelevant to s 2 . According to equation 48, ∀a ∈ A, t ≥ 1 2) . Therefore, we observe that Q t (s 1 , •) = γ t q 1 , ∀t ≥ 1 where q 1 is determined by the initialization Q 0 tot (s 1 , a), ∀a ∈ A. Q t (s 1 , a) = γ 2 V t-1 tot (s 1 ) (50) = γ 2 max a1,a2∈A Q t-1 (s 1 , a 1 ) + Q t-1 (s 1 , a 2 ) (51) Clearly, when t ≥ 1, Q t s 1 , A (1) = Q t s 1 , A Last, we consider state s 2 . It is straightforward to observe the following recursion for t ≥ 2 from equation 48 Q t s 2 , A (1) = 1 2 (1 + 2γV t-1 tot (s 2 )) - 1 8 [1 + γ(3V t-1 tot (s 2 ) + V t-1 tot (s 1 ))] = 5γ 8 V t-1 tot (s 2 ) + 3 8 - 1 4 γ t q 1 = 5γ 4 max a∈A Q t-1 (s 2 , a) + 3 8 - 1 4 γ t q 1 (52) Q t s 2 , A (2) = 1 2 (γV t-1 tot (s 2 ) + γV t-1 tot (s 1 )) - 1 8 [1 + γ(3V t-1 tot (s 2 ) + V t-1 tot (s 1 ))] = γ 8 V t-1 tot (s 2 ) - 1 8 + 3 4 γ t q 1 = γ 4 max a∈A Q t-1 (s 2 , a) - 1 8 + 3 4 γ t q 1 (53) We consider some δ > 0 and t δ = log γ δ 6|q1| . Then, t > t δ Q t s 2 , A (2) ≥ γ 4 max a∈A Q t-1 (s 2 , a) - 1 + δ 8 ≥ γ 4 Q t-1 s 2 , A (2) - 1 + δ 8 (54) Denote Q t s 2 , A (2) = γ 4 Q t-1 s 2 , A (2) -1+δ 8 , ∀t > t δ and Q t δ s 2 , A (2) = Q t δ s 2 , A (2) . Consequently, Q t (s 2 , a 2 ) ≥ Q t δ s 2 , A (2) , ∀t ≥ t δ by equation 54. Since t ≥ t δ Q t s 2 , A (2) = γ 4 t-t δ Q t δ s 2 , A (2) - 1 + δ 2γ -8 + 1 + δ 2γ -8 Furthermore, γ ∈ ( 4 5 , 1). There exists some T δ ≥ t δ which Q T δ s 2 , A (2) ≥ Q T δ s 2 , A (2) ≥ 1 + 2δ 2γ -8 > - 1 + 2δ 6 ( ) According to equation 52 and let δ < 1 11 . Q T δ +1 s 2 , A (1) ≥ 5γ 4 Q T δ s 2 , A (2) + 3 8 - 1 4 γ t q 1 (57) > - 5 + 10δ 24 + 3 8 - 1 24 δ (58) > 1 8 Similar to equation 54, we observer from equation 52 that ∀t > T δ= 1 11 + 1 Q t s 2 , A (1) ≥ 5γ 4 Q t-1 s 2 , A (1) + 1 4 and V t tot (s 2 ) = 2Q t s 2 , A (1) (61) ≥ 2 5γ 4 Q t-1 s 2 , A (1) + 1 4 (62) = 5γ 4 V t-1 tot (s 2 ) + 1 4 Since 5γ 4 > 1 and the initial point at T δ= 1 11 + 1 is larger than 1 8 , this suggests that V t tot (s 2 ) will eventually diverge. Noticing that our proof holds with respect to any Q 0 tot (s j , a)|∀j ∈ S, a ∈ A . Thus, value-iteration on linear decomposed function class w.r.t this MDP will diverge evnetually under any circumstances. Construct an exploratory policy πt based on Q (t) . i.e., -greedy exploration πt (a|s) = n i=1 |A| + (1 -)I a i = arg max a i ∈A Q (t) i (s, a i ) Collect a new dataset D t by running πt . 5: Operate an on-policy Bellman operator Q (t+1) ← T LVD Q (t) ≡ T LVD Dt Q (t) . Algorithm 1 is a variant of fitted Q-iteration which adopts an on-policy sample distribution. At line 3, an exploratory noise is integrated into the greedy policy, since the function approximator generally requires an extensive set of samples to regularize extrapolation values. Particularly, we investigate a standard exploration module called -greedy, in which every agent takes a small probability to explore actions with non-maximum values. To make the underlying insights more accessible, we assume the data collection procedure at line 4 can obtain infinite samples, which makes the dataset D t become a sufficient coverage over the state-action space (see Assumption 1). This algorithmic framework serves as a foundation for discussions on local stability. We consider an additional assumption stated as follows. Assumption 2 (Unique Optimal Policy). The optimal policy π * is unique. The intuitive motivation of this assumption is to have the optimal policy π * be a potential stable solution. In situations where the optimal policy is not unique, most Q-learning algorithms will oscillate around multiple optimal policies (Simchowitz & Jamieson, 2019) , and Assumption 2 helps us to rule out these non-interesting cases. Based on this setting, the local stability of FQI-LVD can be characterized by the following lemma. Lemma 3. There exists a threshold δ > 0 such that the on-policy Bellman operator T LVD is closed in the following subspace B ⊂ Q LVD , when the hyper-parameter is sufficiently small. B = Q ∈ Q LVD π Q = π * , max s∈S |Q tot (s, π * (s)) -V * (s)| ≤ δ Formally, ∃δ > 0, ∃ > 0, ∀Q ∈ B, there must be T LVD Q ∈ B. Lemma 3 indicates that once the value function Q steps into the subspace B, the induced policy π Q will converge to the optimal policy π * . By combining this local stability with Brouwer's fixedpoint theorem (Brouwer, 1911) , we can further verify the existence of a fixed-point solution for the on-policy Bellman operator T LVD (see Theorem 4). Theorem 4 (Formal version of Theorem 2). Besides Lemma 3, Algorithm 1 will have a fixed point value function expressing the optimal policy if the hyper-parameter is sufficiently small. Theorem 4 indicates that, multi-agent Q-learning with linear value decomposition has a convergent region, where the value function induces optimal actions. Note that Q LVD is a limited function class, which even cannot guarantee to contain the one-step TD target T LVD D Q. From this perspective, on-policy data distribution becomes necessary to make the one-step TD target projected to a small set of critical state-action pairs, which help construct the stable subspace B stated in Lemma 3.

C.2 GLOBAL CONVERGENCE IMPROVEMENT

Definition 2 (FQI-IGM). Given a dataset D, FQI-IGM specifies the action-value function class Q IGM = Q Q tot ∈ R |S||A| n and ∀Q i ∈ R |S||A| n i=1 with that Eq. ( 1) is satisfied . (65) and induces the empirical Bellman operator Q (t+1) ← T IGM D Q (t) ≡ arg min Q∈Q IGM (s,a)∈S×A p D (a|s) y (t) (s, a) -Q tot (s, a) 2 , where y (t) (s, a) = r(s, a) + γ max a Q (t) tot (s , a ) denotes the regression target derived by Bellman optimality operator. Q tot and [Q i ] n i=1 refer to the interfaces of CTDE defined in Section 3.3. Compared with FQI-LVD stated in Definition 1, the differences are the Q function class, i.e, Q IGM vs. Q LVD .

D OMITTED PROOFS OF THEOREM 3

Lemma 4. The empirical Bellman operator T IGM D stated in Definition 2 is a γ-contraction, i.e., the following important property of the standard Bellman optimality operator T will hold for T IGM D . ∀Q tot , Q tot ∈ Q, T Q tot -T Q tot ∞ ≤ γ Q tot -Q tot ∞ Proof. We want to prove T IGM D Q tot = r(s, a) + γ P (s, a), V Q , ( ) where P is transition function, V Q (•) = max a∈A Q tot (•, a) , and •, • is inner product. According to Eq. ( 68) and Lemma 1.5 in RL textbook (Agarwal et al., 2019) , we can prove that T IGM D is a γ-contraction. Eq. ( 68) indicates that the empirical Bellman error err IGM D ≡ min Q∈Q IGM (s,a)∈S×A p D (a|s) y (t) (s, a) -Q tot (s, a) 2 = 0. Let a * ,(t) = a * ,(t) i n i=1 = arg max a∈A y (t) (s, a). Then, ∀y (t) (s, •), we construct Q tot (s, a) = y (t) (s, a) and its corresponding local action-value functions [Q i ] n i=1 satisfying IGM principle: Q i (s, a i ) = 1, when a i = a * ,(t) i , 0, when a i = a * ,(t) i . To avoid the multiple solutions of arg max operator in a * ,(t) , we consider the lexicographic order of joint actions as the second priority. Thus, we illustrate the completeness of IGM function class in MMDP setting from our construction. Then, Eq. ( 68) is held, and T IGM D is a γ-contraction in MMDP framework. Theorem 3. FQI-IGM will globally converge to the optimal value function. Proof. Let Q * (s, a) = max π∈Π Q π (s, a) where Π is the space of all policies. According to Lemma 4 and Theorem 1.4 in RL textbook (Agarwal et al., 2019) , we have that • There exists a stationary and deterministic policy π such that Q π tot = Q * tot . • A vector Q tot ∈ R |S|×|A| n is equal to Q * tot if and only if it satisfies Q tot = T IGM D Q tot . • ∀Q tot ∈ Q IGM , Q * tot -T IGM D Q tot ∞ = T IGM D Q * tot -T IGM D Q tot ∞ (71) ≤γ Q * tot -Q tot ∞ . Thus, FQI-IGM will globally converge to optimal value function.

E OMITTED PROOFS OF APPENDIX C.1 E.1 SOME NOTATIONS

In this section, we only consider the data distribution generated by the optimal joint policy π * . To simplify the notations, we use ε = |A| to reformulate the exploratory policy generated by -greedy exploration as follows π(a|s) = n i=1 ε + (1 -ε)I a i = arg max a i ∈A Q * i (s, a i ) where ε = (|A| -1)ε. In addition, we use f (s, •, •) to denote the corresponding coefficient in the closed-form updating (T LVD D Q) tot (s, a) = a ∈A n f (s, a, a )(T Q) tot (s, a ) where (T Q) tot = r(s, a ) + γV tot (s ) denote the precise target values derived by Bellman optimality equation. Formally, according to Eq. ( 8), f (s, a, a ) = h (1) (s, a, a ) 1 - ε + h (0) (s, a, a ) ε -(n -1) (1 -ε) h π * (s,a ) ε n-h π * (s,a ) , in which h π * (s, a) = n i=1 I[a i = π * i (s)] h (1) (s, a, a ) = n i=1 I[a i = π * i (s)]I[a i = a i ] h (0) (s, a, a ) = n i=1 I[a i = π * i (s)]I[a i = a i ] As a reference indicating whether the learned value function produces the optimal policy, we denote E(Q) = max s∈S max a∈(A n \{π * (s)}) (Q tot (s, π * (s)) -Q tot (s, a)) Notice that π * denotes the optimal policy of the given MDP, so E(Q) might be negative for a non-optimal or inaccurate value function Q.

E.2 OMITTED PROOFS

Lemma 5. Given a dataset D generated by the optimal policy π * with -greedy exploration, for any target value function Q, ∀δ > 0, ∀0 < ε ≤ δ n 2 |A| n 2 n+1 (R max + γ V tot ∞ ) , we have ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) ≤ δ, where (T Q) tot (s, a) = r(s, a) + γV tot (s ) denotes the regression target generated by Q. Proof. ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) ≤ |(f (s, π * (s), π * (s)) -1)(T Q) tot (s, π * (s))| + a ∈A n \{π * (s)} f (s, π * (s), a )(T Q) tot (s, a ) ≤   |f (s, π * (s), π * (s)) -1| + a ∈A n \{π * (s)} |f (s, π * (s), a )|   (T Q) tot ∞ . In the first term, ∀s ∈ S, |f (s, π * (s), π * (s)) -1| = n 1 - ε -(n -1) (1 -ε) n -1 = (n -(n -1)(1 -ε))(1 -ε) n-1 -1 = (1 + (n -1)ε)(1 -ε) n-1 -1 = (1 + (n -1)ε) n-1 =0 n -1 (-1) ε -1 = (1 + (n -1)ε) 1 -(n -1)ε + n-1 =2 n -1 (-1) ε -1 = 1 -(n -1) 2 ε2 + (1 + (n -1)ε) n-1 =2 n -1 (-1) ε -1 = ε2 (n -1) 2 -(1 + (n -1)ε) n-1 =2 n -1 (-1) ε -2 ≤ |A| 2 ε 2 n 2 + 2 n-1 =2 n -1 ≤ |A| 2 ε 2 n 2 + 2 n ≤ ε 2 n 2 |A| 2 2 n . In the second term, ∀s ∈ S, a ∈A n \{π * (s)} |f (s, π * (s), a )| ≤ a ∈A n \{π * (s)} h π * (s, a ) 1 - ε -(n -1) (1 -ε) h π * (s,a ) ε n-h π * (s,a ) = a ∈A n \{π * (s)} h π * (s, a ) -(n -1)(1 -ε) (1 -ε) h π * (s,a )-1 ε n-h π * (s,a ) ≤ a ∈A n \{π * (s)} 2n(1 -ε) h π * (s,a )-1 ε n-h π * (s,a ) ≤ a ∈A n \{π * (s)} 2nε ≤ 2nε|A| n . ( ) Thus ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) ≤   |f (s, π * (s), π * (s)) -1| + a ∈A n \{π * (s)} |f (s, π * (s), a )|   (T Q) tot ∞ ≤ (ε 2 n 2 |A| 2 2 n + 2nε|A| n ) (T Q) tot ∞ ≤ εn 2 |A| n 2 n+1 (T Q) tot ∞ ≤ εn 2 |A| n 2 n+1 (R max + γ V tot ∞ ) ≤ δ. ( ) Lemma 6. Given a dataset D generated by the optimal policy π * with -greedy exploration, for any target value function Q, ∀0 < ε ≤ (1 -γ)E(Q * ) γn 3 |A| n 2 n+4 (R max /(1 -γ) + γ V π * tot -V * ∞ ) , we have ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -V * (s) ≤ γ V π * tot -V * ∞ + 1 -γ 8nγ E(Q * ), where V π * tot (s) = Q tot (s, π * (s)). Proof. ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -V * (s) ≤ (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + |(T Q) tot (s, π * (s)) -V * (s)| = (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + |(T Q) tot (s, π * (s)) -Q * (s, π * (s))| = (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + |(T Q) tot (s, π * (s)) -(T Q * )(s, π * (s))| ≤ (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + γ|V tot (s ) -V * (s )| ≤ (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + γ|Q tot (s , π * (s )) -V * (s )| ≤ (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + γ V π * tot -V * ∞ (88) Let δ = 1-γ 8nγ E(Q * ). According to Lemma 5, with the condition 0 < ε ≤ δ n 2 |A| n 2 n+1 (R max + γ V tot ∞ ) = (1 -γ)E(Q * )/(8nγ) n 2 |A| n 2 n+1 (R max + γ V tot ∞ ) , we have (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) ≤ δ = 1 -γ 8nγ E(Q * ). Notice that V tot ∞ ≤ V * ∞ + V tot -V * ∞ (91) ≤ R max 1 -γ + V π * tot -V * ∞ . The overall statement is ∀0 < ε ≤ (1 -γ)E(Q * ) γn 3 |A| n 2 n+4 (R max /(1 -γ) + γ V π * tot -V * ∞ ) ≤ (1 -γ)E(Q * )/(8nγ) n 2 |A| n 2 n+1 (R max + γ V tot ∞ ) we have ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -V * (s) ≤ (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) + γ V π * tot -V * ∞ ≤ γ V π * tot -V * ∞ + 1 -γ 8nγ E(Q * ). ( ) Lemma 7. For any value function Q, the corresponding sub-optimality gap satisfies E(T Q) ≥ E(Q * ) -2γ V tot -V * ∞ (95) Proof. With a slight abuse of notation, let s 1 and s 2 denote the next states while taking actions π * (s) and a at the state s, respectively. According to the definition, E(T Q) = max (s,a)∈S×(A n \{π * (s)}) ((T Q)tot(s, π * (s)) -(T Q)tot(s, a)) ≥ max (s,a)∈S×(A n \{π * (s)}) = E(Q * ) -2γ Vtot -V * ∞ Lemma 8. Given a dataset D generated by the optimal policy π * with -greedy exploration, for any target value function Q, ∀δ > 0, ∀0 < ε ≤ δ n 2 |A| n 2 n (R max /(1 -γ) + γ V tot -V * ∞ ) , we have ∀s ∈ S, ∀a ∈ A n \ {π * (s)}, (T LVD D Q) tot (s, a) ≤ (T Q) tot (s, π * (s)) -E(Q * ) + 2nγ V tot -V * ∞ + δ (98) where (T Q) tot (s, a) = r(s, a) + γV tot (s ) denotes the regression target generated by Q. Proof. ∀s ∈ S, ∀a ∈ A n \ {π * (s)}, (T LVD D Q) tot (s, a) = a ∈A n f (s, a, a )(T Q) tot (s, a ) = f (s, a, π * (s))(T Q) tot (s, π * (s)) + a ∈A n :h π * (s,a )=n-1 f (s, a, a )(T Q) tot (s, a ) + a ∈A n :h π * (s,a )<n-1 f (s, a, a )(T Q) tot (s, a ) In the first term, f (s, a, π * (s))(T Q)tot(s, π * (s)) = h π * (s, a) 1 -ε -(n -1) (1 -ε) n (T Q)tot(s, π * (s)) = h π * (s, a) -(n -1)(1 -ε) (1 -ε) n-1 (T Q)tot(s, π * (s)) = h π * (s, a) -(n -1) + (n -1)(|A| -1)ε (1 -ε) n-1 (T Q)tot(s, π * (s)) ≤ h π * (s, a) -(n -1) (1 -ε) n-1 (T Q)tot(s, π * (s)) + εn|A| (T Q)tot ∞ = h π * (s, a) -(n -1) (1 + (1 -ε) n-1 -1)(T Q)tot(s, π * (s)) + εn|A| (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + h π * (s, a) -(n -1) |(1 -ε) n-1 -1| (T Q)tot ∞ + εn|A| (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + 2n n-1 =1 n -1 (-1) ε (T Q)tot ∞ + εn|A| (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + 2nε n-1 =1 n -1 (T Q)tot ∞ + εn|A| (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + εn2 n (T Q)tot ∞ + εn|A| (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + εn2 n |A| (T Q)tot ∞ + εn|A| (T Q)tot ∞ In the second term, a ∈A n :h π * (s,a )=n-1 f (s, a, a )(T Q)tot(s, a ) = a ∈A n :h π * (s,a )=n-1 h (1) (s, a, a ) 1 -ε + h (0) (s, a, a ) ε -(n -1) (1 -ε) n-1 ε(T Q)tot(s, a ) = a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(1 -ε) n-1 (T Q)tot(s, a ) + h (1) (s, a, a ) 1 -ε -(n -1) (1 -ε) n-1 ε(T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(1 -ε) n-1 (T Q)tot(s, a ) + h (1) (s, a, a ) 1 -ε -(n -1) (1 -ε) n-1 ε (T Q)tot ∞ ≤ a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(1 -ε) n-1 (T Q)tot(s, a ) + 2nε (T Q)tot ∞ = a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a ) n-1 =0 n -1 (-1) ε (T Q)tot(s, a ) + 2nε (T Q)tot ∞ = a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a ) 1 + n-1 =1 n -1 (-1) ε (T Q)tot(s, a ) + 2nε (T Q)tot ∞ ≤ a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a ) + n-1 =1 n -1 (-1) ε (T Q)tot ∞ + 2nε (T Q)tot ∞ = a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a ) + ε n-1 =1 n -1 (-1) ε -1 (T Q)tot ∞ + 2nε (T Q)tot ∞ ≤ a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a ) + ε n-1 =1 n -1 (T Q)tot ∞ + 2nε (T Q)tot ∞ ≤ a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a ) + ε|A|2 n-1 (T Q)tot ∞ + 2nε (T Q)tot ∞ =    a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a )    + εn|A|2 n-1 (T Q)tot ∞ + 2n 2 ε (T Q)tot ∞ ≤    a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a )    + εn 2 |A|2 n (T Q)tot ∞ In the third term, a ∈A n :h π * (s,a )<n-1 f (s, a, a )(T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )<n-1 f (s, a, a )(T Q)tot(s, a ) = a ∈A n :h π * (s,a )<n-1 h (1) (s, a, a ) 1 -ε + h (0) (s, a, a ) ε -(n -1) (1 -ε) h π * (s,a ) ε n-h π * (s,a ) (T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )<n-1 h (1) (s, a, a ) 1 -ε + h (0) (s, a, a ) ε + (n -1) (1 -ε) h π * (s,a ) ε n-h π * (s,a ) (T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )<n-1 n 1 + 1 1 -ε + 1 ε (1 -ε) h π * (s,a ) ε n-h π * (s,a ) (T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )<n-1 n 1 + 2 ε (1 -ε) h π * (s,a ) ε n-h π * (s,a ) (T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )<n-1 3nε n-h π * (s,a )-1 (T Q)tot(s, a ) ≤ a ∈A n :h π * (s,a )<n-1 3nε (T Q)tot ∞ ≤ 3nε|A| n (T Q)tot ∞ Combining the above terms, we can get (T LVD D Q)tot(s, a) = f (s, a, π * (s))(T Q)tot(s, π * (s)) + a ∈A n :h π * (s,a )=n-1 f (s, a, a )(T Q)tot(s, a ) + a ∈A n :h π * (s,a )<n-1 f (s, a, a )(T Q)tot(s, a ) ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + εn2 n |A| (T Q)tot ∞ + εn|A| (T Q)tot ∞ +    a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a )    + εn 2 |A|2 n (T Q)tot ∞ + 3nε|A| n (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) +    a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a )    + εn 2 |A| n 2 n (T Q)tot ∞ (103) in which a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q) tot (s, a ) ≤   a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )   max a ∈A n :h π * (s,a )=n-1 (T Q) tot (s, a ) = (n -h π * (s, a)) max a ∈A n :h π * (s,a )=n-1 (T Q) tot (s, a ) ≤ (n -h π * (s, a)) max a ∈A n \{π * (s)} (T Q) tot (s, a ) = (n -h π * (s, a)) ((T Q) tot (s, π * ) -E(T Q)) Thus ∀s ∈ S, ∀a ∈ A n \ {π * (s)}, (T LVD D Q)tot(s, a) ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) +    a ∈A n :h π * (s,a )=n-1 h (0) (s, a, a )(T Q)tot(s, a )    + εn 2 |A| n 2 n (T Q)tot ∞ ≤ h π * (s, a) -(n -1) (T Q)tot(s, π * (s)) + (n -h π * (s, a)) (T Q)tot(s, π * ) -E(T Q) + εn 2 |A| n 2 n (T Q)tot ∞ = (T Q)tot(s, π * (s)) -(n -h π * (s, a))E(T Q) + εn 2 |A| n 2 n (T Q)tot ∞ (105) According to Lemma 7, E(T Q) ≥ E(Q * ) -2γ V tot -V * ∞ . So ∀s ∈ S, ∀a ∈ A n \ {π * (s)}, (T LVD D Q)tot(s, a) ≤ (T Q)tot(s, π * (s)) -(n -h π * (s, a))E(T Q) + εn 2 |A| n 2 n (T Q)tot ∞ ≤ (T Q)tot(s, π * (s)) -(n -h π * (s, a)) E(Q * ) -2γ Vtot -V ≤ (T Q)tot(s, π * (s)) -E(Q * ) + 2nγ Vtot -V * ∞ + δ Lemma 9. Let B denote a subspace of value functions B = Q ∈ Q LVD E(Q) ≥ 0, V tot -V * ∞ ≤ 1 8nγ E(Q * ) Given a dataset D generated by the optimal policy π * with -greedy exploration, ∀0 < ε ≤ (1 -γ)E(Q * ) n 3 |A| n 2 n+4 (R max /(1 -γ) + E(Q * )/(8n)) (108) we have ∀Q ∈ B, T LVD D Q ∈ B ⊂ B where B = Q ∈ Q LVD E(Q) > 0, V tot -V * ∞ ≤ 1 8nγ E(Q * ) Proof. According to Lemma 5, with the condition 0 < ε ≤ E(Q * )/4 n 2 |A| n 2 n+1 (R max /(1 -γ) + E(Q * )/(8n)) ≤ E(Q * )/4 n 2 |A| n 2 n+1 (R max + γ V tot ∞ ) we have ∀Q ∈ B, ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) -(T Q) tot (s, π * (s)) ≤ 1 4 E(Q * ) which implies ∀Q ∈ B, ∀s ∈ S, (T LVD D Q) tot (s, π * (s)) ≥ (T Q) tot (s, π * (s)) - 1 4 E(Q * ). According to Lemma 8, with the condition 0 < ε ≤ E(Q * )/4 n 2 |A| n 2 n (R max /(1 -γ) + E(Q * )/(8n)) ≤ E(Q * )/4 n 2 |A| n 2 n (R max /(1 -γ) + γ V tot -V * ∞ ) we have ∀Q ∈ B, ∀s ∈ S, ∀a ∈ A n \ {π * (s)}, (T LVD D Q) tot (s, a) ≤ (T Q) tot (s, π * (s)) -E(Q * ) + 2nγ V tot -V * ∞ + 1 4 E(Q * ) ≤ (T Q) tot (s, π * (s)) -E(Q * ) + 1 4 E(Q * ) + 1 4 E(Q * ) = (T Q) tot (s, π * (s)) - 1 2 E(Q * ) < (T LVD D Q) tot (s, π * (s)) which implies E(T LVD D Q) > 0. According to Lemma 6, with the condition 0 < ε ≤ (1 -γ)E(Q * ) γn 3 |A| n 2 n+4 (R max /(1 -γ) + E(Q * )/(8n)) ≤ (1 -γ)E(Q * ) γn 3 |A| n 2 n+4 (R max /(1 -γ) + γ V π * tot -V * ∞ ) , we have ∀Q ∈ B, ∀s ∈ S, (T LVD D V )(s) -V * (s) = (T LVD D Q) tot (s, π * (s)) -V * (s) (116) ≤ γ V π * tot -V * ∞ + 1 -γ 8nγ E(Q * ) ≤ 1 8nγ E(Q * ). Combing Eq. ( 110), ( 113), and (115), the overall condition is 0 < ε ≤ (1 -γ)E(Q * ) n 3 |A| n 2 n+4 (R max /(1 -γ) + E(Q * )/(8n)) Lemma 3. There exists a threshold δ > 0 such that the on-policy Bellman operator T LVD is closed in the following subspace B ⊂ Q LVD , when the hyper-parameter is sufficiently small. B = Q ∈ Q LVD π Q = π * , max s∈S |Q tot (s, π * (s)) -V * (s)| ≤ δ Formally, ∃δ > 0, ∃ > 0, ∀Q ∈ B, there must be T LVD Q ∈ B. Proof. It is implied by Lemma 9. Theorem 4 (Formal version of Theorem 2). Besides Lemma 3, Algorithm 1 will have a fixed point value function expressing the optimal policy if the hyper-parameter is sufficiently small. Proof. Notice that the state value function is sufficient to determine the target values, so the subspace B defined in Lemma 9 is a compact and convex space in terms of V tot . The operator T LVD D is a continuous mapping because it only involves elementary functions. According to Brouwer's Fixed Point Theorem (Brouwer, 1911) , there exist Q ∈ B satisfying T LVD D Q ∈ B. In addition, according to the definition stated in Eq. ( 109), the fixed point must represent the unique optimal policy since it cannot lie on the boundary with E(Q) = 0.

F EXPERIMENT SETTINGS AND IMPLEMENTATION DETAILS F.1 IMPLEMENTATION DETAILS

We adopt the PyMARL (Samvelyan et al., 2019) implementation with default hyper-parameters to investigate state-of-the-art multi-agent Q-learning algorithms: VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , and QPLEX (Wang et al., 2020a is depended on the number of agents and the episode length limit of each map. The performance measure of StarCraft II tasks is the percentage of episodes in which RL agents defeat all enemy units within the limited time constraints, called test win rate. The dataset providing off-policy exploration is constructed by training a behavior policy of VDN and collecting its 20k, 30k or 50k experienced episodes. The dataset configurations are shown in Table 2 . We investigate five multi-agent Q-learning algorithms over 6 random seeds, which includes 3 different datasets and evaluates two seeds on each dataset. We train 300 epochs to evaluate the learning performance with a given static dataset, of which 32 episodes are trained in each update, and 160k transitions are trained for each epoch totally. Moreover, the training process of behavior policy is the same as that discussed in PyMARL (Samvelyan et al., 2019) , which has collected a total of 2 million timestep data and anneals the hyper-parameter of -greedy exploration strategy linearly from 1.0 to 0.05 over 50k timesteps. The target network will be updated periodically after training every 200 episodes. We call this period of 200 episodes an Iteration, which corresponds to an iteration of FQI-LVD (see Definition 1).

F.2 TWO-STATE MMDP

In the two-state MMDP shown in Figure 1a , due to the GRU-based implementation of the finitehorizon paradigm in the above five deep multi-agent Q-learning algorithms, we assume that two agents starting from state s 2 have 100 environmental steps executed by a uniform -greedy exploration strategy (i.e., = 1). We use this long-term horizon pattern and uniform -greedy exploration methods to approximate an infinite-horizon MMDP paradigm with uniform data distribution. We adopt γ = 0.9 to implement FQI-LVD and deep MARL algorithms. In the FQI-LVD framework, V max = 1 1-γ = 100 as shown in Figure 1b . Figure 1c demonstrates that Optimal line is approximately 99 i=0 γ i = 63.4 in one episode of 100 timesteps.

F.3 STARCRAFT II

StarCraft II unit micromanagement tasks consider a combat game of two groups of agents, where StarCraft II takes built-in AI to control enemy units, and MARL algorithms can control each ally unit to fight the enemies. Units in two groups can contain different types of soldiers, but these soldiers in the same group should belong to the same race. The action space of each agent includes no-op, move [direction], attack [enemy id], and stop. At each timestep, agents choose to move or attack in continuous maps. MARL agents will get a global reward equal to the amount of damage done to enemy units. Moreover, killing one enemy unit and winning the combat will bring additional bonuses of 10 and 200, respectively. The maps of SMAC challenges in this paper are introduced in Table 3 in the episodes of 100 timesteps. easy maps such as 2s3z and 2s_vs_1sc, but it cannot provide fundamental improvement in harder tasks. As shown in Figure 5 , the effects of increasing parameters are rather weak for QMIX. These experiments demonstrate that increasing the number of parameters cannot address the limitations of VDN and QMIX on representational capacity. Remark Assumption 1 on the factorizable dataset does not require the factorizability of the underlying transition and reward functions or the decomposability of the joint action-value function.

I ADDITIONAL EXPERIMENTS ON MMDP EXAMPLE

On the contrary, all our theorems and examples focus on the situations where the joint Q-function cannot be perfectly factorized. In addition, Assumption 1 can be naturally satisfied when the dataset is collected by decentralized execution of agents' policies, e.g., an on-policy dataset collected using -greedy exploration policies or an offline dataset collected by given decentralized policies of agents. All algorithms discussed in paper, including VDN, QMIX, QTRAN, and QPLEX, learn decentralized policies, which are executed in a decentralized manner. The theoretical implications derived in this paper are applicable whenever such factorizable data collection procedures are carried out. To investigate the dependency of our theoretical implications on Assumption 1, we provide an experiment to evaluate the performance of deep multi-agent Q-learning algortihms on unfactorizable datasets. Figure 6 As shown in Figure 6 , the choice of parameter η has no impacts on the performance of QPLEX and QTRAN, which matches the fact that Theorem 3 does not rely on the assumption of factorizable dataset. As the extension of Proposition 2, VDN and QMIX empirically suffer from unbounded divergence when the dataset is not factorizable. The only exception is the case of η = 1, in which the dataset only contains two kinds of joint actions. In this case, the given MMDP degenerates to a single-agent MDP because agents only perform the same actions in the dataset. As a result, VDN and QMIX would not diverge in this special situation.



Figure1: (a) An MMDP where FQI-LVD will diverge to infinity when γ ∈ 4 5 , 1 . r is a shorthand for r(s, a) and the action space for each agent A ≡ A (1) , . . . , A(|A|) . (b) The learning curves of Q tot ∞ of on-policy FQI-LVD on the given MMDP where the dataset is generated by different choices of hyper-parameters for -greedy. (c) The learning curves of Q tot ∞ while running several deep multi-agent Q-learning algorithms.

Figure 2: (a,c,i) Constructing datasets using online data collection of VDN. (b-d,f-h,j-l) Evaluating the performance of deep multi-agent Q-learning algorithms with a given static dataset on nine maps.

a ) denotes the groundtruth target value derived by Bellman optimality operator. p D (a, s |s) = p D (a|s)P (s |s, a) denotes the empirical probability of the event that agents execute joint action a on state s and transit to s . Q tot and [Q i ] n i=1 refer to the discussion of CTDE defined in Section 3.3. Lemma 1. The empirical Bellman operator T LVD D defined in Definition 1 is equivalent to

Bellman error minimization; according to Lemma 1, b ∈ R m n denotes the regression target y (t) (s, •) derived by Bellman optimality operator; p ∈ R m n denotes the empirical probability of joint action a executed on state s, p D (a|s), which can be factorized to the production of individual components illustrated in Assumption 1. Besides, A is m-ary encoding matrix namely ∀i ∈ [m n ], j ∈ [mn]

The empirical Bellman operator T LVD D is not a γ-contraction, i.e., the following important property of the standard Bellman optimality operator T does not hold for T LVD D anymore.

OMITTED ALGORITHM BOX, THEOREM, AND DEFINITION IN SECTION 5.2 C.1 LOCAL CONVERGENCE IMPROVEMENT Algorithm 1 On-Policy Fitted Q-Iteration with -greedy Exploration 1: Initialize Q (0) . 2: for t = 0 . . . T -1 do T denotes the computation budget 3:

Figure 5: Evaluating the performance of Large-QMIX with a given static dataset.

Figure 6: The learning curves of Q tot ∞ while running several deep multi-agent Q-learning algorithms with an unfactorizable dataset.

present the learning curves of VDN, QMIX, QPLEX, and QTRAN in the example MMDP shown in Figure1awith an unfactorizable dataset D constructed by a parameter η as follows:∀s ∈ S, p D (A (1) , A 2 | s) = 0.5η + 0.25(1 -η) 0.25(1 -η) 0.25(1 -η)0.5η + 0.25(1 -η) .

(a)  Payoff matrix of the one-step game. Boldface means the optimal joint action selection from payoff matrix. (b,c) Joint action-value functions Q tot of FQI-LVD and VDN. Boldface means the greedy joint action selection from Q tot .

Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. Multi-agent reinforcement learning with emergent roles. In International Conference on Machine Learning, 2020b. Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. In International Conference on Learning Representations, 2020c. Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. Off-policy multi-agent decomposed policy gradients. arXiv preprint arXiv:2007.12322, 2020d. R Paul Wiegand. An analysis of cooperative coevolutionary algorithms. PhD thesis, Citeseer, 2003. David H Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. In Modeling Complexity in Economic and Social Systems, pp. 355-369. World Scientific, 2002. Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, 2020. Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on Artificial Intelligence, 2010. Chongjie Zhang and Victor Lesser. Coordinated multi-agent reinforcement learning in networked distributed pomdps. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

). The training time of these algorithms on an NVIDIA RTX 2080TI GPU is about 4 hours to 12 hours, which The dataset configurations of offline data collection setting.

annex

G.2 DEFERRED TABLES IN SECTION 6.1A (1) 7.98 -12.09 -12.10 A (2) -12.18 -0.02 -0.02A (1) 8.00 -12.00 -12.00-12.00 -0.00 0.00-12.00 0.00 0.00A (1) -7.98 -7.98 -7.98-7.98 -0.00 -0.00 -12 0 0 (a) Payoff of matrix gameA (1) 7.98 -12.09 -12.10 A (2) -12.18 -0.02 -0.02A (1) 8.00 -12.00 -12.00-12.00 -0.00 0.00-12.00 0.00 0.00A (1) -6.24 -4.90 -4.90-4.90 -3.57 -3.57-4.90 -3.57 -3.57A To address the concern that QPLEX naturally uses more hidden parameters than VDN and QMIX, which may also improve its representational capacity. To demonstrate that the performance gap between QPLEX and other methods does not come from the difference in term of the number of parameters, we increase the number of neurons in VDN and QMIX so that they have comparable number of parameters as QPLEX. Formally, Large-VDN and Large-QMIX have similar number of parameters as QPLEX. The experiment results are presented in In addition to the ablation study in the matrix game, Figure 4 and Figure 5 present the ablation studies in StarCraft II benchmark tasks with offline data collection. In comparison to the standard versions of VDN and QMIX, we introduce Large-VDN and Large-QMIX which have similar number of parameters as QPLEX. As shown in Figure 4 , increasing parameters can benefit VDN in some

