TOWARDS UNDERSTANDING LINEAR VALUE DECOMPOSITION IN COOPERATIVE MULTI-AGENT Q-LEARNING

Abstract

Value decomposition is a popular and promising approach to scaling up multiagent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we introduce a variant of the fitted Q-iteration framework for analyzing multi-agent Q-learning with value decomposition. Based on this framework, we derive a closed-form solution to the empirical Bellman error minimization with linear value decomposition. With this novel solution, we further reveal two interesting insights: i) linear value decomposition implicitly implements a classical multi-agent credit assignment called counterfactual difference rewards; and ii) On-policy data distribution or richer Q function classes can improve the training stability of multi-agent Qlearning. In the empirical study, our experiments demonstrate the realizability of our theoretical closed-form formulation and implications in the didactic examples and a broad set of StarCraft II unit micromanagement tasks, respectively.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has great promise for addressing coordination problems in a variety of applications, such as robotic systems (Hüttenrauch et al., 2017) , autonomous cars (Cao et al., 2012) , and sensor networks (Zhang & Lesser, 2011) . Such complex tasks often require MARL to learn decentralized policies for agents to jointly optimize a global cumulative reward signal, and post a number of challenges, including multi-agent credit assignment (Wolpert & Tumer, 2002; Nguyen et al., 2018) , non-stationarity (Zhang & Lesser, 2010; Song et al., 2019) , and scalability (Zhang & Lesser, 2011; Panait & Luke, 2005) . Recently, by leveraging the strength of deep learning techniques, cooperative MARL has made a series of great progress (Sunehag et al., 2018; Baker et al., 2020; Wang et al., 2020b; a) , particularly in value-based methods that demonstrate stateof-the-art performance on challenging tasks such as StarCraft unit micromanagement (Samvelyan et al., 2019) . Sunehag et al. (2018) proposed a popular approach called value-decomposition network (VDN) based on the paradigm of centralized training with decentralized execution (CTDE; Foerster et al., 2016) . VDN learns a centralized but factorizable joint value function Q tot , represented as the summation of individual value functions Q i . During the execution, decentralized policies can be easily derived for each agent i by greedily selecting actions with respect to its local value function Q i . By utilizing this decomposition structure, an implicit multi-agent credit assignment is realized because Q i is learned by neural network backpropagation from the total temporal-difference error on the single global reward signal, rather than on a local reward signal specific to agent i. This decomposition technique significantly improves the scalability of multi-agent Q-learning algorithms and fosters a series of subsequent works, including QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019), and QPLEX (Wang et al., 2020a) . In spite of the empirical success in a broad class of tasks, multi-agent Q-learning with linear value decomposition has not been theoretically well-understood. Because of its limited representation complexity, the standard Bellman update is not a closed operator in the joint action-value function class with linear value decomposition. The approximation error induced by this incompleteness is known as inherent Bellman error (Munos & Szepesvári, 2008) , which usually deviates Q-learning to an unexpected behavior. To develop a deeper understanding of learning with value decomposition, this paper introduces a multi-agent variant of the popular Fitted Q-Iteration (FQI; Ernst et al., 2005; Levine et al., 2020) framework and derives a closed-form solution to its empirical Bellman error minimization. To the best of our knowledge, it is the first theoretical analysis that characterizes the underlying mechanism of linear value decomposition in cooperative multi-agent Q-learning, which can serve as a powerful toolkit to establish follow-up profound theories and explore potential insights from different perspectives in this popular value decomposition structure. By utilizing this novel closed-form solution, this paper formally reveals two interesting insights: 1) Learning linear value decomposition implicitly implements a classical multi-agent credit assignment method called counterfactual difference rewards (Wolpert & Tumer, 2002) , which draws a connection with COMA (Foerster et al., 2018) , a multi-agent policy-gradient method. 2) Multi-agent Q-learning with linear value decomposition potentially suffers from the risk of unbounded divergence from arbitrary initialization. On-policy data distribution or richer Q function classes can provide local or global convergence guarantees for multi-agent Q-learning, respectively. 

2. RELATED WORK

Deep Q-learning algorithms that use neural networks as function approximators have shown great promise in solving complicated decision-making problems (Mnih et al., 2015) . One of the core component of such methods is iterative Bellman error minimization, which can be modelled by a classical framework called Fitted Q-Iteration (FQI; Ernst et al., 2005) . FQI utilizes a specific Q function class to iteratively optimize empirical Bellman error on a dataset D. Great efforts have been made towards theoretically characterizing the behavior of FQI with finite samples and imperfect function classes (Munos & Szepesvári, 2008; Farahmand et al., 2010; Chen & Jiang, 2019) . From an empirical perspective, there is also a growing trend to adopt FQI for empirical analysis of deep offline Q-learning algorithms (Fu et al., 2019; Levine et al., 2020) . In MARL, the joint Q function class grows exponentially with the number of agents, leading many algorithms (Sunehag et al., 2018; Rashid et al., 2018) to utilize different value decomposition structures with limited expressiveness to improve scalability. In this paper, we extend FQI to a multi-agent variant as our grounding theoretical framework for analyzing cooperative multi-agent Q-learning with linear value decomposition. To achieve superior effectiveness and scalability in multi-agent settings, centralized training with decentralized executing (CTDE) has become a popular MARL paradigm (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) . Individual-Global-Max (IGM) principle (Son et al., 2019) is a critical concept for value-based CTDE (Mahajan et al., 2019) , that ensures the consistency between joint and local greedy action selections and enables effective performance in both training and execution phases. VDN (Sunehag et al., 2018) utilizes linear value decomposition to satisfy a sufficient condition of IGM. The simple additivity structure of VDN has achieved excellent scalability and inspired many follow-up methods. QMIX (Rashid et al., 2018) proposes a monotonic Q network structure to improve the expressiveness of the factorized function class. QTRAN (Son et al., 2019) tries to realize the entire IGM function class, but its method is computationally intractable and requires two extra soft regularizations to approximate IGM (which actually loses the IGM guarantee). QPLEX (Wang et al., 2020a) encodes the IGM principle into the Q network architecture and realizes a complete IGM function class, but it may also have potential limitations in scalability. Based on the advantages of VDN's simplicity and scalability, linear value decomposition becomes very popular in MARL (Son et al., 2019; Wang et al., 2020a; d) . This paper focuses on the theoretical and empirical understanding of multi-agent Q-learning with linear value decomposition to explore its underlying implications.

3.1. MULTI-AGENT MARKOV DECISION PROCESS (MMDP)

To support theoretical analysis on multi-agent Q-learning, we adopt the framework of MMDP (Boutilier, 1996) , a special case of Dec-POMDP (Oliehoek et al., 2016) , to model fully cooperative



, we set up an extensive set of experiments to demonstrate the realizability of our theoretical implications. Besides the FQI framework, we also consider deep-learning-based implementations of different multi-agent value decomposition structures. Through didactic examples and the StarCraft II benchmark, we design several experiments to illustrate the consistency of our closed-form formulation with the empirical results, and that online data distribution and richer Q function classes can significantly alleviate the limitations of VDN on the offline training process (Levine et al., 2020).

