TOWARDS UNDERSTANDING LINEAR VALUE DECOMPOSITION IN COOPERATIVE MULTI-AGENT Q-LEARNING

Abstract

Value decomposition is a popular and promising approach to scaling up multiagent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we introduce a variant of the fitted Q-iteration framework for analyzing multi-agent Q-learning with value decomposition. Based on this framework, we derive a closed-form solution to the empirical Bellman error minimization with linear value decomposition. With this novel solution, we further reveal two interesting insights: i) linear value decomposition implicitly implements a classical multi-agent credit assignment called counterfactual difference rewards; and ii) On-policy data distribution or richer Q function classes can improve the training stability of multi-agent Qlearning. In the empirical study, our experiments demonstrate the realizability of our theoretical closed-form formulation and implications in the didactic examples and a broad set of StarCraft II unit micromanagement tasks, respectively.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has great promise for addressing coordination problems in a variety of applications, such as robotic systems (Hüttenrauch et al., 2017) , autonomous cars (Cao et al., 2012) , and sensor networks (Zhang & Lesser, 2011) . Such complex tasks often require MARL to learn decentralized policies for agents to jointly optimize a global cumulative reward signal, and post a number of challenges, including multi-agent credit assignment (Wolpert & Tumer, 2002; Nguyen et al., 2018 ), non-stationarity (Zhang & Lesser, 2010; Song et al., 2019) , and scalability (Zhang & Lesser, 2011; Panait & Luke, 2005) . Recently, by leveraging the strength of deep learning techniques, cooperative MARL has made a series of great progress (Sunehag et al., 2018; Baker et al., 2020; Wang et al., 2020b; a) et al., 2016) . VDN learns a centralized but factorizable joint value function Q tot , represented as the summation of individual value functions Q i . During the execution, decentralized policies can be easily derived for each agent i by greedily selecting actions with respect to its local value function Q i . By utilizing this decomposition structure, an implicit multi-agent credit assignment is realized because Q i is learned by neural network backpropagation from the total temporal-difference error on the single global reward signal, rather than on a local reward signal specific to agent i. This decomposition technique significantly improves the scalability of multi-agent Q-learning algorithms and fosters a series of subsequent works, including QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019), and QPLEX (Wang et al., 2020a) . In spite of the empirical success in a broad class of tasks, multi-agent Q-learning with linear value decomposition has not been theoretically well-understood. Because of its limited representation complexity, the standard Bellman update is not a closed operator in the joint action-value function class with linear value decomposition. The approximation error induced by this incompleteness is known as inherent Bellman error (Munos & Szepesvári, 2008) , which usually deviates Q-learning to an unexpected behavior. To develop a deeper understanding of learning with value decomposition, this paper introduces a multi-agent variant of the popular Fitted Q-Iteration (FQI; Ernst et al., 2005;  



, particularly in value-based methods that demonstrate stateof-the-art performance on challenging tasks such as StarCraft unit micromanagement (Samvelyan et al., 2019). Sunehag et al. (2018) proposed a popular approach called value-decomposition network (VDN) based on the paradigm of centralized training with decentralized execution (CTDE; Foerster

