FSV: LEARNING TO FACTORIZE SOFT VALUE FUNC-TION FOR COOPERATIVE MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

We explore stochastic-based policy solutions for cooperative multi-agent reinforcement learning (MARL) using the idea of function factorization in centralized training with decentralized execution (CTDE). Existing CTDE based factorization methods are susceptible to the relative overgeneralization, where finding a suboptimal Nash Equilibrium, which is a well-known game-theoretic pathology. To resolve this issue, we propose a novel factorization method for cooperative MARL, named FSV, which learns to factorize the joint soft value function into individual ones for decentralized execution. Theoretical analysis shows that FSV solves a rich class of factorization tasks. Our experiments for the well-known tasks of the Non-Monotonic Matrix game and the Max of Two Quadratics game show that FSV converges to optima in the joint action space in the discrete and continuous tasks by local searching. We evaluate FSV on a challenging set of StarCraft II micromanagement tasks, and show that FSV significantly outperforms existing factorization multi-agent reinforcement learning methods.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) aims to instill in agents policies that maximize the team reward accumulated over time (Panait & Luke (2005) ; Busoniu et al. (2008); Tuyls & Weiss (2012) ), which has great potential to address complex real-world problems, such as coordinating autonomous cars (Cao et al. (2013) ). Considering the measurement and communication limitations in practical problems, cooperative MARL faces the partial observability challenge. That is, each agent chooses actions just based on its local observations. Centralized training with decentralized execution (CTDE) (Oliehoek et al. (2011) ) is a common paradigm to address the partial observability, where agents' policies are trained with access to global information in a centralized way and executed only based on local observations in a decentralized way, such as the MADDPG (Lowe (2017)) and COMA (Foerster et al. ( 2017)). However, the size of the joint state-action space of the centralized value function grows exponentially as the number of agents increases, which is known as the scalibility challenge. Value function factorization methods have been an increasingly popular paradigm for solving the scalability in CTDE by satisfying the Individual-Global-Max (IGM) where the optimal joint action selection should be consistent with the optimal individual action selections. 2019)). All these methods are -greedy policies, where VDN and QMIX give sufficient but unnecessary conditions for IGM by additivity and monotonicity structures respectively, and the QTRAN formulates the IGM as an optimization problem with linear constraints. Although these methods have witnessed some success in some tasks, they all face relative overgeneralization, where agents may stick into a suboptimal Nash Equilibrium. In fact, relative overgeneralization is a grave pathology arising which occurs when a suboptimal Nash Equilibrium in the joint space of action priors to an optimal Nash Equilibrium since each agent's action in the suboptimal equilibrium is a better choice (Wei & Luke (2016) ). The non-monotonic matrix game is a simple discrete example. Both VDN and QMIX fail to learn the optimal policy in the non-monotonic



Three representative examples of value function factorization methods include VDN (Sunehag et al. (2017)), QMIX (Rashid et al. (2018)), and QTRAN (Son et al. (

