UNEVEN: UNIVERSAL VALUE EXPLORATION FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

This paper focuses on cooperative value-based multi-agent reinforcement learning (MARL) in the paradigm of centralized training with decentralized execution (CTDE). Current state-of-the-art value-based MARL methods leverage CTDE to learn a centralized joint-action value function as a monotonic mixing of each agent's utility function, which enables easy decentralization. However, this monotonic restriction leads to inefficient exploration in tasks with nonmonotonic returns due to suboptimal approximations of the values of joint actions. To address this, we present a novel MARL approach called Universal Value Exploration (Un-eVEn), which uses universal successor features (USFs) to learn policies of tasks related to the target task, but with simpler reward functions in a sample efficient manner. UneVEn uses novel action-selection schemes between randomly sampled related tasks during exploration, which enables the monotonic joint-action value function of the target task to place more importance on useful joint actions. Empirical results on a challenging cooperative predator-prey task requiring significant coordination amongst agents show that UneVEn significantly outperforms stateof-the-art baselines.

1. INTRODUCTION

Learning control policies for cooperative multi-agent reinforcement learning (MARL) remains challenging as agents must search the joint-action space, which grows exponentially with the number of agents. Current state-of-the-art value-based methods such as VDN (Sunehag et al., 2017) and QMIX (Rashid et al., 2020b ) learn a centralized joint-action value function as a monotonic factorization of decentralized agent utility functions and can therefore cope with large joint action spaces. Due to this monotonic factorization, the joint-action value function can be decentrally maximized as each agent can simply select the action that maximizes its corresponding utility function. This monotonic restriction, however, prevents VDN and QMIX from representing nonmonotonic joint-action value functions (Mahajan et al., 2019) where an agent's best action depends on what actions the other agents choose. For example, consider a predator-prey task where at least three agents need to coordinate to capture a prey and any capture attempts by fewer agents are penalized with a penalty of magnitude p. As a result, both VDN and QMIX tend to get stuck in a suboptimal equilibrium (also called the relative overgeneralization pathology, Panait et al., 2006; Wei et al., 2018) in which agents simply avoid the prey (Mahajan et al., 2019; Böhmer et al., 2020) . This happens for two reasons. First, depending on p, successful coordination by at least three agents is a needle in the haystack and any step towards it is penalized. Second, the monotonically factorized jointaction value function lacks the representational capacity to distinguish the values of coordinated and uncoordinated joint actions during exploration. Recent work addresses the problem of inefficient exploration by VDN and QMIX due to monotonic factorization. QTRAN (Son et al., 2019) and WQMIX (Rashid et al., 2020a) address this problem by weighing important joint actions differently, which can be found by simultaneously learning a centralized value function, but these approaches still rely on inefficient -greedy exploration which may fail on harder tasks (e.g., the predator-prey task above with higher value of p). MAVEN (Mahajan et al., 2019) learns an ensemble of monotonic joint-action value functions through committed exploration by maximizing the entropy of the trajectories conditioned on a latent variable. Their exploration focuses on diversity in the joint team behaviour using mutual information. By contrast,

