UNEVEN: UNIVERSAL VALUE EXPLORATION FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

This paper focuses on cooperative value-based multi-agent reinforcement learning (MARL) in the paradigm of centralized training with decentralized execution (CTDE). Current state-of-the-art value-based MARL methods leverage CTDE to learn a centralized joint-action value function as a monotonic mixing of each agent's utility function, which enables easy decentralization. However, this monotonic restriction leads to inefficient exploration in tasks with nonmonotonic returns due to suboptimal approximations of the values of joint actions. To address this, we present a novel MARL approach called Universal Value Exploration (Un-eVEn), which uses universal successor features (USFs) to learn policies of tasks related to the target task, but with simpler reward functions in a sample efficient manner. UneVEn uses novel action-selection schemes between randomly sampled related tasks during exploration, which enables the monotonic joint-action value function of the target task to place more importance on useful joint actions. Empirical results on a challenging cooperative predator-prey task requiring significant coordination amongst agents show that UneVEn significantly outperforms stateof-the-art baselines.

1. INTRODUCTION

Learning control policies for cooperative multi-agent reinforcement learning (MARL) remains challenging as agents must search the joint-action space, which grows exponentially with the number of agents. Current state-of-the-art value-based methods such as VDN (Sunehag et al., 2017) and QMIX (Rashid et al., 2020b ) learn a centralized joint-action value function as a monotonic factorization of decentralized agent utility functions and can therefore cope with large joint action spaces. Due to this monotonic factorization, the joint-action value function can be decentrally maximized as each agent can simply select the action that maximizes its corresponding utility function. This monotonic restriction, however, prevents VDN and QMIX from representing nonmonotonic joint-action value functions (Mahajan et al., 2019) where an agent's best action depends on what actions the other agents choose. For example, consider a predator-prey task where at least three agents need to coordinate to capture a prey and any capture attempts by fewer agents are penalized with a penalty of magnitude p. As a result, both VDN and QMIX tend to get stuck in a suboptimal equilibrium (also called the relative overgeneralization pathology, Panait et al., 2006; Wei et al., 2018) in which agents simply avoid the prey (Mahajan et al., 2019; Böhmer et al., 2020) . This happens for two reasons. First, depending on p, successful coordination by at least three agents is a needle in the haystack and any step towards it is penalized. Second, the monotonically factorized jointaction value function lacks the representational capacity to distinguish the values of coordinated and uncoordinated joint actions during exploration. Recent work addresses the problem of inefficient exploration by VDN and QMIX due to monotonic factorization. QTRAN (Son et al., 2019) and WQMIX (Rashid et al., 2020a) address this problem by weighing important joint actions differently, which can be found by simultaneously learning a centralized value function, but these approaches still rely on inefficient -greedy exploration which may fail on harder tasks (e.g., the predator-prey task above with higher value of p). MAVEN (Mahajan et al., 2019) learns an ensemble of monotonic joint-action value functions through committed exploration by maximizing the entropy of the trajectories conditioned on a latent variable. Their exploration focuses on diversity in the joint team behaviour using mutual information. By contrast, this paper proposes Universal Value Exploration (UneVEn), which follows the intuitive premise that tasks with a simpler reward function than the target task (e.g., a smaller miscoordination penalty in predator-prey) can be efficiently solved using a monotonic factorization of the joint-action value function. Therefore, UneVEn samples tasks related to the target task, that are often easier to solve, but often have similar important joint actions. Selecting actions based on these related tasks during exploration can bias the monotonic approximation of the value function towards important joint actions of the target task (Son et al., 2019; Rashid et al., 2020a) , which can overcome relative overgeneralization. To leverage the policies of the sampled related tasks, which only differ in their reward functions, UneVEn uses Universal Successor Features (USFs, Borsa et al., 2018) which have demonstrated excellent zero-shot generalization in single-agent tasks with different reward functions (Barreto et al., 2017; 2020) . USFs generalize policy dynamics over tasks using Universal Value Functions (UVFs, Schaul et al., 2015) , along with Generalized Policy Improvement (GPI, Barreto et al., 2017) , which combines solutions of previous tasks into new policies for unseen tasks. Our contributions are as follows. First, we propose Multi-Agent Universal Successor Features (MAUSFs) factorized into novel decentralized agent-specific SFs with value decomposition networks (Sunehag et al., 2017) from MARL. This factorization enables agents to compute decentralized greedy policies and to perform decentralized local GPI, which is particularly well suited for MARL, as it allows to maximize over a combinatorial set of agent policies. Second, we propose Universal Value Exploration (UneVEn), which uses novel action-selection schemes based on related tasks to solve tasks with nonmonotonic values with monotonic approximations thereof. We evaluate our novel approach in predator-prey tasks that require significant coordination amongst agents and highlight the relative overgeneralization pathology. We empirically show that UneVEn with MAUSFs significantly outperforms current state-of-the-art value-based methods on the target tasks and in zero-shot generalization (Borsa et al., 2018) across MARL tasks with different reward functions, which enables us to leverage UneVEn effectively.

2. BACKGROUND

Dec-POMDP: A fully cooperative decentralized multi-agent task can be formalized as a decentralized partially observable Markov decision process (Dec-POMDP, Oliehoek et al., 2016) consisting of a tuple G = S, U, P, R, Ω, O, n, γ . s ∈ S describes the true state of the environment. At each time step, each agent a ∈ A ≡ {1, ..., n} chooses an action u a ∈ U, forming a joint action u ∈ U ≡ U n . This causes a transition in the environment according to the state transition kernel P (s |s, u) : S × U × S → [0, 1]. All agents are collaborative and share therefore the same reward function R(s, u) : S × U → R and γ ∈ [0, 1) is a discount factor. Due to partial observability, each agent a cannot observe the true state s, but receives an observation o a ∈ Ω drawn from observation kernel o a ∼ O(s, a). At time t, each agent a has access to its action-observation history τ a t ∈ T t ≡ (Ω × U) t × Ω, on which it conditions a stochastic policy π a (u a t |τ a t ). τ t ∈ T n t denotes the histories of all agents. The joint stochastic policy π(u t |s t , τ t ) ≡ n a=1 π a (u a t |τ a t ) induces a joint-action value function : Q π (s t , τ t , u t ) = E [G t |s t , τ t , u t ], where G t = ∞ i=0 γ i r t+i is the discounted return. CTDE: We adopt the framework of centralized training and decentralized execution (CTDE Kraemer & Banerjee, 2016), which assumes access to all action-observation histories τ t and global state s t during training, but each agent's decentralized policy π a can only condition on its own actionobservation history τ a . This approach can exploit information that is not available during execution and also freely share parameters and gradients, which improves the sample efficiency considerably (see e.g., Foerster et al., 2018; Rashid et al., 2020b; Böhmer et al., 2020) . Value Decomposition Networks: A naive way to learn in MARL is independent Q-learning (IQL, Tan, 1993), which learns an independent action value function Q a (τ a t , u a t ; θ a ) for each agent a that conditions only on its local action-observation history τ a t . To make better use of other agents' information in CTDE, value decomposition networks (VDN, Sunehag et al., 2017) represent the joint-action value function Q tot as a sum of per-agent utility functions Q a : Q tot (τ , u; θ) ≡ n a=1 Q a (τ a , u a ; θ). Each Q a still conditions only on individual action-observation histories and can be represented by an agent network that shares parameters across all agents. The joint-action value function Q tot can be trained using Deep Q-Networks (DQN, Mnih et al., 2015) . Compared to VDN, QMIX (Rashid et al., 2020b) allows joint-action value function Q tot to be represented as

