NEVER REVISIT: CONTINUOUS EXPLORATION IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Recently, intrinsic motivations are wildly used for exploration in multi-agent reinforcement learning. However, we discover that coming with intrinsic rewards is the issue of revisitation -the relative values of intrinsic rewards, estimated based on neural networks, fluctuate during learning, causing failures of rediscovering promising areas after detachment of exploration. Consequently, agents risk exploring some sub-spaces repeatedly and being stacked nearing the fixed initial point. In this paper, we formally define the concept of revisitation, based on which we propose an observation-distribution matching approach to detect the appearance of revisitation. According to each detected revisitaion, we dynamically augment branches for agents' local Q-networks and the mixing network to achieve sufficient representational capacity. Furthermore, we use historical joint observations to adjust intrinsic rewards to reduce the probability of and penalize the occurrence of revisitation. By virtue of these advances, our method achieves superior performance on three challenging Google Research Football (GRF) scenarios and three StarCraft II micromanagement (SMAC) maps with sparse rewards.

1. INTRODUCTION

Multi-agent cooperation is ubiquitous in real-world problems, such as sensor networks (Zhang & Lesser, 2013) and traffic light control (Zhang et al., 2019) . To introduce intelligence into multiagent systems and achieve sophisticated cooperative behavior, multi-agent reinforcement learning (MARL) has been gaining increasing interest in recent years. Advanced MARL methods have largely pushed forward the performance of machine learning algorithms on tasks such as StarCraft II micromanagement (Rashid et al., 2018; Wang et al., 2021b ), Hanabi (Bard et al., 2020; Foerster et al., 2019) , and robotic control (Kurin et al., 2020) . Despite these achievements, a problem persists and prevents MARL from extending successfully to more complex problems. The action-observation space grows exponentially with the number of agents, and the efficiency of exploration strategies in such search spaces largely limits the learning efficiency of MARL algorithms. Basic exploration (Wang et al., 2020b) schemes, like ϵ-greedy, adopted by many previous works (Wang et al., 2021a; Yu et al., 2021; de Witt et al., 2020) appear to struggle even in tasks with a moderate number of agents. For example, in a 6 × 12 maze game with two agents (Fig. 4 In this paper, we give a formal definition of revisitation in multi-agent settings, according to which we propose a framework for solving this issue. To provide sufficient policy representation capacity, we add branches to agents' local Q-networks and the mixing network when revisitation occurs to achieve a dynamically growing neural network structure. Besides that, we use historical samples to calculate joint observation novelty for adjusting intrinsic rewards. Meanwhile, for each revisitation, we further introduce a KL divergence between its historical point's recorded joint observation distribution and the current one to penalize it to happen again. Based on the above merits for detecting and preventing revisitation, our approach NRT (Never Revisit) achieves nearly 100% coverage rate on the maze task (Fig. 1 right) with continuous exploration (Fig. 4 in Sec. 4). Furthermore, we benchmark our approach on both Google Research Football (GRF, Kurach et al. ( 2020)) and SMAC (Samvelyan et al., 2019) in the sparse reward setting, and find that NRT significantly pushes forward state of the art. Ablation studies show that adding branches is the most important component in revisitation avoidance and performance improvement for challenging tasks. At the same time, the proposed intrinsic reward adjustment modules are critical in achieving continuous exploration.

2. BACKGROUND

In this paper, we formulate multi-agent coordination tasks as Dec-POMDP Oliehoek et al. ( 2016), which can be defined as a tuple G = ⟨I, S, A, P, R, O, n, γ⟩, where I is the set of agents, S is the state space, A is the action space, P is the transition function, R is the reward function, O is the observation space, n is the number of agents, and γ ∈ [0, 1) is the discount factor. During sampling, each agent i ∈ I observes its unique information o i and selects an action a i ∈ A independently. According to the joint action a and environment transition function P (s ′ |s, a), the environment transfers to a new state s ′ and provides a reward that is shared across all agents. Rashid et al. (2018) . In this framework, each agent executes only based on its local action-observation history. This distributed decision-making tackles the exponentially growing joint action space. For



Figure 1: The issue of revisitation. During learning, we will store the joint observation distribution induced by the joint policy π every 100k time steps, named historical point time. Meanwhile, for every 100k time step, we will calculate the JS distance between the current distribution and all historical points, named checkpoint time. Experiments are carried out on a 6 × 12 maze task introduced in Sec. 4. Left: Basic exploration scheme (QMIX (Rashid et al., 2018) with ϵ-greedy exploration) achieves a coverage rate of the joint observation space less than 20%. Middle: Adding CDS (Li et al., 2021) intrinsic incentives improves the coverage rate to 80%. However, the fluctuating JS distances indicate that agents are periodically revisiting some sub-spaces. Consequently, most samples are wasted on revisitation. Right: Revisitation is avoided based on our method, and the JS distances are large and stable before converging to an optimal strategy, indicating continuous exploration achieved by our approach.

CENTRALIZED TRAINING WITH DECENTRALIZED EXECUTION Our approach follows a recently advanced multi-agent optimization framework of centralized training with decentralized execution (CTDE) Lowe et al. (2017); Foerster et al. (2018); Sunehag et al. (2018);

(b)), QMIX(Rashid et al., 2018)  using independent search can only cover less than 20% of the joint observation space (Fig.1 left) and struggles to find any rewards.Various intrinsic motivations are introduced into MARL algorithms to enhance their exploration ability by encouraging interaction among agents(Wang et al., 2020b), maximizing a measurements of behavioral randomness(Houthooft et al., 2016; Mahajan et al., 2019; Gupta et al., 2021), and spurring individuality (Jiang & Lu, 2021) and diversity(Li et al., 2021). These methods significantly enlarge the sub-space that can be explored. For example, the coverage rate increases from ∼ 20% to ∼ 80% using the CDS (Li et al., 2021) diversity-encouragement incentives (Fig.1 middle). However, in this paper, we find that coming with the augmented exploration is a revisitation problem that severely hurts the expected exploration ability and prevents learning efficiently.Revisitation refers to the situation where, with an enlarged exploration space, agents forget the areas they have visited before after the detachment(Ecoffet et al., 2019)  of exploration, causing them to return and re-explore. In this way, agents are stacked nearing the fixed initial point and can not explore continuously. Meanwhile, revisitation in some sub-spaces can repeatedly happen, making this issue more detrimental. For example, in Fig.1(middle), we present the empirical joint observation distribution induced byCDS (Li et al., 2021)  policies at different training time steps. It can be observed that similar distributions occur periodically during learning, and the algorithm wastes most training samples on unvalued, revisited experiences.

