COORDINATED MULTI-AGENT EXPLORATION USING SHARED GOALS

Abstract

Exploration is critical for good results of deep reinforcement learning algorithms and has attracted much attention. However, existing multi-agent deep reinforcement learning algorithms still use mostly noise-based techniques. It was recognized recently that noise-based exploration is suboptimal in multi-agent settings, and exploration methods that consider agents' cooperation have been developed. However, existing methods suffer from a common challenge: agents struggle to identify states that are worth exploring, and don't coordinate their exploration efforts toward those states. To address this shortcoming, in this paper, we proposed coordinated multi-agent exploration (CMAE): agents share a common goal while exploring. The goal is selected by a normalized entropy-based technique from multiple projected state spaces. Then, agents are trained to reach the goal in a coordinated manner. We demonstrated that our approach needs only 1% -5% of the environment steps to achieve similar or better returns than state-of-the-art baselines on various sparse-reward tasks, including a sparse-reward version of the Starcraft multi-agent challenge (SMAC).

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is an increasingly important field. Indeed, many real-world problems are naturally modeled using MARL techniques. For instance, tasks from areas as diverse as robot fleet coordination (Swamy et al., 2020; Hüttenrauch et al., 2019) and autonomous traffic control (Bazzan, 2008; Sunehag et al., 2018) fit MARL formulations. To address MARL problems, early work followed the independent single-agent reinforcement learning paradigm (Tampuu et al., 2015; Tan, 1993; Matignon et al., 2012) . However, more recently, specifically tailored techniques such as monotonic value function factorization (QMIX) (Rashid et al., 2018) , multi-agent deep deterministic policy gradient (MADDPG) (Lowe et al., 2017) , and counterfactual multi-agent policy gradients (COMA) (Foerster et al., 2018) have been developed. Those methods excel in a multi-agent setting because they address the non-stationary issue of MARL and develop communication protocols between agents. Despite those advances and the resulting reported performance improvements, a common issue remained: all of the aforementioned methods use exploration techniques from classical algorithms. Specifically, these methods employ noise-based exploration, i.e., the exploration policy is a noisy version of the actor policy. While all of the aforementioned exploration techniques for multi-agent reinforcement learning significantly improve results, they suffer from a common challenge: agents struggle to identify states that are worth exploring, and don't coordinate their exploration efforts toward those states. To give



For instance, Lowe et al. (2017) add Ornstein-Uhlenbeck (OU) (Uhlenbeck & Ornstein, 1930) noise or Gaussian noise to the actor policy. Foerster et al. (2016); Rashid et al. (2018); Yang et al. (2018); Foerster et al. (2017) use variants of ǫ-greedy exploration, where a random suboptimal action is selected with probability ǫ. It was recognized recently that use of classical exploration techniques is sub-optimal in a multi-agent reinforcement learning setting. Specifically, Mahajan et al. (2019) show that QMIX with ǫ-greedy exploration results in slow exploration and sub-optimality. Mahajan et al. (2019) improve exploration by conditioning an agent's behavior on a shared latent variable controlled by a hierarchical policy. Even more recently, Wang et al. (2020) encourage coordinated exploration by considering the influence of one agent's behavior on other agents' behaviors.

