COORDINATED MULTI-AGENT EXPLORATION USING SHARED GOALS

Abstract

Exploration is critical for good results of deep reinforcement learning algorithms and has attracted much attention. However, existing multi-agent deep reinforcement learning algorithms still use mostly noise-based techniques. It was recognized recently that noise-based exploration is suboptimal in multi-agent settings, and exploration methods that consider agents' cooperation have been developed. However, existing methods suffer from a common challenge: agents struggle to identify states that are worth exploring, and don't coordinate their exploration efforts toward those states. To address this shortcoming, in this paper, we proposed coordinated multi-agent exploration (CMAE): agents share a common goal while exploring. The goal is selected by a normalized entropy-based technique from multiple projected state spaces. Then, agents are trained to reach the goal in a coordinated manner. We demonstrated that our approach needs only 1% -5% of the environment steps to achieve similar or better returns than state-of-the-art baselines on various sparse-reward tasks, including a sparse-reward version of the Starcraft multi-agent challenge (SMAC).

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is an increasingly important field. Indeed, many real-world problems are naturally modeled using MARL techniques. For instance, tasks from areas as diverse as robot fleet coordination (Swamy et al., 2020; Hüttenrauch et al., 2019) and autonomous traffic control (Bazzan, 2008; Sunehag et al., 2018) fit MARL formulations. To address MARL problems, early work followed the independent single-agent reinforcement learning paradigm (Tampuu et al., 2015; Tan, 1993; Matignon et al., 2012) . However, more recently, specifically tailored techniques such as monotonic value function factorization (QMIX) (Rashid et al., 2018) , multi-agent deep deterministic policy gradient (MADDPG) (Lowe et al., 2017) , and counterfactual multi-agent policy gradients (COMA) (Foerster et al., 2018) have been developed. Those methods excel in a multi-agent setting because they address the non-stationary issue of MARL and develop communication protocols between agents. Despite those advances and the resulting reported performance improvements, a common issue remained: all of the aforementioned methods use exploration techniques from classical algorithms. Specifically, these methods employ noise-based exploration, i.e., the exploration policy is a noisy version of the actor policy. While all of the aforementioned exploration techniques for multi-agent reinforcement learning significantly improve results, they suffer from a common challenge: agents struggle to identify states that are worth exploring, and don't coordinate their exploration efforts toward those states. To give an example, consider a push-box task, where two agents need to jointly push a heavy box to a specific location before observing a reward. In this situation, instead of exploring the environment independently, agents need to coordinate pushing the box within the environment to find the specific location. To address this issue, we propose coordinated multi-agent exploration (CMAE), where multiple agents share a common goal. We achieve this by first projecting the joint state space to multiple subspaces. We develop a normalized entropy (Cover & Thomas., 1991) -based technique to select a goal from the under-explored subspaces. Then, exploration policies are trained to reach the goals in a coordinated manner. To show that CMAE improves results, we evaluate our approach on various environments with sparse-rewards from Wang et al. ( 2020), and the sparse-reward version of the Starcraft multi-agent challenge (SMAC) (Samvelyan et al., 2019) , which requires coordinated actions among agents over extended time steps before observing a reward. The experimental results show that our approach needs only 1% -5% of environment steps to achieve similar or better average test episode returns than current state-of-the-art baselines.

2. PRELIMINARIES

In this section, we define the multi-agent Markov decision process (MDP) in Sec. 2.1, and introduce the multi-agent reinforcement learning setting in Sec. 2.2.

2.1. MULTI-AGENT MARKOV DECISION PROCESS

We model a cooperative multi-agent system as a multi-agent Markov decision process (MDP). An n-agent MDP is defined by a tuple G = (S, A, T , R, Z, O, n, γ, H). S is the global state space of the environment. A is the action space. At each time step t, each agent's policy π i , i ∈ {1, . . . , n}, selects an action a t i ∈ A. All selected actions form a joint action a t ∈ A n . The transition function T maps the current state s t and the joint action a t to the next state s t+1 , i.e., T : S × A n → S. All agents receive a collective reward r t ∈ R according to the reward function R : S × A n → R. The goal of all agents' policies is to maximize the collective expected return H t=0 γ t r t , where γ ∈ [0, 1] is the discount factor, H is the horizon, and r t is the collective reward obtained at timestep t. Each agent i observes local observation o t i ∈ Z according to the observation function O : S → Z. Note, observations usually reveal partial information about the global state. For instance, suppose the global state contains the location of agents, while the local observation of an agent may only contain the location of other agents within a limited distance. All agents' local observations form a joint observation, denoted by o t . A global state space S is the product of component spaces V i , i.e., S = M i=1 V i , where V i ⊆ R (Samvelyan et al., 2019; Lowe et al., 2017; Rashid et al., 2018; Foerster et al., 2018; Mahajan et al., 2019) . We refer to V i as a 'state component.' The set of all component spaces of a product space is referred to as the component set. For instance, the component set of S is {V i |i ∈ {1, . . . , M }}. Each entity, e.g., agents, objects, etc., in the environment are described by a set of state components. We refer to a set of state components that is associated with an entity in the environment as an 'entity set.' For instance, in a 2-agent push-box environment, where two agents can only collaboratively push a box to a goal location, we have the global state space S = 6 i=1 V i , where {V 1 , V 2 }, {V 3 , V 4 }, {V 5 , V 6 } represent the location of agent one, agent two, and the box, separately. Consequently, {V 1 , V 2 }, {V 3 , V 4 }, {V 5 , V 6 } are three entity sets.

2.2. MULTI-AGENT REINFORCEMENT LEARNING

In this paper, we follow the standard centralized training and decentralized execution (CTDE) paradigm (Lowe et al., 2017; Rashid et al., 2018; Foerster et al., 2018; Mahajan et al., 2019) . That is, at training time, the learning algorithm has access to all agents' local observations, actions, and the global state. At execution time, i.e., at test time, each individual agent's policy only has access to its own local observation. The proposed CMAE is applicable to off-policy MARL methods (e.g., Rashid et al., 2018; Lowe et al., 2017; Sunehag et al., 2018; Matignon et al., 2012) . In off-policy MARL, exploration poli-



For instance, Lowe et al. (2017) add Ornstein-Uhlenbeck (OU) (Uhlenbeck & Ornstein, 1930) noise or Gaussian noise to the actor policy. Foerster et al. (2016); Rashid et al. (2018); Yang et al. (2018); Foerster et al. (2017) use variants of ǫ-greedy exploration, where a random suboptimal action is selected with probability ǫ. It was recognized recently that use of classical exploration techniques is sub-optimal in a multi-agent reinforcement learning setting. Specifically, Mahajan et al. (2019) show that QMIX with ǫ-greedy exploration results in slow exploration and sub-optimality. Mahajan et al. (2019) improve exploration by conditioning an agent's behavior on a shared latent variable controlled by a hierarchical policy. Even more recently, Wang et al. (2020) encourage coordinated exploration by considering the influence of one agent's behavior on other agents' behaviors.

