AN ADAPTIVE ENTROPY-REGULARIZATION FRAME-WORK FOR MULTI-AGENT REINFORCEMENT LEARN-ING

Abstract

In this paper, we propose an adaptive entropy-regularization framework (ADER) for multi-agent reinforcement learning (RL) to learn the adequate amount of exploration for each agent based on the degree of required exploration. In order to handle instability arising from updating multiple entropy temperature parameters for multiple agents, we disentangle the soft value function into two types: one for pure reward and the other for entropy. By applying multi-agent value factorization to the disentangled value function of pure reward, we obtain a relevant metric to assess the necessary degree of exploration for each agent. Based on this metric, we propose the ADER algorithm based on maximum entropy RL, which controls the necessary level of exploration across agents over time by learning the proper target entropy for each agent. Experimental results show that the proposed scheme significantly outperforms current state-of-the-art multi-agent RL algorithms.

1. INTRODUCTION

RL is one of the most notable approaches to solving decision-making problems such as robot control (Hester et al., 2012; Ebert et al., 2018) , traffic light control (Wei et al., 2018; Wu et al., 2020) and games (Mnih et al., 2015; Silver et al., 2017) . The goal of RL is to find an optimal policy that maximizes expected return. To guarantee convergence of model-free RL, the assumption that each element in the joint state-action space should be visited infinitely often is required (Sutton & Barto, 2018) , but this is impractical due to large state and/or action spaces in real-world problems. Thus, effective exploration has been a core problem in RL. In practical real-world problems, however, the given time for learning is limited and thus the learner should exploit its own policy based on its experiences so far. Hence, the learner should balance exploration and exploitation in the dimension of time and this is called exploration-exploitation trade-off in RL. The problem of explorationexploitation trade-off becomes more challenging in multi-agent RL (MARL) because the state-action space grows exponentially as the number of agents increases. Furthermore, the necessity and benefit of exploration can be different across agents and even one agent's exploration can hinder other agents' exploitation. Thus, the balance of exploration and exploitation across multiple agents should also be considered for MARL in addition to that across the time dimension. We refer to this problem as multi-agent exploration-exploitation trade-off. Although there exist many algorithms for better exploration in MARL (Mahajan et al., 2019; Kim et al., 2020; Liu et al., 2021a; Zhang et al., 2021) , the research on multi-agent exploration-exploitation trade-off has not been investigated much yet. In this paper, we propose a new framework based on entropy regularization for adaptive exploration in MARL to handle the multi-agent exploration-exploitation trade-off. The proposed framework allocates different target entropy across agents and across time based on our newly-proposed metric for the benefit of further exploration for each agent. To implement the proposed framework, we adopt the method of disentanglement between exploration and exploitation (Beyer et al., 2019; Han & Sung, 2021) to decompose the joint soft value function into two types: one for the return and the other for the entropy sum. This disentanglement alleviates instability which can occur due to the updates of the temperature parameters. It also enables applying value factorization to return and entropy separately since the contribution to the reward can be different from that to the entropy from an agent's perspective. Based on this disentanglement, we propose a metric for the desired level of exploration for each agent, based on the partial derivative of the joint value function of pure return with respect to (w.r.t.) policy action entropy. The intuition behind this choice is clear for entropy-based exploration: Agents with higher gradient of joint pure-return value w.r.t. their action entropy should increase their target action entropy resulting in higher exploration level in order to contribute more to pure return. Under the constraint of total target entropy sum across all agents, which we will impose, the target entropy of agents with lower gradient of joint pure-return value w.r.t. their action entropy will then be reduced and inclined to exploitation rather than exploration. Thus, multi-agent exploration-exploitation trade-off can be achieved. The experiments demonstrate the effectiveness of the proposed framework for multi-agent exploration-exploitation trade-off.

2. BACKGROUND

Basic setup We consider a decentralized partially observable MDP (Dec-POMDP), which describes a fully cooperative multi-agent task (Oliehoek & Amato, 2016). Dec-POMDP is defined by a tuple < N , S, {A i }, P, {Ω i }, O, r, γ >, where (Oliehoek et al., 2008) . N = {1, 2, • • • , N }

Value Factorization

It is difficult to learn the joint action-value function, which is defined as Q JT (s, τ , a) = E[ ∞ t=0 γ t r t |s, τ , a] due to the problem of the curse of dimensionality as the number of agents increases. For efficient learning of the joint action-value function, value factorization techniques have been proposed to factorize it into individual action-value functions Q i (τ i , a i ), i = 1, • • • , N . One representative example is QMIX, which introduces a monotonic constraint between the joint action-value function and the individual action-value function. The joint action-value function in QMIX is expressed as Q JT (s, τ , a) = f mix (s, Q 1 (τ 1 , a 1 ), • • • , Q N (τ N , a N )), ∂Q JT (s, τ , a) ∂Q i (τ i , a i ) ≥ 0, ∀i ∈ N , where f mix is a mixing network which combines the individual action-values into the joint actionvalue based on the global state. To satisfy the monotonic constraint ∂Q JT /∂Q i ≥ 0, the mixing network is restricted to have positive weights. There exist other value-based MARL algorithms with value factorization (Son et al., 2019; Wang et al., 2020a) . Actor-critic based MARL algorithms also considered value factorization to learn the centralized critic (Peng et al., 2021; Su et al., 2021) . Maximum Entropy RL and Entropy Regularization Maximum entropy RL aims to promote exploration by finding an optimal policy that maximizes the sum of cumulative reward and entropy (Haarnoja et al., 2017; 2018a) . The objective function of maximum entropy RL is given by J M axEnt (π) = E π ∞ t=0 γ t (r t + αH(π(•|s t ))) , where H(•) is the entropy function and α is the temperature parameter which determines the importance of the entropy compared to the reward. Soft actor-critic (SAC) is an off-policy actor-critic algorithm which efficiently solves the maximum entropy RL problem (2) based on soft policy iteration, which consists of soft policy evaluation and soft policy improvement. For this, the soft Q function is defined as the sum of the total reward and the future entropy, i.e., Q π (s t , a t ) : = r t + E τt+1∼π ∞ l=t+1 γ l-t (r l + N i=1 αH(π(•|s l )) ) . In the soft policy evaluation step, for a fixed policy π, the soft Q function is estimated with convergence guarantee by repeatedly applying the soft Bellman backup operator T π sac to an estimate function Q : S × A → R, and the soft Bellman backup operator is given by T π sac Q(s t , a t ) = r t + γE st+1 [V (s t+1 )], where V (s t ) = E at∼π [Q(s t , a t )-α log π(a t |s t )]. In the soft policy improvement step, the policy is updated using the evaluated soft Q function as follows: πnew = arg max π Ea t ∼π [Q π old (st, at) -α log π(at|st)]. By iterating the soft policy evaluation and soft policy improvement, called the soft policy iteration,



is the set of agents. At time step t, Agent i ∈ N makes its own observation o i t ∈ Ω i according to the observation function O(s, i) : S × N → Ω i : (s t , i) → o i t , where s t ∈ S is the global state at time step t. Agent i selects action a i t ∈ A i , forming a joint action a t = {a 1 t , a 2 t , • • • , a N t }. The joint action yields the next global state s t+1 according to the transition probability P(•|s t , a t ) and a joint reward r(s t , a t ). Each agent i has an observation-action history τ i ∈ (Ω i × A i ) * and trains its decentralized policy π i (a i |τ i ) to maximize the return E[ ∞ t=0 γ t r t ]. We consider the framework of centralized training with decentralized execution (CTDE), where decentralized policies are trained with additional information including the global state in a centralized way during the training phase

