ADAPTIVE LEARNING RATES FOR MULTI-AGENT RE-INFORCEMENT LEARNING

Abstract

In multi-agent reinforcement learning (MARL), the learning rates of actors and critic are mostly hand-tuned and fixed. This not only requires heavy tuning but more importantly limits the learning. With adaptive learning rates according to gradient patterns, some optimizers have been proposed for general optimizations, which however do not take into consideration the characteristics of MARL. In this paper, we propose AdaMa to bring adaptive learning rates to cooperative MARL. AdaMa evaluates the contribution of actors' updates to the improvement of Qvalue and adaptively updates the learning rates of actors to the direction of maximally improving the Q-value. AdaMa could also dynamically balance the learning rates between the critic and actors according to their varying effects on the learning. Moreover, AdaMa can incorporate the second-order approximation to capture the contribution of pairwise actors' updates and thus more accurately updates the learning rates of actors. Empirically, we show that AdaMa could accelerate the learning and improve the performance in a variety of multi-agent scenarios, and the visualizations of learning rates during training clearly explain how and why AdaMa works.

1. INTRODUCTION

Recently, multi-agent reinforcement learning (MARL) has been applied to decentralized cooperative systems, e.g., autonomous driving (Shalev-Shwartz et al., 2016) , smart grid control (Yang et al., 2018) , and traffic signal control (Wei et al., 2019) . Many MARL methods (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018; Iqbal & Sha, 2019; Son et al., 2019) have been proposed for multi-agent cooperation, which follow the paradigm of centralized training and decentralized execution. In many of these methods, a centralized critic learns the joint Q-function using the information of all agents, and the decentralized actors are updated towards maximizing the Q-value based on local observation. However, in these methods, the actors are usually assigned the same learning rates, which is not optimal for maximizing the Q-value. This is because some agents might be more critical than others to improving the Q-value and thus should have higher learning rates. On the other hand, the learning rates of actors and critic are often hand-tuned and fixed, and hence require heavy tuning. More importantly, over the course of training, the effect of actors and critic on the learning varies, so the fixed learning rates will not always be the best at every learning stage. The artificial schedules, e.g., time-based decay and step decay, are pre-defined and require expert knowledge about model and problem. Some optimizers, e.g., AdaGrad (Duchi et al., 2011) , could adjust the learning rate adaptively, but they are proposed for general optimization problems, not specialized for MARL. In this paper, we propose AdaMa for adaptive learning rates in cooperative MARL. AdaMa dynamically evaluates the contribution of actors and critic to the optimization and adaptively updates the learning rates based on their quantitative contributions. First, we examine the gain of Q-value contributed by the update of each actor. We derive the direction along which the Q-value improves the most. Thus, we can update the vector of learning rates of all actors towards the direction of maximizing the Q-value, which leads to diverse learning rates that explicitly captures the contributions of actors. Second, we consider the critic and actors are updated simultaneously. If the critic's update causes a large change of Q-value, we should give a high learning rate to the critic since it is leading the learning. However, the optimization of actors, which relies on the critic, would strug-gle with the fast-moving target. Thus, the learning rates of actors should be reduced accordingly. On the other hand, if the critic has reached a plateau, increasing the learning rates of actors could quickly improve the actors, which further generates new experiences to boost the critic's learning. These two processes alternate during training, promoting the overall learning. Further, by incorporating the second-order approximation, we additionally capture the pairwise interaction between actors' updates so as to more accurately update the learning rates of actors towards maximizing the improvement of Q-value. We evaluate AdaMa in four typical multi-agent cooperation scenarios, i.e., going together, cooperative navigation, predator-prey, and clustering. Empirical results demonstrate that dynamically regulating the learning rates of actors and critic according to the contributions to the change of Qvalue could accelerate the learning and improve the performance, which can be further enhanced by additionally considering the effect of pairwise actors' updates. The visualizations of learning rates during training clearly explain how and why AdaMa works.

2. RELATED WORK

MARL. We consider the formulation of decentralized partially observable Markov decision process (Dec-POMDP). There are N agents interacting with the environment. At each timestep t, each agent i receives a local observation o i t , takes an action a i t , and gets a shared reward r t . The agents aim to maximize the expected return E T t=0 γ t r t , where γ is a discount factor and T is the episode time horizon. Many methods (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018; Iqbal & Sha, 2019; Son et al., 2019) have been proposed for Dec-POMDP, which adopt centralized learning and decentralized execution (CTDE). In many of these methods, a centralized critic learns a joint Q-function by minimizing the TD-error. In training, the critic is allowed to use the information of all agents. The actors, which only have access to local information, learn to maximize the Q-value learned by the critic. In execution, the critic is abandoned and the actors act in a decentralized manner. Adaptive Learning Rate. Learning rate schedules aim to reduce the learning rate during training according to a pre-defined schedule, including time-based decay, step decay, and exponential decay. The schedules have to be defined in advance and depend heavily on the type of model and problem, which requires much expert knowledge. Some optimizers, such as AdaGrad (Duchi et al., 2011) , AdaDelta (Zeiler, 2012), RMSprop (Tieleman & Hinton, 2012), and Adam (Kingma & Ba, 2015) , provide adaptive learning rate to ease manual tuning. AdaGrad performs larger updates for more sparse parameters and smaller updates for less sparse parameters, and other methods are derived from AdaGrad. However, these methods only deal with the gradient pattern for general optimization problems, offering no specialized way to boost multi-agent learning. WoLF (Bowling & Veloso, 2002) provides variable learning rates for stochastic games, but not for cooperation. Meta Gradients for Hyperparameters. Some meta-learning methods employ hyperparameter gradients to tune the hyperparameter automatically. Maclaurin et al. (2015) utilized the reverse-mode differentiation of hyperparameters to optimize step sizes, momentum schedules, weight initialization distributions, parameterized regularization schemes, and neural network architectures. Xu et al. ( 2018) computed the meta-gradient to update the discount factor and bootstrapping parameter in reinforcement learning. OL-AUX (Lin et al., 2019) uses the meta-gradient to automate the weights of auxiliary tasks. The proposed AdaMa can also be viewed as a meta-gradient method for adaptive learning rates in MARL.

3. METHOD

In this section, we first introduce the single-critic version of MADDPG (Lowe et al., 2017) , on which we instantiate AdaMa. However, AdaMa can also be instantiated on other MARL methods, and the instantiation on MAAC (Iqbal & Sha, 2019) for discrete action space is also given in Appendix A.1. Then, we use the Taylor approximation to evaluate the contributions of the critic and actors' updates to the change of Q-value. Based on the derived quantitative contributions, we dynamically adjust the direction of the vector of actors' learning rates and balance the learning rates between the critic and actors. Further, we incorporate higher-order approximation to estimate the contributions more accurately.

