ADAPTIVE LEARNING RATES FOR MULTI-AGENT RE-INFORCEMENT LEARNING

Abstract

In multi-agent reinforcement learning (MARL), the learning rates of actors and critic are mostly hand-tuned and fixed. This not only requires heavy tuning but more importantly limits the learning. With adaptive learning rates according to gradient patterns, some optimizers have been proposed for general optimizations, which however do not take into consideration the characteristics of MARL. In this paper, we propose AdaMa to bring adaptive learning rates to cooperative MARL. AdaMa evaluates the contribution of actors' updates to the improvement of Qvalue and adaptively updates the learning rates of actors to the direction of maximally improving the Q-value. AdaMa could also dynamically balance the learning rates between the critic and actors according to their varying effects on the learning. Moreover, AdaMa can incorporate the second-order approximation to capture the contribution of pairwise actors' updates and thus more accurately updates the learning rates of actors. Empirically, we show that AdaMa could accelerate the learning and improve the performance in a variety of multi-agent scenarios, and the visualizations of learning rates during training clearly explain how and why AdaMa works.

1. INTRODUCTION

Recently, multi-agent reinforcement learning (MARL) has been applied to decentralized cooperative systems, e.g., autonomous driving (Shalev-Shwartz et al., 2016) , smart grid control (Yang et al., 2018) , and traffic signal control (Wei et al., 2019) . Many MARL methods (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018; Iqbal & Sha, 2019; Son et al., 2019) have been proposed for multi-agent cooperation, which follow the paradigm of centralized training and decentralized execution. In many of these methods, a centralized critic learns the joint Q-function using the information of all agents, and the decentralized actors are updated towards maximizing the Q-value based on local observation. However, in these methods, the actors are usually assigned the same learning rates, which is not optimal for maximizing the Q-value. This is because some agents might be more critical than others to improving the Q-value and thus should have higher learning rates. On the other hand, the learning rates of actors and critic are often hand-tuned and fixed, and hence require heavy tuning. More importantly, over the course of training, the effect of actors and critic on the learning varies, so the fixed learning rates will not always be the best at every learning stage. The artificial schedules, e.g., time-based decay and step decay, are pre-defined and require expert knowledge about model and problem. Some optimizers, e.g., AdaGrad (Duchi et al., 2011) , could adjust the learning rate adaptively, but they are proposed for general optimization problems, not specialized for MARL. In this paper, we propose AdaMa for adaptive learning rates in cooperative MARL. AdaMa dynamically evaluates the contribution of actors and critic to the optimization and adaptively updates the learning rates based on their quantitative contributions. First, we examine the gain of Q-value contributed by the update of each actor. We derive the direction along which the Q-value improves the most. Thus, we can update the vector of learning rates of all actors towards the direction of maximizing the Q-value, which leads to diverse learning rates that explicitly captures the contributions of actors. Second, we consider the critic and actors are updated simultaneously. If the critic's update causes a large change of Q-value, we should give a high learning rate to the critic since it is leading the learning. However, the optimization of actors, which relies on the critic, would strug-

