ADAPTIVE LEARNING RATES FOR MULTI-AGENT RE-INFORCEMENT LEARNING

Abstract

In multi-agent reinforcement learning (MARL), the learning rates of actors and critic are mostly hand-tuned and fixed. This not only requires heavy tuning but more importantly limits the learning. With adaptive learning rates according to gradient patterns, some optimizers have been proposed for general optimizations, which however do not take into consideration the characteristics of MARL. In this paper, we propose AdaMa to bring adaptive learning rates to cooperative MARL. AdaMa evaluates the contribution of actors' updates to the improvement of Qvalue and adaptively updates the learning rates of actors to the direction of maximally improving the Q-value. AdaMa could also dynamically balance the learning rates between the critic and actors according to their varying effects on the learning. Moreover, AdaMa can incorporate the second-order approximation to capture the contribution of pairwise actors' updates and thus more accurately updates the learning rates of actors. Empirically, we show that AdaMa could accelerate the learning and improve the performance in a variety of multi-agent scenarios, and the visualizations of learning rates during training clearly explain how and why AdaMa works.

1. INTRODUCTION

Recently, multi-agent reinforcement learning (MARL) has been applied to decentralized cooperative systems, e.g., autonomous driving (Shalev-Shwartz et al., 2016) , smart grid control (Yang et al., 2018) , and traffic signal control (Wei et al., 2019) . Many MARL methods (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018; Iqbal & Sha, 2019; Son et al., 2019) have been proposed for multi-agent cooperation, which follow the paradigm of centralized training and decentralized execution. In many of these methods, a centralized critic learns the joint Q-function using the information of all agents, and the decentralized actors are updated towards maximizing the Q-value based on local observation. However, in these methods, the actors are usually assigned the same learning rates, which is not optimal for maximizing the Q-value. This is because some agents might be more critical than others to improving the Q-value and thus should have higher learning rates. On the other hand, the learning rates of actors and critic are often hand-tuned and fixed, and hence require heavy tuning. More importantly, over the course of training, the effect of actors and critic on the learning varies, so the fixed learning rates will not always be the best at every learning stage. The artificial schedules, e.g., time-based decay and step decay, are pre-defined and require expert knowledge about model and problem. Some optimizers, e.g., AdaGrad (Duchi et al., 2011) , could adjust the learning rate adaptively, but they are proposed for general optimization problems, not specialized for MARL. In this paper, we propose AdaMa for adaptive learning rates in cooperative MARL. AdaMa dynamically evaluates the contribution of actors and critic to the optimization and adaptively updates the learning rates based on their quantitative contributions. First, we examine the gain of Q-value contributed by the update of each actor. We derive the direction along which the Q-value improves the most. Thus, we can update the vector of learning rates of all actors towards the direction of maximizing the Q-value, which leads to diverse learning rates that explicitly captures the contributions of actors. Second, we consider the critic and actors are updated simultaneously. If the critic's update causes a large change of Q-value, we should give a high learning rate to the critic since it is leading the learning. However, the optimization of actors, which relies on the critic, would strug-gle with the fast-moving target. Thus, the learning rates of actors should be reduced accordingly. On the other hand, if the critic has reached a plateau, increasing the learning rates of actors could quickly improve the actors, which further generates new experiences to boost the critic's learning. These two processes alternate during training, promoting the overall learning. Further, by incorporating the second-order approximation, we additionally capture the pairwise interaction between actors' updates so as to more accurately update the learning rates of actors towards maximizing the improvement of Q-value. We evaluate AdaMa in four typical multi-agent cooperation scenarios, i.e., going together, cooperative navigation, predator-prey, and clustering. Empirical results demonstrate that dynamically regulating the learning rates of actors and critic according to the contributions to the change of Qvalue could accelerate the learning and improve the performance, which can be further enhanced by additionally considering the effect of pairwise actors' updates. The visualizations of learning rates during training clearly explain how and why AdaMa works.

2. RELATED WORK

MARL. We consider the formulation of decentralized partially observable Markov decision process (Dec-POMDP). There are N agents interacting with the environment. At each timestep t, each agent i receives a local observation o i t , takes an action a i t , and gets a shared reward r t . The agents aim to maximize the expected return E T t=0 γ t r t , where γ is a discount factor and T is the episode time horizon. Many methods (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018; Iqbal & Sha, 2019; Son et al., 2019) have been proposed for Dec-POMDP, which adopt centralized learning and decentralized execution (CTDE). In many of these methods, a centralized critic learns a joint Q-function by minimizing the TD-error. In training, the critic is allowed to use the information of all agents. The actors, which only have access to local information, learn to maximize the Q-value learned by the critic. In execution, the critic is abandoned and the actors act in a decentralized manner. Adaptive Learning Rate. Learning rate schedules aim to reduce the learning rate during training according to a pre-defined schedule, including time-based decay, step decay, and exponential decay. The schedules have to be defined in advance and depend heavily on the type of model and problem, which requires much expert knowledge. Some optimizers, such as AdaGrad (Duchi et al., 2011 ), AdaDelta (Zeiler, 2012) , RMSprop (Tieleman & Hinton, 2012), and Adam (Kingma & Ba, 2015) , provide adaptive learning rate to ease manual tuning. AdaGrad performs larger updates for more sparse parameters and smaller updates for less sparse parameters, and other methods are derived from AdaGrad. However, these methods only deal with the gradient pattern for general optimization problems, offering no specialized way to boost multi-agent learning. WoLF (Bowling & Veloso, 2002) provides variable learning rates for stochastic games, but not for cooperation. Meta Gradients for Hyperparameters. Some meta-learning methods employ hyperparameter gradients to tune the hyperparameter automatically. Maclaurin et al. (2015) utilized the reverse-mode differentiation of hyperparameters to optimize step sizes, momentum schedules, weight initialization distributions, parameterized regularization schemes, and neural network architectures. Xu et al. (2018) computed the meta-gradient to update the discount factor and bootstrapping parameter in reinforcement learning. OL-AUX (Lin et al., 2019) uses the meta-gradient to automate the weights of auxiliary tasks. The proposed AdaMa can also be viewed as a meta-gradient method for adaptive learning rates in MARL.

3. METHOD

In this section, we first introduce the single-critic version of MADDPG (Lowe et al., 2017) , on which we instantiate AdaMa. However, AdaMa can also be instantiated on other MARL methods, and the instantiation on MAAC (Iqbal & Sha, 2019) for discrete action space is also given in Appendix A.1. Then, we use the Taylor approximation to evaluate the contributions of the critic and actors' updates to the change of Q-value. Based on the derived quantitative contributions, we dynamically adjust the direction of the vector of actors' learning rates and balance the learning rates between the critic and actors. Further, we incorporate higher-order approximation to estimate the contributions more accurately. In mixed cooperation and competition, each MADDPG agent learns an actor π i and a critic for the local reward. However, since the agents share the reward in Dec-POMDP, we only maintain a single shared critic, which takes the observation vector o and the action vector a and outputs the Q-value, as illustrated in Figure 1 . The critic parameterized by φ is trained by minimizing the TD-error δ

3.1. SINGLE-CRITIC MADDPG

E ( o, a,r, o )∼D (Q( o, a) -y) 2 , where y = r+γQ -( o , π - i (o i )). Q -is the target critic, π - i is the target actor, and D is replay buffer. Each actor π i (parameterized by θ i ) is updated to maximize the learned Q-value by gradient ascent. The gradient of θ i is ∂Q( o, a) ∂a i ∂a i ∂θ i . We denote the learning rates of each actor i and the critic as l ai and l c respectively.

3.2. ADAPTIVE l a DIRECTION

First, suppose that the critic is trained and frozen, and we only update the actors. By expanding the Q-function, we can estimate the gain of Q-value contributed by actors' updates by the Taylor approximation: ∆Q = Q( o, a + ∆ a) -Q( o, a) ≈ Q( o, a) + N i=1 ∆a i ∂Q( o, a) ∂a i T -Q( o, a) = N i=1 [π i (θ i + l ai ∂Q( o, a) ∂θ i ) -π i (θ i )] ∂Q( o, a) ∂a i T ≈ N i=1 l ai ∂Q( o, a) ∂θ i ∂a i ∂θ i T ∂Q( o, a) ∂a i T = N i=1 l ai ∂Q( o, a) ∂θ i ∂Q( o, a) ∂θ i T = l a • ∂Q ∂θ ∂Q ∂θ T . Assuming the magnitude of the learning rate vector l a is a fixed small constant l a , the largest ∆Q is obtained when the direction of l a is consistent with the direction of vector ∂Q ∂θ ∂Q ∂θ T . Thus, we can softly update l a to the direction of ∂Q ∂θ ∂Q ∂θ T to improve the Q-value: l a = α l a + (1 -α) l a ∂Q ∂θ ∂Q ∂θ T / ∂Q ∂θ ∂Q ∂θ T l a = l a l a l a , where the second line normalizes the magnitude of l a to l a , and α is a parameter that controls the soft update. From another perspective, the update rule (1) can be seen as updating l a by gradient ascent to increase the Q-value the most, since ∂∆Q ∂ la = ∂Q ∂θ ∂Q ∂θ T .

3.3. ADAPTIVE l c AND l a

In the previous section, we assume that the critic is frozen. However, in MADDPG and other MARL methods, the critic and actors are trained simultaneously. Therefore, we investigate the change of Q-value by additionally considering the critic's update: ∆Q = Q(φ + ∆φ, o, a + ∆ a) -Q(φ, o, a) ≈ Q(φ, o, a) + N i=1 ∆a i ∂Q(φ, o, a) ∂a i T + ∆φ ∂Q(φ, o, a) ∂φ T -Q(φ, o, a) ≈ l a • ∂Q ∂θ ∂Q ∂θ T -l c ∂δ ∂φ ∂Q ∂φ T . We can see that ∆Q is contributed by the updates of both the critic and actors. In principle, the critic's learning is prioritized since the actor's learning is determined by the improved critic. When the critic's update causes a large change of the Q-value, the critic is leading the learning, and we should assign it a high learning rate. However, the optimization of actors, which relies on the current critic, would struggle with the fast-moving target. Therefore, the actors' learning rates should be reduced. On the other hand, when the critic has reached a plateau, increasing the actors' learning rates could quickly optimize the actors, which further injects new experiences into the replay buffer to boost the critic's learning, thus promoting the overall learning. The contributions of actors' updates are always nonnegative, but the critic's update might either increase or decrease the Q-value. Therefore we use the absolute value | ∂δ ∂φ ∂Q ∂φ T | to evaluate the contribution of critic to the change of Q-value. Based on the principles above, we adaptively adjust l c and l a by the update rules: l c = αl c + (1 -α)l • clip(| ∂δ ∂φ ∂Q ∂φ T |/m, , 1 -) l a = l -l c . The hyperparameters α, m, l, and have intuitive interpretations and are easy to tune. α controls the soft update and m controls the target value of l c . The clip function and the small constant prevent the learning rate being too large or too small. Therefore, AdaMa works as follows: first update l c and get l a using (2), then regulate the direction and magnitude of l a according to (1). As Liessner et al. (2019) pointed out, the actor should have a lower learning rate than the critic, and a high learning rate of actor leads to a performance breakdown. Also, empirically, in DDPG (Lillicrap et al., 2016 ) the critic's learning rate is set to 10 times higher than the actor's learning rate. However, we believe such a setting only partially addresses the problem. During training, if the learning rates of actor are always low, actors learn slowly and thus the learning is limited. Therefore, AdaMa decreases l c and increases l a when the learning of critic reaches a plateau, which could avoid the fast-moving target and speed up the learning.

3.4. SECOND-ORDER APPROXIMATION

Under the first-order Taylor approximation, the actor i's contribution to ∆Q is only related to the change of a i , without capturing the joint effect with other agents' updates. However, when there are strong correlations between agents, the increase of the Q-value cannot be sufficiently estimated as the sum of individual contributions of each actor' update, which instead is a result of the joint update. To estimate the actors' contributions more precisely, we extend AdaMa to the second-order Taylor approximation to take pairwise agents' updates into account: As the actors are updated by the first-order gradient, we still estimate ∆ a utilizing the first-order approximation and compute the second-order ∆Q on the first-order ∆ a. Then, the gradient of l ai ∆Q = Q( o, a + ∆ a) -Q( o, a) ≈ Q( o, a) + N i=1 ∆a i ∂Q( o, a) ∂a i T + 1 2 N i,j=1 ∆a i ∂ 2 Q( o, a) ∂a i ∂a j ∆a j T -Q( o, a) ≈ N i=1 l ai ∂Q( o, a) ∂θ i ∂Q( o, a) ∂θ i T + 1 2 N i,j=1 l ai l aj ∂Q( o, a) ∂θ i ∂a i ∂θ i T ∂ 2 Q( o, is ∂∆Q ∂la i = ∂Q ∂θi ∂Q ∂θi T + 1 2 N j=1 l aj ∂Q ∂θi ∂ai ∂θi T ∂ 2 Q ∂ai∂aj ∂aj ∂θj ∂Q ∂θj T + 1 2 N j=1 l aj ∂Q ∂θj ∂aj ∂θj T ∂ 2 Q ∂aj ∂ai ∂ai ∂θi ∂Q ∂θi T . Similarly, l a can be updated as: l a = α l a + (1 -α) l a ∂∆Q ∂ l a / ∂∆Q ∂ l a , l a = l a l a l a .

4. EXPERIMENTS

We validate AdaMa in four cooperation scenarios with continuous observation space and continuous action space, which are illustrated in Figure 2 . In these scenarios, agents observe the relative positions of other agents, landmarks, and other items, and take two-dimensional actions ∈ [-1, 1] as physical velocity. • Going Together. In the scenario, there are 2 agents and 1 landmark. The reward is -0.5(d i + d j )d ij , where d i is the distance from agent i to the landmark, and d ij is the distance between the two agents. The agents have to go to the landmark together, avoiding moving away from each other. • Cooperative Navigation. In the scenario, there are 4 agents and 4 corresponding landmarks. The reward is -max i (d i ), where d i is the distance from agent i to the landmark i. The slowest agent determines the reward in this scenario. • Predator-Prey. In the scenario, 4 slower agents learn to chase a faster rule-based prey. Each time one of the agents collide with the prey, the agents get a reward +1. • Clustering. In the scenario, 8 agents learn to cluster together. The reward isd i , where d i is the distance from agent i to the center of agents' positions. Since the center is changing along with the agents' movements, there are strong interactions between agents. To investigate the effectiveness of AdaMa and for ablation, we evaluate the following methods: • AdaMa adjusts l c and l a using (2), and l a according to (1). • Fixed lr uses grid search to find the optimal combination of l c and l a from 0.01 to 0.001 with step 0.001. The learning rate of each agent is set to l a / √ N . • Adaptive l a direction sets l c and l a as that in Fixed lr and only adjusts the direction of l a using (1). Additionally, Adaptive l a direction (2nd) uses the update rule (3) for the second-order approximation. • Adaptive l c and l a adjusts l c and l a using (2) and sets l ai = l a / √ N . • AdaGrad is an adaptive learning rate optimizer that performs larger updates for more sparse parameters and smaller updates for less sparse parameter. The initial learning rates are sets as that in Fixed lr. Except AdaGrad, all other methods use SGD optimizer without momentum. More details about experimental settings and hyperparameters are available in Appendix A.3. We trained all the models for five runs with different random seeds. All the learning curves are plotted using mean and standard deviation. 

4.1. PERFORMANCE OF ADAPTIVE l a DIRECTION

As shown in Figure 3 (a) and 3(c), Adaptive l a direction converges to a higher reward than Fixed lr that treats each agent as equally important. To make an explicit explanation, we visualize the normalized actors' learning rates l a / l a in Figure 4 for one run and more results are available in Appendix A.4. In going together and predator-prey, the actors' learning rates fluctuate dynamically and alternately as depicted in Figure 4 (a) and 4(b). An actor has a much higher learning rate than other actors in different periods, meaning that the actor is critical to the learning. The direction of l a is adaptive to the changing contributions during the learning, assigning higher learning rates to the actors that make more contributions to ∆Q. In clustering, the center is determined by all agents' positions, and the actors' updates make similar contributions to ∆Q, leading to similar learning rates for the actors. That is the reason Adaptive l a direction is not beneficial in this scenario. Moreover, If other actors are updating at a similar rate, the update of this actor will become unstable, since the changes of others' policies are invisible and unpredictable. In our method, the agents critical to increasing the Q-value learn fast while other agents have low learning rates, which partly attenuates the instability.

4.2. PERFORMANCE OF ADAPTIVE l c AND l a

As illustrated in Figure 3 (b), 3(c), and 3(d), Adaptive l c and l a learns faster than Fixed lr. To interpret the results, we plot l c and l a during the training in Figure 5 and find that l c and l a rise and fall alternately and periodically. When the update of the critic impacts greatly on ∆Q, e.g., at the beginning with large TD-error, the fast-moving Q-value, which is the optimization target of actors, might cause a performance breakdown if the actors are also learning fast. In this situation, our method could adaptively speed up the learning of the critic and slow down the learning of actors for stability. After a while, the TD-error becomes small and makes the critic reach a plateau. According to the update rules (2), the learning of actors is accelerated whilst the learning rate of the critic falls, which keeps the target of actors stable and thus avoids the breakdown. The fast-improving actors generate new experiences, which change the distribution in the replay buffer and increase the TDerror. As a consequence, the learning rate of the critic rises again. Therefore, the learning rates of the critic and actors fluctuate alternately, promoting the overall learning continuously. In going together, the alternate fluctuation is not obvious, so Adaptive l c and l a performs worse than Fixed lr with grid search. Combined with the two adaptive mechanisms, AdaMa learns faster and converges to a higher reward than all other baselines in Figure 3 mechanism that brings the main improvement. Since Fixed lr has to search 100 combinations, the cost is prohibitive. Despite adaptively adjusting the learning rates, AdaGrad does not show competitive performance, since it only focuses on the gradient pattern, ignoring the characteristics of MARL.

PERFORMANCE OF SECOND-ORDER APPROXIMATION

In Figure 3 , the performance gain of Adaptive l a direction is limited, which we think is attributed to that the first-order approximation is relatively rough when an actor's update affects other actors' updates. We apply the second-order approximation to Adaptive l a direction (2nd) and find that it achieves better results as shown in Figure 6 . Comparing the later episodes in Figure 4 (a) and 4(c), there is a larger gap between the two actors' learning rates under the second-order approximation. This accurately reflects the later training is dominated by one actor, which is the reason for the higher reward. In Figure 4 (d), there are obvious ups and downs in the learning rates of actors before convergence (1 × 10 4 episodes), and after that the fluctuation becomes gentle. The second-order approximation that captures the pairwise effect of agents' updates on ∆Q obtains a more accurate update on the learning rates, which eventually leads to better performance.

4.4. TUNING HYPERPARAMETER m

The hyperparameter m controls the target value of the critic's learning rate. If m is too large or too small, the learning rate will reach the boundary value l or (1 -)l, which destroys the adaptability and hampers the learning process. An empirical approach for tuning is setting m to be the mean T | is higher than 60 or lower than 40, similar learning rate patterns is observed when m is between 40 and 60, which verifies that there is high fault tolerance in m. Although Adaptive l c and l a converges to a similar reward with Fixed lr, the former learns faster and is much easier to tune.

5. CONCLUSION

In this paper, we proposed AdaMa for adaptive learning rates in MARL. AdaMa adaptively updates the vector of learning rates of actors to the direction of maximally improving the Q-value. It also dynamically balances the learning rates between the critic and actors during learning. Moreover, AdaMa can incorporates the higher-order approximation to more accurately update the learning rates of actors. Empirically, we show that AdaMa could accelerate the learning and improve the performance in a variety of multi-agent scenarios.

A APPENDIX

A.1 INSTANTIATING ADAMA ON MAAC AdaMa could be applied to other multi-agent actor-critic methods, modified according to the gradient computation. We present how to instantiate AdaMa on MAAC (Iqbal & Sha, 2019) . In discrete action space, ∆ a does not meet the assumption of the Taylor approximation. But in practice, it is feasible to learn the value function Q(φ, o, π), taking as input the action distribution π i instead of the action a i . Since the update of the actor is different from that in MADDPG, the change of Q-value is written as ∆Q = Q(φ + ∆φ, o, π + ∆ π) -Q(φ, o, π) ≈ Q(φ, o, π) + N i=1 ∆π i ∂Q(φ, o, π) ∂π i T + ∆φ ∂Q(φ, o, π) ∂φ T -Q(φ, o, π) = N i=1 [π i (θ i + l ai ∂ log π i A i ∂θ i ) -π i (θ i )] ∂Q(φ, o, π) ∂π i T + ∆φ ∂Q(φ, o, π) ∂φ T ≈ N i=1 l ai ∂ log π i A i ∂θ i ∂π i ∂θ i T ∂Q(φ, o, π) ∂π i T -l c ∂δ ∂φ ∂Q(φ, o, π) ∂φ T = N i=1 l ai ∂ log π i A i ∂θ i ∂Q ∂θ i T -l c ∂δ ∂φ ∂Q ∂φ T . A i is the advantage function of agent i' proposed in MAAC. Rewritting the gradient of l ai as ∂∆Q ∂la i = ∂ log πiAi ∂θi ∂Q ∂θi T , the AdaMa implementation on MAAC is the same as that on MADDPG.

A.2 ADAMA ALGORITHM

For completeness, we provide the AdaMa algorithm on MADDPG below.  = αl c + (1 -α)l • clip(| ∂δ ∂φ ∂Q ∂φ T |/m, , 1 -), l a = l -l c Adjust l a by l a = α l a + (1 -α) l a ∂∆Q ∂ la / ∂∆Q ∂ la , l a = l a la la Update the critic φ by φ = φl c ∂δ ∂φ , where δ is the TD-error Update the actor θ i by θ i = θ i + l ai ∂Q( o, a) ∂ai ∂ai ∂θi for each agent Update the target networks 9: end for

A.3 EXPERIMENTAL SETTINGS AND HYPERPARAMETERS

In each task, the experimental settings and hyperparameters are summarized in Table 1 . Initially, we set l c = l a and l ai = l a / √ N in AdaMa. For exploration, we add random noise to the action (1ε)a i + εη, where the uniform distribution η ∈ [-1, 1]. We anneal ε linearly from 1.0 to 0.1 over 10 4 episodes and keep it constant for the rest of the learning. We update the model every episode and update the target networks every 20 episodes. 



Figure 1: Single-Critic MADDPG

Figure 2: Illustration of experimental scenarios: going together, cooperative navigation, predatorprey, and clustering (from left to right).

Figure 3: Learning curves in the four scenarios.

Figure 4: Normalized actors' learning rates.

Figure5: l c and l a in single-critic MADDPG, the gradient of an actor depends on the current policies of other actors. If other actors are updating at a similar rate, the update of this actor will become unstable, since the changes of others' policies are invisible and unpredictable. In our method, the agents critical to increasing the Q-value learn fast while other agents have low learning rates, which partly attenuates the instability.

Figure 6: Learning curves with the second-order approximation.

Figure 7(b) to interpret the robustness. Since most of the time | ∂δ ∂φ ∂Q ∂φ

Figure 9: Normalized actors' learning rates in predator-prey

Algorithm 1 AdaMa on MADDPG 1: Initialize critic network φ, actor networks θ i , and target networks Initialize the learning rates l c and l a Initialize replay buffer D 2: for episode = 1, . . . , M do Store transition ( o t , a t , r t , o t+1 ) in D Adjust l c and l a by l c



