MA2QL: A MINIMALIST APPROACH TO FULLY DECEN-TRALIZED MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in fully decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose multi-agent alternate Q-learning (MA2QL), where agents take turns to update their Q-functions by Q-learning. MA2QL is a minimalist approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees ε-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is a well-abstracted model for a broad range of real applications, including logistics (Li et al., 2019) , traffic signal control (Xu et al., 2021) , power dispatch (Wang et al., 2021b) , and inventory management (Feng et al., 2022) . In cooperative MARL, centralized training with decentralized execution (CTDE) is a popular learning paradigm, where the information of all agents can be gathered and used in training. Many CTDE methods (Lowe et al., 2017; Foerster et al., 2018; Sunehag et al., 2018; Rashid et al., 2018; Wang et al., 2021a; Zhang et al., 2021; Su & Lu, 2022; Li et al., 2022) have been proposed and shown great potential to solve cooperative multi-agent tasks. Another paradigm is decentralized learning, where each agent learns its policy based on only local information. Decentralized learning is less investigated but desirable in many scenarios where the information of other agents is not available, and for better robustness, scalability, and security (Zhang et al., 2019) . However, fully decentralized learning of agent policies (i.e., without communication) is still an open challenge in cooperative MARL. The most straightforward way for fully decentralized learning is directly applying independent learning at each agent (Tan, 1993) , which however induces the well-known non-stationarity problem for all agents (Zhang et al., 2019) and may lead to learning instability and a non-convergent joint policy, though the performance varies as shown in empirical studies (Rashid et al., 2018; de Witt et al., 2020; Papoudakis et al., 2021; Yu et al., 2021) . In the paper, we directly tackle the non-stationarity problem in the simplest and fundamental way, i.e., fixing the policies of other agents while one agent is learning. Following this principle, we propose multi-agent alternate Q-learning (MA2QL), a minimalist approach to fully decentralized cooperative multi-agent reinforcement learning, where agents take turns to update their policies by Q-learning. MA2QL is theoretically grounded and we prove that when each agent guarantees ε-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL) (Tan, 1993; Tampuu et al., 2015) and also independent DDPG (Lillicrap et al., 2016) for continuous action, i.e., simply swapping the order of two lines of codes as follows. Their major difference can be highlighted as: MA2QL agents take turns to update Q-functions by Q-learning, whereas IQL agents simultaneously update Q-functions by Q-learning. IQL 1: repeat 2: all agents interact in the environment 3: for i ← 1, n do 4: agent i updates by Q-learning 5: end for 6: until terminate MA2QL 1: repeat 2: for i ← 1, n do 3: all agents interact in the environment 4: agent i updates by Q-learning 5: end for 6: until terminate We evaluate MA2QL on a didactic game to empirically verify its convergence, and multi-agent particle environments (Lowe et al., 2017 ), multi-agent MuJoCo (Peng et al., 2021) , and StarCraft multi-agent challenge (Samvelyan et al., 2019) to verify its performance with discrete and continuous action spaces, and fully and partially observable environments. We find that MA2QL consistently outperforms IQL, despite such minimal changes. The effectiveness of MA2QL suggests that simpler approaches may have been left underexplored for fully decentralized cooperative multi-agent reinforcement learning.

2. BACKGROUND

2.1 PRELIMINARIES Dec-POMDP. Decentralized partially observable Markov decision process (Dec-POMDP) is a general model for cooperative MARL. A Dec-POMDP is a tuple M = {S, A, P, Y, O, I, n, r, γ}. S is the state space, n is the number of agents, γ ∈ [0, 1) is the discount factor, and  I = {1, 2 • • • n} is the set of all agents. A = A 1 × A 2 × • • • × A n represents (π) = E π [ ∞ t=0 γ t r(s t , a t )] , and thus we need to find the optimal joint policy π * = arg max π J(π). To settle the partial observable problem, history τ i ∈ T i : (Y × A i ) * is often used to replace observation o i ∈ Y . Each agent i has an individual policy π i (a i |τ i ) and the joint policy π is the product of each π i . Though the individual policy is learned as π i (a i |τ i ) in practice, as Dec-POMDP is undecidable (Madani et al., 1999) and the analysis in partially observable environments is much harder, we will use π i (a i |s) in analysis and proofs for simplicity.

Dec-MARL.

Although decentralized cooperative multi-agent reinforcement learning (Dec-MARL) has been previously investigated (Zhang et al., 2018; de Witt et al., 2020) , the setting varies across these studies. In this paper, we consider Dec-MARL as a fully decentralized solution to Dec-POMDP, where each agent learns its policy/Q-function from its own action individually without communication or parameter-sharing. Therefore, in Dec-MARL, each agent i actually learns in the environment with transition function P i (s ′ |s, a i ) = E a-i∼π-i [P (s ′ |s, a i , a -i )] and reward function r i (s, a i ) = E a-i∼π-i [r(s, a i , a -i )] , where π -i and a -i respectively denote the joint policy and joint action of all agents expect i. As other agents are also learning (i.e., π -i is changing), from the perspective of each individual agent, the environment is non-stationary. This is the non-stationarity problem, the main challenge in Dec-MARL. IQL. Independent Q-learning (IQL) is a straightforward method for Dec-MARL, where each agent i learns a Q-function Q(s, a i ) by Q-learning. However, as all agents learn simultaneously, there is no theoretical guarantee on convergence due to non-stationarity, to the best of our knowledge. In practice, IQL is often taken as a simple baseline in favor of more elaborate MARL approaches, such as value-based CTDE methods (Rashid et al., 2018; Son et al., 2019) . However, much less attention has been paid to IQL itself for Dec-MARL.

2.2. MULTI-AGENT ALTERNATE POLICY ITERATION

To address the non-stationarity problem in Dec-MARL, a fundamental way is simply to make the environment stationary during the learning of each agent. Following this principle, we let agents learn by turns; in each turn, one agent performs policy iteration while fixing the policies of other agents. This procedure is referred to as multi-agent alternate policy iteration. As illustrated in Figure 1,  



the joint action space where A i is the individual action space for agent i. P (s ′ |s, a) : S × A × S → [0, 1] is the transition function, and r(s, a) : S × A → R is the reward function of state s and joint action a. Y is the observation space, and O(s, i) : S × I → Y is a mapping from state to observation for each agent. The objective of Dec-POMDP is to maximize J

