ACHIEVING COMMUNICATION-EFFICIENT POLICY EVALUATION FOR MULTI-AGENT REINFORCEMENT LEARNING: LOCAL TD-STEPS OR BATCHING?

Abstract

In many consensus-based actor-critic multi-agent reinforcement learning (MARL) strategies, one of the key components is the MARL policy evaluation (PE) problem, where a set of N agents work cooperatively to evaluate the value function of the global states under a given policy only through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the communication complexity, which is defined as the rounds of communication between neighboring nodes in order to converge to some ϵ-stationary point. To lower communication complexity in MARL-PE, there exist two "natural" ideas: i) using batching to reduce the variance of TD (temporal difference) errors, which in turn improves the convergence rate of MARL-PE; and ii) performing multiple local TD update steps between each consecutive rounds of communication, so as to reduce the communication frequency. While the effectiveness of the batching approach has been verified and relatively well-understood, the validity of the local TD-steps approach remains unclear due to the potential "agent-drift" phenomenon resulted from various heterogeneity factors across agents. This leads to an interesting open question in MARL-PE: Does the local TD-steps approach really work and how does it perform in comparison to the batching approach? In this paper, we take the first attempt to answer this fundamental question. Our theoretical analysis and experimental results confirm that allowing multiple local TD steps is indeed a valid approach in lowering the communication complexity of MARL-PE compared to vanilla consensus-based MARL-PE algorithms. Specifically, the local TD steps between two consecutive communication rounds can be as large as O( 1/ϵ log (1/ϵ)) in order to converge to an ϵ-stationary point of MARL-PE. Theoretically, we show that in order to reach the optimal sample complexity up to a log factor, the communication complexity is O( 1/ϵ log (1/ϵ)), which is worse than TD learning with batching, whose communication complexity is O(log(1/ϵ)). However, the experimental results show that the allowing multiple steps can be as good as the batch approach.

1. INTRODUCTION

1) Background and Motivation: With the recent success of reinforcement learning (RL) (Sutton & Barto, 2018) techniques in the dynamic decision making process where the underlying model is unkown, MARL, a natural extension of RL to multi-agent systems, has also found increasing applications. Compared to traditional RL, the richness of multi-agent systems has given rise to far more diverse problem settings in MARL, including cooperative, competitive, and mixed MARL (see (Zhang et al., 2021a) for an excellent survey). In this paper, we are interested in cooperative MARL, which has found a wide range of applications in the field of networked large-scale systems, such as power networks (Chen et al., 2022; Riedmiller et al., 2000) , autonomous driving (Yu et al., 2019; Shalev-Shwartz et al., 2016) and so on. A defining feature of cooperative MARL is that all agents in the system collaborate to learn a joint policy to maximize long-term system-wide total reward through communicating with each other. However, due to the decentralized nature (i.e., lack of a centralized infrastructure) of cooperative MARL, the collaboration between the agents can only rely on some algorithmic designs to induce a "consensus" that can be reached by the agents. In many consensus-based actor-critic MARL strategies, one of the key components is the MARL policy evaluation (PE) problem, where a set of N agents work cooperatively to evaluate the value function of the global states for a given joint policy. Just as in the PE problem of conventional RL, temporal difference (TD) learning (Sutton, 1988) has been the prevailing method for MARL-PE thanks to its simplicity and empirical successes in real-world applications. Simply speaking, the key idea of TD learning is to learn the value function by using the Bellman equation to bootstrap from the current estimated value function. However, as mentioned earlier, the decentralized nature of the MARL-PE problem necessitates communication among agents for TD learning. Hence, a critical challenge in consensus-based MARL-PE is how to lower the communication complexity, which is defined as the required rounds of communication between neighboring agents in order to converge to some ϵ-stationary point of the MARL-PE problem. To lower communication complexity for solving MARL-PE problems, there exist two "natural" ideas: i) using batching of trajectory samples to reduce the variance of TD errors, which in turn improves the convergence rate of MARL-PE; and ii) using an "infrequent communication" approach where we perform multiple local TD update steps between each consecutive rounds of communication to reduce the communication frequency. While the effectiveness of the "batching" approach has been verified and relatively well-understood (Hairi et al., 2022; Chen et al., 2021) , the validity of the "local TD-step" approach remains unclear due to the potential "agent-drift" phenomenon resulted from various heterogeneity factors across agents (more on this soon). This leads to two interesting open questions: 1) Can the local TD-steps approach really lower the communication complexity for solving MARL-PE? 2) If the answer to 1) is "yes," how does the local TD-steps approach perform in comparison to the batching approach? Answering these two questions from both theory and in practice constitutes the main goal of this paper. 2) Technical Challenges: Answering the above questions is highly non-trivial due to several technical challenges in the convergence analysis of the local TD-steps approach. Notably, it is easy to see that the structure of TD learning in consensus-based cooperative MARL resembles that of decentralized stochastic gradient descent (DSGD) method in consensus-based decentralized optimization (Nedic & Ozdaglar, 2009; Lian et al., 2017; Pu & Nedić, 2021) . Thus, it is tempting to believe that one can borrow convergence analysis techniques of DSGD and apply them in TD learning. However, despite such similarities, there also exist significant differences between TD learning in MARL and DSGD. • Structural Differences: First, we note that TD learning is not a true gradient-based method, since TD error is not a gradient estimator of any static objective function as in DSGD, which is well defined in a consensus-based decentralized optimization problem. Also, in decentralized optimization, the gradient terms are often assumed to be bounded. However, in TD learning, the TD-errors can not be assumed to be bounded without further assuming that the value function approximation parameters lie in some compact set. • Markovian Data in TD Learning: In RL/MARL problem, there exists an underlying Markovian dynamic process across time steps, where the state distribution may differ at different time steps. By contrast, in decentralized optimization, it is often safe to assume that the data at each agent are independently distributed. Thus, it is not applicable to directly use convergence analysis techniques of decentralized optimization in TD learning for MARL-PE. The coupling and dependence among samples renders the convergence analysis of TD learning in MARL far more challenging. • "Agent-Drift" Phenomenon: Due to heterogeneity nature of the rewards across agents, executing multiple local TD update steps would inevitably pull the local function approximation parameters toward the direction of approximating local value functions rather than the global value function, leading to the "agent-drift" phenomenon. Hence, it is unclear whether such local steps help or hurt the convergence of TD learning in MARL. Intuitively, if the agent-drift is too large, then the low communication-complexity benefit of infrequent communication might be offset by the errors in aligning the local model parameters at each agent to the true global value function parameters. Because of the agent-drift effect, the number of local TD update steps has to be chosen judiciously to mitigate the potentially large divergence of the function approximation parameters between each communication round. 3) Main Results and Contribution: The main contribution of this paper is that we overcome the above challenges in analyzing the communication complexity of the local TD-steps approach for cooperative MARL-PE. By doing so, we are able to shed light on the feasibility and effect of local

