LEARNING TO SHARE IN MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. We propose a hierarchically decentralized MARL method, LToS, which enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and two networked MARL scenarios.

1. INTRODUCTION

In multi-agent reinforcement learning (MARL), there are multiple agents interacting with the environment via their joint action to cooperatively optimize an objective. Many methods of centralized training and decentralized execution (CTDE) have been proposed for cooperative MARL, such as VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018), and QTRAN (Son et al., 2019) . However, these methods suffer from the overgeneralization issue (Palmer et al., 2018; Castellini et al., 2019) . Moreover, they may not easily scale up with the number of agents due to centralized learning (Qu et al., 2020a) . In many MARL applications, there are a large number of agents that are deployed as a partially connected network and collaboratively make decisions to optimize the globally averaged return, such as smart grids (Dall'Anese et al., 2013 ), network routing (Jiang et al., 2020) , traffic signal control (Chu et al., 2020), and IoT (Xu et al., 2019) . To deal with such scenarios, networked MARL is formulated to decompose the dependency among all agents into dependencies between only neighbors in such scenarios. To avoid decision-making with insufficient information, agents are permitted to exchange messages with neighbors over the network. In such settings, it is feasible for agents to learn to make decisions in a decentralized way (Zhang et al., 2018; Qu et al., 2020b) . However, there are still difficulties of dependency if anyone attempts to make decision independently, e.g., prisoner's dilemma and tragedy of the commons (Pérolat et al., 2017) . Existing methods tackle these problems by consensus update of value function (Zhang et al., 2018 ), credit assignment (Wang et al., 2020) , or reward shaping (Chu et al., 2020) . However, these methods rely on either access to global state and joint action (Zhang et al., 2018) or handcrafted reward functions (Wang et al., 2020; Chu et al., 2020) . Inspired by the fact that sharing plays a key role in human's learning of cooperation, in this paper, we propose Learning To Share (LToS), a hierarchically decentralized learning method for networked MARL. LToS enables agents to learn to dynamically share reward with neighbors so as to collaboratively optimize the global objective. The high-level policies decompose the global objective into local ones by determining how to share their rewards, while the low-level policies optimize local objectives induced by the high-level policies. LToS learns in a decentralized manner, and we prove that the high-level policies are a mean-field approximation of the joint high-level policy. Moreover, the high-level and low-level policies form a bi-level optimization and alternately learn to optimize

