UTS: WHEN MONOTONIC VALUE FACTORISATION MEETS NON-MONOTONIC AND STOCHASTIC TARGETS

Abstract

In the paradigm of centralised training with decentralised execution, monotonic value decomposition is one of the most popular methods to guarantee consistency between centralised and decentralised policies. This method always underestimates the value of the optimal joint action and converges to the suboptimal because it can only represent values in the restricted monotonic space. A possible way to rectify this issue is to introduce a weighting function to prioritise the real optimal joint action and learn biased joint action-value functions. However, there may not exist an appropriate weight to solve more general tasks with non-monotonic and stochastic target joint action-values. To solve this problem, we propose a novel value factorisation method named uncertainty-based target shaping (UTS), which projects the original target to the space that monotonic value factorisation can represent based on its stochasticity. First, we employ networks to predict the reward and the embedding of the next state, where the prediction error quantifies the stochasticity. Then, we introduce a target shaping function to replace the targets for deterministic suboptimal with the best per-agent value. Since we remain the optimal policy unchanged during shaping, monotonic value decomposition can converge to the real optimal with any original targets. Theoretical and empirical results demonstrate the improved performance of UTS in the task with non-monotonic and stochastic target action-value functions.

1. INTRODUCTION

Recent progress in cooperative multi-agent reinforcement learning (MARL) has shown attractive prospects for various real-world applications, such as the smart grid management (Aladdin et al., 2020) and autonomous vehicles (Zhou et al., 2021) . Due to practical communication constraints and intractably large joint action space, decentralised policies are often used in MARL. It is possible to use extra information from the environment and other agents in a simulated or laboratory setting. Exploiting this information can significantly benefit policy optimisation and improve learning performance (Foerster et al., 2016; 2018; Rashid et al., 2020) . In the paradigm of centralised training with decentralised execution (CTDE), agents' policies are trained with access to global information in a centralised way and executed only based on local histories in a decentralised way (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) . One of the most significant challenges is to guarantee the consistency between the individual policies and the centralised policy, i.e., the Individual-Global Max (IGM) principle (Son et al., 2019) . In value decomposition methods, QMIX (Rashid et al., 2018) applies a monotonic mixing network to factorise the joint Q-value function, which naturally meets the IGM principle. Inspired by QMIX, many algorithms are proposed to improve coordination from different perspectives, e.g., multi-agent exploration (Mahajan et al., 2019 ), role-based learning (Wang et al., 2020b; c), and policy-based algorithms (Wang et al., 2020d) . However, they can represent the same class of joint Q-values as QMIX because they use the same monotonic mixing network. However, since QMIX can only represent values in the restricted monotonic space, there exists a gap between the approximated joint Q-values and the non-monotonic target values Q from the environment. In some special tasks, QMIX can underestimate the value of the real optimal joint action and converge to a suboptimal (Son et al., 2019; Mahajan et al., 2019; Rashid et al., 2020) . Recent works try to solve this representational limitation from two different perspectives. This first category

