UTS: WHEN MONOTONIC VALUE FACTORISATION MEETS NON-MONOTONIC AND STOCHASTIC TARGETS

Abstract

In the paradigm of centralised training with decentralised execution, monotonic value decomposition is one of the most popular methods to guarantee consistency between centralised and decentralised policies. This method always underestimates the value of the optimal joint action and converges to the suboptimal because it can only represent values in the restricted monotonic space. A possible way to rectify this issue is to introduce a weighting function to prioritise the real optimal joint action and learn biased joint action-value functions. However, there may not exist an appropriate weight to solve more general tasks with non-monotonic and stochastic target joint action-values. To solve this problem, we propose a novel value factorisation method named uncertainty-based target shaping (UTS), which projects the original target to the space that monotonic value factorisation can represent based on its stochasticity. First, we employ networks to predict the reward and the embedding of the next state, where the prediction error quantifies the stochasticity. Then, we introduce a target shaping function to replace the targets for deterministic suboptimal with the best per-agent value. Since we remain the optimal policy unchanged during shaping, monotonic value decomposition can converge to the real optimal with any original targets. Theoretical and empirical results demonstrate the improved performance of UTS in the task with non-monotonic and stochastic target action-value functions.

1. INTRODUCTION

Recent progress in cooperative multi-agent reinforcement learning (MARL) has shown attractive prospects for various real-world applications, such as the smart grid management (Aladdin et al., 2020) and autonomous vehicles (Zhou et al., 2021) . Due to practical communication constraints and intractably large joint action space, decentralised policies are often used in MARL. It is possible to use extra information from the environment and other agents in a simulated or laboratory setting. Exploiting this information can significantly benefit policy optimisation and improve learning performance (Foerster et al., 2016; 2018; Rashid et al., 2020) . In the paradigm of centralised training with decentralised execution (CTDE), agents' policies are trained with access to global information in a centralised way and executed only based on local histories in a decentralised way (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) . One of the most significant challenges is to guarantee the consistency between the individual policies and the centralised policy, i.e., the Individual-Global Max (IGM) principle (Son et al., 2019) . In value decomposition methods, QMIX (Rashid et al., 2018) applies a monotonic mixing network to factorise the joint Q-value function, which naturally meets the IGM principle. Inspired by QMIX, many algorithms are proposed to improve coordination from different perspectives, e.g., multi-agent exploration (Mahajan et al., 2019 ), role-based learning (Wang et al., 2020b; c), and policy-based algorithms (Wang et al., 2020d) . However, they can represent the same class of joint Q-values as QMIX because they use the same monotonic mixing network. However, since QMIX can only represent values in the restricted monotonic space, there exists a gap between the approximated joint Q-values and the non-monotonic target values Q from the environment. In some special tasks, QMIX can underestimate the value of the real optimal joint action and converge to a suboptimal (Son et al., 2019; Mahajan et al., 2019; Rashid et al., 2020) . Recent works try to solve this representational limitation from two different perspectives. This first category introduces the joint actions (Wang et al., 2020a; Mahajan et al., 2021) or pairwise interactions (Böhmer et al., 2020; Li et al., 2021) into centralised learning to achieve full representational capacity for target Q-values. However, learning such centralised values is difficult due to the large joint action space. Another category is to prioritise the real optimal joint action and learn biased joint Q-value functions. WQMIX (Rashid et al., 2020) introduces a weighting function into the projection from the target value functions to the joint Q-values and uses it to down-weight every suboptimal action whose target value is less than the current estimate. However, the poor empirical results on decentralised micromanagement tasks in StarCraft II show that it is difficult to apply the weighting function to more general tasks (Rashid et al., 2020; Wang et al., 2020a) . We prove that the weight for the suboptimal should be small to help QMIX focus on the representation of the optimal joint Q-value and solves the non-monotonic targets. In addition, the weight for each action should be uniform to avoid overestimating the suboptimal whose target is large with a low probability. Due to this contradiction, there may not exist an appropriate weight to recover the optimal policy when the target is non-monotonic and stochastic. This paper aims to take a step towards the latter category. We propose a novel value factorisation method named uncertainty-based target shaping (UTS), which projects the original target to the space that monotonic value factorisation can represent based on its stochasticity. First, we formulate two prediction problems and use the prediction error to quantify the stochasticity of the target joint Q-values. We employ a reward predictor and a state predictor to approximate the standard deviation of the reward and the embedding of the following state. The predicted standard deviation of the reward and the error of the state are expected to be significant if the pair leads to stochastic reward and stochastic state transition, respectively. Then, we introduce a shaping function to project the original targets Q to the monotonic space and keep the optimal policy unchanged. In practice, the best action value network is applied to predict the action value when each agent gets the cooperation of others. We use the minimal best per-agent values over all agents to replace the suboptimal target. We prove that this shaping can guarantee that all shaped targets are tractable for monotonic value decomposition. In addition, the optimal policy is the same for the original and the shaped targets. Therefore, QMIX can achieve full representational capacity for the shaped target Q-values rather than the original ones and converge to the real optimal. We list our main contributions as follows: • We first analyse the limitations of the weighting function in value decomposition methods and show that it cannot guarantee convergence to the optimal when target Q-value functions are non-monotonic and stochastic. • We introduce a target shaping function to project the original targets to a monotonic space, which ensures that QMIX can converge to the optimal with any original targets. • We propose uncertainty-based target shaping and empirically show its improved performance in practice, especially in tasks with non-monotonic and stochastic targets. where G t = ∞ i=0 γ i r t+i is the discounted return. VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018), and WQMIX (Rashid et al., 2020) are Q-learning algorithms for the fully cooperative multi-agent tasks, which estimate the joint Q-value function Q(s, u) as Q tot with specific forms. Considering only a fully-observable setting for ease of representation. VDN factorises Q tot into a sum of individual Q-value functions. By contrast,



multi-agent task in the partially observable setting can be formulated as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP) (Oliehoek & Amato, 2016), consisting of a tuple G = ⟨A, S, Ω, O, U, P, R, n, γ⟩, where a ∈ A ≡ {1, . . . , n} describes the set of agents, S denotes the set of states, Ω denotes the set of joint observations, and R denotes the set of rewards. At each time step, an agent obtains its observation o ∈ Ω based on the observation function O (s, a) : S × A → Ω, and an action-observation history τ a ∈ T ≡ (Ω × U ) * . Each agent a chooses an action u a ∈ U by a stochastic policy π a (u a |τ a ) : T × U → [0, 1], forming a joint action u ∈ U, which leads to a transition on the environment through the transition function P (s ′ , r|s, u) :S × U × S × R → [0, 1],where r ∈ R is the team reward. The goal of the task is to find the joint policy π which can maximise the joint Q-value function Q π (s t , u t ) = E st+1:∞,ut+1:∞ [G t |s t , u],

