REPRESENTATION INTERFERENCE SUPPRESSION VIA NON-LINEAR VALUE FACTORIZATION FOR INDECOM-POSABLE MARKOV GAMES

Abstract

Value factorization is an efficient approach for centralized training with decentralized execution in cooperative multi-agent reinforcement learning tasks. As the simplest implementation of value factorization, Linear Value Factorization (LVF) attracts wide attention. In this paper, firstly, we investigate the applicable conditions of LVF, which is important but usually neglected by previous works. We prove that due to the representation limitation, LVF is only perfectly applicable to an extremely narrow class of tasks, which we define as the decomposable Markov game. Secondly, to handle the indecomposable Markov game where the LVF is inapplicable, we turn to value factorization with complete representation capability (CRC) and explore the general form of the value factorization function that satisfies both Independent Global Max (IGM) and CRC conditions. A common problem of these value factorization functions is the representation interference among true Q values with shared local Q value functions. As a result, the policy could be trapped in local optimums due to the representation interference on the optimal true Q values. Thirdly, to address the problem, we propose a novel value factorization method, namely Q Factorization with Representation Interference Suppression (QFRIS). QFRIS adaptively reduces the gradients of the local Q value functions contributed by the non-optimal true Q values. Our method is evaluated on various benchmarks. Experimental results demonstrate the good convergence of QFIRS.

1. INTRODUCTION

Centralized training with decentralized execution (CTDE) (Lowe et al., 2017; Oliehoek et al., 2008; Foerster et al., 2016) shows surprising performance and great scalability in challenging fully cooperative multi-agent reinforcement learning (MARL) tasks (Tan, 1993b) . Such tasks only provide rewards shared by all agents. Each agent is expected to deduce its own contribution to the team, which introduces the problem of credit assignment (Foerster et al., 2018) . As a simple and efficient approach for credit assignment in the CTDE paradigm, value factorization, especially Linear Value Factorization (LVF) recently gains growing attention, e.g., VDN (Sunehag et al., 2017) and QMIX (Rashid et al., 2018) . An important property of LVF is that it concisely meets the Independent Global Max (IGM) principle (Son et al., 2019) . The IGM principle is defined as the identity between the joint Q value function and the set of factorized local Q value functions, which is wildly acknowledged as a critical rule for value factorization. However, the linearly factorizable joint Q value function in LVF is incapable to represent non-linear true Q value functions, known as the representation limitation of LVF. Recent works focus on the solutions to the representation limitation but usually neglect under what conditions the true Q value function is not linearly factorizable. In this paper, we prove that in the context of Markov games, the linear factorizability relies on two conditions: (1) the reward function is linearly factorizable on a set of subspaces of the joint state-action space; (2) the state transitions in each subspace is irrelevant to the state and action out of the subspace. Based on the two conditions above, we define the decomposability of the Markov game. In words, the true Q value function is linearly factorizable if and only if the Markov game is decomposable. Most of the tasks are indecomposable Markov games, so we go deeper into the property of LVF in this case. We prove that the target of the joint Q value function in Bellman equation (Sutton & Barto, 2018) is always unbiased for LVF under value iteration in sarsa manner. To deal with the indecomposable Markov game where the true Q value function is not linearly factorizable, we consider improving the representation capability of the value factorization function by introducing extra approximators. According to the partial derivative on local Q value functions, value factorization functions can be classified into two categories, i.e., linear and non-linear. For both categories, we investigate the conditions of value factorization functions that satisfy the complete representation capability (i.e., the capability to approximate any true Q value function) and the IGM principle. Then we propose a rule to generate qualified functions and list some example functions for both linear and non-linear cases. A common problem of these value factorization functions is the representation interference among true Q values. Specifically, a local Q value corresponds to multiple true Q values in value factorization. As a result, the representation of these true Q values is interfered by each other through the training of the shared local Q value function. The representation interference on the optimal true Q value function could leave the policy trapped in the local optimum. To address the problem, we design a novel value factorization function. Our method, namely Q Factorization with Representation Interference Suppression (QFRIS), alleviates the representation interference on the optimal true Q value by reducing the weight contributed by the non-optimal ones. QFRIS is evaluated on matrix game, predator-prey and starcraft multi-agent challenge. The experimental results demonstrate the good convergence of our method. We have three main contributions in this work: (1) We prove a sufficient and necessary condition for the linear factorizability of the true Q value function, which can be used to distinguish whether the joint Q value function of LVF is faced with representation limitation; (2) To deal with indecomposable Markov games, we propose the rules to generate value factorization functions that satisfy both IGM and CRC conditions; (3) We point out a common problem of value decomposition, namely representation interference, and design a novel value factorization function to address the problem. Our method shows good convergence in experiments on various benchmarks.

2.1. DEC-POMDP

A fully cooperative multi-agent reinforcement learning problem can be modelled by the Decentralized Partially Observable Markov Decision Process (Dec-POMDP), which is usually described by a tuple G =< S, U , P, r, Z, O, n, γ > (Guestrin et al., 2001; Oliehoek & Amato, 2016; Seuken & Zilberstein, 2008) . s ∈ S denotes the global state of the environment, by which a local observation z i ∈ Z i is assigned to agent i ∈ I ≡ {1, 2, • • • , n} according to the observation function O : S × I → Z i . After receiving z i , each agent chooses an individual action u i ∈ U i based on its local policy π i (u i |τ i ) : The true Q value function is defined as the expectation of accumulative rewards, i.e., Q(s t , u T i × U i → [0, 1], where τ i ∈ T i ≡ (Z i × U i ) t ) := E st+1:∞,ut+1:∞ [R t |s t , u t ], where R t = ∞ i=0 γ i r t+1 . Q(s t , u t ) is approximated by the joint Q value function Q(s, u). We denote the optimal action and greedy action by u * := argmax u Q(s, u) and u gre := argmax u Q(s, u), respectively.

2.2. VALUE FACTORIZATION

In value factorization, the joint Q value function is factorized through a value factorization operator F(•) as Q(s, u) = F(Q 1 (τ 1 , u 1 ), • • • , Q n (τ n , u n )) (1) Q i (u i , τ i ) : U i → R (i ∈ [1, n] ) is defined as the local Q value function of agent i. A critical rule of value factorization is the Independent Global Max principle. The IGM principle is defined as the identity of the joint greedy action and the set of local greedy actions. Formally, given the joint Q



is the local observation-action history, i.e., the local trajectory. After the execution of the joint action u = {u 1 , • • • , u n }, a reward r shared by all agents and the next state s are generated by the reward function r(s, u) : S × U → R and transition function P(s |s, u) : S × U → S, respectively. γ ∈ [0, 1) is a discount factor. Note that we use bold symbols to denote the global and joint variables, e.g., S and u.

