GENERATIVE MULTI-FLOW NETWORKS: CENTRAL-IZED, INDEPENDENT AND CONSERVATION

Abstract

Generative flow networks utilize the flow matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, existing works can only handle single flow model tasks and cannot directly generalize to multi-agent flow networks due to limitations such as flow estimation complexity and independent sampling. In this paper, we propose the framework of generative multi-flow networks (GMFlowNets) that can be applied to multiple agents to generate objects collaboratively through a series of joint actions. Then, the centralized flow network algorithm is proposed for centralized training GM-FlowNets, while the independent flow network algorithm is proposed to achieve decentralized execution of GMFlowNets. Based on the independent global conservation condition, the flow conservation network algorithm is then proposed to realize centralized training with decentralized execution paradigm. Theoretical analysis proves that using the multi-flow matching loss function can train a unique Markovian flow, and the flow conservation network can ensure independent policies can generate samples with probability proportional to the reward function. Experimental results demonstrate the performance superiority of the proposed algorithms compared to reinforcement learning and MCMC-based methods.

1. INTRODUCTION

Generative flow networks (GFlowNets) Bengio et al. (2021b) can sample a diverse set of candidates in an active learning setting, where the training objective is to approximate sample them proportionally to a given reward function. Compared to reinforcement learning (RL), where the learned policy is more inclined to sample action sequences with higher rewards, GFlowNets can perform better on exploration tasks. Since the goal of GFlowNets is not to generate a single highest-reward action sequence, but to sample a sequence of actions from the leading modes of the reward function Bengio et al. (2021a) . Unfortunately, currently GFlowNets cannot support multi-agent systems. A multi-agent system is a set of autonomous, interacting entities that share a typical environment, perceive through sensors and act in conjunction with actuators Busoniu et al. (2008) . Multi-agent reinforcement learning (MARL), especially cooperative MARL, are widely used in robotics teams, distributed control, resource management, data mining, etc Zhang et al. ( 2021 (2018) . In MARL, to address these challenges, a popular centralized training with decentralized execution (CTDE) Oliehoek et al. (2008) ; Oliehoek & Amato (2016) paradigm is proposed, in which the agent's policy is trained in a centralized manner by accessing global information and executed in a decentralized manner based only on local history. However, extending these techniques to GFlowNets is not straightforward, especially in constructing CTDE-architecture flow networks and finding IGM conditions for flow networks worth investigating. In this paper, we propose Generative Multi-Flow Networks (GMFlowNets) framework for cooperative decision-making tasks, which can generate more diverse patterns through sequential joint ac-tions with probabilities proportional to the reward function. Unlike vanilla GFlowNets, our method analyzes the interaction of multiple agent actions and shows how to sample actions from multi-flow functions. We propose the Centralized Flow Networks (CFN), Independent Flow Networks (IFN) and Flow Conservation Networks (FCN) algorithms based on the flow matching condition to solve GMFlowNets. CFN regards multi-agent dynamics as a whole for policy optimization, regardless of combinatorial complexity and the demand for independent execution, while IFN suffers from the flow non-stationary problem. In contrast, FCN takes full advantage of CFN and IFN, which is trained based on the independent global conservation (IGC) condition. Since FCN has the CTDE paradigm, it can reduce the complexity of flow estimation and support decentralized execution, which is beneficial to solving practical cooperative decision-making problems. Main Contributions: 1) We are the first to propose the concept of generative multi-flow networks for cooperative decision-making tasks; 2) We propose three algorithms, CFN, IFN, and FCN, for training GMFlowNets, which are respectively based on centralized training, independent execution, and CTDE paradigm; 3) We propose the IGC condition and then prove that the joint state-action flow function can be decomposed into the product form of multiple independent flows, and that a unique Markovian flow can be trained based on the flow matching condition; 4) We conduct experiments based on cooperative control tasks to demonstrate that the proposed algorithms can outperform current cooperative MARL algorithms, especially in terms of exploration capabilities.

2.1. PRELIMINARY

Let F : T → R + be a trajectory flow Bengio et al. (2021b) , such that F (τ ) can be interpreted as the probability mass associated with trajectory τ . Then, we have the corresponding defined edge flow F (s → s ′ ) = s→s ′ ∈τ F (τ ) and state flow F (s) = s∈τ F (τ ). The forward transition probabilities P F for each step of a trajectory can then be defined as Bengio et al. ( 2021b) P F (s | s ′ ) = F (s → s ′ ) F (s) . GFlowNets aims to train a neural network to approximate the trajectory flow function with the output proportional to the reward function based on the flow matching condition Bengio et al. (2021b): s ′ ∈Parent(s) F (s ′ → s) = s ′′ ∈Child(s) F (s → s ′′ ) , where Parent(s) and Child(s) denote the parent set and child set of state s, respectively. In this way, for any consistent flow F with the terminating flow as the reward, i.e., F (s → s f ) = R(s) with s f being the final state and s being the terminating state (can be transferred directly to the final state), a policy π defined by the forward transition probability satisfies π (s ′ | s) = P F (s ′ | s) ∝ R(x).

2.2. PROBLEM FORMULATION

A multi-agent directed graph is defined as a tuple (S, A) with S being a set of state and A = A 1 ×• • •×A k denoting the set of joint edges (also called actions or transitions), which consists of all possible combinations of the actions available to each agent. A trajectory in such a graph is defined as a sequence (s 1 , ..., s n ) of elements of S. A corresponding multi-agent directed acyclic graph (MADAG) is defined as a multi-agent directed graph with unequal pairs of states in the trajectory. Given an initial state s 0 and final state s f , we name a trajectory τ = (s 0 , ..., s f ) ∈ T starting from s 0 and ending in s f as the complete trajectory, where T denotes the set of complete trajectories. We consider a partially observable scenario, where the state s ∈ S is shared by all agents, but it is not necessarily fully observed. Hence, each agent i ∈ I selects an action a i ∈ A i based only on local observations o i made of a shared state s ∈ S. In this way, we define the individual edge/action flow F (o i t , a i t ) = F (o i t → o i t+1 ) as the flow through an edge o i t → o i t+1 , and the joint edge/action flow is defined by F (s t , a t ) = F (s t → s t+1 ) with a t = [a 1 t , ..., a k t ] T . The state flow F (s) : S → R is defined as F (s) = τ ∈T 1 s∈τ F (τ ). Based on the flow matching condition Bengio et al. (2021b), 



); Canese et al. (2021); Feriani & Hossain (2021). Two major challenges for cooperative MARL are scalability and partial observability Yang et al. (2019); Spaan (2012). Since the joint state-action space grows exponentially with the number of agents, coupled with the environment's partial observability and communication constraints, each agent needs to make individual decisions based on local action observation history with guaranteed performance Sunehag et al. (2017); Wang et al. (2020); Rashid et al.

