GENERATIVE MULTI-FLOW NETWORKS: CENTRAL-IZED, INDEPENDENT AND CONSERVATION

Abstract

Generative flow networks utilize the flow matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, existing works can only handle single flow model tasks and cannot directly generalize to multi-agent flow networks due to limitations such as flow estimation complexity and independent sampling. In this paper, we propose the framework of generative multi-flow networks (GMFlowNets) that can be applied to multiple agents to generate objects collaboratively through a series of joint actions. Then, the centralized flow network algorithm is proposed for centralized training GM-FlowNets, while the independent flow network algorithm is proposed to achieve decentralized execution of GMFlowNets. Based on the independent global conservation condition, the flow conservation network algorithm is then proposed to realize centralized training with decentralized execution paradigm. Theoretical analysis proves that using the multi-flow matching loss function can train a unique Markovian flow, and the flow conservation network can ensure independent policies can generate samples with probability proportional to the reward function. Experimental results demonstrate the performance superiority of the proposed algorithms compared to reinforcement learning and MCMC-based methods.

1. INTRODUCTION

Generative flow networks (GFlowNets) Bengio et al. (2021b) can sample a diverse set of candidates in an active learning setting, where the training objective is to approximate sample them proportionally to a given reward function. Compared to reinforcement learning (RL), where the learned policy is more inclined to sample action sequences with higher rewards, GFlowNets can perform better on exploration tasks. Since the goal of GFlowNets is not to generate a single highest-reward action sequence, but to sample a sequence of actions from the leading modes of the reward function Bengio et al. (2021a) . Unfortunately, currently GFlowNets cannot support multi-agent systems. A multi-agent system is a set of autonomous, interacting entities that share a typical environment, perceive through sensors and act in conjunction with actuators Busoniu et al. (2008) . Multi-agent reinforcement learning (MARL), especially cooperative MARL, are widely used in robotics teams, distributed control, resource management, data mining, etc Zhang et al. (2021) ; Canese et al. (2021) ; Feriani & Hossain (2021) . Two major challenges for cooperative MARL are scalability and partial observability Yang et al. (2019) ; Spaan (2012) . Since the joint state-action space grows exponentially with the number of agents, coupled with the environment's partial observability and communication constraints, each agent needs to make individual decisions based on local action observation history with guaranteed performance Sunehag et al. (2017) ; Wang et al. (2020) ; Rashid et al. (2018) . In MARL, to address these challenges, a popular centralized training with decentralized execution (CTDE) Oliehoek et al. (2008) ; Oliehoek & Amato (2016) paradigm is proposed, in which the agent's policy is trained in a centralized manner by accessing global information and executed in a decentralized manner based only on local history. However, extending these techniques to GFlowNets is not straightforward, especially in constructing CTDE-architecture flow networks and finding IGM conditions for flow networks worth investigating. In this paper, we propose Generative Multi-Flow Networks (GMFlowNets) framework for cooperative decision-making tasks, which can generate more diverse patterns through sequential joint ac-tions with probabilities proportional to the reward function. Unlike vanilla GFlowNets, our method analyzes the interaction of multiple agent actions and shows how to sample actions from multi-flow functions. We propose the Centralized Flow Networks (CFN), Independent Flow Networks (IFN) and Flow Conservation Networks (FCN) algorithms based on the flow matching condition to solve GMFlowNets. CFN regards multi-agent dynamics as a whole for policy optimization, regardless of combinatorial complexity and the demand for independent execution, while IFN suffers from the flow non-stationary problem. In contrast, FCN takes full advantage of CFN and IFN, which is trained based on the independent global conservation (IGC) condition. Since FCN has the CTDE paradigm, it can reduce the complexity of flow estimation and support decentralized execution, which is beneficial to solving practical cooperative decision-making problems. Main Contributions: 1) We are the first to propose the concept of generative multi-flow networks for cooperative decision-making tasks; 2) We propose three algorithms, CFN, IFN, and FCN, for training GMFlowNets, which are respectively based on centralized training, independent execution, and CTDE paradigm; 3) We propose the IGC condition and then prove that the joint state-action flow function can be decomposed into the product form of multiple independent flows, and that a unique Markovian flow can be trained based on the flow matching condition; 4) We conduct experiments based on cooperative control tasks to demonstrate that the proposed algorithms can outperform current cooperative MARL algorithms, especially in terms of exploration capabilities.

2.1. PRELIMINARY

Let F : T → R + be a trajectory flow Bengio et al. (2021b) , such that F (τ ) can be interpreted as the probability mass associated with trajectory τ . Then, we have the corresponding defined edge flow F (s → s ′ ) = s→s ′ ∈τ F (τ ) and state flow F (s) = s∈τ F (τ ). The forward transition probabilities P F for each step of a trajectory can then be defined as Bengio et al. (2021b)  P F (s | s ′ ) = F (s → s ′ ) F (s) . GFlowNets aims to train a neural network to approximate the trajectory flow function with the output proportional to the reward function based on the flow matching condition Bengio et al. (2021b) : s ′ ∈Parent(s) F (s ′ → s) = s ′′ ∈Child(s) F (s → s ′′ ) , where Parent(s) and Child(s) denote the parent set and child set of state s, respectively. In this way, for any consistent flow F with the terminating flow as the reward, i.e., F (s → s f ) = R(s) with s f being the final state and s being the terminating state (can be transferred directly to the final state), a policy π defined by the forward transition probability satisfies π (s ′ | s) = P F (s ′ | s) ∝ R(x).

2.2. PROBLEM FORMULATION

A multi-agent directed graph is defined as a tuple (S, A) with S being a set of state and A = A 1 ×• • •×A k denoting the set of joint edges (also called actions or transitions), which consists of all possible combinations of the actions available to each agent. A trajectory in such a graph is defined as a sequence (s 1 , ..., s n ) of elements of S. A corresponding multi-agent directed acyclic graph (MADAG) is defined as a multi-agent directed graph with unequal pairs of states in the trajectory. Given an initial state s 0 and final state s f , we name a trajectory τ = (s 0 , ..., s f ) ∈ T starting from s 0 and ending in s f as the complete trajectory, where T denotes the set of complete trajectories. We consider a partially observable scenario, where the state s ∈ S is shared by all agents, but it is not necessarily fully observed. Hence, each agent i ∈ I selects an action a i ∈ A  F (s t , a t ) = F (s t → s t+1 ) with a t = [a 1 t , ..., a k t ] T . The state flow F (s) : S → R is defined as F (s) = τ ∈T 1 s∈τ F (τ ). Based on the flow matching condition Bengio et al. (2021b) , we have the state flow equal to the inflows or outflows, i.e., F (s) = s ′ ,a ′ :T (s ′ ,a ′ )=s F (s ′ , a ′ ) = s ′ ∈Parent(s) F (s ′ → s) (1) F (s) = a∈A F (s, a) = s ′′ ∈Child(s) F (s → s ′′ ) , where T (s ′ , a ′ ) = s denotes an action a ′ that can transfer state s ′ to attain s. To this end, generative multi-flow networks (GMFlowNets) are defined as learning machines that can approximate trajectory flow functions in MADAG, with outputs proportional to the predefined reward function, trained based on flow matching conditions in equation 1 and equation 2. 3 GMFLOWNETS: ALGORITHMS

3.1. CENTRALIZED FLOW NETWORK

Given such a MADAG, to train a GMFlowNet, a straightforward approach is to use a centralized training approach to estimate joint-flows, named Centralized Flow Network (CFN) algorithm, where multiple flows are trained together based on the flow matching conditions. In particular, for any state s in the trajectory, we require that the inflows equal the outflows. In addition, the boundary condition is given by the flow passing through the terminating state s based on the reward R(s). Assuming we have a sparse reward setting, i.e., the internal states satisfy R(s) = 0 while the final state satisfies A = ∅, then we have the flow consistency equation: s,a:T (s,a)=s ′ F (s, a) = R (s ′ ) + a ′ ∈A(s ′ ) F (s ′ , a ′ ) . Lemma 1 Define a joint policy π that generates trajectories starting in state s 0 by sampling actions a ∈ A(s) according to π(a|s) = F (s, a) F (s) , where F (s, a) > 0 is the flow through allowed edge (s, a), which satisfies the flow consistency equation in equation 3. Let π(s) be the probability of visiting state s when starting at s 0 and following π. Then we have (a) π(s) = F (s) F (s0) ; (b) F (s 0 ) = s f R(s f ); (c) π(s f ) = R(s f ) s ′ f R(s ′ f ) . Proof: The proof is trivial by following the proof of Proposition 2 in Bengio et al. (2021a) . We have Lemma 1, which shows that a joint flow function can produce π(s f ) = R(s f )/Z correctly when the flow consistency equation is satisfied. Then we can use a TD-like objective to optimize the joint flow function parameter θ: L θ (τ ) = s ′ ∈τ ̸ =s0   s,a:T (s,a)=s ′ F θ (s, a) -R (s ′ ) - a ′ ∈A(s ′ ) F θ (s ′ , a ′ )   2 . ( ) Note that optimizing equation 5 is not straightforward. On the one hand, in each iteration, we need to estimate the flow in the order of O(|A i | N )foot_0 , which leads to exponential complexity. The joint flow estimation method may get stuck in local optima and can hardly scale beyond dozens of agents. On the other hand, joint flow networks require all agents to sample jointly, which is impractical since in many applications the agents only have access to their own observations.

3.2. INDEPENDENT FLOW NETWORK

To reduce the complexity and achieve the independent sampling of each agent, a simple way is to treat each agent as an independent agent, so that each agent can learn its flow function in the order of O(|A i |). We call this approach the Independent Flow Network (IFN) algorithm, which reduces the exponential complexity to linear. However, due to the non-stationarity of the flow (see Definition 1), it is difficult for this algorithm to train a high-performance GMFlowNet. Definition 1 (Flow Non-Stationary) Define the independent policy π i as π i (a i |o i ) = F i (o i , a i ) F (o i ) , where a i ∈ A i (o i ) and F i (o i , a i ) is the independent flow of agent i. The flow consistent equation can be rewritten as oi,ai:T (oi,ai,a-i)=o ′ i F i (o i , a i ) = R(o i , a i ) + a ′ i ∈A(o ′ i ) F i (o ′ i , a ′ i ), where -i represents other agents except agent i, and R(o i , a i ) represents the reward with respect to state s and action a i . Note that the transition function T (o i , a i , a -i ) = o ′ i in equation 7 is also related to the actions of other agents, which makes estimating parent nodes difficult. In addition, the reward of many multiagent systems is the node reward R(s), that is, we cannot accurately estimate the action reward R(o i , a i ) of each node. This transition uncertainty and spurious rewards can cause the flow nonstationary property. This makes it difficult to assign accurate rewards to each action, and thus, it is difficult to train independent flow network with a TD-like objective function. As shown in Figure 1 , compared to the centralized training method, it is almost difficult for the independent method to learn a better sampling policy. One way to improve the performance of independent flow networks is to design individual reward functions that are more directly related to the behavior of individual agents. However, this approach is difficult to implement in many environments because it is difficult to determine the direct relationship between individual performance and overall system performance. Even in the case of a single agent, only a small fraction of the shaped reward function aligns with the true objective.

3.3. FLOW CONSERVATION NETWORK

In this subsection, we propose the Flow Conservation Network (FCN) algorithm to reduce the complexity and simultaneously solve the flow non-stationary problem. FCN aims to learn the optimal value decomposition from the final reward by back-propagating the gradients of the joint flow function F through deep neural networks representing the individual flow function F i , ∀i ∈ N . The specific motivation for FCN is to avoid flow non-stationarity problems and reduce complexity. To start with, we have the following Definition 2, which shows the Individual Global Conservation (IGC) condition between joint and individual edge flows. Definition 2 (Individual Global Conservation) The joint edge flow is a product of individual edge flows, i.e., F (s t , a t ) = i F i (o i t , a i t ). Then, we propose the following flow decomposition theorem. Theorem 1 Let the joint policy be the product of the individual policy {π i } k i=1 , where π i with respect to the individual flow function F i (o i , a i ), i.e., π i (a i |o i ) = F i (o i , a i ) F i (o i ) , ∀i = 1, • • • , k. ( ) Assume that the individual flow F i (o i , a i ) satisfies the condition in Definition 2. Define a flow function F , if all agents generate trajectories using independent policies π i , i = 1, ..., k and the matching conditions ∀s ′ > s 0 , F (s ′ ) = s∈P(s ′ ) F (s → s ′ ) and ∀s ′ < s f , F (s ′ ) = s ′′ ∈C(s ′ ) F (s ′ → s ′′ ) are satisfied. Then, we have: During the individual sampling process, each agent samples trajectories using its own policy and composes a batch of data for joint training. During the joint training process, the system allows to call of the independent flow functions of each agent and uses the joint reward function to train the flow network. After training, each agent gets a trained independent flow network to meet the needs of independent sampling. In particular, for each sampled state, we first seek their parent nodes and corresponding observations and independent actions. Then, we compute the estimated joint flow F (s, a) by the flow consistency equation: 1) π(s f ) ∝ R(s f ); 2) F uniquely defines a Markovian flow F matching F such that F (τ ) = n+1 t=1 F (s t-1 → s t ) n t=1 F (s t ) . ( F (s, a) = exp k i=1 log Fi (o i , a i ; θ i ) , where θ i is the model parameter of the i-th agent, which can be trained based on equation 5 as: L(τ ; θ) = s ′ ∈τ ̸ =s0   s,a:T (s,a)=s ′ F (s, a) -R (s ′ ) - a ′ ∈A(s ′ ) F (s ′ , a ′ )   2 . ( ) Note that the above loss may encounter the problem that the magnitude of the flow on each node in the trajectory does not match, for example, the flow of the root node is large, while the flow of the leaf node is very small. To solve this problem, we here adopt the idea of log-scale loss introduced in Bengio et al. (2021a) , and modify equation 12 as L(τ, ϵ; θ) = s ′ ∈τ ̸ =s0 (log [ϵ + Inflows] -log [ϵ + Outflows]) , where Inflows := s,a:T (s,a)=s ′ exp log F (s, a; θ) = s,a:T (s,a)=s ′ exp k i=1 log Fi (o i , a i ; θ i ) Outflows := R (s ′ ) + a ′ ∈A(s ′ ) exp log F (s ′ , a ′ ; θ) = R (s ′ ) + a ′ exp k i=1 log Fi (o ′ i , a ′ i ; θ i ) , and ϵ is a hyper-parameter that helps to trade-off large and small flows, which also avoids the numerical problem of taking the logarithm of tiny flows. 3.4 DISCUSSION: RELATIONSHIP WITH MARL Interestingly, there are similar independent execution algorithms in the multi-agent reinforcement learning scheme. Therefore, in this subsection, we discuss the relationship between flow conservation networks and multi-agent RL. The value decomposition approach has been widely used in multi-agent RL based on IGM conditions, such as VDN and QMIX. For a given global state s and joint action a, the IGM condition asserts the consistency between joint and local greedy action selections in the joint action-value Q tot (s, a) and individual action values [Q i (o i , a i )] k i=1 : arg max a∈A Q tot (s, a) = arg max a1∈A1 Q 1 (o 1 , a 1 ), • • • , arg max a k ∈A k Q k (o k , a k ) , ∀s ∈ S. ( ) Assumption 1 For any complete trajectory in an MADAG τ = (s 0 , ..., s f ), we assume that Q µ tot (s f -1 , a) = R(s f )f (s f -1 ) with f (s n ) = n t=0 1 µ(a|st) . Remark 1 Although Assumption 1 is a strong assumption that does not always hold in practical environments. Here we only use this assumption for discussion analysis, which does not affect the performance of the proposed algorithms. A scenario where the assumption directly holds is that we sample actions according to a uniform distribution in a tree structure, i.e., µ(a|s) = 1/|A(s)|. The uniform policy is also used as an assumption in Bengio et al. (2021a) . Lemma 2 Suppose Assumption 1 holds and the environment has a tree structure, based on the IGC and IGM conditions we have: 1) Q µ tot (s, a) = F (s, a)f (s); 2) (arg max ai Q i (o i , a i )) k i=1 = (arg max ai F i (o i , a i )) k i=1 . Based on Assumption1, we have Lemma 2, which shows the connection between the IGC condition and the IGM condition. This action-value function equivalence property helps us better understand the multi-flow network algorithms, especially showing the rationality of the IGC condition.

4. RELATED WORKS

Generative Flow Networks: GFlowNets is an emerging generative model that could learn a policy to generate the objects with a probability proportional to a given reward function. RL aims to maximize the expected reward and usually only generates the single action sequence with the highest reward. Conversely, the learned policies of GFlowNets can achieve that the sampled actions are proportional to the reward and are more suitable for exploration. This exploration ability makes GFNs promising as a new paradigm for policy optimization in the RL field, but there are many problems, such as solving multi-agent collaborative tasks. Cooperative Multi-agent Reinforcement Learning: There are already many MARL algorithms to solve collaborative tasks, two extreme algorithms are independent learning Tan (1993) 2021), which has shown the surprising effectiveness in cooperative, multi-agent games. The goal of these algorithms is to find the policy that maximizes the long-term reward, however, it is difficult for them to learn more diverse policies, which can generate more promising results.

5. EXPERIMENTS

We first verify the performance of CFN on a multi-agent hyper-grid domain where partition functions can be accurately computed. We then compare the performance of IFN and FCN with standard MCMC and some RL methods to show that their sampling distributions better match normalized rewards. All our code is implemented by the PyTorch Paszke et al. ( 2019) library. We reimplement the multi-agent RL algorithms and other baselines.

5.1. HYPER-GRID ENVIRONMENT

We consider a multi-agent MDP where states are the cells of a N -dimensional hypercubic grid of side length H. In this environment, all agents start from the initialization point x = (0, 0, • • • ), and is only allowed to increase coordinate i with action a i . In addition, each agent has a stop action. When all agents choose the stop action or reach the maximum H of the episode length, the entire system resets for the next round of sampling. The reward function is designed as R(x) = R 0 + R 1 j I (0.25 < |x j /H -0.5|) + R 2 j I (0.3 < |x j /H -0.5| < 0.4) , where x = [x 1 , • • • , x k ] includes all agent states, the reward term 0 < R 0 ≪ R 1 < R 2 leads a distribution of modes. By changing R 0 and setting it closer to 0, this environment becomes harder to solve, creating an unexplored region of state space due to the sparse reward setting. We conducted experiments in Hyper-grid environments with different numbers of agents and different dimensions, and we use different version numbers to differentiate these environments, the higher the number, the more the number of dimensions and proxies. Moreover, the specific details about the environments and experiments can be found in the appendix. We compare CFN and FCN with a modified MCMC and RL methods. In the modified MCMC method Xie et al. (2021) , we allow iterative reduction of coordinates on the basis of joint action space, and cancel the setting of stop actions to form a ergodic chain. As for RL methods, we consider the maximum entropy algorithm, i.e., multi-agent SAC Haarnoja et al. (2018) , and a previous cooperative multi-agent algorithm, i.e., MAPPO, Yu et al. (2021) . Note that the maximum entropy method uses the Softmax policy of the value function to make decision, so as to explore the state of other reward, which is related to our proposed algorithm. To measure the performance of these methods, we define the empirical L1 error as Figure 3 illustrates the performance superiority of our proposed algorithm compared to other methods in the L1 error and mode found. For FCN, we consider two different decision-making methods, the first is to sample actions independently, called FCN v1, and the other is to combine these policies for sampling, named FCN v2. We find that on small-scale environments shown in Figure 3 -Left, CFN can achieve the best performance, because CFN can accurately estimate the flow of joint actions when the joint action space dimension is small. However, as the complexity of the joint action flow that needs to be estimated increases, we find that the performance of CFN degrades, but the independently executed method still achieves good estimation and maintains the speed of convergence, as shown in Figure 3 -Middle. Note that RL-based methods do not achieve the expected performance, their performance curves first rise and then fall, because as training progresses, these methods tend to find the highest rewarding nodes rather than finding more patterns. In addition, as shown in Table 1 , both the reinforcement learning method and our proposed method can achieve the highest reward, but the average reward of reinforcement learning is slightly better for all found modes. Our algorithms do not always have higher rewards than RL, which is reasonable since the goal of GMFlowNets is not to maximize rewards. 2021), we consider the task of molecular generation to evaluate the performance of FCN. For any given molecular and chemical validity constraints, we can choose an atom to attach a block. The action space is to choose the location of the additional block and selecting the additional block. And the reward function is calculated by a pretrained model. We modify the environment to meets the multi-agent demand, where the task allows two agents to perform actions simultaneously depending on the state. Although this approach is not as refined as single-agent decision making, we only use it to verify the performance of FCN. Figure 4 shows that the number of molecules with the reward value greater than a threshold τ = 8 found by different algorithms, we can see that FCN can generate more molecules with high reward functions over three independent runs. E[p(s f ) -π(s f )] with p(s f ) = R(s f )/Z

6. CONCLUSION

In this paper, we discuss the policy optimization problem when GFlowNets meet the multi-agent systems. Different from RL, the goal of GMFlowNets is to find diverse samples with probability proportional to the reward function. Since the joint flow is equivalent to the product of independent flow of each agent, we design a CTDE method to avoid the flow estimation complexity problem in fully centralized algorithm and the non-stationary environment in the independent learning process, simultaneously. Experimental results on Hyper-grid environments and small molecules generation task demonstrate the performance superiority of the proposed algorithms. Limitation and Future Work: Unlike multi-agent RL algorithms that typically use RNNs as the value estimation network Hochreiter & Schmidhuber (1997) ; Rashid et al. (2018) , RNNs are not suitable for our algorithms for flow estimation. The reason is that the need to compute the parent nodes of each historical state introduces additional overhead. Another limitation is that, like the original GFlowNets, GMFlowNets are constrained by DAGs and discrete environments, which makes GMFlowNets temporarily unavailable for multi-agent continuous control tasks. Therefore, our future work is to design multi-agent continuous algorithms to overcome the above problems.

A PROOF OF MAIN RESULTS

A.1 PROOF OF THEOREM 1 Theorem 1. Let the joint policy be the product of the individual policy {π i } k i=1 , where π i with respect to the individual flow function F i (o i , a i ), i.e., π i (a i |o i ) = F i (o i , a i ) F i (o i ) , ∀i = 1, • • • , k. ( ) Assume that the individual flow F i (o i , a i ) satisfies the condition in Definition 2. Define a flow function F , if all agents generate trajectories using independent policies π i , i = 1, ..., k and the matching conditions ∀s ′ > s 0 , F (s ′ ) = s∈P(s ′ ) F (s → s ′ ) and ∀s ′ < s f , F (s ′ ) = s ′′ ∈C(s ′ ) F (s ′ → s ′′ ) are satisfied. Then, we have: 1) π(s f ) ∝ R(s f ); 2) F uniquely defines a Markovian flow F matching F such that F (τ ) = n+1 t=1 F (s t-1 → s t ) n t=1 F (s t ) . ( ) Proof: We first prove the part 1). Since F (s t , a t ) = i F i (o i t , a i t ), then we have the global state flow as F (s t ) = at∈A F (s t , a t ) = at∈A i F i (o i t , a i t ). According to the flow definitions, the observation flow F i (o i t ) and individual observation flows have the relationship: F i (o i t ) = a i t ∈A i F i (o i t , a i t ). Hence, we have k i=1 F i (o i t ) = k i=1    a i t ∈A i F i (o i t , a i t )    (21) = a 1 t ∈A 1 F i (o 1 t , a 1 t ) • • • a k t ∈A k F i (o k t , a k t ) (22) = a 1 t ,••• ,a k t ∈A 1 ×•••×A k F i (o 1 t , a 1 t ) • • • F i (o k t , a k t ) (23) = at∈A k i=1 F i (o i t , a i t ), yielding F (s t ) = i F i (o i t ). Therefore, the joint policy π(a|s) = F (s t , a t ) F (s t ) = i F i (o i t , a i t ) F (s t ) = i F i (o i t , a i t ) i F i (o i t ) = i π i (a i |o i ). Equation 25 indicates that if the conditions in Definition 2 is satisfied, we can establish the consistency of joint and individual policies. Based on Lemma 1, we can conclude that the reward of the generated state satisfies π(s f ) ∝ R(s f ) using the individual policy π i (a i |o i ) of each agent. Next, we prove the part 2). We first prove the necessity part. According to Definition 2 and Bengio et al. (2021b) we have F (s ′ ) = i F i (o i,′ ) = i o i ∈P(o i,′ ) F i (o i → o i,′ ) = o∈P(o ′ ) i F i (o i → o i,′ ), F (s ′ ) = i F i (o i,′ ) = i o i,′′ ∈C(o i,′ ) F i (o i,′ → o i,′′ ) = o ′′ ∈C(o ′ ) i F i (o i,′ → o i,′′ ). Then we prove the sufficiency part. We first present Lemma 3, which shows that τ ∈T0,s P B (τ ) = τ ∈T0,s st→st+1∈τ P B (s t |s t+1 ) = 1. Lemma 3 (Independent Transition Probability) Define the independent forward and backward transition respectively as P F o i t+1 |o i t := P i o i t → o i t+1 |o i t = F i o i t → o i t+1 F i o i t , and P B o i t |o i t+1 := P i o i t+1 → o i t |o i t+1 = F i o i t+1 → o i t F i o i t+1 . Then we have τ ∈T s,f P F (τ ) = 1, ∀s ∈ S\{s f }, τ ∈T0,s P B (τ ) = 1, ∀s ∈ S\{s 0 }, where T s,f is the set of trajectories starting in s and ending in s f and T 0,s is the set of trajectories starting in s 0 and ending in s. Define Ẑ = F (s 0 ) as the partition function and PF as the forward probability function. Then, according to Proposition 18 in Bengio et al. (2021b) , we have there exists a unique Markovian flow F with forward transition probability function P F = PF and partition function Z, and such that F (τ ) = Ẑ n+1 t=1 PF (s t |s t-1 ) = n+1 t=1 F (s t-1 → s t ) n t=1 F (s t ) , where s n+1 = s f . Thus, we have for s ′ ̸ = s 0 : F (s ′ ) = Ẑ τ ∈T 0,s ′ (st→st+1)∈τ PF (s t+1 |s t ) = Ẑ F (s ′ ) F (s 0 ) τ ∈T 0,s ′ (st→st+1)∈τ PB (s t |s t+1 ) = F (s ′ ). Combining equation 30 with P F = PF , we have ∀s → s ′ ∈ A, F (s → s ′ ). Finally, for any Markovian flow F ′ matching F on states and edges, we have F ′ (τ ) = F (τ ) according to Proposition 16 in Bengio et al. (2021b) , which shows the uniqueness property. Then we complete the proof. A.2 PROOF OF LEMMA 2 Lemma 2. Suppose Assumption 1 holds and the environment has a tree structure, based on the IGC and IGM conditions we have: 1) Q µ tot (s, a) = F (s, a)f (s); 2) (arg max ai Q i (o i , a i )) k i=1 = (arg max ai F i (o i , a i )) k i=1 . Proof: The proof is an extension of that of Proposition 4 in Bengio et al. (2021a) . For any (s, a) satisfies s f = T (s, a), we have Q µ tot (s, a) = R(s f )f (s) and F (s, a) = R(s f ). Therefore, we have Q µ tot (s, a) = F (s, a)f (s). Then, for each non-final node s ′ , based on the action-value function in terms of the action-value at the next step, we have by induction: Q µ tot (s, a) = R(s ′ ) + µ(a|s ′ ) a ′ ∈A(s ′ ) Q µ tot (s ′ , a ′ ; R) (a) = 0 + µ(a|s ′ ) a ′ ∈A(s ′ ) F (s ′ , a ′ ; R)f (s ′ ), where R(s ′ ) is the reward of Q µ tot (s, a) and (a) is due to that R(s ′ ) = 0 if s ′ is not a final state. Since the environment has a tree structure, we have F (s, a) = a ′ ∈A(s ′ ) F (s ′ , a ′ ), which yields Q µ tot (s, a) = µ(a|s ′ )F (s, a)f (s ′ ) = µ(a|s ′ )F (s, a)f (s) 1 µ(a|s ′ ) = F (s, a)f (s). According to the IGC condition we have F (s t , a t ) = i F i (o i t , a i t ), yielding arg max a Q tot (s, a) (a) = arg max a log F (s, a)f (s) (b) = arg max a k i=1 log F i (o i , a i ) (c) = arg max a1∈Ai F 1 (o 1 , a 1 ), • • • , arg max a k ∈A k F k (o k , a k ) , where (a) is based on the fact F and f (s) are positive, (b) is due to the IGC condition. Combining with the IGM condition arg max a∈A Q tot (s, a) = arg max a1∈A1 Q 1 (o 1 , a 1 ), • • • , arg max a k ∈A k Q k (o k , a k ) , ∀s ∈ S. we can conclude that arg max ai∈Ai F i (o i , a i ) k i=1 = arg max a1∈A1 Q i (o i , a i ) k i=1 . Then we complete the proof.  where T s,f is the set of trajectories starting in s and ending in s f and T 0,s is the set of trajectories starting in s 0 and ending in s. Proof: When the maximum length of trajectories is not more than 1, we have τ ∈T s,f P F (τ ) = 1. ( ) Then we have the following results by induction: τ ∈T s,f P F (τ ) = s ′ ∈C(s) τ ∈T s→s ′ ,f P F (τ ) = o ′ ∈C(o) P F (o ′ |o) τ ∈T s ′ ,f P F (τ ) = k o ′ i ∈C(oi) P F (o ′ i |o i ) τ ∈T s ′ ,f P F (τ ) = 1, where C(•) is the children set of the current state or observation and the last equation is based on the fact o ′ i ∈C(oi) P F (o ′ i |o i ) = 1. Since the proof process of P B is similar to that of P F , it is omitted here.

B EXPERIMENTAL DETAILS B.1 HYPER-GRID ENVIRONMENT

Here we present the experimental details on the Hyper-Grid environments. Figure 5 shows the curve of the flow matching loss function with the number of training steps. The loss of our proposed algorithm gradually decreases, ensuring the stability of the learning process. For some RL algorithms based on the state-action value function estimation, the loss usually oscillates. This may be because RL-based methods use experience replay buffer and the transition data distribution is not stable enough. The method we propose uses an on-policy based optimization method, and the data distribution changes with the current sampling policy, hence the loss function is relatively stable. We set the same number of training steps for all algorithms for a fair comparison. Moreover, we list the key hyperparameters of the different algorithms in Tables 2 3 4 We study the effect of different reward in Figure 6 . In particular, we set R 0 = {10 -1 , 10 -2 , 10 -4 } for different task challenge. A smaller value of R 0 makes the reward function distribution more 



For simplicity, here we consider homogeneous agents, i.e., Ai = Aj, ∀i, j ∈ N . Moreover, heterogeneous agents also face the problem of combinatorial complexity.



Figure2: Framework of GMFlowNets. For each state, each agent obtains its own observation and computes its independent flow to sample actions. During training, the agent seeks the parent nodes for computing inflows and outflows, and performs policy optimization through flow matching.

Figure 1: The performance of the centralized training and independent learning on Hyper-grid task.

)Theorem 1 states two facts. First, the joint state-action flow function F (s, a) can be decomposed into the product form of multiple independent flows. Second, if any non-negative function satisfies the flow matching conditions, a unique flow is determined. On this basis, we can design algorithms for flow decomposition based on conservation properties. Each agent maintains a neural network to estimate the flow of its actions, then calculates the joint flow function through the flow conservation condition, and trains the model with the relevant reward function. In this case, each agent maintains a flow estimation network with the above architecture, which only estimates (|A i |) flows. Compared with the centralized flow estimation network, we can reduce the complexity to O(N (|A i |)). By combining F i (o i , a i ), we can get an unbiased estimate of F (s, a) to calculate a TD-like objective function. Next, we illustrate the overall training process.

Nowadays, GFlowNets has achieved promising performance in many fields, such as molecule generationBengio et al. (2021a);Malkin et al. (2022);Jain et al. (2022), discrete probabilistic modelingZhang et al. (2022) and structure learningDeleu et al. (2022). This network could sample the distribution of trajectories with high rewards and can be useful in tasks when the reward distribution is more diverse. This learning method is similar to reinforcement learning (RL) Sutton & Barto (2018), but Algorithm 1 Flow Conservation Network (FCN) Algorithm Input: MADAG ⟨S, A, P, R, N ⟩, Number of iteration T , Sample size B, Initial flow function F 0 i , ∀i = 1, • • • , k, Parameters. 1: for iteration t = 1, • • • , T do 2: \\ Individual sampling process 3: Sample observations {(o b i , a ′,b i , R b )} B b=1 based on the individual flow function Fi for all agents 4: \\ Joint training process 5: Seek all parent nodes {p b } of the global state {s b } B b=1 and calculate the inflow F (s b , a b ) 6: Calculate the outflow Y b = R b (s) + a∈A(s) F (s, a) by the flow conservation condition 7: Update the individual flow function: { Fi} ← arg min {F i } k i=1 [Y b -F (s b , a b )] 2 8: end for 9: Define the joint sampling policy as the product of the individual policies {πi} k i=1 w.r.t. {Fi} k i=1 Output: flow function FT and individual sampling policy {πi} k i=1

and centralized training. Independent training methods regard the influence of other agents as part of the environment, but the team reward function is usually difficult to measure the contribution of each agent, resulting in the agent facing a non-stationary environment Sunehag et al. (2017); Yang et al. (2020). On the contrary, centralized training treats the multi-agent problem as a single-agent counterpart. Unfortunately, this method exhibits combinatorial complexity and is difficult to scale beyond dozens of agents Yang et al. (2019). Therefore, the most popular paradigm is centralized training and decentralized execution (CTDE), including value-based Sunehag et al. (2017); Rashid et al. (2018); Son et al. (2019); Wang et al. (2020) and policy-based Lowe et al. (2017); Yu et al. (2021); Kuba et al. (2021) methods. The goal of value-based methods is to decompose the joint value function among agents for decentralized execution, which requires satisfying the condition that the local maximum of each agent's value function should be equal to the global maximum of the joint value function.VDN Sunehag et al. (2017)  andQMIX Rashid et al. (2018)  propose two classic and efficient factorization structures, additivity and monotonicity, respectively, despite the strict factorization method.QTRAN Son et al. (2019)  andQPLEX Wang et al. (2020)  introduce extra design for descomposition, such as factorization structure and advantage function. The policy-based methods extend the single-agentTRPO Schulman et al. (2015)  andPPO Schulman et al. (2017)  into the multi-agent setting, such asMAPPO Yu et al. (

Figure 3: L1 error and Mode Found performance of different algorithms on various Hyper-grid environments. Top and bottom are respectively Mode Found (higher is better) and L1 Error (lower is better). Left: Hyper-Grid v1, Middle: Hyper-Grid v2, Right: Hyper-Grid v3.

Figure 4: Performance of FCN and MAPPO on molecules generation task. Similar to Jin et al. (2018); Bengio et al. (2021a); Xie et al. (2021), we consider the task of molecular generation to evaluate the performance of FCN. For any given molecular and chemical validity constraints, we can choose an atom to attach a block. The action space is to choose the location of the additional block and selecting the additional block. And the reward function is calculated by a pretrained model. We modify the environment to meets the multi-agent demand, where the task allows two agents to perform actions simultaneously depending on the state. Although this approach is not as refined as single-agent decision making, we only use it to verify the performance of FCN. Figure4shows that the number of molecules with the reward value greater than a threshold τ = 8 found by different algorithms, we can see that FCN can generate more molecules with high reward functions over three independent runs.

5.

Figure 5: The flow matching loss of different algorithm.

being the sample distribution computed by the true reward function. Moreover, we can consider the mode found theme to demonstrate the superiority of the algorithm. The best reward found of different methods.

P F (τ ) = 1, ∀s ∈ S\{s f },

Hyper-parameter of FCN under different environments

Hyper-parameter of CFN under different environments

annex

 2019). As shown in Figure 6 , we found that our proposed method is robust with the cases R 0 = 10 -1 and R 0 = 10 -2 . When the reward distribution becomes sparse, the performance of the proposed algorithm degrades slightly. 

