IMPROVING GENERATIVE FLOW NETWORKS WITH PATH REGULARIZATION

Abstract

Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory that places prior constraints on the underlying structure of the GFlowNet. The prior is designed to help the GFlowNet better discover the latent structure of the target distribution or enhance its ability to explore the environment in the context of active learning. The path regularization controls the flow in the GFlowNet to generate more diverse and novel candidates via maximizing the optimal transport distances between two forward policies or to improve the generalization via minimizing the optimal transport distances. In addition, we derive an efficient implementation of the regularization by finding its closed-form solutions in specific cases and a meaningful upper bound that can be used as an approximation to minimize the regularization term. We empirically demonstrate the advantages of our path regularization on a wide range of tasks, including synthetic hypergrid environment modeling, discrete probabilistic modeling, and biological sequence design.

1. INTRODUCTION

Recently proposed by Bengio et al. (2021a) , Generative Flow Networks (GFlowNets) are generative models for compositional objects, which learn a stochastic policy that sequentially modifies a temporarily constructed object through a sequence of actions to make the generating likelihood proportional to a given reward function. Specifically, GFlowNets aim to solve the problem of generating a diverse set of good candidates. In biological sequence design, diversity is a crucial consideration because of improving the chance of discovering candidates that can satisfy many evaluation criteria later in downstream phases (Jain et al. (2022) ). Especially in the multi-round active learning setting, where the generator was iteratively improved by receiving feedback from an oracle on their proposed candidates, the effect of diverse generation becomes apparent because more diversity means more exploration and knowledge gained. Besides, the generalization ability of GFlowNets (Zhang et al. (2022) ; Malkin et al. (2022) ) over structured data makes them a good framework for discrete probabilistic modeling. The central problems of GFlowNets are improving exploration and generalization. In this work, we propose to train the GFlowNet with an additional path regularization via optimal transport (Villani, 2003) ), which acts as a prior constraint on its underlying structure. The prior is designed to help the GFlowNet better discover the latent structure of the target distribution or enhance its ability to explore the environment in the context of active learning. Precisely, the path regularization via OT can help the GFlowNet generate more diverse and novel candidates via maximizing OT distances between two forward policies or improving generalization via minimizing the OT distances. For generalization: To improve GFlowNet's generalization, we propose the following prior constraints: (i) The forward policies of two neighbor states are expected to be similar in the way that they both have the focused tendency of choosing the next action, which implicitly forces the GFlowNet to find states with high rewards rather than exploring, especially in sparse environments. ; (ii) Trajectories related to positive objects (both have high rewards) must share their paths. As a result, the similarity of states along trajectories with high flow is higher than in other places. From a probabilistic perspective, we propose to measure the similarity of states s and s ′ in the GFlowNet by the transition probability from s to s ′ ; (iii) When the GFlowNet learns something, the sparse flow is expected to generalize better. Thus, although many solutions exist for learning a GFlowNet, our proposal priors promote refining the GFlowNet's flow, i.e., enhance flow on high flow trajectories and vice versa. For diversity and exploration: To encourage the GFlowNet's policy to generate more diverse candidates, such as in the multi-round active learning settings, we propose to put a prior constraint on the forward policies of two neighbor states. Specifically, this prior constraint intentionally promotes the "dissimilarity" between the forward policies of two neighbor states. In other words, it forces the children states of two considered neighbor states far from each other in terms of probabilistic transition, which will help the GFlowNet generate more diverse and novel candidates. Why OT is a good solution?Indeed, we need a measure of "distance" between pairs of probability distributions. The optimal transport (OT) theory (Villani (2003) ) studies how probabilistic mass can be optimally transported from the supports of one probabilistic distribution to the supports of another distribution given a cost function. The minimum transportation cost, called distance, can be used as a metric that quantifies the distance between two probability distributions. In the context of GFlowNets, we want to affect nearby states, which can be done by regularizing on the OT distance between the forward policies P F (•|s) and P F (•|s ′ ) of two neighbor states s and s ′ . To compute the OT distance, we solve an OT problem between two discrete probability measures, whose support points are the child states of s and s ′ correspondingly, given the transportation cost c(u i , v j ) from each child u i of s to each child v j of s ′ . While the weakness of KL divergence is that it requires two interested distributions to share the same set of supports, OT can deal with this problem efficiently. Another reason is that the cost used in our OT distance can capture the given DAG's structure and the GFlowNet's flow, while directly using KL divergence cannot. Contributions. In this work, we develop a novel path regularization based on OT theory for either helping the GFlowNet better discover the latent structure of the target distribution or enhancing its ability to explore the environment in the context of active learning. Our contributions can be summarized as follows: 1. We propose to train the GFlowNet with an additional path regularization via OT, which acts as a prior on the underlying structure of the GFlowNets for either improving the generalization capability or enhancing the exploration ability of the GFlowNet. 2. We define a new directed distance between two arbitrary states in the GFlowNet, which can be naturally chosen as the transportation cost for computing the OT distance and link the proposed regularization to entropy terms. 3. We also derive an efficient implementation of the proposed regularization by finding its closed-form solutions in specific cases and a meaningful upper bound that can be used as an approximation when we want to minimize the regularization term. Organization. The paper is organized as follows. In Section 2, we provide the background of GFlowNets and OT. In Section 3, we propose a new directed distance between two arbitrary states in the GFlowNet and then derive the formulation of path regularization via OT. We also explain why it is the natural and optimal choice for constructing the transportation cost between states. Then, we derive the upper bound and efficient implementation of the proposed path regularization. We provide extensive experiment results of our path regularization via OT in Section 4 and conclude the paper with a few discussions in Section 5. Theoretical proofs, as well as experimental settings and additional results, are provided in the Appendix.

2. BACKGROUND

2.1 GFLOWNETS Given a compositional space X , where each object x ∈ X can be constructed by taking a sequence of discrete actions from the action space A. Specifically, the construct of each object begins from the source state s 0 and ends in the final state s f . Incrementally, the generation process modifies a temporarily constructed object, which is called a state s ∈ S. In addition, a specific action determines that the object is completely constructed and represents a terminal state, such that s = x ∈ X . These states and actions correspond to the vertices and edges of a directed acyclic graph G = (S, A). The construction of an object x ∈ X defines a complete trajectory, which is a sequence of transitions τ = (s 0 → s 1 → . . . → s n = x → s f ). T is defined as the set of all complete trajectories. Following Malkin et al. (2022) , we may assume that each terminal state s ∈ X has only one outgoing edge, which is s → s f . Flows Following Bengio et al. (2021b) , a trajectory flow is a nonnegative function F : T → R + , which represents the probability mass of each complete trajectory τ . Consequently, the flow through each state can be defined as F (s) = τ ∈T ,s∈τ F (τ ), as well as the flow through each edge F (s → s ′ ) = τ ∈T ,s→s ′ ∈τ F (τ ). We can associate a probability measure P with the trajectory flow F . In which, there are two important conditional probabilities, the forward transition probabilities (forward policy) P F (s ′ |s) := F (s → s ′ )/F (s) is related to adding an element to build the objects, and the backward transition probabilities (backward policy) P B (s|s ′ ) := F (s → s ′ )/F (s ′ ) is related to removing an element. Learning Objective Theoretically, if the training objective such as flow matching objective (Bengio et al., 2021a) , detail balance objective (Bengio et al., 2021b) , or trajectory balance objective (Malkin et al., 2022) is achieved on all states and possible trajectories respectively, then a GFlowNet can be trained to completion, i.e., perfectly generating objects proportional to their rewards. In this paper, we use trajectory balance objective because it brings more efficient credit assignment and faster convergence (Malkin et al. (2022) ). We provide more background of GFlowNets in Appendix B.

2.2. OPTIMAL TRANSPORT DISTANCE

Transportation plans and joint probabilities For two discrete probability measures α and β over some space X , the admissible couplings set, which can be interpreted as the set of transportation plans or joint probability distributions, is defined as: Π (α, β) = π ∈ R k×l + : π1 l = α, π ⊤ 1 k = β . (1)

Optimal transportation

The Kantorovich optimal transport (Peyré & Cuturi (2019) ) between α and β is defined as follows: OT C (α, β) := min π∈Π(α,β) ⟨C, π⟩ where C is the cost matrix and C ij describes the cost of transport mass from the support i th of α toward the support j th of β. Whenever the matrix C is itself a metric matrix, the optimum of this problem, OT C (α, β), can be proved to be also a distance. Assuming that k = l = d, the worst-case complexity of computing that optimum with any of the algorithms known so far scales in O d 3 log d and turns out to be super-cubic in practice (Pele & Werman (2009) , §2.1).

3.1. OPTIMAL TRANSPORT FORMULATION OF THE PATH REGULARIZATION

Turn transition probability into directed distance We define a new directed distance between two arbitrary states in the GFlowNet, which is used as transportation cost to compute OT distance. Specifically, the directed distance from a state s to another state s ′ is designed to be inversely proportional to the probability of going from s to s ′ . And, because we consider two arbitrary states s and s ′ , it is trivial that there does not always exist a sequence of forward transitions τ from s to s ′ such as τ = (s = s 0 → s 1 → ... → s n = s ′ ) where s t → s t+1 ∈ A. This will result in many entries of the cost matrix may have infinite values, which can also make the OT cost infinite. Therefore, we consider a generalized notion of τ as a sequence of transitions τ = (s = s 0 → s 1 → ... → s n = s ′ ) where s t → s t+1 can be a forward or backward transition, i.e, τ can be a back-and-forth trajectory (Zhang et al. (2022) ). In fact, this sequence of transitions always exists because when we do not regard the direction of each edge, the given DAG can be considered an undirected connected graph. Definition 3.1 (Directed distance in the GFlowNet) Let τ = (s = s 0 → s 1 → ... → s n = s ′ ) be the sequence of transitions from s to s ′ where s t → s t+1 can be a forward or backward transition. The length of the trajectory τ is defined as follows: Len(τ ) := -log(P (τ | s)), (3) and the directed distance from s to s ′ is also defined as follows: d(s, s ′ ) := min τ =(s→...→s ′ ) -log(P (τ | s)). ( ) where τ = (s = s 0 → s 1 → ... → s n = s ′ ) is a sequence of transitions from s to s ′ where s t → s t+1 can be a forward or backward transition and d(s, s ′ ) = 0 when s ≡ s ′ . Intuitively, d(s, s ′ ) is the shortest path length from s to s ′ . Although, d(s, s ′ ) does not satisfy conditions of being a "distance", it is still a pseudoquasimetric, which motivates us to use the term "directed distance" as an analogy with "directed distance in digraphs" Chartrand & Tian (1997) . Indeed, the proposed directed distance is the natural choice for distance in the GFlowNet. Firstly, let's consider two states s and s ′ . They can be considered the equivalent of one another, and have zero distance, if P (s ′ |s), the probability of transitioning from state s to state s ′ , is equal to 1. In contrast, if the transition probability is equal to 0, we cannot reach s ′ from s by following the GFlowNet policy, then the distance must be infinite. Another reason is that distances are additive, whereas probabilities are multiplicative. Consequently, if we want the length of a trajectory to be related to its likelihood, the transition probabilities between states should be changed to their distance by a "negative" logarithmic scale. Why do we use the proposed directed distance as transportation cost to compute OT distance? Several reasons exist for using the proposed directed distance as transportation cost to compute the OT distance. First, as discussed above, it is the natural choice for turning a transition probability into a distance. Second, with the transportation cost defined as our directed distance, we can decompose the OT distance into entropy and other terms. Consequently, this property not only gives us an upper bound for the OT distance but also imposes sparsity prior on the GFlowNet's structure when minimizing the OT distance or improves exploration when maximizing the OT distance. Besides, the construction of the transportation cost makes minimizing OT distance correspond to maximizing the transition probability between children of neighbor states. Moreover, it also allows us to derive closed-form solutions for the OT distance under some conditions. Optimal transport formulation of the path regularization Once we have defined the directed distance between two states in the GFlowNet, we can define the OT distance between two forward policies of neighbor states. Consider two neighbor states s and s ′ in trajectory τ , such that s → s ′ ∈ A. The forward policy P F (•|s) is a discrete probability measure supported by Child(s) = {u 1 , ..., u k } and P F (•|s ′ ) is a discrete probability measure supported by Child(s ′ ) = {v 1 , ..., v l }. The OT distance between P F (•|s) and P F (•|s ′ ) can be defined as: OT C (P F (•|s), P F (•|s ′ )) := min π∈ (P F (•|s),P F (•|s ′ )) ⟨C, π⟩, where the set of admissible couplings is defined as: Π(P F (•|s), P F (•|s ′ )) := π ∈ R k×l + : π1 l = P F (•|s), π ⊤ 1 k = P F (•|s ′ ) , and C is a cost matrix whose each entry is the length of shortest path from u i to v j : C ij = c(u i , v j ) := d(u i , v j ) = min τ =(ui→...→vj ) -log(P (τ | u i )), However, the underlying DAG is unavailable during training progress because of the enormous number of states and edges connecting them. Then C ij in Eqn. 7 can only be approximately computed by using trajectories in the sub-graph containing s, s ′ , their child states, and the edges that connecting them. Specifically, rather than going directly from u i to v j if this edge exists, we can always move from u i to v j along a back-and-forth trajectory, i.e., u i → s → s ′ → v j , with the probability P B (s|u i )P F (s ′ |s)P F (v j |s ′ ). Therefore, each entry of the cost matrix C can be calculated in practice by approximating as follows (note that we abuse the notation of "=" in Eqn. 8 instead of "≈" for easier viewing): C ij = 0, if u i ≡ v j min (-log(P B (s | u i )P F (s ′ | s)P F (v j | s ′ )), -log(P (v j | u i ))) , else if u i → v j ∈ A -log(P B (s | u i )P F (s ′ | s)P F (v j | s ′ )), otherwise. (8) Definition 3.2 (Optimal Transport Formulation of the Path Regularization) For any complete trajectory τ = (s 0 → s 1 → ... → s n ), we define the path regularization via OT as follows: L OT (τ ) := n-1 t=0 OT C t;θ (P F (•|s t ; θ), P F (•|s t+1 ; θ)). ( ) where C t;θ is the cost matrix where each entry is defined in 8. If π θ is the training policy -usually that given by P F (•|•; θ) or a modified version of it -then the trajectory loss is updated along trajectories sampled from π θ , i.e., with stochastic gradient: E τ ∼π θ ∇ θ (L TB (τ ) + λL OT (τ )). (10) where λ ∈ R, λ > 0 indicates that we want to minimize the path regularization and vice versa. The cost matrix is a 3×3 matrix. For example, c11 = d(u1, v1) = 0 (because u1 ≡ v1). There exist many possible paths to move from u3 to v3. First, going directly from u3 to v3 with a distance len(u3 → v3) = -log(PF (v3|u3)). Second, we can move from u3 to v3 along a back-and-forth trajectory, i.e., u3 → s → s ′ → v3, with a distance -log(PB(s|u3)) -log(PF (s ′ |s)PF )log(P (v3|s ′ ))). Because the transportation cost from u3 to v3 is the length of the shortest path from u3 to v3, c33 = d(u3, v3) ≈ min(-log(PB(s|u3)PF (s ′ |s)PF (v3|s ′ )), -log(PF (v3|u3))). The effect of minimizing the path regularization: Following Eqn. 5, minimizing OT C (P F (•|s), P F (•|s ′ )) makes P F (v|s ′ ) closed to P F (u|s), where c(u, v) small. This is because all probability mass from u can be transferred to v with a smaller cost than other places, reducing the transportation costs' expectation. Also, P F (v|s ′ ) and P F (u|s) will increase where c(u, v) is small, inducing the similarity of the forward policies and making the flow focus on specific directions. Note that the cost matrix C is not a constant and depends on the forward policies. Therefore, minimizing the OT distance affects not only the forward policies but also the transportation cost. However, the effect on the forward policies is also consistent with the effect on the transportation costs. As in Eqn. 14 of Theorem 3.2, increasing P F (u|s) leads to increase P B (s|u). Besides, -log(P B (s|u)P F (s ′ |s)P F (v|s ′ )) is an upper bound of c(u, v), so increasing P F (u|s) and P F (v|s ′ ) also makes c(u, v) smaller. Also, the OT distance is equal to zero when all probability mass P F (v|s ′ ) = P F (u|s) concentrate on u, v where c(u, v) = 0. For examples, P F (s t |s t-1 ) = 1 and P F (s t+1 |s t ) = 1 is a special case. When the GFlowNet visits a high reward object, the prior by design helps the GFlowNet quickly adapt its flow to this high reward terminal state. More discussion about relation with entropy is provided in Theorem 3.1. The effect of maximizing the path regularization: In contrast, maximizing makes the forward policies different, so more diverse actions are chosen, leading to more diverse and novel candidates. As in the Theorem 3.1, the upper bound of the OT distance contains the entropy of forward policy H(P F (•|s)). Moreover, when the training policy is given by the forward policy P F , the upper bound also contains the entropy of the path H(P (τ )). Thus, maximizing the OT distance is expected to increase the upper bound, which increases the entropy as well. This means more diversity and exploration. Recall that in terms of probabilistic interpretation, we can rewrite the OT distance as the minimum expectation of the transportation costs min γ∼Π(α,β) E u,v∼γ c(u, v). Thus, maximizing the OT distance means maximizing the cost c(u, v). Because c(u, v) is an inverse function of transition probability from u to v, maximizing the OT distance means minimizing P (u → v). Consequently, the flow is distributed to more states, so more diverse actions are chosen.

3.2. UPPER BOUND AND EFFICIENT IMPLEMENTATION OF THE PATH REGULARIZATION

The cost of computing the OT distance OT (P F (•|s), P F (•|s ′ )) scales at least in O(d 3 log(d)) , where d is the number of support points. By using the Sinkhorn algorithm (Cuturi (2013) ), which solves OT with entropic regularization, we can reduce the computational complexity to O(d 2 ) (Altschuler et al., 2017; Lin et al., 2019; 2022) . However, our path regularization's definition requires computing the OT distances for all edges in the trajectory τ , which imposes a heavy burden on the computing resources and capacity. To overcome this problem, in Theorem 3.1, we propose the upper bound of the OT distance OT (P F (•|s), P F (•|s ′ )). The upper bound provides an efficient implementation and can explain the path regularization's behaviors. The reason is that we can decompose the OT distance into entropy and other terms. Moreover, when the GFlowNet's settings satisfy certain conditions (Section 3.3), we can solve the OT problem with a closed-form solution. Our upper bound and closed-form formulation both have the computational complexity of O(LV ) where L is the maximal length of constructed sequences and V is the action space size. Theorem 3.1 (Upper bound of optimal transport distance) For any trajectory τ = (s 0 → s 1 → ... → s n ). The path regularization via OT L OT (τ ) can be upper bound by: L UB (τ ) := n-1 t=0   u∈Child(st) P F (u|s t ) log(P B (s t |u)) -log(P F (s t+1 |s t )) + H(P F (•|s t+1 ))   . (11) The proof of Theorem 3.1 is provided in Appendix D.1. The entropy terms in the upper bound L UB (τ ) encourage the sparsity of the GFlowNet's flow. In addition, when we minimizes the upper bound L UB (τ ), the terms u∈Child(st) P F (u|s t ) log(P B (s t |u)) can make P B (s t |u) become higher when P F (u|s t ) is high. As a result, increasing P B (s t |u) makes other flows leading to u pruned. While the meanings of the terms u∈Child(st) P F (u|s t ) log(P B (s t |u)) and entropy terms are clear, we now would like to explain the meaning of regularization terms -log(P F (s t+1 |s t )). Direct calculation indicates that: n-1 t=0 -log(P F (s t+1 |s t )) = -log n-1 t=0 P F (s t+1 |s t ) = -log(P (τ )), Moreover, when the training policy π θ is given by the forward policies P F (.|.; θ), we have: E τ ∼π θ (-log(P (τ ))) = H(P (τ )). (13) By taking this approach, the upper bound regularizes not only on the forward policy via H(P F (•|s t+1 )) but also on the path via H(P (τ )). Besides, we can think -log(P (τ )) = -log(P (τ |s 0 )P (s 0 )) = -log(P (τ |s 0 )) as the length of τ . Then minimizing path regularization can result in shorter paths and smaller numbers of paths with high flows.

3.3. CLOSED-FORM FORMULATION FOR THE PATH REGULARIZATION

In some specific circumstances, i.e., synthetic hypergrid environment (Bengio et al. (2021a) ), discrete probabilistic modeling (Zhang et al. (2022) ), and biological sequence design (Jain et al. (2022) ), Theorem 3.2 allows us to compute OT loss by a closed-form formulation, where the cost matrix C is re-defined by its approximation as in Eqn. 8. Specifically, the closed-form formulation was derived by taking advantage of the following observations. First, in these cases, each two neighbor states s and s ′ on each sampled trajectory, such that s → s ′ ∈ A, don't share any child state. In other words, there doesn't exist any state s ′′ satisfying s → s ′′ ∈ A and s ′ → s ′′ ∈ A. This finding is resulted from the property that each action can not decompose into the composition of others in the action space. Second, if there exists an edge connecting from a child state u of s, such that u ̸ = s ′ , to a child state v of s ′ then the action of transition from s to u must be the same action of of transition from s ′ to v and the transition action from u to v must be exactly the transition action from s to s ′ . Theorem 3.2 (Closed-form solution for optimal transport distance) Let the OT cost between the forward policies of two neighbor states be defined as in Eqn. 5 where the cost matrix C is redefined by its approximation as in Eqn. 8. For each non-terminal neighbor states s and s ′ such that s → s ′ ∈ A, let a i be an action so that the state-action pair (s, a i ) leads to u i and (s ′ , a i ) leads to v i , where u i ∈ Child(s) and v i ∈ Child(s ′ ). Let A * s , A * s ′ be the set of non-terminal valid actions at state s, s ′ and a * s be the action of moving from s to s ′ . If the following conditions are satisfied: (1) a i ̸ = a k + a h ∀a i , a k , a h ∈ A, and (2) if ∃ a i , a h , a m , a n ∈ A such that a i + a h = a m + a n , a i ̸ = a m then a i = a n , a h = a m ; the following result holds: OT (P F (•|s), P F (•|s ′ )) = u∈Child(s) P F (u|s) log(P B (s|u)) + H(P F (•|s ′ )) + P F (s ′ |s)(log(P B (s ′ |s)) + log(P F (s ′ |s))) + ai∈A * s A * s ′ ,ui̸ =s ′ ,ui+a * s =vi min(P F (u i |s), P F (v i |s ′ ))c ′ i , where we define: c ′ i = min(0, log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v i |s ′ )) -log(P F (v i |u i ))), if u i ̸ = s ′ 0, if u i = s ′ (15) The proof of Theorem 3.2 and the closed-form solution for OT distance at terminal states are provided in Appendix D.2. These closed-form solutions of the OT distance will be used in our experiments in Section 4 (see Appendix D.2 for the reasons).

4. EXPERIMENTAL RESULTS

In this section, we numerically justify the advantage of OT regularization over the baseline GFlowNet model only trained with trajectory balance loss on a wide range of tasks: hypergrid environment, discrete probabilistic modeling, and biological sequence design tasks. We aim to show that: (i) Minimizing the path regularization via OT improves the GFlowNets' generalization, and the upper bound can be used as an efficient approximation when we want to minimize the regularization term (hyper-grid environment, discrete probabilistic modeling); while (ii) Maximizing path regularization via OT enhances the exploration ability of the GFlowNet (biological sequence tasks).

4.1. HYPER-GRID ENVIRONMENT

Task We follow the framework of Malkin et al. (2022) with slight changes to study a hyper-grid environment, which evaluates the generalization ability of the GFlowNet to guess and sample unvisited modes of the interested distribution. Consider a D-dimensional hyper-grid environment with length of each side is H, where each cell represents non-terminal state of the given DAG: s = (s 1 , . . . , s D ) where s d ∈ {0, 1, . . . , H -1} for d ∈ {1, . . . , D}. The source state is (0, 0, ..., 0). For any nonterminal state, the available actions are operations of increasing coordinate i by 1 that still satisfies s i ≤ H -1 and a terminating action that moves to a corresponding terminal state s T , which has its reward: R s ⊤ = R0 + 0.5 D d=1 I [|s d /(H -1) -0.5| ∈ (0.25, 0.5]] + 2 D d=1 I [|s d /(H -1) -0.5| ∈ (0.3, 0.4)] where R 0 is the constant that controls the discovery challenge and I is the indicator function. This reward function indicates that only considerable rewards exist at the environment's corners, and there are correct 2 D modes. The experiment is conducted for two hyper-grid environments with the number of dimensions 4 and 8 (a higher number of dimensions means more challenging). We consider the same side length H = 8 and R 0 = 10 -3 for both environments. To evaluate the performance on this task, we use KL divergence and the number of modes found during training as the main evaluation metrics. More details about architectures, hyper-parameters, and evaluation criteria are provided in Appendix E.1.

Results

We plot the mean results over 10 runs for each configuration in Fig. 2 . Generally, the observed behaviors of the proposed regularization methods become more apparent when the hypergrid environment is sparser. Precisely, although recovering full modes, the GFlowNet model trained by minimizing the path regularization via OT discovers modes faster than the baseline, which indicates its focus on finding directions leading to states with high rewards during the training progress rather than spending time exploring the environment. This also helps the model better discover the latent structures of the interested distribution and achieve lower KL error. We can also see that the upper bound is an efficient approximation in terms of complexity when using a positive regularization coefficient, whose performance is even better. Meanwhile, Max OT seems unsuitable because of its motivation to improve the model's exploration, while we only need the high-reward states near the corners, and the majority of states have minimal rewards. 

4.2. BIOLOGICAL SEQUENCE DESIGN

Task We follow the framework of Jain et al. (2022) to simulate the process of designing biological sequences, such as anti-microbial peptides (AMP), DNA and protein sequences (TF Bind 8, GFP). The experiments are conducted in the multi-round active learning setting, with the goal of generating a diverse set of useful candidates after evaluation rounds. We report the performance score, diversity score, and novelty score of the TopK scoring candidates to evaluate the performance of each method. More details about task description, datasets, hyper-parameters, and evaluation criteria are provided in Appendix E.2. GFP Lastly, the results for the GFP task are shown in Table 3 , where the objective is to find protein sequences having high fluorescence. We observe that the GFlowNet-AL model trained by maximizing OT regularization generates more diverse and novel candidates than other methods. In addition, its performance score is only lower than the best one achieved by COMs and higher than the GFlowNet-AL baseline. However, when looking at all metrics, the GFlowNet-AL model trained by maximizing the path regularization via OT still outperforms all other baselines. Note that in all biological sequence tasks, Min OT and UB OT do not improve the performance of GFlowNets since these methods make the forward policy at each state have a focused tendency, which seems not to increase the diversity and novelty of the generated candidates. Table 3 : Results on the GFP task with K = 128. cretizations of continuous distributions over the plane. The state space S of the GFlowNet consists of vectors of length D = 32 with with entries in {0; 1; ⊘}. The source state is s 0 = (⊘, ⊘, ..., ⊘). For any non-terminal state, the available actions are turning a void entry ⊘ to 0 or 1. After D actions, we reach the terminal states having all entries in {0; 1}. The main evaluation metrics are NLL score and MMD score. The detailed settings about architectures, hyper-parameters, and evaluation criteria are provided in Appendix E.3.

Results

The results for synthetic discrete probabilistic modeling tasks (Synthetic EB-GFN) are shown in Table 4 . Training the GFlowNet with either minimizing the path regularization via OT (Min OT) or via the upper bound (UB OT) gains the better NLL and MMD scores than the baseline and Max OT. We also observe that the performance of training EB-GFN with Min OT and UB OT are quite similar. Meanwhile, Max OT is not useful due to the same reasons provided in hypergrid environment modeling task. Because there is a gap between our reproduce results and the baseline in EB- GFN Zhang et al. (2022) , we only take into account the the reproduce results of EB-GFN when comparing with our methods (Min OT, Max OT, and UB OT). Table 4 : Results on the Synthetic EB-GFN tasks. The negative log-likelihood (NLL) and MMD are displayed in units of 1 × 10 -4 . ALOE+ uses a 30 larger parametrization than ALOE and EB-GFN.We only take into account the the reproduce results of EB-GFN when comparing with our methods (Min OT, Max OT, and UB OT).

5. CONCLUDING REMARKS

In this paper, we propose to train the GFlowNet with an additional path regularization via Optimal Transport that places prior constraints on the underlying structure of the GFlowNet. We have empirically shown that minimizing the path regularization via OT improves the GFlowNet's generalization while maximizing path regularization via OT enhances the exploration ability of the GFlowNet. In addition, we derive an efficient implementation of the regularization by finding its closed-form solutions in specific cases and a meaningful upper bound that can be used as an approximation when we want to minimize the regularization term. A limitation of the current method is computing the optimal transport distances for all couples of nearest neighbor states. Our proposed Dropout OT (see in Appendix C) might be a solution. In future works, we aim to develop a more efficient path regularization for high dimensional discrete data or propose a new cost function to compute the optimal transport distances.

Supplement to "Improving Generative Flow Networks with Path Regularization"

A RELATED WORK GFlowNets The objective of GFlowNets is related to MCMC methods for sampling from a given unnormalized density function, especially in discrete spaces where exact sampling is intractable (Dai et al. (2020) ; Grathwohl et al. (2021) ). However, GFlowNets amortize the complexity of iterative sampling by a training procedure that implies the data's compositional structure as its learning problem. Empirically, GFlowNets' performance is better than other earlier methods in a wide variety of tasks: small molecules generation (Bengio et al. (2021a) ), probabilistic modeling (Zhang et al. (2022) ), Bayesian structure learning (Deleu et al. (2022) ) and biological sequence design (Jain et al. (2022) ). On the theoretical side, definitions and properties of GFlowNets are more investigated in Bengio et al. (2021b) . Optimal Transport The optimal transport theory (OT) (Villani (2003) ) has established a natural and useful geometric tool for comparing measures supported on metric probability spaces. The development of OT theory has a long history, where it has been discovered in many settings and under different forms. And in recent years, another revolution in the spread of OT has been witnessed, thanks to the emergence of approximate solvers that can scale to the problem of large dimensions. As a consequence, OT is being widely used to solve various problems in computer graphics (Bonneel et al. ( 2011 Energy-based models EBMs, or energy functions parameterized by deep neural networks, have demonstrated effectiveness in generative modeling (Salakhutdinov & Hinton (2009) ; Hinton et al. (2006) ). Contrastive divergence methods (Hinton (2002) ; Tieleman (2008) ; Du et al. (2021) ) have been proposed to handle costly MCMC processes by approximating energy gradient. Recently, it has been shown that simultaneous learning of the proposal distribution can also be helpful (Dai et al. (2019) ; Arbel et al. (2021) ). Then this finding has been extended to discrete spaces by using GFlowNets in Zhang et al. (2022) . 

B BACKGROUND OF GFLOWNETS

Generative Flow Networks (GFlowNets) are a recently proposed class of generative model, which aims to sample a structural object x with probability proportional to a given reward function R(x). From the reinforcement learning viewpoint, GFlowNets learn a stochastic policy to generate object x ∈ X by applying a sequence of discrete actions a ∈ A where A is the action space. The construction of an object x ∈ X defines a complete trajectory τ = (s 0 , s 1 , ..., s n = x, s f ) where s 0 is the initial state, s n = x ∈ X is the terminal state (indicating entirely constructed object), and s f is the final state. Note that the same terminal state can be formed by different sequences of actions. These states and actions correspond to the vertices and edges of a directed acyclic graph G = (S, A). In addition, for each transition s → s ′ ∈ A, we call s a parent of s ′ , and s ′ a child of s. T is defined as the set of all complete trajectories. Following Bengio et al. (2021b) , a trajectory flow is any nonnegative function defined on the set of complete trajectories, such as F : T → R + . Correspondingly, the flow through a state (state flow) is defined as F (s) = τ ∈T ,s∈τ F (τ ) and the flow through a edge (edge flow) is defined as F (s → s ′ ) = τ ∈T ,s→s ′ ∈τ F (τ ). Additionally, the forward transition probabilities P F and the backward transition probabilities P B are defined as follows: P F (s ′ |s) := F (s → s ′ ) F (s) , P B (s|s ′ ) := F (s → s ′ ) F (s ′ ) . Then the training objective of the GFlowNet is to learn a consistent flow (Bengio et al. (2021b) ; Malkin et al. (2022) ) that has the terminal flow F (x → s f ) approximately equal a given reward function R(x) for any x ∈ X. In addition, when the flow is consistent, the forward transition probabilities P F and the backward transition probabilities P B correspondingly define a distribution over the children and parent of each state, which can be considered as the forward and backward policy of GFLowNets. Specifically, followed by Malkin et al. (2022) , the GflowNet models the forward policy, backward policy and total flow of a Markovian flow F by P F (.|.; θ), P B (.|.; θ) and Z θ . The trajectory balance objective is then optimized for each complete trajectory τ sampled from the training policy π θ : L T B (τ, θ) = log(Z θ n t=1 P F (s t |s t-1 ; θ)) -log(R(x) n t=1 P B (s t-1 |s t ; θ)) 2 . ( ) which is derived from the trajectory balance constraint (Malkin et al. (2022) ) Moreover, as already proved by Bengio et al. (2021b) , π θ can be chosen as any distribution on the set of complete trajectories T with full supports, or the GflowNet can be trained with offline policy as well, such as a mixture between the GFlowNet's forward policy and an uniform distribution over allowed actions in each state: π θ = (1 -α)P F (.|.; θ) + α Uniform ( ) There also exist other objectives for learning a GFlowNet, which are based on flow matching constraint or detail balance constraint as in Bengio et al. (2021a; b) . However, Malkin et al. (2022) empirically shows that the trajectory balance objective improves the training of a GFlowNet in terms of more efficient credit assignment and faster convergence, compared to the previously proposed objectives. These advantages make us choose it as the training objective in this paper.

C DROPOUT OPTIMAL TRANSPORT

A limitation of the current method is computing the optimal transport distances for all couples of nearest neighbor states, especially in high dimensional discrete data. Our proposed dropout OT might be a solution. This is because rather than sampling trajectories τ and using all edges from them, we can separately sample edges s → s ′ proportional to edge flows, allowing us to efficiently compute path regularization.

Theorem C.1 For any complete trajectory

τ = (s 0 → s 1 → ... → s n ) sampled from the training policy π θ E τ ∼π θ (L OT (τ )) ∝ E s→s ′ ∼π θ (OT (P F (•|s), P F (•|s ′ ))). The proof of Theorem C.1 is in Appendix D.3. Here we train GFlowNets with trajectory balance objective. Therefore, when sampling a trajectory τ , we get a set of edges from τ . We just sample uniformly a p percentage of edges to compute OT loss. To sample p percentage of edges, let sample r s ∼ Ber(p). E s→s ′ ∼π θ (OT (P F (•|s), P F (•|s ′ ))) = 1 p E rs∼Ber(p) E s→s ′ ∼π θ (r s .OT (P F (•|s), P F (•|s ′ ))). (22) We approximate the path regularization loss via: L OT (τ ) ≃ 1 p n-1 t=0 x t OT(P F (.|s t ), P F (.|s t+1 )) with x t drawn independently from Ber(p) for all 0 ≤ t ≤ n -1 . Intuitively, if x t = 0 then we don't need to calculate the corresponding optimal transport cost anymore, which reduces a considerable amount of computing time and memory down to p percentage.

D PROOFS

D.1 PROOF OF THEOREM 3.1 For any trajectory τ = (s 0 → s 1 → ... → s n ), we first prove that for any t ∈ 0, n -1 OT (P F (•|s t ), P F (•|s t+1 )) ≤ u∈Child(st) P F (u|s t ) log(P B (s t |u))-log(P F (s t+1 |s t ))+H(P F (•|s t+1 )). Consider two neigboor states s t and s t+1 with the children sets: Child(s t ) = {u 1 , ..., u k } and Child(s t+1 ) = {v 1 , ..., v l }. By definition 5, the optimal transportation distance between two distributions P F (.|s t ) and P F (.|s t+1 ) is defined as: OT C (P F (•|s t ), P F (•|s t+1 )) := min π∈ (P F (•|st),P F (•|st+1)) ⟨C, π⟩, where the admissible couplings set is defined as: Π(P F (.|s t ), P F (.|s t+1 )) = π ∈ R k×l + : π1 l = P F (•|s t ), π T 1 k = P F (•|s t+1 ) . We have, OT(P F (.|s t ), P F (.|s t+1 )) ≤ i j π ij C ij ≤ - i j π ij log (P B (s t |u i )P F (s t+1 |s t )P F (v j |s t+1 )) = - i j π ij log (P B (s t |u i )) - i j π ij log (P F (s t+1 |s t )) - i j π ij log (P F (v j |s t+1 ))) = - i log (P B (s t |u i )) j π ij -log (P F (s t+1 |s t )) i j π ij - i log (P F (v j |s t+1 ))) j π ij = - i log (P B (s t |u i )) P F (u i |s t ) -log (P F (s t+1 |s t )) - j log (P F (v j |s t+1 ))) P F (v j |s t+1 ) = u∈Child(st) P F (u|s t ) log(P B (s t |u)) -log(P F (s t+1 |s t )) + H(P F (.|s t+1 ). The first inequality obtained by the definition of optimal transport distance in Eq. 25, the second inequality comes from Eq. 8, the fifth equality is due to the constraints of admissible couplings in Eq. 26. As a consequence, the upper bound loss is obtained by summing up all inequality 24 for all t. D.2 PROOF OF THEOREM 3.2 Recall from definition 5 the optimal transportation distance between two distributions P F (.|s) and P F (.|s ′ ) is defined as: OT C (P F (•|s), P F (•|s ′ )) := min π∈Π(P F (•|s),P F (•|s ′ )) ⟨C, π⟩. Let decompose the total cost ⟨C, π⟩ (29) We will prove that u i ̸ = v j ∀i, j, i.e, Child(s) ∩ Child(s ′ ) = Ø, indeed, ⟨C, π⟩ = i,j π ij C ij = i,j π ij (-log(P B (s|u i )) -log(P F (s ′ |s)) -log(P F (v j |s ′ ))) + ui=s ′ ,j a i ̸ = a k + a h ∀a i , a k , a h ∈ A =⇒ a i ̸ = a * s + a j =⇒ s + a i ̸ = s + a * s + a j =⇒ u i ̸ = v j . ∀i, j We have: u i ̸ = s ′ , v j ∈ Child(u i ), a i ̸ = a ⊤ =⇒ a i ̸ = a * s , s + a i + a * ui = s + a * s + a j , a i ̸ = a ⊤ =⇒ a i ̸ = a * s , a i + a * ui = a * s + a j , a i ̸ = a ⊤ =⇒ a i ̸ = a * s , a i = a j ̸ = a ⊤ , a * ui = a * s . As a result, we can rewrite Eq. 29 as: ⟨C, π⟩ = i,j π ij (-log(P B (s|u i )) -log(P F (s ′ |s)) -log(P F (v j |s ′ ))) + ui=s ′ ,j π ij (log(P B (s|s ′ )) + log(P F (s ′ |s))) + ui̸ =s ′ ,ai=aj ̸ =a ⊤ π ij (log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v j |s ′ ) + C ii ). The first term of above equation actually is the upper bound of the optimal transport distance. Therefore, we can rewrite the total transportation cost as:  + ui̸ =s ′ ,ai=aj ̸ =a ⊤ π ij (log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v j |s ′ ) + C ii ). From the definition of c ′ i in Eq. 15, we have c ′ i = log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v j |s ′ ) + C ii , if u i ̸ = s ′ , a i = a j ̸ = a ⊤ 0 if u i = s ′ or a i = a ⊤ . ( ) From Eq. 33 and Eq. 34, we have ⟨C, π⟩ = arg min π∈ (P F (•|s),P F (•|s ′ )) ⟨C ′ , π⟩ where, C ′ is a diagonal matrix with the diagonal c ′ i ≤ 0. For convenience, if action a i is invalid at state s, we assign P F (u i | s) := 0, so the cost matrix of the optimal transport distance still is a square matrix with the zero cost a invalid actions, then applying the Lemma 1, we have: min π∈ (P F (•|s),P F (•|s ′ )) ⟨C ′ , π⟩ = i min(P F (u i |s), P F (v i |s))C ′ ii . ( ) We obtain the closed-form formulation for optimal transport distance  + i∈A * s A * s ′ min(P F (u i |s), P F (v i |s ′ ))c ′ i . Lemma 1 Given a squared diagonal cost matrix C ′ with non-positive entities in the diagonal, the solution of optimal transport problem between two distribution P F (•|s) and P F (•|s ′ ), which has the same number of support points, given cost matrix C ′ is given by: min π∈Π(P F (•|s),P F (•|s ′ )) ⟨C ′ , π⟩ = i min(P F (u i |s), P F (v i |s))C ′ ii . ( ) Proof of Lemma 1: Let define F (π) = ⟨C ′ , π⟩, p ij =    min(p i s , p i s ′ ), if i = j (p i s -min(p i s ,p i s ′ ))(p j s ′ -min(p j s ,p j s ′ )) 1-k min(p k s ,p k s ′ ) if i ̸ = j. where p i s := P F (u i |s) and p j s ′ := P F (v j |s ′ ). We will prove that π ∈ Π (P F (•|s), P F (•|s ′ )) and F (π) ≥ F (π) ∀π ∈ Π (P F (•|s), P F (•|s ′ )). It is not difficult to show that π ij ≥ 0. From the definition of π, we have n j π ij = j̸ =i π ij + π ii = j̸ =i p i s -min(p i s , p i s ′ ) p j s ′ -min(p j s , p j s ′ ) 1 -k min(p k s , p k s ′ ) + min(p i s , p i s ′ ). (40) If min(p i s , p i s ′ ) = p i s then n j π ij = 0 + min(p i s , p i s ′ ) = p i s . ( ) else min(p i s , p i s ′ ) = p i s ′ then j̸ =i p j s ′ -min(p j s , p j s ′ ) = j p j s ′ -min(p j s , p j s ′ ) = 1 - k min(p k s , p k s ′ ) =⇒ n j π ij = p i s -min(p i s , p i s ′ ) j̸ =i p j s ′ -min(p j s , p j s ′ ) 1 -k min(p k s , p k s ′ ) + min(p i s , p i s ′ ) = p i s . Therefore n j π ij = p i s = P F (u i |s). Similarly, n i π ij = p j s = P F (v j |s ′ ), combining with π ij ≥ 0, we have π ∈ Π (P F (•|s), P F (•|s ′ )) . Moreover F (π) = ⟨C ′ , π⟩ = i π ii C ′ ii ≥ i min(p i s , p i s ′ )C ′ ii = ⟨C ′ , π⟩ = F (π) ∀π ∈ Π (P F (•|s), P F (•|s ′ )) . (44) As a consequence, we obtained the solution of optimal transport problem. closed-form solution for optimal transport distance at terminal state. We will derive the closedform solution for optimal transport distance in case of two neighbor states s < s ′ , in which s ′ is a terminal state. In the case of Hyper-grid environment, EB-GFN experiments, and Biological Sequence Design, all terminal state x have only one child that is the final state s f , and P F (s f |x) = 1 ∀x. Thus, the admissible couplings set (P F (•|s), P F (•|s ′ )) has only one element. That is π * = P F (•|s). As a result, the optimal transportation distance between P F (.|s) and P F (.|s ′ ) is: OT (P F (•|s), P F (•|s ′ )) = min π∈ (P F (•|s),P F (•|s ′ )) ⟨C, π⟩ = ⟨C, π * ⟩. Specially, in EB-GFN experiments, all children u i of s is a terminal state so d(u i , s f ) = -log(1) = 0. This makes C = 0 and OT (P F (•|s), P F (•|s ′ )) = 0. In Hyper-grid environment experiment, for terminal sate s ′ because c ′ i = 0, we have: OT (P F (•|s), P F (•|s ′ )) = u∈Child(s) P F (u|s) log(P B (s|u)) + P F (s ′ |s).(log(P B (s ′ |s)) + log(P F (s ′ |s))). The Hyper-grid environment (Bengio et al., 2021a) (in section 4.1) and EB-GFN experiments (Zhang et al., 2022) (in section 4.3) satisfy two condition in Theorem 3.2. In Biological Sequence Design (Jain et al., 2022) (in section 4.2) such as protein and DNA sequences, the action space consists of actions adding a nucleic acid in {A, T, G, U } and a amino acid respectively. Such settings satisfy former condition a i ̸ = a k + a h ∀a i , a k , a h ∈ A. However, the later condition a i + a h = a m + a n , a i ̸ = a m ⇐⇒ a i = a n , a h = a m , a i ̸ = a m is no longer true because the order property of action space, i.e, a i + a j ̸ = a j + a i . In this situation, the third terms in Eq. 14 is zero and we can still using the formulation in Eq. 14. Generally, the action space is independence and unique factorization.

D.3 PROOF OF THEOREM C.1

By definition of the edge flow we have τ :s→s ′ ∈τ P (τ ) = τ :s→s ′ ∈τ F (τ ) Z = F (s → s ′ ) Z = P (s → s ′ ). From that equation, we find that E τ ∼π θ (L OT (τ )) = E τ ∼π θ s→s ′ ∈τ OT (P F (•|s), P F (•|s ′ )) = τ s→s ′ ∈τ OT (P F (•|s), P F (•|s ′ )).P (τ ) = s→s ′ τ :s→s ′ ∈τ OT (P F (•|s), P F (•|s ′ )).P (τ ) = s→s ′ OT (P F (•|s), P F (•|s ′ )) τ :s→s ′ ∈τ P (τ ) = s→s ′ OT (P F (•|s), P F (•|s ′ )).P (s → s ′ ) ∝ E s→s ′ ∼π θ (OT (P F (•|s), P F (•|s ′ ))). □

E EXPERIMENT SETTINGS

In this part, we report experiment settings, including evaluation metrics for comparing the methods, hyper-parameter choices, and neural network architectures for all experiments. For biological sequence design tasks, we also give more details about the task description and datasets used for training. Note that the regularization coefficients provided in this part are task-specific. Specifically, λ is chosen from a predefined set of values with different scales in each task. However, because all tasks in our experiment parts do not change the target distribution between the training and test time, the reported λ is chosen to have the best model's performance.

E.1.1 EVALUATION CRITERIA

To evaluate the performance, we measure the KL divergence between the actual and empirical distribution of the last 2 × 10 5 visited states. The number of modes found during the training progress is also used to measure the learned models' performance. These experiments simulate the process of designing biological sequences, such as anti-microbial peptides, DNA, and protein sequences..., in drug discovery applications. This process often consists of an active loop with several rounds of ideating molecules and multiple-stage evaluations for filtering candidates, with rising levels of precision and cost. This characteristic makes the diversity of proposed candidates a considerable concern in the ideation phase because many similar candidates can all fail in the later phases. Proxy model: We parameterize it as an MLP with two hidden layers, each having 2048 hidden units, and use ReLU activation. We also use ensembles of 5 models with same architecture for uncertainty estimation. For the acquisition function, we use UCB (µ + κσ) with κ = 0.1. The proxy is trained with MSE loss using mini-batch of size 256 and Adam optimizer with (β 0 , β 1 ) = (0.9, 0.999) and learning rate 10 -4 . During training, early stopping is also used by evaluating the validation set containing 10% of the data. GFlowNet generator: We use an MLP with 2 hidden layers of 2048 hidden units each. The model is trained with trajectory balance objective as the main loss function, by using Adam optimizer with (β 0 , β 1 ) = (0.9, 0.999). Additionally, log Z is trained with a learning rate of 10 -3 for AMP, TF Bind 8 task, and 5 × 10 -3 for GFP task. Other hyper-parameters are shown in the following There are some changes in hyper-parameter choices and the number of active learning rounds in the TF Bind 8 task and the GFP task compared to the original training setups of Jain et al. (2022) . However, during the experiment, we observed that these settings helped us get the closest results to the reported one in Jain et al. (2022) . Proposed OT regularization The regularization coefficients for Min OT, UB OT, and Max OT are the same for each biological sequence design task. Specifically, the coefficients for the AMP, TF Bind 8, and GFP task are 0.025, 0.1, and 0.02 correspondingly.

E.3.1 EVALUATION CRITERIA

To evaluate the performance, we keep the same evaluation criteria in Zhang et al. (2022) , where they use the NLL of a large independent sample of ground truth data and the exponential Hamming MMD (Gretton et al. (2012) ) between ground truth data and generated samples as performance metrics. To measure NNL and MMD, we use 10 fixed sets, and each set consists of 4000 ground truth data samples.

E.3.2 IMPLEMENTATION DETAILS

GFlowNet: For the implementation of the GFlowNet model, we use an MLP with 2 hidden layers of 512 dimensions each. The GFlowNet policy model, which includes both P F and P B , is trained with a learning rate of 0.001. We use a mini-batch size of 128 and 1e5 training steps with the trajectory balance objective. EBMs: For the implementation of the Energy-Based Model, we use an MLP with 3 hidden layers of 256 dimensions each. The learning rate is 0.001. Proposed OT regularization: The regularization coefficient is 0.001 for both Min OT and UB OT and is the same for all tasks. F ADDITIONAL EXPERIMENT RESULTS

F.1 ABLATION STUDY ABOUT VARYING λ

Specifically, we will further investigate the proposed path regularization via OT with different values of the regularization coefficient λ in the 8-D hyper-grid environment in Section 4.1. In addition, the regularization coefficient is selected from the set (0.001, 0.01, 0.1, 0.4). We plot the mean results over 10 runs for each configuration in Fig. 3 . Note that the good range of values for the regularization coefficient is observed to highly depend on the specific setting of the experiment task. Here, we can see that when λ is relatively small, such as λ ∈ (0.001, 0.01), the performance of GFlowNets trained additionally with our proposed regularization via 0T does not seem to be significantly different from the baseline model's performance, which holds for both UB OT, Min OT, and Max OT. This may be resulted from the small contribution of regularization to the regularized training objective, which is caused by the not large enough value of λ. Specifically, when λ = 0.01, we can still see that the performance of GFlowNets trained by minimizing the upper bound is slightly better than the baseline's result. In addition, when λ is relatively large (λ = 0.4), the result of learned GFlowNets is even worse than the baseline, which may be due to the large value of λ forces the model's learning focus on the regularization part more than necessary, which badly affects the optimization of the main training objective (trajectory balance objective). Specifically, this can be observed in the lower KL divergence of both UB OT, Min OT, and Max OT compared to the baseline. Meanwhile, when λ = 0.1, GFlowNets trained by minimizing the OT regularization and its upper bound clearly perform better than the baselines regarding the number of modes found and KL divergence between the actual and empirical distribution, which proves our motivation that minimizing the proposed path regularization is more beneficial in this circumstances.

F.2 ADDITIONAL RESULTS OF THE HYPERGRID ENVIRONMENT

We also plot the mean results over 10 runs for each configuration with variance in Fig. 4 . 



Figure 1: OT distance between PF (•|s) and PF (•|s ′ ). The forward policy PF (•|s) is a discrete probability measure supported by Child(s) = {u1, u2 := s ′ , u3}. Similarly, PF (•|s ′ ) is a discrete probability measure supported by Child(s ′ ) = {v1 ≡ u1, v2, v3}.The cost matrix is a 3×3 matrix. For example, c11 = d(u1, v1) = 0 (because u1 ≡ v1). There exist many possible paths to move from u3 to v3. First, going directly from u3 to v3 with a distance len(u3 → v3) = -log(PF (v3|u3)). Second, we can move from u3 to v3 along a back-and-forth trajectory, i.e., u3 → s → s ′ → v3, with a distance -log(PB(s|u3)) -log(PF (s ′ |s)PF )log(P (v3|s ′ ))). Because the transportation cost from u3 to v3 is the length of the shortest path from u3 to v3, c33 = d(u3, v3) ≈ min(-log(PB(s|u3)PF (s ′ |s)PF (v3|s ′ )), -log(PF (v3|u3))).

Figure 2: Results on the 4 -D (upper) and 8 -D (lower) hyper-grid environment. Left: Number of modes found during training. Right: KL divergence between the true and empirical distribution.

),Nguyen et al. (2021)), image processing(Xia et al. (2014)), and machine learning(Courty et al. (2014), Ho et al. (2017) Genevay et al. (2018),Bunne et al. (2019)).

Various methods have been proposed to handle the biological sequence design tasks: deep model-based optimization (Trabucco et al. (2021)), Bayesian optimization (Belanger et al. (2019); Swersky et al. (2020)), reinforcement learning (Angermueller et al. (2020)), adaptive evolutionary methods (Hansen (2006); Sinai et al. (2020)), and so on. Recently, GFlowNets also have been proposed as a useful generator of diverse candidates for this problem inJain et al. (2022).

ij (log(P B (s|s ′ )) + log(P F (s ′ |s)))+ ui̸ =s ′ ,vj ∈Child(ui),ai̸ =a ⊤ π ij (log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v j |s ′ ) + C ij ) + ui=vj π ij (log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v j |s ′ ))).

OT (P F (•|s), P F (•|s ′ )) = u∈Child(s) P F (u|s) log(P B (s|u)) + H(P F (•|s ′ )) + P F (s ′ |s).(log(P B (s ′ |s)) + log(P F (s ′ |s)))

Figure 3: Results on the 8 -D hyper-grid environment with λ ∈ (0.001, 0.01, 0.1, 0.4) (from top to bottom). Left: Number of modes found during training. Right: KL divergence between the true and empirical distribution.

Figure 4: Results with variance on the 4 -D (upper) and 8 -D (lower) hyper-grid environment. Left: Number of modes found during training. Right: KL divergence between the true and empirical distribution.

Results on the AMP task with K = 100.AMP The results for AMP design task is shown in Table1. We see that the GFlowNet-AL model trained by maximizing the regularization via OT performs better than other baselines in terms of diversity and novelty. In addition, the TopK performance of the GFLowNet-AL baseline also increases from 0.874 to 0.917 when we maximize the OT regularization and is only lower than the reported performance of DynaPPO. However, DynaPPO has a much lower diversity and novelty score, which implies that it mostly generates similar candidates from the training dataset.TF Bind 8 An interesting observation here is that initial dataset D 0 contains only half of all possible DNA sequences of length 8 having lower scores. Specifically, low-quality data is very common in practice, and in this task, it poses a big challenge for all the methods to have good results. From Table2, we can see that MINs have the highest diversity compared to the other methods. However, this method has a much lower TopK performance and novelty score, which indicates its generated samples are very similar to the low-quality training dataset. Moreover, although having slightly lower diversity, the GFlowNet-AL model trained by maximizing the path regularization via OT performs better than the others when looking at all metrics -it outperforms other baselines in terms of performance and novelty score.

Results on the TF Bind 8 task with K = 128.

P F (u|s) log(P B (s|u)) -log(P F (s ′ |s)) + H(P F (•|s ′ )) + P F (s ′ |s).(log(P B (s ′ |s)) + log(P F (s ′ |s)))

ui̸ =s ′ ,ai=aj ̸ =a ⊤ π ij (log(P B (s|u i )) + log(P F (s ′ |s)) + log(P F (v j |s ′ ) + C ii ) =

For the implementation of the GFlowNet model, we also follow the framework ofMalkin et al. (2022): an MLP with two hidden layers of 256 dimensions each. The GFlowNet policy model, which includes both P F and P B , is trained with a learning rate of 0.001 while the learning rate for total flow Z θ is 0.1. We use a mini-batch size of 16 and 62500 training steps with the trajectory balance objective.

table: Hyper-parameters for the GFlowNet.

annex

Specifically, we consider the problem of finding objects x in the space of discrete objects X , that maximize a given oracle f : X → R + . Here, we can only query this oracle N times, each with an input batch of fixed size b. This can form N rounds of evaluation in the active learning setting, where the generative policy is initially given a dataset D 0 = x 0 1 , y 0 1 , . . . , x 0 n , y 0 n collected from the oracle, where y 0 i = f (x 0 i ) for 1 ≤ i ≤ n. Because the oracle can only be called limited, we also train a supervised proxy model M that predicts y from x to approximate the oracle f . Specifically, in i-th round, given the current dataset D i , this proxy model can be used as a reward function R to collect additional observations to train our generative policy to propose a batch of candidates B i = x i 1 , . . . , x i b . Then the current dataset D i is updated for the next round of evaluation asFollowing the framework of Jain et al. (2022) , we will conduct experiments on the biological sequence design tasks:Anti-Microbial Peptide Design: This task aims to generate short amino-acid sequences of length lower than 51, which have anti-microbial properties. The vocabulary has 20 amino-acids [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y ]. The active learning algorithm is evaluated for N = 10 rounds, with the number of candidates generated each round b = 1000. The initial dataset D 0 contains 3219 AMPs and 4611 non-AMP sequences, which is collected from the DBAASP database Pirtskhalava et al. (2021) .TFBind 8: The goal of this task is to generate DNA sequences of length 8, which have high binding activity with human transcription factors. The vocabulary has 4 nucleobases [A, C, T, G]. The active learning algorithm is evaluated for N = 10 round, with the number of candidates generated each round b = 128. The initial dataset D 0 contains 32, 898 samples, which is half of all possible DNA sequences of length 8 having lower scores. The data and the oracle used are from Barrera et al. (2016) .

GFP:

The objective of this task is to generate protein sequences of length 237 that have high fluorescence. The vocabulary is similar to the one of the AMP task (size 20). The active learning algorithm is evaluated for N = 10 round, with the number of candidates generated each round b = 128. The initial dataset D 0 contains 5, 000 samples, which is from Rao et al. (2019) ; Sarkisyan et al. (2016) together with the oracle.

E.2.2 EVALUATION CRITERIA

To evaluate the performance, we also use the metrics as in Jain et al. (2022) . Specifically, considering a set of candidates D, we have the following metrics:Performance score: mean score of the candidates in the setDiversity: a measurement of how well the generated candidates can capture the modes of the distribution implied by the oraclewhere d is a distance defined over X , such as Levenshtein distance Miller et al. (2009) .Novelty: a measure of the difference between the candidates in D and D 0These metrics will be evaluated on the set of candidates that have top K scores D = TopK (D N \D 0 ).

E.2.3 IMPLEMENTATION DETAILS

For the implementation of the GFlowNet-AL baseline model, we use the previously published implementation with slight changes, which follows the training setups of Jain et al. (2022) :

