GFLOWNETS AND VARIATIONAL INFERENCE

Abstract

This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.

1. INTRODUCTION

Many probabilistic generative models produce a sample through a sequence of stochastic choices. Non-neural latent variable models (e.g., Blei et al., 2003) , autoregressive models, hierarchical variational autoencoders (Sønderby et al., 2016) , and diffusion models (Ho et al., 2020) can be said to rely upon a shared principle: richer distributions can be modeled by chaining together a sequence of simple actions, whose conditional distributions are easy to describe, than by performing generation in a single sampling step. When many intermediate sampled variables could generate the same object, making exact likelihood computation intractable, hierarchical models are trained with variational objectives that involve the posterior over the sampling sequence (Ranganath et al., 2016b) . This work connects variational inference (VI) methods for hierarchical models (i.e., sampling through a sequence of choices conditioned on the previous ones) with the emerging area of research on generative flow networks (GFlowNets; Bengio et al., 2021a) . GFlowNets have been formulated as a reinforcement learning (RL) algorithm -with states, actions, and rewards -that constructs an object by a sequence of actions so as to make the marginal likelihood of producing an object proportional to its reward. While hierarchical VI is typically used for distributions over real-valued objects, GFlowNets have been successful at approximating distributions over discrete structures for which exact sampling is intractable, such as for molecule discovery (Bengio et al., 2021a) , for Bayesian posteriors over causal graphs (Deleu et al., 2022) , or as an amortized learned sampler for approximate maximum-likelihood training of energy-based models (Zhang et al., 2022b) . Although GFlowNets appear to have different foundations (Bengio et al., 2021b) and applications than hierarchical VI algorithms, we show here that the two are closely connected. As our main theoretical contribution, we show that special cases of variational algorithms and GFlowNets coincide in their expected gradients. In particular, hierarchical VI (Ranganath et al., 2016b) and nested VI (Zimmermann et al., 2021) are related to the trajectory balance and detailed balance objectives for GFlowNets (Malkin et al., 2022; Bengio et al., 2021b) . We also point out the differences between VI and GFlowNets: notably, that GFlowNets automatically perform gradient variance reduction by estimating a marginal quantity (the partition function) that acts as a baseline and allow off-policy learning without the need for reweighted importance sampling. Our theoretical results are accompanied by experiments that examine what similarities and differences emerge when one applies hierarchical VI algorithms to discrete problems where GFlowNets have been used before. These experiments serve two purposes. First, they supply a missing hierarchical VI baseline for problems where GFlowNets have been used in past work. The relative performance of this baseline illustrates the aforementioned similarities and differences between VI and GFlowNets. Second, the experiments demonstrate the ability of GFlowNets, not shared by hierarchical VI, to learn from off-policy distributions without introducing high gradient variance. We show that this ability to learn with exploratory off-policy sampling is beneficial in discrete probabilistic modeling tasks, especially in cases where the target distribution has many modes.

2.1. GFLOWNETS: NOTATION AND BACKGROUND

We consider the setting of Bengio et al. (2021a) . We are given a pointedfoot_0 directed acyclic graph (DAG) G = (S, A), where S is a finite set of vertices (states), and A ⊂ S × S is a set of directed edges (actions). If 𝑠→𝑠 ′ is an action, we say 𝑠 is a parent of 𝑠 ′ and 𝑠 ′ is a child of 𝑠. There is exactly one state that has no incoming edge, called the initial state 𝑠 0 ∈ S. States that have no outgoing edges are called terminating. We denote by X the set of terminating states. A complete trajectory is a sequence 𝜏 = (𝑠 0 → . . . →𝑠 𝑛 ) such that each 𝑠 𝑖 →𝑠 𝑖+1 is an action and 𝑠 𝑛 ∈ X. We denote by T the set of complete trajectories and by 𝑥 𝜏 the last state of a complete trajectory 𝜏. GFlowNets are a class of models that amortize the cost of sampling from an intractable target distribution over X by learning a functional approximation of the target distribution using its unnormalized density or reward function, 𝑅 : X → R + . While there exist different parametrizations and loss functions for GFlowNets, they all define a forward transition probability function, or a forward policy, 𝑃 𝐹 (-| 𝑠), which is a distribution over the children of every state 𝑠 ∈ S. The forward policy is typically parametrized by a neural network that takes a representation of 𝑠 as input and produces the logits of a distribution over its children. Any forward policy 𝑃 𝐹 induces a distribution over complete trajectories 𝜏 ∈ T (denoted by 𝑃 𝐹 as well), which in turn defines a marginal distribution over terminating states 𝑥 ∈ X (denoted by 𝑃 ⊤ 𝐹 ): 𝑃 𝐹 (𝜏 = (𝑠 0 → . . . →𝑠 𝑛 )) = 𝑛-1 𝑖=0 𝑃 𝐹 (𝑠 𝑖+1 | 𝑠 𝑖 ) ∀𝜏 ∈ T , 𝑃 ⊤ 𝐹 (𝑥) = ∑︁ 𝜏 ∈ T:𝑥 𝜏 =𝑥 𝑃 𝐹 (𝜏) ∀𝑥 ∈ X. (2) Given a forward policy 𝑃 𝐹 , terminating states 𝑥 ∈ X can be sampled from 𝑃 ⊤ 𝐹 by sampling trajectories 𝜏 from 𝑃 𝐹 (𝜏) and taking their final states 𝑥 𝜏 . GFlowNets aim to find a forward policy 𝑃 𝐹 for which 𝑃 ⊤ 𝐹 (𝑥) ∝ 𝑅(𝑥). Because the sum in (2) is typically intractable to compute exactly, training objectives for GFlowNets introduce auxiliary objects into the optimization. For example, the trajectory balance objective (TB; Malkin et al., 2022) introduces an auxiliary backward policy 𝑃 𝐵 , which is a learned distribution 𝑃 𝐵 (-| 𝑠) over the parents of every state 𝑠 ∈ S, and an estimated partition function 𝑍, typically parametrized as exp(log 𝑍) where log 𝑍 is the learned parameter. The TB objective for a complete trajectory 𝜏 is defined as L TB (𝜏; 𝑃 𝐹 , 𝑃 𝐵 , 𝑍) = log 𝑍 • 𝑃 𝐹 (𝜏) 𝑅(𝑥 𝜏 )𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) 2 , where 𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) = (𝑠→𝑠 ′ ) ∈ 𝜏 𝑃 𝐵 (𝑠 | 𝑠 ′ ). If L TB is made equal to 0 for every complete trajectory 𝜏, then 𝑃 ⊤ 𝐹 (𝑥) ∝ 𝑅(𝑥) for all 𝑥 ∈ X and 𝑍 is the inverse constant of proportionality: 𝑍 = 𝑥 ∈ X 𝑅(𝑥). The objective (3) is minimized by sampling trajectories 𝜏 from some distribution and making gradient steps on (3) with respect to the parameters of 𝑃 𝐹 , 𝑃 𝐵 , and log 𝑍. The distribution from which 𝜏 is sampled amounts to a choice of scalarization weights for the multi-objective problem of minimizing (3) over all 𝜏 ∈ T . If 𝜏 is sampled from 𝑃 𝐹 (𝜏) -note that this is a nonstationary scalarization -we say the algorithm runs on-policy. If 𝜏 is sampled from another distribution, the algorithm runs off-policy; typical choices are to sample 𝜏 from a tempered version of 𝑃 𝐹 to encourage exploration (Bengio et al., 2021a; Deleu et al., 2022) or to sample 𝜏 from the backward policy 𝑃 𝐵 (𝜏|𝑥) starting from given terminating states 𝑥 (Zhang et al., 2022b) . By analogy with the RL nomenclature, we call the behavior policy the one that samples 𝜏 for the purpose of obtaining a stochastic gradient, e.g, the gradient of the objective L TB in (3) for the sampled 𝜏. Other objectives have been studied and successfully used in past works, including detailed balance (DB; proposed by Bengio et al. (2021b) and evaluated by Malkin et al. (2022) ) and subtrajectory balance (SubTB; Madan et al., 2022) . In the next sections, we will show how the TB objective relates to hierarchical variational objectives. In §C, we generalize this result to the SubTB loss, of which both TB and DB are special cases.

2.2. HIERARCHICAL VARIATIONAL MODELS AND GFLOWNETS

Variational methods provide a way of sampling from distributions by means of learning an approximate probability density. Hierarchical variational models (HVMs; Ranganath et al., 2016b; Sobolev & Vetrov, 2019; Vahdat & Kautz, 2020; Zimmermann et al., 2021) ) typically assume that the sample space is a set of sequences (𝑧 1 , . . . , 𝑧 𝑛 ) of fixed length, with an assumption of conditional independence between 𝑧 𝑖-1 and 𝑧 𝑖+1 conditioned on 𝑧 𝑖 , i.e., the likelihood has a factorization 𝑞(𝑧 1 , . . . , 𝑧 𝑛 ) = 𝑞(𝑧 1 )𝑞(𝑧 2 |𝑧 1 ) . . . 𝑞(𝑧 𝑛 |𝑧 𝑛-1 ). The marginal likelihood of 𝑧 𝑛 in a hierarchical model involves a possibly intractable sum, 𝑞(𝑧 𝑛 ) = ∑︁ 𝑧 1 ,...,𝑧 𝑛-1 𝑞(𝑧 1 )𝑞(𝑧 2 |𝑧 1 ) . . . 𝑞(𝑧 𝑛 |𝑧 𝑛-1 ). The goal of VI algorithms is to find the conditional distributions 𝑞 that minimize some divergence between the marginal 𝑞(𝑧 𝑛 ) and a target distribution. The target is often given as a distribution with intractable normalization constant: a typical setting is a Bayesian posterior (used in VAEs, variational EM, and other applications), for which we desire 𝑞(𝑧 𝑛 ) ∝ 𝑝 likelihood (𝑥|𝑧 𝑛 ) 𝑝 prior (𝑧 𝑛 ). The GFlowNet corresponding to a HVM: Sampling sequences (𝑧 1 , . . . , 𝑧 𝑛 ) from a hierarchical model is equivalent to sampling complete trajectories in a certain pointed DAG G. The states of G at a distance of 𝑖 from the initial state are in bijection with possible values of the variable 𝑧 𝑖 , and the action distribution is given by 𝑞. Sampling from the HVM is equivalent to sampling trajectories from the policy 𝑃 𝐹 (𝑧 𝑖+1 |𝑧 𝑖 ) = 𝑞(𝑧 𝑖+1 |𝑧 𝑖 ) (and 𝑃 𝐹 (𝑧 1 |𝑠 0 ) = 𝑞(𝑧 1 )), and the marginal distribution 𝑞(𝑧 𝑛 ) is the terminating distribution 𝑃 ⊤ 𝐹 . The HVM corresponding to a GFlowNet: Conversely, suppose G = (S, A) is a graded pointed DAGfoot_1 and that a forward policy 𝑃 𝐹 on G is given. Sampling trajectories 𝜏 = (𝑠 0 →𝑠 1 → . . . →𝑠 𝐿 ) in G is equivalent to sampling from a HVM in which the random variable 𝑧 𝑖 is the identity of the (𝑖 + 1)-th state 𝑠 𝑖 in 𝜏 and the conditional distributions 𝑞(𝑧 𝑖+1 |𝑧 𝑖 ) are given by the forward policy 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ). Specifying an approximation of the target distribution in a hierarchical model with 𝑛 layers is thus equivalent to specifying a forward policy 𝑃 𝐹 in a graded DAG. The correspondence can be extended to non-graded DAGs. Every pointed DAG G = (S, A) can be canonically transformed into a graded pointed DAG by the insertion of dummy states that have one child and one parent. To be precise, every edge 𝑠→𝑠 ′ ∈ A is replaced with a sequence of ℓ ′ℓ(𝑠) edges, where ℓ(𝑠) is the length of the longest trajectory from 𝑠 0 to 𝑠, ℓ ′ = ℓ(𝑠 ′ ) if 𝑠 ′ ∉ X, and ℓ ′ = max 𝑠 ′′ ∈ S ℓ(𝑠 ′′ ) otherwise. This process is illustrated in §A. We thus restrict our analysis in this section, without loss of generality, to graded DAGs. The meaning of the backward policy: Typically, the target distribution is over the objects X of the last layer of a graded DAG, rather than over complete sequences or trajectories. Any backward policy 𝑃 𝐵 on the DAG turns an unnormalized target distribution 𝑅 over X into an unnormalized distribution over complete trajectories T : ∀𝜏 ∈ T 𝑃 𝐵 (𝜏) ∝ 𝑅(𝑥 𝜏 )𝑃 𝐵 (𝜏 | 𝑥 𝜏 ), with unknown partition function Ẑ = ∑︁ 𝑥 ∈ X 𝑅(𝑥). The marginal distribution of 𝑃 𝐵 over terminating states is equal to 𝑅(𝑥)/ Ẑ by construction. Therefore, if 𝑃 𝐹 is a forward policy that equals 𝑃 𝐵 as a distribution over trajectories, then 𝑃 ⊤ 𝐹 (𝑥) = 𝑅(𝑥)/ Ẑ ∝ 𝑅(𝑥). VI training objectives: In its most general form, the hierarchical variational objective ('HVI objective' in the remainder of the paper) minimizes a statistical divergence 𝐷 𝑓 between the learned and the target distributions over trajectories: L HVI, 𝑓 (𝑃 𝐹 , 𝑃 𝐵 ) = 𝐷 𝑓 (𝑃 𝐵 ∥𝑃 𝐹 ) = E 𝜏∼𝑃 𝐹 𝑓 𝑃 𝐵 (𝜏) 𝑃 𝐹 (𝜏) . ( ) Two common objectives are the forward and reverse Kullback-Leibler (KL) divergences (Mnih & Gregor, 2014) , corresponding to 𝑓 : 𝑡 ↦ → 𝑡 log 𝑡 for 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) and 𝑓 : 𝑡 ↦ →log 𝑡 for 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ), respectively. Other 𝑓 -divergences have been used, as discussed in Zhang et al. (2019b) ; Wan et al. (2020) . Note that, similar to GFlowNets, (5) can be minimized with respect to both the forward and backward policies, or can be minimized using a fixed backward policy. Divergences between two distributions over trajectories and divergences between their two marginal distributions over terminating states distributions are linked via the data processing inequality, assuming 𝑓 is convex (see e.g. Zhang et al. (2019b) ), making the former a sensible surrogate objective for the latter: 𝐷 𝑓 (𝑅/ Ẑ ∥𝑃 ⊤ 𝐹 ) ≤ 𝐷 𝑓 (𝑃 𝐵 ∥𝑃 𝐹 ) ) When both 𝑃 𝐵 and 𝑃 𝐹 are learned, the divergences with respect to which they are optimized need not be the same, as long as both objectives are 0 if and only if 𝑃 𝐹 = 𝑃 𝐵 . For example, wake-sleep algorithms (Hinton et al., 1995) optimize the generative model 𝑃 𝐹 using 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) and the posterior 𝑃 𝐵 using 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ). A summary of common combinations is shown in Table 1 . We remark that tractable unbiased gradient estimators for objectives such as (5) may not always exist, as we cannot exactly sample from or compute the density of 𝑃 𝐵 (𝜏) when its normalization constant Ẑ is unknown. For example, while the REINFORCE estimator gives unbiased estimates of the gradient with respect to 𝑃 𝐹 when the objective is REVERSE KL (see §2.3), other objectives, such as FORWARD KL, require importance-weighted estimators. Such estimators approximate sampling from 𝑃 𝐵 by sampling a batch of trajectories {𝜏 𝑖 } from another distribution 𝜋 (which may equal 𝑃 𝐹 ) and weighting a loss computed for each 𝜏 𝑖 by a scalar proportional to 𝑃 𝐵 ( 𝜏 𝑖 ) 𝜋 ( 𝜏 𝑖 ) . Such reweighted importance sampling is helpful in various variational algorithms, despite its bias when the number of samples is finite (e.g., Bornschein & Bengio, 2015; Burda et al., 2016) , but it may also introduce variance that increases with the discrepancy between 𝑃 𝐵 and 𝜋.

2.3. ANALYSIS OF GRADIENTS

The following proposition summarizes our main theoretical claim, relating the GFN objective of (3) and the variational objective of (5). In §C, we extend this result by showing an equivalence between the subtrajectory balance objective (introduced in Malkin et al. (2022) and empirically evaluated in Madan et al. (2022) ) and a natural extension of the nested variational objective (Zimmermann et al., 2021) to subtrajectories. A special case of this equivalence is between the Detailed Balance objective (Bengio et al., 2021b) and the nested VI objective (Zimmermann et al., 2021) . Proposition 1 Given a graded DAG G, and denoting by 𝜃, 𝜙 the parameters of the forward and backward policies 𝑃 𝐹 , 𝑃 𝐵 respectively, the gradients of the TB objective (3) satisfy: ∇ 𝜙 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) = 1 2 E 𝜏∼𝑃 𝐵 [∇ 𝜙 L TB (𝜏)], ∇ 𝜃 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) = 1 2 E 𝜏∼𝑃 𝐹 [∇ 𝜃 L TB (𝜏)]. The proof of the extended result appears in §C. An alternative proof is provided in §B. While (8) is the on-policy TB gradient with respect to the parameters of 𝑃 𝐹 , (7) is not the on-policy TB gradient with respect to the parameters of 𝑃 𝐵 , as the expectation is taken over 𝑃 𝐵 , not 𝑃 𝐹 . The on-policy TB gradient can however be expressed through a surrogate loss E 𝜏∼𝑃 𝐹 [∇ 𝜙 L TB (𝜏)] = ∇ 𝜙 𝐷 log 2 (𝑃 𝐵 ∥𝑃 𝐹 ) + 2(log 𝑍 -log Ẑ)𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) , where Ẑ = 𝑥 ∈ X 𝑅(𝑥), the unknown true partition function. Here 𝐷 log 2 is the pseudo-𝑓 -divergence defined by 𝑓 (𝑥) = log(𝑥) 2 , which is not convex for large 𝑥. (Proof in §B.) The loss in ( 7) is not possible to optimize directly unless using importance weighting (cf. the end of §2.2), but optimization of 𝑃 𝐵 using (7) and 𝑃 𝐹 using (8) would yield the gradients of REVERSE WAKE-SLEEP in expectation. Score function estimator and variance reduction: Optimizing the reverse KL loss 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) with respect to 𝜃, the parameters of 𝑃 𝐹 , requires a likelihood ratio (also known as REINFORCE) estimator of the gradient (Williams, 1992), using a trajectory 𝜏 (or a batch of trajectories), which takes the form: 𝑃 𝐹 ( 𝜏 ) ∇ 𝜃 𝑃 𝐹 (𝜏) = 0.) The estimator of ( 10) is known to exhibit high variance norm, thus slowing down learning. A common workaround is to subtract a baseline 𝑏 from 𝑐(𝜏), which does not bias the estimator. The value of the baseline 𝑏 (also called control variate) that most reduces the trace of the covariance matrix of the gradient estimator is Δ(𝜏) = ∇ 𝜃 log 𝑏 * = E 𝜏∼𝑃 𝐹 [𝑐(𝜏) ∥∇ 𝜃 log 𝑃 𝐹 (𝜏; 𝜃) ∥ 2 ] E 𝜏∼𝑃 𝐹 [∥∇ 𝜃 log 𝑃 𝐹 (𝜏; 𝜃) ∥ 2 ] , commonly approximated with E 𝜏∼𝑃 𝐹 [𝑐(𝜏)] (see, e.g., Weaver & Tao (2001) ; Wu et al. (2018) ). This approximation is itself often approximated with a batch-dependent local baseline, from a batch of trajectories {𝜏 𝑖 } 𝐵 𝑖=1 : 𝑏 local = 1 𝐵 𝐵 ∑︁ 𝑖=1 𝑐(𝜏 𝑖 ) A better approximation of the expectation E 𝜏∼𝑃 𝐹 [𝑐(𝜏)] can be obtained by maintaining a running average of the values 𝑐(𝜏), leading to a global baseline. After observing each batch of trajectories, the running average is updated with step size 𝜂: 𝑏 global ← (1 -𝜂)𝑏 global + 𝜂𝑏 local . (12) This coincides with the update rule of log 𝑍 in the minimization of L TB (𝑃 𝐹 , 𝑃 𝐵 , 𝑍) with a learning rate 𝜂 2 for the parameter log 𝑍 (with respect to which the TB objective is quadratic). Consequently, (8) of Prop. 1 shows that the update rule for the parameters of 𝑃 𝐹 , when optimized using the RE-VERSE KL objective, with (12) as a control variate for the score function estimator of its gradient, is the same as the update rule obtained by optimizing the TB objective using on-policy trajectories. While learning a backward policy 𝑃 𝐵 can speed up convergence (Malkin et al., 2022) , the TB objective can also be used with a fixed backward policy, in which case the REVERSE KL objective and the TB objective differ only in how they reduce the variance of the estimated gradients, if the trajectories are sampled on-policy. In § 4, we experimentally explore the differences between the two learning paradigms that arise when 𝑃 𝐵 is learned, or when the algorithms run off-policy.

3. RELATED WORK

(Hierarchical) VI: Variational inference (Zhang et al., 2019a) techniques originate from graphical models (Saul et al., 1996; Jordan et al., 2004) , which typically include an inference machine and a generative machine to model the relationship between latent variables and observed data. The line of work on black-box VI (Ranganath et al., 2014) focuses on learning the inference machine given a data generating process, i.e., inferring the posterior over latent variables. Hierarchical modeling exhibits appealing properties under such settings as discussed in Ranganath et al. (2016b) ; Yin & Zhou (2018) ; Sobolev & Vetrov (2019) . On the other hand, works on variational auto-encoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) focus on generative modeling, where the inference machine -the estimated variational posterior -is a tool to assist optimization of the generative machine or decoder. Hierarchical construction of multiple latent variables has also been shown to be beneficial (Sønderby et al., 2016; Maaløe et al., 2019; Child, 2021) . While earlier works simplify the variational family with mean-field approximations (Bishop, 2006) , modern inference methods rely on amortized stochastic optimization (Hoffman et al., 2013) . One of the oldest and most commonly used ideas is REINFORCE (Williams, 1992; Paisley et al., 2012) which gives unbiased gradient estimation. Follow-up work (Titsias & Lázaro-Gredilla, 2014; Gregor et al., 2014; Mnih & Gregor, 2014; Mnih & Rezende, 2016) proposes advanced estimators to reduce the high variance of REINFORCE. The log-variance loss proposed by Richter et al. (2020) is equivalent in expected gradient of 𝑃 𝐹 to the on-policy TB loss for a GFlowNet with a batch-optimal value of log 𝑍. On the other hand, path-wise gradient estimators (Kingma & Welling, 2014) have much lower variance, but have limited applicability. Later works combine these two approaches for particular distribution families (Tucker et al., 2017; Grathwohl et al., 2018) . Beyond the evidence lower bound (ELBO) objective used in most variational inference methods, more complex objectives have been studied. Tighter evidence bounds have proved beneficial to the learning of generative machines (Burda et al., 2016; Domke & Sheldon, 2018; Rainforth et al., 2018; Masrani et al., 2019) . As KL divergence optimization suffers from issues such as mean-seeking behavior and posterior variance underestimation (Minka, 2005) , other divergences are adopted as in expectation propagation (Minka, 2001; Li et al., 2015) , more general 𝑓 -divergences (Dieng et al., 2017; Wang et al., 2018; Wan et al., 2020) , their special case 𝛼-divergences (Hernández-Lobato et al., 2016) , and Stein discrepancy (Liu & Wang, 2016; Ranganath et al., 2016a) . GFlowNets could be seen as providing a novel pseudo-divergence criterion, namely TB, as discussed in this work. Wake-sleep algorithms: Another branch of work, starting with Hinton et al. (1995) , proposes to avoid issues from stochastic optimization (such as REINFORCE) by alternatively optimizing the generative and inference (posterior) models. Modern versions extending this framework include reweighted wake-sleep Bornschein & Bengio (2015) ; Le et al. ( 2019) and memoised wakesleep (Hewitt et al., 2020; Le et al., 2022) . It was shown in Le et al. ( 2019) that wake-sleep algorithms behave well for tasks involving stochastic branching. GFlowNets: GFlowNets have been used successfully in settings where RL and MCMC methods have been used in other work, including molecule discovery (Bengio et al., 2021a; Malkin et al., 2022; Madan et al., 2022) , biological sequence design (Malkin et al., 2022; Jain et al., 2022; Madan et al., 2022) , and Bayesian structure learning (Deleu et al., 2022) . A connection of the theoretical foundations of GFlowNets (Bengio et al., 2021a; b) with variational methods was first mentioned by Malkin et al. (2022) and expanded in Zhang et al. (2022a; 2023) . A concurrent and closely related paper (Zimmermann et al., 2022) theoretically and experimentally explores interpolations between forward and reverse KL objectives.

4. EXPERIMENTS

The goal of the experiments is to empirically investigate two main observations consistent with the above theoretical analysis: Observation 1. On-policy VI and TB (GFlowNet) objectives can behave similarly in some cases, when both can be stably optimized, while in others on-policy TB strikes a better compromise than either the (mode-seeking) REVERSE KL or (mean-seeking) FORWARD KL VI objectives. This claim is supported by the experiments on all three domains below. However, in all cases, notable differences emerge. In particular, HVI training becomes more stable near convergence and is sensitive to learning rates, which is consistent with the hypotheses about gradient variance in §2.3. Observation 2. When exploration matters, off-policy TB outperforms both on-policy TB and VI objectives, avoiding the possible high variance induced by importance sampling in off-policy VI. GFlowNets are capable of stable off-policy training without importance sampling. This claim is supported by experiments on all domains, but is especially well illustrated on the realistic domains in §4.2 and §4.3. This capability provides advantages for capturing a more diverse set of modes. Observation 1 and Observation 2 provide evidence that off-policy TB is the best method among those tested in terms of both accurately fitting the target distribution and effectively finding modes, where the latter is particularly important for the challenging molecule graph generation and causal graph discovery problems studied below.

4.1. HYPERGRID: EXPLORATION OF LEARNING OBJECTIVES

In this section, we comparatively study the ability of the variational objectives and the GFlowNet objectives to learn a multimodal distribution given by its unnormalized density, or reward function, 𝑅. We use the synthetic hypergrid environment introduced by Bengio et al. (2021a) and further explored by Malkin et al. (2022) . The states form a 𝐷-dimensional hypergrid with side length 𝐻, and the reward function has 2 𝐷 flat modes near the corners of the hypergrid. The states form a pointed DAG, where the source state is the origin 𝑠 0 = 0, and each edge corresponds to the action of incrementing one coordinate in a state by 1 (without exiting the grid). More details about the environment are provided in § D.1. We focus on the case where 𝑃 𝐵 is learned, which has been shown to accelerate convergence (Malkin et al., 2022) . In Fig. 1 , we compare how fast each learning objective discovers the 4 modes of a 128 × 128 grid, with an exploration parameter 𝑅 0 = 0.001 in the reward function. The gap between the learned distribution 𝑃 ⊤ 𝐹 and the target distribution is measured by the Jensen-Shannon divergence (JSD) 𝐹 for the different algorithms, along with the target distribution. To amplify variation, the plot intensity at each grid position is resampled from the Gaussian approximating the distribution over the 5 runs. Although WS, FORWARD KL, and REVERSE WS (off-policy) find the 4 target modes, they do not model them with high precision, and produce a textured pattern at the modes, where it should be flat. between the two distributions, to avoid giving a preference to one KL or the other. Additionally, we show graphical representations of the learned 2D terminating states distribution, along with the target distribution. We provide in § E details on how 𝑃 ⊤ 𝐹 and the JSD are evaluated and how hyperparameters were optimized separately for each learning algorithm. Exploration poses a challenge in this environment, given the distance that separates the different modes. We thus include in our analysis an off-policy version of each objective, where the behavior policy is different from, but related to, the trained sampler 𝑃 𝐹 (𝜏). The GFlowNet behavior policy used here encourages exploration by reducing the probability of terminating a trajectory at any state of the grid. This biases the learner towards sampling longer trajectories and helps with faster discovery of farther modes. When off-policy, the HVI gradients are corrected using importance sampling weights. For the algorithms that use a score function estimator of the gradient (FORWARD KL, REVERSE WS, and REVERSE KL), we found that using a global baseline, as explained in §2.2, was better than using the more common local baseline in most cases (see Fig. D.1) . This brings the VI methods closer to GFlowNets and thus factors out this issue from the comparison with the GFlowNet objectives. We see from Fig. 1 that while FORWARD KL and WS -the two algorithms that use 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) as the objective for 𝑃 𝐹 -discover the four modes of the distribution faster, they converge to a local minimum and do not model all the modes with high precision. This is due to the mean-seeking behavior of the forward KL objective, requiring that 𝑃 ⊤ 𝐹 puts non-zero mass on terminating states 𝑥 where 𝑅(𝑥) > 0. Objectives that use the reverse KL to train the forward policy (REVERSE KL and REVERSE WS) are mode-seeking and can thus have a low loss without finding all the modes. The TB GFlowNet objective offers the best of both worlds, as it converges to a lower value of the JSD, discovers the four modes, and models them with high precision. This supports Observation 1. Additionally, in support of Observation 2, while both the TB objective and the HVI objectives benefit from off-policy sampling, TB benefits more, as convergence is greatly accelerated. We supplement this study with a comparative analysis of the algorithms on smaller grids in §D.1.

4.2. MOLECULE SYNTHESIS

We study the molecule synthesis task from Bengio et al. (2021a) , in which molecular graphs are generated by sequential addition of subgraphs from a library of blocks (Jin et al., 2020 against on-policy TB (orange) and both on-policy (blue) and off-policy HVI (green). For each hyperparameter setting on the 𝑥-axis (𝛼 or 𝛽), we take the optimal choice of the other hyperparameter (𝛽 or 𝛼, respectively) and plot the mean and standard error region over three random seeds. et al., 2012) . The reward function is expressed in terms of a fixed, pretrained graph neural network 𝑓 that estimates the strength of binding to the soluble epoxide hydrolase protein (Trott & Olson, 2010) . To be precise, 𝑅(𝑥) = 𝑓 (𝑥) 𝛽 , where 𝑓 (𝑥) is the output of the binding model on molecule 𝑥 and 𝛽 is a parameter that can be varied to control the entropy of the sampling model. Because the number of terminating states is too large to make exact computation of the target distribution possible, we use a performance metric from past work on this task (Bengio et al., 2021a) to evaluate sampling agents. Namely, for each molecule 𝑥 in a held-out set, we compute log 𝑃 ⊤ 𝐹 (𝑥), the likelihood of 𝑥 under the trained model (computable by dynamic programming, see § E), and evaluate the Pearson correlation of log 𝑃 ⊤ 𝐹 (𝑥) and log 𝑅(𝑥). This value should equal 1 for a perfect sampler, as log 𝑃 ⊤ 𝐹 (𝑥) and log 𝑅(𝑥) would differ by a constant, the log-partition function log Ẑ. In Malkin et al. (2022) , GFlowNet samplers using the DB and TB objectives, with the backward policy 𝑃 𝐵 fixed to a uniform distribution over the parents of each state, were trained off-policy. Specifically, the trajectories used for DB and TB gradient updates were sampled from a mixture of the (online) forward policy 𝑃 𝐹 and a uniform distribution at each sampling step, with a special weight depending on the trajectory length used for the termination action. We wrote an extension of the published code of Malkin et al. (2022) with an implementation of the HVI (REVERSE KL) objective, using a reweighted importance sampling correction. We compare the off-policy TB from past work with the off-policy REVERSE KL, as well as on-policy TB and REVERSE KL objectives. (Note that on-policy TB and REVERSE KL are equivalent in expectation in this setting, since the backward policy is fixed.) Each of the four algorithms was evaluated with four values of the inverse temperature parameter 𝛽 and of the learning rate 𝛼, for a total of 4×4×4 = 64 settings. (We also experimented with the off-policy FORWARD KL / WS objective for optimizing 𝑃 𝐹 , but none of the hyperparameter settings resulted in an average correlation greater than 0.1.) The results are shown in Fig. 2 , in which, for each hyperparameter (𝛼 or 𝛽), we plot the performance for the optimal value of the other hyperparameter. We make three observations: • In support of Observation 2, off-policy REVERSE KL performs poorly compared to its on-policy counterpart, especially for smoother distributions (smaller values of 𝛽) where more diversity is present in the target distribution. Because the two algorithms agree in the expected gradient, this suggests that importance sampling introduces unacceptable variance into HVI gradients. • In support of Observation 1, the difference between on-policy REVERSE KL and on-policy TB is quite small, consistent with their gradients coinciding in the limit of descent along the full-batch gradient field. However, REVERSE KL algorithms are more sensitive to the learning rate. • In support of Observation 2, off-policy TB gives the best and lowest-variance fit to the target distribution, showing the importance of an exploratory training policy, especially for sparser reward landscapes (higher 𝛽). We only consider settings where the true posterior distribution 𝑝(𝐺 | D) can be computed exactly by enumerating all the possible DAGs 𝐺 over 𝑑 nodes (for 𝑑 ≤ 5). This allows us to exactly compare the posterior approximations, found either with the GFlowNet objectives or HVI, with the target posterior distribution. The state space grows rapidly with the number of nodes (e.g., there are 29k DAGs over 𝑑 = 5 nodes). For each experiment, we sampled a dataset D of 100 observations from a randomly generated ground-truth graph 𝐺 ★ ; the size of D was chosen to obtain highly multimodal posteriors. In addition to the (Modified) DB objective introduced by Deleu et al. ( 2022), we also study the TB (GFlowNet) and the REVERSE KL (HVI) objectives, both on-policy and off-policy. In Table 2 , we compare the posterior approximations found using these different objectives in terms of their Jensen-Shannon divergence (JSD) to the target posterior distribution 𝑃(𝐺 | D). We observe that on the easiest setting (graphs over 𝑑 = 3 nodes), all methods accurately approximate the posterior distribution. But as we increase the complexity of the problem (with larger graphs), we observe that the accuracy of the approximation found with Off-Policy REVERSE KL degrades significantly, while the ones found with the off-policy GFlowNet objectives ((Modified) DB & TB) remain very accurate. We also note that the performance of On-Policy TB and On-Policy REVERSE KL degrades too, but not as significantly; furthermore, both of these methods achieve similar performance across all experimental settings, confirming our Observation 1, and the connection highlighted in § 2.2. The consistent behavior of the off-policy GFlowNet objectives compared to the on-policy objectives (TB & REVERSE KL) as the problem increases in complexity (i.e., as the number of nodes 𝑑 increases, requiring better exploration) also supports our Observation 2. These observations are further confirmed when comparing the edge marginals 𝑃(𝑋 𝑖 → 𝑋 𝑗 | D) in Fig. D .3 ( §D.3), computed either with the target posterior distribution or with the posterior approximations.

5. DISCUSSION AND CONCLUSIONS

The theory and experiments in this paper place GFlowNets, which had been introduced and motivated as a reinforcement learning method, in the family of variational methods. They suggest that off-policy GFlowNet objectives may be an advantageous replacement to previous VI objectives, especially when the target distribution is highly multimodal, striking an interesting balance between the mode-seeking (REVERSE KL) and mean-seeking (FORWARD KL) VI variants. This work should prompt more research on how best to choose the behavior policy in off-policy GFlowNet training, seen as a means to efficiently explore and discover modes. Whereas the experiments performed here focused on the realm of discrete variables, future work should also investigate GFlowNets for continuous action spaces as potential alternatives to VI in continuous-variable domains. We make some first steps in this direction in the Appendix ( §F). While this paper was under review, Lahlou et al. (2023) introduced theory for continuous GFlowNets and showed that some of our claims extend to continuous domains. Nodes with a double border represent terminating states. Nodes with a dashed border represent dummy states added to make the DAG graded. 

B PROOFS

We prove Prop. 1. Proof For a complete trajectory 𝜏 ∈ T , denote by 𝑐(𝜏) = log 𝑃 𝐹 ( 𝜏 ) 𝑅 ( 𝑥 𝜏 ) 𝑃 𝐵 ( 𝜏 | 𝑥 𝜏 ) . We have the following: ∇ 𝜃 𝑐(𝜏) = ∇ 𝜃 log 𝑃 𝐹 (𝜏) (13) ∇ 𝜙 𝑐(𝜏) = -∇ 𝜙 log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) = -∇ 𝜙 log 𝑃 𝐵 (𝜏) Denoting by 𝑓 1 : 𝑡 ↦ → 𝑡 log 𝑡 and 𝑓 2 : 𝑡 ↦ →log 𝑡, which correspond to the forward and reverse KL divergences respectively, and starting from L HVI, 𝑓 2 (𝑃 𝐹 , 𝑃 𝐵 ) = 𝐷 𝐾 𝐿 (𝑃 𝐹 ∥𝑃 𝐵 ) = E 𝜏∼𝑃 𝐹 log 𝑃 𝐹 (𝜏) 𝑃 𝐵 (𝜏) = E 𝜏∼𝑃 𝐹 [𝑐(𝜏)] + log Ẑ, L HVI, 𝑓 1 (𝑃 𝐹 , 𝑃 𝐵 ) = 𝐷 𝐾 𝐿 (𝑃 𝐵 ∥𝑃 𝐹 ) = E 𝜏∼𝑃 𝐵 log 𝑃 𝐵 (𝜏) 𝑃 𝐹 (𝜏) = -E 𝜏∼𝑃 𝐵 [𝑐(𝜏)] + log Ẑ , we obtain: ∇ 𝜃 L HVI, 𝑓 2 (𝑃 𝐹 , 𝑃 𝐵 ) = ∇ 𝜃 E 𝜏∼𝑃 𝐹 [𝑐(𝜏)] = E 𝜏∼𝑃 𝐹 [∇ 𝜃 log 𝑃 𝐹 (𝜏)𝑐(𝜏) + ∇ 𝜃 𝑐(𝜏)], ∇ 𝜙 L HVI, 𝑓 1 (𝑃 𝐹 , 𝑃 𝐵 ) = -∇ 𝜙 E 𝜏∼𝑃 𝐵 [𝑐(𝜏)] = -E 𝜏∼𝑃 𝐵 [∇ 𝜙 log 𝑃 𝐵 (𝜏)𝑐(𝜏) + ∇ 𝜙 𝑐(𝜏)]. From ( 13) and ( 14), we obtain: E 𝜏∼𝑃 𝐹 [∇ 𝜃 𝑐(𝜏)] = E 𝜏∼𝑃 𝐹 [∇ 𝜃 log 𝑃 𝐹 (𝜏)] = ∑︁ 𝜏 ∈ T 𝑃 𝐹 (𝜏)∇ 𝜃 log 𝑃 𝐹 (𝜏) = ∑︁ 𝜏 ∈ T ∇ 𝜃 𝑃 𝐹 (𝜏) = ∇ 𝜃 1 = 0 Hence, for any scalar 𝑍 > 0, we can write: E 𝜏∼𝑃 𝐹 [∇ 𝜃 𝑐(𝜏)] = 0 = E 𝜏∼𝑃 𝐹 [∇ 𝜃 log 𝑃 𝐹 (𝜏) log 𝑍] and similarly E 𝜙∼𝑃 𝐹 [∇ 𝜙 𝑐(𝜏)] = 0 = E 𝜏∼𝑃 𝐵 [∇ 𝜙 log 𝑃 𝐵 (𝜏) log 𝑍]. Plugging these two equalities back in the HVI gradients above, we obtain: ∇ 𝜃 L HVI, 𝑓 2 (𝑃 𝐹 , 𝑃 𝐵 ) = E 𝜏∼𝑃 𝐹 [∇ 𝜃 log 𝑃 𝐹 (𝜏) log 𝑍 𝑃 𝐹 (𝜏) 𝑅(𝑥 𝜏 )𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) ] ∇ 𝜙 L HVI, 𝑓 1 (𝑃 𝐹 , 𝑃 𝐵 ) = -E 𝜏∼𝑃 𝐵 [∇ 𝜃 log 𝑃 𝐵 (𝜏) log 𝑍 𝑃 𝐹 (𝜏) 𝑅(𝑥 𝜏 )𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) ] The last two equalities hold for any scalar 𝑍 (that does not depend on the parameters of 𝑃 𝐹 , 𝑃 𝐵 , and that does not depend on any trajectory). In particular, the equations hold for the parameter 𝑍 of the Trajectory Balance objective. It thus follows that: ∇ 𝜃 L HVI, 𝑓 2 (𝑃 𝐹 , 𝑃 𝐵 ) = 1 2 E 𝜏∼𝑃 𝐹 ∇ 𝜃 log 𝑍 𝑃 𝐹 (𝜏) 𝑅(𝑥 𝜏 )𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) 2 = 1 2 E 𝜏∼𝑃 𝐵 [∇ 𝜃 L TB (𝜏; 𝑃 𝐹 , 𝑃 𝐵 , 𝑍)] ∇ 𝜙 L HVI, 𝑓 1 (𝑃 𝐹 , 𝑃 𝐵 ) = 1 2 E 𝜏∼𝑃 𝐵 ∇ 𝜃 log 𝑍 𝑃 𝐹 (𝜏) 𝑅(𝑥 𝜏 )𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) 2 = 1 2 E 𝜏∼𝑃 𝐵 [∇ 𝜙 L TB (𝜏; 𝑃 𝐹 , 𝑃 𝐵 , 𝑍)] As an immediate corollary, we obtain that the expected on-policy TB gradient does not depend on the estimated partition function 𝑍. Next, we will prove the identity ( 9), which we restate here: E 𝜏∼𝑃 𝐹 [∇ 𝜙 L TB (𝜏)] = ∇ 𝜙 𝐷 log 2 (𝑃 𝐵 ∥𝑃 𝐹 ) + 2(log 𝑍 -log Ẑ)𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) . ( ) Proof The RHS of (15) equals ∇ 𝜙 E 𝜏∼𝑃 𝐹 log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 )𝑅(𝑥 𝜏 ) Ẑ 𝑃 𝐹 (𝜏) 2 + 2(log 𝑍 -log Ẑ) log 𝑃 𝐹 (𝜏) Ẑ 𝑃 𝐵 (𝜏 | 𝑥 𝜏 )𝑅(𝑥 𝜏 ) =E 𝜏∼𝑃 𝐹 ∇ 𝜙 log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 )𝑅(𝑥 𝜏 ) Ẑ 𝑃 𝐹 (𝜏) 2 + 2(log 𝑍 -log Ẑ) log 𝑃 𝐹 (𝜏) Ẑ 𝑃 𝐵 (𝜏 | 𝑥 𝜏 )𝑅(𝑥 𝜏 ) =E 𝜏∼𝑃 𝐹 2∇ 𝜙 log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 )𝑅(𝑥 𝜏 ) Ẑ 𝑃 𝐹 (𝜏) -2(log 𝑍 -log Ẑ)∇ 𝜙 log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) =2E 𝜏∼𝑃 𝐹 ∇ 𝜙 log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 ) log 𝑃 𝐵 (𝜏 | 𝑥 𝜏 )𝑅(𝑥 𝜏 ) 𝑍 𝑃 𝐹 (𝜏) =E 𝜏∼𝑃 𝐹 [∇ 𝜙 L TB (𝜏)]

C A VARIATIONAL OBJECTIVE FOR SUBTRAJECTORIES

In this section, we extend the claim made in Prop. 1 to connect alternative GFlowNet losses to other variational objectives. Prop. 1 is thus a partial case of Prop. 2. This provides an alternative proof to Prop. 1. The detailed balance objective (DB): The loss proposed in (Bengio et al., 2021b ) parametrizes a GFlowNet using its forward and backward policies 𝑃 𝐹 and 𝑃 𝐵 respectively, along with a state flow function 𝐹, which is a positive function of the states, that matches the target reward function on the terminating states. It decomposes as a sum of transition-dependent losses: ∀𝑠→𝑠 ′ ∈ A L DB (𝑠→𝑠 ′ ; 𝑃 𝐹 , 𝑃 𝐵 , 𝐹) = log 𝐹 (𝑠)𝑃 𝐹 (𝑠 ′ | 𝑠) 𝐹 (𝑠 ′ )𝑃 𝐵 (𝑠 | 𝑠 ′ ) 2 , where 𝐹 (𝑠 ′ ) = 𝑅(𝑠 ′ ) if 𝑠 ′ ∈ X. (16) The subtrajectory balance objective (SubTB): Both the DB and TB objectives can be seen as special instances of the subtrajectory balance objective (Malkin et al., 2022; Madan et al., 2022) . Malkin et al. (2022) suggested instead of defining the state flow function 𝐹 for every state 𝑠, a state flow function could be defined on a subset of the state space S, called the hub states. The loss can be decomposed into a sum of subtrajectory-dependent losses: ∀𝜏 = (𝑠 1 , . . . , 𝑠 𝑛 ) ∈ T partial L SubTB (𝜏; 𝑃 𝐹 , 𝑃 𝐵 , 𝐹) = log 𝐹 (𝑠 1 )𝑃 𝐹 (𝜏) 𝐹 (𝑠 𝑛 )𝑃 𝐵 (𝜏 | 𝑠 𝑡 ) 2 , ( ) where 𝑃 𝐹 (𝜏) is defined for partial trajectories similarly to complete trajectories (2), 𝑃 𝐵 (𝜏 | 𝑠) = (𝑠→𝑠 ′ ) ∈ 𝜏 𝑃 𝐵 (𝑠 | 𝑠 ′ ), and we again fix 𝐹 (𝑥) = 𝑅(𝑥) for terminating states 𝑥 ∈ X). The SubTB objective reduces to the DB objective for subtrajectories of length 1 and to the TB objective for complete trajectories, in which case we use 𝑍 to denote 𝐹 (𝑠 0 ). A variational objective for transitions: From now on, we work with a graded DAG G = (S, A), in which the state space S is decomposed into layers: S = 𝐿 𝑙=0 S 𝑙 , with S 0 = {𝑠 0 } and S 𝐿 = X. HVI provides a class of algorithms to learn forward and backward policies on G. Rather than learning these policies (𝑃 𝐹 and 𝑃 𝐵 ) using a variational objective requiring distributions over complete trajectories, nested variational inference (NVI; Zimmermann et al., 2021) ), which combines nested importance sampling and variational inference, defines an objective dealing with distributions over transitions, or edges. To this end, it makes use of positive functions 𝐹 𝑘 of the states 𝑠 𝑘 ∈ S 𝑘 , for 𝑘 = 0, . . . , 𝐿 -1, to define two sets of distributions p𝑘 and p𝑘 over edges from S 𝑘 to S 𝑘+1 : p𝑘 (𝑠 𝑘 →𝑠 𝑘+1 ) ∝ 𝐹 𝑘 (𝑠 𝑘 )𝑃 𝐹 (𝑠 𝑘+1 | 𝑠 𝑘 ) p𝑘 (𝑠 𝑘 →𝑠 𝑘+1 ) ∝ 𝑅(𝑠 𝐿 )𝑃 𝐵 (𝑠 𝑘 | 𝑠 𝐿 ) 𝑘 = 𝐿 -1 𝐹 𝑘+1 (𝑠 𝑘+1 )𝑃 𝐵 (𝑠 𝑘 | 𝑠 𝑘+1 ) otherwise . (18) Learning the policies 𝑃 𝐹 , 𝑃 𝐵 and the functions 𝐹 𝑘 is done by minimizing losses of the form: L NVI (𝑃 𝐹 , 𝑃 𝐵 , 𝐹) = 𝐿-1 ∑︁ 𝑘=0 𝐷 𝑓 ( p𝑘 ∥ p𝑘 ) (19) The positive function 𝐹 𝑘 plays the same role as the state flow function in GFlowNets (in the DB objective in particular). Before drawing the links between DB and NVI, we first propose a natural extension of NVI to subtrajectories.

C.1 A VARIATIONAL OBJECTIVE FOR SUBTRAJECTORIES

Consider a graded DAG G = (S, A) where S = 𝐿 𝑙=0 S 𝑙 , S 0 = {𝑠 0 }, S 𝐿 = X. Amongst the 𝐿 + 1 layers 𝑙 = 0, . . . , 𝐿, we consider 𝐾 + 1 ≤ 𝐿 + 1 special layers, that we call junction layers, of which the states are called hub states. We denote by 𝑚 0 , . . . , 𝑚 𝐾 the indices of these layers, and we constrain 𝑚 0 = 0 to represent the layer comprised of the source state only, and 𝑚 𝐾 = 𝐿 representing the terminating states X. On each non-terminating junction layer 𝑚 𝑘 ≠ 𝐿, we define a state flow function 𝐹 𝑘 : S 𝑚 𝑘 → R * + . Given any forward and backward policies 𝑃 𝐹 and 𝑃 𝐵 respectively, consistent with the DAG G, the state flow functions define two sets of distributions p𝑘 and p𝑘 over partial trajectories starting from a state 𝑠 𝑚 𝑘 ∈ S 𝑚 𝑘 and ending in a state 𝑠 𝑚 𝑘+1 ∈ S 𝑚 𝑘+1 (we denote by T 𝑘 the set comprised of these partial trajectories, for 𝑘 = 0 . . . 𝐾 -1): ∀𝜏 𝑘 = (𝑠 𝑚 𝑘 → . . . →𝑠 𝑚 𝑘+1 ) ∈ T 𝑘 p𝑘 (𝜏 𝑘 ) ∝ 𝐹 𝑘 (𝑠 𝑚 𝑘 )𝑃 𝐹 (𝜏 𝑘 ), ( ) ∀𝜏 𝑘 = (𝑠 𝑚 𝑘 → . . . →𝑠 𝑚 𝑘+1 ) ∈ T 𝑘 p𝑘 (𝜏 𝑘 ) ∝ 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 )𝑃 𝐵 (𝜏 𝑘 | 𝑠 𝑚 𝑘+1 ), where 𝐹 𝐾 is fixed to the target reward function 𝑅. Lemma 1 If p𝑘 = p𝑘 for all 𝑘 = 0 . . . 𝐾 -1, then the forward policy 𝑃 𝐹 induces a terminating state distribution 𝑃 ⊤ 𝐹 that matches the target unnormalized distribution (or reward function) 𝑅. Proof Consider a complete trajectory 𝜏 = (𝑠 𝑚 0 → . . . →𝑠 𝑚 1 → . . . → . . . 𝑠 𝑚 2 → . . . → . . . →𝑠 𝑚 𝐾 ). And let 𝜏 𝑘 = (𝑠 𝑚 𝑘 → . . . →𝑠 𝑚 𝑘+1 ), for every 𝑘 < 𝐾. Denote by Ẑ𝑘 and Ž𝑘 the partition functions (constant of proportionality in ( 18)) of p𝑘 and p𝑘 respectively, for every 𝑘 < 𝐾. It is straightforward to see that for every 0 < 𝑘 < 𝐾: Ẑ𝑘+1 = Ž𝑘 = ∑︁ 𝑠 𝑚 𝑘+1 ∈ S 𝑚 𝑘+1 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 ) (22) 𝐾 -1 𝑘=0 p𝑘 (𝜏 𝑘 ) = 𝐾 -1 𝑘=0 𝐹 𝑘 (𝑠 𝑚 𝑘 ) 𝐾 -1 𝑘=0 Ẑ𝑘 𝑃 𝐹 (𝜏), ( ) 𝐾 -1 𝑘=0 p𝑘 (𝜏 𝑘 ) = 𝐾 -1 𝑘=0 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 ) 𝐾 -1 𝑘=0 Ž𝑘 𝑃 𝐵 (𝜏 | 𝑠 𝑚 𝐾 ). ( ) Because p𝑘 = p𝑘 for all 𝑘 = 0 . . . 𝐾 -1, then both right-hand sides of ( 23) and ( 24) are equal. Combining this with ( 22), we obtain: ∀𝜏 ∈ T 𝐹 0 (𝑠 0 ) Ẑ0 =1 𝑃 𝐹 (𝜏) = 𝑅(𝑥 𝜏 ) 𝑥 ∈ X 𝑅(𝑥) 𝑃 𝐵 (𝜏 | 𝑥), which implies the TB constraint is satisfied for all 𝜏 ∈ T . Malkin et al. (2022) shows that this is a sufficient condition for the terminating state distribution induced by 𝑃 𝐹 to match the target reward function 𝑅, which completes the proof. Similar to NVI, we can use Lemma 1 to define objective functions for 𝑃 𝐹 , 𝑃 𝐵 , 𝐹 𝑘 , of the form: L SubNVI, 𝑓 (𝑃 𝐹 , 𝑃 𝐵 , 𝐹 0:𝐾 -1 ) = 𝐾 -1 ∑︁ 𝑘=1 𝐷 𝑓 ( p𝑘 ∥ p𝑘 ) Note that the SubNVI objective of ( 26) matches the NVI objective (Zimmermann et al., 2021) when all layers are junction layers (i.e. 𝐾 = 𝐿, and 𝑚 𝑘 = 𝑘 for all 𝑘 ≤ 𝐿), and matches the HVI objective of ( 5) when only the first and last layers are junction layers (i.e. 𝐾 = 1, 𝑚 0 = 0, and 𝑚 1 = 𝐿).

C.2 AN EQUIVALENCE BETWEEN THE SUBNVI AND THE SUBTB OBJECTIVES

Proposition 2 Given a graded DAG G as in §2.1, with junction layers 𝑚 0 = 0, 𝑚 1 , . . . , 𝑚 𝐾 = 𝐿 as in § C.1. For any forward and backward policies, and for any positive function 𝐹 𝑘 defined for the hubs, consider p𝑘 and p𝑘 defined in (20) and (21). The subtrajectory variational objectives of (26) are equivalent to the subtrajectory balance objective (17) for specific choices of the 𝑓 -divergences. Namely, denoting by 𝜃, 𝜙 the parameters of 𝑃 𝐹 , 𝑃 𝐵 respectively: E 𝜏 𝑘 ∼ p𝑘 [∇ 𝜙 L SubTB (𝜏 𝑘 ; 𝑃 𝐹 , 𝑃 𝐵 , 𝐹)] = 2∇ 𝜙 𝐷 𝑓 1 ( p𝑘 ∥ p𝑘 ) E 𝜏 𝑘 ∼ p𝑘 [∇ 𝜃 L SubTB (𝜏 𝑘 ; 𝑃 𝐹 , 𝑃 𝐵 , 𝐹)] = 2∇ 𝜃 𝐷 𝑓 2 ( p𝑘 ∥ p𝑘 ) where 𝐹 = 𝐹 0:𝐾 -1 , and 𝑓 1 : 𝑡 ↦ → 𝑡 log 𝑡 and 𝑓 2 : 𝑡 ↦ →log 𝑡. Proof For a subtrajectory 𝜏 𝑘 = (𝑠 𝑚 𝑘 → . . . →𝑠 𝑚 𝑘+1 ) ∈ T 𝑘 , let 𝑐(𝜏 𝑘 ) = log 𝐹 𝑘 (𝑠 𝑚 𝑘 ) 𝑃 𝐹 ( 𝜏 𝑘 ) 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 ) 𝑃 𝐵 ( 𝜏 𝑘 |𝑠 𝑚 𝑘+1 ) . First, note that because Ẑ𝑘 and Ž𝑘 are not functions of 𝜙, 𝜃 ((23)): ∇ 𝜙 𝑐(𝜏 𝑘 ) = -∇ 𝜙 log 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 )𝑃 𝐵 (𝜏 𝑘 | 𝑠 𝑚 𝑘+1 ) Ž𝑘 = -∇ 𝜙 log p𝑘 (𝜏 𝑘 ) ∇ 𝜃 𝑐(𝜏 𝑘 ) = ∇ 𝜃 log 𝐹 𝑘 (𝑠 𝑚 𝑘 )𝑃 𝐹 (𝜏 𝑘 ) Ẑ𝑘 = ∇ 𝜙 log p𝑘 (𝜏 𝑘 ) We will prove ( 27). The proof of (28) follows the same reasoning, and is left as an exercise for the reader. 𝐷 𝑓 1 ( p𝑘 ∥ p𝑘 ) = 𝐷 𝐾 𝐿 ( p𝑘 ∥ p𝑘 ) ∇ 𝜙 𝐷 𝑓 1 ( p𝑘 ∥ p𝑘 ) = ∇ 𝜙 ∑︁ 𝜏 𝑘 ∈ T 𝑘 p𝑘 (𝜏 𝑘 ) log p𝑘 (𝜏 𝑘 ) p𝑘 (𝜏 𝑘 ) = -∇ 𝜙 ∑︁ 𝜏 𝑘 ∈ T 𝑘 p𝑘 (𝜏 𝑘 )𝑐(𝜏 𝑘 ) + ∇ 𝜙 log Ẑ𝑘 Ž𝑘 =0, according to (23) = - ∑︁ 𝜏 𝑘 ∈ T 𝑘 (∇ 𝜙 p𝑘 (𝜏 𝑘 )𝑐(𝜏 𝑘 ) + p𝑘 (𝜏 𝑘 )∇ 𝜙 𝑐(𝜏 𝑘 )) = - ∑︁ 𝜏 𝑘 ∈ T 𝑘 ( p𝑘 (𝜏 𝑘 )∇ 𝜙 log p𝑘 (𝜏 𝑘 )𝑐(𝜏 𝑘 ) + p𝑘 (𝜏 𝑘 )∇ 𝜙 𝑐(𝜏 𝑘 )) = -E 𝜏 𝑘 ∼ p𝑘 [∇ 𝜙 log p𝑘 (𝜏 𝑘 )𝑐(𝜏 𝑘 )] + ∑︁ 𝜏 𝑘 ∈ T 𝑘 p𝑘 (𝜏 𝑘 )∇ 𝜙 log p𝑘 (𝜏 𝑘 ) following (29) = -E 𝜏 𝑘 ∼ p𝑘 [∇ 𝜙 log 𝑃 𝐵 (𝜏 𝑘 | 𝑠 𝑚 𝑘+1 )𝑐(𝜏 𝑘 )] + ∇ 𝜙 ∑︁ 𝜏 𝑘 ∈ T 𝑘 p𝑘 (𝜏 𝑘 ) =0 = E 𝜏 𝑘 ∼ p𝑘 ∇ 𝜙 log 𝑃 𝐵 (𝜏 𝑘 | 𝑠 𝑚 𝑘+1 ) log 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 )𝑃 𝐵 (𝜏 𝑘 | 𝑠 𝑚 𝑘+1 ) 𝐹 𝑘 (𝑠 𝑚 𝑘 )𝑃 𝐹 (𝜏 𝑘 ) = 1 2 E 𝜏 𝑘 ∼ p𝑘 ∇ 𝜙 log 𝐹 𝑘 (𝑠 𝑚 𝑘 )𝑃 𝐹 (𝜏 𝑘 ) 𝐹 𝑘+1 (𝑠 𝑚 𝑘+1 )𝑃 𝐵 (𝜏 𝑘 | 𝑠 𝑚 𝑘+1 ) 2 = 1 2 E 𝜏 𝑘 ∼ p𝑘 [∇ 𝜙 L SubTB (𝜏 𝑘 ; 𝑃 𝐹 , 𝑃 𝐵 , 𝐹)] As a special case of Prop. 2, when the state flow function is defined for 𝑠 0 only (and for the terminating states, at which it equals the target reward function), i.e. when 𝐾 = 1, the distribution p0 (𝜏) and 𝑃 𝐹 (𝜏) are equal, and so are the distributions p0 (𝜏) and 𝑃 𝐵 (𝜏). We thus obtain the first two equations of Prop. 1 as a consequence of Prop. 2.

D ADDITIONAL EXPERIMENTAL DETAILS D.1 HYPERGRID EXPERIMENTS

Details about the environment For completeness, we provide more details about the environment, as explained in Malkin et al. (2022) . In a 𝐷-dimension hypergrid of side length 𝐻, the state space S is partitioned into the non-terminating states S 𝑜 = {0, . . . , 𝐻 -1} 𝐷 and terminating states X = S ⊤ = {0, . . . , 𝐻 -1} 𝐷 . The initial state is 0 R 𝐷 = (0, . . . , 0) ∈ S 𝑜 , and in addition to the transitions from a non-terminating state to another (by incrementing one coordinate of the state), an "exit" action is available for all 𝑠 ∈ S 𝑜 , that leads to a terminating state 𝑠 ⊤ ∈ S ⊤ . The reward at a terminating state 𝑠 ⊤ = (𝑠 1 , . . . , 𝑠 𝐷 ) ⊤ is: 𝑅(𝑠 ⊤ ) = 𝑅 0 + 0.5 𝐷 𝑑=1 1 𝑠 𝑑 𝐻 -1 -0.5 ∈ (0.25, 0.5] + 2 𝐷 𝑑=1 1 𝑠 𝑑 𝐻 -1 -0.5 ∈ (0.3, 0.4) , (31) where 𝑅 0 is an exploration parameter (lower values indicate harder exploration). Architectural details The forward and backward policies are parametrized as neural networks with 2 hidden layers of 256 units each. The neural networks take as input a one-hot representation of a a state (also called K-hot, or multi-hot representations), which is a 𝐻×𝐷 vector including exactly 𝐷 ones and (𝐻 -1)𝐷 zeros, and output the logits of 𝑃 𝐹 and 𝑃 𝐵 respectively. Forbidden actions (e.g. when a coordinate is already maxed out at 𝐻 -1) are masked out by setting the corresponding logits to -∞ after the forward pass. Unlike Malkin et al. (2022) , we do not tie the parameters of 𝑃 𝐹 and 𝑃 𝐵 . Behavior policy The behavior policy is obtained from the forward policy 𝑃 𝐹 by subtracting a scalar 𝜖 from the logits output by the forward policy neural network. The value of 𝜖 is decayed from 𝜖 𝑖𝑛𝑖𝑡 to 0 following a cosine annealing schedule (Loshchilov & Hutter, 2017) , and the value 𝜖 = 0 is reached at an iteration 𝑇 𝑚𝑎𝑥 . The values of 𝜖 𝑖𝑛𝑖𝑡 and 𝑇 𝑚𝑎𝑥 were treated as hyperparamters. Hyperparameter optimization Our experiments have shown that HVI objectives were brittle to the choice of hyperparameters (mainly learning rates), and that the ones used for Trajectory Balance in Malkin et al. (2022) do not perform as well in the larger 128 × 128 grid we considered. To obtain a fair comparison between GFlowNets and HVI methods, a particular care was given to the optimization of hyperparameters in this domain. The optimization was performed in two stages: 1. We use a batch size of 64 for all learning objectives, whether on-policy or off-policy, and the Adam optimizer with secondary parameters set to their default values, for the parameters of 𝑃 𝐹 , the parameters of 𝑃 𝐵 , and log 𝑍 (which is initialized at 0). The learning rates of 𝑃 𝐹 , 𝑃 𝐵 , log 𝑍, along with a schedule factor 𝛾 < 1 by which they are multiplied when the JSD plateaus for more than 500 iterations (i.e. 500 × 64 trajectories sampled), were sought after separately for each combination of learning objective and sampling method (on-policy or off-policy), using a Bayesian search with the JSD evaluated at 200𝐾 trajectories as an optimization target. The choice of the baseline for HVI methods (except WS, that does not have a score function estimator of the gradient) was treated as a hyperparameter as well. 2. All objectives were then trained for 10 6 trajectories using all the combinations of hyperparameters found in the first stage, for 5 seeds each. The final set of hyperparameters for each objective and sampling mode was then chosen as the one that leads to the lowest area under the JSD curve (approximated with the trapezoids method). For off-policy runs, 𝑇 𝑚𝑎𝑥 was defined as a fraction 1/𝑛 of the total number of iterations (which is equal to 10 6 /64). The value of 𝑛 and 𝜖 𝑖𝑛𝑖𝑡 was optimized the same way as the learning rate and the schedule, as described above. In Fig. D .1, we illustrate the differences between the two types of baselines considered (global and local) for the 3 algorithms that use a score function estimator of the gradient, both on-policy and off-policy. Smaller environments: The environment studied in the main body of text (128 × 128, with 𝑅 0 = 10 -3 ) already illustrates some key differences between the Forward and Reverse KL objectives. As a sanity check for the HVI methods that failed to converge in this challenging environment, we consider two alternative grids: 64×64 and 8×8×8×8, both with an easier exploration parameter (𝑅 0 = 0.1), and compare the 5 algorithms on-policy on these two extra domains. Additionally, for the two-dimensional domain (64 × 64), we illustrate in Fig. D .2 a visual representation of the average distribution obtained after sampling 10 6 trajectories, for each method separately. Interestingly, unlike the hard exploration domain, the two algorithms with the mode-seeking KL (REVERSE KL and REVERSE WS) converge to a lower JSD than the mean-seeking KL algorithms (FORWARD KL and WS), and are on par with TB.

D.2 MOLECULE EXPERIMENTS

Most experiment settings were identical to those of Malkin et al. (2022) , in particular, the reward model 𝑓 the held-out set of molecules used to compute the performance metric, the GFlowNet model where Pa 𝐺 (𝑋 𝑖 ) is the set of parent variables of 𝑋 𝑖 in the graph 𝐺. Each conditional distribution in the factorization above is also associated with a set of parameters 𝜃 ∈ Θ. The structure 𝐺 of the Bayesian Network is often assumed to be known. However, when the structure is unknown, we can learn it based on a dataset of observation D: this is called structure learning. Structure of the state space We use the same structure of graded DAG G as the one described in (Deleu et al., 2022) , where each state of G is itself a DAG 𝐺, and where actions correspond to adding one edge to the current graph 𝐺 to transition to a new graph 𝐺 ′ . Only the actions maintaining the acyclicity of 𝐺 ′ are considered valid; this ensures that all the states are well-defined DAGs, meaning that all the states are terminating here (we define a distribution over DAGs). Similar to the hypergrid environment, the action space also contains an extra action "stop" to terminate the generation process, and return the current graph as a sample of our distribution; this "stop" action is denoted 𝐺 → 𝐺 ⊤ , to follow the notation introduced in §2.1. Reward function Our objective in Bayesian structure learning is to approximate the posterior distribution over DAGs 𝑝(𝐺 | D), given a dataset of observations D. Since our goal is to find a forward policy 𝑃 𝐹 for which 𝑃 ⊤ 𝐹 (𝐺) ∝ 𝑅(𝐺) (see §2.1), we can define the reward function as the joint distribution 𝑅(𝐺) = 𝑝(𝐺, D) = 𝑝(D | 𝐺) 𝑝(𝐺), where 𝑝(𝐺) is a prior over graphs (assumed to be uniform throughout the paper), and 𝑝(D | 𝐺) is the marginal likelihood. Since the marginal likelihood involves marginalizing over the parameters of the Bayesian Network 𝑝(D | 𝐺) = ∫ Θ 𝑝(D | 𝜃, 𝐺) 𝑝(𝜃 | 𝐺) 𝑑Θ, it is in general intractable. We consider here a special class of models, called linear-Gaussian models, where the marginal likelihood can be computed in closed form; for this class of models, the log-marginal likelihood is also called the BGe score (Geiger & Heckerman, 1994; Kuipers et al., 2014) in the structure learning literature. Performance is reported as the Root Mean Square Error (RMSE) between the marginals (lower is better). For each experiment, we sampled a dataset D of 100 samples from a randomly generated Bayesian network. The (ground truth) structure of the Bayesian Network was generated following an Erdős-Rényi model, with about 𝑑 edges on average (to encourage sparsity on such small graphs with 𝑑 ≤ 5). Once the structure is known, the parameters of the linear-Gaussian model were sampled randomly from a standard Normal distribution N (0, 1). See (Deleu et al., 2022) for more details about the data generation process. For each setting (different values of 𝑑) and each objective, we repeated the experiment over 20 different seeds. Forward policy Deleu et al. (2022) parametrized the forward policy 𝑃 𝐹 using a linear transformer, taking all the 𝑑 2 possible edges in the graph 𝐺 as an input, and returning a probability distribution over those edges, where the invalid actions were masked out. We chose to parametrize 𝑃 𝐹 using a simpler neural network architecture, based on a graph neural network (Battaglia et al., 2018) . The GNN takes the graph 𝐺 as an input, where each node of the graph is associated with a (learned) embedding, and it returns for each node 𝑋 𝑖 a pair of embeddings 𝒖 𝑖 and 𝒗 𝑖 . The probability of adding an edge 𝑋 𝑖 → 𝑋 𝑗 to transition from 𝐺 to 𝐺 ′ (given that we do not terminate in 𝐺) is then given by 𝑃 𝐹 (𝐺 ′ | 𝐺, ¬𝐺 ⊤ ) ∝ exp(𝒖 ⊤ 𝑖 𝒗 𝑗 ), assuming that 𝑋 𝑖 → 𝑋 𝑗 is a valid action (i.e., it doesn't introduce a cycle in 𝐺), and where the normalization depends only on all the valid actions. We then use a hierarchical model to obtain the forward policy 𝑃 𝐹 (𝐺 ′ | 𝐺), following (Deleu et al., 2022) : 𝑃 𝐹 (𝐺 ′ | 𝐺) = (1 -𝑃 𝐹 (𝐺 ⊤ | 𝐺))𝑃 𝐹 (𝐺 ′ | 𝐺, ¬𝐺 ⊤ ). Recall that the backward policy 𝑃 𝐵 is fixed here, as the uniform distribution over the parents of 𝐺 (i.e. all the graphs were exactly one edge has been removed from 𝐺). (Modified) Detailed Balance objective For completeness, we recall here the modified Detailed Balance (DB) objective (Deleu et al., 2022) as a special case of the DB objective (Bengio et al., 2021b ; see also ( 16)) when all the states of G are terminating (which is the case in our Bayesian structure learning experiments): L ( 𝑀 ) 𝐷𝐵 (𝐺 → 𝐺 ′ ; 𝑃 𝐹 , 𝑃 𝐵 ) = log 𝑅(𝐺 ′ )𝑃 𝐵 (𝐺 | 𝐺 ′ )𝑃 𝐹 (𝐺 ⊤ | 𝐺) 𝑅(𝐺)𝑃 𝐹 (𝐺 ′ | 𝐺)𝑃 𝐹 (𝐺 ⊤ | 𝐺) 2 . Optimization Following (Deleu et al., 2022) , we used a replay buffer for all our off-policy objectives ((Modified) DB, TB, and REVERSE KL). All the objectives were optimized using a batch size of 256 graphs sampled either on-policy from 𝑃 𝐹 , or from the replay buffer. We used the Adam optimizer, with the best learning rate found among {10 -6 , 3 × 10 -6 , 10 -5 , 3 × 10 -5 , 10 -4 }. For the TB objective, we learned log 𝑍 using SGD with a learning rate of 0.1 and momentum 0.8. Edge marginals In addition to the Jensen-Shannon divergence (JSD) between the true posterior distribution 𝑝(𝐺 | D) and the posterior approximation 𝑃 ⊤ 𝐹 (𝐺) (see § E for details about how this divergence is computed), we also compare the edge marginals computed with both distributions. That is, for any edge 𝑋 𝑖 → 𝑋 𝑗 in the graph, we compare 𝑝(𝑋 𝑖 → 𝑋 𝑗 | D) = ∑︁ 𝐺 | 𝑋 𝑖 ∈Pa 𝐺 (𝑋 𝑗 ) 𝑝(𝐺 | D) and 𝑃 ⊤ 𝐹 (𝑋 𝑖 → 𝑋 𝑗 ) = ∑︁ 𝐺 | 𝑋 𝑖 ∈Pa 𝐺 (𝑋 𝑗 ) 𝑃 ⊤ 𝐹 (𝐺). The edge marginal quantifies how likely an edge 𝑋 𝑖 → 𝑋 𝑗 is to be present in the structure of the Bayesian Network, and is of particular interest in the (Bayesian) structure learning literature. To measure how accurate the posterior approximation 𝑃 ⊤ 𝐹 is for the different objectives considered here, we use the Root Mean Square Error (RMSE) between 𝑝(𝑋 𝑖 → 𝑋 𝑗 | D) and 𝑃 ⊤ 𝐹 (𝑋 𝑖 → 𝑋 𝑗 ), for all possible pairs of nodes (𝑋 𝑖 , 𝑋 𝑗 ) in the graph. 

E METRICS

Evaluation of the terminating state distribution 𝑃 ⊤ 𝐹 : When the state space is small enough (e.g. graphs with 𝑑 ≤ 5 nodes in the Structure learning experiments, or a 2-D hypergrid with length 128, as in the Hypergrid experiments), we can propagate the flows in order to compute the terminating state distribution 𝑃 ⊤ 𝐹 from the forward policy 𝑃 𝐹 . This is done using a flow function 𝐹 defined recursively: 𝐹 (𝑠 ′ ) = 1 if 𝑠 ′ = 𝑠 0 𝑠∈ 𝑃𝑎𝑟 (𝑠 ′ ) 𝐹 (𝑠)𝑃 𝐹 (𝑠 ′ | 𝑠) otherwise 𝑃 ⊤ 𝐹 is then given by: 𝑃 ⊤ 𝐹 (𝑠 ⊤ ) ∝ 𝐹 (𝑠)𝑃 𝐹 (𝑠 ⊤ | 𝑠), The recursion can be carried out by dynamic programming, by enumerating the states in any topological ordering consistent with the graded DAG G. In particular, computation of the flow at a given terminating state 𝑠 is linear in the number of states and actions that lie on trajectories leading to 𝑠, and computation of the full distribution 𝑃 ⊤ 𝐹 is linear in |S| + |A|. Evaluation of the Jensen-Shannon divergence (JSD) Similarly, when the state space is small enough, the target distribution 𝑃 ⊤ = 𝑅/𝑍 * can be evaluated exactly, given that the marginalization is over X only. The JSD is a symmetric divergence, thus motivating our choice. The JSD can directly be evaluated as:  𝐽𝑆𝐷 (𝑃 ⊤ ∥𝑃 ⊤ 𝐹 ) = F EXTENSION TO CONTINUOUS DOMAINS As a first step towards understanding GFlowNets with continuous action spaces, we perform an experiment on a stochastic control problem. The goal of this experiment is to explore whether the observations in the main text may hold in continuous settings as well. We consider an environment in which an agent begins at the point x 0 = (0, 0) in the plane and makes a sequence of 𝐾 = 10 steps over the time interval [0, 1], through points x 0.1 , x 0.2 , . . . , x 1 . Each step from x 𝑡 to x 𝑡+0.1 is Gaussian with learned mean depending on x 𝑡 and 𝑡 and with fixed variance; the variance is isotropic with standard deviation 1 2 √ 𝐾 . Equivalently, the agent samples the Euler-Maruyama discretization with interval Δ𝑡 = 1 𝐾 of the Itô stochastic differential equation 𝑑x 𝑡 = (x 𝑡 , 𝑡) 𝑑𝑡 + 1 2 𝑑w 𝑡 , where w 𝑡 is the two-dimensional Wiener process. The choice of the drift function 𝑓 determines the marginal density of the final point, x 1 . We aim to find 𝑓 such that this marginal density is proportional to a given reward function, in this case a quantity proportional to the density function of the standard 8gaussians distribution, shown in Fig. F.2. We scale the distribution so that the modes of the 8 Gaussian components are at a distance of 2 from the origin and their standard deviations are 0.25. In GFlowNet terms, the set of states is S = {(0, 0)} ∪ {(x, 𝑡) : x ∈ R 2 , 𝑡 ∈ {0.1, 0.2, . . . , 1}}. States with 𝑡 = 1 are terminating. There is an action from (x, 𝑡) to (x ′ , 𝑡 ′ ) if and only if 𝑡 ′ = 𝑡 + Δ𝑡. The forward policy is given by a conditional Gaussian: 𝑃 𝐹 ((x ′ , 𝑡 + Δ𝑡) | (x, 𝑡)) = N x ′ -x; 𝑓 (x, 𝑡)Δ𝑡, √ Δ𝑡 2 2 . ( ) We impose a conditional Gaussian assumption on the backward policy as well, i.e., 𝑃 𝐵 ((x, 𝑡) | (x ′ , 𝑡 + Δ𝑡)) = N xx ′ ; 𝜇 𝐵 (x ′ , 𝑡 + Δ𝑡)Δ𝑡, 𝜎 2 𝐵 (x ′ , 𝑡 + Δ𝑡)Δ𝑡 𝑡 ≠ 0 1 𝑡 = 0 , ( ) where 𝜇 𝐵 and log 𝜎 2 𝐵 are learned. Notice that all the policies, except the backward policy from time 1 𝐾 to time 0, now represent probability densities; states can have uncountably infinite numbers of children and parents. We parametrize the three functions 𝑓 , 𝜇 𝐵 , log 𝜎 2 𝐵 as small (two hidden layers, 64 units per layer) MLPs taking as input the position x and an embedding of the time 𝑡. Their parameters can be optimized using any of the five algorithms in Table 1 of the main text.foot_2 Fig. F .1 shows the marginal densities of x 𝑡 (estimated using KDE) for different 𝑡 in one well-trained model, as well as some sampled points and paths. In addition to training on policy, we consider exploratory training policies that add Gaussian noise to the mean of each transition distribution. We experiment with adding standard normal noise scaled by 𝜎 exp , where 𝜎 exp ∈ {0, 0.1, 0.2}. for the parameters of 𝑓 , 𝜇 𝐵 , log 𝜎 2 𝐵 and 10 -1 for the log 𝑍 parameter of the GFlowNet). These results suggest that the observations made for discrete-space GFlowNets in the main text may continue to hold in continuous settings. The first two rows of Fig. F.2 show that off-policy exploration is essential for finding the modes and that TB achieves a better fit to the target distribution. Just as in Fig. 1 , although all modes are found by WAKE-SLEEP, they are modeled with lower precision, appearing off-centre and having an oblong shape, which is reflected in the slightly higher MMD. 



A pointed DAG is one with a designated initial state. We recall some facts about partially ordered sets. A pointed graded DAG is a pointed DAG in which all complete trajectories have the same length. Pointed graded DAGs G are also characterized by the following equivalent property: the state space S can be partitioned into disjoint sets S = 𝐿 𝑙=0 S 𝑙 , with S 0 = {𝑠 0 }, called layers, such that all edges 𝑠→𝑠 ′ are between states of adjacent layers (𝑠 ∈ S 𝑖 ,𝑠 ′ ∈ S 𝑖+1 for some 𝑖). We conjecture (and strongly believe under mild assumptions) but do not prove that the necessary GFlowNet theory continues to hold when probabilities are placed by probability densities; the results obtained here are evidence in support of this conjecture.



Figure 1: Top: The evolution of the JSD between the learned sampler 𝑃 ⊤ 𝐹 and the target distribution on the 128 × 128 grid, as a function of the number of trajectories sampled. Shaded areas represent the standard error evaluated across 5 different runs (on-policy left, off-policy right). Bottom: The average (across 5 runs) final learned distribution 𝑃 ⊤𝐹 for the different algorithms, along with the target distribution. To amplify variation, the plot intensity at each grid position is resampled from the Gaussian approximating the distribution over the 5 runs. Although WS, FORWARD KL, and REVERSE WS (off-policy) find the 4 target modes, they do not model them with high precision, and produce a textured pattern at the modes, where it should be flat.

Figure2: Correlation between marginal sampling log-likelihood and log-reward on the molecule generation task for different learning algorithms, showing the advantage of off-policy TB (red) against on-policy TB (orange) and both on-policy (blue) and off-policy HVI (green). For each hyperparameter setting on the 𝑥-axis (𝛼 or 𝛽), we take the optimal choice of the other hyperparameter (𝛽 or 𝛼, respectively) and plot the mean and standard error region over three random seeds.

Figure A.1: Illustration of the process by which a DAG (left) can turn into a graded DAG (right).Nodes with a double border represent terminating states. Nodes with a dashed border represent dummy states added to make the DAG graded.

Fig. A.1 shows the canonical conversion of a DAG into a graded DAG as described in §2.2. Note that this operation is idempotent: applying it to a graded DAG yields the same graded DAG.

Figure D.1: A comparison of the the type of baseline used (local or global) for the three HVI algorithms that use a score function estimator of the gradient.

Figure D.2: Top: The evolution of the JSD between the learned sampler 𝑃 ⊤ 𝐹 and the target distribution on the 8 × 8 × 8 × 8 grid left and the 64 × 64 grid right. Trajectories are sampled on-policy. Shaded areas represent the standard error evaluated across 5 different runs Bottom: The average (across 5 runs) final learned distribution 𝑃 ⊤ 𝐹 for the different algorithms, along with the target distribution. To amplify variation, the plot intensity at each grid position is resampled from the Gaussian approximating the distribution over the 5 runs. architecture (a graph neural network introduced by by Bengio et al. (2021a)), and the off-policy exploration rate. All models were trained with the Adam optimizer and batch size 4 for a maximum of 50000 batches. The metric was computed after every 5000 batches and the last computed value of the metric was reported, which was sometimes not the value after 50000 batches when the training runs terminated early because of numerical errors.D.3 BAYESIAN STRUCTURE LEARNING EXPERIMENTSBayesian Networks A Bayesian Network is a probabilistic model where the joint distribution over 𝑑 random variables {𝑋 1 , . . . , 𝑋 𝑑 } factorizes according to a directed acyclic graph (DAG) 𝐺:

Figure D.3: Comparison of edge marginals computed using the target posterior distribution and using the posterior approximations found either with the GFlowNet objectives, or REVERSE KL.Performance is reported as the Root Mean Square Error (RMSE) between the marginals (lower is better).

Fig. D.3 shows the RMSE of the edge marginals, for different GFlowNet objectives and REVERSE KL (denoted as HVI here for brevity). The results on the edge marginals largely confirm the observations made in § 4.3: the off-policy GFlowNet objectives ((Modified) DB & TB) consistently perform well across all experimental settings; On-Policy TB & On-Policy REVERSE KL perform similarly and degrade as the complexity of the experiment increases (as 𝑑 increases); and Off-Policy REVERSE KL has a performance that degrades the most as the complexity increases, where the edge marginals given by 𝑃 ⊤ 𝐹 (𝑋 𝑖 → 𝑋 𝑗 ) do not match the true edge marginals 𝑝(𝑋 𝑖 → 𝑋 𝑗 | D) accurately.

(𝑃 ⊤ ∥ 𝑀) + 𝐷 KL (𝑃 ⊤ 𝐹 ∥ 𝑀)where 𝑀 = (𝑃 ⊤ + 𝑃 ⊤ 𝐹 )

Fig. F.2 compares the marginal densities obtained using different algorithms with on-policy and offpolicy training. The algorithms that use a forward KL objective to learn 𝑃 𝐵 -namely, REVERSE WS and FORWARD KL -are not shown because they encounter NaN values in the gradients early in training, even when using a 10× lower learning rate than that used for all other algorithms (10 -3 for the parameters of 𝑓 , 𝜇 𝐵 , log 𝜎 2 𝐵 and 10 -1 for the log 𝑍 parameter of the GFlowNet). These results suggest that the observations made for discrete-space GFlowNets in the main text may continue to hold in continuous settings. The first two rows of Fig.F.2 show that off-policy exploration is essential for finding the modes and that TB achieves a better fit to the target distribution. Just as in Fig.1, although all modes are found by WAKE-SLEEP, they are modeled with lower precision, appearing off-centre and having an oblong shape, which is reflected in the slightly higher MMD.

Figure F.1: Above: KDE (2560 samples, bandwidth 0.25) of the agent's position after 𝑖 steps for 𝑖 = 0, 1, . . . , 10 (𝑡 = 0, 0.1, . . . , 1) for a model trained with off-policy TB, showing a close match to the target distribution (also convolved with the KDE kernel for fair comparison). Below: A sample of 2560 points from the trained model and the trajectories taken by 128 of the points.

𝑃 𝐹 ∥𝑃 𝐵 ) 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) FORWARD KL 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) WAKE-SLEEP (WS) 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 ) 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) REVERSE WAKE-SLEEP 𝐷 KL (𝑃 𝐹 ∥𝑃 𝐵 ) 𝐷 KL (𝑃 𝐵 ∥𝑃 𝐹 )

𝜃 𝑐(𝜏) that is typically present in the REINFORCE estimator is 0 in expectation, sinceE 𝜏∼𝑃 𝐹 [∇ 𝜃 log 𝑃 𝐹 (𝜏)] = 𝜏 𝑃 𝐹 ( 𝜏 )

Comparison of the Jensen-Shannon divergence for Bayesian structure learning, showing the advantage of off-policy TB over on-policy TB and on-policy or off-policy HVI. The JSD is measured between the true posterior distribution 𝑝(𝐺 | D) and the learned approximation 𝑃 ⊤ 𝐹 (𝐺). GENERATION OF DAGS IN BAYESIAN STRUCTURE LEARNING Finally, we consider the problem of learning the (posterior) distribution over the structure of Bayesian networks, as studied in Deleu et al. (2022). The goal of Bayesian structure learning is to approximate the posterior distribution 𝑝(𝐺 | D) over DAGs 𝐺, given a dataset of observations D. Following Deleu et al. (2022), we treat the generation of a DAG as a sequential decision problem, where directed edges are added one at a time, starting from the completely disconnected graph. Since our goal is to approximate the posterior distribution 𝑝(𝐺 | D), we use the joint probability 𝑅(𝐺) = 𝑝(𝐺, D) as the reward function, which is proportional to the former up to a normalizing constant. Details about how this reward is computed, as well as the parametrization of the forward policy 𝑃 𝐹 , are available in §D.3. Note that similarly to §4.2, and following Deleu et al. (2022), we leave the backward policy 𝑃 𝐵 fixed to uniform.

ACKNOWLEDGMENTS

The authors thank Moksh Jain for valuable discussions about the project.This research was enabled in part by computational resources provided by the Digital Research Alliance of Canada. All authors are funded by their primary institution. We also acknowledge funding from CIFAR, Genentech, Samsung, and IBM.

availability

https://github.com/GFNOrg/GFN_vs_HVI.

AUTHOR CONTRIBUTIONS

N.M., X.J., D.Z., and Y.B. observed the connection between GFlowNets and variational inference, providing motivation for the main ideas in this work. N.M., X.J., and T.D. did initial experimental exploration. S.L., N.M., and D.Z. contributed to the theoretical analysis. S.L. and N.M. extended the theoretical analysis to subtrajectory objectives. D.Z. reviewed the related work. S.L. performed experiments on the hypergrid domain. N.M. performed experiments on the molecule domain and the stochastic control domain. T.D., E.H., and K.E. performed experiments on the causal graph domain. All authors contributed to planning the experiments, analyzing their results, and writing the paper.

