LEARNING GFLOWNETS FROM PARTIAL EPISODES FOR IMPROVED CONVERGENCE AND STABILITY Anonymous authors Paper under double-blind review

Abstract

Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD(𝜆) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB(𝜆), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB(𝜆) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.

1. INTRODUCTION

Generative flow networks (GFlowNets; Bengio et al., 2021a) are generative models that construct objects lying in a target space X by taking sequences of actions sampled from a learned policy. GFlowNets are trained so as to make the probability of sampling an object 𝑥 ∈ X proportional to a given nonnegative reward 𝑅(𝑥). GFlowNets' use of a parametric policy that can generalize to states not seen during training makes them a competitive alternative to methods based on local exploration in various probabilistic modeling tasks (Bengio et al., 2021a; Malkin et al., 2022; Zhang et al., 2022; Jain et al., 2022; Deleu et al., 2022) . GFlowNets solve the variational inference problem of approximating a target distribution over X with the distribution induced by the sampling policy, and they are trained by algorithms reminiscent of reinforcement learning (although GFlowNets model the diversity present in the reward distribution, rather than maximizing reward by seeking its mode). In most past works (Bengio et al., 2021a; Malkin et al., 2022; Zhang et al., 2022; Jain et al., 2022) , GFlowNets are trained by exploratory sampling from the policy and receive their training signal from the reward of the sampled object. The flow matching (FM) and detailed balance (DB) learning objectives for GFlowNets proposed in Bengio et al. (2021a; b) resemble temporal difference learning (Sutton & Barto, 2018) . A third objective, trajectory balance (TB), was proposed in Malkin et al. (2022) to address the problem of slow temporal credit assignment with the FM and DB objectives. The TB objective propagates learning signals over entire episodes, while the temporal difference-like objectives (FM and DB) make updates local to states or actions. It has been hypothesized by Malkin et al. (2022) that the improved credit assignment with TB comes at the cost of higher gradient variance, analogous to the bias-variance tradeoff seen in temporal difference learning (TD(𝑛) or TD(𝜆)) with different eligibility trace schemes (Sutton & Barto, 2018; Kearns & Singh, 2000; van Hasselt et al., 2018; Bengio et al., 2020) . This hypothesis is one of the starting points for the present paper. In this paper, we propose a new learning objective for GFlowNets, called subtrajectory balance (SubTB, or SubTB(𝜆) when its real-valued hyperparameter 𝜆 is specified). Building upon theoretical results of Bengio et al. (2021b) ; Malkin et al. (2022) , we show how the SubTB(𝜆) objective allows the flexibility of learning from partial experiences of any length. Experiments on two synthetic and four real-world domains support the following empirical claims: (1) SubTB(𝜆) improves convergence of GFlowNets in previously studied environments: models trained with SubTB(𝜆) approach the target distribution in fewer training iterations and are less sensitive to hyperparameter choices. (2) SubTB(𝜆) enables training of GFlowNets in environments where past approaches perform poorly due to sparsity of the reward function or length of action sequences. (3) The benefits of SubTB(𝜆) are explained by lower variance of the stochastic gradient, with the parameter 𝜆 allowing interpolation between the high-bias, low-variance DB objective and the low-bias, high-variance TB objective. 

2. METHOD

𝑃 𝐹 (𝜏 = (𝑠 0 → . . . →𝑠 𝑛 )) = 𝑛-1 𝑖=0 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ). Any distribution over complete trajectories that arises from a forward policy satisfies a Markov property: the marginal choice of action out of a state 𝑠 is independent of how 𝑠 was reached. Conversely, any Markovian distribution over T arises from a forward policy (Bengio et al., 2021b) . A forward policy can thus be used to sample terminal states 𝑥 ∈ X by starting at 𝑠 0 and iteratively sampling actions from 𝑃 𝐹 , or, equivalently, taking the terminating state of a complete trajectory 𝜏 ∼ 𝑃 𝐹 (𝜏). The marginal likelihood of sampling 𝑥 ∈ X is the sum of likelihoods of all complete trajectories that terminate at 𝑥. Suppose that a nontrivial (not identically 0) nonnegative reward function 𝑅 : X → R ≥0 is given. The learning problem solved by GFlowNets is to estimate a policy 𝑃 𝐹 such that the likelihood of sampling 𝑥 ∈ X is proportional to 𝑅(𝑥). That is, there should exist a constant 𝑍 such that 𝑅(𝑥) = 𝑍 ∑︁ 𝜏=(𝑠 0 →...→𝑠 𝑛 =𝑥 ) 𝑃 𝐹 (𝜏) ∀𝑥 ∈ X. If ( 2) is satisfied, then 𝑍 = 𝑥 ∈ X 𝑅(𝑥).

2.2. GFLOWNET TRAINING OBJECTIVES

Because the sum in (2) may be intractable to compute, it is in general not possible to directly convert this constraint into a training objective. To solve this problem, GFlowNet training objectives introduce auxiliary variables in the parametrization in various ways, but all have the property that ( 2) is satisfied at the global optimum. The key properties of these objectives are summarized in Table 1 . . (3) A sufficient condition for the terminating distribution of 𝑃 𝐹 to be proportional to the reward 𝑅(𝑥) is that a family of flow-matching (flow in = flow out) conditions is satisfied at all interior states and a



matching(FM; Bengio et al., 2021a). Motivating the 'flow network' terminology, Bengio  et al. (2021a)  proved that (2) is satisfied if 𝑃 𝐹 arises from an edge flow function satisfying certain constraints. Namely, an assignment 𝐹 : A → R ≥0 of a nonnegative number (flow) to each action defines a policy via𝑃 𝐹 (𝑡|𝑠) = 𝐹 (𝑠→𝑡) 𝑡 ′ :(𝑠→𝑡 ′ ) ∈ A 𝐹 (𝑠→𝑡 ′ )

There is a unique initial state 𝑠 0 ∈ S with no parents. States with no children are called terminal, and the set of terminal states is denoted by X.A trajectory or an action sequence is a sequence of states 𝜏 = (𝑠 𝑚 →𝑠 𝑚+1 → . . . →𝑠 𝑛 ), where each (𝑠 𝑖 →𝑠 𝑖+1 ) is an action. The trajectory is complete if 𝑠 𝑚 = 𝑠 0 and 𝑠 𝑛 is terminal. The set of complete trajectories is denoted by T .

