LEARNING GFLOWNETS FROM PARTIAL EPISODES FOR IMPROVED CONVERGENCE AND STABILITY Anonymous authors Paper under double-blind review

Abstract

Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD(𝜆) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB(𝜆), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB(𝜆) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.

1. INTRODUCTION

Generative flow networks (GFlowNets; Bengio et al., 2021a) are generative models that construct objects lying in a target space X by taking sequences of actions sampled from a learned policy. GFlowNets are trained so as to make the probability of sampling an object 𝑥 ∈ X proportional to a given nonnegative reward 𝑅(𝑥). GFlowNets' use of a parametric policy that can generalize to states not seen during training makes them a competitive alternative to methods based on local exploration in various probabilistic modeling tasks (Bengio et al., 2021a; Malkin et al., 2022; Zhang et al., 2022; Jain et al., 2022; Deleu et al., 2022) . GFlowNets solve the variational inference problem of approximating a target distribution over X with the distribution induced by the sampling policy, and they are trained by algorithms reminiscent of reinforcement learning (although GFlowNets model the diversity present in the reward distribution, rather than maximizing reward by seeking its mode). In most past works (Bengio et al., 2021a; Malkin et al., 2022; Zhang et al., 2022; Jain et al., 2022) , GFlowNets are trained by exploratory sampling from the policy and receive their training signal from the reward of the sampled object. The flow matching (FM) and detailed balance (DB) learning objectives for GFlowNets proposed in Bengio et al. (2021a; b) resemble temporal difference learning (Sutton & Barto, 2018) . A third objective, trajectory balance (TB), was proposed in Malkin et al. (2022) to address the problem of slow temporal credit assignment with the FM and DB objectives. The TB objective propagates learning signals over entire episodes, while the temporal difference-like objectives (FM and DB) make updates local to states or actions. It has been hypothesized by Malkin et al. (2022) that the improved credit assignment with TB comes at the cost of higher gradient variance, analogous to the bias-variance tradeoff seen in temporal difference learning (TD(𝑛) or TD(𝜆)) with different eligibility trace schemes (Sutton & Barto, 2018; Kearns & Singh, 2000; van Hasselt et al., 2018; Bengio et al., 2020) . This hypothesis is one of the starting points for the present paper. In this paper, we propose a new learning objective for GFlowNets, called subtrajectory balance (SubTB, or SubTB(𝜆) when its real-valued hyperparameter 𝜆 is specified). Building upon theoretical results of Bengio et al. (2021b) ; Malkin et al. (2022) , we show how the SubTB(𝜆) objective allows the flexibility of learning from partial experiences of any length. Experiments on two synthetic and four real-world domains support the following empirical claims: (1) SubTB(𝜆) improves convergence of GFlowNets in previously studied environments: models trained with SubTB(𝜆) approach the target distribution in fewer training iterations and are less sensitive to hyperparameter choices. (2) SubTB(𝜆) enables training of GFlowNets in environments where past approaches perform poorly due to sparsity of the reward function or length of action sequences. (3) The benefits of SubTB(𝜆) are explained by lower variance of the stochastic gradient, with the parameter 𝜆 allowing interpolation between the high-bias, low-variance DB objective and the low-bias, high-variance TB objective.

2. METHOD

2.1 PRELIMINARIES In this section, we summarize the necessary preliminaries on GFlowNets. We follow the notation of Malkin et al. (2022) , to which the reader is directed for a more thorough exposition written with a view towards motivating the trajectory and subtrajectory balance objectives. A deeper introduction is given in Bengio et al. (2021b) . Let 𝐺 = (S, A) be a directed acyclic graph. The vertices 𝑠 ∈ S are called states and the directed edges (𝑢→𝑣) ∈ A are actions. If (𝑢→𝑣) is an edge, we say 𝑣 is a child of 𝑢 and 𝑢 is a parent of 𝑣. There is a unique initial state 𝑠 0 ∈ S with no parents. States with no children are called terminal, and the set of terminal states is denoted by X. A trajectory or an action sequence is a sequence of states 𝜏 = (𝑠 𝑚 →𝑠 𝑚+1 → . . . →𝑠 𝑛 ), where each (𝑠 𝑖 →𝑠 𝑖+1 ) is an action. The trajectory is complete if 𝑠 𝑚 = 𝑠 0 and 𝑠 𝑛 is terminal. The set of complete trajectories is denoted by T . A (forward) policy is a collection of distributions 𝑃 𝐹 (-|𝑠) over the children of every nonterminal state 𝑠 ∈ S. A forward policy determines a distribution over T by 𝑃 𝐹 (𝜏 = (𝑠 0 → . . . →𝑠 𝑛 )) = 𝑛-1 𝑖=0 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ). Any distribution over complete trajectories that arises from a forward policy satisfies a Markov property: the marginal choice of action out of a state 𝑠 is independent of how 𝑠 was reached. Conversely, any Markovian distribution over T arises from a forward policy (Bengio et al., 2021b) . A forward policy can thus be used to sample terminal states 𝑥 ∈ X by starting at 𝑠 0 and iteratively sampling actions from 𝑃 𝐹 , or, equivalently, taking the terminating state of a complete trajectory 𝜏 ∼ 𝑃 𝐹 (𝜏). The marginal likelihood of sampling 𝑥 ∈ X is the sum of likelihoods of all complete trajectories that terminate at 𝑥. Suppose that a nontrivial (not identically 0) nonnegative reward function 𝑅 : X → R ≥0 is given. The learning problem solved by GFlowNets is to estimate a policy 𝑃 𝐹 such that the likelihood of sampling 𝑥 ∈ X is proportional to 𝑅(𝑥). That is, there should exist a constant 𝑍 such that 𝑅(𝑥) = 𝑍 ∑︁ 𝜏=(𝑠 0 →...→𝑠 𝑛 =𝑥 ) 𝑃 𝐹 (𝜏) ∀𝑥 ∈ X. If ( 2) is satisfied, then 𝑍 = 𝑥 ∈ X 𝑅(𝑥).

2.2. GFLOWNET TRAINING OBJECTIVES

Because the sum in (2) may be intractable to compute, it is in general not possible to directly convert this constraint into a training objective. To solve this problem, GFlowNet training objectives introduce auxiliary variables in the parametrization in various ways, but all have the property that ( 2) is satisfied at the global optimum. The key properties of these objectives are summarized in Table 1 . Flow matching (FM; Bengio et al., 2021a) . Motivating the 'flow network' terminology, Bengio et al. (2021a) proved that (2) is satisfied if 𝑃 𝐹 arises from an edge flow function satisfying certain constraints. Namely, an assignment 𝐹 : A → R ≥0 of a nonnegative number (flow) to each action defines a policy via 𝑃 𝐹 (𝑡|𝑠) = 𝐹 (𝑠→𝑡) 𝑡 ′ :(𝑠→𝑡 ′ ) ∈ A 𝐹 (𝑠→𝑡 ′ ) . A sufficient condition for the terminating distribution of 𝑃 𝐹 to be proportional to the reward 𝑅(𝑥) is that a family of flow-matching (flow in = flow out) conditions is satisfied at all interior states and a  𝑠:(𝑠→𝑡 ) ∈ A 𝐹 (𝑠→𝑡) = ∑︁ 𝑢:(𝑡→𝑢) ∈ A 𝐹 (𝑡→𝑢) ∀𝑡 ∈ S \ (X ∪ {𝑠 0 }), ∑︁ 𝑠:(𝑠→𝑥 ) ∈ A 𝐹 (𝑠→𝑥) = 𝑅(𝑥) ∀𝑥 ∈ X. The flow 𝐹 (𝑠→𝑡) is then proportional to the marginal likelihood that a complete trajectory sampled from 𝑃 𝐹 includes the action 𝑠→𝑡. In Bengio et al. (2021a) , a GFlowNet is described by a parametric estimate of the edge flow function, 𝐹 (𝑢→𝑣; 𝜃) (a neural net with parameters 𝜃). These conditions can be converted into objectives that are minimized when (4) is satisfied. For example, the flow-matching objective at a nonterminal state 𝑠 is defined by L FM (𝑠) = log 𝑠:(𝑠→𝑡 ) ∈ A 𝐹 (𝑠→𝑡; 𝜃) + 𝜖 𝑢:(𝑡→𝑢) ∈ A 𝐹 (𝑡→𝑢; 𝜃) + 𝜖 2 , ( ) where 𝜖 is a smoothing constant that can safely be set to 0 if the flows are constrained to be strictly positive, and a similar objective (or a constraint by construction) is defined to force the flow 𝐹 (𝑠→𝑥) into terminal states 𝑥 to match 𝑅(𝑥). If these objectives are globally minimized for all states 𝑠, then the policy 𝑃 𝐹 (-|-; 𝜃) defined by 𝐹 (-; 𝜃) via (3) satisfies (2), with 𝑍 = 𝑡:(𝑠 0 →𝑡 ) ∈ A 𝐹 (𝑠→𝑡; 𝜃) = 𝑥 ∈ X 𝑅(𝑥). The question of how to sample states 𝑠 for training is discussed below. Detailed balance (DB; Bengio et al., 2021b; Malkin et al., 2022) . In the DB parametrization, a forward policy model 𝑃 𝐹 (-|-; 𝜃) is learned directly, jointly with two additional objects: a backward policy model 𝑃 𝐵 (-|-; 𝜃), which can predict a distribution over the parents of any noninitial state, and a state flow function 𝐹 (𝑠; 𝜃) (typically parametrized in the log domain). The detailed balance conditions state that 𝐹 (𝑠; 𝜃)𝑃 𝐹 (𝑡|𝑠; 𝜃) = 𝐹 (𝑡; 𝜃)𝑃 𝐵 (𝑠|𝑡; 𝜃) (6) for all actions (𝑠→𝑡) and 𝐹 (𝑥; 𝜃) = 𝑅(𝑥) for 𝑥 terminal. Satisfaction of these conditions for all actions (𝑠→𝑡) and 𝑥 ∈ X implies that 𝑃 𝐹 samples proportionally to the reward (i.e., satisfies (2), with 𝑍 = 𝐹 (𝑠 0 )). The DB condition (6) can be converted into a squared log-ratio objective L DB (𝑠→𝑡) in the same way that (4) yields (5), and L DB (𝑠→𝑡) can be optimized over sampled actions (𝑠→𝑡). Trajectory balance (TB; Malkin et al., 2022) . The parametrization required for the TB objective includes forward and backward policy models 𝑃 𝐹 (-|-; 𝜃) and 𝑃 𝐵 (-|-; 𝜃), as well as an estimate 𝑍 𝜃 of the constant of proportionality in (2). Satisfaction of the following condition for all complete trajectories 𝜏 = (𝑠 0 → . . . →𝑠 𝑛 ) implies that (2) is satisfied: 𝑍 𝜃 𝑃 𝐹 (𝜏; 𝜃) = 𝑅(𝑠 𝑛 )𝑃 𝐵 (𝜏|𝑠 𝑛 ; 𝜃), where we have used the conventions 𝑃 𝐹 (𝜏; 𝜃) = 𝑛-1 𝑖=0 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ; 𝜃), 𝑃 𝐵 (𝜏|𝑠 𝑛 ; 𝜃) = 𝑛-1 𝑖=0 𝑃 𝐵 (𝑠 𝑖 |𝑠 𝑖+1 ; 𝜃). The condition (7) can again be made into a squared log-ratio objective L TB (𝜏) and optimized for complete trajectories 𝜏 taken from some training policy. In Malkin et al. (2022) , the TB objective was empirically demonstrated to have better convergence properties than FM and DB on various problem domains. Training policy and exploration. Global minimization of the FM, DB, and TB objectives for all values of their respective arguments (states, actions, or complete trajectories) implies satisfaction of (2). Therefore, given a sufficiently expressive model and convergence of the optimization procedure, a GFlowNet policy that samples 𝑥 with likelihood proportional to 𝑅(𝑥) can be trained by minimizing any of these losses over a distribution with full support, enabling offline training of GFlowNets. As in other RL algorithms, the distribution over sampled states, actions, or episodes can be fixed and off-policy, or can vary over the course of training and use available information about terminal states in interesting ways (Zhang et al., 2022; Deleu et al., 2022) . The simplest approach, which is also taken in this paper, is on-policy learning or a very similar off-policy variant that flattens the current policy to ensure exploration. Complete trajectories 𝜏 = (𝑠 0 → . . . →𝑠 𝑛 ) are sampled from the forward policy 𝑃 𝐹 (-|-; 𝜃) (tempered or mixed with a uniform policy with a small weight so as to ensure full support and exploration). One then takes gradient descent steps on L TB (𝜏), on L DB (𝑠 𝑖 →𝑠 𝑖+1 ) over all actions in 𝜏, or on L FM (𝑠 𝑖 ) for all intermediate states in 𝜏. The GFlowNets in this paper are trained on-policy, or off-policy with a training policy that is a mixture of 𝑃 𝐹 with a uniform policy: 𝜏 = (𝑠 0 →𝑠 1 → . . . →𝑠 𝑛 ) is sampled with 𝑠 𝑖+1 ∼ (1 - 𝜖)𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ; 𝜃) + 𝜖 1 #{𝑡:(𝑠→𝑡 ) ∈ A } . Here 𝜖 is the random exploration weight.

2.3. SUBTRAJECTORY BALANCE: LEARNING FROM PARTIAL EPISODES

Recall the GFlowNet parametrization used in the DB objective above, with a state flow estimator 𝐹 (-|-; 𝜃) and a pair of policies Malkin et al. (2022) that the detailed balance conditions ( 6) are satisfied for all actions if and only if the following subtrajectory balance constraint holds for all (not necessarily complete) trajectories 𝜏 = (𝑠 𝑚 → . . . →𝑠 𝑛 ): 𝑃 𝐹 (-|-; 𝜃), 𝑃 𝐵 (-|-; 𝜃). It is shown in §A.2 of 𝐹 (𝑠 𝑚 ; 𝜃) 𝑛-1 𝑖=𝑚 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ; 𝜃) = 𝐹 (𝑠 𝑛 ; 𝜃) 𝑛-1 𝑖=𝑚 𝑃 𝐵 (𝑠 𝑖 |𝑠 𝑖+1 ; 𝜃), where we again enforce that 𝐹 (𝑥; 𝜃) = 𝑅(𝑥) if 𝑥 is terminal. Observe that the DB condition ( 6) is a special case of ( 8) when the trajectory consists of one action, and the TB condition ( 7) is precisely the case when 𝜏 is complete, with the identification 𝑍 𝜃 = 𝐹 (𝑠 0 ; 𝜃). The above constraint yields the subtrajectory balance objective L SubTB (𝜏) = log 𝐹 (𝑠 𝑚 ; 𝜃) 𝑛-1 𝑖=𝑚 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ; 𝜃) 𝐹 (𝑠 𝑛 ; 𝜃) 𝑛-1 𝑖=𝑚 𝑃 𝐵 (𝑠 𝑖 |𝑠 𝑖+1 ; 𝜃) 2 . If this objective is made equal to 0 for all partial trajectories 𝜏, where 𝑅(𝑠 𝑛 ) is substituted for 𝐹 (𝑠 𝑛 ; 𝜃) if 𝑠 𝑛 is terminal, then the policy 𝑃 𝐹 satisfies the desired condition (2). (Proof: When L SubTB (𝜏) = 0, (8) is satisfied, implying satisfaction of both ( 7) and ( 6). Either of these conditions is a sufficient condition for (2), as shown by Bengio et al. (2021b) ; Malkin et al. (2022) .) Extracting subtrajectories for training. Suppose that an episode (complete trajectory) 𝜏 = (𝑠 0 →𝑠 1 → . . . →𝑠 𝑛 ) is sampled for training. There are 𝑛+1 2 = 𝑂 (𝑛 2 ) nontrivial subtrajectories: 𝜏 𝑖: 𝑗 := (𝑠 𝑖 →𝑠 𝑖+1 → . . . →𝑠 𝑗 ), 0 ≤ 𝑖 < 𝑗 ≤ 𝑛. Having sampled a complete trajectory 𝜏 for training, we make gradient steps on a convex combination of the subtrajectory balance losses L SubTB (𝜏 𝑖: 𝑗 ): 𝜃 ← 𝜃 -∇ 𝜃 L, where L = 0≤𝑖< 𝑗 ≤𝑛 𝜆 𝑗 -𝑖 L SubTB (𝜏 𝑖: 𝑗 ) 0≤𝑖< 𝑗 ≤𝑛 𝜆 𝑗 -𝑖 . ( ) Here 𝜆 > 0 is a hyperparameter controlling the weights assigned to subtrajectories of different lengths, and when 𝜆 is set to 1, it leads to a uniform weighting scheme. Notice that the 𝜆 → 0 + limit leads precisely to the average detailed balance loss L DB (𝑠 𝑖 →𝑠 𝑖+1 ) over all transitions in 𝜏, while the 𝜆 → +∞ limit gives the trajectory balance objective L TB (𝜏).foot_0 Other schemes for weighting subtrajectories are possible and should be explored in future work. Computational considerations. It may appear that the optimization of (11) induces a computation cost that is quadratic in the trajectory length. However, a closer inspection of the gradient of ( 11) with respect to the state flows log 𝐹 (𝑠 𝑖 ; 𝜃) and the forward and backward policy logits shows that gradient computation requires only one forward and one backward pass through the neural networks giving log 𝐹 (𝑠; 𝜃), log 𝑃 𝐹 (-|𝑠 𝑖 ; 𝜃), and log 𝑃 𝐵 (-|𝑠 𝑖 ; 𝜃). The quadratic computation cost is incurred only in performing linear operations on these log-flows and policy logits, not in the evaluation of the deep networks. Thus the SubTB loss has little computation overhead over DB or TB. Hypothesized benefits. We hypothesize that SubTB(𝜆) brings two benefits to GFlowNet training: VARIANCE REDUCTION. The TB loss terms L TB (𝜏) for trajectories 𝜏 that take a given sequence of actions until a state 𝑠, then diverge, share the terms log 𝑍 and the policy logits for all transitions preceding 𝑠 inside the square. However, the 'tail' of the TB loss, involving the forward and backward policy logits for transitions that appear after 𝑠 in 𝜏, can be seen as a stochastic least-squares regression target. That is, if 𝑠 = 𝑠 𝑚 in a trajectory 𝜏 = (𝑠 0 →𝑠 1 → . . . →𝑠 𝑛 ), then log 𝑍 • 𝑚-1 𝑖=0 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ) 𝑃 𝐵 (𝑠 𝑖 |𝑠 𝑖+1 ) (12) is regressed to log 𝑅(𝑠 𝑛 ) • 𝑛-1 𝑖=𝑚 𝑃 𝐵 (𝑠 𝑖 |𝑠 𝑖+1 ) 𝑃 𝐹 (𝑠 𝑖+1 |𝑠 𝑖 ) . ( ) Similarly, for trajectories that share the transitions following 𝑠 but may differ in their initial actions, ( 12) is a stochastic regression target for (13). The subtrajectory balance loss terms L SubTB (𝜏 𝑚: 𝑗 ) for partial trajectories beginning at 𝑠 regress the log-state flow log 𝐹 (𝑠) to (parts of) expressions like (13), while loss terms L SubTB (𝜏 𝑖:𝑚 ) regress (parts of) expressions like ( 12) to the log-state flow log 𝐹 (𝑠). The learned log 𝐹 (𝑠) is thus a learned estimate of a stochastic piece of the TB loss for trajectories that contain 𝑠. Replacing a stochastic term in the TB loss by a learned estimate of its expectation is guaranteed to introduce bias into the gradient (with respect to the gradient of the TB loss), but is expected to reduce variance. This is akin to the variance-reducing effect of actor-critic methods in RL. This hypothesis is studied empirically in our experiments and in particular §4.1.1, where we provide evidence that SubTB(𝜆) is a practically useful interpolation between TB (high variance) and DB (low variance, high bias relative to the true TB gradient) losses. FASTER LEARNING DUE TO GENERALIZATION OF STATE FLOWS. Another benefit of subtrajectory balance for convergence speed may come from the ability of estimated state flow functions log 𝐹 (𝑠; 𝜃) to be modeled with high precision and generalize between states 𝑠 faster than the often high-dimensional policy logits log 𝑃 𝐹 (-|𝑠; 𝜃), log 𝑃 𝐵 (-|𝑠; 𝜃). Such generalization is important in problems where the state graph becomes 'wide' far from the initial state, making the learning signal sparse at states that are near termination. Indeed, in all of our experiment domains except the hypergrids in §4.1 -and for the largest hypergrids -the number of terminal states is many orders of magnitude larger than the total number of states seen in training.

3. RELATED WORK

Eligibility traces. SubTB(𝜆) draws inspiration from the TD(𝜆) algorithm in RL (Sutton, 1988; Sutton & Barto, 2018) , which forms an estimate of the expected return via a convex combination of 𝑛-step returns, each weighed by (1 -𝜆)𝜆 𝑛-1 . The parameter 𝜆 ∈ [0, 1] enables a bias-variance tradeoff (Kearns & Singh, 2000) . Intuitively, larger 𝜆 leads to lower bias and higher variance, since the estimate of the expected return approaches the single-point Monte Carlo estimate as 𝜆 → 1. We take inspiration from this idea to mix together different (possibly all) subtrajectories, akin to how 𝑛-step returns are mixed together. We hypothesize that the right mixing may reduce variance, compared to TB, with the additional benefits of inducing consistency between the flows of intermediate states, and thus of helping propagate credit faster and enable faster convergence. In addition, GFlowNet training objectives are reminiscent of residual gradient RL methods (Baird, 1995; Zhang et al., 2020) since the "endpoint" (e.g. 𝐹 (𝑠 𝑛 ) in ( 9)) is also considered in the gradient. MaxEnt RL. RL has a rich literature on energy-based, or maximum entropy, methods (Ziebart, 2010; Mnih et al., 2016; Haarnoja et al., 2017; Nachum et al., 2017; Schulman et al., 2017; Haarnoja et al., 2018) , which are close or equivalent to the GFlowNet framework in certain settings (in particular when the MDP has a tree structure (Bengio et al., 2021a) ). Also related are methods that maximize entropy not on the policy, but rather on the state visitation distribution (Hazan et al., 2019; Islam et al., 2019; Zhang et al., 2021) or some proxy of it (Eysenbach et al., 2018) , which achieve a similar objective to GFlowNet models by flattening the state visitation distribution. If the state graph of the environment is a directed tree, the loss L SubTB on individual subtrajectories is equivalent to that of path consistency learning (Nachum et al., 2017) . However, attempts to use path consistency learning in settings without intermediate rewards have only computed the loss on subtrajectories that have length 1 or include a terminal state (Guo et al., 2021) . 2022) and Fig. 1 ). The initial state is (0, 0, . . . , 0), and each action is a step that increments one of the 𝑑 coordinates by 1 without leaving the grid. A special termination action is also allowed from each state. This environment is designed to challenge a learning agent to infer and discover new modes from those that have been already been visited. We study various sizes of 2-dimensional and 4-dimensional hypergrids, using the hardest variant of the reward function from past work (the minimal reward, away from the corners of the grid, is set to 10 -3 ). We train GFlowNets to sample from the target reward functions and plot the evolution of the 𝐿 1 distance between the target distribution and the empirical distribution of the last 2 • 10 5 states seen in training.foot_1 In all cases, we tune the learning rates for the TB and SubTB(𝜆 = 0.9) objectives. (See §A for details.) We also study an even sparser variant of the environment, in which the background reward is set to 10 -4 . In this case, SubTB(𝜆) continues to perform strongly (last row of Fig. 3 ), while models trained with TB do not even discover all modes of the target distribution for grids larger than 8 × 8 (Fig. 2 ). Additional results are given in §A.1. In particular, SubTB(𝜆) continues to perform strongly when only subtrajectories of less than a certain length are used for training, which can be beneficial in realistic settings where only partial episodes are given. We also show the effect of 𝜆 on the convergence rate ( Above: Self-similarity of the DB, SubTB(𝜆), and TB gradients, showing DB < SubTB(𝜆) < TB in gradient variance. Below: Similarity of small-batch DB, SubTB(𝜆), and TB gradients to the large-batch TB gradient, showing that the small-batch SubTB(𝜆) gradient is a good estimator of large-batch TB. We take a closer look at gradient bias and variance to understand the benefits of training GFlowNets with SubTB(𝜆). The methodology of these experiments is inspired by Ilyas et al. (2020) . We train GFlowNets on the 8 × 8 grid environment using SubTB(𝜆 = 0.8) and monitor various gradient metrics during training. To remove the effect of parameter sharing between policies at different states and to isolate the effect of the objective, we use a tabular representation of the GFlowNet, i.e., all flows and policy logits are optimized as independent parameters. Gradient variance. To measure gradient variance, we use the following procedure for each training objective (DB, TB, or SubTB(𝜆)). A large batch of 2 10 = 1024 trajectories is sampled, and the gradient 𝑔 (0) 𝑗 of the objective with respect to the policy logits at all states is computed for each trajectory 𝜏 𝑗 in the batch. Then, for each 𝑘 ∈ {0, 1, . . . , 9}, the gradients 𝑔 (0) 𝑖 are combined into 2 10-𝑘 sub-batches, each of size 2 𝑘 . The subbatch gradient 𝑔 (𝑘 ) 𝑖 for the 𝑖-th sub-batch is set to the average of trajectory gradients 𝑔 (0) 𝑗 contained within the sub-batch and computed for 𝑖 ∈ {1, 2, . . . , 2 10-𝑘 }. We then report the average cosine similarity between the sub-batch and full-batch gradients: 1 2 10-𝑘 2 10-𝑘 ∑︁ 𝑖=1 𝑔 (𝑘 ) 𝑖 • 𝑔 (10) 1 𝑔 (𝑘 ) 𝑖 𝑔 (10) 1 . If this quantity is positive, then gradient steps of infinitesimally small norm along the stochastic sub-batch gradient decrease the full-batch objective in expectation. Fig. 4 (left) shows the dependence of this metric on 𝑘 at various iterations. A steeper curve, such as those of DB and SubTB(𝜆), indicates lower gradient variance. Gradient bias. We next compare the small-batch stochastic gradients with large-batch stochastic gradients, using different objectives for the small and full batches. Specifically, we compare the small-batch DB, SubTB(𝜆), and TB gradients with the full-batch TB gradient. (The full-batch TB gradient can be seen as a 'canonical' gradient against which bias can be measured, as its expectation equals the gradient of the KL divergence between the distribution over trajectories defined by 𝑃 𝐹 and that defined by the reward 𝑅 and 𝑃 𝐵 ; see §A.3 of Malkin et al. (2022) .) Fig. 5 (bottom) shows the cosine similarity at the batch size used for training. Notably, at intermediate iterations, the similarity of SubTB(𝜆) with TB is higher than that of TB with TB: despite its bias, the small-batch SubTB(𝜆) gradient estimates the full-batch TB gradient better than the small-batch TB gradient does. Fig. 4 (right) shows the dependence of the similarity on 𝑘 at selected iterations and suggests that this effect may be even larger for smaller batch sizes. Moreover, at 𝑘 = 10, the similarity of SubTB(𝜆) vs. TB always lies between DB vs. TB and TB vs. TB, indicating that SubTB(𝜆) interpolates between TB's unbiased and DB's biased estimates of the TB gradient. The effect of learned state flows. For additional experiments, see §A.2. 

4.2. SMALL MOLECULE SYNTHESIS

We use SubTB(𝜆) to train models on the molecule generation task of Bengio et al. (2021a) . The task is to generate binders of the sEH (soluble epoxide hydrolase) protein, based on a docking prediction (Trott & Olson, 2010) . To be precise, molecules are generated by sequentially joining 'blocks' from a fixed library to the partial molecular graph (Jin et al., 2020; Kumar et al., 2012) , resulting in a state space of estimated size 10 12 . The reward function 𝑅 is given by a pretrained proxy model made available by Bengio et al. (2021a) . To adjust the greediness of the agent, an inverse temperature hyperparameter 𝛽 is used, i.e., the reward used for training is 𝑅(𝑥) = 𝑅(𝑥) 𝛽 , where 𝑅(𝑥) is the proxy's prediction. We train models with the DB, TB, and SubTB(𝜆) objectives, with four values each of 𝜆, 𝛽, and learning rate, averaging the results over 3 random runs for each setting. We measure how well the trained models match the target distribution by the correlation of log 𝑅(𝑥) and log 𝑝 𝜃 (𝑥), the log-probability assigned to 𝑥 by the GFlowNet, computed on a held-out set of terminal states 𝑥. 3The results are shown in Fig. 6 . SubTB(𝜆), in particular with 𝜆 = 1, performs better than both DB and TB when the optimal hyperparameters 𝛼, 𝛽 are used (solid lines) and is far more robust to the choice of hyperparameters (dashed lines). Additional details can be found in §B.

4.3. SEQUENCE GENERATION

We consider three sequence generation tasks in which sequences are generated left to right, with each action appending one symbol from a vocabulary to a partial sequence: a synthetic task with varying sequence lengths and vocabulary sizes ( §4.3.1), a practical biological sequence design task ( §4.3.2), and a new protein design task with longer sequences (4.3.3). For all three tasks, we consider the baselines Soft Actor-Critic (Haarnoja et al., 2018; Christodoulou, 2019) , A2C with Entropy regularization (Williams & Peng, 1991; Mnih et al., 2016) and MARS-like MCMC (Xie et al., 2021) and compare them with three GFlowNet training objectives: TB, FM, and SubTB(𝜆). In §F, we also study a non-autoregressive sequence generation problem (inverse protein folding).

4.3.1. BIT SEQUENCES

We consider the synthetic sequence generation setting from Malkin et al. (2022) , where the goal is to generate sequences of bits of fixed length 𝑛 = 120. The reward is specified by a set of modes 𝑀 ⊂ X = {0, 1} 𝑛 that is unknown to the learning agent. The reward of a generated sequence 𝑥 is defined in terms of Hamming distance 𝑑 from the modes: 𝑅(𝑥) = exp(-min 𝑦 ∈ 𝑀 𝑑 (𝑥, 𝑦)). The vocabulary size can be varied: for any integer 𝑘 dividing 120, we take a vocabulary consisting of words of length 𝑘 (so that the vocabulary size is 2 𝑘 and the full sequence is generated in 𝑛 𝑘 actions). By varying the value of 𝑘 and keeping 𝑛 and 𝑀 constant, we study the behavior of learning agents with varying action space sizes and trajectory lengths without changing the underlying modeling problem. Most experiment settings are taken from Malkin et al. (2022) ; see §C. Training with SubTB(𝜆) leads to policies that have the highest correlation with the reward across all lengths and vocabulary sizes. Right: For 𝑘 = 1, the number of modes discovered by each method over the course of training is plotted. SubTB(𝜆) discovers more modes faster. Models are evaluated by computing the Spearman correlation, on a test set of sequences 𝑥, between the probability of generating 𝑥 and the reward 𝑅(𝑥). We also track the number of modes discovered during the training process for all the methods, see Fig. 7 . We find that models trained with the SubTB(𝜆) objective have a higher Spearman correlation at the end of training and discover modes faster compared to the other GFlowNet objectives and non-GFlowNet baselines. Next, we consider the task of generating peptides with antimicrobial properties (AMPs). These sequences have maximum length 60 and use a vocabulary of 20 amino acids (and an end-of-sequence token), resulting in a state space of size 21 60 . The reward function is a pretrained proxy neural network that estimates the antimicrobial activity. (See Jain et al. (2022) for details on this task.)

4.3.2. ANTIMICROBIAL PEPTIDE GENERATION

We train GFlowNets with the SubTB(𝜆), TB, and FM losses and compare them with baselines. To evaluate the trained models, we sample 2048 sequences from the policy, then compute the mean reward and mean pairwise edit distance of the top-100 reward sequences. The metrics and model architecture are taken from Malkin et al. (2022) ; see §D. The results are presented in Table 2 . SubTB(𝜆) provides significant improvements over all the baselines (including TB, FM, and DB GFlowNets) in both reward and diversity. We consider the task of generating protein sequences with fluorescence properties (Trabucco et al., 2022) to evaluate SubTB(𝜆) in settings with longer trajectories. In this task, sequences have a fixed length of 237, and the size of the state space is 20 237 . The proxy reward function 𝑅(𝑥) is trained on a dataset of proteins with their fluorescence scores from Sarkisyan et al. (2016) . The metrics and models are the same as in §4.3.2; see §E for details.

4.3.3. FLUORESCENT PROTEIN GENERATION

The GFlowNet objectives outperform all other methods in both metrics, finding more diverse and higher-reward sequences (Table 3 ). SubTB(𝜆) significantly outperforms TB, while achieving a similar diversity. We note that the advantage of SubTB(𝜆) is greater than that in the AMP task (Table 2 ) and speculate that the benefits of SubTB(𝜆) become more prominent for longer action sequences.

5. DISCUSSION AND CONCLUSION

We have given evidence of a bias-variance tradeoff in GFlowNet training algorithms. The highvariance stochastic regression objective of TB and the low-variance local consistency objective of DB lie at opposite ends of this range. We showed that SubTB(𝜆) can harness the variance-reducing effects of local objectives while retaining the fast credit assignment properties of trajectory-level objectives. We see learnable strategies for selecting and weighting (sub)trajectories for traininge.g., a dynamic choice of 𝜆 and an active-learning approach to sampling trajectories -as the most interesting questions for future work. The ability of subtrajectory objectives to learn from incomplete episodes also makes their application in RL environments an appealing research direction.

REPRODUCIBILITY STATEMENT

We provide extensive experiment details, such as learning rates, batch sizes, number of training steps, choices of 𝜆, description of attempted hyperparameters, and additional clarifying experiments in the Appendices. Code for experiments on the hypergrid domain ( §4.1) and on the molecule domain ( §4.2) is also provided with the submission. 

A EXPERIMENT DETAILS: HYPERGRID

The environment is identical to that in Malkin et al. (2022) , with reward function parameters (𝑅 0 , 𝑅 1 , 𝑅 2 ) = (10 -3 , 0.5, 2) for the standard variant of the grid and (10 -4 , 1.0, 3.0) for the harder variant. The models giving logits of 𝑃 𝐹 (-|𝑠) and 𝑃 𝐵 (-|𝑠), as well as log 𝐹 (𝑠), are MLPs of the same architecture as in Bengio et al. (2021a) , taking a one-hot representation of the coordinates of 𝑠 as input and sharing all layers except the last. The initial state flow log 𝑍 = log 𝐹 (𝑠 0 ) is an independent parameter whose learning rate is set to 10× the learning rate of other parameters. All models are trained with the Adam optimizer and a batch size of 16 for a total of 10 6 trajectories (62500 batches). The optimal learning rate for each experiment is chosen from {0.0005, 0.00075, 0.001, 0.003, 0.005, 0.0075, 0.01}, and 𝜆 = 0.9 is chosen as the optimal value from the set {0.8, 0.9, 0.99}. Gradient bias and variance experiments are conducted in the harder variant of the 8 × 8 grid. The tabular GFlowNet is trained using Adam with a learning rate 0.007 and the SubTB(𝜆 = 0.8) objective. 

A.2 MORE ON BIAS AND VARIANCE: THE EFFECT OF LEARNED STATE FLOWS

To better understand the variance-reducing properties of SubTB(𝜆), we perform the gradient bias experiments with a modified computation of gradients that removes the factor of learning the state flows. Recall from §2.1 that a forward policy 𝑃 𝐹 uniquely determines a distribution over trajectories. If the initial state flow 𝑍 and forward policy 𝑃 𝐹 are fixed, there is a unique state flow function 𝐹 𝐹 and backward policy 𝑃 𝐵 that satisfy the detailed balance conditions (6). This 'true forward' flow function, written 𝐹 𝐹 (𝑠) = 𝑍 𝜏:𝑠∈ 𝜏 𝑃 𝐹 (𝜏), is determined by an initial state flow fixed to the true To define an exploratory training policy, we set the the random action probability to 0.01 selected from {0.0001, 0.0005, 0.001, 0.01} and the reward exponent 𝛽 (having the same meaning as in §4.2) to 3 selected from {2, 3, 4}. For trajectory balance we use a learning rate of 5 × 10 -3 selected from {10 -5 , 10 -4 , 5 × 10 -4 , 10 -3 , 5 × 10 -3 } for the flow parameters and 1 × 10 -2 for log 𝑍. For SubTB(𝜆), we choose the best 𝜆 from {0.7, 0.8, 0.9, 0.99}, and found 𝜆 = 0.99 to perform the best. For TB and SubTB(𝜆), we tune for the best learning rates from {0.0001, 0.0003, 0.0005, 0.00075, 0.001} for the forward logits. For log 𝑍, we use a learning rate of 10× the learning rate for the forward logits. For FM we use a learning rate of 10 -3 selected from {10 -5 , 10 -4 , 5 × 10 -4 , 10 -3 , 5 × 10 -3 } with leaf loss coefficient 𝜆 𝑇 = 30. For A2C with entropy regularization we share parameters between the actor and critic networks, and use learning rate of 5 × 10 -3 selected from {10 -5 , 10 -4 , 5 × 10 -4 , 10 -3 , 5 × 10 -3 } with entropy regularization coefficient 5 × 10 -2 selected from {10 -4 , 10 -3 , 5 × 10 -3 , 10 -2 , 5 × 10 -2 }. For SAC we use the formulation in Christodoulou (2019) with a learning rate of 10 -3 selected from {10 -5 , 10 -4 , 5 × 10 -4 , 10 -3 , 5 × 10 -3 }, a target network update frequency of 400 and initial random steps of 200. For the MARS baseline, we set the learning rate to 5 × 10 -4 selected from {10 -5 , 10 -4 , 5 × 10 -4 , 10 -3 , 5 × 10 -3 }. We run the experiments on 3 seeds and report the mean and standard error over the three runs in Table 3 .

GENERATION

We consider the inverse protein folding problem suggested in Sinai et al. (2020) . A target protein 3D backbone conformation is given, and the task is to sample amino acid sequences of a fixed length 𝐿 = 40 from the Boltzmann distribution corresponding to their energy in the target conformation. The energy is provided by a physics model (Rohl et al., 2004; Chaudhury et al., 2010) . The policy model is a 3-layer convolutional architecture that closely follows previous work (Sinai et al., 2020) . Specifically, for the policy function, the convolution size was set to 7 with 32 hidden features and ReLU activation in each layer. The policy network has one additional convolutional layer of size 20 (number of amino acids), and without the activation function. The flow network has an additional two linear layers of sizes [1280,64], and [64, 1] with ReLU activation in between. We report mean result over three runs. For this task, rather than generating sequences from left to right, we consider an action space in which actions modify one letter at a time at arbitrary positions. The first action uniformly randomly samples an amino acid sequence. On each subsequent action, the agent selects a position in the sequence and replaces the letter in this position with another letter in the vocabulary. Generation terminates after exactly 𝑁 = 40 replacement steps. The forward policy is conditioned on the number of steps taken so far in the trajectory; the backward policy is fixed to be uniform over the 𝑁 • 𝐿 actions. As a metric of how well the learned model matches the target distribution, we measure the correlation between log 𝑅(𝑥) and the marginal sampling likelihood log 𝑝 𝜃 (𝑥) on a held-out set of terminal



When a batch of trajectories is used for training, the convex combination weights may either be normalized over all subtrajectories of all trajectories in the batch, or normalized independently over the subtrajectories of each trajectory. For consistency, we choose the first option for the experiments in this paper. Such an evaluation is possible in this synthetic environment because the exact target distribution function can be tractably computed. Note that the metric shown in Fig. differs from what is called '𝐿 1 distance' in past work, as we do not divide by the total number of states. Comparing the exact sampling and target distributions, like in §4.1, is not possible here, since we cannot enumerate all terminal states. However, the marginal likelihood that a trained GFlowNet generates a given 𝑥 is tractable to compute by dynamic programming. For a model that samples perfectly from the target distribution, log(𝑅(𝑥)) and log 𝑝 𝜃 (𝑥) would differ by a constant log 𝑍 independent of 𝑥 and thus be perfectly correlated.



Figure 3: 𝐿 1 distance between empirical and target distributions over the course of training on the hypergrid environment. SubTB(𝜆 = 0.9) consistently gives faster convergence than TB, the strongest objective from past work, on all grid sizes. The difference is especially visible for the harder variant of the reward function (last row). The 𝑥-axis is the cumulative number of training trajectories (episodes).

Figure 2: Distribution of 2 × 10 5 samples from GFlowNets trained on the harder variant of the 32 × 32 grid with TB and SubTB(𝜆) objectives.

Figure4: Mean cosine similarity between small-batch (2 𝑘 ) and large-batch (1024) gradients at selected training iterations. Left: Small-batch vs. large-batch gradients of DB, SubTB(𝜆), and TB objectives. Right: Small-batch DB, SubTB(𝜆), and TB gradients vs. large-batch TB gradient.4.1.1 A CLOSER LOOK AT GRADIENT VARIANCE

Fig. 5 (top)  shows the metric at 𝑘 = 6 (corresponding to the batch size of 64 used for training) over the course of training. We see that the DB gradient has the highest self-consistency at all iterations, TB has the lowest, and SubTB(𝜆 = 0.8) is in between.

Figure 6: Correlation between marginal sampling log-likelihood and log-reward on the molecule task. For each hyperparameter setting on the 𝑥-axis, we plot the best result over choices of the other hyperparameter(s) -𝛼 in the left plot, 𝛽 in the centre plot, and both 𝛼 and 𝛽 in the right plot -with a solid line. The mean result over values of other hyperparameter(s) is plotted with a dashed line.

Figure7: Left: For the number of bits 𝑘 ∈ {1, 2, 4, 6, 8, 10} in each vocabulary token, we plot the Spearman correlation between the sampling probability and reward on a test set for each method. Training with SubTB(𝜆) leads to policies that have the highest correlation with the reward across all lengths and vocabulary sizes. Right: For 𝑘 = 1, the number of modes discovered by each method over the course of training is plotted. SubTB(𝜆) discovers more modes faster.

Figure A.1: Additional results for hypergrid experiments. Above: The evolution of the 𝐿 1 between empirical sampling and target distributions on the harder variants of 4-dimensional grids, in the same format as Fig. 3. Below: The number of cumulative distinct terminal states visited as a function of training time on the standard 2-dimensional grid. Models trained with SubTB(𝜆) discover more states faster.

Fig. A.1 shows additional results on more difficult grid environments.We perform another experiment in which only short (up to length 4) subtrajectories are used for training with the SubTB(𝜆) objective (i.e., the sum in (11) is truncated to exclude pairs (𝑖, 𝑗) with 𝑗 -𝑖 > 4). The results, shown in Fig.A.4, show that SubTB(𝜆) continues to perform strongly in this restricted setting.

Figure A.2: Empirical 𝐿 1 curves on the 8 × 8 grid for varying values of 𝜆.

Fig. A.2 shows the effect of the SubTB parameter 𝜆 on the training curves, showing a gradual interpolation between DB and TB and fastest convergence at values slightly less than 1.

Fig. A.3 contains visualizations of the exploration behavior of different training algorithms. It shows that TB can perform better with off-policy training and can benefit from a higher temperature of the policy logits, but still does not learn as fast as SubTB(𝜆), nor does it find all the modes in the maximum number of training iterations.

Figure F.1: The Spearman correlation between the sampling probability and reward on a test set is plotted over the course of training for each value of 𝜆.

Summary of GFlowNet training objectives.

Results on the AMP generation task (mean and standard error over 3 runs).

Results on the GFP generation task (mean and standard error over 3 runs).

annex

partition function 𝑍 = 𝑥 ∈ X 𝑅(𝑥) and the learned forward policy 𝑃 𝐹 . Similarly, the 'true backward' flow function, written 𝐹 𝐵 (𝑠) = 𝜏:𝑠∈ 𝜏 𝑃 𝐵 (𝜏)𝑅(𝑥 𝜏 ) where 𝑥 𝜏 is the terminal state of 𝜏, is uniquely determined by the reward function 𝑅 and the learned backward policy 𝑃 𝐵 . In particular, 𝐹 𝐵 (𝑠 0 ) = 𝑥 ∈ X 𝑅(𝑥). We repeat the experiments on gradient bias, but by replacing the learned state flows 𝐹 in the losses by either the true forward or the true backward state flows (𝐹 𝐹 or 𝐹 𝐵 respectively) computed exactly using the current values of the learned 𝑃 𝐹 and 𝑃 𝐵 . (These modifications are not applied in training, but are used only to compute the gradient similarities. The small size of the environment makes computation of the true state flows tractable; this is not possible in general.)The gradient similarity over the course of training is shown in Fig. A.5 (cf. Fig. 5 in the main text). The similar behavior of SubTB(𝜆) with learned and true forward state flows suggests that the learned state flows remain close enough to their optimal values and that the variance-reducing benefits of SubTB(𝜆) with true state flows are retained.

B EXPERIMENT DETAILS: MOLECULES

All experiments with SubTB(𝜆) are based upon the published code of Malkin et al. (2022) , which extends that of Bengio et al. (2021a) . The proxy model giving the reward, the held-out set of molecules used to compute the correlation metric, and the GFlowNet model architecture -a graph neural network -are identical to those in Bengio et al. (2021a) , and the off-policy exploration rate and early stopping likelihood are the same as those tuned for the training with the TB objective in (Some training runs terminated early because of numerical overflows in the gradients, in which case we report the metric of the last stable model whose cumulative number of batches is a multiple of 5000.)

C EXPERIMENT DETAILS: BIT SEQUENCES

The modes 𝑀 as well as the test sequences are selected as described in Malkin et al. (2022) . The policy for all methods is parameterized by a Transformer (Vaswani et al., 2017) with 3 layers, dimension 64, and 8 attention heads. All methods are trained for 50,000 iterations with minibatch size of 16 using Adam optimizer. For GFlowNets with FM objective as well as the baselines, we use the exact same implementation and hyperparameters reported in Malkin et al. (2022) . For TB and SubTB(𝜆), we pick the best learning rate from {0.0075, 0.001, 0.001, 0.003, 0.005} for forward logits, and for Z, use a learning rate of 10× the learning rate for forward logits. For SubTB(𝜆), we found the best 𝜆 value of 1.9 from the values {0.8, 0.9, 1.1, 1.3, 1.5, 1.7, 1.9, 2.0}.

D EXPERIMENT DETAILS: ANTIMICROBIAL PEPTIDE GENERATION

Following Malkin et al. (2022) we use the following amino acids: ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']. We take 6438 known AMP sequences and 9522 non-AMP sequences from the DBAASP database Pirtskhalava et al. (2021) . The classifier that serves as the proxy reward function is trained on this dataset, using 20% of the data as the validation set. The reward model is a Transformer, with 4 hidden layers, hidden dimension 64, and 8 attention heads. We train it with a minibatch of size 256, with learning rate 10 -4 , and with early stopping on the validation set. We use a Transformer with 3 hidden layers with hidden dimension 64 with 8 attention heads as the architecture of the policy for all methods. All methods are trained for 20, 000 iterations, with a minibatch size of 16, using the reported hyperparameters for all the baselines from (Malkin et al., 2022) . For TB and SubTB(𝜆), we pick the best learning rates from {0.005, 0.007, 0.01, 0.03, 0.05, 0.07} for forward logits and from {0.007, 0.01, 0.03, 0.05} for log 𝑍. For SubTB(𝜆), the best performing 𝜆 value of 1.9 chosen from {0.9, 0.99, 1.1, 1.2, 1.3, 1.4, 1.6, 1.7, 1.8, 1.9, 2.0} is used.

E EXPERIMENT DETAILS: FLUORESCENT PROTEIN GENERATION

We consider a variant of the GFP task from Trabucco et al. (2022) . The vocabulary of amino acids is the same as §D: ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']. Following Trabucco et al. (2022) , we consider the dataset of 56,086 proteins from Sarkisyan et al. ( 2016) processed based on Brookes et al. (2019) . Each protein is accompanied by a score quantifying its fluorescence. As with the AMP data, we keep 20% of the data as a validation set used for early-stopping. The regressor trained with the dataset is a Transformer, with 4 hidden layers, hidden dimension 64, and 8 attention heads. We train it with a minibatch of size 256, with learning rate 10 -4 , with early stopping on the validation set. The architecture of the policy for all methods is a Transformer with 3 hidden layers with hidden dimension 64 with 8 attention heads. All methods are trained for 20, 000 iterations, with a minibatch size of 16. We use the same implementation for all methods as the ones used in §D. 

