LEARNING GFLOWNETS FROM PARTIAL EPISODES FOR IMPROVED CONVERGENCE AND STABILITY Anonymous authors Paper under double-blind review

Abstract

Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD(𝜆) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB(𝜆), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB(𝜆) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.

1. INTRODUCTION

Generative flow networks (GFlowNets; Bengio et al., 2021a) are generative models that construct objects lying in a target space X by taking sequences of actions sampled from a learned policy. GFlowNets are trained so as to make the probability of sampling an object 𝑥 ∈ X proportional to a given nonnegative reward 𝑅(𝑥). GFlowNets' use of a parametric policy that can generalize to states not seen during training makes them a competitive alternative to methods based on local exploration in various probabilistic modeling tasks (Bengio et al., 2021a; Malkin et al., 2022; Zhang et al., 2022; Jain et al., 2022; Deleu et al., 2022) . GFlowNets solve the variational inference problem of approximating a target distribution over X with the distribution induced by the sampling policy, and they are trained by algorithms reminiscent of reinforcement learning (although GFlowNets model the diversity present in the reward distribution, rather than maximizing reward by seeking its mode). In most past works (Bengio et al., 2021a; Malkin et al., 2022; Zhang et al., 2022; Jain et al., 2022) 2022) that the improved credit assignment with TB comes at the cost of higher gradient variance, analogous to the bias-variance tradeoff seen in temporal difference learning (TD(𝑛) or TD(𝜆)) with different eligibility trace schemes (Sutton & Barto, 2018; Kearns & Singh, 2000; van Hasselt et al., 2018; Bengio et al., 2020) . This hypothesis is one of the starting points for the present paper. In this paper, we propose a new learning objective for GFlowNets, called subtrajectory balance (SubTB, or SubTB(𝜆) when its real-valued hyperparameter 𝜆 is specified). Building upon theoretical results of Bengio et al. (2021b); Malkin et al. (2022) , we show how the SubTB(𝜆) objective allows the flexibility of learning from partial experiences of any length. Experiments on two synthetic and four real-world domains support the following empirical claims:



, GFlowNets are trained by exploratory sampling from the policy and receive their training signal from the reward of the sampled object. The flow matching (FM) and detailed balance (DB) learning objectives for GFlowNets proposed in Bengio et al. (2021a;b) resemble temporal difference learning (Sutton & Barto, 2018). A third objective, trajectory balance (TB), was proposed in Malkin et al. (2022) to address the problem of slow temporal credit assignment with the FM and DB objectives. The TB objective propagates learning signals over entire episodes, while the temporal difference-like objectives (FM and DB) make updates local to states or actions. It has been hypothesized by Malkin et al. (

