GENERATIVE AUGMENTED FLOW NETWORKS

Abstract

The Generative Flow Network (Bengio et al., 2021b, GFlowNet) is a probabilistic framework where an agent learns a stochastic policy for object generation, such that the probability of generating an object is proportional to a given reward function. Its effectiveness has been shown in discovering high-quality and diverse solutions, compared to reward-maximizing reinforcement learning-based methods. Nonetheless, GFlowNets only learn from rewards of the terminal states, which can limit its applicability. Indeed, intermediate rewards play a critical role in learning, for example from intrinsic motivation to provide intermediate feedback even in particularly challenging sparse reward tasks. Inspired by this, we propose Generative Augmented Flow Networks (GAFlowNets), a novel learning framework to incorporate intermediate rewards into GFlowNets. We specify intermediate rewards by intrinsic motivation to tackle the exploration problem in sparse reward environments. GAFlowNets can leverage edge-based and state-based intrinsic rewards in a joint way to improve exploration. Based on extensive experiments on the GridWorld task, we demonstrate the effectiveness and efficiency of GAFlowNet in terms of convergence, performance, and diversity of solutions. We further show that GAFlowNet is scalable to a more complex and large-scale molecule generation domain, where it achieves consistent and significant performance improvement.

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved significant progress in recent years with particular success in games (Mnih et al., 2015; Silver et al., 2016; Vinyals et al., 2019) . RL methods applied to the setting where a reward is only given at the end (i.e., terminal states) typically aim at maximizing that reward function for learning the optimal policy. However, diversity of the generated states is desirable in a wide range of practical scenarios including molecule generation (Bengio et al., 2021a) , biological sequence design (Jain et al., 2022b ), recommender systems (Kunaver & Požrl, 2017) , dialogue systems (Zhang et al., 2020), etc. For example, in molecule generation, the reward function used in in-silico simulations can be uncertain and imperfect itself (compared to the more expensive in-vivo experiments). Therefore, it is not sufficient to only search the solution that maximizes the return. Instead, it is desired that we sample many high-reward candidates, which can be achieved by sampling them proportionally to the reward of each terminal state. Interestingly, GFlowNets (Bengio et al., 2021a; b) learn a stochastic policy to sample composite objects x ∈ X with probability proportional to the return R(x). The learning paradigm of GFlowNets is different from other RL methods, as it is explicitly aiming at modeling the diversity in the target distribution, i.e., all the modes of the reward function. This makes it natural for practical applications where the model should discover objects that are both interesting and diverse, which is a focus of previous GFlowNet works (Bengio et al., 2021a; b; Malkin et al., 2022; Jain et al., 2022b ). Yet, GFlowNets only learn from the reward of the terminal state, and do not consider intermediate rewards, which can limit its applicability, especially in more general RL settings. Rewards play a critical role in learning (Silver et al., 2021) . The tremendous success of RL largely depends on the reward signals that provide intermediate feedback. Even in environments with sparse rewards, RL agents can motivate themselves for efficient exploration by intrinsic motivation, which augments the sparse extrinsic learning signal with a dense intrinsic reward at each step. Our focus in this paper is precisely on introducing such intermediate intrinsic rewards in GFlowNets, since they can be applied even in settings where the extrinsic reward is sparse (say non-zero only on a few terminal states). Inspired by this missing element of GFlowNets, we propose a new GFlowNet learning framework that takes intermediate feedback signals into account to provide an exploration incentive during training. The notion of flow in GFlowNets (Bengio et al., 2021a; b) refers to a marginalized quantity that sums rewards over all downstream terminal states following a given state, while sharing that reward with other states leading to the same terminal states. Apart from the existing flows in the network, we introduce augmented flows as intermediate rewards. Our new framework is well-suited for sparse reward tasks by considering intrinsic motivation as intermediate rewards, where the training of GFlowNet can get trapped in a few modes, since it may be difficult for it to discover new modes based on those it visited (Bengio et al., 2021b) . We first propose an edge-based augmented flow, based on the incorporation of an intrinsic reward at each transition. However, we find that although it improves learning efficiency, it only performs local exploration and still lacks sufficient exploration ability to drive the agent to visit solutions with zero rewards. On the other hand, we find that incorporating intermediate rewards in a state-based manner (Bengio et al., 2021b) can result in slower convergence and large bias empirically, although it can explore more broadly. Therefore, we propose a joint way to take both edge-based and state-based augmented flows into account. Our method can improve the diversity of solutions and learning efficiency by reaping the best from both worlds. Extensive experiments on the GridWorld and molecule domains that are already used to benchmark GFlowNets corroborate the effectiveness of our proposed framework. The code is publicly available at https://github.com/ling-pan/GAFN. The main contributions of this paper are summarized as follows: • We propose a novel GFlowNet learning framework, dubbed Generative Augmented Flow Networks (GAFlowNet), to incorporate intermediate rewards, which are represented by augmented flows in the flow network. • We specify intermediate rewards by intrinsic motivation to deal with the exploration of state space for GFlowNets in sparse reward tasks. We theoretically prove that our augmented objective asymptotically yields an unbiased solution to the original formulation. • We conduct extensive experiments on the GridWorld domain, demonstrating the effectiveness of our method in terms of convergence, diversity, and performance. Our method is also general, being applicable to different types of GFlowNets. We further extend our method to the larger-scale and more challenging molecule generation task, where our method achieves consistent and substantial improvements over strong baselines.

2. BACKGROUND

Consider a directed acyclic graph (DAG) G = (S, A), where S denotes the state space, and A represents the action space, which is a subset of S × S. We denote the vertex s 0 ∈ S to be the initial state with no incoming edges, while the vertex s f without outgoing edges is called the sink state, and state-action pairs correspond to edges. The goal for GFlowNets is to learn a stochastic policy π that can construct discrete objects x ∈ X with probability proportional to the reward function R : X → R ≥0 , i.e., π(x) ∝ R(x). GFlowNets construct objects sequentially, where each step adds an element to the construction. We call the resulting sequence of state transitions from the initial state to a terminal state τ and the backward policy P B (s|s ′ ) = F (s→s ′ ) F (s ′ ) . The flows can be considered as the amount of water flowing through edges (like pipes) or states (like tees connecting pipes) (Malkin et al., 2022) , with R(x) the amount of water through terminal state x, and P F (s ′ |s) the relative amount of water flowing in edges outgoing from s. = (s 0 → • • • → s n ) a



trajectory, where τ ∈ T with T denoting the set of trajectories. Bengio et al. (2021a) define a trajectory flow F : T → R ≥0 . Let F (s) = τ ∋s F (τ ) define a state flow for any state s, and F (s → s ′ ) = τ ∋s→s ′ F (τ ) defines the edge flow for any edge s → s ′ . The trajectory flow induces a probability measure P (τ ) = F (τ )Z , where Z = τ ∈T F (τ ) denotes the total flow. We then define the corresponding forward policy P F (s ′ |s) = F (s→s ′ ) F (s)

