CFLOWNETS: CONTINUOUS CONTROL WITH GENERATIVE FLOW NETWORKS

Abstract

Generative flow networks (GFlowNets), as an emerging technique, can be used as an alternative to reinforcement learning for exploratory control tasks. GFlowNet aims to generate distribution proportional to the rewards over terminating states, and to sample different candidates in an active learning fashion. GFlowNets need to form a DAG and compute the flow matching loss by traversing the inflows and outflows of each node in the trajectory. No experiments have yet concluded that GFlowNets can be used to handle continuous tasks. In this paper, we propose generative continuous flow networks (CFlowNets) that can be applied to continuous control tasks. First, we present the theoretical formulation of CFlowNets. Then, a training framework for CFlowNets is proposed, including the action selection process, the flow approximation algorithm, and the continuous flow matching loss function. Afterward, we theoretically prove the error bound of the flow approximation. The error decreases rapidly as the number of flow samples increases. Finally, experimental results on continuous control tasks demonstrate the performance advantages of CFlowNets compared to many reinforcement learning methods, especially regarding exploration ability.

1. INTRODUCTION

As an emerging technology, generative flow networks (GFlowNets) (Bengio et al., 2021a; b) can make up for the shortcomings of reinforcement learning (Kaelbling et al., 1996; Sutton & Barto, 2018) on exploratory tasks. Specifically, based on the Bellman equation (Sutton & Barto, 2018) , reinforcement learning is usually trained to maximize the expectation of future rewards; hence the learned policy is more inclined to sample action sequences with higher rewards. In contrast, the training goal of GFlowNets is to define a distribution proportional to the rewards over terminating states, i.e., the parent states of the final states, rather than generating a single high-reward action sequence (Bengio et al., 2021a) . This is more like sampling different candidates in an active learning setting (Bengio et al., 2021b) , thus better suited for exploration tasks. GFlowNets construct the state transitions of trajectories into a directed acyclic graph (DAG) structure. Each node in the graph structure corresponds to a different state, and actions correspond to transitions between different states, that is, an edge connecting different nodes in the graph. For discrete tasks, the number of nodes in this graph structure is limited, and each edge can only correspond to one discrete action. However, in real environments, the state and action spaces are continuous for many tasks, such as quadrupedal locomotion (Kohl & Stone, 2004) , autonomous driving (Kiran et al., 2021; Shalev-Shwartz et al., 2016; Pan et al., 2017) , or dexterous in-hand manipulation (Andrychowicz et al., 2020) . Moreover, the reward distributions corresponding to these environments may be multimodal, requiring more diversity exploration. The needs of these environments closely match the strengths of GFlowNets. (Bengio et al., 2021b) proposes an idea for adapting GFlowNets to continuous tasks by replacing sums with integrals for continuous variables, and they suggest the use of integrable densities and detailed balance (DB) or trajectory balance (TB) Malkin et al. (2022) criterion to obtain tractable training objectives, which can avoid some integration operations. However, this idea has not been verified experimentally. In this paper, we propose generative Continuous Flow Networks, named CFlowNets for short, for continuous control tasks to generate policies that can be proportional to continuous reward functions. Applying GFlowNets to continuous control tasks is exceptionally challenging. In generative flow networks, the transition probability is defined as the ratio of action flow and state flow. For discrete state and action spaces, we can form a DAG and compute the state flow by traversing a node's incoming and outgoing flows. Conversely, it is impossible for continuous tasks to traverse all state-action pairs and corresponding rewards. To address this issue, we use important sampling to approximate the integrals over inflows and outflows in the flow-matching constraint, where we use a deep neural network to predict the parent nodes of each state in the sampled trajectory. The main contributions of this paper are summarized as the following: Main Contributions: 1) We extend the theoretical formulation and flow matching theorem of previous GFlowNets to continuous scenarios. Based on this, a loss function for training CFlowNets is presented; 2) We propose an efficient way to sample actions with probabilities approximately proportional to the output of the flow network, and propose a flow sampling approach to approximate continuous inflows and outflows, which allows us to construct a continuous flow matching loss; 3) We theoretically analyze the error bound between sampled flows and inflows/outflows, and the tail becomes minor as the number of flow samples increases; 4) We conduct experiments based on continuous control tasks to demonstrate that CFlowNets can outperform current state-of-the-art RL algorithms, especially in terms of exploration capabilities. To the best of our knowledge, our work is the first to empirically demonstrate the effectiveness of flow networks on continuous control tasks. The codes are available at http://gitee.com/mindspore/models/tree/master/research/gflownets/cflownets 2 PRELIMINARIES

2.1. MARKOV DECISION PROCESS

A stochastic, discrete-time and sequential decision task can be described as a Markov Decision Process (MDP) , which is canonically formulated by the tuple: M = S, A, P, R, γ . (1) In the process, s ∈ S represents the state space of the environment. At each time step, agent receives a state s and selects an action a on the action space A. This results in a transition to the next state s according to the state transition function P (s |s, a) : S × A × S → [0, 1]. Then the agent gets the reward r based on the reward function R(s, a) : S × A → R. A stochastic policy π maps each state to a distribution over actions π(•|s) and gives the probability π(a|s) of choosing action a in state s. The agent interacts with the environment by executing the policy π and obtaining the admissible trajectories {(s t , a t , r t , s t+1 )} n t=1 , where n is the trajectory length. The goal of an agent is to maximize the discounted return E s0:n,a0:n [ ∞ t=0 γ t r t | s 0 = s, a 0 = a, π] , where E is the expectation over the distribution of the trajectories and γ ∈ [0, 1) is the discount factor.  which means that for any node s, the incoming flow equals the outgoing flow, which is the total flow F (s) of node s.



GENERATIVE FLOW NETWORK GFlowNet sees the MDP as a flow network. Define s = T (s, a) and F (s) as the node's transition and the total flow going through s. Define an edge/action flow F (s, a) = F (s → s ) as the flow through an edge s → s . The training process of vanilla GFlowNets needs to sum the flow of parents and children through nodes (states), which depends on the discrete state space and discrete action space. The framework is optimized by the following flow consistency equations: s,a:T (s,a)=s F (s, a) = R (s ) + a ∈A(s ) F (s , a ) ,

