CFLOWNETS: CONTINUOUS CONTROL WITH GENERATIVE FLOW NETWORKS

Abstract

Generative flow networks (GFlowNets), as an emerging technique, can be used as an alternative to reinforcement learning for exploratory control tasks. GFlowNet aims to generate distribution proportional to the rewards over terminating states, and to sample different candidates in an active learning fashion. GFlowNets need to form a DAG and compute the flow matching loss by traversing the inflows and outflows of each node in the trajectory. No experiments have yet concluded that GFlowNets can be used to handle continuous tasks. In this paper, we propose generative continuous flow networks (CFlowNets) that can be applied to continuous control tasks. First, we present the theoretical formulation of CFlowNets. Then, a training framework for CFlowNets is proposed, including the action selection process, the flow approximation algorithm, and the continuous flow matching loss function. Afterward, we theoretically prove the error bound of the flow approximation. The error decreases rapidly as the number of flow samples increases. Finally, experimental results on continuous control tasks demonstrate the performance advantages of CFlowNets compared to many reinforcement learning methods, especially regarding exploration ability.

1. INTRODUCTION

As an emerging technology, generative flow networks (GFlowNets) (Bengio et al., 2021a; b) can make up for the shortcomings of reinforcement learning (Kaelbling et al., 1996; Sutton & Barto, 2018) on exploratory tasks. Specifically, based on the Bellman equation (Sutton & Barto, 2018) , reinforcement learning is usually trained to maximize the expectation of future rewards; hence the learned policy is more inclined to sample action sequences with higher rewards. In contrast, the training goal of GFlowNets is to define a distribution proportional to the rewards over terminating states, i.e., the parent states of the final states, rather than generating a single high-reward action sequence (Bengio et al., 2021a) . This is more like sampling different candidates in an active learning setting (Bengio et al., 2021b) , thus better suited for exploration tasks. GFlowNets construct the state transitions of trajectories into a directed acyclic graph (DAG) structure. Each node in the graph structure corresponds to a different state, and actions correspond to transitions between different states, that is, an edge connecting different nodes in the graph. For discrete tasks, the number of nodes in this graph structure is limited, and each edge can only correspond to one discrete action. However, in real environments, the state and action spaces are continuous for many tasks, such as quadrupedal locomotion (Kohl & Stone, 2004) , autonomous driving (Kiran et al., 2021; Shalev-Shwartz et al., 2016; Pan et al., 2017) , or dexterous in-hand manipulation (Andrychowicz et al., 2020) . Moreover, the reward distributions corresponding to these environments may be multimodal, requiring more diversity exploration. The needs of these environments closely match the strengths of GFlowNets. (Bengio et al., 2021b) proposes an idea for adapting GFlowNets to continuous tasks by replacing sums with integrals for continuous variables, and they suggest the use of integrable densities and detailed balance (DB) or trajectory balance (TB) Malkin et al. (2022) criterion to obtain tractable training objectives, which can avoid some integration operations. However, this idea has not been verified experimentally. In this paper, we propose generative Continuous Flow Networks, named CFlowNets for short, for continuous control tasks to generate policies that can be proportional to continuous reward functions. Applying GFlowNets to continuous control tasks is exceptionally challenging. In generative flow networks, the transition probability is defined as the ratio of action flow and state flow. For discrete state and action spaces, we can form a DAG and compute the state flow by traversing a node's incoming and outgoing flows. Conversely, it is impossible for continuous tasks to traverse all state-action pairs and corresponding rewards. To address this issue, we use important sampling to approximate the integrals over inflows and outflows in the flow-matching constraint, where we use a deep neural network to predict the parent nodes of each state in the sampled trajectory. The main contributions of this paper are summarized as the following: Main Contributions: 1) We extend the theoretical formulation and flow matching theorem of previous GFlowNets to continuous scenarios. Based on this, a loss function for training CFlowNets is presented; 2) We propose an efficient way to sample actions with probabilities approximately proportional to the output of the flow network, and propose a flow sampling approach to approximate continuous inflows and outflows, which allows us to construct a continuous flow matching loss; 3) We theoretically analyze the error bound between sampled flows and inflows/outflows, and the tail becomes minor as the number of flow samples increases; 4) We conduct experiments based on continuous control tasks to demonstrate that CFlowNets can outperform current state-of-the-art RL algorithms, especially in terms of exploration capabilities. To the best of our knowledge, our work is the first to empirically demonstrate the effectiveness of flow networks on continuous control tasks. The codes are available at http://gitee.com/mindspore/models/tree/master/research/gflownets/cflownets 2 PRELIMINARIES

2.1. MARKOV DECISION PROCESS

A stochastic, discrete-time and sequential decision task can be described as a Markov Decision Process (MDP) , which is canonically formulated by the tuple: M = S, A, P, R, γ . (1) In the process, s ∈ S represents the state space of the environment. At each time step, agent receives a state s and selects an action a on the action space A. This results in a transition to the next state s according to the state transition function P (s |s, a) : S × A × S → [0, 1]. Then the agent gets the reward r based on the reward function R(s, a) : S × A → R. A stochastic policy π maps each state to a distribution over actions π(•|s) and gives the probability π(a|s) of choosing action a in state s. The agent interacts with the environment by executing the policy π and obtaining the admissible trajectories {(s t , a t , r t , s t+1 )} n t=1 , where n is the trajectory length. The goal of an agent is to maximize the discounted return E s0:n,a0:n [ ∞ t=0 γ t r t | s 0 = s, a 0 = a, π] , where E is the expectation over the distribution of the trajectories and γ ∈ [0, 1) is the discount factor.

2.2. GENERATIVE FLOW NETWORK

GFlowNet sees the MDP as a flow network. Define s = T (s, a) and F (s) as the node's transition and the total flow going through s. Define an edge/action flow F (s, a) = F (s → s ) as the flow through an edge s → s . The training process of vanilla GFlowNets needs to sum the flow of parents and children through nodes (states), which depends on the discrete state space and discrete action space. The framework is optimized by the following flow consistency equations: s,a:T (s,a)=s F (s, a) = R (s ) + a ∈A(s ) F (s , a ) , which means that for any node s, the incoming flow equals the outgoing flow, which is the total flow F (s) of node s.

3. CFLOWNETS: THEORETICAL FORMULATION

Considering a continuous task with tuple (S, A), where S denotes the continuous state space and A denotes the continuous action space. Define a trajectory τ = (s 1 , ..., s n ) in this continuous task as a sequence sampled elements of S such that every transition a t : s t → s t+1 ∈ A. Further, we define an acyclic trajectory τ = (s 1 , ..., s n ) as a trajectory satisfies the acyclic constraint: ∀s m ∈ τ, s k ∈ τ, m = k, we have s m = s k . Denote s 0 and s f respectively as the initial state and the final state related with the continuous task (S, A), we define the complete trajectory as any sampled acyclic trajectory from (S, A) starting in s 0 and ending in s f . Correspondingly, a transition s → s f into the final state is defined as the terminating transition, and F (s → s f ) is a terminating flow. A trajectory flow F (τ ) : τ → R + is defined as any nonnegative function defined on the set of complete trajectories τ . For each trajectory τ , the associated flow F (τ ) contains the number of particles (Bengio et al., 2021b) sharing the same path τ . In addition, the tuple (S, A, F ) is called a continuous flow network. Let T (s, a) = s indicate an action a that could make a transition from state s to attain s . Then we make the following assumptions. Assumption 1. Assume that the continuous take (S, A) is an "acyclic" task, which means that arbitrarily sampled trajectories τ are acyclic, i.e., s i = s j , ∀s i , s j ∈ τ = (s 0 , ..., s n ), i = j. Assumption 2. Assume the flow function F (s, a) is Lipschitz continuous, i.e., |F (s, a) -F (s, a )| ≤ L||a -a ||, a, a ∈ A, (3) |F (s, a) -F (s , a)| ≤ L||s -s ||, s, s ∈ S, where L is a constant. Assumption 3. Assume that for any state pair (s t , s t+1 ), there is a unique action a t such that T (s t , a t ) = s t+1 , i.e., taking action a t in s t is the only way to get to s t+1 . Hence we can define s t := g(s t+1 , a t ), where g(•) is a transition function. And assume actions are the translation actions. The necessity and rationality of Assumptions 1-3 are analyzed in the appendix. Under Assumption 1, we define the parent set P(s t ) of a state s t as the set that contains all of the direct parents of s t that could make a direct transition to s t , i.e., P(s t ) = {s ∈ S : T (s, a ∈ A) = s t }. Similarly, define the child set C(s t ) of a state s t as the set contains all of the direct children of s t that could make a direct transition from s t , i.e., C(s t ) = {s ∈ S : T (s t , a ∈ A) = s}. Then, we have the following continuous flow definitions, where Assumptions 2-3 make these integrals integrable and meaningful. Definition 1 (Continuous State Flow). The continuous state flow F (s) : S → R is the integral of the complete trajectory flows passing through the state: F (s) = τ :s∈τ F (τ )dτ. Definition 2 (Continuous Inflows). For any state s t , its inflows are the integral of flows that can reach state s t , i.e., s∈P(st) F (s → s t )ds = s:T (s,a)=st F (s, a)ds = F (s t ) = a:T (s,a)=st F (s, a)da, where a : s → s t and s = g(s t , a) since Assumption 3 holds. Definition 3 (Continuous Outflows). For any state s t , the outflows are the integral of flows passing through state s t with all possible actions a ∈ A, i.e., s∈C(st) F (s t → s)ds = F (s t ) = a∈A F (s t , a)da. Based on the above definitions, we can define the transition probability P (s → s |s) of edge s → s as a special case of conditional probability introduced in Bengio et al. (2021b) . In particular, the forward transition probability is given by P F (s t+1 |s t ) := P (s t → s t+1 |s t ) = F (s t → s t+1 ) F (s t ) . Similarly, the backwards transition probability is given by P B (s t |s t+1 ) := P (s t → s t+1 |s t+1 ) = F (s t → s t+1 ) F (s t+1 ) . For any trajectory sampled from a continuous task (S, A), we have ∀τ = (s 1 , ..., s n ), P F (τ ) := Given any trajectory τ = (s 0 , ..., s n , s) that starts in s 0 and ends in s, a Markovian flow (Bengio et al., 2021b) is defined as the flow that satisfies P (s → s |τ ) = P (s → s |s) = P F (s |s), and the corresponding flow network (S, A, F ) is called a Markovian flow network (Bengio et al., 2021b) . Then, we present Theorem 1 proved in the appendix B.1, which is an extension of Proposition 19 in Bengio et al. (2021b) to continuous scenarios. (13) Furthermore, F uniquely defines a Markovian flow F matching F such that F (τ ) = n+1 t=1 F (s t-1 → s t ) n t=1 F (s t ) . ( ) Theorem 1 means that as long as any non-negative function satisfies the flow matching conditions, a unique flow is determined. Therefore, for sparse reward environments, i.e., R(s) = 0, ∀s = s f , we can obtain the target flow by training a flow network that satisfies the flow matching conditions. Such learning machines are called CFlowNets, and we have the following continuous loss function: L(τ ) = s f st=s1 st-1∈P(st) F (s t-1 → s t )ds t-1 -R(s t ) - st+1∈C(st) F (s t → s t+1 )ds t+1 2 . However, obviously, the above continuous loss function cannot be directly applied in practice. Next, we propose a method to approximate the continuous loss function based on the sampled trajectories to obtain the flow model.

4. CFLOWNETS: TRAINING FRAMEWORK

For continuous tasks, it is usually difficult to access all state-action pairs to calculate continuous inflows and outflows. In the following, we propose the CFlowNets training framework to address this problem, which includes an action sampling process, a flow matching approximation process. Then, CFlowNets can be trained based on an approximate flow matching loss function. 

4.1. OVERALL FRAMEWORK

The overview framework of CFlowNets is shown in Figure 1 , including the environment interaction, flow sampling, and training procedures. During the environment interaction phase (Left part of Figure 1 ), we sample an action probability buffer based on the forward-propagation of CFlowNets. We name this process the action selection procedure, as detailed in Section 4.2. After acquiring the action, the agent can interact with the environment to update the state, and this process repeats several steps until the complete trajectory is sampled. Once a buffer of complete trajectories is available, we randomly sample K actions and compute the child states to approximately calculate the outflows. For the inflows, we use these sampled actions together with the current state as the input to the deep neural network G to estimate the parent states. Based on these, we can approximately determine the inflows. We name this process the flow matching approximation procedure (Middle part of Figure 1 ), as detailed in Section 4.3. Finally, based on the approximate inflows and outflows, we can train a CFlowNet based on the continuous flow matching loss function (Right part of Figure 1 ), as details in Section 4.4. The pseudocode is provided in Appendix C.

4.2. ACTION SELECTION PROCEDURE

Starting from an empty set, CFlowNets aim to obtain complete trajectories τ = (s 0 , s 1 , ..., s f ) ∈ T by iteratively sampling a t ∼ π(a t |s t ) = F (st,at) F (st) with tuple {(s t , a t , r t , s t+1 )} f t=0 . However, it is difficult to sample trajectories strictly according to the corresponding probability of a t , since the actions are continuous, we cannot get the exact action probability distribution function based on the flow network F (s t , a t ). To solve this problem, at each state s t , we first uniformly sample M actions from A and generate an action probability buffer P = {F (s t , a i )} M i=1 , which is used as an approximation of action probability distributions. Then we sample an action from P according to the corresponding probabilities of all actions. Obviously, actions with larger F (s t , a i ) will be sampled with higher probability. In this way, we approximately sample actions from a continuous distribution according to their corresponding probabilities. Remark 1. After the training process, for tasks that require a larger reward, we can sample actions with the maximum flow output in P during the test process to obtain a relatively higher reward. How the output of the flow model is used is flexible, and we can adjust it for different tasks.

4.3. FLOW MATCHING APPROXIMATION

Once a batch of trajectories B is available, to satisfy flow conditions, we require that for any node s t , the inflows a:T (s,a)=st F (s, a)da equals the outflows a∈A F (s t , a)da, which is the total flow F (s t ) of node s t . However, obviously, we cannot directly calculate the continuous inflows and outflows to complete the flow matching condition. An intuitive idea is to discretize the inflows and outflows based on a reasonable approximation and match the discretized flows. To do this, we sample K actions independently and uniformly from the continuous action space A and calculate corresponding F (s t , a k ), k = 1, ..., K as the outflows, i.e., we use the following approximation: a∈A F (s t , a)da ≈ µ(A) K K k=1 F (s t , a k ), where µ(A) denotes the measure of the continuous action space A. By contrast, an approximation of inflow is more difficult since we should find the parent states first. To solve this problem, we construct a deep neural network G (named "retrieval" neural network) parameterized by φ with (s t+1 , a t ) as the input while s t as the output, and train this network based on B with the MSE loss. That is, we want use G to fit function g(•). The network G is usually easy to train since we consider tasks satisfy Assumption 3, and we can obtain a high-precision network G through simple pre-training. As the training progresses, we can also occasionally update G based on the sampled trajectories to ensure accuracy. Then, the inflows can be calculated approximately: a:T (g(st,a),a)=st F (g(s t , a), a)da ≈ µ(A) K K k=1 F (G φ (s t , a k ), a k ). Next, by assuming that the flow function F (s, a) is Lipschitz continuous in Assumption 2, we could provide a non-asymptotic analysis for the error between the sample inflows/outflows and the true inflows/outflows. Theorem 2 establishes the error bound between the sample outflows (resp. inflows) and the actual outflows (resp. inflows) in the tail form and shows that the tail is decreasing exponentially. Furthermore, the tail gets much smaller with the increase of K, which means the sample outflows (resp. inflows) are a good estimation of the actual outflows (resp. inflows). Theorem 2. Let {a k } K k=1 be sampled independently and uniformly from the continuous action space A. Assume G φ can optimally output the actual state s t with (s t+1 , a t ). For any bounded continuous action a ∈ A and any state s t ∈ S, we have P µ(A) K K k=1 F (s t , a k ) - a∈A F (s t , a)da ≥ t ≤ 2 exp - Kt 2 2(Lµ(A)diam(A)) 2 and P µ(A) K K k=1 F (G φ (s t , a k ), a k ) - a:T (s,a)=st F (s, a)da ≥ t ≤ 2 exp - Kt 2 2 Lµ(A)(diam(A) + diam(S)) 2 , ( ) where L is the Lipschitz constant, diam(A) denotes the diameter of the action space A and diam(S) denotes the diameter of the state space S.

4.4. LOSS FUNCTION

Based on ( 15) and ( 16), the continuous loss function can be approximated by L θ (τ ) = s f st=s1 K k=1 F θ (G φ (s t , a k ), a k ) -λR(s t ) - K k=1 F θ (s t , a k ) 2 , ( ) where θ is the parameter of the flow network F (•) and λ = K/µ(A). Note that in many tasks we cannot obtain exact µ(A). For such tasks, we can directly set λ to 1, and then adjust the reward shaping to ensure the convergence of the algorithmfoot_0 . It is noteworthy that the magnitude of the state flow at different locations in the trajectory may not match. For example, the initial node flow is likely to be larger than the ending node flow. To solve this problem, inspired the log-scale loss introduced in GFlowNets (Bengio et al., 2021a) , we can modify (19) into: L θ (τ ) = s f st=s1 log + K k=1 exp F log θ (G φ (s t , a k ), a k ) -log + λR(s t ) + K k=1 exp F log θ (s t , a k ) 2 , ( ) where is a hyper-parameter that helps to trade off small versus large flows and helps avoid the numerical problem of taking the logarithm of tiny flows. Note that Theorem 2 cannot be used to guarantee the unbiasedness of (20) because log E(x) = E log(x). But experiments show that this approximation works well.

5. RELATED WORKS

Generative Flow Networks. Generative flow networks are proposed to enhance exploration capabilities by generating a distribution proportional to the rewards over terminating states (Bengio et al., 2021b; a) . Since the network only samples actions based on the distribution of the corresponding rewards, rather than focusing only on actions that maximize rewards such as reinforcement learning, it can perform well on tasks with more diverse reward distributions, and has been successfully applied to molecule generation (Bengio et al., 2021a; Malkin et al., 2022; Jain et al., 2022) Bengio et al. (2021b) mentioned that these objective functions can also be used in continuous scenarios by replacing the policy likelihoods in the objective with probability densities. A possible disadvantage is that it is not easy to estimate P F and P B in a continuous environment, since the state space is much larger than in a discrete scenario, and a small error in modeling probability densities can greatly affect the final performance. How to combine DB and TB with CFlowNets will be a worthy future work. Continuous Reinforcement Learning. Policy gradient algorithms are widely used for reinforcement learning problems with continuous action spaces. The deterministic policy gradient (DPG) (Silver et al., 2014) algorithm is an actor-critic (Grondman et al., 2012; Rosenstein et al., 2004) method that uses an estimate of the learned value Q(s, a) to train a deterministic policy µ : S → A parameterized by θ µ . Compared with CFlowNets, the policy is updated by applying the chain rule to the expected return J from the start distribution with respect to the policy parameters: ∇ θ µ J ≈ E D ∇ θ µ Q s, a | θ Q a=µ(s|θ µ ) = E D ∇ a Q s, a | θ Q a=µ(st) ∇ θ µ µ (s | θ µ ) , ( ) where D is the replay buffer. The policy aims to maximize the expectation of future rewards, which are estimated by Q-learning. In this setting, the trajectories generated by the policy may be relatively homogeneous. However, the training goal of CFlowNets is to define a distribution proportional to the rewards over terminating states, resulting in more diverse trajectories that are beneficial for exploring the environment. Later, deep DPG (DDPG) (Lillicrap et al., 2015) improves DPG and has good sample efficiency but suffers from extreme brittleness and hyperparameter sensitivity. Therefore, it is difficult to extend DDPG to complex, high-dimensional tasks. To improve DDPG, twin delayed DDPG (TD3) (Fujimoto et al., 2018) adopts an actor-critic framework and considers the interaction between value update and function approximation error and in the policy. There are also some policy gradient (Sutton et al., 1999; Kohl & Stone, 2004; Khadka & Tumer, 2018) based algorithms that can be adapted for continuous tasks, such as proximal policy optimization (PPO) (Schulman et al., 2017) algorithms, asynchronous advantage actor-critic (A3C) (Stooke & Abbeel, 2018) , and importance weighted actor-learner architecture (IMPALA) (Espeholt et al., 2018) . PPO has the benefits of trust region policy optimization (Schulman et al., 2015) , enabling multiple batches of data to be updated together. Therefore, it is simpler to implement, more general, and has lower sample complexity. Recently, phasic policy gradient (PPG) (Cobbe et al., 2021) is proposed to decouple the training between policy and value function while keeping their feature sharing, and PPG optimizes each objective with an appropriate level of sample reuse to improve sample efficiency. Most of these improved policy gradient methods can be classified as aiming at maximizing reward, so none of them are better suited for exploration tasks than CFlowNets. Furthermore, some maximum entropy (Pitis et al., 2020; Haarnoja et al., 2018a; Hazan et al., 2019; Yarats et al., 2021) based reinforcement learning algorithms can also be adapted for continuous tasks, such as soft actor-critic (SAC) (Haarnoja et al., 2018b) . By maximizing the expected reward and entropy, the actor network of SAC can successfully complete tasks while acting as randomly as possible. The difference between CFlowNets and SAC is: 1) SAC selects actions by a Gaussian policy, which is less expressive than using a general unnormalized action p.d.f. F (s, a); 2) In the general case, SAC learns to be proportional to the long-term return, which generates the trajectory distribution satisfying p(τ ) ∝ R(τ ) with R(τ ) is the return of τ . CFlowNets considers all possible trajectories that lead to a terminal state s f , and learn the policy to generate s f with p(s f ) ∝ R(s f ).

6. EXPERIMENTS

To demonstrate the effectiveness of the proposed CFlowNets, we conduct experiments on several continuous control tasks with sparse rewards, including Point-Robot-Sparse, Reacher-Goal-Sparse, and Swimmer-Sparse. The visualization of these environments is shown in Figures 7, 8 and 9 . Then we compare CFlowNets with a few state-of-the-art baseline RL algorithms, such as DDPG (Lillicrap et al., 2015) , TD3 (Fujimoto et al., 2018) , PPO (Schulman et al., 2017) , and SAC (Haarnoja et al., 2018b) . More implementation details are provided in Appendix D. Figure 2 illustrates the distributions of learned policies for CFlowNets and RL algorithms. All curves are max-min normalized. The gray curve is the ground truth of reward distribution generated by the agent's different actions when it goes to coordinates (7, 7), which indicates that the optimal action here is to go right or up. The red curve shows the flow network output of CFlowNets under different actions, indicating that CFlowNets have an excellent fitting ability to the reward. In contrast, other reinforcement learning algorithms have difficulty fitting the actual reward distribution well. Figures 3(a)-(c ) show the number of valid-distinctive trajectories explored as training progresses in Point-Robot-Sparse, Reacher-Goal-Sparse, and Swimmer-Sparse environment, respectively. After a certain number of training epochs, 10000 trajectories are collected. A valid-distinctive trajectory is defined as a reward above a threshold δ r while the MSE between the trajectory and other trajectories is greater than another threshold δ mse . That is, if the returns of both trajectories are high, but the two are close and the MSE is small, we consider it only one valid-distinctive exploration. δ r in Point-Robot-Sparse, Reacher-Goal-Sparse, and Swimmer-Sparse is set as 0.5, -0.2, 5.0, respectively. δ mse in Point-Robot-Sparse, Reacher-Goal-Sparse, and Swimmer-Sparse is set as 0.02, 4.0, 1.0, respectively. As can be seen from the figure, DDPG, TD3 and PPO have the worst exploration ability, only one valid-distinctive trajectory is generated. SAC explores better at the beginning of training, and decreases as the training progresses and gradually converges. In contrast, the exploration ability of CFlowNets is very outstanding, the number of trajectories explored far exceeds other algorithms, and the exploration ability has been stable as the training progresses. show that CFlowNets has the fastest and more stable upward trend, and the final reward is ahead of that of other algorithms by a large margin. In contrast, CFlowNets do not perform as well as other algorithms in Figure 3 (f). Since the rewards in Point-Robot-Sparse and Reacher-Goal-Sparse are more evenly distributed, so these two tasks are more inclined to exploration. CFlowNets has better exploration ability and hence can converge stably. As for Swimmer-Sparse, its reward distribution is relatively steep, and sampling near the maximum reward can achieve faster convergence. It is reasonable for CFlowNets to perform worse than RL on this task in terms of reward. However, in this environment, CFN can still maintain a good exploration ability.

7. CONCLUSION

In this paper, we propose generative continuous flow networks to enhance exploration in continuous control tasks. The theoretical formulation of CFlowNets is first presented. Then, a training framework for CFlowNets is proposed, including the action selection process, the flow approximation algorithm, and the continuous flow matching loss function. Theoretical analysis shows that the error of the flow approximation decreases rapidly as the number of flow samples increases. Experimental results on continuous control tasks illustrate the performance advantages of CFlowNets compared to many reinforcement learning methods. Especially in the exploration ability, the effect of CFlowNets far exceeds other state-of-the-art reinforcement learning algorithms. Limitations: Similar to GFlowNets, CFlowNets aims to sample actions according to the flow network, rather than selecting actions with maximizing rewards. Therefore, CFlowNets are more suitable for exploration-biased tasks. It does not perform as well as reinforcement learning on tasks that aim to maximize reward. Of course, the purpose of CFlowNets is not to completely replace reinforcement learning, but as a supplement to reinforcement learning, giving a new option for continuous control tasks. Future work: Future work will be how to combine CFlowNets with DB and TB objective functions to improve training efficiency. Necessity: For most environments, it is difficult to generate cycles when sampling a trajectory in a continuous space. Since ∀t, µ({s 0 , ..., s t }) = 0 and µ(A) = µ(A\{s 0 , ..., s t }), that is, the probability of s t+1 ∈ {s 0 , ..., s t } is very small. However, cycles often arise when certain environments have some special constraints. For example, a simple pendulum task (see Figure 4 ), the action is to control the pendulum to rotate from the previous position to the next position at a certain angle. For this task, it is difficult for a pendulum to rotate to exactly the same position in continuous space. However, if a wall is added to the task, the pendulum can easily go to the same position (see Figure 5 ), i.e. a cycle will occur. Therefore, we still need to add an acyclic assumption to make the theory and performance of CFlowNets guaranteed. Rationality: This assumption is reasonable because for many continuous environments it is difficult to form cycles in trajectories without special constraints. Even for tasks prone to form cycles, we can directly add time steps in the state space to satisfy this assumption. 

A.2 WHY IS ASSUMPTION 2 NECESSARY AND REASONABLE?

Necessity: This assumption is mainly used to guarantee the existence of flow-related integrals, and to ensure that Theorem 2 holds.

Rationality:

We justify this assumption based on simulations. As shown in Figure 6 In addition, Lipschitz continuous is a common assumption of neural networks, just some quick examples: Du et al. (2019) ; Jacot et al. (2018); Allen-Zhu et al. (2019) ; Alistarh et al. (2018) all use this assumption to prove the convergence of algorithms.

A.3 WHY IS ASSUMPTION 3 NECESSARY AND REASONABLE?

Necessity: This assumption is used in Definition 2 and enables the retrieval neural network to fit the function g(s, a). While there is a one-to-one correspondence between most environment state transitions and actions, there are still some special cases where, given a state pair (s, s ), there can be an infinite number of actions. For example: for Pendulum-with-Wall in Figure 5 , after reaching the wall, continuing to increase the action will not continue to change the state s . In addition, a special case of the translation action could be T (s, a) = s + a or using the special linear group, such that Definition 2 and 3 hold. The translation action is used to ensure that there is no Jacobian term in the continuous flow definition.

Rationality:

This assumption is a property of many environments and therefore reasonable. For environments that do not satisfy this assumption, we can try to satisfy this assumption by modifying the state to add more information. For example, we can add the duration of the action to the state space of Pendulum-with-Wall task in Figure 5 . Even if the action increases after reaching the wall, the position information will not be changed, but the duration will increase, so that the state transition and the action will correspond one-to-one. The worst case is that we cannot change the environment to satisfy the assumption. At this time, we mainly need to solve the problem that the output of the retrieval neural network G cannot be multiple when the input is fixed. One of our conjectures is that maybe we can alleviate this problem by adding some small random noise to the input, but this idea has not been tested.

B PROOFS B.1 PROOF OF THEOREM 1

Theorem 1. (Continuous Flow Matching Condition). Consider a non-negative function F (s, a) taking a state s ∈ S and an action a ∈ A as inputs. Then we have F corresponds to a flow if and only if the following continuous flow matching conditions are satisfied: ∀s > s 0 , F (s ) = s∈P(s ) F (s → s )ds = s:T (s,a)=s F (s, a : s → s )ds ∀s < s f , F (s ) = s ∈C(s ) F (s → s )ds = a∈A F (s , a)da. (22) Furthermore, F uniquely defines a Markovian flow F matching F such that F (τ ) = n+1 t=1 F (s t-1 → s t ) n t=1 F (s t ) . ( ) Proof. The proof is an extension of that of Proposition 19 in Bengio et al. (2021b) to the continuous case. We first prove the necessity part of the proof. Given a flow network, for non-initial and nonfinal nodes on a trajectory, the set of complete trajectories passing through state s is the union of the sets of trajectories going through s → s for all s ∈ P(s ), and also is the union of the sets of trajectories going through s → s for all s ∈ C(s ), i.e., {τ ∈ T : s ∈ τ } = s∈P(s ) {τ ∈ T : s → s ∈ τ } = s ∈C(s ) {τ ∈ T : s → s ∈ τ }. Then we have F (s ) = τ :s ∈τ F (τ )dτ = s∈P(s ) τ :s→s ∈τ F (τ )dτ ds = s∈P(s ) F (s → s )ds, F (s ) = τ :s ∈τ F (τ )dτ = s ∈C(s ) τ :s →s ∈τ F (τ )dτ ds = s ∈C(s ) F (s → s )ds . Then we finish the necessity part. Next we show sufficiency. Let Ẑ = F (s 0 ) be the partition function and PF be the forward probability function, then there exists a unique Markovian flow F with forward transition probability function P F = PF and partition function Z according to Proposition 18 in Bengio et al. (2021b) , and such that F (τ ) = Ẑ n+1 t=1 PF (s t |s t-1 ) = n+1 t=1 F (s t-1 → s t ) n t=1 F (s t ) , ( ) where s n+1 = s f . In addition, according to Lemma 1, we have τ ∈T0,s P B (τ )dτ = τ ∈T0,s st→st+1∈τ P B (s t |s t+1 )dτ = 1. Lemma 1. Considering a continuous task (S, A), where we have the transition probabilities defined in (8) and ( 9). Define T s,f and T 0,s as the set of trajectories sampled from a continuous task starting in s and ending in s f ; and starting in s 0 and ending in s, respectively. Then we have ∀s ∈ S\{s f }, τ ∈T s,f P F (τ )dτ = 1 (25) ∀s ∈ S\{s 0 }, τ ∈T0,s P B (τ )dτ = 1. Thus, we have for s = s 0 : F (s ) = Ẑ τ ∈T 0,s (st→st+1)∈τ PF (s t+1 |s t )dτ = Ẑ F (s ) F (s 0 ) τ ∈T 0,s (st→st+1)∈τ PB (s t |s t+1 )dτ = F (s ). Combine ( 27) with P F = PF yields ∀s → s ∈ A, F (s → s ) = F (s → s ). Finally, according to Proposition 16 in Bengio et al. (2021b) , for any Markovian flow F matching F on states and edges, we have F (τ ) = F (τ ), which shows the uniqueness property. Then we complete the proof.

B.2 PROOF OF THEOREM 2

Theorem 2. Let {a k } K k=1 be sampled independently and uniformly from the continuous action space A. Assume G φ can optimally output the actual state s t with (s t+1 , a t ). For any bounded continuous action a ∈ A and any state s t ∈ S, we have P µ(A) K K k=1 F (s t , a k ) - a∈A F (s t , a)da ≥ t ≤ 2 exp - Kt 2 2(Lµ(A)diam(A)) 2 and P µ(A) K K k=1 F (G φ (s t , a k ), a k ) - a:T (s,a)=st F (s, a)da ≥ t ≤ 2 exp - Kt 2 2 Lµ(A)(diam(A) + diam(S)) 2 , ( ) where L is the Lipschitz constant, diam(A) denotes the diameter of the action space A and diam(S) denotes the diameter of the state space S. Proof. First, we show that the expectation of sample outflow is the true outflow and the expectation of sample inflow is the true inflow in Lemma 2. Lemma 2. Let {a k } K k=1 be sampled independently and uniformly from the continuous action space A. Assume G φ can optimally output the actual state s t with (s t+1 , a t ). Then for any state s t ∈ S, we have E µ(A) K K k=1 F (s t , a k ) = a∈A F (s t , a)da and E µ(A) K K k=1 F (G φ (s t , a k ), a k ) = a:T (s,a)=st F (s, a)da, where s = g(s t , a). Then, define the following terms: Γ k = µ(A) K F (s t , a k ) - 1 K a∈A F (s t , a)da = 1 K a∈A [F (s t , a k ) -F (s t , a)] da and Λ k = µ(A) K F (G φ (s t , a k ), a k ) - 1 K a:T (s,a)=st F (s, a)da (33) = 1 K a:T (s,a)=st [F (G φ (s t , a k ), a k ) -F (s, a)] da, where s = g(s t , a). Note that the variables {Γ k } K k=1 are independent and E[Γ k ] = 0, k = 1, . . . , K according to Lemma 2. So the following equations hold P µ(A) K K k=1 F (s t , a k ) - a∈A F (s t , a)da ≥ t = P K k=1 Γ k ≥ t and P µ(A) K K k=1 F (G φ (s t , a k ), a k ) - a:T (s,a)=st F (s, a)da ≥ t = P K k=1 Λ k ≥ t . Since F (s, a) is a Lipschitz function, we have |Γ k | ≤ 1 K a∈A F (s t , a k ) -F (s t , a) da ≤ L K a∈A ||a k -a||da ≤ Lµ(A)diam(A) K . Together with Assumption 3, that is, for any pair of (s, a) satisfying T (s, a) = s t , a is unique if we fix s, we have |Λ k | ≤ 1 K a:T (s,a)=st F (G φ (s t , a k ), a k ) -F (s, a) da ≤ 1 K a:T (s,a)=st F (G φ (s t , a k ), a k ) -F (s, a k ) + F (s, a k ) -F (s, a) da ≤ 1 K a:T (s,a)=st L||G φ (s t , a k ) -s|| + L||a k -a||da ≤ Lµ(A) diam(A) + diam(S) K . ( ) Lemma 3 (Hoeffding's inequality, Vershynin (2018) ). Let x 1 , . . . , x K be independent random variables. Assume the variables {x k } K k=1 are bounded in the interval [T l , T r ]. Then for any t > 0,we have P K k=1 (x k -Ex k ) ≥ t ≤ 2 exp - 2t 2 K(T r -T l ) 2 . ( ) Incorporating T r = L K µ(A)diam(A) and T l = -L K µ(A)diam(A) in Lemma 3 with (37), and incorporating T r = L K µ(A)(diam(A) + diam(S)) and T l = -L K µ(A)(diam(A) + diam(S)) in Lemma 3 with (38), we complete the proof.

B.3 PROOF OF LEMMA 1

Lemma 1. Considering a continuous task (S, A), where we have the transition probabilities defined in (8) and (9). Define T s,f and T 0,s as the set of trajectories sampled from a continuous task starting in s and ending in s f ; and starting in s 0 and ending in s, respectively. Then we have ∀s ∈ S\{s f }, τ ∈T s,f P F (τ )dτ = 1 (40) ∀s ∈ S\{s 0 }, τ ∈T0,s P B (τ )dτ = 1. Proof. We show by strong induction that (40) holds, mainly following the proof of Lemma 5 in Bengio et al. (2021b) , and then extending to (41) is trivial. Define d as the maximum trajectory length in T s,f , s = s f , we have: Base cases: If d = 1, then τ ∈T s,f P F (τ )dτ = P F (s → s f ) = 1 holds by noting T s,f = {(s → s f )}. Induction steps: Consider d > 1, by noting (12) we have τ ∈T s,f P F (τ )dτ = s ∈C(s) τ ∈T s→s ,f P F (τ )dτ ds (42) = s ∈C(s) τ ∈T s ,f P F (s |s)P F (τ )dτ ds (43) = s ∈C(s) P F (s |s)ds τ ∈T s ,f P F (τ )dτ = 1, where the last equality follows by the induction hypotheses.

B.4 PROOF OF LEMMA 2

Lemma 2. Let {a k } K k=1 be sampled independently and uniformly from the continuous action space A. Assume G φ can optimally output the actual state s t with (s t+1 , a t ). Then for any state s t ∈ S, we have E µ(A) K K k=1 F (s t , a k ) = a∈A F (s t , a)da and E µ(A) K K k=1 F (G φ (s t , a k ), a k ) = a:T (s,a)=st F (s, a)ds, where s = g(s t , a). Proof. Since {a k } K k=1 is sampled independently and uniformly from the continuous action space A, then we have E [F (s t , a k )] = 1 µ(A) a∈A F (s t , a)da. Therefore, we obtain E µ(A) K K k=1 F (s t , a k ) = µ(A) K K k=1 E [F (s t , a k )] (48) = a∈A F (s t , a)da. ( ) Since Assumption 3 holds, for any pair of (s, a) satisfying T (s, a) = s t , a is unique if we fix s, we have E [F (G φ (s t , a k ), a k )] = 1 µ(A) a:T (s,a)=st F (s, a)da, where s = g(s t , a). Therefore, we get E µ(A) K K k=1 F (G φ (s t , a k ), a k ) = µ(A) K K k=1 E [F (G φ (s t , a k ), a k )] = a:T (s,a)=st F (s, a)da. Then we complete the proof.

C PSEUDOCODE OF CFLOWNETS

For clarity, we show pseudocode for CFlowNets in Algorithm 1. As shown in Figures 7, 8 and 9, we provide the visualization of Point-Robot-Sparse, Reacher-Goal-Sparse, and Swimmer-Sparse tasks. In Point-Robot-Sparse, the goal of the agent is to navigate two different goals. The agent starts at the starting coordinate (0, 0) and moves towards the target coordinate one step at a time. The environment has two target coordinates (5, 10) and (10, 5) with a maximum episode length of 12, and the environment returns a reward only when the last step is reached. Rewards are issued by measuring the distance between the agent's current position and the target node, and the closer the distance, the greater the reward. Each time the agent can take a step from any angle to the upper right. Both Reacher-Goal-Sparse and Swimmer-Sparse are adapted from OpenAI Gym's MuJoCo environment. In the Reacher-Goal-Sparse, "Reacher" is a two-jointed robotic arm. The goal of the agent is to reach a randomly generated target by moving the robots end effector. Figure 8 shows the movement process of the robotic arm. By adjusting the torque applied at the hinge joint, the end effector can gradually approach the target. In the Swimmer-Spars, the "swimmer" is suspended in a Published as a conference paper at ICLR 2023 two-dimensional pool, and the goal is to move as fast as possible towards the right or left. Figure 9 shows the shape change process of the robot during motion. By taking the action that applies torque on the rotors and using the fluids friction, the robot can swim faster. We set the maximum number of steps to 50 for these two environments. For Reacher-Goal-Sparse, when the last step is reached, the environment returns a reward that measures how far the agent is from the randomly generated target. The closer the agent is to the target, the greater the reward. For Swimmer-Sparse, the farther to the left or right from the starting point, the greater the reward returned.

D.2 ADDITIONAL ANALYSIS

Figure 10 shows that the average reward and reward distribution of different algorithms on the Point-Robot-OneGoal-Sparse task, where an agent needs to navigate to a specific location. Figure 10 (a) indicates that CFLowNets can obtain the highest average return compared to other RL-based algorithms. In Figure 10 (b), all algorithms are able to fit the reward distribution well under the one goal setting, while CFlowNets can achieve better. Note that RL algorithms can also learn the reward distribution in this task, since maximizing the reward is the optimal policy in the case of a single objective, and the policy is not difficult to learn. In Figure 11 , we provide the action reward distribution of different algorithms with 2e4 total timesteps on Point-Robot-Sparse with Point (4,8), Point (8,4) and Point (7,7), respectively. Note that unlike Figure 2 , where the total number of timesteps is 1e5, here we show the result with 2e4 total timesteps since we found DDPG is overfit after 1e5 timesteps in this task. Therefore we show the results without overfitting for a fairer comparison. We can see that no matter at which point, the policy of CFlowNets can better match the real reward distribution. For example, at points (4,8) and (8,4), CFlowNet tends to choose actions that guide the agent towards (5, 10) and (10, 5), respectively. For a location between two goals (point (7,7)), there are two directions that allow the agent to reach goals with high rewards. In contrast, the policy learned by RL algorithms can only occasion- ally match the true reward distribution of a certain point, and cannot stably match every point. This also shows that the policies learned by RL algorithms is relatively simple. CFlowNets learn more diverse policies for agents to reach different goals with high rewards, while other methods usually find one goal instead of all potentially high reward locations. Figure 12 and Figure 13 show the results of trajectories visualization produced by different algorithms. In the Point-Robot-OneGoal-Sparse task, the trajectories of DDPG, TD3, and PPO are single, while SAC can select actions from the policy probability distribution, so different trajectories can be obtained. In contrast, CFlowNets found more diverse trajectories and also found the highest reward goal (thickened red trajectory), which means that CFlowNets can better explore the region near the goal. In the Point-Robot-Sparse task, the RL-based algorithms seek only one goal. However, CFlowNets can find all goals. It is worth noting that in Figure 13 (e), the density of CFlowNets sampling trajectories is not as dense as in Figure 12 (e) near the maximum reward. Rather, it is denser on the diagonal. This is because in most positions, the action probability of choosing to go up and to the right is relatively high, so it is easier to go to the diagonal direction in combination. In addition, the reward on the line between two goals is not small. When sampling according to the output of the flow model as a probability, many trajectories themselves are more likely to reach the diagonal. Figure 14 shows the true reward distribution of Point-Robot-Sparse, where the reward is higher in the area near two goals and the line between two goals.

D.3 EXPERIMENT RESULTS ON HIGHWAY-PARKING-SPARSE

We evaluate the performance of CFLowNets on Highway-Parking-Sparse, which is an ego-vehicle control task. As shown in Figure 15 , the goal is to make the ego-vehicle park in a given space with the appropriate orientation by adjusting its controller. The dimension of the vehicle observation is 18, consisting of the distance between the vehicle and parking, the vehicle speed, the triangular heading information, the goal the agent should attempt to achieve, and the goal that it currently achieves. The action space includes control over the throttle and steering angle, and the reward function is set as the distance between the ego-vehicle and parking. Figure 16 shows the average reward and the number of valid-distinctive trajectories explored as training progresses of different algorithms, which illustrates that the performance of CFlowNets is more promising than other RLbased algorithms. Even for higher-dimensional continuous tasks, CFlowNets have very competitive reward results (outperforming DDPG, TD3, and SAC), while achieving much better exploration performance than RL-based algorithms.

D.4 BASELINES

We compare our proposed CFlowNets to the following baselines: • Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) . https://github. com/sfujim/TD3/blob/master/DDPG.py • Twin Delayed Deep Deterministic Policy Gradient (TD3) (Fujimoto et al., 2018) . https: //github.com/sfujim/TD3 • Soft Actor-Critic (SAC) (Haarnoja et al., 2018b) . https://github.com/ denisyarats/pytorch_sac/ • Proximal Policy Optimization (PPO) (Schulman et al., 2017) . https://github.com/ DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ ppo.py 

D.5 HYPER-PARAMETER

We provide the hyper-parameters of all compared methods under different environments in Table 1 , Table 2 , Table 3 , Table 4 , and Table 5 . As for "Total Timesteps", "Start Traning Timestep", "Max Episode Length", "Actor Network Hidden Layers", "Critic Network Hidden Layers", "Optimizer", "Learning Rate", and "Discount Fac-tor", we set them the same for all algorithms for a fair comparison. As for these specific parameters for baseline algorithms, we remain them the same as those in the original code to achieve good performance. As for these specific parameters of our CFlowNets, we set the number of sample flows to 100 and the action probability buffer size to 1000 to tradeoff the performance and computational load. Note that CFlowNets dose not require as large a replay buffer size as other RL algorithms, since the exploration ability of CFlowNets is better than that of others. And a good policy can already be learned from a small replay buffer. This is also an advantage of CFlowNets compared to RL based algorithms. 



A commonly used reward shaping method is to multiply the reward by a constant and adjust the reward to an appropriate range to ensure better convergence. Therefore, after setting λ to 1, a reasonable reward shaping operation can also compensate for the influence of λ error.



Continuous Flow Matching Condition). Consider a non-negative function F (s, a) taking a state s ∈ S and an action a ∈ A as inputs. Then we have F corresponds to a flow if and only if the following continuous flow matching conditions are satisfied: ∀s > s 0 , F (s ) = s∈P(s ) F (s → s )ds = s:T (s,a)=s F (s, a : s → s )ds ∀s < s f , F (s ) = s ∈C(s ) F (s → s )ds = a∈A F (s , a)da.

Figure 2: Reward distributions on Point-Robot-Sparse Task.

Figure 3: Comparison results of CFlowNets, DDPG, TD3, SAC and PPO on Point-Robot-Sparse, Reacher-Goal-Sparse, and Swimmer-Sparse tasks. Top: Number of valid-distinctive trajectories generated under 10000 explorations. Bottom: The average reward of different methods.

Figure 4: Pendulum. It is difficult for the state to be completely consistent in this continuous space.

Figure 5: Pendulum-with-Wall. The state becomes consistent when reaching the wall.

Figure 6: Accumulated maximum Lipschitz constant of flow network F (s, a).

, we calculate |F (s,a)-F (s,a )| a-a and |F (s,a)-F (s ,a)| s-s of each sample tuple (s, a, a ) and (s, s , a) to analysis the Lipschitz constant, respectively. Their accumulated maximum Lipschitz constants are shown in Figures 6 (a) and (b), respectively. Clearly, there exists a finite Lipschitz constant our flow network.

Figure 7: Visualization of Point-Robot-Sparse task.

Figure 10: The average reward and reward distributions of CFlowNets, DDPG, TD3, SAC and PPO on Point-Robot-OneGoal-Sparse task.

Figure 12: Sampled trajectories Point-Robot-OneGoal-Sparse task.

Figure 13: Sampled trajectories on Point-Robot-Sparse task.

Figure 14: Reward distributions on Point-Robot-Sparse Task.

Figure 15: Visualization of Highway-Parking-Sparse task.

Overall framework of CFlowNets. Left: During the environment interaction phase, we sample actions to update states with probabilities proportional to the reward according to CFlowNet. Middle: We randomly sample actions to approximately calculate the inflows and outflows, where a DNN is used to estimate the parent states. Right: Continuous flow matching loss is used to train the CFlowNet based on making inflows equal to outflows or reward.

Dinghuai Zhang, Ricky TQ Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with gflownets. arXiv preprint arXiv:2209.02606, 2022a.

Compute edge flow F θ (s t , a i ) for each a i ∈ {a i } M i=1 to generate P Sample a t ∼ P and execute a t in the environment to obtain r t+1 and s t+1

Hyper-parameters of CFlowNets under different environments.

Hyper-parameter of DDPG under different environments.

Hyper-parameter of TD3 under different environments.

annex

Published as a conference paper at ICLR 2023 

