

Abstract

In many applications of machine learning, like drug discovery and material design, the goal is to generate candidates that simultaneously maximize a set of objectives. As these objectives are often conflicting, there is no single candidate that simultaneously maximizes all objectives, but rather a set of Pareto-optimal candidates where one objective cannot be improved without worsening another. Moreover, in practice, these objectives are often under-specified, making the diversity of candidates a key consideration. The existing multi-objective optimization methods focus predominantly on covering the Pareto front, failing to capture diversity in the space of candidates. Motivated by the success of GFlowNets for generation of diverse candidates in a single objective setting, in this paper we consider Multi-Objective GFlowNets (MOGFNs). MOGFNs consist of a novel Conditional GFlowNet which models a family of single-objective sub-problems derived by decomposing the multi-objective optimization problem. Our work is the first to empirically demonstrate conditional GFlowNets. Through a series of experiments on synthetic and benchmark tasks, we empirically demonstrate that MOGFNs outperform existing methods in terms of Hypervolume, R2-distance and candidate diversity. We also demonstrate the effectiveness of MOGFNs over existing methods in active learning settings. Finally, we supplement our empirical results with a careful analysis of each component of MOGFNs. -(a) Preference-Conditional GFlowNets (MOGFN-PC) which combine Reward-Conditional GFlowNets (Bengio et al., 2021b) with Weighted Sum Scalarization (Ehrgott, 2005) and (b) MOGFN-AL, an extension of GFlowNet-AL (Jain et al., 2022) for multi-objective active learning settings. We empirically demonstrate the advantage of MOGFNs over existing approaches on a variety of highdimensional multi-objective optimization tasks: the generation of small molecules, DNA aptamer sequences and fluorescent proteins. Our contributions are as follows: C1 We demonstrate how two variants of GFlowNets -MOGFN-PC and MOGFN-AL -can be applied to multi-objective optimization. Our work is the first successful empirical validation of Reward-Conditional GFlowNets (Bengio et al., 2021b). C2 Through a series of experiments on molecule generation and sequence generation we demonstrate that MOGFN-PC generates diverse Pareto-optimal candidates. C3 In a challenging active learning task for designing fluorescent proteins, we show that MOGFN-AL results in significant improvements to sample-efficiency and diversity of generated candidates. C4 We perform a thorough analysis of the main components of MOGFNs to provide insights into design choices that affect performance. BACKGROUND Multi-objective optimization (MOO) involves finding a set of feasible candidates x ⋆ ∈ X which all simultaneously maximize a set of objectives: In general, the objectives being optimized can be conflicting such that there is no single x ⋆ which simultaneously maximizes all objectives. Consequently, the concept of Pareto optimality is adopted in MOO, giving rise to a set of solutions trading off the objectives in different ways. Given x 1 , x 2 ∈ X , x 1 is said to dominate x 2 , written (x 1 ≻ x 2 ), iff R i (x 1 ) ≥ R i (x 2 ) ∀i ∈ {1, . . . , d} and ∃k ∈ {1, . . . , d} such that R k (x 1 ) > R k (x 2 ). A candidate x ⋆ is Pareto-optimal if there exists no other solution x ′ ∈ X which dominates x ⋆ . In other words, for a Pareto-optimal candidate it is impossible to improve one objective without sacrificing another. The Pareto set is the set of all Pareto-optimal candidates in X , and the Pareto front is defined as the image of the Pareto set in objective-space. It is important to note that since the objectives being optimized in general might not be injective, any point on the Pareto front can be the image of several candidates in the Pareto set. This introduces a notion of diversity in the candidate space, capturing all the candidates corresponding to a point on the Pareto front, that is critical for applications such as drug discovery. While there are several paradigms for tackling the MOO problem (Ehrgott, 2005; Miettinen, 2012; Pardalos et al., 2017) , we consider Scalarization, where the multi-objective problem is decomposed into simpler single-objective problems, as it is well suited for the GFlowNet formulation introduced in Section 3.1. A set of weights (preferences) ω i are assigned to the objectives R i , such that ω i ≥ 0 and The MOO problem in Equation 1 is then decomposed into solving single-objective sub-problems of the form max x∈X R(x|ω), where R is a scalarization function. is a widely used scalarization function which results in Pareto optimal candidates for problems with a convex Pareto front (Ehrgott, 2005)  , where z ⋆ i denotes some ideal value for objective R i , results in Pareto optimal solutions even for problems with a non-convex Pareto front (Pardalos et al., 2017) . See Appendix B for more discussion on scalarization. In summary, using scalarization, the MOO problem can be viewed as solving a family of single-objective optimization problems. Generative Flow Networks (Bengio et al., 2021a; b) are a family of probabilistic models which generate, through a sequence of steps,compositional objects x ∈ X with probability proportional to a given reward R : X → R + . The sequential construction of x ∈ X can be described as a trajectory

1. INTRODUCTION

Decision making in practical applications often involves reasoning about multiple, often conflicting, objectives (Keeney et al., 1993) . For example, in drug discovery, the goal is to generate novel drug-like molecules that inhibit a target, are easy to synthesize and can safely be used by humans (Dara et al., 2021) . Unfortunately, these objectives often conflict -molecules effective against a target might also have adverse effects on humans -so there is no single molecule which maximizes all the objectives simultaneously. Such problems fall under the umbrella of Multi-Objective Optimization (MOO; Ehrgott, 2005; Miettinen, 2012) , wherein one is interested in identifying Pareto-optimal candidates. The set of Pareto-optimal candidates covers all the best tradeoffs among the objectives, i.e., the Pareto front, where each point on that front corresponds to a different set of weights associated with each of the objectives. In-silico drug discovery and material design are typically driven by proxies trained with finite data, which only approximate the problem's true objectives, and therefore include intrinsic epistemic uncertainty associated with their predictions. In such problems, not only it is important to cover the Pareto front, but also to generate sets of diverse candidates at each solution of the front so as to increase the likelihood of success in downstream evaluations (Jain et al., 2022) . Generative Flow Networks (GFlowNets; Bengio et al., 2021a; b) are a recently proposed family of probabilistic models which tackle the problem of diverse candidate generation. Contrary to the reward maximization view of reinforcement learning (RL) and Bayesian optimization (BO), GFlowNets sample candidates with probability proportional to the reward. Sampling candidates, as opposed to greedily generating them, implicitly encourages diversity in the generated candidates. GFlowNets have shown promising results in single objective problems of molecule generation (Bengio et al., 2021a) and biological sequence design (Jain et al., 2022) . In this paper, we study Multi-Objective GFlowNets (MOGFNs), extensions of GFlowNets which tackle the multi-objective optimization problem. We consider two variants of MOGFNs τ ∈ T in a weighted directed acyclic graph (DAG)foot_0 G = (S, E), starting from an empty object s 0 and following actions a ∈ A as building blocks. The nodes S of this graph (states) correspond to the set of all possible objects that can be constructed using sequences of actions in A. An edge s a -→ s ′ ∈ E indicates that action a at state s leads to state s ′ . The forward policy P F (-|s) is a distribution over the children of state s. x can be generated by starting at s 0 and sampling a sequence of actions iteratively from P F . Similarly, the backward policy P B (-|s) is a distribution over the parents of state s and can generate backward trajectories starting at any state x, e.g., iteratively sampling from P B starting at x shows a way x could have been constructed. Let π(x) be the marginal likelihood of sampling trajectories terminating in x following P F , and partition function Z = x∈X R(x). The learning problem solved by GFlowNets is to estimate P F such that π(x) ∝ R(x). This is achieved using learning objectives like trajectory balance (TB; Malkin et al., 2022) , to learn P F (-|s; θ), P B (-|s; θ), Z θ which approximate the forward and backward policies and partition function, parameterized by θ. We refer the reader to Bengio et al. (2021b) ; Malkin et al. (2022) for a more thorough introduction to GFlowNets.

3. MULTI-OBJECTIVE GFLOWNETS

We broadly categorize Multi-Objective GFlowNets (MOGFNs) as GFlowNets which solve a family of sub-problems derived from a Multi-Objective Optimization (MOO) problem. We first consider solving a family of MOO sub-problems simultaneously with preference-conditional GFlowNets, followed by MOGFN-AL, which solves a sequence of MOO sub-problems.

3.1. PREFERENCE-CONDITIONAL GFLOWNETS

Whereas a GFlowNet learns how to sample according to a single reward function, reward-conditional GFlowNets (Bengio et al., 2021b) are a generalization of GFlowNets that simultaneously model a family of distributions associated with a corresponding family of reward functions. Let C denote a set of values c, with each c ∈ C inducing a unique reward function R(x|c). We can define a family of weighted DAGs {G c = (S c , E) , c ∈ C} which describe the construction of x ∈ X , with conditioning information c available at all states in S c . We denote P F (-|s, c) and P B (-|s ′ , c) as the conditional forward and backward policies, Z(c) = x∈X R(x|c) as the conditional partition function and π(x|c) as the marginal likelihood of sampling trajectories τ from P F terminating in x given c. The learning objective in reward-conditional GFlowNets is thus estimating P F (-|s, c) such that π(x|c) ∝ R(x|c). We refer the reader to Bengio et al. (2021b) for a more formal discussion on conditional GFlowNets. Recall from Section 2.1 that MOO problems can be decomposed into a family of single-objective problems each defined by a preference ω over the objectives. Thus, we can employ reward-conditional GFlowNets to model the family of reward functions by using as the conditioning set C the d-simplex ∆ d spanned by the preferences ω over d objectives. Preference-conditional GFlowNets (MOGFN-PC) are reward-conditional GFlowNets conditioned on the preferences ω ∈ ∆ d over a set of objectives {R 1 (x), . . . , R d (x)}. In other words, MOGFN-PC model the family of reward functions R(x|ω) where R(x|ω) itself corresponds to a scalarization of the MOO problem. We consider three scalarization techniques, which are discussed in Appendix B: • Weighted-sum (WS) (Ehrgott, 2005)  : R(x|ω) = d i=1 ω i R i (x) • Weighted-log-sum (WL): R(x|ω) = d i=1 R i (x) ωi • Weighted-Tchebycheff (WT) (Choo & Atkins, 1983): R(x|ω) = min 1≤i≤d ω i |R i (x) -z ⋆ i |,. MOGFN-PC is not constrained to any scalarization function, and can incorporate any user-defined scalarization scheme that fits the desired optimization needs. Training MOGFN-PC The procedure to train MOGFN-PC, or any reward-conditional GFlowNet, closely follows that of a standard GFlowNet and is described in Algorithm 1. The objective is to learn the parameters θ of the forward and backward conditional policies P F (-|s, ω; θ) and P B (-|s ′ , ω; θ), and the log-partition function log Z θ (ω). To this end, we consider an extension of the trajectory balance objective for reward-conditional GFlowNets: L(τ, ω; θ) = log Z θ (ω) s→s ′ ∈τ P F (s ′ |s, ω; θ) R(x|ω) s→s ′ ∈τ P B (s|s ′ , ω; θ) 2 . ( ) One important component is the distribution p(ω) used to sample preferences during training. p(ω) influences the regions of the Pareto front that are captured by MOGFN-PC. In our experiments, we use a Dirichlet(α) to sample preferences ω which are encoded with thermometer encoding (Buckman et al., 2018) when input to the policy. Following prior work, we also use an exponent β for the reward R(x|ω), i.e. π(x|ω) ∝ R(x|ω) β . This incentivizes the policy to focus on the modes of R(x|ω), which is critical for generation of high reward and diverse candidates. MOGFN-PC and MOReinforce MOGFN-PC is closely related to MOReinforce (Lin et al., 2021) in that both learn a preference-conditional policy to sample Pareto-optimal candidates. The key difference is the learning objective: MOReinforce uses a multi-objective version of REINFORCE (Williams, 1992) , whereas MOGFN-PC uses a preference-conditional GFlowNet objective as in Equation ( 2). As discussed in Section 2.1, each point on the Pareto front (corresponding to a unique ω) can be the image of multiple candidates in the Pareto set. MOReinforce, given a preference ω will converge to sampling a single candidate that maximizes R(x|ω). MOGFN-PC, on the other hand, samples from R(x|ω), which enables generation of diverse candidates from the Pareto set for a given ω. This is a key feature of MOGFN-PC whose advantage we empirically demonstrate in Section 5.

3.2. MULTI-OBJECTIVE ACTIVE LEARNING WITH GFLOWNETS

In many practical scenarios, the objective functions of interest are computationally expensive. For instance, in the drug discovery scenario, evaluating objectives such as the binding energy to a target even in simulations can take several hours. Sample-efficiency, in terms of number of evaluations of the objective functions, and diversity of candidates, thus become critical in such scenarios. Black-box optimization approaches involving active learning (Zuluaga et al., 2013) , particularly multi-objective Bayesian optimization (MOBO) methods (Shah & Ghahramani, 2016; Garnett, 2022) are powerful approaches in these settings. MOBO uses a probabilistic model to approximate the objectives R = {R 1 . . . R d } and leverages the epistemic uncertainty in the predictions of the model as a signal for prioritizing potentially useful candidates. The optimization is performed over M rounds, where each round i consists of generating a batch of candidates B given all the candidates D i proposed in the previous rounds. The batch B is then evaluated using the true objective functions. The candidates are generated in each round by maximizing an acquisition function a which combines the predictions with their epistemic uncertainty into a single scalar utility score. We note that each round is effectively a scalarization of the MOO problem, and as such it may be decomposed into each round's single objective problem. We broadly define MOGFN-AL as approaches which use GFlowNets to generate candidates in each round of an active learning loop for multi-objective optimization. MOGFN-AL tackles MOO through a sequence of single-objective sub-problems defined by acquisition function a. As such, MOGFN-AL can be viewed as a multi-objective extension of GFlowNet-AL (Jain et al., 2022) . In this work, we consider an instantiation of MOGFN-AL for biological sequence design summarized in Algorithm 2 (Appendix A), building upon the framework proposed by Stanton et al. (2022) . We start with an initial dataset D 0 = (x i , y i ) N i=1 of candidates x i ∈ X and their evaluation with the true objectives y i = R(x). D i is used to train a surrogate probabilistic model (proxy) of the true objectives f : X → R d , which we parameterize as a multi-task Gaussian process (Shah & Ghahramani, 2016) with a deep kernel (DKL GP; Maddox et al., 2021a; b) . Using this proxy, the acquisition function defines the utility to be maximized a : X × F → R, where F denotes the space of functions represented by DKL GPs. In our work we use as acquisition function a noisy expected hypervolume improvement (NEHVI; Daulton et al., 2020) . We use GFlowNets to propose candidates at each round i by generating mutations for candidates x ∈ Pi where Pi is the set of non-dominated candidates in D i . Given a sequence x, the GFlowNet generates a set of mutations m = {(l i , v i )} T i=1 where l ∈ {1, . . . , |x|} is the location to be replaced and v ∈ A is the token to replace x[l] while T is the number of mutations. This set is generated sequentially such that each mutation is sampled from P F conditioned on x and the mutations sampled so far {(l i , v i )}. Let x ′ m be the sequence resulting from mutations m on sequence x. The reward for a set of sampled mutations for x is the value of the acquisition function on x ′ m , R(m, x) = a(x ′ m | f ). This approach of generating mutations to existing sequences provides an key advantage over generating sequences token-by-token as done in prior work (Jain et al., 2022) better scaling for longer sequences. We show empirically in Section 5.3 that generating mutations with GFlowNets results in more diverse candidates and faster improvements to the Pareto front than LaMBO (Stanton et al., 2022) .

4. RELATED WORK

Evolutionary Algorithms (EA) Traditionally, evolutionary algorithms such as NSGA-II have been widely used in various multi-objective optimization problems (Ehrgott, 2005; Konak et al., 2006; Blank & Deb, 2020) . More recently, Miret et al. (2022) incorporated graph neural networks into evolutionary algorithms enabling them to tackle large combinatorial spaces. Unlike MOGFNs, evolutionary algorithms do not leverage any type of data, including past experiences, and therefore are required to solve each instance of a MOO from scratch rather than by amortizing computation during training in order to quickly generate solutions at run-time. Evolutionary algorithms, however, can be augmented with MOGFNs for generating mutations to improve efficiency, as in Section 3.2. Multi-Objective Reinforcement Learning MOO problems have also received significant interest in the reinforcement learning (RL) literature (Hayes et al., 2022) . Traditional approaches broadly consist of learning sets of Pareto-dominant policies (Roijers et al., 2013; Van Moffaert & Nowé, 2014; Reymond et al., 2022) . Recent work has focused on extending Deep RL algorithms for multi-objective settings such as Envelope-MOQ (Yang et al., 2019) , MO-MPO (Abdolmaleki et al., 2020; 2021) , and MOReinforce (Lin et al., 2021) . A general shortcoming of RL based approaches is that they only discover a single mode of the reward function, and thus cannot generate diverse candidates, which also persists in the multi-objective setting. In contrast, MOGFNs sample candidates proportional to the reward, implicitly resulting in diverse candidates. Multi-Objective Bayesian Optimization (MOBO) Bayesian optimization (BO) has been used in the context of MOO when the objectives are expensive to evaluate and sample-efficiency is a key consideration. MOBO approaches consist of learning a surrogate model of the true objective functions, which is used to define an acquisition function such as expected hypervolume improvement (Emmerich et al., 2011; Daulton et al., 2020; 2021) and max-value entropy search (Belakaria et al., 2019) , as well as scalarization-based approaches (Paria et al., 2020; Zhang & Golovin, 2020) . Stanton et al. (2022) proposed LaMBO, which uses language models in conjunction with BO for multi-objective sequence design problems. The key drawbacks of MOBO approaches are that they do not consider the need for diversity in generated candidates and that they mainly consider continuous state spaces. As we discuss in Section 3.2, MOBO approaches can be augmented with GFlowNets for diverse candidate generation in discrete spaces. Other Works Zhao et al. (2022) introduced LaMOO which tackles the MOO problem by iteratively splitting the candidate space into smaller regions, whereas Daulton et al. ( 2022) introduce MORBO, which performs BO in parallel on multiple local regions of the candidate space. Both these methods, however, are limited to continuous candidate spaces.

5. EMPIRICAL RESULTS

In this section, we present our empirical findings across a wide range of tasks ranging from sequence design to molecule generation.The experiments cover two distinct classes of problems in the context of GFlowNets: where G is a DAG and where it is a tree. Through our experiments, we aim to answer the following questions: Q1 Can MOGFNs model the preference-conditional reward distribution? Q2 Can MOGFNs sample Pareto-optimal candidates? Q3 Are candidates sampled by MOGFNs diverse? Q4 Do MOGFNs scale to high-dimensional problems relevant in practice? Metrics: We rely on standard metrics such as the Hypervolume (HV) and R 2 indicators, as well as the Generational Distance+ (GD+). To measure diversity we use the Top-K Diversity and Top-K Reward metrics of Bengio et al. (2021a) . We detail all metrics in Appendix D. For all our empirical evaluations we follow the same protocol. First, we sample a set of preferences which are fixed for all the methods. For each preference we sample 128 candidates from which we pick the top 10, compute their scalarized reward and diversity, and report the averages over preferences. We then use these samples to compute the HV and R 2 indicators. We pick the best hyperparameters for all methods based on the HV and report the mean and standard deviation over 3 seeds for all quantities. Baselines: We consider the closely related MOReinforce (Lin et al., 2021) as a baseline. We also study its variants MOSoftQL and MOA2C which use Soft Q-Learning (Haarnoja et al., 2017) and A2C (Mnih et al., 2016) in place of REINFORCE. We also compare against Envelope-MOQ (Yang et al., 2019) , another popular multi-objective reinforcement learning method. For fragment-based molecule generation we consider an additional baseline MARS (Xie et al., 2021) , a relevant MCMC approach for this task. To keep comparisons fair, we omit baselines like LaMOO (Zhao et al., 2022) and MORBO (Daulton et al., 2022) as they are designed for continuous spaces and rely on latent representations from pre-trained models for discrete tasks like molecule generation.

5.1.1. HYPER-GRID

We first study the ability of MOGFN-PC to capture the preference-conditional reward distribution in a multi-objective version of the HyperGrid task from Bengio et al. (2021a) . The goal here is to navigate proportional to a reward within a HyperGrid. We consider the following objectives for our experiments: brannin(x), currin(x), shubert(x)foot_1 . Since the state space is small, we can compute the distribution learned by MOGFN-PC in closed form. In Figure 1a , we visualize π(x|ω), the distribution learned by MOGFN-PC conditioned on a set of fixed preference vectors ω and contrast it with the true distribution R(x|ω) in a 32 × 32 hypergrid with 3 objectives. We observe that π(-|ω) and R(-|ω) are very similar. To quantify this, we compute E x [|π(x|ω) -R(x|ω)/Z(ω)|] averaged over a set of 64 preferences, and find a difference of about 10 -4 . Note that MOGFN-PC is able to capture all the modes in the distribution, which suggests the candidates sampled from π would be diverse. Further, we compute the GD+ metric for the Pareto front of candidates generated with MOGFN-PC, which comes up to an average value of 0.42. For more details about the task and the additional results, refer to Appendix E.1.

5.1.2. N-GRAMS TASK

We consider version of the synthetic sequence design task from Stanton et al. (2022) . The task consists of generating strings with the objectives given by occurrences of a set of d n-grams. In the results summarized in Table 1 , we consider 3 Bigrams (with common characters in the bigrams resulting in correlated objectives) and 3 Unigrams (conflicting objectives) as the objectives. MOGFN-PC outperforms the baselines in terms of the MOO objectives while generating diverse candidates. Since the objective counts occurrences of n-grams, the diversity is limited by the performance, i.e. high scoring sequences will have lower diversity, explaining higher diversity of MOSoftQL. We note that the MOReinforce and Envelope-MOQ baselines struggle in this task potentially due to longer trajectories with sparse rewards. MOGFN-PC adequately models the trade-off between conflicting objectives in the 3 Monograms task as illustrated by the Pareto front of generated candidates in Figure 1b . For the 3 Bigrams task with correlated objectives, Figure 1c demonstrates MOGFN-PC generates candidates which can simultaneously maximize multiple objectives. We refer the reader to Appendix E.2 for more task details and additional results with different number of objectives and varying sequence length.

5.2.1. QM9

We first consider a small-molecule generation task based on the QM9 dataset (Ramakrishnan et al., 2014) . We generate molecules atom-by-atom and bond-by-bond with up to 9 atoms and use 4 reward signals. The main reward is obtained via a MXMNet (Zhang et al., 2020) proxy trained on QM9 to predict the HOMO-LUMO gap. The other rewards are Synthetic Accessibility (SA), a molecular weight target, and a molecular logP target. Rewards are normalized to be between 0 and 1, but the gap proxy can exceed 1, and so is clipped at 2. We train the models with 1M molecules and present the results in Table 2 , showing that MOGFN-PC outperforms all baselines in terms of Pareto performance and diverse candidate generation. 

5.2.2. FRAGMENT-BASED MOLECULE GENERATION

We evaluate our method on the fragment-based (Kumar et al., 2012 ) molecular generation task of Bengio et al. (2021a) , where the task is to generate molecules by linking fragments to form a junction tree (Jin et al., 2020) . The main reward function is obtained via a pretrained proxy, available from Bengio et al. (2021a) , trained on molecules docked with AutodockVina (Trott & Olson, 2010) for the sEH target. The other rewards are based on Synthetic Accessibility (SA), drug likeness (QED), and a molecular weight target. We detail the reward construction in Appendix E.4. Similarly to QM9, we train MOGFN-PC to generate 1M molecules and report the results in Table 3 . We observe that MOGFN-PC is consistently outperforming baselines not only in terms of HV and R 2 , but also candidate diversity score. Note that we do not report reward and diversity scores for MARS, since the lack of preference conditioning would make it an unfair comparison.

5.2.3. DNA SEQUENCE GENERATION

As a practical domain where the GFlowNet graph is a tree, we consider the generation of DNA aptamers, single-stranded nucleotide sequences that are popular in biological polymer design due to their specificity and affinity as sensors in crowded biochemical environments (Zhou et al., 2017; Corey et al., 2022; Yesselman et al., 2019; Kilgour et al., 2021) . We generate sequences by adding one nucleobase (A, C, T or G) at a time, with a maximum length of 60 bases. We consider three objectives: the free energy of the secondary structured calculated with the software NUPACK (Zadeh et al., 2011) , the number of base pairs and the inverse of the sequence length to favour shorter sequences. We report the results in Table 4 . In this case, the best Pareto performance is obtained by the multi-objective RL algorithm MOReinforce (Lin et al., 2021) . However, it achieves so by finding a quasi-trivial solution with the pattern GCGCGC... for most lengths, yielding very low diversity. In contrast, MOGFN-PC obtains much higher diversity and Top-K rewards but worse Pareto performance. An extended discussion, ablation study and further details are provided in Appendix E.5. Figure 2b illustrates that the Pareto frontier of candidates generated with MOGFN-AL, which dominates the Pareto frontier of the initial dataset. As we the candidates are generated by mutating sequences in the existing Pareto front, we also highlight the sequences that are mutations of each seqeunce in the initial dataset with the same color. To quantify the diversity of the generated candidates we measure the average e-value from DIAMOND (Buchfink et al., 2021) between the initial Pareto front and the Pareto frontier of generated candidates. Table 2c shows that MOGFN-AL generates candidates that are more diverse than the baselines.

6. ANALYSIS

In this section, we isolate the important components of MOGFN-PC: the distribution p(ω) for sampling preferences during training, the reward exponent β and the reward scalarization R(x|ω) to understand the impact of each component on Pareto performance and diversity. We consider the 3 Bigrams task discussed in Section 5.1.2 and the fragment-based molecule generation task from Section 5.2.1 for this analysis and provide further results in the Appendix.

Impact of p(ω)

To examine the effect of p(ω), which controls the coverage of the Pareto front, we set it to Dirichlet(α) and vary α ∈ {0.1, 1, 10}. This results in ω being sampled from different regions of ∆ d . Specifically, α = 1 corresponds to a uniform distribution over ∆ d , α > 1 is skewed towards the center of ∆ d whereas α < 1 is skewed towards the corners of ∆ d . In Table 5 and Table 6 we observe that α = 1 results in the best performance. Despite the skewed distribution with α = 0.1 and α = 10, we still achieve performance close to that of α = 1 indicating that MOGFN-PC is able to interpolate to preferences not sampled during training. Note that diversity is not affected significantly by p(ω). Impact of β During training β, controls the concentration of the reward density around modes of the distribution. For large values of β the reward density around the modes become more peaky and vice-versa. In Table 5 and Table 6 we present the results obtained by varying β ∈ {16, 32, 48}. As β increases, MOGFN-PC is incentivized to generate samples closer to the modes of R(x|ω), resulting in better Pareto performance. However, with high β values, the reward density is concentrated close to the modes and there is a negative impact on the diversity of the candidates. Choice of scalarization R(x|ω) Next, we analyse the effect of the scalarization defining R(x|ω) used for training. The set of R(x|ω) for different ω specifies the family of MOO sub-problems and thus has a critical impact on the Pareto performance. Table 5 and Table 6 include results for the Weighted Sum (WS), Weighted-log-sum (WL) and Weighted Tchebycheff (WT) scalarizations. Note that we do not compare the Top-K Reward as different scalarizations cannot be compared directly. WS scalarization results in the best performance. WL scalarization on the other hand is not formally guaranteed to cover the Pareto front and consequently results in poor Pareto performance. We suspect the poor performance of WT and WL are in part also due to the harder reward landscapes they induce.

7. CONCLUSION

In this work, we have empirically demonstrated the generalization of GFlowNets to conditional GFlowNets for multi-objective optimization problems (MOGFN) to promote the generation of diverse optimal candidates. We presented two instantiations of MOGFN: MOGFN-PC, which leverages reward-conditional GFlowNets (Bengio et al., 2021b) to model a family of single-objective subproblems, and MOGFN-AL, which sequentially solves a set of single-objective problems defined by multi-objective acquisition functions. Finally, we empirically demonstrated the efficacy of MOGFNs for generating diverse Pareto-optimal candidates on sequence and graph generation tasks. As a limitation, we identify that in certain domains, such as DNA sequence generation, MOGFN generates diverse candidates but currently does not match RL algorithms in terms of Pareto performance. The analysis in Section 6 hints that the distribution of sampling preferences p(ω) affects the Pareto performance. Since for certain practical applications only a specific region of the Pareto front is of interest, future work may explore gradient based techniques to learn preferences for more structured exploration of the preference space. Within the context of MOFGN-AL, an interesting research avenue is the development of preference-conditional acquisition functions.

Reproducibility Statement

We include the code necessary to replicate experiments with our submission and provide detailed description of experimental setups in the Appendix. All datasets and pretrained models used are publicly available or included in the supplementary materials.

Ethics Statement

We acknowledge that as with all machine learning algorithms, there is potential for dual use of multi-objective GFlowNets by nefarious agents. This work was motivated by the application of machine learning to accelerate scientific discovery in areas that can benefit humanity. We explicitly discourage the use of multi-objective GFlowNets in applications that may be harmful to others.

A ALGORITHMS

We summarize the algorithms for MOGFN-PC and MOGFN-AL here.  = {(x 1 , R(x 1 )), . . . , (x b , R(x b ))}; Update dataset D i = Di ∪ D i-1 end Result: Approximate Pareto set PN B SCALARIZATION Scalarization is a popular approach for tackling multi-objective optimization problems. MOGFN-PC can build upon any scalarization approach. We consider three choices. Weighted-sum (WS) scalarization has been widely used in literature. WS finds candidates on the convex hull of the Pareto front (Ehrgott, 2005) . Under the assumption that the Pareto front is convex, every Pareto optimal solution is a solution to a weighted sum problem and the solution to every weighted sum problem is Pareto optimal. Weigthed Tchebycheff (WT), proposed by Choo & Atkins (1983) is an alternative designed for non-convex Pareto fronts. Any Pareto optimal solution can be found by solving the weighted Tchebycheff problem with appropriate weights, and the solutions for any weights correspond to a weakly Pareto optimal solution of the original problem (Pardalos et al., 2017) . Lin et al. (2021) deomstrated through their empirical results that WT can be used with neural network based policies. The third scheme we consider, Weighted-log-sum (WL) has not been considered in prior work. We hypothesized that in some practical scenarios, we might want to ensure that all objectives are optimized, since, for instance, in WS the scalarized reward can be dominated by a single reward. WL, which considers the weigthed sum in log space can potentially help with this drawback. However, as discussed in Section 6, in practice WL can be hard to optimize, and lead to poor performance. Can MOGFN-PC match Single Objective GFNs? To evaluate how well MOGFN-PC models the family of rewards R(x|ω), we consider a comparison with single objective GFlowNets. More specifically, we first sample a set of 10 preferences ω 1 , . . . , ω 10 , and train a standard single objective GFlowNet using the weighted sum scalar reward for each preference. We then generate N = 128 candidates from each GFlowNet, throughout training, and compute the mean reward for the top 10 candidates for each preference. We average this top 10 reward across {ω 1 , . . . , ω 10 }, and call it R so .

C ADDITIONAL ANALYSIS

We then train MOGFN-PC, and apply the sample procedure with the preferences {ω 1 , . . . , ω 10 }, and call the resulting mean of top 10 rewards R mo . We plot the value of the ratio R mo /R so in Figure 3 . We observe that the ratio stays close to 1, indicating that MOGFN-PC can indeed model the entire family of rewards simultaneously at least as fast as a single objective GFlowNet could. As MOGFN-PC models a conditional distribution, an entire family of functions as we've described before, we expect capacity to play a crucial role since the amount of information to be learned is higher than for a single-objective GFN. We increase model size in the 3 Bigrams task to see that effect, and see in Table 7 that larger models do help with performance-although the performance plateaus after a point. We suspect that in order to fully utilize the model capacity we might need better training objectives. Table 7 : Analysing the impact of model size on the performance of MOGFN-PC. Each architecture choice for the policy is denoted as A-B-C where A is number of layers, B is the number of hidden units in each layer, and C is the number of attention heads. 

D METRICS

In this section we discuss the various metrics that we used to report the results in Section 5. 1. Generational Distance Plus (GD +) (Ishibuchi et al., 2015) : This metric measures the euclidean distance between the solutions of the Pareto approximation and the true Pareto front by taking the dominance relation into account. To calculate GD+ we require the knowledge of the true Pareto front and hence we only report this metric for Hypergrid experiments (Section 5.1.1) 2. Hypervolume (HV) Indicator (Fonseca et al., 2006) : This is a standard metric reported in MOO works which measures the volume in the objective space with respect to a reference point spanned by a set of non-dominated solutions in Pareto front approximation. 3. R 2 Indicator (Hansen & Jaszkiewicz, 1994): R 2 provides a monotonic metric comparing two Pareto front approximations using a set of uniform reference vectors and a utopian point z * representing the ideal solution of the MOO. This metric provides a monotonic reference to compare different Pareto front approximations relative to a utopian point. Specifically, we define a set of uniform reference vectors λ ∈ Λ that cover the space of the MOO and then calculate:R 2 (Γ, Λ, z * ) = 1 |Λ| λ∈Λ min γ∈Γ max i∈1,...,k {λ i |z * i -γ i |} where γ ∈ Γ corresponds to the set of solutions in a given Pareto front approximations and z * is the utopian point corresponding to the ideal solution of the MOO. Generally, R 2 metric calculations are performed with z * equal to the origin and all objectives transformed to a minimization setting, which serves to preserve the monotonic nature of the metric. This holds true for our experiments as well. 4. Top-K Reward This metric was originally used in (Bengio et al., 2021a) , which we extend for our multi-objective setting. For MOGFN-PC, we sample N candidates per test preference and then pick the top-k candidates (k < N ) with highest scalarized rewards and calculate the mean. We repeat this for all test preferences enumerated from the simplex and report the average top-k reward score.

5.. Top-K Diversity

This metric was also originally used in (Bengio et al., 2021a) , which we again extend for our multi-objective setting. We use this metric to quantify the notion of diversity of the generated candidates. Given a distance metric d(x, y) between candidates x and y we calculate the diversity of candidates as those who have d(x, y) greater than a threshold ϵ. For MOGFN-PC, we sample N candidates per test preference and then pick the top-k candidates based on the diversity scores and take the mean. We repeat this for all test preferences sampled from simplex and report the average top-k diversity score. We use the edit distance for sequences, and 1 minus the Tanimoto similarity for molecules.

E ADDITIONAL EXPERIMENTAL DETAILS

E.1 HYPER-GRID Here we elaborate on the Hyper-Grid experimental setup which we discussed in Section 5.1.1. Consider an n-dimensional hypercube gridworld where each cell in the grid corresponds to a state. The agent starts at the top left coordinate marked as (0, 0, . . . ) and is allowed to move only towards the right, down, or stop. When the agent performs the stop action, the trajectory terminates and the agent receives a non-zero reward. In this work, we consider the following reward functions brannin(x), currin(x), sphere(x), shubert(x), beale(x). In Figure 4 , we show the heatmap for each reward function. Note that we normalize all the reward functions between 0 and 1. Additional Results To verify the efficacy of MOGFNs across different objectives sizes, we perform some additional experiments and measure the L 1 loss and the GD+ metric. In Figure 5 , we can see that as the reward dimension increases, the loss and GD+ increases. This is expected because the number of rewards is indicative of the difficulty of the problem. We also present extended qualitative visualizations across more preferences in Figure 6 . Model Details and Hyperparameters For MOGFN-PC policies we use an MLP with two hidden layers each consisting of 64 units. We use LeakyReLU as our activation function as in Bengio et al. (2021a) . All models are trained with learning rate=0.01 with the Adam optimizer Kingma & Ba (2015) and batch size=128. We sample preferences ω from Dirichlet(α) where α = 1.5. We try two encoding techniques for encoding preferences -1) Vanilla encoding where we just use the raw values of the preference vectors and 2) Thermometer encoding (Buckman et al., 2018) . In our experiments we have not observed significant difference in performance difference. Task Details The task is to generate sequences of some maximum length L, which we set to 36 for the experiments in Section 5.1.2. We consider a vocabulary (actions) of size 21, with 20 characters ["A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"] and a special token to indicate the end of sequence. The rewards {R i } d i=1 are defined by the number of occurrences of a given set of n-grams in a sequence x. For instance, consider ["AB", "BA"] as the n-grams. The rewards for a sequence x = ABABC would be [2, 1]. We consider two choices of n-grams: (a) Unigrams: the number of occurrences of a set of unigrams induces conflicting objectives since we cannot increase the number of occurrences of a monogram without replacing another in a string of a particular length, (b) Bigrams: given common characters within the bigrams, the occurrences of multiple bigrams can be increased simultaneously within a string of a fixed length. We also consider different sizes for the set of n-grams considered, i.e. different number of objectives. This allows us to evaluate the behaviour of MOGFN-PC on a variety of objective spaces. We summarize the specific objectives used in our experiments in Table 8 . We normalize the rewards to [0, 1] in our experiments. 2 Bigrams ["AC", "CV"] 3 Unigrams ["A", "C", "V"] 3 Bigrams ["AC", "CV", "VA"] 4 Unigrams ["A", "C", "V", "W"] 4 Bigrams ["AC", "CV", "VA", "AW"] Model Details and Hyperparameters We build upon the implementation from Stanton et al. (2022) for the task: https://github.com/samuelstanton/lambo. For the string generation task, the backward policy P B is trivial (as there is only one parent for each node s ∈ S), so we only have to parameterize P F and log Z. As P F (-|s, ω) is a conditional policy, we use a Conditional Transformer encoder as the architecture. This consists of a Transformer encoder (Vaswani et al., 2017) with 3 hidden layers of dimension 64 and 8 attention heads to embed the current state (string generated so far) s. We have an MLP which embeds the preferences ω which are encoded using thermometer encoding with 50 bins. The embeddings of the state and preferences are concatenated and passed to a final MLP which generates a categorical distribution over the actions (vocabulary token). We use the same architecture for the baselines using a conditional policy -MOReinforce and MOSoftQL. For Envelope-MOQ, which does not condition on the preferences, we use a standard Transformer-encoder with a similar architecture. We present the hyperparameters we used in Table 9 . Each method is trained for 10,000 iterations with a minibatch size of 128. For the baselines we adopt the official implementations released by the authors for MOReinforce -https://github.com/Xi-L/PMOCO and Envelope-MOQ -https://github.com/RunzheYang/MORL.

Additional Results

We present some additional results for the n-grams task. We consider different number of objectives d ∈ {2, 4} in Table 10 and Table 11 respectively. As with the experiments in Section 5.1.2 we observe that MOGFN-PC outperforms the baselines in Pareto performance while achieving high diversity scores. In Table 12 , we consider the case of shorter sequences L = 24. MOGFN-PC continues to provide significant improvements over the baselines. There are two trends we can observe considering the N-grams task holistically: 1. As the sequence size increases the advantage of MOGFN-PC becomes more significant. 2. The advantage of MOGFN-PC increases with the number of objectives. 

E.5 DNA SEQUENCE DESIGN

Task Details The set of building blocks here consists of the bases["A", "C", "T", "G"] in addition to a special end of sequence token. In order to compute the free energy and number of base with the software NUPACK (Zadeh et al., 2011) , we used 310 K as the temperature. The inverse of the length L objective was calculated as 30 L , as 30 was the minimum length for sampled sequences. The rewards are normalized to [0, 1] for our experiments.

Model Details and Hyperparameters

We use the same implementation as the N-grams task, detailed in Appendix E.2. Here we consider a 4-layer Transformer architecture, with 256 units per layer and 16 attention head instead. We detail the most relevant hyperparameters Table 15 . Discussion of Results Contrary to the other tasks on which we evaluated MOGFN-PC, for the generation of DNA aptamer sequences, our proposed model did not match the best baseline, multiobjective reinforcement learning (Lin et al., 2021) , in terms of Pareto performance. Nonetheless, it is worth delving into the details in order to better understand the different solutions found by the two methods. First, as indicated in section 5, despite the better Pareto performance, the best sequences generated by the RL method have extremely low diversity (0.62), compared to MOGFN, which generates optimal sequences with diversity of 19.6 or higher. As a matter of fact, MOReinforce mostly samples sequences with the well-known pattern GCGC... for all possible lengths. Sequences with this pattern have indeed low (negative) energy and many number of pairs, but they offer little new insights and poor diversity if the model is not able to generate sequences with other distinct patterns. On the contrary, GFlowNets are able to generate sequences with patterns other than repeating the pair of bases G and C. Interestingly, we observed that GFlowNets were able to generate sequences with even lower energy than the best sequences generated by MOReinforce by inserting bases A and T into chains of GCGC.... Finally, we observed that one reason why MOGFN does not match the Pareto performance of MOReinforce is because for short lengths (one of the objectives) the energy and number of pairs are not successfully optimised. Nonetheless, the optimisation of energy and number of pairs is very good for the longest sequences. Given these observations, we conjecture that there is room for improving the set of hyperparameters or certain aspects of the algorithm.

Additional Results

In order to better understand the impact of the main hyperparameters of MOGFN-PC in the Pareto performance and diversity of the optimal candidates, we train multiple instances by sweeping over several values of the hyperparameters, as indicated in Table 15 . We present the results in Table 16 . One key observation is that there seems to be a tradeoff between the Pareto performance and the diversity of the Top-K sequences. Nonetheless, even the models with the lowest diversity are able to generate much more diverse sequences than MOReinforce. Furthermore, we also observe α < 1 as the parameter of the Dirichlet distribution to sample the weight preferences, as well as higher β (reward exponent), both yield better metrics of Pareto performance but slightly worse diversity. In the case of β, this observation is consistent with the results in the Bigrams task (Table 5 ), but with Bigrams, best performance was obtained with α = 1. This is indicative of a degree of dependence on the task and the nature of the objectives.

E.6 ACTIVE LEARNING

Task Details We consider the Proxy RFP task from Stanton et al. (2022) , an in silico benchmark task designed to simulate searching for improved red fluorescent protein (RFP) variants (Dance et al., 2021) . The objectives considered are stability (-dG or negative change in Gibbs free energy) and that, as opposed to the sequence generation experiments, P B here is not trivial as there are multiple ways (orders) of generating the set. For our experiments, we use a uniform random P B . P F takes as input the sequence x with the mutations generated so far applied. We use a Transformer encoder with 3 layers, with hidden dimension 64 and 8 attention heads as the architecture for the policy. The policy outputs a distribution over the locations in x, {1, . . . , |x|}, and a distribution over tokens for each location. The vocabulary of actions here is the same as the N-grams task -["A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]. The logits of the locations of the mutations generated so far are set to -1000, to prevent generating the same sequence. The acquisition function(NEHVI) value for the mutated sequence is used as the reward. We also use a reward exponent β. To make optimization easier (as the acquisition function becomes harder to optimize with growing β), we reduce β linearly by a factor δβ at each round. We train the GFlowNet for 750 iterations in each round. Table 18 shows the MOGFN-AL hyperparameters. The active learning batch size is 16, and we run 64 rounds of optimization. Table 18 presents the hyperparameters used for MOGFN-AL. 



If the object is constructed in a canonical order (say a string constructed from left to right), G is a tree. We present additional results with more objectives in Appendix E.1



Figure 1: (a) The distribution learned by MOGFN-PC (Top) almost exactly matches the ground truth distribution (Bottom), in particular capturing all the modes, on hypergrid of size 32 × 32 with 3 objectives. (b) and (c) illustrate the Pareto front of candidates generated by MOGFN-PC with conflicting and correlated objectives respectively.

Figure 2: (a) MOGFN-AL demonstrates substantial advantage in terms of Relative Hypervolume and (b) Pareto frontier of candidates generated by MOGFN-AL dominates the Pareto front of the initial dataset. (c) MOGFN-AL is particularly strong in terms of diversity of candidates. Finally, to evaluate MOGFN-AL, we consider the Proxy RFP task from Stanton et al. (2022), with the aim of discovering novel proteins with red fluorescence properties, optimizing for folding stability and solvent-accessible surface area. We adopt all the experimental details (described in Appendix E.6) from Stanton et al. (2022), using MOGFN-AL for candidate generation. In addition to LaMBO, we use a model-free (NSGA-2) and model-based EA from Stanton et al. (2022) as baselines. We observe in Figure 2a that MOGFN-AL results in significant gains to the improvement in Hypervolume relative to the initial dataset, in a given budget of black-box evaluations. In fact, MOGFN-AL is able to match the performance of LaMBO within about half the number of black-box evaluations.

Training preference-conditional GFlowNets Input: p(ω): Distribution for sampling preferences; β: Reward Exponent; δ: Mixing Coefficient for uniform actions in sampling policy; N : Number of training steps; Initialize: (P F (s ′ |s, ω), P B (s|s ′ , ω), log Z(ω)): Conditional GFlowNet with parameters θ; for i = 1 to N do Sample preference ω ∼ p(ω); Sample trajectory τ following policy π = (1 -δ)P F + δUniform ; Compute reward R(x|ω) β for generated samples and corresponding loss L(τ, ω; θ) as in Equation 2; Update parameters θ with gradients from the loss, ∇ θ L(τ, ω); end Algorithm 2: Training MOGFN-AL Input: R = {R 1 , . . . , R d }: Oracles to evaluate candidates x and return true objectives (R 1 (x), . . . , R d (x)) ; D 0 = {(x i , y i )}: Initial dataset with y i = R(x i ); f : Probabilistic surrogate model to model posterior over R given a dataset D; a(x| f ): Acquisition function computing a scalar utility for x given f ; π θ : Learnable GFlowNet policy; b: Size of candidate batch to be generated; N : Number of active learning rounds; Initialize: f , π θ ; for i = 1 to N do Fit f on dataset D i-1 ; Extract the set of non-dominated candidates Pi-1 from D i-1 ; Train π θ with to generate mutations for x ∈ Pi using a(-| f ) as the reward; Generate batch B = {x ′ 1,mi , . . . , x ′ b,m b } by sampling x ′ i from Pi-1 and applying to it mutations m i sampled from π θ ; Evaluate batch B with R to generate Di

Figure 3: We plot the ratio of rewards R mo /R so for candidates generated with MOGFN-PC (R m o) and single-objective GFlowNets(R so ) for a set of preferences in the (a) 3 Bigrams and (b) Fragmentbased molecule generation tasks. We observe that MOGFN-PC matches and occasionally surpasses single objective GFlowNets

Figure 4: Reward Functions Different reward function considered for HyperGrid experiments presented in Section 5.1.1. Here the grid dimension is H = 32 .

Figure 5: (Left) Average test loss between the MOGFN-PC distribution and the true distribution for increasing number of objectives. (Right) GD + metrics of MOGFN-PC across objectives.

Figure 6: Extended Qualitative Visualizations for Hypergrid epxeriments

N-Grams Task: Diversity and Pareto performance of various algorithms on for the 3 Bigrams and 3 Unigrams tasks with MOGFN-PC achieving superior Pareto performance.

Atom-based QM9 task: MOGFN-PC exceeds Diversity and Pareto performance on QM9 task with HUMO-LUMO gap, SA, QED and molecular weight objectives compared to baselines.

Fragment-based Molecule Generation Task: Diversity and Pareto performance on the Fragment-based drug design task with sEH, QED, SA and molecular weight objectives.

DNA Sequence Design Task: Diversity and Pareto performance of various algorithms on DNA sequence generation task with free energy, number of base pairs and inverse sequence length objectives.MOReinforce(Lin et al., 2021) 0.105 ±0.002 0.6178 ±0.209 0.629 ±0.002 1.925 ±0.003 MOSoftQL 0.446 ±0.010 32.130 ±0.542 0.163 ±0.014 5.565 ±0.170

N-grams: Analysing the impact of α, β and R(x|ω) on the performace of MOGFN-PC

Fragment-based molecule generation: Analysing the impact of α, β and R(x|ω) on the performace of MOGFN-PC ±0.01 0.75 ±0.01 0.75 ±0.09 0.86 ±0.006 0.86 ±0.001 0.85 ±0.002 0.75 ±0.01 0.75 ±0.01 0.82 ±0.016 0.10 ±0.002 Hypervolume (↑) 0.67 ±0.08 0.90 ±0.01 0.82 ±0.12 0.59 ±0.06 0.55 ±0.001 0.60 ±0.06 0.90 ±0.01 0.90 ±0.01 0.55 ±0.017 0.71 ±0.10 R 2 (↓) 2.57 ±0.43 1.86 ±0.08 1.93 ±1.12 5.76 ±0.30 4.46 ±0.28 3.64 ±0.19 1.86 ±0.08 1.86 ±0.08 6.92 ±0.18 11.51 ±1.79

±0.08 17.13 ±0.38 17.53 ±0.15 16.12 ±0.04 Hypervolume (↑) 0.22 ±0.017 0.255 ±0.008 0.262 ±0.003 0.270 ±0.011

Objectives considered for the N-grams task

Hyperparameters for N-grams Task

N-grams Task. 2 Objectives ±0.03 19.40 ±0.91 0.247 ±0.031 2.92 ±0.39 0.46 ±0.02 22.05 ±0.04 0.253 ±0.003 2.54 ±0.02 MOGFN-TB 0.51 ±0.04 20.65 ±0.58 0.321 ±0.011 2.31 ±0.04 0.48 ±0.01 22.15 ±0.22 0.267 ±0.007 2.24 ±0.03

N-grams Task. 4 Objectives ±0.04 24.32 ±1.21 0.013 ±0.001 39.31 ±1.35 0.22 ±0.02 24.18 ±1.43 0.019 ±0.005 31.46 ±2.32 MOGFN-TB 0.23 ±0.02 20.31 ±0.43 0.055 ±0.017 24.42 ±1.44 0.33 ±0.01 23.24 ±0.23 0.063 ±0.032 23.31 ±2.03

Hyperparameters tuned for DNA-Aptamers Task.

Hyperparameters for MOGFN-AL

E.3 QM9

Reward Details As mentioned in Section 5.2.1, we consider four reward functions for our experiments.The first reward function is the HUMO-LUMO gap, for which we rely on the predictions of a pretrained MXMNet (Zhang et al., 2020) model trained on the QM9 dataset (Ramakrishnan et al., 2014) . The second reward is the standard Synthetic Accessibility score which we calculate using the RDKit library (Landrum) , to get the reward we compute (10 -SA)/9. The third reward function is molecular weight target. Here we first calculate the molecular weight of a molecule using RDKit, and then construct a reward function of the form e -(molWt-105) 2 /150 which is maximized at 105. Our final reward function is a logP target, e -(logP-2.5) 2 /2 , which is again calculated with RDKit and is maximized at 2.5.

Model Details and Hyperparameters

We sample new preferences for every episode from a Dirichlet(α), and encode the desired sampling temperature using a thermometer encoding (Buckman et al., 2018) . We use a graph neural network based on a graph transformer architecture (Yun et al., 2019) . We transform this conditional encoding to an embedding using an MLP. The embedding is then fed to the GNN as a virtual node, as well as concatenated with the node embeddings in the graph. The model's action space is to add a new node to the graph, a new bond, or set node or bond properties (like making a bond a double bond). It also has a stop action. For more details please refer to the code provided in the supplementary material. We summarize the hyperparameters used in Table 13 . The first reward function is a proxy trained on molecules docked with AutodockVina (Trott & Olson, 2010) for the sEH target; we use the weights provided by Bengio et al. (2021a) . We also use synthetic accessibility, as for QM9, and a weight target region (instead of the specific target weight used for QM9), ((300 -molwt) / 700 + 1).clip(0, 1) which favors molecules with a weight of under 300. Our final reward function is QED which is again calculated with RDKit. We again use a graph neural network based on a graph transformer architecture (Yun et al., 2019) . The experimental protocol is similar to QM9 experiments discussed in Appendix E.3. We additionally sample from a lagged model whose parameters are updated as 

Additional Results

We also present in Figure 7 a view of the reward distribution produced by MOGFN-PC. Generally, the model is able to find good near-Pareto-optimal samples, but is also able to spend a lot of time exploring. The figure also shows that the model is able to respect the preference conditioning, and remains capable of generating a diverse distribution rather than a single point. In the off-diagonal plots of Figure 7 , we show pairwise scatter plots for each objective pair; the Pareto front is depicted with a red line; each point corresponds to a molecule generated by the model as it explores the state space; color is density (linear viridis palette). The diagonal plots show two overlaid informations: a blue histogram for each objective, and an orange scatter plot showing the relationship between preference conditioning and generated molecules. The effect of this conditioning is particularly visible for seh (top left) and wt (bottom right). As the preference for the sEH binding reward gets closer to 1, the generated molecules' reward for sEH gets closer to 1 as well. Indeed, the expected shape for such a scatter plot is a triangular-ish shape: when the preference ω i for reward R i is close to 1, the model is expected to generate objects with a high reward for R i ; as the preference ω i gets further away from 1, the model can generate anything, including objects with a high R i -that is, unless there is a trade off between objectives, in which case in cannot; this is the case for the seh objective, but not for the wt objective, which has a more triangular shape. solvent-accessible surface area (SASA) (Shrake & Rupley, 1973) in simulation, computed using the FoldX suite (Schymkowitz et al., 2005) and BioPython (Cock et al., 2009) . We use the dataset introduced in Stanton et al. (2022) as the initial pool of candidates D 0 with |D 0 | = 512.Method Details and Hyperparameters Our implementation builds upon the publicly released code from (Stanton et al., 2022) : https://github.com/samuelstanton/lambo. We follow the exact experimental setup used in Stanton et al. (2022) . The surrogate model f consists of an encoder with 1D convolutions (masking positions corresponding to padding tokens). We used 3 standard pre-activation residual blocks with two convolution layers, layer norm, and swish activations, with a kernel size of 5, 64 intermediate channels and 16 latent channels. A multi-task GP with an ICM kernel is defined in the latent space of this encoder, which outputs the predictions for each objective. We also use the training tricks detailed in Stanton et al. (2022) for the surrogate model. The hyperparameters, taken from Stanton et al. (2022) are shown in Table 17 . The acquisiton function used is NEHVI (Daulton et al., 2021) defined aswhere ft , t = 1, . . . N are independent draws from the surrogate model (which is a posterior over functions), and P t denotes the Pareto frontier in the current dataset D under ft . We replace the LaMBO candidate generation with GFlowNets. We generate a set of mutations m = {(l i , v i )} for a sequences x from the current approximation of the Pareto front Pi . Note

