EQUIVARIANT ENERGY-GUIDED SDE FOR INVERSE MOLECULAR DESIGN

Abstract

Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. Empirically, under the guidance of designed energy functions, EEGSDE significantly improves the baseline on QM9, in inverse molecular design targeted to quantum properties and molecular structures. Furthermore, EEGSDE is able to generate molecules with multiple target properties by combining the corresponding energy functions linearly.

1. INTRODUCTION

The discovery of new molecules with desired properties is critical in many fields, such as the drug and material design (Hajduk & Greer, 2007; Mandal et al., 2009; Kang et al., 2006; Pyzer-Knapp et al., 2015) . However, brute-force search in the overwhelming molecular space is extremely challenging. Recently, inverse molecular design (Zunger, 2018) provides an efficient way to explore the molecular space, which directly predicts promising molecules that exhibit desired properties. A natural way of inverse molecular design is to train a conditional generative model (Sanchez-Lengeling & Aspuru-Guzik, 2018) . Formally, it learns a distribution of molecules conditioned on certain properties from data, and new molecules are predicted by sampling from the distribution with the condition set to desired properties. Among them, equivariant diffusion models (EDM) (Hoogeboom et al., 2022) leverage the current state-of-art diffusion models (Ho et al., 2020) , which involves a forward process to perturb data and a reverse process to generate 3D molecules conditionally or unconditionally. While EDM generates stable and valid 3D molecules, we argue that a single conditional generative model is insufficient for generating accurate molecules that exhibit desired properties (see Table 1 and Table 3 for an empirical verification). In this work, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. EEGSDE formalizes the generation process as an equivariant stochastic differential equation, and plugs in energy functions to improve the controllability of generation. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. We apply EEGSDE to various applications by carefully designing task-specific energy functions. When targeted to quantum properties, EEGSDE is able to generate more accurate molecules than As the energy function is invariant to rotational transformation R, its gradient (i.e., the energy guidance) is equivariant to R, and therefore the distribution of generated samples is invariant to R. EDM, e.g., reducing the mean absolute error by more than 30% on the dipole moment property. When targeted to specific molecular structures, EEGSDE better capture the structure information in molecules than EDM, e.g, improving the similarity to target structures by more than 10%. Furthermore, EEGSDE is able to generate molecules targeted to multiple properties by combining the corresponding energy functions linearly. These demonstrate that our EEGSDE enables a flexible and controllable generation of molecules, providing a smart way to explore the chemical space.

2. RELATED WORK

Diffusion models are initially proposed by Sohl-Dickstein et al. (2015) . Recently, they are better understood in theory by connecting it to score matching and stochastic differential equations (SDE) (Ho et al., 2020; Song et al., 2020) . After that, diffusion models have shown strong empirical performance in many applications Dhariwal & Nichol (2021) ; Ramesh et al. (2022) ; Chen et al. (2020) ; Kong et al. (2020) . There are also variants proposed to improve or accelerate diffusion models (Nichol & Dhariwal, 2021; Vahdat et al., 2021; Dockhorn et al., 2021; Bao et al., 2022b; a; Salimans & Ho, 2022; Lu et al., 2022) . Guidance is a technique to control the generation process of diffusion models. Initially, Song et al. (2020) ; Dhariwal & Nichol (2021) use classifier guidance to generate samples belonging to a class. Then, the guidance is extended to CLIP (Radford et al., 2021) for text to image generation, and semantic-aware energy (Zhao et al., 2022) for image-to-image translation. Prior guidance methods focus on image data, and are nontrivial to apply to molecules, since they do not consider the geometric symmetry. In contrast, our work proposes a general guidance framework for 3D molecules, where an invariant energy function is employed to leverage the geometric symmetry of molecules. Molecule generation. Several works attempt to model molecules as 3D objects via deep generative models Nesterov et al. (2020) ; Gebauer et al. (2019) ; Satorras et al. (2021a) ; Hoffmann & Noé (2019) ; Hoogeboom et al. (2022) . Among them, the most relevant one is the equivariant diffusion model (EDM) (Hoogeboom et al., 2022) , which generates molecules in an iterative denoising manner. Benefiting from recent advances of diffusion models, EDM is stable to train and is able to generate high quality molecules. We provide a formal description of EDM in Section 3. Some other methods generate simplified representations of molecules, such as 1D SMILES strings (Weininger, 1988) and 2D graphs of molecules. These include variational autoencoders (Kusner et al., 2017; Dai et al., 2018; Jin et al., 2018; Simonovsky & Komodakis, 2018; Liu et al., 2018) , normalizing flows (Madhawa et al., 2019; Zang & Wang, 2020; Luo et al., 2021) , generative adversarial networks (Bian et al., 2019; Assouel et al., 2018) , and autoregressive models (Popova et al., 2019; Flam-Shepherd et al., 2021) . There are also methods on generating torsion angles in molecules. For instance, Torsional Diffusion (Jing et al., 2022) employs the SDE formulation of diffusion models to model torsion angles in a given 2D molecular graph for the conformation generation task. Inverse molecular design. Generative models have been applied to inverse molecular design. For example, conditional autoregressive models (Gebauer et al., 2022) and EDM (Hoogeboom et al., 2022) directly generate 3D molecules with desired quantum properties. Gebauer et al. (2019) also finetune pretrained generative models on a biased subset to generate 3D molecules with small HOMO-LUMO gaps. In contrast to these conditional generative models, our work further proposes a guidance method, a flexible way to control the generation process of molecules. Some other methods apply optimization methods to search molecules with desired properties, such as reinforcement learning (Zhou et al., 2019; You et al., 2018) and genetic algorithms (Jensen, 2019; Nigam et al., 2019) . These optimization methods generally consider the 1D SMILES strings or 2D graphs of molecules, and the 3D information is not provided.

3. BACKGROUND

3D representation of molecules. Suppose a molecule has M atoms and let x i ∈ R n (n = 3 in general) be the coordinate of the ith atom. The collection of coordinates x = (x 1 , . . . , x M ) ∈ R M n determines the conformation of the molecule. In addition to the coordinate, each atom is also associated with an atom feature, e.g., the atom type. We use h i ∈ R d to represent the atom feature of the ith atom, and use h = (h 1 , . . . , h M ) ∈ R M d to represent the collection of atom features in a molecule. We use a tuple z = (x, h) to represent a molecule, which contains both the 3D geometry information and the atom feature information. Equivariance and invariance. Suppose R is a transformation. A distribution p(x, h) is said to be invariant to R, if p(x, h) = p(Rx, h) holds for all x and h. Here Rx = (Rx 1 , . . . , Rx M ) is applied to each coordinate. A function (a x , a h ) = f (x, h) that have two components a x , a h in its output is said to be equivariant to R, if f (Rx, h) = (Ra x , a h ) holds for all x and h. A function f (x, h) is said to be invariant to R, if f (Rx, h) = f (x, h ) holds for all x and h. Zero CoM subspace. It has been shown that the invariance to translational and rotational transformations is an important factor for the success of 3D molecule modeling (Köhler et al., 2020; Xu et al., 2022) . However, the translational invariance is impossible for a distribution in the full space R M n (Satorras et al., 2021a). Nevertheless, we can view two collections of coordinates x and y as equivalent if x can be translated from y, since the translation doesn't change the identity of a molecule. Such an equivalence relation partitions the whole space R M n into disjoint equivalence classes. Indeed, all elements in the same equivalence classes represent the same conformation, and we can use the element with zero center of mass (CoM), i.e., 1 M M i=1 x i = 0, as the specific representation. These elements collectively form the zero CoM linear subspace X (Xu et al., 2022; Hoogeboom et al., 2022) , and the rest of the paper always uses elements in X to represent conformations. Equivariant graph neural network. Satorras et al. (2021b) propose equivariant graph neural networks (EGNNs), which incorporate the equivariance inductive bias into neural networks. Specifically, (a x , a h ) = EGNN(x, h) is a composition of L equivariant convolutional layers. The l-th layer takes the tuple (x l , h l ) as the input and outputs an updated version (x l+1 , h l+1 ), as follows: m ij = Φ m (h i l , h j l , ∥x i l -x j l ∥ 2 2 , e ij ; θ m ), w ij = Φ w (m ij ; θ w ), h i l+1 = Φ h (h i l , j̸ =i w ij m ij ; θ h ), x i l+1 = x i l + j̸ =i x i l -x j l ∥x i l -x j l ∥ 2 + 1 Φ x (h i l , h j l , ∥x i l -x j l ∥ 2 2 , e ij ; θ x ), where Φ m , Φ w , Φ h , Φ x are parameterized by fully connected neural networks with parameters θ m , θ w , θ h , θ x respectively, and e ij are optional feature attributes. We can verify that these layers are equivariant to orthogonal transformations, which include rotational transformations as special cases. As their composition, the EGNN is also equivariant to orthogonal transformations. Furthermore, let EGNN h (x, h) = a h , i.e., the second component in the output of the EGNN. Then EGNN h (x, h) is invariant to orthogonal transformations. Equivariant diffusion models (EDM) (Hoogeboom et al., 2022) are a variant of diffusion models for molecule data. EDMs gradually inject noise to the molecule z = (x, h) via a forward process q(z 1:N |z 0 ) = N n=1 q(z n |z n-1 ), q(z n |z n-1 ) = N X (x n | √ α n x n-1 , β n )N (h n | √ α n h n-1 , β n ), where α n and β n represent the noise schedule and satisfy α n + β n = 1, and N X represent the Gaussian distribution in the zero CoM subspace X (see its formal definition in Appendix A.2). Let α n = α 1 α 2 • • • α n , β n = 1 -α n and βn = β n β n-1 /β n . To generate samples, the forward process is reversed using a Markov chain: p(z 0:N ) = p(z N ) N n=1 p(z n-1 |z n ), p(z n-1 |z n ) = N X (x n-1 |µ x n (z n ), βn )N(h n-1 |µ h n (z n ), βn ). (2) Here p(z N ) = N X (x N |0, 1)N (h N |0, 1). The mean µ n (z n ) = (µ x n (z n ), µ h n (z n )) is parameterized by a noise prediction network ϵ θ (z n , n), and is trained using a MSE loss, as follows: µ n (z n ) = 1 √ α n (z n - β n β n ϵ θ (z n , n)), min θ E n E q(z0,zn) w(t)∥ϵ θ (z n , n) -ϵ n ∥ 2 , where ϵ n = zn- √ αnz0 √ β n is the standard Gaussian noise injected to z 0 and w(t) is the weight term. Hoogeboom et al. (2022) show that the distribution of generated samples p(z 0 ) is invariant to rotational transformations if the noise prediction network is equivariant to orthogonal transformations. In Section 4.2, we extend this proposition to the SDE formulation of molecular diffusion modelling. 

4. EQUIVARIANT ENERGY-GUIDED SDE

In this part, we introduce our equivariant energy-guided SDE (EEGSDE), as illustrated in Figure 1 . EEGSDE is based on the SDE formulation of molecular diffusion modeling, which is described in Section 4.1 and Section 4.2. Then, we formally present our EEGSDE that incorporates an energy function to guide the molecular generation in Section 4.3. We provide derivations in Appendix A.

4.1. SDE IN THE PRODUCT SPACE

Recall that a molecule is represented as a tuple z = (x, h), where x = (x 1 , . . . , x M ) ∈ X represents the conformation and h = (h 1 , . . . , h M ) ∈ R M d represents atom features. Here X = {x ∈ R M n : 1 M M i=1 x i = 0} is the zero CoM subspace mentioned in Section 3, and d is the feature dimension. We first introduce a continuous-time diffusion process {z t } 0≤t≤T in the product space X × R M d , which gradually adds noise to x and h. This can be described by the forward SDE: dz = f (t)zdt + g(t)d(w x , w h ), z 0 ∼ q(z 0 ), where f (t) and g(t) are two scalar functions, w x and w h are independent standard Wiener processes in X and R M d respectively, and the SDE starts from the data distribution q(z 0 ). Note that w x can be constructed by subtracting the CoM of a standard Wiener process w in R M n , i.e., w x = ww, where w = 1 M M i=1 w i is the CoM of w = (w 1 , . . . , w M ). It can be shown that the SDE has a linear Gaussian transition kernel q(z t |z s ) = q(x t |x s )q(h t |h s ) from x s to x t , where 0 ≤ s < t ≤ T . Specifically, there exists two scalars α t|s and β t|s , s.t., q(x t |x s ) = N X (x t | √ α t|s x s , β t|s ) and q(h t |h s ) = N (h t | √ α t|s h s , β t|s ). Here N X denotes the Gaussian distribution in the subspace X, and see Appendix A.2 for its formal definition. Indeed, the forward process of EDM in Eq. ( 1) is a discretization of the forward SDE in Eq. (3). To generate molecules, we reverse Eq. (3) from T to 0. Such a time reversal forms another a SDE, which can be represented by both the score function form and the noise prediction form: dz =[f (t)z -g(t) 2 (∇ x log q t (z) -∇ x log q t (z), ∇ h log q t (z)) score function form ]dt + g(t)d( wx , wh ), =[f (t)z + g(t) 2 β t|0 E q(z0|zt) ϵ t noise prediction form ]dt + g(t)d( wx , wh ), z T ∼ q T (z T ). Here q t (z) is the marginal distribution of z t , ∇ x log q t (z) is the gradient of log q t (z) w.r.t. x 1 , ∇ x log q t (z) = 1 M M i=1 ∇ x i log q t (z) is the CoM of ∇ x log q t (z), dt is the infinitesimal negative timestep, wx and wh are independent reverse-time standard Wiener processes in X and R M d respectively, and ϵ t = zt- √ α t|0 z0 √ β t|0 is the standard Gaussian noise injected to z 0 . Compared to the original SDE introduced by Song et al. ( 2020), our reverse SDE in Eq. ( 4) additionally subtracts the CoM of ∇ x log q t (z). This ensures x t always stays in the zero CoM subspace as time flows back. To sample from the reverse SDE in Eq. ( 4), we use a noise prediction network ϵ θ (z t , t) to estimate E q(z0|zt) ϵ t , through minimizing the MSE loss min θ E t E q(z0,zt) w(t)∥ϵ θ (z t , t) -ϵ t ∥ 2 , where t is uniformly sampled from [0, T ], and w(t) controls the weight of the loss term at time t. Note that the noise ϵ t is in the product space X × R M d , so we subtract the CoM of the predicted noise of x t to ensure ϵ θ (z t , t) is also in the product space. Substituting ϵ θ (z t , t) into Eq. ( 4), we get an approximate reverse-time SDE parameterized by θ: dz = [f (t)z + g(t) 2 β t|0 ϵ θ (z, t)]dt + g(t)d( wx , wh ), z T ∼ p T (z T ), where p T (z T ) = N X (x T |0, 1)N (h T |0, 1 ) is a Gaussian prior in the product space that approximates q T (z T ). We define p θ (z 0 ) as the marginal distribution of Eq. ( 5) at time t = 0, which is the distribution of our generated samples. Similarly to the forward process, the reverse process of EDM in Eq. ( 2) is a discretization of the reverse SDE in Eq. ( 5).

4.2. EQUIVARIANT SDE

To leverage the geometric symmetry in 3D molecular conformation, p θ (z 0 ) should be invariant to translational and rotational transformations. As mentioned in Section 3, the translational invariance of p θ (z 0 ) is already satisfied by considering the zero CoM subspace. The rotational invariance can be satisfied if the noise prediction network is equivariant to orthogonal transformations, as summarized in the following theorem: Theorem 1. Let (ϵ x θ (z t , t), ϵ h θ (z t , t)) = ϵ θ (z t , t) , where ϵ x θ (z t , t) and ϵ h θ (z t , t) are the predicted noise of x t and h t respectively. If for any orthogonal transformation R ∈ R n×n , ϵ θ (z t , t) is equivariant to R, i.e., ϵ θ (Rx t , h t , t) = (Rϵ x θ (x t , h t , t), ϵ h θ (x t , h t , t)), and p T (z T ) is invariant to R, i.e., p T (Rx T , h T ) = p T (x T , h T ), then p θ (z 0 ) is invariant to any rotational transformation. As mentioned in Section 3, the EGNN satisfies the equivariance constraint, and we parameterize ϵ θ (z t , t) using an EGNN following Hoogeboom et al. (2022) . See details in Appendix D.

4.3. EQUIVARIANT ENERGY-GUIDED SDE

Now we describe equivariant energy-guided SDE (EEGSDE), which guides the generated molecules of Eq. ( 5) towards desired properties c by leveraging a time-dependent energy function E(z, c, t): dz = [f (t)z + g(t) 2 ( 1 β t|0 ϵ θ (z, t) + (∇ x E(z, c, t) -∇ x E(z, c, t), ∇ h E(z, c, t)) energy gradient taken in the product space )]dt + g(t)d( wx , wh ), z T ∼ p T (z T ), which defines a distribution p θ (z 0 |c) conditioned on the property c. Here the CoM ∇ x E(z, c, t) of the gradient is subtracted to keep the SDE in the product space, which ensures the translational invariance of p θ (z 0 |c). Besides, the rotational invariance is satisfied by using energy invariant to orthogonal transformations, as summarized in the following theorem: Theorem 2. Suppose the assumptions in Theorem 1 hold and E(z, c, t) is invariant to any orthogonal transformation R, i.e., E(Rx, h, c, t) = E(x, h, c, t). Then p θ (z 0 |c) is invariant to any rotational transformation. Note that we can also use a conditional model ϵ θ (z, c, t) in Eq. ( 6). See Appendix C for details. To sample from p θ (z 0 |c), various solvers can be used for Eq. ( 6), such as the Euler-Maruyama method (Song et al., 2020) and the Analytic-DPM sampler (Bao et al., 2022b; a) . We present the Euler-Maruyama method as an example in Algorithm 1 at Appendix B.

4.4. HOW TO DESIGN THE ENERGY FUNCTION

Our EEGSDE is a general framework, which can be applied to various applications by specifying different energy functions E(z, c, t). For example, we can design the energy function according to consistency between the molecule z and the property c, where a low energy represents a well consistency. As the generation process in Eq. ( 6) proceeds, the gradient of the energy function encourages generated molecules to have a low energy, and consequently a well consistency. Thus, we can expect the generated molecule z aligns well with the property c. In the rest of the paper, we specify the choice of energy functions, and show these energies improve controllable molecule generation targeted to quantum properties, molecular structures, and even a combination of them. Remark. The term "energy" in this paper refers to a general notion in statistical machine learning, which is a scalar function that captures dependencies between input variables (LeCun et al., 2006) . Thus, the "energy" in this paper can be set to a MSE loss when we want to capture how the molecule align with the property (as done in Section 5). Also, the "energy" in this paper does not exclude potential energy or free energy in chemistry, and they might be applicable when we want to generate molecules with small potential energy or free energy.

5. GENERATING MOLECULES WITH DESIRED QUANTUM PROPERTIES

Let c ∈ R be a certain quantum property. To generate molecules with the desired property, we set the energy function as the squared error between the predicted property and the desired property E(z t , c, t) = s|g(z t , t) -c| 2 , where g(z t , t) is a time-dependent property prediction model, and s is the scaling factor controlling the strength of the guidance. Specifically, g(z t , t) can be parameterized by equivariant models such as EGNN (Satorras et al., 2021b), SE3-Transformer (Fuchs et al., 2020) and DimeNet (Klicpera et al., 2020) to ensure the invariance of E(z t , c, t), as long as they perform well in the task of property prediction. In this paper we consider EGNN. We provide details on parameterization and the training objective in Appendix E. We can also generate molecules targeted to multiple quantum properties by combining energy functions linearly (see details in Appendix F.1).

5.1. SETUP

We evaluate on QM9 (Ramakrishnan et al., 2014)  K K i=1 |ϕ p (z i ) -c i | , where z i is a generated molecule, c i is its desired property. We generate K=10,000 samples for evaluation with ϕ p . For fairness, the property prediction model ϕ p is different from g(z t , t) used in the energy 

Conclusion:

These results suggest that molecules generated by our EEGSDE align better with the desired properties than molecules generated by the conditional EDM baseline (for both the singleproperty and multiple-properties cases). As a consequence, EEGSDE is able to explore the chemical space in a guided way to generate promising molecules for downstream applications such as the virtual screening, which may benefit drug and material discovery.

6. GENERATING MOLECULES WITH TARGET STRUCTURES

Following Gebauer et al. (2022) , we use the molecular fingerprint to encode the structure information of a molecule. The molecular fingerprint c = (c 1 , . . . , c L ) is a series of bits that capture the presence or absence of substructures in the molecule. Specifically, a substructure is mapped to a specific position l in the bitmap, and the corresponding bit c l will be 1 if the substructure exists in the molecule and will be 0 otherwise. To generate molecules with a specific structure (encoded by the fingerprint c), we set the energy function as the squared error E(z t , c, t) = s∥m(z t , t)-c∥ 2 between a time-dependent multi-label classifier m(z t , t) and c. Here s is the scaling factor, and m(z t , t) is trained with binary cross entropy loss to predict the fingerprint as detailed in Appendix E.2. Note that the choice of the energy function is flexible and can be different to the training loss of m(z t , t). In initial experiments, we also try binary cross entropy loss for the energy function, but we find it causes the generation process unstable. The multi-label classifier m(z t , t) is parameterized in a similar way to the property prediction model g(z t , t) in Section 5, and we present details in Appendix E.1.

6.1. SETUP

We evaluate on QM9 and GEOM-Drug. We train our method (including the noise prediction network and the multi-label classifier) on the whole training set. By default, we use a conditional noise prediction network ϵ θ (z, c, t) in Eq. ( 6) for a better performance. See more details in Appendix F.4. Evaluation metric: To measure how the structure of a generated molecule aligns with the target one, we use the Tanimoto similarity (Gebauer et al., 2022) , which captures similarity between structures by comparing their fingerprints.

Baseline:

The most direct baseline is conditional EDM (Hoogeboom et al., 2022) , which only adopts a conditional noise prediction network without the guidance of an energy model. We also consider cG-SchNet (Gebauer et al., 2022) , which generates molecules in an autoregressive manner.

6.2. RESULTS

As shown in Table 4 , EEGSDE significantly improves the similarity between target structures and generated structures compared to conditional EDM and cG-SchNet on QM9. Also note in a proper range, a larger scaling factor results in a better similarity, and EEGSDE with s=1 improves the sim- ilarity by more than 10% compared to conditional EDM. In Figure 2 , we plot generated molecules of conditional EDM and EEGSDE (s=1) targeted to specific structures, where our EEGSDE aligns better with them. We further visualize the effect of the scaling factor in Appendix G.3, where the generated structures align better as the scaling factor grows. These results demonstrate that our EEGSDE captures the structure information in molecules well. We also perform experiments on the more challenging GEOM-Drug (Axelrod & Gomez-Bombarelli, 2022) dataset, and we train the conditional EDM baseline following the default setting of Hoogeboom et al. ( 2022). As shown in Table 4 , we find the conditional EDM baseline has a similarity of 0.165, which is much lower than the value on QM9. We hypothesize this is because molecules in GEOM-Drug has much more atoms than QM9 with a more complex structure, and the default setting in Hoogeboom et al. ( 2022) is suboptimal. For example, the conditional EDM on GEOM-Drug has a smaller number of parameters than the conditional EDM on QM9 (15M v.s. 26M), which is insufficient to capture the structure information. Nevertheless, our EEGSDE still improves the similarity by ∼%17. We provide generated molecules on GEOM-Drug in Appendix G.5. Finally, we demonstrate that our EEGSDE is a flexible framework to generate molecules targeted to multiple properties, which is often the practical case. We additionally target to the quantum property α (polarizability) on QM9 by combining the energy function for structures in this section and the energy function for quantum properties in Section 5. Here we choose α = 100 Bohr 3 , which is a relatively large value, and we expect it to encourage less isometrically shaped structures. As shown in Figure 3 , the generated molecule aligns better with the target structure as the scaling factor s 1 grows, and meanwhile a ring substructure in the generated molecule vanishes as the scaling factor for polarizability s 2 grows, leading to a less isometrically shaped structure, which is as expected.

Conclusion:

These results suggest that molecules generated by our EEGSDE align better with the target structure than molecules generated by the conditional EDM baseline. Besides, EEGSDE can generate molecules targeted to both specific structures and desired quantum properties by combining energy functions linearly. As a result, EEGSDE may benefit practical cases in molecular design when multiple properties should be considered at the same time. 

A DERIVATIONS

A.1 ZERO COM SUBSPACE The zero CoM subspace X = {x ∈ R M n : M i=1 x i = 0} is a (M -1)n dimensional subspace of R M n . Therefore, there exists an isometric isomorphism ϕ from R (M -1)n to X, i.e., ϕ is a linear bijection from R (M -1)n to X, and ∥ϕ( x)∥ 2 = ∥ x∥ 2 for all x ∈ R (M -1)n . We use A ϕ ∈ R M n×(M -1)n represent the matrix corresponding to ϕ, so we have ϕ( x) = A ϕ x. An important property of the isometric isomorphism is that A ϕ A ⊤ ϕ x = xx for all x ∈ R M n . We show the proof as following. Proposition 1. Suppose ϕ is an isometric isomorphism from R (M -1)n to X, and let A ϕ ∈ R M n×(M -1)n be the matrix corresponding to ϕ. Then we have A ϕ A ⊤ ϕ x = x -x for all x ∈ R M n , where x = 1 M M i=1 x i . Proof. We consider a new subspace of R M n , X ⊥ = {x ∈ R M n : x ⊥ X}, i.e., the orthogonal component of X. We can verify that X ⊥ = {x ∈ R M n : x 1 = x 2 = • • • = x M }, and X ⊥ is n dimensional. Thus, there exists an isometric isomorphism ψ from R n to X ⊥ . Let A ψ ∈ R M n×n represent the matrix corresponding to ψ. Then we define λ( x, ŷ) = ϕ( x) + ψ( ŷ), where x ∈ R (M -1)n and ŷ ∈ R n . The image of λ is {x + y : x ∈ X, y ∈ X ⊥ } = R M n . Therefore, λ is a linear bijection from R M n to R M n , and the matrix corresponding to λ is A λ = [A ϕ , A ψ ]. Furthermore, λ is an isometric isomorphism, since ∥λ( x, ŷ)∥ 2 = ∥ϕ( x)∥ 2 + ∥ψ( ŷ)∥ 2 + 2 ⟨ϕ( x), ψ( ŷ)⟩ = ∥ x∥ 2 + ∥ ŷ∥ 2 = ∥( x⊤ , ŷ⊤ ) ⊤ ∥ 2 . This means A λ is orthogonal transformation. Therefore, A λ A ⊤ λ = A ϕ A ⊤ ϕ x + A ψ A ⊤ ψ = I and A ϕ A ⊤ ϕ x + A ψ A ⊤ ψ x = x. Since A ϕ A ⊤ ϕ x ∈ X and A ψ A ⊤ ψ x ∈ X ⊤ , we can conclude that A ϕ A ⊤ ϕ x is the orthogonal projection of x to X, which is exactly x -x. Since R (M -1)n and X are two intrinsically equivalent spaces, an equivalence of distributions in these two spaces can also be established, as shown in the following propositions. Proposition 2. Suppose x is a random vector distributed in X, and x = ϕ -1 (x) is its equivalent representation in R (M -1)n . If x ∼ q(x), then x ∼ q( x), where q( x) = q(ϕ( x)). Proposition 3. Suppose x, y is are two random vectors distributed in X, and x = ϕ -1 (x), ŷ = ϕ -1 (y) are their equivalent representations in R (M -1)n . If x|y ∼ q(x|y), then x| ŷ ∼ q( x| ŷ), where q( x| ŷ) = q(ϕ( x)|ϕ( ŷ))).

A.2 SDE IN THE PRODUCT SPACE

Definition 1 (Gaussian distributions in the zero CoM subspace). Suppose µ ∈ X. Let N X (x|µ, σ 2 ) := (2πσ 2 ) -(M -1)n/2 exp(-1 2σ 2 ∥x -µ∥ 2 2 ), which is the isotropic Gaussian distribution with mean µ and variance σ 2 in the zero CoM subspace X. Proposition 4 (Transition kernels of a SDE in the zero CoM subspace). Suppose dx = f (t)xdt + g(t)dw x , x 0 ∼ q(x 0 ) is a SDE in the zero CoM subspace. Then the transition kernel from x s to x t (0 ≤ s < t ≤ T ) can be expressed as q(x t |x s ) = N X (x t | √ α t|s x s , β t|s ), where α t|s = exp(2 t s f (τ )dτ ) and β t|s = α t|s t s g(τ ) 2 /α τ |s dτ are two scalars determined by f (•) and g(•). Proof. Firstly, we map the process {x t } T t=0 in the zero CoM subspace to the equivalent space R (M -1)n through the isometric isomorphism ϕ introduced in Appendix A.1. This produces a new process { x} T t=0 , where xt = ϕ -1 (x t ). By applying ϕ -1 to the SDE in the zero CoM subspace, we know xt = ϕ -1 (x t ) satisfies the following SDE in R (M -1)n : d x = f (t) xdt + g(t)d ŵ, x0 ∼ q( x0 ), ( ) where ŵ is the standard Wiener process in R (M -1)n and q( x0 ) = q(ϕ( x0 )). According to Song et al. (2020) , the transition of Eq. ( 7  (x t |x s ) from x s to x t satisfies q(x t |x s ) =q(ϕ -1 (x t )|ϕ -1 (x s )) = (2πβ t|s ) -(M -1)n/2 exp(- 1 2β t|s ∥ϕ -1 (x t ) - √ α t|s ϕ -1 (x s )∥ 2 2 ) =(2πβ t|s ) -(M -1)n/2 exp(- 1 2β t|s ∥ϕ -1 (x t - √ α t|s x s )∥ 2 2 ) // linearity of ϕ -1 =(2πβ t|s ) -(M -1)n/2 exp(- 1 2β t|s ∥x t - √ α t|s x s ∥ 2 2 ) // norm preserving of ϕ -1 =N X (x t | √ α t|s x s , β t|s ). Proposition 5 (Transition kernels of the SDE in the product space). Suppose dz = f (t)zdt + g(t)d(w x , w h ), z 0 ∼ q(z 0 ) is the SDE in the product space X × R M d , as introduced in Eq. (3). Then the transition kernel from z s to z t (0 ≤ s < t ≤ T ) can be expressed as q(z t |z s ) = q(x t |x s )q(h t |h s ), where q(x t |x s ) = N X (x t | √ α t|s x s , β t|s ) and q(h t |h s ) = N (h t | √ α t|s h s , β t|s ). Here α t|s and β t|s are defined as in Proposition 4. Proof. Since w x and w h are independent to each other, the transition kernel from z s to z t can be factorized as q(z t |z s ) = q(x t |x s )q(h t |h s ), where q(x t |x s ) is the transition kernel of dx = f (t)xdt + g(t)dw x and q(h t |h s ) is the transition kernel of dh = f (t)hdt + g(t)dw h . According to Proposition 4 and Song et al. (2020) , we have q(x t |x s ) = N X (x t | √ α t|s x s , β t|s ) and q(h t |h s ) = N (h t | √ α t|s h s , β t|s ). Remark 1. The marginal distribution of z t is q t (z t ) = Z R M d q(z 0 )q(x t |x 0 )q(h t |h 0 )dh 0 λ(dx 0 ), where λ is the Lebesgue measure in the zero CoM subspace X. While q t (z t ) is a distribu- tion in X × R M d , it has a natural differentiable extension to R M n × R M d , since q(x t |x 0 ) = N X (x t | √ α t|s x s , β t|s ) has a differentiable extension to R M n according to Definition 1. Thus, we can take gradient of q t (z t ) w.r.t. x t in the whole space R M n . Proposition 6 (Time reversal of the SDE in the product space). Suppose dz = f (t)zdt + g(t)d(w x , w h ), z 0 ∼ q(z 0 ) is the SDE in the product space X × R M d , as introduced in Eq. (3). Then its time reversal satisfies the following reverse-time SDE, which can be represented by both the score function form and the noise prediction form: dz =[f (t)z -g(t) 2 (∇ x log q t (z) -∇ x log q t (z), ∇ h log q t (z)) score function form ]dt + g(t)d( wx , wh ), =[f (t)z + g(t) 2 β t|0 E q(z0|zt) ϵ t noise prediction form ]dt + g(t)d( wx , wh ), z T ∼ q T (z T ), where wx and wh are reverse-time standard Wiener processes in X and R M d respectively, and ϵ t = zt- √ α t|0 z0 √ β t|0 is the standard Gaussian noise in X × R M d injected to z 0 . Furthermore, we have ∇ x log q t (x) -∇ x log q t (x) = -1 √ β t|0 E q(x0|xt) ϵ t , where ϵ t = xt- √ α t|0 x0 √ β t|0 is the standard Gaussian noise in the zero CoM subspace injected to x 0 . Proof. Let ẑt = ( xt , h t ), where xt = ϕ -1 (x t ), introduced in the proof of Proposition 4. Then { ẑt } T t=0 is a process in R (M -1)n × R M d determined by the following SDE d ẑ = f (t) ẑdt + g(t)d ŵ, ẑ0 ∼ q( ẑ0 ), where ŵ is the standard Wiener process in R (M -1)n × R M d and q( ẑ0 ) = q(ϕ( x0 ), h). According to Song et al. (2020) , Eq. ( 8) has a time reversal: d ẑ = [f (t) ẑ -g(t) 2 ∇ ẑ log qt ( ẑ)]dt + g(t)d w, ẑT ∼ qT ( ẑT ), where qt (z) is the marginal distribution of ẑt , which satisfies qt ( ẑt ) = q t (ϕ( xt ), h t ) according to Proposition 2, and w is the reverse-time standard Wiener process in R (M -1)n × R M d . Then we apply the linear transformation T ẑ = (ϕ( x), h) to Eq. ( 9), which maps ẑt back to z t . This yields dz = [f (t)z -g(t) 2 T (∇ ẑ log qt ( ẑ))]dt + g(t)d( wx , wh ), z T ∼ q T (z T ). Here wx and wh are reverse-time standard Wiener processes in X and R M d respectively, and T (∇ ẑ log qt ( ẑ)) can be expressed as (ϕ(∇ x log qt ( ẑ)), ∇ h log qt ( ẑ)). Since ∇ x log qt ( ẑ) = A ⊤ ϕ ∇ x log q t (x, h) = A ⊤ ϕ ∇ x log q t (z), we have ϕ(∇ x log qt ( ẑ)) = A ϕ A ⊤ ϕ ∇ x log q t (z) , where A ϕ represents the matrix corresponding to ϕ. According to Proposition 1, we have ϕ(∇ x log qt ( ẑ)) = ∇ x log q t (z) -∇ x log q t (z). Besides, ∇ h log qt ( ẑ) = ∇ h log q t (z). Thus, Eq. ( 10) can be written as dz = [f (t)z -g(t) 2 (∇ x log q t (z) -∇ x log q t (z), ∇ h log q t (z))]dt + g(t)d( wx , wh ), which is the score function form of the reverse-time SDE. We can also write the score function in Eq. ( 9) as ∇ ẑ log qt ( ẑ) = E q( ẑ0| ẑt) ∇ ẑ log q( ẑt | ẑ0 ) = -1 √ β t|0 E q( ẑ0| ẑt) εt , where εt = ẑt- √ α t|0 ẑ0 √ β t|0 is the standard Gaussian noise injected to ẑ0 . With this expression, we have T (∇ ẑ log qt ( ẑ)) = -1 √ β t|0 E q( ẑ0| ẑt) T (ε t ) = -1 √ β t|0 E q(z0|zt) ϵ t , where ϵ t = zt- √ α t|0 z0 √ β t|0 is the standard Gaussian noise in X × R M d injected to z 0 . Thus, Eq. ( 10) can also be written as dz = [f (t)z + g(t) 2 β t|0 E q(z0|zt) ϵ t ]dt + g(t)d( wx , wh ), which is the noise prediction form of the reverse-time SDE. A.3 EQUIVARIANCE Theorem 1. Let (ϵ x θ (z t , t), ϵ h θ (z t , t)) = ϵ θ (z t , t) , where ϵ x θ (z t , t) and ϵ h θ (z t , t) are the predicted noise of x t and h t respectively. If for any orthogonal transformation R ∈ R n×n , ϵ θ (z t , t) is equivariant to R, i.e., ϵ θ (Rx t , h t , t) = (Rϵ x θ (x t , h t , t), ϵ h θ (x t , h t , t)), and p T (z T ) is invariant to R, i.e., p T (Rx T , h T ) = p T (x T , h T ), then p θ (z 0 ) is invariant to any rotational transformation. Proof. Suppose R ∈ R n×n is an orthogonal transformation. Let z θ t = (x θ t , h θ t ) (0 ≤ t ≤ T ) be the process determined by Eq. ( 5). Let y θ t = Rx θ t and u θ t = (y θ t , h θ t ). We use p u t (u t ) and p z t (z t ) to denote the distributions of u θ t and z θ t respectively, and they satisfy p u t (y t , h t ) = p z t (R -1 y t , h t ). By applying the transformation T z = (Rx, h) to Eq. ( 5), we know the new process {u θ t } T t=0 satisfies the following SDE:  du = T dz = [f (t)u + g(t) 2 β t|0 (Rϵ x θ (z, t), ϵ h θ (z, t))]dt + g(t)d(R wx , wh ) =[f (t)u + g(t) 2 β t|0 (ϵ x θ (Rx, h, t), ϵ h θ (Rx, h, t))]dt + g(t)d(R (u T ) = p u T (y T , h T ) = p z T (R -1 y T , h T ) = p T (R -1 y T , h T ) = p T (y T , h T ) = p T (u T ). Thus, the SDE of {u θ t } T t=0 is exactly the same to that of {z θ t } T t=0 . This indicates that the distribution of u θ 0 is the same to the distribution of z θ 0 , i.e., p u 0 (u 0 ) = p z 0 (u 0 ) = p θ (u 0 ). Also note that p u 0 (u 0 ) = p z 0 (R -1 y 0 , h 0 ) = p θ (R -1 y 0 , h 0 ). Thus, p θ (u 0 ) = p θ (R -1 y 0 , h 0 ), and consequently p θ (Rx 0 , h 0 ) = p θ (x 0 , h 0 ). This means p θ (z 0 ) is invariant to any orthogonal transformation, which includes rotational transformations as special cases. Theorem 2. Suppose the assumptions in Theorem 1 hold and E(z, c, t) is invariant to any orthogonal transformation R, i.e., E(Rx, h, c, t) = E(x, h, c, t). Then p θ (z 0 |c) is invariant to any rotational transformation. Proof. Suppose R ∈ R n×n is an orthogonal transformation. Taking gradient to both side of E(Rx, h, c, t) = E(x, h, c, t) w.r.t. x, we get R ⊤ ∇ y E(y, h, c, t)| y=Rx = ∇ x E(x, h, c, t). Multiplying R to both sides, we get ∇ y E(y, h, c, t)| y=Rx = R∇ x E(x, h, c, t). Let ϕ(z, c, t) = (∇ x E(z, c, t) -∇ x E(z, c, t), ∇ h E(z, c, t)). Then we have ϕ(Rx, h, c, t) =(∇ y E(y, h, c, t) -∇ y E(y, h, c, t), ∇ h E(y, h, c, t))| y=Rx =(R∇ x E(x, h, c, t) -R∇ x E(x, h, c, t), ∇ h E(Rx, h, c, t)) =(R(∇ x E(x, h, c, t) -∇ x E(x, h, c, t)), ∇ h E(x, h, c, t)). Thus, ϕ(z, c, t) is equivariant to R. Let εθ (z, c, t) = ϵ θ (z, t) + β t|0 ϕ(z, c, t), which is a linear combination of two equivariant functions and is also equivariant to R. Then, Eq. ( 6) can be written as dz = [f (t)z + g(t) 2 β t|0 εθ (z, c, t)]dt + g(t)d( wx , wh ), z T ∼ p T (z T ). According to Theorem 1, we know its marginal distribution at time t = 0, i.e., p θ (z 0 |c), is invariant to any rotational transformation.

B SAMPLING

In Algorithm 1, we present the Euler-Maruyama method to sample from EEGSDE in Eq. 6. Algorithm 1 Sample from EEGSDE using the Euler-Maruyama method Require: Number of steps N ∆t = T N z ← (x -x, h), where x ∼ N (0, 1), h ∼ N (0, 1) {Sample from the prior p T (z T )} for i = N to 1 do t ← i∆t g x ← ∇ x E(z, c, t), g h ← ∇ h E(z, c, t) {Calculate the gradient of the energy function} g ← (g x -g x , g h ) {Subtract the CoM of the gradient} F ← f (t)z + g(t) 2 ( 1 √ β t|0 ϵ θ (z, t) + g) ϵ ← (ϵ x -ϵ x , ϵ h ), where ϵ x ∼ N (0, 1), ϵ h ∼ N (0, 1) z ← z -F ∆t + g(t) √ ∆tϵ {Update z according to Eq. ( 6)} end for return z

C CONDITIONAL NOISE PREDICTION NETWORKS

In Eq. ( 6), we can alternatively use a conditional noise prediction network ϵ θ (z, c, t) for a stronger guidance, as follows dz = [f (t)z + g(t) 2 ( 1 β t|0 ϵ θ (z, c, t) + (∇ x E(z, c, t) -∇ x E(z, c, t), ∇ h E(z, c, t)))]dt + g(t)d( wx , wh ), z T ∼ p T (z T ). The conditional noise prediction network is trained similarly to the unconditional one, using the following MSE loss min θ E t E q(c,z0,zt) w(t)∥ϵ θ (z t , c, t) -ϵ t ∥ 2 .

D PARAMETERIZATION OF NOISE PREDICTION NETWORKS

We parameterize the noise prediction network following Hoogeboom et al. (2022) , and we provide the specific parameterization for completeness. For the unconditional model ϵ θ (z, t), we first concatenate each atom feature h i and t, which gives h i′ = (h i , t). Then we input x and h ′ = (h 1′ , . . . , h M ′ ) to the EGNN as follows (a x , a h′ ) = EGNN(x, h ′ ) -(x, 0). Finally, we subtract the CoM of a x , and gets the parameterization of ϵ θ (z, t): ϵ θ (z, t) = (a x -a x , a h ), where a h comes from discarding the last component of a h′ that corresponds to the time. For the conditional model ϵ θ (z, c, t), we additionally concatenate c to the atom feature h i , i.e., h i′ = (h i , t, c), and other parts in the parameterization remain the same.

E DETAILS OF ENERGY FUNCTIONS E.1 PARAMETERIZATION OF TIME-DEPENDENT MODELS

The time-dependent property prediction model g(z t , t) is parameterized using the second component in the output of EGNN (see Section 3) followed by a decoder (Dec): g(z t , t) = Dec(EGNN h (x t , h ′ t )), h ′ t = concatenate(h t , t), where the concatenation is performed on each atom feature, and the decoder is a small neural network based on Satorras et al. (2021b) . This parameterization ensures that the energy function E(z t , c, t) is invariant to orthogonal transformations, and thus the distribution of generated samples is also invariant according to Theorem 2. Similarly, the time-dependent multi-label classifier is parameterized by EGNN as m(z t , t) = σ(Dec(EGNN h (x t , h ′ t ))), h ′ t = concatenate(h t , t ). The multi-label classifier has the same backbone to the property prediction model in Eq. ( 11), except that the decoder outputs a vector of dimension L, and the sigmoid function σ is adopted for multilabel classification. Similarly to Eq. ( 11), the EGNN in the multi-label classifier guarantees the invariance of the distribution of generated samples according to Theorem 2.

E.2 TRAINING OBJECTIVES OF ENERGY FUNCTIONS

Time-dependent property prediction model. Since the quantum property is a scalar, we train the time-dependent property prediction model g(z t , t) using the ℓ 1 loss E t E q(c,z0,zt) |g(z t , t) -c|, where t is uniformly sampled from [0, T ]. Time-dependent multi-label classifier. Since the fingerprint is a bit map, predicting it can be viewed as a multi-label classification task. Thus, we use a time dependent multi-label classifier m(z t , t), and train it using the binary cross entropy loss E t E q(c,x0,xt) L l=1 c l log m l (z t , t) + (1 -c l ) log(1 -m l (z t , t)), where t is uniformly sampled from [0, T ], and m l (z t , t) is the l-th component of m(z t , t).

F EXPERIMENTAL DETAILS

F.1 HOW TO GENERATE MOLECULES TARGETED TO MULTIPLE QUANTUM PROPERTIES When we want to generate molecules with K quantum properties c = (c 1 , c 2 , . . . , c K ), we combine energy functions for single properties linearly as E(z t , c, t) = K k=1 E k (z t , c k , t), where E k (z t , c k , t) = s k |g k (z t , t) -c k | 2 is the energy function for the k-th property, s k is the scaling factor and g k (z t , t) is the time-dependent property prediction model for the k-th property. Then we use the gradient of E(z t , c, t) to guide the reverse SDE as described in Eq. (6).

F.2 THE "U-BOUND" AND "#ATOMS" BASELINES

The "U-bound" and "#Atoms" baselines are from Hoogeboom et al. (2022) . The "U-bound" baseline shuffles the labels in D b and then calculate the loss of ϕ c on it, which can be regarded as an upper bound of the MAE. The "#Atoms" baseline predicts a quantum property c using only the number of atoms in a molecule.

F.3 GENERATING MOLECULES WITH DESIRED QUANTUM PROPERTIES

For the noise prediction network, we use the same setting with EDM (Hoogeboom et al., 2022) for a fair comparison, where the models is trained ∼2000 epochs with a batch size of 64, a learning rate of 0.0001 with the Adam optimizer and an exponential moving average (EMA) with a rate of 0.9999. The EGNN used in the energy function has 192 hidden features and 7 layers. We train 2000 epochs with a batch size of 128, a learning rate of 0.0001 with the Adam optimizer and an exponential moving average (EMA) with a rate of 0.9999. During evaluation, we need to generate a set of molecules. Following the EDM (Hoogeboom et al., 2022) 

F.4 GENERATING MOLECULES WITH TARGET STRUCTURES

Computation of Tanimoto similarity. Let S g be the set of bits that are set to 1 in the fingerprint of a generated molecule, and S t be the set of bits that are set to 1 in the target structure. The Tanimoto similarity is defined as |S g ∩ S t |/|S g ∪ S t |, where | • | denotes the number of elements in a set. Experimental details on QM9. For the backbone of the noise prediction network, we use a threelayer MLP with 768, 512 and 192 hidden nodes as embedding to encode the fingerprint and add the output of it with the embedding of atom features h, which is then fed into the following EGNN. The EGNN has 256 hidden features and 9 layers. We train it 1500 epoch with a batch size of 64, a learning rate of 0.0001 with the Adam optimizer and an exponential moving average (EMA) with a rate of 0.9999. The energy function is trained with 1750 epoch with a batch size of 128, a learning rate of 0.0001 with the Adam optimizer and an exponential moving average (EMA) with a rate of 0.9999. Its EGNN has 192 hidden features and 7 layers. For the baseline cG-SchNet, we reproduce it using the public code. Since the default data split in cG-SchNet is different with ours, we train the cG-SchNet under the same data split with ours for a fair comparison. We report results at 200 epochs (there is no gain on similarity metric after 150 epochs). We evaluate the Tanimoto similarity on the whole test set. Experimental details on GEOM-Drug. We use the same data split of GEOM-Drug with Hoogeboom et al. (2022) , where training, validation and test set include 554K, 70K and 70K samples respectively. We train the noise prediction network and the energy function on the training set. For the noise prediction network, we use the recommended hyperparameters of EDM (Hoogeboom et al., 2022) , where the EGNN has 256 hidden features and 4 layers, and the other part is the same as that in QM9. We train 10 epoch with a batch size of 64, a learning rate of 0.0001 with the Adam optimizer and an exponential moving average (EMA) with a rate of 0.9999. The backbone of the energy function is the same as that in QM9 and we train the energy function for 14 epoch. We evaluate the Tanimoto similarity on randomly selected 10K molecules from the test set.

G.1 ABLATION STUDY ON NOISE PREDICTION NETWORKS AND ENERGY GUIDANCE

We perform an ablation study on the conditioning, i.e., use conditional or unconditional noise prediction networks, and the energy guidance. When neither the conditioning nor the energy guidance is adopted, a single unconditional model can't perform conditional generation, and therefore we only report its atom stability and the molecule stability. As shown in Table 5 , Table 6 and Table 7 , the conditional noise prediction network improves the MAE compared to the unconditional one, and the energy guidance improves the MAE compared to a sole conditional model. Both the conditioning and the energy guidance do not affect the atom stability and the molecule stability much. For completeness, we also report novelty, atom stability and molecule stability on 10K generated molecules. Below we briefly introduce these metrics. • Novelty (Simonovsky & Komodakis, 2018) is the proportion of generated molecules that do not appear in the training set. Specifically, let G be the set of generated molecules, the novelty is calculated as 1 -|G∩D b | |G| . Note that the novelty is evaluated on D b , since the reproduced conditional EDM and our method are trained on D b . This leads to an inflated value compared to the one evaluated on the whole dataset (Hoogeboom et al., 2022) . • Atom stability (AS) (Hoogeboom et al., 2022) is the proportion of atoms that have the right valency. Molecule stability (MS) (Hoogeboom et al., 2022) is the proportion of generated molecules where all atoms are stable. We use the official implementation for the two metrics from the EDM paper (Hoogeboom et al., 2022) . As shown in Table 8 and Table 9 , conditional EDM and EEGSDE with a small scaling factor have a slightly better stability, and EEGSDE with a large scaling factor has a slightly better novelty in general. The additional energy changes the distribution of generated molecules, which improves the novelty of generated molecules in general. Since there is a tradeoff between novelty and stability (see the caption of Table 5 in the EDM paper (Hoogeboom et al., 2022) ), a slight decrease on the stability is possible when the scaling factor is large.



While qt(z) is defined in X × R M d , its domain can be extended to R M n × R M d and the gradient is valid. See Remark 1 in Appendix A.2 for details.



Figure 1: Overview of our EEGSDE. EEGSDE iteratively generates molecules with desired properties (represented by the condition c) by adopting the guidance of energy functions in each step.As the energy function is invariant to rotational transformation R, its gradient (i.e., the energy guidance) is equivariant to R, and therefore the distribution of generated samples is invariant to R.

Hoogeboom et al. (2022) also present a conditional version of EDM for inverse molecular design by adding an extra input of the condition c to the noise prediction network as ϵ θ (z n , c, n).

Figure3: Generate molecules on QM9 targeted to both the quantum property α and the molecular structure. As the scaling factor s 2 grows, the substructure of generated molecule gradually change from the symmetric ring to a less isometrically shaped structure. Meanwhile the generated molecule aligns better with the target structure as the scaling factor s 1 grows.

) from xs to xt (s < t) can be expressed as q( xt | xs ) = N ( xt | √ α t|s xs , β t|s I), where α t|s = exp(2 t s f (τ )dτ ) and β t|s = α t|s t s g(τ ) 2 /α τ |s dτ are two scalars determined by f (•) and g(•). According to Proposition 3, we know the transition kernel q

, we firstly sample the number of atoms in a molecule M ∼ p(M ) and the property value c ∼ p(c|M ) (or c 1 ∼ p(c 1 |M ), c 2 ∼ p(c 2 |M ), . . . , c K ∼ p(c K |M ) for multiple properties). Here p(M ) is the distribution of molecule sizes on training data, and p(c|M ) is the distribution of the property on training data. Then we generate a molecule given M, c.

, which contains quantum properties and coordinates of ∼130k molecules with up to nine heavy atoms from (C, N, O, F). Following EDM, we split QM9 into training, validation and test sets, which include 100K, 18K and 13K samples respectively. The training set is further divided into two non-overlapping halves D a , D b equally. The noise prediction network and the time-dependent property prediction model of the energy function are trained on D b separately. By default, EEGSDE uses a conditional noise prediction network ϵ

How generated molecules align with the target quantum property. The L-bound(Hoogeboom et al., 2022) represents the loss of ϕ p on D b and can be viewed as a lower bound of the MAE metric. The conditional EDM results are reproduced, and are consistent withHoogeboom et al. (2022) (see Appendix G.4). "#Atoms" uses public results fromHoogeboom et al. (2022).

The mean absolute error (MAE) computed by the Gaussian software instead of ϕ p .

How generated molecules align with multiple target quantum properties.

How generated molecules align with target structures. , and they are trained on the two non-overlapping training subsets D a , D b respectively. This ensures no information leak occurs when evaluating our EEGSDE. To further verify the effectiveness of EEGSDE, we also calculate the MAE without relying on a neural network. Specifically, we use the Gaussian software (which calculates properties according to theories of quantum chemistry without using neural networks) to calculate the properties of 100 generated molecules. For completeness, we also report the novelty, the atom stability and the molecule stability followingHoogeboom et al. (2022) in Appendix G.2, although they are not our main focus. Figure 2: Generated molecules on QM9 targeted to specific structures (unseen during training). The molecular structures of EEGSDE align better with target structures then conditional EDM.molecules targeted to one of these six properties. As shown in Table1, with the energy guidance, our EEGSDE has a significantly better MAE than the conditional EDM on all properties. Remarkably, with a proper scaling factor s, the MAE of EEGSDE is reduced by more than 25% compared to conditional EDM on properties ∆ε, ε LUMO , and more than 30% on µ. What's more, as shown in Table2, our EEGSDE still has better MAE under the evaluation by the Gaussian software, which further verifies the effectiveness of our EEGSDE. We further generate molecules targeted to multiple quantum properties by combining energy functions linearly. As shown in Table3, our EEGSDE still has a significantly better MAE than the conditional EDM.

Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. Advances in neural information processing systems, 31, 2018.

Effects of conditioning and energy guidance on a single quantum property µ (D).

Effects of conditioning and energy guidance on a single quantum property C v ( cal mol K).

Effects of conditioning and energy guidance on multiple quantum properties C v , µ.

ACKNOWLEDGMENTS

We thank Han Guo and Zhen Jia for their help with their expertise in chemistry. This work was supported by NSF of China Projects (Nos. 62061136001, 61620106010, 62076145, U19B2034, U1811461, U19A2081, 6197222); Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098; a grant from Tsinghua Institute for Guo Qiang; the High Performance Computing Center, Tsinghua University; the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (22XNKJ13). J.Z was also supported by the XPlorer Prize. C. Li was also sponsored by Beijing Nova Program.

ETHICS STATEMENT

Inverse molecular design is critical in fields like material science and drug discovery. Our EEGSDE is a flexible framework for inverse molecular design and thus might benefit these fields. Currently the negative consequences are not obvious.

REPRODUCIBILITY STATEMENT

Our code is included in the supplementary material. The implementation of our experiment is described in Section 5 and Section 6. Further details such as the hyperparameters of training and model backbones are provided in Appendix F. We provide complete proofs and derivations of all theoretical results in Appendix A.Published as a conference paper at ICLR 2023 

G.3 VISUALIZATION OF THE EFFECT OF THE SCALING FACTOR

We visualize the effect of the scaling factor in Figure 4 , where the generated structures align better as the scaling factor grows.Target structure S = 0 S = 0.5Figure 4 : Visualization of the effect of the scaling factor on QM9. As the scaling factor grows, the generated structures align better with the target structure. S = 0 corresponds to the conditional EDM.

G.4 REPRODUCE

We compare our reproduced results and the original results of conditional EDM (Hoogeboom et al., 2022) in Table 10 . The results are consistent. We plot generated molecules on GEOM-Drug in Figure 5 , and it can be observed that the atom types of generated molecules with EEGSDE often match the target better than conditional EDM. 

