3D EQUIVARIANT DIFFUSION FOR TARGET-AWARE MOLECULE GENERATION AND AFFINITY PREDICTION

Abstract

Rich data and powerful machine learning models allow us to design drugs for a specific protein target in silico. Recently, the inclusion of 3D structures during targeted drug design shows superior performance to other target-free models as the atomic interaction in the 3D space is explicitly modeled. However, current 3D target-aware models either rely on the voxelized atom densities or the autoregressive sampling process, which are not equivariant to rotation or easily violate geometric constraints resulting in unrealistic structures. In this work, we develop a 3D equivariant diffusion model to solve the above challenges. To achieve target-aware molecule design, our method learns a joint generative process of both continuous atom coordinates and categorical atom types with a SE(3)-equivariant network. Moreover, we show that our model can serve as an unsupervised feature extractor to estimate the binding affinity under proper parameterization, which provides an effective way for drug screening. To evaluate our model, we propose a comprehensive framework to evaluate the quality of sampled molecules from different dimensions. Empirical studies show our model could generate molecules with more realistic 3D structures and better affinities towards the protein targets, and improve binding affinity ranking and prediction without retraining.

1. INTRODUCTION

Rational drug design against a known protein binding pocket is an efficient and economical approach for finding lead molecules (Anderson, 2003; Batool et al., 2019) and has attracted growing attention from the research community. However, it remains challenging and computationally intensive due to the large synthetically feasible space (Ragoza et al., 2022) , and high degrees of freedom for binding poses (Hawkins, 2017) . Previous prevailed molecular generative models are based on either molecular string representation (Bjerrum and Threlfall, 2017; Kusner et al., 2017; Segler et al., 2018) or graph representation (Li et al., 2018; Liu et al., 2018; Jin et al., 2018; Shi et al., 2020) , but both representations do not take the 3D spatial interaction into account and therefore not well suited for target-aware molecule generation. With recent development in structural biology and protein structure prediction (Jumper et al., 2021) , more structural data become available (Francoeur et al., 2020) and unlock new opportunities for machine learning algorithms to directly design drugs inside 3D binding complex (Gebauer et al., 2019; Simm et al., 2020a; b) . Recently, new generation of generative models are proposed specifically for the target-aware molecule generation task (Luo et al., 2021; Ragoza et al., 2022; Tan et al., 2022; Liu et al., 2022; Peng et al., 2022) . However, existing approaches suffer from several drawbacks. For instance, Tan et al. (2022) does not explicitly model the interactions between atoms of molecules and proteins in the 3D space, but only considers the target as intermediate conditional embeddings. For those that do consider the atom interactions in the 3D space, Ragoza et al. (2022) represents the 3D space as voxelized grids and model the proteins and molecules using 3D Convolutional Neural Networks (CNN). However, this model is not rotational equivariant and cannot fully capture the 3D inductive biases. In addition, the voxelization operation will lead to poor scalability since the number of voxels increases at a cubic rate to the pocket size. Advanced approaches achieve SE(3)-equivariance through different modeling techniques (Luo et al., 2021; Liu et al., 2022; Peng et al., 2022) . However, these methods adopt autoregressive sampling, where atoms are generated one by one based on the learned probability density of atom types and atom coordinates. These approaches suffer from several limitations: First, the mismatch between training and sampling incurs exposure bias. Secondly, the model assigns an unnatural generation order during sampling and cannot consider the probability of the entire 3D structure. For instance, it would be easy for the model to correctly place the n-th atom to form a benzene ring if the n -1-th carbon atoms have already been placed in the same plane. However, it would be difficult for the model to place the first several atoms accurately since there is limited context information available, which yields unrealistic fragments as a consequence. Moreover, the sampling scheme does not scale well when generating large binding molecules is necessary. Finally, current autoregressive models could not estimate the quality of generated molecules. One has to rely on other tools based on physical-chemical energy functions such as AutoDock (Trott and Olson, 2010) to select the drug candidates. To address these problems, we propose TargetDiff, a 3D full-atom diffusion model that generates target-aware molecules in a non-autoregressive fashion. Thanks to recent progress in probabilistic diffusion models (Ho et al., 2020; Hoogeboom et al., 2021) and equivariant neural networks (Fuchs et al., 2020; Satorras et al., 2021b) , our proposed model can generate molecules in continuous 3D space based on the context provided by protein atoms, and have the invariant likelihood w.r.t global translation and rotation of the binding complex. Specifically, we represent the protein binding pockets and small molecules as atom point sets in the 3D space where each atom is associated with a 3D Cartesian coordinate. We define a diffusion process for both continuous atom coordinates and discrete atom types where noise is gradually added, and learn the joint generative process with a SE(3)-equivariant graph neural network which alternately updates the atom hidden embedding and atom coordinates of molecules. Under certain parameterization, we can extract representative features from the model by forward passing the input molecules once without retraining. We find these features provide strong signals to estimate the binding affinity between the sampled molecule and target protein, which can then be used for ranking drug candidates and improving other supervised learning frameworks for binding affinity prediction. An empirical study on the CrossDocked2020 dataset (Francoeur et al., 2020) shows that TargetDiff generates molecules with more realistic 3D structures and better binding energies towards the protein binding sites compared to the baselines. Our main contributions can be summarized as follows: • An end-to-end framework for generating molecules conditioned on a protein target, which explicitly considers the physical interaction between proteins and molecules in 3D space. • So far as we know, this is the first probabilistic diffusion formulation for target-aware drug design, where training and sampling procedures are aligned in a non-autoregressive as well as SE(3)-equivariant fashion thanks to a shifting center operation and equivariant GNN. • Several new evaluation metrics and additional insights that allow us to evaluate the model generated molecules in many different dimensions. The empirical results demonstrate the superiority of our model over two other representative baselines. • Propose an effective way to evaluate the quality of generated molecules based on our framework, where the model can be served as either a scoring function to help ranking or an unsupervised feature extractor to improve binding affinity prediction.

2. RELATED WORK

Molecule Generation with Different Representations Based on different levels of representations, existing molecular generative models can be roughly divided into three categories -stringbased, graph-based, and 3D-structure-based. The most common molecular string representation is SMILES (Weininger, 1988) , where many existing language models such as RNN can be re-purposed for the molecule generation task (Bjerrum and Threlfall, 2017; Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Segler et al., 2018) . However, SMILES representation is not an optimal choice since it fails to capture molecular similarities and suffers from the validity issue during the generation phase (Jin et al., 2018) . Thus, many graph-based methods are proposed to operate directly on graphs (Liu et al., 2018; Shi et al., 2020; Jin et al., 2018; 2020; You et al., 2018; Zhou et al., 2019) . On the other hand, these methods are very limited in modeling the spatial information of molecules that is crucial for determining molecular properties and functions. Therefore, recent work (Gebauer et al., 2019; Skalic et al., 2019a; Ragoza et al., 2020; Simm et al., 2020a; b) focus on generating molecules in 3D space. More recently, flow-based and diffusion-based generative models (Satorras et al., 2021a; Hoogeboom et al., 2022) GNNs. Despite the progress made in this direction, the models still suffer from several issues, including separately encoding the small molecules and protein pockets (Skalic et al., 2019b; Xu et al., 2021; Tan et al., 2022; Ragoza et al., 2022) , relying on voxelization and non-equivariance networks (Skalic et al., 2019b; Xu et al., 2021; Ragoza et al., 2022) , and autoregressive sampling (Luo et al., 2021; Liu et al., 2022; Peng et al., 2022) . Different from all these models, our equivariant model explicitly considers the interaction between proteins and molecules in 3D and can perform non-autoregressive sampling, which better aligns the training and sampling procedures. Diffusion Models Diffusion models (Sohl-Dickstein et al., 2015) are a new family of latent variable generative models. Ho et al. (2020) propose denoising diffusion probabilistic models (DDPM) which establishes a connection between diffusion models and denoising score-based models (Song and Ermon, 2019) . The diffusion models have shown remarkable success in generating image data (Ho et al., 2020; Nichol and Dhariwal, 2021) and discrete data such as text (Hoogeboom et al., 2021; Austin et al., 2021) . Recently, it has also been applied in the domain of molecules. For example, GeoDiff (Xu et al., 2022) generates molecular conformations given 2D molecular graphs. EDM (Hoogeboom et al., 2022) generates 3D molecules. However, the unawareness to potential targets make it hard to be utilized by biologists in real scenarios.

3.1. PROBLEM DEFINITION

A protein binding site is represented as a set of atoms P = {(x (i) P , v P )} N P i=1 , where N P is the number of protein atoms, x P ∈ R 3 represents the 3D coordinates of the atom, and v P ∈ R N f represents protein atom features such as element types and amino acid types. Our goal is to generate binding molecules M = {(x (i) L , v (i) L )} N M i=1 conditioned on the protein target. For brevity, we denote molecules as M = [x, v], where [•, •] is the concatenation operator and x ∈ R M ×3 and v ∈ R M ×K denote atom Cartesian coordinates and one-hot atom types respectively.

3.2. OVERVIEW OF TARGETDIFF

As discussed in Sec. 1 and Sec. 2, we hope to develop a non-autoregressive model to bypass the drawbacks raised in autoregressive sampling models. In addition, we also require the model to represent the protein-ligand complex in continuous 3D space to avoid the voxelization operation. Last but not least, the model will also need to be SE(3)-equivariant to global translation and rotation. We therefore develop TargetDiff, an equivariant non-autoregressive method for target-aware molecule generation based on the DDPM framework Ho et al. (2020) . TargetDiff is a latent variable model of the form p θ (M 0 |P) = p θ (M 0:T |P)dM 1:T , where M 1 , M 2 , • • • , M T is a sequence of latent variables with the same dimensionality as the data M 0 ∼ p(M 0 |P). As shown in Fig. 1 , the approach includes a forward diffusion process and a reverse generative process, both defined as Markov chains. The diffusion process gradually injects noise to data, and the generative process learns to recover data distribution from the noise distribution with a network parameterized by θ: q(M1:T |M0, P) = Π T t=1 q(Mt|Mt-1, P) p θ (M0:T -1|MT , P) = Π T t=1 p θ (Mt-1|Mt, P) (1) Since our goal is to generate 3D molecules based on a given protein binding site, the model needs to generate both continuous atom coordinates and discrete atom types, while keeping SE(3)-equivariant during the entire generative process. In the following section, we will elaborate on how we construct the diffusion process, parameterize the generative process, and eventually train the model.

3.3. MOLECULAR DIFFUSION PROCESS

Following recent progress in learning continuous distributions Ho et al. (2020) and discrete distributions Hoogeboom et al. (2021) with diffusion models, we use a Gaussian distribution N to model continuous atom coordinates x and a categorical distribution C to model discrete atom types v. The atom types are constructed as a one-hot vector containing information such as element types and membership in an aromatic ring. We formulate the molecular distribution as a product of atom coordinate distribution and atom type distribution. At each time step t, a small Gaussian noise and a uniform noise across all categories are added to atom coordinates and atom types separately, according to a Markov chain with fixed variance schedules β 1 , . . . , β T : q(M t |M t-1 , P) = N (x t ; 1 -β t x t-1 , β t I) • C(v t |(1 -β t )v t-1 + β t /K). We note that the schedules can be different in practice, but we still denote them with the same symbol for conciseness. Here, we decompose the joint molecule distribution as the product of two independent distributions of atom coordinates and atom types during diffusion, because the independent distributions have concise mathematical formulations and we can efficiently draw noisy samples from them. In the next section, we will see the dependencies between atom coordinates and atom types are considered by the model in the generative process. Denoting α t = 1 -β t and ᾱt = Π t s=1 α s , a desirable property of the diffusion process is to calculate the noisy data distribution q(M t |M 0 ) of any time step in closed-form: q(x t |x 0 ) = N (x t ; √ ᾱt x 0 , (1 -ᾱt )I) q(v t |v 0 ) = C(v t |ᾱ t v 0 + (1 -ᾱt )/K). Using Bayes theorem, the normal posterior of atom coordinates and categorical posterior of atom types can both be computed in closed-form: q(x t-1 |x t , x 0 ) = N (x t-1 ; μt (x t , x 0 ), βt I) q(v t-1 |v t , v 0 ) = C(v t-1 |c t (v t , v 0 )). ( ) where μt ( x t , x 0 ) = √ ᾱt-1βt 1-ᾱt x 0 + √ αt(1-ᾱt-1) 1-ᾱt x t , βt = 1-ᾱt-1 1-ᾱt β t , and ct (v t , v 0 ) = c ⋆ / K k=1 c ⋆ k and c ⋆ (v t , v 0 ) = [α t v t + (1 -α t )/K] ⊙ [ᾱ t-1 v 0 + (1 -ᾱt-1 )/K].

3.4. PARAMETERIZATION OF EQUIVARIANT MOLECULAR GENERATIVE PROCESS

The generative process, on reverse, will recover the ground truth molecule M 0 from the initial noise M T , and we approximate the reverse distribution with a neural network parameterized by θ: p θ (M t-1 |M t , P) = N (x t-1 ; µ θ ([x t , v t ], t, P), σ 2 t I) • C(v t-1 |c θ ([x t , v t ], t, P)). One desired property of the generative process is that the likelihood p θ (M 0 |P) should be invariant to translation and rotation of the protein-ligand complex, which is a critical inductive bias for generating 3D objects such as molecules (Köhler et al., 2020; Satorras et al., 2021a; Xu et al., 2022; Hoogeboom et al., 2022) . One important piece of evidence for such achievement is that an invariant distribution composed with an equivariant transition function will result in an invariant distribution. Leveraging this evidence, we have the following proposition in the setting of target-aware molecule Proposition 1. Denoting the SE(3)-transformation as T g , we could achieve invariant likelihood w.r.t T g on the protein-ligand complex: p θ (T g (M 0 |P)) = p θ (M 0 |P) if we shift the Center of Mass (CoM) of protein atoms to zero and parameterize the Markov transition p(x t-1 |x t , x P ) with an SE(3)-equivariant network. A slight abuse of notation in the following is that we use x t (t = 1, . . . , T ) to denote ligand atom coordinates and x P to denote protein atom coordinates. We analyze the operation of shifting CoM in Appendix B and prove the invariant likelihood in Appendix C. There are different ways to parameterize µ θ ([x t , v t ], t, P) and c θ ([x t , v t ], t, P). Here, we choose to let the neural network predict [x 0 , v 0 ] and feed it through equation 4 to obtain µ θ and c θ which define the posterior distributions. Inspired from recent progress in equivariant neural networks (Thomas et al., 2018; Fuchs et al., 2020; Satorras et al., 2021b; Guan et al., 2022) , we model the interaction between the ligand molecule atoms and the protein atoms with a SE(3)-Equivariant GNN: [x 0 , v0 ] = ϕ θ (M t , t, P) = ϕ θ ([x t , v t ], t, P). At the l-th layer, the atom hidden embedding h and coordinates x are updated alternately as follows: h l+1 i = h l i + j∈V,i̸ =j f h (d l ij , h l i , h l j , e ij ; θ h ) x l+1 i = x l i + j∈V,i̸ =j (x l i -x l j )f x (d l ij , h l+1 i , h l+1 j , e ij ; θ x ) • 1 mol (7) where d ij = ∥x i -x j ∥ is the euclidean distance between two atoms i and j and e ij is an additional feature indicating the connection is between protein atoms, ligand atoms or protein atom and ligand atom. 1 mol is the ligand molecule mask since we do not want to update protein atom coordinates. The initial atom hidden embedding h 0 is obtained by an embedding layer that encodes the atom information. The final atom hidden embedding h L is fed into a multi-layer perceptron and a softmax function to obtain v0 . Since x0 is rotation equivariant to x t and it is easy to see x t-1 is rotation equivariant to x 0 according to equation 4, we achieve the desired equivariance for Markov transition. The complete proof can be found in Appendix A.

3.5. TRAINING OBJECTIVE

The combination of q and p is a variational auto-encoder (Kingma and Welling, 2013). The model can be trained by optimizing the variational bound on negative log likelihood. For the atom coordinate loss, since q(x t-1 |x t , x 0 ) and p θ (x t-1 |x t ) are both Gaussian distributions, the KL-divergence can be written in closed form: L (x) t-1 = 1 2σ 2 t ∥ μt (x t , x 0 ) -µ θ ([x t , v t ], t, P)∥ 2 + C = γ t ∥x 0 -x0 ∥ 2 + C (8) where γ t = ᾱt-1β 2 t 2σ t (1-ᾱt) 2 and C is a constant. In practice, training the model with an unweighted MSE loss (set γ t = 1) could also achieve better performance as Ho et al. (2020) suggested. For the atom type loss, we can directly compute KL-divergence of categorical distributions as follows: L (v) t-1 = k c(v t , v 0 ) k log c(v t , v 0 ) k c(v t , v0 ) k . ( ) The final loss is a weighted sum of atom coordinate loss and atom type loss: L = L (x) t-1 + λL (v) t-1 . We summarize the overall training and sampling procedure of TargetDiff in Appendix E.

3.6. AFFINITY RANKING AND PREDICTION AS UNSUPERVISED LEARNER

Generative models are unsupervised learners. However, in the area of target-aware molecule generation, nobody has established the connection between the generative model and binding affinity, which is an important indicator for evaluating generated molecules. Existing generative models can not (accurately) estimate the quality of generated molecules. Especially for models relying on autoregressive sampling, they have to assign an unnatural order when performing likelihood estimation (if possible) and cannot capture the global context as a whole. We first establish the connection between unsupervised generative models and binding affinity ranking / prediction. Under our parameterization, the network predicts the denoised [x 0 , v0 ]. Given the protein-ligand complex, we can feed ϕ θ with [x 0 , v 0 ] while freezing the x-update branch (i.e. only atom hidden embedding h is updated), and we could finally obtain h L and v0 : h l+1 i = h l i + j∈V,i̸ =j f h (d l ij , h l i , h l j , e ij ; θ h ) l=1...L-1 v0 = softmax(MLP(h L )). ( ) Our assumption is that if the ligand molecule has a good binding affinity to protein, the flexibility of atom types should be low, which could be reflected in the entropy of v0 . Therefore, it can be used as a scoring function to help ranking, whose effectiveness is justified in the experiments. In addition, h L also includes useful global information. We found the binding affinity ranking performance can be greatly improved by utilizing this feature with a simple linear transformation.

4.1. SETUP

Data We use CrossDocked2020 (Francoeur et al., 2020) to train and evaluate TargetDiff. Similar to Luo et al. (2021) , we further refined the 22.5 million docked protein binding complexes by only selecting the poses with a low (< 1 Å) and sequence identity less than 30%. In the end, we have 100,000 complexes for training and 100 novel complexes as references for testing. Baseline For benchmarking, we compare with various baselines: liGAN (Ragoza et al., 2022) , AR (Luo et al., 2021) , Pocket2Mol (Peng et al., 2022) , and GraphBP (Liu et al., 2022) . liGAN is a 3D CNN-based method that generates 3D voxelized molecular images following a conditional VAE scheme. AR, Pocket2Mol and GraphBP are all GNN-based methods that generate 3D molecules by sequentially placing atoms into a protein binding pocket. We choose AR and Pocket2Mol as representative baselines with autoregressive sampling scheme because of their good empirical performance. All baselines are considered in Table 3 for a comprehensive comparison. TargetDiff Our model contains 9 equivariant layers described in equation 7, where f h and f x are specifically implemented as graph attention layers with 16 attention heads and 128 hidden features. We first decide on the number of atoms for sampling by drawing a prior distribution estimated from training complexes with similar binding pocket sizes. After the model finishes the generative process, we then use OpenBabel (O'Boyle et al., 2011) to construct the molecule from individual atom coordinates as done in AR and liGAN. Please see Appendix F for the full details.

4.2. TARGET-AWARE MOLECULE GENERATION

We propose a comprehensive evaluation framework for target-aware molecule generation to justify the performance of our model and baselines from the following perspectives: molecular structures, target binding affinity and molecular properties. Molecular Structures First, we plot the empirical distributions of all-atom distances and carboncarbon bond distances in Figure 2 , then compare them against the same empirical distributions for reference molecules. For overall atom distances, TargetDiff captures the overall distribution very well, while AR and Pocket2Mol has an over-representation for small atom distances. Due to its limited voxelized resolution, liGAN can only capture the overall shape but not specify modes. Similarly, different carbon-carbon bonds form two representative distance modes in reference molecular structures. While we can still see the two modes in TargetDiff generated structures, only a single mode is observed for ones generated by liGAN, AR, and Pocket2Mol. In Table 1 , we further evaluated how well different generated molecular structures capture the empirical distributions of bond distances in reference molecules, measured by Jensen-Shannon divergence (JSD) Lin (1991) . We found that TargetDiff outperforms other methods with a clear margin across all major bond types. Secondly, we investigate whether TargetDiff can generate rigid sub-structure / fragment in a consistent fashion (e.g., all carbons in a benzene ring are in the same plane). To measure such consistency, we optimize the generated structure with Merck Molecular Force Field (MMFF) Halgren (1996) and calculate the RMSD between pre-and pos-MMFF-optimized coordinates for different rigid fragments that do not contain any rotatable bonds. As shown in Figure 3 , TargetDiff is able to generate more consistent rigid fragments. In a further analysis, we discover that liGAN and AR tend to generate a large amount of 3-and 4-member rings (Table 2 ). While TargetDiff shows a larger proportion of 7-member-ring, we believe this represents a limitation in the reconstruction algorithm and could be an interesting future direction to replace such post-hoc operation with bond generation. These results suggest that TargetDiff can produce more realistic molecular structures throughout the process compared to existing baselines and therefore perceive a more accurate 3D representation of the molecular dynamics leading to better protein binders. Target Binding Affinity Figure 4 shows the median Vina energy (computed by AutoDock Vina (Eberhardt et al., 2021) ) of all generated molecules for each binding pocket. Based on the Vina energy, generated molecules from TargetDiff show the best binding affinity in 57% of the targets, while the ones from liGAN, AR and Pocket2Mol are only best for 4%, 13% and 26% of all targets. In terms of high-affinity binder, we find that on average 58.1% of the TargetDiff molecules show better binding affinity than the reference molecule, which is clearly better than other baselines (See Table 3 ). We further compute Vina Score and Vina Min in Table 3 , where the Vina score function is directly computed or locally optimized without re-docking. They directly reflect the quality of model generated 3D molecules and similarly, our model outperforms all other baselines. To better understand the differences in generated molecules, we sample a generated molecule from each model for two pockets where TargetDiff outperforms AR. As shown in Figure 5a , while Tar-getDiff can generate molecules occupying the entire pocket, AR is only able to generate a molecule that covers part of the space and potentially loses its specificity for the desired target and cause offtarget effects. Let us consider the AR-generated molecules for 4QLK A. Despite having a similar number of atoms as the TargetDiff molecule (27 vs. 29), the frontier network in AR keeps placing molecules deep inside the pocket instead of considering the global structure, and trying to cover the entire binding pocket results in poor binding affinity. To further quantify such effects, we measure the distance between the center of mass (CoM) for reference molecules and the CoM for generated molecules. As shown in Figure 5b , the sequential generation nature of AR results in a larger shift in CoM (1.79 Å vs. 1.45 Å) and presents sub-optimal binding poses with poorer binding affinity. Molecular Properties Besides binding affinity, we further investigate other molecular properties for generated molecules, including drug likeliness QED (Bickerton et al., 2012) , synthesizability SA (Ertl and Schuffenhauer, 2009; You et al., 2018) , and diversity computed as the average pairwise -5.75 -5.64 -6.18 -5.88 -6.75 -6.62 37.9% 31.0% 0.51 0.50 0.63 0.63 0.70 0.70 -5.14 -4.70 -6.42 -5.82 -7.15 -6.79 Tanimoto distances (Bajusz et al., 2015; Tanimoto, 1958) . As shown in Table 3 , TargetDiff can generate more high-affinity binders compared to liGAN, AR, and GraphBP while maintaining similar other 2D metrics. The metrics TargetDiff does fall behind Pocket2Mol are the QED and SA scores. However, we put less emphasis on them because in the context of drug discovery, QED and SA are used as rough filter and would be fine as long as they are in a reasonable range. Therefore, they might not be the metrics we want to optimize against. We believe future investigation around prediction on bonds and fragment-based (instead of atom-based) generation could lead to improvement.

4.3. BINDING AFFINITY RANKING AND PREDICTION

To justify that our model can serve as an unsupervised learner to improve the binding affinity ranking and prediction, we first check Spearman's rank correlation on CrossDocked2020. AutoDock Vina score (i.e. vina) and negative log-transformed experimentally measured binding affinity pK are provided along with the dataset. As shown in Figure 6 , we found: (1) The entropy of denoised atom type v0 (i.e. v ent) has a reasonable correlation with pK, indicating unsupervised learning can provide a certain degree of information for binding affinity ranking. (2) The entropy score provides some complementary information to traditional chemical / physical-based score function like Vina, since the combination of them (i.e. combined) achieves better correlation. (3) When provided with labeled data, the final hidden embedding h L (i.e. hidden emb) with a simple linear transformation could improve the correlation to a large extent. We further demonstrate that our unsupervised learned features could improve supervised affinity prediction on PDBBind v2020 dataset (Liu et al., 2015) . We perform a more difficult time split as Stärk et al. (2022) in which the test set consists of structures deposited after 2019, and the training and validation set consist of earlier structures. We augment EGNN (Satorras et al., 2021b) with the unsupervised features h L provided by our model, and compare it with two state-of-the-art sequencebased models TransCPI (Chen et al., 2020) and MONN (Li et al., 2020) , one complex model IGN (Jiang et al., 2021) , two structure-based model HOLOPROT (Somnath et al., 2021) and STAMP-DPI (Wang et al., 2022) and finally the base EGNN model. As shown in Table 4 , the augmented EGNN clearly improves the vanilla EGNN and achieves the best results among baselines.

5. CONCLUSION

This paper proposes TargetDiff, a 3D equivariant diffusion model for target-aware molecule generation and enhancing binding affinity prediction. In terms of future work, it would be interesting to incorporate bond generation as part of the diffusion process such that we can skip the bond inference algorithm. In addition to bond inference, another interesting future direction would be incorporating some of the techniques in fragment-based molecule generation Podda et al. (2020) and generating molecules with common and more synthesizable molecular sub-structures.

Reproducibility Statements

The model implementation, experimental data and model checkpoints can be found here: https://github.com/guanjq/targetdiff 

A PROOF OF SE(3)-EQUIVARIANCE OF GENERATIVE MARKOV TRANSITION

One crucial property the model needs to satisfy is that rotating or translating the protein-ligand complex (M, P) will not change the estimated likelihood p θ (M|P). Leveraging the conclusion from Köhler et al. (2020) ; Xu et al. (2022) , it requires the initial density of our generative process p(M T |P) is SE(3)-invariant and the Markov transition p θ (M t-1 |M t , P) is SE(3)-equivariant. Since atom types are always invariant to SE(3)-transformation during the generative process, we only need to consider the atom coordinates. More concretely, the model needs to satisfy: p(x T , x P ) = p(T g (x T , x P )) p(x t-1 |x t , x P ) = p(T g (x t-1 )|T g (x t , x P )) where T g is the group of SE(3)-transformation, x t and x P denote the atom coordinates of ligand molecule and protein separately. T g (x) can also be written explicitly as T g (x) = Rx + b, where R ∈ R 3×3 is the rotation matrix and b ∈ R 3 is the translation vector. In Sec. 3.4, we provide a way to implement SE(3)-equivariance of the generative Markov transition. In this section, we will prove the SE(3)-equivariance of our design. Recall the equation 7, we update the atom hidden embedding h and coordinates x alternately as follows: h l+1 i = h l i + j∈V,i̸ =j f h (d l ij , h l i , h l j , e ij ; θ h ) x l+1 i = x l i + j∈V,i̸ =j (x l i -x l j )f x (d l ij , h l+1 i , h l+1 j , e ij ; θ x ) • 1 mol First, it is easy to see d ij does not change with the 3D roto-translation T g : d2 ij = ∥T g (x i ) -T g (x j )∥ 2 = ∥(Rx i + b) -(Rx j + b)∥ 2 = ∥Rx i -Rx j ∥ 2 = (x i -x j ) ⊤ R ⊤ R(x i -x j ) = (x i -x j ) ⊤ I(x i -x j ) = ∥x i -x j ∥ 2 = d 2 ij Since h i , h j , e ij are initially obtained from atom and edge features, which are invariant to SE(3)transformation, we have h l i is SE(3)-invariant for any l = 1, . . . , L. Then, we can prove that x updated from equation 7 is SE(3)-equivariant as follows: ϕ θ (T (x l )) = T (x l i ) + j∈V,i̸ =j (T (x l i ) -T (x l j ))f x (d l ij , h l+1 i , h l+1 j , e ij ; θ x ) • 1 mol = Rx l i + b + j∈V,i̸ =j R(x l i -x l j )f x (d l ij , h l+1 i , h l+1 j , e ij ; θ x ) • 1 mol = R   x l i + j∈V,i̸ =j R(x l i -x l j )f x (d l ij , h l+1 i , h l+1 j , e ij ; θ x ) • 1 mol   + b = Rx l+1 i + b = T (ϕ θ (x l )) Under our parameterization, the neural network predicts [x 0 , v0 ]. By stacking L such equivariant layers together, we can draw the conclusion that the output of neural network x0 is SE(3)equivariant w.r.t the input x t . Finally, we can obtain the mean of posterior xt-1 from equation 4: xt-1 = √ ᾱt-1βt 1-ᾱt x0 + √ αt(1-ᾱt-1) 1-ᾱt x t . The last thing the model needs to satisfy is that xt-1 is SE(3)-equivariant w.r.t x t . However, we can see the translation vector will be changed under this formula: µ θ (T (x t ), t) = √ ᾱt-1 β t 1 -ᾱt T (x 0 ) + √ α t (1 -ᾱt-1 ) 1 -ᾱt T (x t ) = √ ᾱt-1 β t 1 -ᾱt R(x 0 ) + √ α t (1 -ᾱt-1 ) 1 -ᾱt R(x t ) + b (12) where b = √ ᾱt-1βt 1-ᾱt + √ αt(1-ᾱt-1) 1-ᾱt b As the Sec. 3.4 discussed, we can move CoM of the protein atoms to zero once to achieve translation invariance in the whole generative process, which is same to how we achieve the invariant initial density. Thus, we only need to consider rotation equivariance in the Markov transition, which is straightforward to see that it can be achieved from equation 12 when b is ignored: µ θ (R(x t ), t) = R(µ θ (x t , t)).

B ANALYSIS OF INVARIANT INITIAL DENSITY

We assume when the timestep T of the diffusion process is sufficiently large, q(x T |x P ) would be a Gaussian distribution whose mean is the center of protein and standard deviation is one, i.e. q(x T |x P ) ∼ N (C P ⊗ 1 N P , I 3•N P ), where C P = 1 N P x P , ⊗ denotes the kronecker product, I k denotes the k × k identity matrix and 1 k denotes the k-dimensional vector filled with one. To achieve the SE(3)-invariant initial density, we move the center of protein to zero, i.e. C P = 0. One can also define the initial density on other CoM-free systems such as the ligand or complex CoM-free system. We choose protein CoM-free system here since only one step of shifting center operation is needed at the beginning of generative or diffusion process (protein context is a fixed input). Formally, it can be considered as a linear transformation: xP = Qx P , where Q = I 3 ⊗(I N - 1 N 1 N 1 T N ). It has several benefits in simplifying the formula: In the diffusion process, q(x T |x P ) would be a standard Gaussian when T is sufficiently large; Accordingly, in the generative process, we can sample xT from p(x T |x P ), which is also a standard Gaussian distribution. For evaluating a complex position (x T , x P ), we can firstly translate the complex to achieve zero CoM on protein positions, which can also be considered as a linear transformation: (x T , xP ) = Q(x T , x P ), where Q = I 3 ⊗ I M -1 N 1 M 1 T N 0 I N -1 N 1 N 1 T N Then, we can evaluate the density p(x T |x P ) with the standard normal distribution. We denote p as the density function on this protein zero CoM subspace: p(x T |x P ) = p(Q(x T , x P )) It can be seen that for any rigid transformation T g (x) = Rx + b, we have Q • T g (x T , x P ) = Q • R(x T , x P ). Since Q is a symmetric projection operator and rotation matrix R is a orthogonal matrix, we have ∥Q • Rx∥ 2 = ∥x∥ 2 . Given p is an isotropic normal distribution, we can easily have p(T g (x T , x P )) = p(x T , x P ), which means an SE(3)-invariant density.

C PROOF OF INVARIANT LIKELIHOOD

In Sec. 3.4, we argue that an invariant initial density composed with an equivariant transition function will result in an invariant distribution. In this section, we will provide the proof of it. The two conditions to guarantee an invariant likelihood p θ (M 0 |P) are as follows: p(x T , x P ) = p(T g (x T , x P )) ( 1 Invariant Prior) p(x t-1 |x t , x P ) = p(T g (x t-1 )|T g (x t , x P )) ( 2 Equivariant Transition) We can obtain the conclusion as follows: p θ (T g (x 0 , x P )) = p(T g (x T , x P )) T t=1 p θ (T g (x t-1 )|T g (x t , x P ))) = p(x T , x P ) T t=1 p θ (T g (x t-1 )|T g (x t , x P ))) ← Apply 1 = p(x T , x P ) T t=1 p θ (x t-1 |x t , x P ) ← Apply 2 = p θ (x 0 , x P )

D DERIVATION OF ATOM TYPES DIFFUSION PROCESS

According to Bayes theorem, we have: q(v t-1 |v t , v 0 ) = q(v t |v t-1 , v 0 )q(v t-1 |v 0 ) vt-1 q(v t |v t-1 , v 0 )q(v t-1 |v 0 ) = q(v t |v t-1 )q(v t-1 |v 0 ) vt-1 q(v t |v t-1 )q(v t-1 |v 0 ) According to Eq. 2 and 3, q(v t |v t-1 ) and q(v t-1 |v 0 ) can be calculated as: q(v t |v t-1 ) = C(v t |α t v t-1 + (1 -α t )/K) q(v t-1 |v 0 ) = C(v t-1 |ᾱ t-1 v 0 + (1 -ᾱt-1 )/K) Note that when computing  C(v t |α t v t-1 + (1 -α t )/K), the value is α t + (1 -α t )/K if v t = v t-1 and (1 -α t )/K otherwise, v t-1 + (1 -α t )/K) = C(v t-1 |α t v t + (1 -α t )/K). Let c ⋆ (v t , v 0 ) denotes the numerator of Eq. 14. Then it can be computed as: c ⋆ (v t , v 0 ) = q(v t |v t-1 )q(v t-1 |v 0 ) = [α t v t + (1 -α t )/K] ⊙ [ᾱ t-1 v 0 + (1 -ᾱt-1 )/K] and therefore the posterior of atom types is derived as: q(v t-1 |v t , v 0 ) = C(v t-1 |c t (v t , v 0 )) where ct (v t , v 0 ) = c ⋆ / K k=1 c ⋆ k

E OVERALL TRAINING AND SAMPLING PROCEDURES

In this section, we summarize the overall training and sampling procedures of TargetDiff as Algorithm 1 and Algorithm 2 respectively.

Algorithm 1 Training Procedure of TargetDiff

Input: Protein-ligand binding dataset {P, M} N i=1 , neural network ϕ θ 1: while ϕ θ not converge do 2: Sample diffusion time t ∈ U(0, . . . , T )

3:

Move the complex to make CoM of protein atoms zero 4: Perturb x 0 to obtain x t : x t = √ ᾱt x 0 + (1 -ᾱt )ϵ, where ϵ ∈ N (0, I) 5: Perturb v 0 to obtain v t : log c = log ( ᾱt v 0 + (1 -ᾱt )/K) v t = one hot(arg max i [g i + log c i ]) , where g ∼ Gumbel(0, 1) 6: Predict [x 0 , v0 ] from [x t , v t ] with ϕ θ : [x 0 , v0 ] = ϕ θ ([x t , v t ], t, P) 7: Compute the posterior atom types c(v t , v 0 ) and c(v t , v0 ) according to equation 4 8: Compute the unweighted MSE loss on atom coordinates and the KL loss on posterior atom types: L = ∥x 0 -x0 ∥ 2 + α KL(c(v t , v 0 ) ∥ c(v t , v0 )) 9: Update θ by minimizing L 10: end while F EXPERIMENT DETAILS F.1 FEATURIZATION At the l-th layer, we dynamically construct the protein-ligand complex as a k-nearest neighbors (knn) graph based on known protein atom coordinates and current ligand atom coordinates, which Algorithm 2 Sampling Procedure of TargetDiff Input: The protein binding site P, the learned model ϕ θ . Output: Generated ligand molecule M that binds to the protein pocket. 1: Sample the number of atoms in M based on a prior distribution conditioned on the pocket size 2: Move CoM of protein atoms to zero 3: Sample initial molecular atom coordinates x T and atom types v T : x T ∈ N (0, I) v T = one hot(arg max i g i ), where g ∼ Gumbel(0, 1) 4: for t in T, T -1, . . . , 1 do Sample v t-1 from the posterior p θ (v t-1 |v t , v0 ) according to equation 4 8: end for is the output of the l -1-th layer. We choose k = 32 in our experiments. The protein atom features include chemical elements, amino acid types and whether the atoms are backbone atoms. The ligand atom types are one-hot vectors consisting of the chemical element types and aromatic information. The edge features are the outer products of distance embedding and bond types, where we expand the distance with radial basis functions located at 20 centers between 0 Å and 10 Å and the bond type is a 4-dim one-hot vector indicating the connection is between protein atoms, ligand atoms, protein-ligand atoms or ligand-protein atoms. 5: Predict [x 0 , v0 ] from [x t , v t ] with ϕ θ : [x 0 , v0 ] = ϕ θ ([x t , v t ],

F.2 MODEL PARAMETERIZATION

Our model consists of 9 equivariant layers as equation 7 shows, and each layer is a Transformer with hidden dim=128 and n heads=16. The key/value embedding and attention scores are generated through a 2-layer MLP with LayerNorm and ReLU activation. We choose to use a sigmoid β schedule with β 1 = 1e-7 and β T = 2e-3 for atom coordinates, and a cosine β schedule suggested in Nichol and Dhariwal (2021) with s = 0.01 for atom types. We set the number of diffusion steps as 1000.

F.3 TRAINING DETAILS

The model is trained via gradient descent method Adam Kingma and Ba (2014) with init learning rate=0.001, betas=(0.95, 0.999), batch size=4 and clip gradient norm=8. We multiply a factor α = 100 on the atom type loss to balance the scales of two losses. During the training phase, we add a small Gaussian noise with a standard deviation of 0.1 to protein atom coordinates as data augmentation. We also schedule to decay the learning rate exponentially with a factor of 0.6 and a minimum learning rate of 1e-6. The learning rate is decayed if there is no improvement for the validation loss in 10 consecutive evaluations. The evaluation is performed for every 2000 training steps. We trained our model on one NVIDIA GeForce GTX 3090 GPU, and it could converge within 24 hours and 200k steps. 

H SAMPLING TIME ANALYSIS

One major advantage of TargetDiff over auto-regressive based model such AR is that TargetDiff scales better against the size of the molecules. While AR is required to run additional steps to generate larger molecules, the diffusion model can operate on additional atoms in a parallel fashion without sacrificing a lot of inference time. To better demonstrate such effect, we randomly select 5 binding pockets as targets and record the time spent in generating 100 molecules for each pocket using both autoregessive models (including AR, Pocket2Mol and GraphBP) and TargetDiff. Since these models have different numbers of parameters and sampling schemes, we first compare the inference time ratio against generating a 10atom molecule for these models, instead of simply comparing their wall time. As shown in Figure S3 , as we start to generate larger and larger molecules, the wall time for AR grows almost linearly along with the molecule size, while the wall time for TargetDiff stays relatively flat. In terms of wall clock time, AR, Pocket2Mol and GraphBP use 7785s, 2544s and 105s for generating 100 valid molecules on average separately, and it takes 3428s on average for TargetDiff. GraphBP has the fastest sampling time but the quality of generated molecules is lower than other models (See Table 3 ). TargetDiff has the moderate sampling efficiency compared to AR and Pocket2Mol. 



Figure 1: Overview of TargetDiff. The diffusion process gradually injects noise to the data, and the generative process learns to recover the data distribution from the noise distribution with a network parameterized by θ.

Figure2: Comparing the distribution for distances of allatom (top row) and carbon-carbon pairs (bottom row) for reference molecules in the test set (gray) and model generated molecules (color). Jensen-Shannon divergence (JSD) between two distributions is reported.

Figure 4: Median Vina energy for different generated molecules (liGAN vs. AR vs. TargetDiff) across 100 testing binding targets. Binding targets are sorted by the median Vina energy of TargetDiff generated molecules. Lower Vina energy means a higher estimated binding affinity. a) b)

Figure 6: Binding affinity ranking results on CrossDocked2020. Spearman's rank correlation coefficients between different indicators and experimentally measured binding affinity are shown.

DETERMINE THE NUMBER OF LIGAND ATOMSPocket Size Estimation We compute the top 10 farthest pairwise distances of protein atoms, and select the median of it as the pocket size for robustness.Prior Distribution of Number of Ligand AtomsWe compute 10 quantiles of training pocket sizes and estimate the prior distribution of number of ligand atoms for each bin. Specifically, we take the histogram of number of atoms in the training set as the prior distribution. The relationship between estimated prior distribution and actual training testing number of atom distribution is shown in FigureS1. We can see that the number of ligand atoms has a clear positive correlation with pocket sizes and the prior distribution estimated from the training set can also be generalized to the test set.

Summary of different properties of reference molecules and molecules generated by our model and other baselines.

Figure S2: A scatter plot compares the SA score and the number of atoms for a given molecule.

Figure S3: Inference time growth as a function of molecule size for AR, Pocket2Mol, GraphBP and TargetDiff.

Figure S4: More examples of binding poses for generated molecules. To present of fair overview of the model performance, we specifically select the best (1H36 A), median (1DXO A), and worst (2GNS A) targets shown in Figure 4 for visualization along with the generated molecules and calculated Vina energy (kcal/mol).

Target-Aware Molecule Generation As more structural data become available, various generative models are proposed to solve the target-aware molecule generation task. For example,Skalic  et al. (2019b);Xu et al. (2021) generate SMILES based on protein contexts. Tan et al. (2022) propose a flow-based model to generate molecular graphs conditional on a protein target as a sequence embedding. Ragoza et al. (2022) try to generate 3D molecules by voxelizing molecules in atomic density grids in a conditional VAE framework. Li et al. (2021) leverage Monte-Carlo Tree Search and a policy network to optimize molecules in 3D space. Luo et al. (2021); Liu et al. (2022); Peng et al. (2022) develop autoregressive models to generate molecules atom by atom in 3D space with

Percentage of different ring sizes for reference and model generated molecules.

Summary

Binding affinity prediction results on PDBbind v2020. EGNN augmented with our unsupervised features achieves best results on all four metrics.

which leads to symmetry of this functionHoogeboom et al. (2021), i.e., C(v t |α t

t, P) Sample x t-1 from the posterior p θ (x t-1 |x t , x0 ) according to equation 4

acknowledgement

Acknowledgement We thank all the reviewers for their feedbacks through out the review cycles of the manuscript. This work was supported by National Key R&D Program of China No. 2021YFF1201600, U.S. National Science Foundation under grant no. 2019897 and U.S. Department of Energy award DE-SC0018420.

G ADDITIONAL EVALUATION RESULTS

In the main text, we provided the evaluation results in Table . 3 where the Vina score is computed with AutoDock Vina (Eberhardt et al., 2021) . Here, we provided the evaluation results in Table S1 where Vina scores are computed with QVina (Alhossary et al., 2015) , a faster but less accurate docking tool (following what AR and Pocket2Mol used). We can observe a similar trend that molecules generated by TargetDiff could achieve SOTA binding affinity.Upon further investigation, we also discover a strong negative correlation between SA score and molecular size as shown in Figure S2 (Pearson R=-0.56, p ≤ 10 -80 ), and the difference in SA score between these model generated molecules could be the artifact of their size differences. 

