EQUIVARIANT 3D-CONDITIONAL DIFFUSION MODELS FOR MOLECULAR LINKER DESIGN Anonymous

Abstract

Fragment-based drug discovery has been an effective paradigm in early-stage drug development. An open challenge in this area is designing linkers between disconnected molecular fragments of interest to obtain chemically-relevant candidate drug molecules. In this work, we propose DiffLinker, an E(3)-equivariant 3D-conditional diffusion model for molecular linker design. Given a set of disconnected fragments, our model places missing atoms in between and designs a molecule incorporating all the initial fragments. Unlike previous approaches that are only able to connect pairs of molecular fragments, our method can link an arbitrary number of fragments. Additionally, the model automatically determines the number of atoms in the linker and its attachment points to the input fragments. We demonstrate that DiffLinker outperforms other methods on the standard datasets generating more diverse and synthetically-accessible molecules. Besides, we experimentally test our method in real-world applications, showing that it can successfully generate valid linkers conditioned on target protein pockets.

1. INTRODUCTION

The space of pharmacologically-relevant molecules is estimated to exceed 10 60 structures (Virshup et al., 2013) , and searching in that space poses significant challenges for drug design. A successful approach to reduce the size of this space is to start from fragments, smaller molecular compounds that usually have no more than 20 heavy (non-hydrogen) atoms. This strategy is known as fragmentbased drug design (FBDD) (Erlanson et al., 2016) . Given a protein pocket (a part of the target protein that employs suitable properties for binding a ligand), computationally determining fragments that interact with the pocket is a cheaper and more efficient alternative to experimental high-throughput screening methods (Erlanson et al., 2016) . Once the relevant fragments have been identified and docked to the target protein, it remains to combine them into a single, connected molecule. Among various strategies such as fragment linking, merging, and growing (Lamoree & Hubbard, 2017) , the former has been preferred as it allows to boost rapidly the binding energy of the target and the compound (Jencks, 1981; Hajduk et al., 1997) . This work addresses the fragment linking problem. Early computational methods for molecular linker design were based on database search and physical simulations (Sheng & Zhang, 2013) , both of which are computationally intensive. Therefore, there is increasing interest for machine learning methods that can go beyond the available data and design diverse linkers more efficiently. Existing approaches are either based on syntactic pattern recognition (Yang et al., 2020) or on autoregressive models (Imrie et al., 2020; 2021; Huang et al., 2022) . While the former method operates solely on SMILES (Weininger, 1988) , the latter take into account 3D positions and orientations of the input fragments, as this information is essential for designing stable molecules in various application (see Appendix A.1 for details). However, these methods are not equivariant with respect to the permutation of atoms and can only combine pairs of fragments. Linker design depends on the target protein pocket, and using this information correctly can improve the affinity of the resulting overall compound. To date, however, there is no computational method for molecular linker design that takes the pocket into account. In this work, we introduce DiffLinker, a conditional diffusion model that generates molecular linkers for a set of input fragments represented as a 3D atomic point cloud. First, our model generates the size of the prospective linker and then samples initial linker atom types and positions from the normal distribution. Next, the linker atom types and coordinates are iteratively updated using a neural network that is conditioned on the input fragments. Eventually, the denoised linker atoms and the input fragment atoms form a single connected molecule, as shown in Figure 1 . DiffLinker enjoys several desirable properties: it is equivariant to translations, rotations, reflections and permutations, it is not limited by the number of input fragments, it does not require information on the attachment atoms and generates linkers with no predefined size. Besides, the conditioning mechanism of DiffLinker allows to pass additional information about the surrounding protein pocket atoms, which makes the model applicable in structure-based drug design applications (Congreve et al., 2005) . We empirically show that DiffLinker is more effective than previous methods in generating chemically-relevant linkers between pairs of fragments. Our method achieves the state-of-the-art results in synthetic accessibility and drug-likeness which makes it preferable for using in drug design pipelines. Besides, DiffLinker remarkably outperforms other methods in the diversity of the generated linkers. We further propose a more challenging benchmark and show that our method is able to successfully link more than two fragments, which cannot be done by the other methods. We also demonstrate that DiffLinker can be conditioned on the target protein pocket: our model respects geometric constraints imposed by the surrounding protein atoms and generates molecules that have minimum clashes with the corresponding pockets. To the best of our knowledge, DiffLinker is the first method that is not limited by the number of input fragments and accounts the information about pockets. The overall goal of this work is to provide practitioners with an effective tool for molecular linker generation in realistic drug design scenarios.

2. RELATED WORK

Molecular linker design has been widely used in the fragment-based drug discovery community (Sheng & Zhang, 2013) . Various de novo design methods refer to the fragment linking problem (Tschinke & Cohen, 1993; Miranker & Karplus, 1995; Roe & Kuntz, 1995; Pearlman & Murcko, 1996; Stahl et al., 2002; Thompson et al., 2008; Ichihara et al., 2011) . Early fragment linking methods were based on search in the predefined libraries of linkers (Böhm, 1992; Lauri & Bartlett, 1994) , genetic algorithm, tabu search (Glover, 1986) and force field optimization (Dey & Caflisch, 2008) . Having been successfully used in multiple application cases (Ji et al., 2003; Silverman, 2009; Sheng & Zhang, 2011) , these methods are however computationally expensive and substantially limited by the available data. Hence, there has recently been interest in developing learning-based methods for molecular linker design. Yang et al. (2020) proposed SyntaLinker, a SMILES-based deep conditional transformer neural network that solves a sentence completion problem (Zweig et al., 2012) . This method inherits the drawbacks of SMILES, which are the absence of 3D structure and the lack of consistency (atoms that are close in the molecule can be far away in the SMILES string). Imrie et al. (2020) overcome these limitations by introducing an autoregressive model DeLinker and its extension DEVELOP (Imrie et al., 2021 ) that uses additional pharmacophore information. Although these methods operate on 3D molecular conformations, they use a very limited geometric information and require input on the attachment atoms of the fragments. Recently, Huang et al. (2022) have proposed another autoregressive method 3DLinker that does not require one to specify attachment points and leverages the geometric information to a much greater extent. It makes this approach more relevant for connecting docked fragments. As both DeLinker and 3DLinker are autoregressive models, they are not permutation equivariant which limits their sample efficiency and ability to scale to large molecules (Elesedy & Zaidi, 2021; Rath & Condurache, 2022) . Besides, these methods are capable of connecting only pairs of fragments and cannot be easily extended to larger sets of fragments. Outside of the linker design problem, several recent works proposed denoising diffusion models for molecular data in 3D. Conformer generation methods GeoDiff (Xu et al., 2022) and ConfGF (Shi et al., 2021) condition the model on the adjacency matrix of the molecular graph. Since they have access to the connectivity information, they can compute torsion angles between atoms and optimize them (Jing et al., 2022) . Equivariant Diffusion Model (EDM) (Hoogeboom et al., 2022) generates 3D molecules from scratch and is able to be conditioned on the predefined scalar properties. Another diffusion model has been recently proposed for designing protein scaffolds given protein motifs (Trippe et al., 2022) . Having the conditional sampling procedure, this model is however trained in an unconditional setup. Finally, Luo et al. (2022) proposed a model for antibody design which combines discrete diffusion for the molecular graphs and continuous diffusion on the 3D coordinates. The conditioning mechanism proposed in this work is the closest to ours, but their model is targeted at generating chains of amino acids rather than atomic point clouds.

3.1. DIFFUSION MODELS

Diffusion models (Sohl-Dickstein et al., 2015) are a class of generative methods that consist of a diffusion process, which progressively distorts a data point mapping it to a noise, and a generative denoising process which approximates the reverse of the diffusion process. The diffusion process iteratively adds noise to a data point x in order to progressively transform it into the Gaussian noise. At a time step t = 0, . . . , T , the conditional distribution of the intermediate data state z t given the previous state is defined by the multivariate normal distribution, q(z t |z t-1 ) = N (z t ; α t z t-1 , σ 2 t I), where α t ∈ R + controls how much signal is retained and σ t ∈ R + controls how much noise is added. By hypothesis, the full transition model is Markov, so that it can be written: q(z 0 , z 1 , . . . , z T |x) = q(z 0 |x) T t=1 q(z t |z t-1 ). (2) Since the distribution q is normal, a simple formula for the distribution of z t given x can be derived: q(z t |x) = N (z t |α t x, σ 2 t I), where α t = α t /α t-1 and σ 2 t = σ 2 t -α 2 t σ 2 t-1 . This closed-form expression shows that noise does not need to be added iteratively to x in order to achieve an intermediate state z t . Another key property of Gaussian noise is that the reverse of the diffusion process, the true denoising process, also admits a closed-form solution when conditioned on x: q(z t-1 |x, z t ) = N (z t-1 ; µ t (x, z t ), ς 2 t I), where distribution parameters can be obtained analytically: µ t (x, z t ) = α t σ 2 t-1 σ 2 t z t + α s σ 2 t σ 2 t x and ς t = σ t σ t-1 σ t . This formula simply describes that if a diffusion trajectory starts at x and ends at z T , then the expected value of any intermediate state is an interpolation between x and z T . The second component of a diffusion model is the generative denoising process that learns to invert this trajectory having the data point x unknown. The generative transition distribution is defined as: p(z t-1 |z t ) = q(z t-1 | x, z t ), where x is an approximation of the data point x computed by a neural network φ. Ho et al. (2020) have empirically shown that it works better to predict the Gaussian noise εt = φ(z t , t) instead, and then estimate the data point x using Equation (3): x = (1/α t )z t -(σ t /α t )ε t . The neural network is trained to maximize an evidence lower bound to the likelihood of the data under the model. Up to a prefactor that depends on t, this objective is equivalent to the mean squared error between predicted and true noise (Ho et al., 2020; Kingma et al., 2021) . We therefore use the simplified objective L(t) = ||ϵ -εt || 2 that can be optimized by mini-batch gradient descent using an estimator E t∼U (0,...,T ) [T • L(t)]. Finally, once the network is trained, it can be used to sample new data points. For this purpose, one first samples the Gaussian noise: z T ∼ N (0, I). Then, for t = T, . . . , 1, one should iteratively sample z t-1 ∼ p(z t-1 |z t ) and finally sample x ∼ p(x|z 0 ).

3.2. DIFFUSION FOR MOLECULES

Molecule Representation We consider now diffusion models that can operate on molecules represented as 3D atomic point clouds. A data point x, which is an attributed point cloud consisting of M atoms, is represented by atom coordinates r = (r 1 , . . . , r M ) ∈ R M ×3 and the corresponding feature vectors h = (h 1 , . . . , h M ) ∈ R M ×nf which are one-hot encoded atom types. We will therefore denote point cloud x as a tuple x = [r, h]. Categorical Features Along with continuous atom coordinates, a molecular diffusion model has to operate on atom types that are discrete variables. While categorical diffusion models do exist (Hoogeboom et al., 2021; Austin et al., 2021) , we follow a simpler strategy (Hoogeboom et al., 2022) that is based on lifting the atom types to a continuous space: we consider a one-hot encoding of the discrete variables, and add Gaussian noise on top of it. In the end of the denoising process, once z 0 is sampled, the continuous values corresponding to the atom types should be converted back to discrete values. We consider the argmax over the different categories and include it in the final transition from z 0 to x. For more details on the structure of the final transition distribution p(x|z 0 ) and likelihood computation in this setting, we refer the reader to (Hoogeboom et al., 2022) . Equivariance Processing 3D molecules requires operations that respect data symmetries. In this work, we consider the Euclidean group E(3) that comprises translations, rotations and reflections of R 3 and the orthogonal group O(3) that includes rotations and reflections of R 3 . A function f : R 3 → R 3 is E(3)-equivariant if f (Rx + t) = Rf (x) + t for any orthogonal matrix R ∈ R 3×3 , det R = ±1 , for any translation vector t ∈ R 3 and for any x ∈ R 3 . Note that for simplicity we use notation Rx = (Rx 1 , . . . , Rx M ) ⊤ . A conditional distribution p(x|y) is E(3)-equivariant if p(Rx + t|Ry + t) = p(x|y) for any x, y ∈ R 3 . A function f and a distribution p are O(3)- equivariant if f (Rx) = Rf (x) and p(Rx|Ry) = p(x|y) respectively. We call the function f translation invariant if f (x + t) = f (x).

4. DIFFLINKER: EQUIVARIANT 3D-CONDITIONAL DIFFUSION MODEL FOR MOLECULAR LINKER DESIGN

In this section, we introduce DiffLinker, a new E(3)-equivariant diffusion model for generating molecular linkers conditioned on 3D fragments. We formulate equivariance requirements for the underlying denoising distributions in Section 4.1 and propose an appropriate learnable dynamic function in Section 4.2. We discuss the strategy of sampling the size of a linker and conditioning on protein pockets in Sections 4.3 and 4.4 respectively. The full linker generation workflow is schematically represented in Figure 1 and the overview of DiffLinker training and sampling procedures is provided in Algorithms 1 and 2 correspondingly.

4.1. EQUIVARIANT 3D-CONDITIONAL DIFFUSION MODEL

Unlike other diffusion models for molecule generation (Hoogeboom et al., 2022; Xu et al., 2022) , our method is conditioned on three-dimensional data. More specifically, we assume that each point cloud x has a corresponding context u, which is another point cloud consisting of all input fragments and (optionally) protein pocket atoms that remain unchanged throughout the diffusion and denoising processes, as shown in Figure 1 . Hence, we consider the generative process from Equation ( 6) to operate on point cloud x while being conditioned on the fixed corresponding context: p(z t-1 |z t , u) = q(z t-1 | x, z t ), where x = (1/α t )z t -(σ t /α t )φ(z t , u, t). The presence of a 3D context puts additional requirements on the generative process as it should be equivariant to its transformations. Proposition 1 Consider a prior noise distribution p(z T |u) = N (z T ; f (u), I) and transition dis- tributions p(z t-1 |z t , u) = q(z t-1 | x, z t ) , where q is an isotropic Gaussian, f : R M ×3 → R 3 is a function operating on 3D point clouds, and x is an approximation computed by the neural network φ according to Equation ( 8). Let the conditional denoising probabilistic model p be a Markov chain defined as p(z 0 , z 1 , . . . , z T |u) = p(z T |u) T t=1 p(z t-1 |z t , u). ( ) If f is O(3)-equivariant and φ is equivariant to joint O(3)-transformations of z t and u, then p(z 0 |u) is O(3)-equivariant. The choice of the function f highly depends on the problem being solved and the available priors. In our experiments, we consider two cases. First, following (Imrie et al., 2020) , we make use of the information about atoms that should be connected by the linker. We call these atoms anchors and define f (u) as the anchors' center of mass. However, in real-world scenarios it is unlikely to know which atoms should be the anchors. In this case we define f (u) as the center of mass of the whole context u. We note that the probabilistic model p is not equivariant to translations, as shown in Appendix A.3. To overcome this issue and make the generative process independent of translations, we construct the network φ to be additionally translation invariant. Then, instead of sampling the initial noise from N (f (u), I) we center the data at f (u) and sample from N (0, I).

4.2. EQUIVARIANT GRAPH NEURAL NETWORK

The learnable function φ that models the dynamics of the diffusion model is implemented as a modified E(3)-equivariant graph neural network (EGNN) (Satorras et al., 2021) . Its input is the noisy version of the linker z t at time t and the context u. These two parts are modeled as a single fully-connected graph where nodes are represented by coordinates r and feature vectors h that include atom types, time t, fragment flags and (optionally) anchor flags. The predicted noise ε includes coordinate and feature components: ε = [ε r , εh ]. In order to make function φ invariant to translations, we subtract the initial coordinates from the coordinate component of the predicted noise following Hoogeboom et al. (2022) : ε = [ε r , εh ] = φ(z t , u, t) = EGNN(z t , u, t) -[z r t , 0]. ( ) EGNN consists of the composition of Equivariant Graph Convolutional Layers r l+1 , h l+1 = EGCL[r l , h l ] which are defined as follows: m ij = ϕ e (h l i , h l j , d 2 ij ), h l+1 i = ϕ h (h l i , j̸ =i m ij ), r l+1 i = r l i + ϕ vel (r l , h l , i), where d ij = ∥r l i -r l j ∥ and ϕ e , ϕ h are learnable functions parametrized by fully connected neural networks (see Appendix A.5 for details). The latter update for the node coordinates is computed by the learnable function ϕ vel . Note that our graph includes both a noisy linker z t and a fixed context u, and φ is intended to predict the noise that should be subtracted from the coordinates and features of z t . Therefore, it is natural to keep the context coordinates unchanged when computing dynamics and to apply non-zero displacements only to the linker part at each EGCL step. Hence, we model the node displacements as follows, ϕ vel (r l , h l , i) = j̸ =i r l i -r l j dij +1 ϕ r (h l i , h l j , d 2 ij ) if node i belongs to the point cloud z t , 0 if node i belongs to the context u, where ϕ r is a learnable function parametrized by a fully connected neural network. The equivariance of convolutional layers is achieved by construction. The messages ϕ e and the node updates ϕ h depend only on scalar node features and distances between nodes that are E(3)-invariant. Coordinate updates ϕ vel additionally depend linearly on the difference between coordinate vectors r l i and r l j , which makes them E(3)-equivariant. After the sequence of EGCLs is applied, we have an updated graph with new node coordinates r = [u r , ẑr t ] and new node features ĥ = [ ûh , ẑh t ]. Since we are interested only in the linker-related part, we discard the coordinates and features of context nodes and consider the tuple [ ẑr t , ẑh t ] to be the EGNN output.

4.3. LINKER SIZE PREDICTION

To predict the size of the missing linker between a set of fragments, we represent fragments as a fully-connected graph with one-hot encoded atom types as node features and distances between nodes as edge features. From this, a separately trained GNN (see Appendix A.6 for details) produces probabilities for the linker size. Our assumption is that relative fragment positions and orientations along with atom types contain all the information essential for predicting most likely size of the prospective linker. When generating a linker, we first sample its size with the predicted probabilities from the categorical distribution over the list of linker sizes seen in the training data, as shown in Figure 1 .

4.4. PROTEIN POCKET CONDITIONING

In real-world fragment-based drug design applications, it often occurs that fragments are selected and docked into a target protein pocket (Igashov et al., 2022) . To propose a drug candidate molecule, the fragments have to be linked. When generating the linker, one should take the surrounding pocket into account and construct a linker that has no clashes with protein pocket atoms (in other words, the configuration of the linker and pocket atoms should be physically-realistic) and keeps the binding strength high. To add pocket conditioning to DiffLinker, we represent a protein pocket as an atomic point cloud and consider it as a part of the context u (cf. Section 4.1). We also extend node features with an additional binary flag marking atoms that belong to the protein pocket. Finally, as the new context point cloud contains much more atoms, we modify the joint representation of the data point z t and the context u that are passed to the neural network φ. Instead of considering fully-connected graphs, we assign edges between nodes based on a 4 Å distance cutoff as it makes the resulting graphs less dense and counterbalances the increase in the number of nodes.

5.1. DATASETS

ZINC We follow Imrie et al. (2020) and consider a subset of 250,000 molecules randomly selected by Gómez-Bombarelli et al. (2018) from ZINC database (Irwin & Shoichet, 2005) . First, we generate 3D conformers using RDKit (Landrum, 2013) and define a reference 3D structure for each molecule by selecting the lowest energy conformation. Then, these molecules are fragmented by enumerating all double cuts of acyclic single bonds that are not within functional groups. The resulting splits are filtered by the number of atoms in the linker and fragments, synthetic accessibility (Ertl & Schuffenhauer, 2009) , ring aromaticity, and pan-assay interference compounds (PAINS) (Baell & Holloway, 2010) GEOM ZINC and CASF datasets used in previous works only contain pairs of fragments. However, real-world applications often require connecting more than two fragments with one or more linkers (Igashov et al., 2022) . To address this case, we construct a new dataset based on GEOM molecules (Axelrod & Gómez-Bombarelli, 2022) Pockets Dataset In order to assess the ability of DiffLinker to generate valid linkers given additional information about protein pockets, we use the protein-ligand dataset curated by Schneuing et al. (2022) from Binding MOAD (Hu et al., 2005) . To define pockets, we consider amino acids that have at least one atom closer than 6 Å to any atom of the ligand. All atoms belonging to these residues constitute the pocket. We split molecules into fragments using RDKit's implementation of MMPA-based algorithm (Dossetter et al., 2013) 

5.2. EVALUATION

Metrics First, we report several chemical properties of the generated molecules that are especially important in drug design applications: average quantitative estimation of drug-likeness (QED) (Bickerton et al., 2012) , average synthetic accessibility (SA) (Ertl & Schuffenhauer, 2009) and average number of rings in the linker. Next, following Imrie et al. (2020) , we measure validity, uniqueness and novelty of the samples. We then determine if the generated linkers are consistent with the 2D filters used to produce the ZINC training set. These filters are explained in detail in Appendix A.11. In addition, we record the percentage of the original molecules that were recovered by the generation process. To compare the 3D shapes of the sampled and ground-truth molecules, we estimate the Root Mean Squared Deviation (RMSD) between the generated and real linker coordinates in the cases where true molecules are recovered. We also compute the SC RDKit metric that evaluates the geometric and chemical similarity between the ground-truth and generated molecules (Putta et al., 2005; Landrum et al., 2006) . Baselines We compare our method with DeLinker (Imrie et al., 2020) and 3DLinker (Huang et al., 2022) on the ZINC test set and with DeLinker on the CASF dataset. Besides, we adjust 3DLinker to connect more than two fragments (see Appendix A.7 for details) and evaluate its performane on the GEOM dataset. To obtain 3D conformations for the molecules generated by DeLinker on ZINC and CASF, we apply a pre-trained ConfVAE (Xu et al., 2021) followed by a force field relaxation procedure. For all methods including ours, we generate linkers with the ground-truth size unless explicitly noted otherwise. To obtain SMILES representations of atomic point clouds generated by our models, we utilize OpenBabel (O'Boyle et al., 2011) to compute covalent bonds between atoms. We also use OpenBabel to rebuild covalent bonds for the molecules in test sets in order to correctly compute the recovery rate, RMSD and SC RDKit scores for our models. In ZINC and CASF experiments, we sample 250 linkers for each input pair of fragments. For the GEOM dataset and in experiments with pocket conditioning, we sample 100 linkers for each input set of fragments.

5.3. RESULTS

ZINC and CASF While our models have much greater flexibility and applicability in more applications as we show below, they also outperform other methods on standard benchmarks ZINC and CASF in terms of chemical relevance of the generated molecules. As shown in Table 1 , molecules sampled by DiffLinker are more synthetically accessible and demonstrate higher drug-likeness, which is especially important for drug design applications. Besides, our models generate linkers containing more rings. Moreover, our molecules usually share higher chemical and geometric similarity with the reference molecules as demonstrated by the SC RDKit scores in Table 2 . In terms of validity, our models perform on par with the other methods. Note that both autoregressive approaches employ valency rules at each generation step explicitly, while our model is shown to be able to learn these rules from the data. Remarkably, the validity of the reference molecules from CASF with covalent bonds computed by OpenBabel is 92.2% while our model generated molecules with 90.2% validity. Notably, sampling the size of the linker significantly improves novelty and uniqueness of the generated linkers without significant degradation of the most important metrics. Examples of linkers generated by DiffLinker for different input fragments are provided in Figure 6 .

Multiple Fragments

The major advantage of DiffLinker compared to recently proposed autoregressive models DeLinker and 3DLinker is one-shot generation of the whole linker between arbitrary amounts of fragments. This overcomes the limitation of DeLinker and 3DLinker, which can only link two fragments at a time. Although these autoregressive models can be adjusted to connect pairs of fragments iteratively while growing the molecule, the full context cannot be taken into account in this case. Therefore, suboptimal solutions are more likely to be produced. To illustrate this difference, we adapted 3DLinker to iteratively connect pairs of fragments in molecules where more than two fragments should be connected. As shown in Table 1 , 3DLinker fails to construct valid molecules in almost 84% of cases and cannot recover any reference molecule while, despite the higher complexity of linkers in this dataset, our models achieve 93% validity and recover more than 85% of the reference molecules. Besides, molecules generated by 3DLinker have no rings in the linkers, have substantially lower QED and are much harder to synthesize. Examples of linkers generated by DiffLinker for different input fragments are provided in Figure 5 . An example of the DiffLinker sampling process for the molecule from the GEOM dataset is shown in Figure 1 . Protein Pocket Conditioning To illustrate the ability of DiffLinker to take surrounding pockets into account, we trained three models on the Pockets Dataset: these are respectively conditioned on the full-atomic pocket representation, conditioned on the pocket backbone atoms and unconditioned. Besides the standard metrics reported in Tables 8 and 9 , we also compute the number of clashes between generated molecules and surrounding pockets. We say that there is a clash between two atoms if the distance between them is lower than the sum of their van der Waals radii. As shown in Figure 2 , the model conditioned on the full-atomic pocket representation generates molecules with almost the same amount of clashes (in average 7 clashes per molecule) as in the reference complexes from the test set (in average 6 clashes per molecule). There is a clear trend on the number of clashes depending on the amount of information about pockets DiffLinker is conditioned on: the model conditioned on pocket backbone atoms generates molecules with 14 clashes in average, and the unconditioned model produces molecules with 21 clashes in average.

6. CONCLUSION

In this work, we introduced DiffLinker, a new E(3)-equivariant 3D-conditional diffusion model for molecular linker design. DiffLinker designs realistic molecules from a set of disconnected fragments by generating a linker, i.e., an atomic point cloud that interconnects the input fragments. While previous methods were only capable to connect pairs of fragments, DiffLinker naturally scales to an arbitrary number of fragments. Our method does not require to specify the attachment points of the fragments and predicts the distribution of linker size from the fragments. We show that the proposed method outperforms other models on standard benchmarks and produces more chemically-relevant molecules. Besides, we demonstrate that our model can be conditioned on protein pockets and generate linkers with a minimum number of clashes. We believe that our method will accelerate the development of prospective drug candidates and has the potential to become widely used in the fragment-based drug design community. A.2 PROOF OF PROPOSITION 1 O(3)-equivariance of function f and the fact that q is isotropic Gaussian distribution implies O(3)equivariance of the prior distribution: p(Rz T |Ru) = N (Rz T |f (Ru), I) = N (Rz T |Rf (u)) = N (z T |f (u)) = p(z T |u). Likewise, O(3)-equivariance of function φ and Equation ( 8) imply O(3)-equivariance of all transition probabilities p(z t-1 |z t , u). To obtain the distribution p(z 0 |u) of data point z 0 , we can consider joint distribution p(z 0 , z 1 , . . . , z T |u) and marginalize it by z 1...T : p(z 0 |u) = p(z 0 , z 1 , . . . , z T |u)dz 1...T = p(z T |u) T -1 t=0 p(z t |z t+1 , u)dz 1...T . Having prior and all transition distributions equivariant, it is now trivial to show O(3)-equivariance of p(z 0 |u): p(Rz 0 |Ru) = p(Rz T |Ru) T -1 t=0 p(Rz t |Rz t+1 , Ru)dz 1...T = p(z T |u) T -1 t=0 p(Rz t |Rz t+1 , Ru)dz 1...T (equivariant prior p(z T |u)) = p(z T |u) T -1 t=0 p(z t |z t+1 , u)dz 1...T (equivariant transition kernels p(z t |z t+1 , u)) = p(z 0 , z 1 , . . . , z T |u)dz 1...T = p(z 0 |u).

A.3 PROBLEM WITH TRANSLATIONS

Consider transition probability p(z t- 1 |z t , u) = q(z t-1 | x, z t ). Translation equivariance of p(z t-1 |z t , u) means that p(z t-1 + t|z t + t, u + t) = p(z t-1 |z t , u) ∀t ∈ R 3 . More precisely, N (z t-1 + t; μt (z t + t, u + t), ς 2 t I) = N (z t-1 ; μt (z t , u), ς 2 t I), where μt (z t , u) = µ t ( x, z t ) = α t σ 2 t-1 σ 2 t z t + α t-1 σ 2 t σ 2 t x, and x = 1 α t z t - σ t α t φ(z t , u, t). Therefore, the mean of this distribution can be written as: μt (z t , u) = 1 α t z t - σ 2 t α t σ t φ(z t , u, t). Neural network φ is translation invariant meaning that φ(z t + t, u + t, t) = φ(z t , u, t). It means that: μt (z t + t, u + t) = 1 α t (z t + t) - σ 2 t α t σ t φ(z t + t, u + t, t) (18) = 1 α t (z t + t) - σ 2 t α t σ t φ(z t , u, t) (19) = 1 α t z t - σ 2 t α t σ t φ(z t , u, t) + 1 α t t (20) = μt (z t , u) + 1 α t t (21) = μt (z t , u) + λ t t. So we see that μt (z t , u) is equivariant to translations, however input and output translations are not equal because λ ̸ = 1. It means that equivariance of distributions from Equations ( 13) and ( 14) does not hold. More formally, p(z t-1 + t|z t + t, u + t) = N z t-1 + t; μt (z t , u) + λt, ς 2 t I (23) = N z t-1 ; μt (z t , u) + (λ -1)t, ς 2 t I ̸ = N z t-1 ; μt (z t , u), ς 2 t I . We can also write that p(z t-1 + t|z t + t, u + t) = p (z t-1 + (1 -λ)t|z t , u) ̸ = p (z t-1 |z t , u) .

A.4 DIFFUSION MODEL

We trained all DiffLinkerfoot_0 models with T = 500 diffusion steps using polynomial noise schedule: α t = (1 -2s) • (1 -(t/T ) 2 ), where s = 10 -5 is a precision value that helps to avoid numerically unstable situations (Hoogeboom et al., 2022) . Sampling For all the experiments discussed in the main text, we sampled with the same number of denoising steps T = 500 as used in training. Sampling time for all the datasets is provided in Table 3 . Although the time reported in Table 3 is more than affordable for applying our method in practice, we explored the capability of DiffLinker to sample even faster without performance loss. Following Nichol & Dhariwal (2021), we conducted additional evaluation of DiffLinker with the reduced number of denoising steps T = 500 in sampling, considering T /2, T /5, T /10, T /20, T /50 and T /100 values. Figure 3 shows how performance metrics obtained on ZINC test set depend on the number of denoising steps performed in sampling. In all cases we used DiffLinker pretrained on ZINC with T = 500 denoising steps. As shown in Figure 3 , our model is robust to significant reduction of the number of denoising steps in sampling resulting in 10-fold gain in sampling speed without any performance degradation. Effectively, one can reduce the sampling speed from 0.365 to 0.036 seconds per molecule with no significant performance metrics loss. A.5 DYNAMICS EGNN takes as input a graph of atoms belonging to the linker z t and its context u represented by feature vectors h i ∈ R in and coordinates r i ∈ R 3 . Feature vector h i consists of atom types, fragments flag and time step t. If anchors are known, additionally anchor flag is passed. If the model is conditioned on the protein pocket, additionally pocket flag is passed. First, atom features are passed to the encoder: h i → Linear(in, nf) → h 0 i . Next, as discussed in Section 4.2, L Equivariant Graph Convolutional Layers (EGCL) are sequentially applied. Learnable components of EGCL ϕ e , ϕ h , ϕ r are implemented as neural networks that include fully-connected layers (FC), batch normalization layers (BN) and activations SiLU. Message ϕ e : takes a pair of node embeddings h l i and h l j and the squared distance d 2 ij = ∥r i -r j ∥ 2 between these nodes and outputs a message m ij ∈ R nf : concat[h l i , h l j , d 2 ij ] → {FC(2 • nf + 1, nf) → SiLU → FC(nf, nf) → SiLU} → m ij Features update ϕ h : takes as input node embedding h l i and its aggregated message m i = j m ij and returns the updated node embedding: concat[h l i , m i ] → {FC(2 • nf, nf) → BN → SiLU → FC(nf, nf) → BN → add(h l i )} → h l+1 i Coordinates update ϕ r : takes the same input as ϕ e and outputs a scalar value concat[h l i , h l j , d 2 ij ] → {FC(2 • nf + 1, nf) → SiLU → FC(nf, nf) → SiLU → FC(nf, 1) } → output Training We trained separate models for ZINC, Multi-Frag and Pocket datasets. Hyper parameters of the models and average time required for training one epoch are provided in Table 4 . All models were trained on a single Tesla V100-PCIE-32GB GPU using Adam with learning rate 2 • 10 -5 and weight decay 10 -13 .

A.6 LINKER SIZE PREDICTION

Graph neural network for predicting probabilities of the number of atoms in the prospective linker for a given set of fragments takes as input a fully-connected graph of atoms belonging to the fragments represented by feature vectors h i ∈ R in and inter-atomic squared distances d 2 ij = ∥r i -r j ∥ 2 , and outputs a vector of probabilities corresponding to the predefined linker sizes p ∈ [0, 1] out . First, node embeddings are computed: Finally, node embeddings h L i are projected onto R out , aggregated and normalized resulting in the vector of label probabilities: h i → Linear(in, nf) → h 0 i . h L i → {FC(nf, out) → Mean → Softmax} → p Training We trained two models for ZINC and Multi-Frag datasets. Hyper parameters of the models and average time required for training one epoch are provided in Table 5 . Both models were trained using Adam with learning rate 10 -4 and weight decay 10 -13 . Both models were trained on a single Tesla V100-PCIE-32GB GPU.

A.7 3DLINKER ON GEOM DATASET

In order to run 3DLinker on GEOM dataset, we had to additionally filter the original test set consisting of 1,288 input fragment sets and remove examples with more than 3 disconnected fragments. For the remainder test set that included 1,170 input fragment triplets, we ran 3DLinker twice: first, to connect two randomly selected fragments (10 samples per fragment pair) and then to connect the resulting compound with the third fragment (10 samples per input). In both steps, we used the half of the original linker size. Overall, we obtained 100 samples for each input fragment triplet. We used a pre-trained 3DLinker model available at https://github.com/YinanHuang/3DLinker. The results are provided in Tables 1 and 2 .

A.8 EVALUATION DETAILS

The principal difference between our and other methods is that we generate 3D point cloud of atoms that should be further connected with covalent bonds while other methods generate covalent bonds along with atom types. We emphasize that both DeLinker and 3DLinker employ valency rules at each generation step which makes is much easier to achieve high validity of samples. In our case, DiffLinker learns these chemical rules from the data and places atoms at the relevant distances from each other. Since the output of DiffLinker is a 3D point cloud, we need to additionally compute covalent bonds between pairs of atoms based on their types and pairwise distances. For doing that, we use OpenBabel (O'Boyle et al., 2011) . To be consistent in the evaluation methodology, we recomputed covalent bonds using OpenBabel for all molecules in ZINC and CASF test sets. Next, for each updated molecule, we obtained linkers by removing irrelevant atoms and saved the resulting molecules and fragments in SDF and SMILES formats. Molecules saved in SDF format were considered as ground truth and used for 3D comparison An alternative way of conditioned linker generation with diffusion models is inpainting strategy (Lugmayr et al., 2022) , in which the model is trained to denoise a full molecule, while at the inference step the known part of the molecule is generated using the true denoising process. Inpainting strategy for molecular linker design can be easily implemented using vanilla EDM (Hoogeboom et al., 2022) with a slight tweak of the sampling function. This approach was the first that we tried in our experiments. However, models trained in 3Dconditioning setting, explained in Section 4.1, significantly outperformed the inpainting models. In Table 7 we provide a comparison of DiffLinker trained in 3D-conditioning setting with DiffLinker trained in inpainting setting. Both models have identical architectures and were trained with identical hyper parameters. For evaluation we used validation ZINC set (400 examples) and sampled 50 linkers for every pair of input fragments. As shown in Table 7 , using 3D-conditioned model reduces the number of invalid molecules (i.e., chemically incorrect or with disconnected fragments) by 25%. Besides, inpainting model generates molecules with more than 2-fold higher RMSD. Even though the chemical properties of the molecules generated by both methods are comparable, significantly lower validity indicates that inpainting approach is suboptimal in the molecular linker design task. A.11 2D FILTERS 2D Filters used by Imrie et al. (2020) for constructing ZINC and CASF datasets include synthetic accessibility (Ertl & Schuffenhauer, 2009) , ring aromaticity (RA), and pan-assay interference compounds (PAINS) (Baell & Holloway, 2010) criteria. RA controls the correctness of the covalent bond orders in the rings of the a linker and PAINS checks if a molecule does not contain compounds that often give false-positive results in high-throughput screens (Baell & Holloway, 2010) . Even though we used the same datasets as in Imrie et al. (2020) that were created using all three filters, we however modify the metric "Passed 2D Filters" by removing SA from it. Instead we introduce our own SA-based metric that we report separately. Problem with SA filter used by Imrie et al. (2020) and Huang et al. (2022) The molecule is considered to pass the synthetic accessibility filter if its SA-score is lower than SA-score of the corresponding pair of fragments. Even though our models performed on par or better than DeLinker and 3DLinker according to all other metrics, almost all molecules generated by our models did not pass SA-filter. We investigated this issue and figured out that SMILES of fragments passed to SA filter by DeLinker and 3DLinker contained dummy atoms representing anchors. These atoms did not have any atom type assigned and therefore such molecules were considered hard to be synthesised. For almost all molecules in the test set SA-scores of fragments with dummy atoms were higher than SA-scores of the whole molecules. However, for most of fragments without dummy atoms SA-scores were much lower. Figure 4 shows 4 examples of molecules and fragments with and without dummy atoms and the corresponding SA-scores. We can conclude that SA-filter proposed by authors of DeLinker shows nothing but the fact that molecules with unknown atoms are hard to be synthesized. Therefore, we considered this metric to be irrelevant and excluded it from our report. Instead, we report average Synthetic Accessibility score of full generated molecules. 



DiffLinker code is available at: https://anonymous.4open.science/r/DiffLinker



Figure 1: (Top) Overview of the molecular linker generation process. First, probabilities of linker sizes are computed for the input fragments. Next, linker atoms are sampled and denoised using our fragment-conditioned equivariant diffusion model. (Bottom) Example of the linker generation process. Linker atoms are highlighted in orange.

Figure 2: (Left) Examples of linkers sampled by DiffLinker conditioned on pocket atoms (top row) and unconditioned (bottom row). Linkers sampled by the unconditioned model have multiple clashes with the protein pocket. (Right) Distribution of numbers of clashes in reference test molecules and samples from three DiffLinker models differently conditioned (or unconditioned) on pocket atoms.

which we decompose into three or more fragments with one or two linkers connecting them. To achieve such splits, we use RDKit implementations of two fragmentation techniques --an MMPA-based algorithm (Dossetter et al., 2013) and BRICS (Degen et al., 2008) --and combine results removing duplicates. Overall, we obtain 41,907 molecules and 285,140 fragmentations that are randomly split in train (282,602 examples), validation (1,250 examples) and test (1,288 examples) sets.

Figure 3: Dependency of validity, recovery and RMSD on the number of denoising steps in sampling shows that DiffLinker is robust to reducing the number of denoising steps. The robustness of DiffLinker allows for 10-fold gain in sampling speed without any performance degradation. For all experiments we used DiffLinker trained on ZINC with T = 500 steps and performed evaluation on ZINC test set sampling 250 linkers for each input set of fragments.

Figure 4: Synthetic accessibility scores (SA-scores) for fragments without dummy atoms (top row), full molecules (middle row) and fragments with dummy atoms (bottom row).

Figure 5: Examples of linkers generated by DiffLinker (sampled size) for fragments from GEOM datasets.

criteria. One molecule can therefore result in various combinations of two fragments with a linker in between. The resulting dataset is randomly split into train (438,610 examples), validation (400 examples), and test (400 examples) sets.

. We randomly split the resulting data into train (185,678 examples), validation (490 examples) and test (566 examples) sets taking into account Enzyme Commission (EC) numbers of the proteins.

Performance metrics on ZINC, CASF and GEOM test sets. The first three metrics assess the chemical relevance of the generated molecules. The last three metrics evaluate the standard generative properties of the methods.

Metrics assessing the ability of the methods to generate molecules that are chemically and geometrically similar to the reference ones.

Sampling time for different datasets (with T = 500 denoising steps). Experiments were performed on a single Tesla V100-PCIE-32GB GPU.

Hyper parameters of EGNN models trained on ZINC, Multi-Fragment and Pocket datasets.

Hyper parameters of SizeGNN models trained on ZINC and Multi-Fragment datasets. Graph Convolutional Layers (GCL) is applied. Learnable components of GCL ϕ e , ϕ h are implemented in the same way as for EGNN.

QED, SA and number of rings for molecules in training, validation and test datasets. RMSD and SC RDKit metrics). Molecules and linkers saved in SMILES format were considered as ground truth and used for 2D comparison (novelty and recovery rates). To evaluate other methods, we used original SMILES representations.Our samples For each generated point cloud, we computed covalent bonds with OpenBabel, and extracted the largest connected component. Next, we obtained a linker by matching the generated molecule with the corresponding fragments (computed with OpenBabel as explained above) and removing irrelevant atoms. Finally, we kekulized the resulting linker and saved the generated molecule with recomputed covalent bonds and the corresponding linker in SDF and SMILES formats.Metrics To compute validity, we apply sanitization and additionally check that the molecule contains all atoms from fragments. For all other metrics, we consider only a subset of valid samples. To compute novelty, we first preprocess SMILES of the linker by removing stereochemistry and canonicalizing tautomer SMILES. Then, we count how many of the resulting generated linker SMILES were represented in the training set. To compute uniqueness, we compare SMILES of whole molecules and count number of unique molecules sampled for each input pair of fragments. To compute recovery, we compare SMILES of each molecule sampled for a given pair of fragments with SMILES of the corresponding ground-truth molecule. Before comparison, we remove hydrogens and stereochemistry from molecules. To compute RMSD, we consider only recovered molecules and align them with the corresponding groundtruth molecules using RDKit function rdkit.Chem.rdMolAlign which returns the optimal RMSD for aligning two molecules. To compute quantitative drug-likeness (QED)(Bickerton et al., 2012), we used RDKit function rdkit.Chem.QED.qed. To compute synthetic accessibility, we used function calculateScore provided byErtl & Schuffenhauer (2009) in the RDKitcompatible package sascorer.py. For calculating the number of rings, we used RDKit function rdkit.Chem.rdMolDescriptors.CalcNumRings. Table6provides mean QED, SA and number of rings computed for molecules in training, validation and test datasets. DiffLinker results computed on the full GEOM test set with 250 samples per input are provided in Tables8 and 9.DiffLinker results on the Pockets test set with 100 samples per input are provided in Tables8 and 9as well. We note that train/test split of the Pockets dataset was performed solely based on PDBcodes of the protein-ligand complexes and EC numbers of proteins. In the resulting dataset used for evaluation, we therefore have 17 molecules that are also represented in the training set but bound to different proteins. For the full picture, we alternatively provide another reduced test set which does not contain molecules from the training set at all. It includes 453 examples from the initial test set. We provide evaluation metrics obtained on the reduced test set in Table10.

Comparison of DiffLinker trained in 3D-conditioning setting with DiffLinker trained in inpainting setting. Evaluation was performed on ZINC validation set. For each input pair of fragments we sampled 50 linkers.MethodValid, % Recovered, % RMSD ↓ SC RDKit ↑ QED ↑ SA ↓ # Rings ↑

A APPENDIX

Algorithm 1 Training Input: linker x, context u, neural network φ Sample t ∼ U(0, . . . , T ), ϵ t ∼ N (0, I) z t ← α t x + σ t ϵ t εt ← φ(z t , u, t) Minimize ∥ϵ -εt ∥ 2 Algorithm 2 Sampling Input: context u, neural network φ Center context u at f (u) Sample z T ∼ N (0, I) for t in T, T -1, . . . 1:A.1 APPLICATIONS There are several major drug discovery directions in which one of highly relevant or even preferred strategies is to design a linker for fragments that are placed in the space and have fixed positions.Fragment Based Drug Discovery (FBDD) By analogy with classical drug discovery methods, one of the common strategies in FBDD is to operate on fragments that strongly interact with the target proteins. First, strongly binding fragments are defined and characterized (using high-throughput screening followed by X-ray / NMR, or virtual screening and docking). As a result, the exact location and orientation of the fragments in which they bind strongly to the parts of the protein pocket is defined. The next step is to find a linker between the fragments that preserves positions and thus the binding strength of the fragments (preferably, adding the linker will boost the binding strength of the whole molecule) Bancet et al. (2020) . There has been reported a range of successful works in which the starting point for a linker design was a crystal structure of a protein with fragments bound to it (Bancet et al., 2020) . To name a few, inhibitors for CK2 (De Fusco et al., 2017) , LDH-A (Kohlmann et al., 2013) and Dot1L (Mobitz et al., 2017) , which are proteins playing crucial roles in progress of various cancers, were designed by linking the fragments that were experimentally observed in a bound state with the corresponding targets.Proteolysis targeting chimera (PROTAC) PROTAC is a heterobifunctional small molecule designed for stimulating degradation of a target protein by connecting it to an E3-ligase. PROTAC consists of two ligands joined by a linker: one ligand recruits and binds a target protein while the other recruits and binds E3 ubiquitin ligase (Békés et al., 2022) . For designing PROTACs, one of possible strategies is to dock two proteins (with ligands bound to them) to explore a favorable conformation of the prospective tertiary complex. This information about the initial docking pose of the proteins and exact positions of bound fragments is further used for designing a linker that will stabilize the whole complex (Bai et al., 2021; Farnaby et al., 2019) .Scaffold hopping Scaffold hopping is a strategy for designing novel compounds by replacing the central core structure of the known molecule. As shown by Sun et al. (2012) , various scaffoldhopping strategies rely on the experimental 3D data of the initial compound bound to a target complex: the information about the geometry of the initial bound molecule is important for altering its core with the increase of the binding affinity, potency or selectivity of the whole molecule. In such a case, scaffold-hopping of the bound molecule can be considered as a linking problem of several disconnected fragments with fixed known 3D coordinates. 

