STRUCTURE-BASED DRUG DESIGN WITH EQUIVARIANT DIFFUSION MODELS

Abstract

Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions. Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide a realistic binding scenario that complements the synthetic CrossDocked dataset. Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking.

1. INTRODUCTION

The rational design of molecular compounds to act as drugs remains an oustanding challenge in biopharmaceutical research. Towards supporting such efforts, structure-based drug design (SBDD) aims to generate small-molecule ligands that bind to a specific 3D protein structure with high affinity and specificity (Anderson, 2003) . However, SBDD remains very challenging and with important limitations. A traditional SBDD campaign starts with the identification and validation of a target of interest and its subsequent structural characterization using experimental structural determination methods. The first step in this process is the identification of the binding pocket; a cavity in which ligands may bind the target to elicit the desired therapeutic effect. This can be achieved via experimental means or a plethora of computational approaches (Pérot et al., 2010) . Once a binding site is identified, the goal is to discover lead compounds that exhibit the desired biological activity. Importantly, to transition from leads to promising candidates the compounds need to be evaluated regarding other drug development constraints that are also hard to predict (toxicity, absorption, etc.). Traditionally, SBDD is handled either by high-throughput experimental or virtual screening (Lyne, 2002; Shoichet, 2004) of large chemical databases. Not only is this expensive and time consuming but it also limits the exploration of chemical space to the historical knowledge of previously studied molecules, with a further emphasis usually placed on commercial availability (Irwin & Shoichet, 2005) . Moreover, the optimization of initial lead molecules is often a biased process, with heavy reliance on human intuition (Ferreira et al., 2015) . Recent advances in geometric deep learning, especially in modeling geometric structures of biomolecules (Bronstein et al., 2021; Atz et al., 2021) , provide a promising direction for structurebased drug design (Gaudelet et al., 2021) . Even though utilizing deep learning as surrogate docking models has achieved remarkable progress (Lu et al., 2022; Stärk et al., 2022) , deep learning-based design of ligands that bind to target proteins is still an open problem. Early attempts have been made to represent molecules as atomic density maps, and variational auto-encoders were utilized to generate new atomic density maps corresponding to novel molecules (Ragoza et al., 2022) . However, it is nontrivial to map atomic density maps back to molecules, necessitating a subsequent atom-fitting stage. Follow-up work addressed this limitation by representing molecules as 3D graphs with atomic coordinates and types which circumvents the unnecessary post-processing Figure 1 : DiffSBDD in the protein-conditioned scenario. We first simulate the forward diffusion process q to gain a trajectory of progressively noised samples over T timesteps. We then train a model p θ to reverse or denoise this process that is conditional on the target structure. Once trained, we are able to sample new drug candidates from a Gaussian distribution N (0, I). Both atom features and coordinates are diffused throughout the process. Ligands (z (L) ) are represented as fully-connected graphs during the diffusion process (edges not shown for clarity) and covalent bonds are added to the resultant point cloud at the end of generation. The protein (z (P ) ) is represented as a graph but is shown as a surface here for clarity. 2021) formulated the generation process as a reinforcement learning problem and connected the generator with Monte Carlo Tree Search for protein pocket-conditioned ligand generation. However, the main premise of sequential generation methods may not hold in real scenarios, since there is no ordering of the generation process and, as a result, the global context of the generated ligands may be lost. In addition, sequential methods pose more computational complexities that make the model inference inefficient (Luo et al., 2021; Peng et al., 2022 ). An alternative is a one-shot generation strategy that samples the atomic coordinates and types of all the atoms at once (Du et al., 2022b) . In this work, we develop an equivariant diffusion model for structure-based drug design (DiffSBDD) which, to the best of our knowledge, is the first of its kind. Specifically, we formulate SBDD as a 3D-conditioned generation problem where we aim to generate diverse ligands with high binding affinity for specific protein targets. We propose an SE(3)-equivariant 3D-conditional diffusion model that respects translation, rotation, and permutation equivariance. We introduce two strategies, protein-conditioned generation and ligandinpainting generation producing new ligands conditioned on protein pockets. Specifically, proteinconditioned generation considers the protein as a fixed context, while ligand-inpainting models the joint distribution of the protein-ligand complex and new ligands are inpainted at inference time. We also demonstrate that our model can be used for out-of-the-box for molecular optimization. We further curate an experimentally determined binding dataset derived from Binding MOAD (Hu et al., 2005) , which supplements the commonly used synthetic CrossDocked (Francoeur et al., 2020) dataset to validate our model performance under realistic binding scenarios. The experimental results demonstrate that DiffSBDD is capable of generating novel, diverse and drug-like ligands with predicted high binding affinities to given protein pockets. The code is available at https://anonymous.4open.science/r/DiffSBDD-AF75/.

2. BACKGROUND

Denoising Diffusion Probabilistic Models Denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) are a class of generative models in-



steps. Li et al. (2021) proposed an autoregressive generative model to sample ligands given the protein pocket as a conditioning constraint. Peng et al. (2022) improved this method by using an E(3)-equivariant graph neural network which respects rotation and translation symmetries in 3D space. Similarly, Drotár et al. (2021); Liu et al. (2022) used autoregressive models to generate atoms sequentially and incorporate angles during the generation process. Li et al. (

