STRUCTURE-BASED DRUG DESIGN WITH EQUIVARIANT DIFFUSION MODELS

Abstract

Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions. Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide a realistic binding scenario that complements the synthetic CrossDocked dataset. Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking.

1. INTRODUCTION

The rational design of molecular compounds to act as drugs remains an oustanding challenge in biopharmaceutical research. Towards supporting such efforts, structure-based drug design (SBDD) aims to generate small-molecule ligands that bind to a specific 3D protein structure with high affinity and specificity (Anderson, 2003) . However, SBDD remains very challenging and with important limitations. A traditional SBDD campaign starts with the identification and validation of a target of interest and its subsequent structural characterization using experimental structural determination methods. The first step in this process is the identification of the binding pocket; a cavity in which ligands may bind the target to elicit the desired therapeutic effect. This can be achieved via experimental means or a plethora of computational approaches (Pérot et al., 2010) . Once a binding site is identified, the goal is to discover lead compounds that exhibit the desired biological activity. Importantly, to transition from leads to promising candidates the compounds need to be evaluated regarding other drug development constraints that are also hard to predict (toxicity, absorption, etc.). Traditionally, SBDD is handled either by high-throughput experimental or virtual screening (Lyne, 2002; Shoichet, 2004) of large chemical databases. Not only is this expensive and time consuming but it also limits the exploration of chemical space to the historical knowledge of previously studied molecules, with a further emphasis usually placed on commercial availability (Irwin & Shoichet, 2005) . Moreover, the optimization of initial lead molecules is often a biased process, with heavy reliance on human intuition (Ferreira et al., 2015) . Recent advances in geometric deep learning, especially in modeling geometric structures of biomolecules (Bronstein et al., 2021; Atz et al., 2021) , provide a promising direction for structurebased drug design (Gaudelet et al., 2021) . Even though utilizing deep learning as surrogate docking models has achieved remarkable progress (Lu et al., 2022; Stärk et al., 2022) , deep learning-based design of ligands that bind to target proteins is still an open problem. Early attempts have been made to represent molecules as atomic density maps, and variational auto-encoders were utilized to generate new atomic density maps corresponding to novel molecules (Ragoza et al., 2022) . However, it is nontrivial to map atomic density maps back to molecules, necessitating a subsequent atom-fitting stage. Follow-up work addressed this limitation by representing molecules as 3D graphs with atomic coordinates and types which circumvents the unnecessary post-processing 1

