E3BIND: AN END-TO-END EQUIVARIANT NETWORK FOR PROTEIN-LIGAND DOCKING

Abstract

In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work focuses on flexible blind selfdocking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference costs. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standard benchmark datasets demonstrate the superior performance of our end-to-end trainable model compared to traditional and recently-proposed deep learning methods.

1. INTRODUCTION

For nearly a century, small molecules, or organic compounds with small molecular weight, have been the major weapon of the pharmaceutical industry. They take effect by ligating (binding) to their target, usually a protein, to alter the molecular pathways of diseases. The structure of the proteinligand interface holds the key to understanding the potency, mechanisms and potential side effects of small molecule drugs. Despite huge efforts made for protein-ligand complex structure determination, there are by far only some 10 4 protein-ligand complex structures available in the protein data bank (PDB) (Berman et al., 2000) , which dwarfs in front of the enormous combinatorial space of possible complexes between 10 60 drug-like molecules (Hert et al., 2009; Reymond & Awale, 2012) and at least 20,000 human proteins (Gaudet et al., 2017; Consortium, 2019) , highlighting the urgent need for in silico protein-ligand docking methods. Furthermore, a fast and accurate docking tool capable of predicting binding poses for molecules yet to be synthesized would empower mass-scale virtual screening (Lyu et al., 2019) , a vital step in modern structure-based drug discovery (Ferreira et al., 2015) . It also provides pharmaceutical scientists with an interpretable, information-rich result. Being a crucial task, predicting the docked pose of a ligand is also a challenging one. Traditional docking methods (Halgren et al., 2004; Morris et al., 1996; Trott & Olson, 2010; Coleman et al., 2013) rely on physics-inspired scoring functions and extensive conformation sampling to obtain the predicted binding pose. Some deep learning methods focus on learning a more accurate scoring function (McNutt et al., 2021; Méndez-Lucio et al., 2021) , but at the cost of even lower inference speed due to their adoption of the sampling-scoring framework. Distinct from the above methods, TankBind (Lu et al., 2022) drops the burden of conformation sampling by predicting the proteinligand distance map, then converting the distance map to a docked pose using gradient descent. The optimization objective is the weighted sum of the protein-ligand distance error with respect to the predicted distance map and the intra-ligand distance error w.r.t. the reference intra-ligand distances. This two-stage approach might run into problems during the distance-to-coordinate transformation, as the predicted distance map is, in many cases, not a valid Euclidean distance matrix (Liberti et al., 2014) . Recently, Stärk et al. (2022) proposed EquiBind, an equivariant model that directly predicts the coordinates of the docked pose. EquiBind updates the ligand conformation with a graph neural network, then roto-translates the ligand into the pocket using a key-point alignment mechanism. It enjoys significant speedup compared to the popular docking baselines (Hassan et al., 2017; Koes et al., 2013) and provides good pose initializations for them, but on its own the docking performance is less satisfactory. This is probably because, after the one-shot roto-translation, the ligand might fall into an unfavorable position but its conformation could not be further refined. In this paper, we move one step forward in this important direction and propose E3Bind, the first end-to-end equivariant network that iteratively docks the ligand into the binding pocket. Inspired by AlphaFold2 (Jumper et al., 2021) , our model comprises a feature extractor named Trioformer and an iterative coordinate refinement module. The Trioformer encodes the protein and ligand graphs into three information-rich embeddings: the protein residue embeddings, the ligand atom embeddings and the protein-ligand pair embeddings, where the pair embeddings are fused with geometry awareness to enforce the implicit constraints in docking. Our coordinate refinement module decodes the rich representations into E(3)-equivariant coordinate updates. The iterative coordinate update scheme feeds the output pose of a decoder block as the initial pose of the next one, allowing the model to dynamically sense the local context and fix potential errors (see Figure 3 ). We further propose a self-confidence predictor to select the final pose and evaluate the soundness of our predictions. E3Bind is trained end-to-end with loss directly defined on the output ligand coordinates, relieving the burden of conformation sampling or distance-to-coordinate transformation. Our contributions can be summarized as follows: • We formulate the docking problem as an iterative refinement process where the model updates the ligand coordinates based on the current context at each iteration. • We propose an end-to-end E(3) equivariant network to generate the coordinate updates. The network comprises an expressive geometric-aware encoder and an equivariant contextaware coordinate update module. • Quantitative results show that E3Bind outperforms both traditional score-based methods and recent deep learning models.

2. RELATED WORKS

Protein-ligand docking. Traditional approaches to protein-ligand docking (Morris et al., 1996; Halgren et al., 2004; Coleman et al., 2013) mainly adopt a sampling, scoring, ranking, and finetuning paradigm, with AutoDock Vina (Trott & Olson, 2010) being a popular example. Each part of the docking pipeline has been extensively studied in literature to increase both accuracy and speed (Durrant & McCammon, 2011; Liu et al., 2013; Hassan et al., 2017; Zhang et al., 2020) . Multiple subsequent works use deep-learning on 3D voxels (Ragoza et al., 2017; Francoeur et al., 2020; McNutt et al., 2021; Bao et al., 2021) or graphs (Méndez-Lucio et al., 2021) to improve the scoring functions. Nevertheless, these methods are inefficient in general, often taking minutes or even more to predict the docking poses of a single protein-ligand pair, which hinders the accessibility of large-scale virtual screening experiments. Recently, methods that directly model the distance geometry between protein-ligand pairs have been investigated (Masters et al., 2022; Lu et al., 2022; Zhou et al., 2022) . They adopt a two-stage approach for docking, and generate docked poses from predicted protein-ligand distance maps using post-optimization algorithms. Advanced techniques in geometric deep learning, e.g. triangle attention (Jumper et al., 2021) with geometric constraints, have been leveraged to encourage the local geometrical consistency of the distance map (Lu et al., 2022) . To bypass the error-prone twostage framework, EquiBind (Stärk et al., 2022) proposes a fully differentiable equivariant model, which directly predicts coordinates of docked poses with a novel attention-based key-point alignment mechanism (Ganea et al., 2021b) . Despite being more efficient, EquiBind fails to beat popular docking baselines on its own, stressing the importance of increasing model expressiveness.

