XTRIMOABFOLD: IMPROVING ANTIBODY STRUC-TURE PREDICTION WITHOUT MULTIPLE SEQUENCE ALIGNMENTS

Abstract

Antibody, used by the immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses, plays an important role in immune system. In the field of drug engineering, the essential task is designing a novel antibody to make sure its paratope (substructures in the antibody) binds to the epitope of the specific antigen with high precision. Also, understanding the structure of antibody and its paratope can facilitate a mechanistic understanding of the function. Therefore, antibody structure prediction has always been a highly valuable problem for drug discovery. AlphaFold2, a breakthrough in the field of structural biology, provides a feasible solution to predict protein structure based on protein sequences and computationally expensive coevolutionary multiple sequence alignments (MSAs). However, the computational efficiency and undesirable prediction accuracy on antibody, especially on the complementarity-determining regions (CDRs) of antibody limit its applications on the industrially high-throughput drug design. In this paper, we present a novel method named xTrimoABFold to predict antibody structure from antibody sequence based on a pretrained antibody language model (ALM) as well as homologous templates, which are searched from protein database (PDB) via fast and cheap algorithms. xTrimoABFold outperforms the MSA-based AlphaFold2 and the protein language model based SO-TAs, e.g., OmegaFold, HelixFold-Single and IgFold with a large significant margin (30+% improvement on RMSD) while performs 151x faster than AlphaFold2. To the best of our knowledge, xTrimoABFold is the best antibody structure predictor to date in the world.

1. INTRODUCTION

Antibody is an important type of proteins for disease diagnosis and treatment. The structures of antibodies are closely related to their functions, so that antibody structure prediction, which aims to predicting the 3D coordinates of atoms in antibody, is essential in the biological and medical applications such as protein engineering, modifying the antigen binding affinity, and identifying an epitope of specific antibody. However, manual experimental methods such as X-ray crystallography are time-consuming and expensive. Recently, deep learning methods have achieved great success in protein structure prediction (Jumper et al., 2021; Baek et al., 2021; Li et al., 2022) . In short, these methods incorporate evolutional and geometric information of protein structures and deep neural networks. In particular, Al-phaFold2 (Jumper et al., 2021) introduces the architecture to jointly model the multiple sequence alignments (MSAs) and pairwise information, which is able to end-to-end predict the protein structures in near experimental accuracy. Nevertheless, unlike general proteins, antibodies do not evolve naturally but rather they bind to specific antigens and evolve specifically (fast and one-way evolving), the MSAs of antibodies especially on complementarity-determining regions (CDRs) are not always available or reliable, which hurts the accuracy of models on antibody data. With the development of large-scale pretrained language model, many protein language models (Rao et al., 2020; Elnaggar et al., 2022; Rives et al., 2021; Rao et al., 2021; Ruffolo et al., 2021; Ofer et al., 2021; Wu et al., 2022) have been developed to generate the representation of protein sequence and show promising performance on contact prediction (Iuchi et al., 2021; Rao et al., 2020) , functional properties prediction (Meier et al., 2021; Hie et al., 2021) and structure prediction from single sequence (Hong et al., 2022; Wu et al., 2022; Fang et al., 2022; Chowdhury et al., 2021; Ruffolo & Gray, 2022) . These single-sequence-based structure predictors typically follows a two-stage framework that i) trains a protein language model (PLM) on large-scale unlabeled protein databases, e.g., UniRef50, UniRef90, BFD or Observed Antibody Space (OAS) database (Olsen et al., 2022a) , ii) employs the evoformer variants and structure module variants to predict protein structures from the learned representation from the pretrained PLM. Their experimental results show comparable accuracy with the standard AlphaFold2 and perform much more efficient because of skipping the computationally expensive CPU-based MSA searching stage. Although large-scale PLM show promising results, neither of PLM-based methods outperform AlphaFold2 on both general protein databases and antibody databases. Contribution. In this paper, we focus on one the of the most important problems in the field of drug discovery: antibody structure prediction. We claim that when conducting structure prediction on antibody, the general protein language model (PLM) is not the best optional. In contrast, we employ an antibody language model (ALM) pretrained on the large-scale OAS database and use evoformer and structure modules to learn the antibody structures in an end-to-end fashion. Also, we design fast and cheap template searching algorithms based on two modalities of both sequence and structures. The searched templates help xTrimoABFold learn from a good starting point. We construct two large antibody databases of 19K antibody structure database and 501K protein structure database from RCSB PDB (Berman et al., 2000) foot_0 . Experimental results show that our xTrimoABFold performs much better than all the latest SOTAs, e.g., AlphaFold2, OmegaFold, HelixFold-Single and IgFold, with a significant margin (30+% improvement on RMSD). To the best of our knowledge, xTrimoABFold is currently the most accurate antibody structure prediction model of the world. We believe such large improvement on antibody prediction from xTrimoABFold will make a substantive impact on drug discovery.

2. RELATED WORKS

Protein & Antibody Structure Prediction. Protein structure prediction aims to getting the 3D coordinates from an amino acid sequence, which has been an important open research problems for over 50 years (Dill et al., 2008; Anfinsen, 1973) . In recent years, deep learning methods have been widely used in protein structure prediction and considerable progress has been made by using the co-evolution information from Multiple Sequence Alignments(MSAs), such like AlphaFold (Senior et al., 2019; 2020 ), AlphaFold2 (Jumper et al., 2021 ), OpenFold (Ahdritz et al., 2021) and RoseTTAFold (Baek et al., 2021) . However, these methods are time-consuming and strictly dependent on MSAs, which remains a challenge for the structure prediction of orphan proteins with less homologous information or antibody for which MSAs are not always useful on account of a fast evolving nature. Recently, Lin et al. (2022 ), Fang et al. (2022 ) and Wu et al. (2022) make protein structure prediction on large protein language models which are no longer dependent on MSAs, which drastically reduce computation time but incur a certain loss of prediction precision. In particular, models like DeepAb (Ruffolo et al., 2022 ), ABlooper (Abanades et al., 2022) and IgFold (Ruffolo & Gray, 2022) are specifically developed for antibody structure. pretrained Language Model on general protein and antibody. • General Protein Language Model (PLM). Typically, protein language models (Rao et al., 2020; Elnaggar et al., 2022; Rives et al., 2021; Rao et al., 2021; Ruffolo et al., 2021; Ofer et al., 2021; Wu et al., 2022) 



The two database are splitted by release datetime on January 17th,



employ the popular transformer neural architecture variants (Vaswani et al., 2017) and train on different protein databases, such as UniRef50, UniRef90 and BFD, etc. For example, Rao et al. (2021) introduce ESM variants and use axial attention to learn row and column representation from MSA. Elnaggar et al. (2022) train several PLMs with different number of parameters on UniRef and BFD datasets. Lin et al. (2022) extends ESM and proposed ESM-2, which used relative position encoding to capture the intrinsic interaction of amino acids in the sequence. Wu et al. (2022) pro-

