XTRIMOABFOLD: IMPROVING ANTIBODY STRUC-TURE PREDICTION WITHOUT MULTIPLE SEQUENCE ALIGNMENTS

Abstract

Antibody, used by the immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses, plays an important role in immune system. In the field of drug engineering, the essential task is designing a novel antibody to make sure its paratope (substructures in the antibody) binds to the epitope of the specific antigen with high precision. Also, understanding the structure of antibody and its paratope can facilitate a mechanistic understanding of the function. Therefore, antibody structure prediction has always been a highly valuable problem for drug discovery. AlphaFold2, a breakthrough in the field of structural biology, provides a feasible solution to predict protein structure based on protein sequences and computationally expensive coevolutionary multiple sequence alignments (MSAs). However, the computational efficiency and undesirable prediction accuracy on antibody, especially on the complementarity-determining regions (CDRs) of antibody limit its applications on the industrially high-throughput drug design. In this paper, we present a novel method named xTrimoABFold to predict antibody structure from antibody sequence based on a pretrained antibody language model (ALM) as well as homologous templates, which are searched from protein database (PDB) via fast and cheap algorithms. xTrimoABFold outperforms the MSA-based AlphaFold2 and the protein language model based SO-TAs, e.g., OmegaFold, HelixFold-Single and IgFold with a large significant margin (30+% improvement on RMSD) while performs 151x faster than AlphaFold2. To the best of our knowledge, xTrimoABFold is the best antibody structure predictor to date in the world.

1. INTRODUCTION

Antibody is an important type of proteins for disease diagnosis and treatment. The structures of antibodies are closely related to their functions, so that antibody structure prediction, which aims to predicting the 3D coordinates of atoms in antibody, is essential in the biological and medical applications such as protein engineering, modifying the antigen binding affinity, and identifying an epitope of specific antibody. However, manual experimental methods such as X-ray crystallography are time-consuming and expensive. Recently, deep learning methods have achieved great success in protein structure prediction (Jumper et al., 2021; Baek et al., 2021; Li et al., 2022) . In short, these methods incorporate evolutional and geometric information of protein structures and deep neural networks. In particular, Al-phaFold2 (Jumper et al., 2021) introduces the architecture to jointly model the multiple sequence alignments (MSAs) and pairwise information, which is able to end-to-end predict the protein structures in near experimental accuracy. Nevertheless, unlike general proteins, antibodies do not evolve naturally but rather they bind to specific antigens and evolve specifically (fast and one-way evolving), the MSAs of antibodies especially on complementarity-determining regions (CDRs) are not always available or reliable, which hurts the accuracy of models on antibody data. With the development of large-scale pretrained language model, many protein language models (Rao et al., 2020; Elnaggar et al., 2022; Rives et al., 2021; Rao et al., 2021; Ruffolo et al., 2021; Ofer et al., 1 

