AEDESIGN: A GRAPH PROTEIN DESIGN METHOD AND BENCHMARK ON ALPHAFOLD DB Anonymous

Abstract

While AlphaFold has remarkably advanced protein folding, the inverse problem, protein design, by which protein sequences are predicted from the corresponding 3D structures, still faces significant challenges. First of all, there lacks a large-scale benchmark covering the vast protein space for evaluating methods and models; secondly, existing methods are still low in prediction accuracy and time-inefficient inference. This paper establishes a new benchmark based on AlphaFold DB, one of the world's largest protein structure databases. Moreover, we propose a new baseline method called AEDesign, which achieves 5% higher recovery than previous methods and about 70 times inference speed-up in designing long protein sequences. We also reveal AEDesign's potential for practical protein design tasks, where the designed proteins achieve good structural compatibility with native structures. The open-source code will be released.

1. INTRODUCTION

As "life machines", proteins play vital roles in almost all cellular processes, such as transcription, translation, signaling, and cell cycle control. Understanding the relationship between protein structures and their sequences brings significant scientific impacts and social benefits in many fields, such as bioenergy, medicine, and agriculture (Huo et al., 2011; Williams et al., 2019) . While AlphaFold2 has tentatively solved protein folding (Jumper et al., 2021; Wu et al., 2022; Lin et al., 2022; Mirdita et al., 2022; Li et al., 2022c) from 1D sequences to 3D structures, its reverse problem, i.e., protein design raised by (Pabo, 1983 ) that aims to predict amino acid sequences from known 3D structures, has fewer breakthroughs in the ML community. The main reasons hindering the research progress include: (1) The lack of large-scale standardized benchmarks; (2) The difficulty in improving protein design accuracy; (3) Many methods are neither efficient nor open-source. Therefore, we aim to benchmark protein design and develop an effective and efficient open-source method. Previous benchmarks may suffer from biased testing and unfair comparisons. Since SPIN (Li et al., 2014) introduced the TS500 (and TS50) consisting of 500 (and 50) native structures, it has served as a standard test set for evaluating different methods (O'Connell et al., 2018; Wang et al., 2018; Chen et al., 2019; Jing et al., 2020; Zhang et al., 2020a; Qi & Zhang, 2020; Strokach et al., 2020) . However, such a few proteins do not cover the vast protein space and are more likely to lead to biased tests. Besides, there are no canonical training and validation sets, which means that different methods may use various training sets. If the training data is inconsistent, how can we determine that the performance gain comes from different methods rather than biases of the data distribution? Especially when the test set is small, adding training samples that match the test set distribution could cause dramatic performance fluctuations. Considering these issues, we suggest establishing a large-scale standardized benchmark for fair and comprehensive comparisons. Extracting expressive residue representations is a key challenge for accurate protein design, where both sequential and structural properties must be considered. For general 3D points, structural features should be rotationally and translationally invariant in the classification task. Regarding proteins, we should consider amino acids' stable structure, number, and order. Previous studies (O'Connell et al., 2018; Wang et al., 2018; Ingraham et al., 2019; Jing et al., 2020) may have overlooked some important protein features and data dependencies, i.e., bond angles; thus, few of them exceeds 50% recovery except DenseCPD (Qi & Zhang, 2020) . How can protein features and neural models be designed to learn expressive residue representations? Improving the model efficiency is necessary for rapid iteration of research and applications. Current advanced methods have severe speed defects due to the sequential prediction paradigm. For example, GraphTrans (Ingraham et al., 2019) and GVP (Jing et al., 2020) predict residues one by one during inference rather than in parallel, which means that it calls the model multiple times to get the entire protein sequence. Moreover, DenseCPD (Qi & Zhang, 2020) takes 7 minutes to predict a 120-length protein on their serverfoot_0 . How can we improve the model efficiency while ensuring accuracy? To address these problems, we establish a new protein design benchmark and develop a graph model called AEDesign (Accurate and Efficient Protein Design) to achieve SOTA accuracy and efficiency. Firstly, we compare various graph models on consistent training, validation, and testing sets, where all these datasets come from the AlphaFold Protein Structure Database (Varadi et al., 2021) . In contrast to previous studies (Ingraham et al., 2019; Jing et al., 2020) that use limited-length proteins, we extend the experimental setups to the case of arbitrary protein length. Secondly, we improve the model accuracy by introducing protein angles as new features and introducing a simplified graph transformer encoder (SGT). Thirdly, we improve model efficiency by proposing a confidence-aware protein decoder (CPD) to replace the auto-regressive decoder. Experiments show that AEDesign significantly outperforms previous methods in accuracy (+5%) and efficiency (70+ times faster than before). We also reveal AEDesign's potential for practical protein design tasks, where the designed proteins achieve good structural compatibility with native structures.

2. RELATED WORK

We focus on structure-based protein design (Gao et al., 2020; Pearce & Zhang, 2021; Wu et al., 2021; Ovchinnikov & Huang, 2021; Ding et al., 2022; Strokach & Kim, 2022; Li & Koehl, 2014; Greener et al., 2018; Anand et al., 2022; Karimi et al., 2020; Cao et al., 2021; Liu et al., 2022; McPartlon et al., 2022; Huang et al., 2022; Dumortier et al., 2022; Li et al., 2022a; Maguire et al., 2021; Anishchenko et al., 2021; Li et al., 2022b) , and the approaches can be categorized into MLP-based, CNN-based, and GNN-based ones. Some terms need to be explained in advance: we refer to amino acids as residues, and accuracy indicates the degree of prediction of the residues, i.e., recovery. Problem definition The structure-based protein design aims to find the amino acids sequence S = {s i : 1 ≤ i ≤ n} that folds into a known 3D structure X = {x i ∈ R 3 : 1 ≤ i ≤ n}, where n is the number of residues and the natural proteins are composed by 20 types of amino acids, i.e., 1 ≤ s i ≤ 20, and s i ∈ N + . Formally, that is to learn a function F θ : X → S. Because homologous proteins always share similar structures (Pearson & Sierk, 2005) , the problem itself is underdetermined, i.e., the valid amino acid sequence may not be unique. In addition, the need to consider both 1D sequential and 3D structural information further increases the difficulty of algorithm design. The MLP methods have a high inference speed, but their accuracy is relatively low due to the partial consideration of structural information. These methods require complex feature engineering using multiple databases and computational tools, limiting their widespread usage.

MLP-based models

CNN-based models CNN methods extract protein features directly from the 3D structure (Torng & Altman, 2017; Boomsma & Frellsen, 2017; Weiler et al., 2018; Zhang et al., 2020a; Huang et al., 2017; Chen et al., 2019) , which can be further classified as 2D CNN-based and 3D CNN-based.



http://protein.org.cn/densecpd.html



These methods use multi-layer perceptron (MLP) to predict the type of each residue. The MLP outputs the probability of 20 amino acids for each residue, and the input feature construction is the main difference between various methods.SPIN (Li et al., 2014)  integrates torsion angles (ϕ and ψ), fragment-derived sequence profiles, and structure-derived energy profiles to predict protein sequences. SPIN2(O'Connell et al., 2018)  adds backbone angles (θ and τ ), local contact number, and neighborhood distance to improve the accuracy from 30% to 34%.(Wang et al.,  2018)  uses backbone dihedrals (ϕ, ψ and ω), the solvent accessible surface area of backbone atoms (C α , N, C, and O), secondary structure types (helix, sheet, loop), C α -C α distance and unit direction vectors of C α -C α , C α -N and C α -C, which achieves 33.0% accuracy on 50 test proteins.

Statistics of structure-based protein design methods.Ntrain is the number of training samples. TS500 and TS500 are the test sets containing 500 and 50 proteins, respectively. All results are copied from their manuscripts or related papers.

