AEDESIGN: A GRAPH PROTEIN DESIGN METHOD AND BENCHMARK ON ALPHAFOLD DB Anonymous

Abstract

While AlphaFold has remarkably advanced protein folding, the inverse problem, protein design, by which protein sequences are predicted from the corresponding 3D structures, still faces significant challenges. First of all, there lacks a large-scale benchmark covering the vast protein space for evaluating methods and models; secondly, existing methods are still low in prediction accuracy and time-inefficient inference. This paper establishes a new benchmark based on AlphaFold DB, one of the world's largest protein structure databases. Moreover, we propose a new baseline method called AEDesign, which achieves 5% higher recovery than previous methods and about 70 times inference speed-up in designing long protein sequences. We also reveal AEDesign's potential for practical protein design tasks, where the designed proteins achieve good structural compatibility with native structures. The open-source code will be released. Under review as a conference paper at ICLR 2022 protein sequence. Moreover, DenseCPD (Qi & Zhang, 2020) takes 7 minutes to predict a 120-length protein on their server 1 . How can we improve the model efficiency while ensuring accuracy? To address these problems, we establish a new protein design benchmark and develop a graph model called AEDesign (Accurate and Efficient Protein Design) to achieve SOTA accuracy and efficiency. Firstly, we compare various graph models on consistent training, validation, and testing sets, where all these datasets come from the AlphaFold Protein Structure Database (Varadi et al., 2021) . In contrast to previous studies (Ingraham et al., 2019; Jing et al., 2020) that use limited-length proteins, we extend the experimental setups to the case of arbitrary protein length. Secondly, we improve the model accuracy by introducing protein angles as new features and introducing a simplified graph transformer encoder (SGT). Thirdly, we improve model efficiency by proposing a confidence-aware protein decoder (CPD) to replace the auto-regressive decoder. Experiments show that AEDesign significantly outperforms previous methods in accuracy (+5%) and efficiency (70+ times faster than before). We also reveal AEDesign's potential for practical protein design tasks, where the designed proteins achieve good structural compatibility with native structures.

1. INTRODUCTION

As "life machines", proteins play vital roles in almost all cellular processes, such as transcription, translation, signaling, and cell cycle control. Understanding the relationship between protein structures and their sequences brings significant scientific impacts and social benefits in many fields, such as bioenergy, medicine, and agriculture (Huo et al., 2011; Williams et al., 2019) . While AlphaFold2 has tentatively solved protein folding (Jumper et al., 2021; Wu et al., 2022; Lin et al., 2022; Mirdita et al., 2022; Li et al., 2022c) from 1D sequences to 3D structures, its reverse problem, i.e., protein design raised by (Pabo, 1983 ) that aims to predict amino acid sequences from known 3D structures, has fewer breakthroughs in the ML community. The main reasons hindering the research progress include: (1) The lack of large-scale standardized benchmarks; (2) The difficulty in improving protein design accuracy; (3) Many methods are neither efficient nor open-source. Therefore, we aim to benchmark protein design and develop an effective and efficient open-source method. Previous benchmarks may suffer from biased testing and unfair comparisons. Since SPIN (Li et al., 2014) introduced the TS500 (and TS50) consisting of 500 (and 50) native structures, it has served as a standard test set for evaluating different methods (O'Connell et al., 2018; Wang et al., 2018; Chen et al., 2019; Jing et al., 2020; Zhang et al., 2020a; Qi & Zhang, 2020; Strokach et al., 2020) . However, such a few proteins do not cover the vast protein space and are more likely to lead to biased tests. Besides, there are no canonical training and validation sets, which means that different methods may use various training sets. If the training data is inconsistent, how can we determine that the performance gain comes from different methods rather than biases of the data distribution? Especially when the test set is small, adding training samples that match the test set distribution could cause dramatic performance fluctuations. Considering these issues, we suggest establishing a large-scale standardized benchmark for fair and comprehensive comparisons. Extracting expressive residue representations is a key challenge for accurate protein design, where both sequential and structural properties must be considered. For general 3D points, structural features should be rotationally and translationally invariant in the classification task. Regarding proteins, we should consider amino acids' stable structure, number, and order. Previous studies (O'Connell et al., 2018; Wang et al., 2018; Ingraham et al., 2019; Jing et al., 2020) may have overlooked some important protein features and data dependencies, i.e., bond angles; thus, few of them exceeds 50% recovery except DenseCPD (Qi & Zhang, 2020) . How can protein features and neural models be designed to learn expressive residue representations? Improving the model efficiency is necessary for rapid iteration of research and applications. Current advanced methods have severe speed defects due to the sequential prediction paradigm. For example, GraphTrans (Ingraham et al., 2019) and GVP (Jing et al., 2020) predict residues one by one during inference rather than in parallel, which means that it calls the model multiple times to get the entire 

Problem definition

The structure-based protein design aims to find the amino acids sequence S = {s i : 1 ≤ i ≤ n} that folds into a known 3D structure X = {x i ∈ R 3 : 1 ≤ i ≤ n}, where n is the number of residues and the natural proteins are composed by 20 types of amino acids, i.e., 1 ≤ s i ≤ 20, and s i ∈ N + . Formally, that is to learn a function F θ : X → S. Because homologous proteins always share similar structures (Pearson & Sierk, 2005) , the problem itself is underdetermined, i.e., the valid amino acid sequence may not be unique. In addition, the need to consider both 1D sequential and 3D structural information further increases the difficulty of algorithm design.

MLP-based models

These methods use multi-layer perceptron (MLP) to predict the type of each residue. The MLP outputs the probability of 20 amino acids for each residue, and the input feature construction is the main difference between various methods. SPIN (Li et al., 2014) integrates torsion angles (ϕ and ψ), fragment-derived sequence profiles, and structure-derived energy profiles to predict protein sequences. SPIN2 (O'Connell et al., 2018) adds backbone angles (θ and τ ), local contact number, and neighborhood distance to improve the accuracy from 30% to 34%. (Wang et al., 2018) uses backbone dihedrals (ϕ, ψ and ω), the solvent accessible surface area of backbone atoms (C α , N, C, and O), secondary structure types (helix, sheet, loop), C α -C α distance and unit direction vectors of C α -C α , C α -N and C α -C, which achieves 33.0% accuracy on 50 test proteins. The MLP methods have a high inference speed, but their accuracy is relatively low due to the partial consideration of structural information. These methods require complex feature engineering using multiple databases and computational tools, limiting their widespread usage. CNN-based models CNN methods extract protein features directly from the 3D structure (Torng & Altman, 2017; Boomsma & Frellsen, 2017; Weiler et al., 2018; Zhang et al., 2020a; Huang et al., 2017; Chen et al., 2019) , which can be further classified as 2D CNN-based and 3D CNN-based. The 2D CNN-based SPROF (Chen et al., 2019) extracts structural features from the distance matrix and improves the accuracy to 39.8%. In contrast, 3D CNN-based methods extract residue features from the atom distribution in a three-dimensional grid box. For each residue, the atomic density distribution is computed after being translated and rotated to a standard position so that the model can learn translation and rotation invariant features. ProDCoNN (Zhang et al., 2020a) designs a nine-layer 3D CNN to predict the corresponding residues at each position, which uses multi-scale convolution kernels and achieves 40.69% recovery on TS50. DenseCPD (Qi & Zhang, 2020) further uses the DensetNet architecture (Huang et al., 2017) to boost the accuracy to 50.71%. Although 3DCNN-based models improve accuracy, their inference is slow, probably because they require separate pre-processing and prediction for each residue. Graph-based models Graph methods represent the 3D structure as a k-NN graph, then use graph neural networks (Defferrard et al., 2016; Kipf & Welling, 2016; Veličković et al., 2017; Zhou et al., 2020; Zhang et al., 2020b; Gao et al., 2022; Li et al., 2021a) to learn residue representations considering structural constraints. The protein graph encodes the residue information in node vectors and constructs edges and edge features between neighboring residues. GraphTrans (Ingraham et al., 2019) combines graph encoder and autoregressive decoder to generate protein sequences. GVP (Jing et al., 2020) increases the accuracy to 44.9% by proposing the geometric vector perceptron, which learns both scalar and vector features in an equivariant and invariant manner concerning rotations and reflections. GCA (Tan et al., 2022) further improves recovery to 47.02% by introducing global graph attention. Another related work is ProteinSolver (Strokach et al., 2020) , but it was mainly developed for scenarios where partial sequences are known and do not report results on standard datasets. In parallel with our work, ProteinMPNN (Dauparas et al., 2022) and ESM-IF (Hsu et al., 2022) achieve dramatic improvements, while they do not provide the training code and will be benchmarked in the future.

3.1. OVERVIEW

We present the overall framework of AEDesign in Fig. 1 . We suggest using AEDesign as a future baseline model because it is more accurate, straightforward, and efficient than previous methods. The methodological innovations include: • Expressive features: We add new proteins angles (α, β, γ) to steadily improve the model accuracy. • Simplified graph encoder: We use the simplified graph transformer (SGT) to extract more expressive representations. • Fast sequence decoder: We propose constraint-aware protein decoder (CPD) to speed up inference by replacing the autoregressive generator.

3.2. GRAPH FEATURE AND ENCODER

The protein structure can be viewed as a particular 3D point cloud in which the order of residues is known. For ordinary 3D points, there are two ways to get rotation and translation invariant features: Using particular network architectures (Fuchs et al., 2020; Satorras et al., 2021; Jing et al., 2020; Shuaibi et al., 2021) that take 3D points as input or using handicraft invariant features. For proteins, general 3D point cloud approaches cannot consider their particularity, including the regular structure and order of amino acids. Therefore, we prefer learning from the hand-designed, invariant, and protein-specific features. How to create invariant features and learn expressive representations from them is subject to further research. Graph We represent the protein as a k-NN graph derived from residues to consider the 3D dependencies, where the default k is 30. The protein graph G(A, X, E) consists of the adjacency matrix A ∈ {0, 1} n,n , node features X ∈ R n,12 , and edge features E ∈ R m,23 . Note that n and m are the numbers of nodes and edges, and we create these features by residues' stable structure, order, and coordinates. Node features As shown in Fig. 2 , we consider two kinds of angles, i.e., the angles α i , β i , γ i formed by adjacent edges and the dihedral angles ϕ i , ψ i , ω i formed by adjacent surfaces, where α i , β i , γ i are new features we introduced. For better understanding dihedral angles, it is worth stating that ϕ i-1 is the angle between plane C i-2 -N i-1 -C αi-1 and N i-1 -C αi-1 -C i-1 , whose intersection is N i-1 -C αi-1 . Finally, there are 12 node features derived from {sin, cos} × {α i , β i , γ i , ϕ i , ψ i , ω i }. Edge features For edge j → i, we use relative rotation R ji , distance r ji , and relative direction u ji as edge features, seeing Fig. 2 . At the i-th residue's alpha carbon, we establish a local coordinate system Q i = (b i , n i , b i × n i ). The relative rotation between Q i and Q j is R ji = Q T i Q j can be represented by an quaternion q(R ji ). We use radial basis r(•) to encode the distance between C αi and C αj . The relative direction of C αj respective to C αi is calculated by u ji = Q T i xj -xi ||xj -xi|| . In summary, there are 23 edge features: 4 (quaternion) + 16 (radial basis) + 3 (relative direction). 

Simplified graph transformer Denote h l

i and e l ji as the output feature vectors of node i and edge j → i in layer l. We use MLP to project input node and edge features into d-dimensional space, thus h 0 i ∈ R d and e 0 ji ∈ R d . When considering the attention mechanisms centered in node i, the attention weight a ji at the l + 1 layer is calculated by: w ji = M LP 1 (h l j ||e l ji ||h l i ) a ji = exp wji k∈N i exp w ki (1) where N i is the neighborhood system of node i and || means the concatenation operation. Here, we simplify GraphTrans (Ingraham et al., 2019) by using a single MLP to learn multi-headed attention weights instead of using separate MLPs to learn Q and K, seeing Fig. 3 . The updated h l+1 i is: v j = M LP 2 (e l ji ||h l j ) h l+1 i = j∈Ni a ji v j (2) By stacking multiple simplified graph transformer (SGT) layers, we can obtain expressive protein representations considering 3D structural constraints by message passing.

3.3. SEQUENCE DECODER

To generate more accurate protein sequences, previous researches (Ingraham et al., 2019; Jing et al., 2020) prefer the autoregressive mechanism. However, this technique also significantly slows down the inference process because the residuals must be predicted one by one. Can we parallelize the predictor while maintaining the accuracy? Context-aware Previously, we have considered the 3D constraints through graph networks but ignored 1D inductive bias. As shown in Fig. 4 , the input features are Z gnn = {z 1 , z 2 , • • • , z N }, where z i is the feature vector of node i extracted by the encoder. In the generation phase, we use 1D CNNs to capture the local sequential dependencies based on the 3D context-aware graph node features, where the convolution kernel can be viewed as the sliding window. Confidence-aware Given the 3D structure X = {x i : 1 ≤ i ≤ N } and protein sequence S = {s i : 1 ≤ i ≤ N }, the vanilla autoregressive prediction indicates p(S|X ) = i p(s i |X , s <i ), where residues must be predicted one-by-one. We replace autoregressive connections with parallelly estimated confidence score c, written as We encode the confidence score as learnable embeddings C ∈ R n,128 , concatenate them with graph features, and feed these features into another CNN decoder to get the revised predictions. Note that all CNN decoders for estimating confidence and final predictions use the same CE loss:    a = Conf(X ) c = f (a) p(S|X ) = i p(s i |X , x i , c i ) L = - i 1≤j≤20 1 {j} (y i ) log(p i,j ). (4) where p i,j is the predicted probability that residue i's type is j, y i is the true label and 1 {j} (•) is a indicator function.

4. EXPERIMENTS

We conduct systematic experiments to establish a large-scale benchmark and evaluate the proposed AEDesign method. Specifically, we aim to answer: • Q1: What is the difference between the new bechmark and the old one? • Q2: Can AEDesign achieve SOTA accuracy and efficiency on the new benchmark? • Q3: What is important for achieving SOTA performance? 4.1 BENCHMARK COMPARATION (Q1) Metric Following (Li et al., 2014; O'Connell et al., 2018; Wang et al., 2018; Ingraham et al., 2019; Jing et al., 2020) , we use sequence recovery to evaluate different protein design methods. Compared with other metrics, such as perplexity, recovery is more intuitive and clear, and its value is equal to the average accuracy of predicted amino acids in a single protein sequence. By default, we report the median recovery score across the entire test set. Previous benchmark In Table . 1, we show the old benchmark collected from previous studies, including MLP (Li et al., 2014; O'Connell et al., 2018; Wang et al., 2018) , CNN (Chen et al., 2019; Zhang et al., 2020a; Qi & Zhang, 2020; Huang et al., 2017) and GNN (Ingraham et al., 2019; Jing et al., 2020; Strokach et al., 2020; Dauparas et al., 2022; Hsu et al., 2022) models. Most approaches report results on the common test set TS50 (or TS500), consisting of 50 (or 500) native structures (Li et al., 2014) . We also provide the results of AEDesign under the same experimental protocols as GraphTrans and GVP. Although the TS50, TS500 test sets have contributed significantly to establishing benchmarks, they still do not cover a vast protein space, and do not reveal how the model performs on species-specific data. Besides, there are no canonical training and validation sets, which means that different methods may use various training sets. New dataset We use the AlphaFold Protein Structure Database (until 2021.2.1)foot_1 (Varadi et al., 2021) to benchmark graph-based protein design methods. As shown in Table. 6 (Appendix) , there are over 360,000 predicted structures by AlphaFold2 (Jumper et al., 2021) across 21 model-organism proteomes. This dataset has several advantages: • Species-specific: This dataset provides well-organized species-specific data for different species, which is helpful to develop specialized models for each species. • More structures: This dataset provides more than 360,000 structures, while Protein Data Bank (PDB) (Burley et al., 2021) holds just over 180,000 structures for over 55,000 distinct proteins. • High quality: The median predictive score of AlphaFold2 reaches 92.4%, comparable to experimental techniques (Callaway, 2020) . Nicholas (Fowler & Williamson, 2022) found that AlphaFold tends to be more accurate than NMR ensembles. In 2020, the CASP14 benchmark recognized AlphaFold2 as a solution to the protein-folding problem (Pereira et al., 2021 ). • Missing value: There are no missing values in protein structures provided by AlphaFold DB. • Widespread usage: AlphaFold DB has been used in many frontier works (Varadi et al., 2022; Morreale et al., 2022; Luyten et al., 2022; Alderson et al., 2022; Zhang et al., 2022; Fowler & Williamson, 2022; Shaban et al., 2022; Brems et al., 2022; Hsu et al., 2022) , and we believe that investigating AlphaFold DB could yield more discoveries for protein design. Dataset Preprocessing It should be noted that the AlphaFold2 DB data itself may have model bias. Similar to ESM-IF (Hsu et al., 2022) , we address data quality and partitioning issues through data pre-processing for each species-specific subset. As suggested by (Baek & Kepp, 2022; Callaway, 2020) , the MAE between the predicted and experimentally generated structures does depend on pLDDT. Thus, we filter low-quality structures whose confidence score (pLDDT) is less than 70. To prevent potential information leakage, for each species-specific subset, we cluster protein sequences if their sequence similarities (Qi & Zhang, 2020; Steinegger & Söding, 2017) are higher than 30% (Qi & Zhang, 2020) and split the dataset by these clusters. As a result, the proteins belonging to the same cluster must be in one of the training, validation, and test sets. By default, we keep the ratio of the training set and test set at about 9:1 and choose 100 proteins belonging to a randomly selected cluster for validation. If the randomly selected cluster has less than 100 proteins, then all of its proteins are used as the validation set.

4.2. AEDESIGN BENCHMARK (Q2)

Overall settings We extend the experimental setups to arbitrary length and species-specific, while most previous studies (Ingraham et al., 2019; Jing et al., 2020; Tan et al., 2022) do not explore such a vast protein space. Arbitrary length means that the protein length may be arbitrary to generalize the model to broader situations; otherwise, the protein must be between 30 and 500 in length. Species-specific indicates that we develop a specific model for each organism's proteome to learn domain-specific knowledge. In summary, there are two settings, i.e., species-specific dataset with limited length (SL) and species-specific dataset with arbitrary length (SA). Denote the total amount of structures as N all , and the i-th species has N i structures, we have N all = i=21 i=1 N i . As shown in Table . 6, if the length is limited, N all = 254, 636; otherwise, N all = 365, 198. For each speciesspecific subset, we develop 21 models based on datasets with N 1 , N 2 , • • • , N 21 structures. We report the median recovery scores across the test set. All baseline results were obtained by running their official code with the same dataset. In parallel with our work, ProteinMPNN (Dauparas et al., 2022) and ESM-IF (Hsu et al., 2022) are also proposed, but they are not provided with the full training code and will be benchmarked in the future. Due to space constraints, we introduce abbreviations such as GTrans for GraphTrans and SGNN for Struct GNN. Hyper-parameters AEDesign's encoder contains ten layers of SGT, and the decoder contains three layers of CNN, where the hidden dimension is 128. We use Adam optimizer and OneCycleLR scheduler to train all models up to 100 epochs with early stop patience 20 and learning rate 0.001. In the SL setting, we set the batch size as 16 for GraphTrans, StructGNN, GCA, and AEDesign, and the max node number as 2000 for GVP, which indicates the maximum number of residuals per batch. In the SA setting, we change GVP's max node parameter to 3000 to make it applicable to all data. GraphTrans, StructGNN, and GCA take more GPU memories because they must pad data according to the longest chain in each batch, and we adjust the batch size to 8 to avoid memory overflow. 

4.3. EFFICIENCY COMPARATION (Q2)

A good algorithm should have excellent computational efficiency in addition to high accuracy. We compare the inference time cost of different approaches, especially when designing long proteins commonly found in AF2DB. Setting We evaluate various models' inference time costs under different scenarios, considering 100 proteins with short (L<500), medium (500<L<1000), and long (1000<L) lengths . As to long sequence design, we further investigate the time costs of encoder, decoder, and encoder+decoder. All experiments are conducted on an NVIDIA-V100. When designing short proteins (L < 500), AEDesign is 25+ times faster than baselines; and the dominance extends to about 70 times for designing long proteins (L ≥ 500). In Fig. 5 , we observe that the time costs of baselines mainly come from the autoregressive decoder, and the proposed CPD module significantly speeds up the decoding process.

4.4. ABLATION STUDY (Q3)

While AEDesign has shown remarkable performance, we are more interested in where the improvements come from. As mentioned before, we add new protein features, simplify the graph transformer, and propose a confidence-aware protein decoder. Whether these modifications could improve model performance? Setting We conduct ablation studies under the SL setting. Specifically, we may replace the simplified attention module with the original GraphTrans (w/o SGT), replace the CPD module with the autoregressive decoder of GraphTrans (w/o CPD), or remove the newly introduced angle features (w/o new feat). All experimental settings keep the same as previous SL settings.

Results and analysis

The ablation results are shown in Table . 5. We conclude that: (1) The SGT module and the new features can improve the recovery rate by 2.51% and 10.85%, respectively. Most of the performance improvement comes from the new angular features, are consistent with the recent ProteinMPNN, while they focus on distance features. (2) If we replace the CPD module with the autoregressive decoder, the recovery rate will improve by 0.55%. However, the recovery improvement brought by the autoregressive decoder is marginal compared to the SGT module and the new features. Therefore, we conclude CPD module dramatically improves the evaluation speed while maintaining good recovery. (3) If we remove the introduced angular features, AEDesign is not as accurate as GCA and GVP, but the improvement in efficiency is still significant.

4.5. VISUAL EXAMPLES

We show the potential of AEDesign in real applications, i.e., designing all-alpha, all-beta, and mixed native proteins. We ensemble multiple models by selecting the sequence with the lowest perplexity as the final solution. We use AlphaFold2 to predict the structures of designed sequences and compare them with the reference ones. Visual examples are provided in Figure . 6. All-alpha structure (3tld) All-beta structure (2giy) Mixed structure (1mgr) (Recovery=50.00, RMSD=1.18) (Recovery=53.07, RMSD=0.54) (Recovery=46.39, RMSD=0.47) Figure 6 : Visual examples. For native structures, we color Helix, Sheet, and Loop with cyan, magenta, and orange, respectively. Green structures are protein chains designed by our algorithm. We provide the recovery score and structural RMSD relative to the ground truth proteins.

5. CONCLUSION

This paper establishes a new benchmark and proposes a new method (AEDesign) for AI-guided protein design. By introducing new protein features, simplifying the graph transformer, and proposing a confidence-aware protein decoder, AEDesign achieves state-of-the-art accuracy and efficiency. We hope this work will help standardize comparisons and provide inspiration for subsequent research. How we re-run baselines We tune hyperparameters of all models on CATH4.2. For baseline models, we prefer to choose the hyperparameters recommended in their original paper to ensure that the results produced by our code are consistent with those reported, as shown in Table .10. For AEDesign, we also tune the hidden dimensions ( 128) and the number of layers (10 layers of GNN + 3 layers of CNN) in CATH4.2 to investigate whether it could achieve competitive results. When adapting to the new dataset, such as AlphaFold DB, we we fix hyperparameters of all models, including AEDesign, to be the same as those on CATH4.2. For each model, we use the same batch size as in the reproduction phase and adjust the learning rate in [ 0.001, 0.0001] to ensure that the model is

Results

We provide results where models are trained on CATH4.2 in Table .10 for investigating the potential of AEDesign model when designing native proteins. We use the same data splitting as GraphTrans (Ingraham et al., 2019) and GVP (Jing et al., 2020) , where proteins are partitioned by the CATH topology classification, resulting in 18024 proteins for training, 608 proteins for validation, and 1120 proteins for testing. We observe that the non-autoregressive AEDesign outperforms its competitors in recovery while the autoregressive GVP and GCA achieve lower perplexity. Since recovery is the primary measure, we conclude that AEDesign is competitive in designing native proteins. More importantly, the performance order of models is consistent with the results on AlphaFold DB: AEDesign > GCA ≈ GVP > StructGNN > GraphTrans. 

E DOMAIN GENERALIZATION

This work does not study the domain generalization problem, which could be another research direction. However, it will be good for readers to know the difference between proteins created by AlphaFold2 and native ones. Discussion about Domain shifts We study the domain shifts between proteins created by Al-phaFold2 and native proteins. Taking all the testing sets of AF2DB and CATH4.2 as examples, we statistics the distribution of angle features to study whether there are significant differences, as shown in Table .11. We also compare angle distributions of AF2DB and CATH4.2 in Figure .7. We observe that the angle distributions are quite similar but not the same between AF2DB and CATH4.2. The similar distribution means that the knowledge learned from AF2DB could be transferred into CATH4.2, while the difference may lead to performance degradation when transferring to different domains. We also provide KL divergence between the angle distributions of AF2DB and CATH4.2. Gaussian noise could be added to the input structures, where the noise std is listed in the left. Address the domain shift issue How to eliminate domain differences? Inspired by Li at.el (Li et al., 2021b; Dauparas et al., 2022; Hsu et al., 2022) , we find that perturbing input structures with Gaussian noise during training leads to improved domain-generalization performance. As shown in Table .11, the KL divergence between features of AF2DB and CATH4.2 decreases when the the standard deviation of Gaussian noise increases. When training AEDesign on SOYBN (a subset of AF2DB) and evaluating it on the testing set of CATH4.2, the generalization can be enhanced by adding noise, as shown in Table .12. However, this does not mean that higher noise levels are better, as too much noise may obscure all useful information. Noise std 0.00 0.001 0.02 0.05 0.08 0.10 0.20 0.30 CATH4.2 20.74 29.21 32.62 34.51 34.24 33.63 31.90 29.26 Table 12 : Results under the cross-domain setting. We train AEDesign on the training set of SOYBN, and evaluate it on the testing set of CATH4.2. We reveal how the noise level helps the domain generalization. We have discussed the domain shift issue between AF2DB and CATH4.2, and verified that the model trained on AF2DB by simply adding Gaussian noise is effective for designing native proteins. This discovery is consistent with the results of ESM-IF and ProteinMPNN. However, we provide new insight into why adding noise is effective. We believe that better domain generalization methods could further improve the model's performance, but we do not intend to study it in depth in this work. Moreover, ESM-IF has verified that: training AF2DB with CATH4.2 should further improve the model performance from 38.3% to 51.6% on the CATH dataset, and this paper does not repeat their innovation. From our perspective, ESM-IF's data augmentation is actually another domain generalization approach. 



http://protein.org.cn/densecpd.html https://alphafold.ebi.ac.uk



Figure 1: Overview of AEDesign. Compared with GraphTrans, StructGNN(Ingraham et al., 2019) and GVP(Jing et al., 2020), we add new protein features, simplify the graph transformer, and propose a confidence-aware protein decoder to improve accuracy and efficiency.--Dihedral angle for ---Dihedral angle for ---Dihedral angle for -

Figure 3: Simplified graph transformer.

Figure4: CPD: The confidence-aware protein decoder. We use two 1D CNN networks to learn confidence scores and make final predictions based on the input graph node features.

where Conf(•) is the model containing graph encoder and CNN decoder (called "Confidence predictor" in Figure.4) and outputs logit scores a ∈ R n,20 , f (•) is the function computing confidence score c ∈ N n,1 . The confidence score contains knowledge captured by the previous prediction and serves as a hint message to help the network correct previous predictions. Let M = ColumnMax(a) ∈ R n,1 and m = ColumnSubMax(a) ∈ R n,1 as the first and secondary largest logits score of a, the confidence score is defined as c = ⌊ M m ⌋. Let a ∈ R 1,20 , and noting i and j as the indexes of the largest and sub-largest values of a, then M = a i , m = a j , and f (a) = ⌊ ai aj ⌋, where ⌊•⌋ indicates the floor function. By extending a ∈ R 1,20 as a ∈ R n,20 , Eq.(3) shows the vectorized version of f (•).

Figure 5: Inference time cost when designing long proteins. We report the total inference time of different methods on 100 long proteins, which are longer than 1000. The time costs of the encoder, decoder and encoder+decoder are reported.

Figure 7: Comparing the angle distributions between CATH4.2 and AF2DB, structural noise = 0.00.

Figure 8: Comparing the angle distributions between CATH4.2 and AF2DB, structural noise = 0.02.

Figure 9: Comparing the angle distributions between CATH4.2 and AF2DB, structural noise = 0.05.

Figure 10: Comparing the angle distributions between CATH4.2 and AF2DB, structural noise = 0.10.

Statistics of structure-based protein design methods.Ntrain is the number of training samples. TS500 and TS500 are the test sets containing 500 and 50 proteins, respectively. All results are copied from their manuscripts or related papers.

SL benchmark. The length of the protein must be between 30 and 500. We highlight the best (or next best) results in bold (or underline).

SA benchmark. There is no constraint on the protein length. We highlight the best (or next best) results in bold (or underline). the better model performance; refer to Table.6. For example, the recovery on METIJ is 58.54% when the number of structures is 1605, which will increase to 73.96% on SOYBN when the data volume increases to 41,048.

Inference time costs of different methods. #Number is the number of proteins for time cost evaluation. #Avg L is the average protein length.

Ablation study under the SL setting.

A DATA STATISTICS Protein length In Table.6, we count the protein in each species dataset by length. AlphaFold DB contains some extra-long proteins, so it is necessary to improve the algorithm efficiency.

AlphaFold DB: we show the total number of proteins N all , and the number of proteins whose length within (0, 30], (30, 500], (500, 1000] and (1000, +∞]. The statistics of all species-specific subsets are also presented.C RESULTS AFTER TM-SCORE FILTERINGTM-score filtered ResultsWe evaluated all models on test sets filtered by both 30% sequence identity and 0.5 TM-score, and provide the TM-score filtered benchmarks on Table.8 and Table.9. Compared to the results without using the TM-score filter, the relative performance gain of our model is slightly reduced. The average model performance on this TM score-based test set remains nearly the same as on the sequence identity-based test set. We note that ESM-IF also finds that "the model performance overall remains the same on the TM score-based test set as on the CATH topology split test set." These facts show that more filters do not make the model predictions as difficult as we expected. This means that the model may not rely on the so-called homologous information leakage for prediction, but actually learns the patterns of the data.

TM-score filtered SL benchmark. The length of the protein must be between 30 and 500. The sequence identity and TM-score between training and testing proteins are less than 30% and 0.5, respectively.

TM-score filtered SA benchmark. There is no constraint on the protein length. The sequence identity and TM-score between training and testing proteins are less than 30% and 0.5, respectively.

Results comparison on the CATH dataset. All baselines are reproduced under the same code framework, where perplexity (lower is better) and recovery (higher is better) are reported. The best and next best results are labeled with bold and underline.

Statistics of angle features. We count angle distributions for the testing sets of AF2DB (SL setting) and CATH4.2. The mean and standard deviation are provided, where the standard deviation is marked in brackets.

B PREPROCESSING

Filter low quality data by pLDDT Similar to ESM-IF (Hsu et al., 2022) , we address data quality and partitioning issues through data pre-processing for each species-specific subset. As suggested by (Baek & Kepp, 2022; Callaway, 2020) , the MAE between the predicted and experimentally generated structures does depend on pLDDT. Thus, we filter low-quality structures whose confidence score (pLDDT) is less than 70.Filter test data by sequence identity To prevent potential information leakage, for each speciesspecific subset, we cluster protein sequences if their sequence similarities (Qi & Zhang, 2020; Steinegger & Söding, 2017) are higher than 30% (Qi & Zhang, 2020) and split the dataset by these clusters. As a result, the proteins belonging to the same cluster must be in one of the training, validation, and test sets. By default, we keep the ratio of training set to test set around 9:1 and select 100 proteins from a randomly selected cluster for validation. If the randomly selected cluster has less than 100 proteins, then all of its proteins are used as the validation set. In Table .7, we show the number of training data (#Train), validation data (#Valid), and test data (#Test) under the partition based on 30% sequence identity.Filter test data by TS-score In addition to the sequence identity clustering, we further filter the test sets by structural similarity using Foldseek (van Kempen et al., 2022) to exclude any structures with TM-score larger than 0.5 from those in the training set. Thus, the training and test sets are strictly different at both the sequence and structure levels. In 

