IMPROVING GENERALIZABILITY OF PROTEIN SE-QUENCE MODELS WITH DATA AUGMENTATIONS

Abstract

While protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-topredict changes to the protein's function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semisupervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

1. INTRODUCTION

Semi-supervised learning has proven to be an effective mechanism to promote generalizability for protein machine learning models, as task-specific labels are generally very sparse. However, with other common data types there are simple transformations that can be applied to the data in order to improve a model's ability to generalize: for instance, vision models use cropping, rotations, or color distortion; natural language models can employ synonym substitution; and time series data models benefit from window restriction or noise injection. Scientific data, such as a corpus of protein sequences, have few obvious transformations that can be made to it that unambiguously preserve the meaningful information in the data. Often, an easily understood transformation to a protein sequence (e.g., replacing an amino acid with a chemically similar one) will unpredictably produce either a very biologically similar or very biologically different mutant protein. In this paper, we take the uncertainty arising from the unknown effect of simple data augmentations in protein sequence modeling as an empirical challenge that deserves a robust assessment. To our knowledge, no study has been performed to find out whether simple data augmentation techniques improve a suite of protein tasks. We focus on fine-tuning previously published self-supervised models that are typically used for representation learning with protein sequences, viz. the transformerbased methods of Rao et al. (2019) which have shown the best ability to generalize on a set of biological tasks, which are referred to as Tasks Assessing Protein Embeddings (TAPE). We test one or more of the following data augmentations: replacing an amino acid with a pre-defined alternative; shuffling the input sequences either globally or locally; reversing the sequence; or subsampling the sequence to focus only on a local region (see Fig. 1 ). We demonstrate that the protein sequence representations learned by fine-tuning the baseline models with data augmentations results in relative improvements between 1% (secondary structure accuracy) and 41% (fluorescence ρ), as assessed with linear evaluation for all TAPE tasks we studied. When fine-tuning the same representations during supervised learning on each TAPE task, we show significant improvement as compared to baseline for 3 out of 4 TAPE tasks, with the fourth (fluorescence) within 1σ in performance. We also study the effect of increasingly aggressive data augmentations: when fine-tuning baseline models with contrastive learning (Hadsell et al., 2006; Chen et al., 2020a) we see a local maximum in downstream performance as a function of the quantity of data augmentation, with "no augmentations" generally under-performing modest amounts of data augmentations. Conversely, performing the same experiments but using masked-token prediction instead of contrastive learning, we detect a minor trend of decreasing performance on the TAPE tasks as we more frequently use data augmentations during fine-tuning. We interpret this as evidence that contrastive learning techniques, which require the use of data augmentation, are important methods that can be used to improve generalizibility of protein models.

2. RELATED WORKS

Self-supervised and semi-supervised methods have become the dominant paradigm in modeling protein sequences for use in downstream tasks. Rao et al. (2019) have studied next-token and maskedtoken prediction, inspired by the BERT natural language model (Devlin et al., 2018 2018) uses contrastive methods to predicts future values of an input sequence. We consider sequence augmentations in natural language as the most relevant comparison for the data augmentations we study in this paper. Some commonly applied augmentations on strings include Lexical Substitution (Zhang et al., 2015) , Back Translation (Xie et al., 2019a) , Text Surface TransformationPermalink (Coulombe, 2018) , Random Noise Injection (Xie et al., 2019b; Wei & Zou, 2019) , and Synonym Replacement, Random Swap, Random Deletion (RD) (Wei & Zou, 2019) . However, sequence augmentations designed for natural languages often require the preservation of contextual meaning of the sentences, a factor that is less explicit for protein sequences. Contrastive Learning is a set of approaches that learn representations of data by distinguishing positive data pairs from negative pairs (Hadsell et al., 2006) . SimCLR (v1 & v2) (Chen et al., 2020a; b) describes the current state-of-the-art contrastive learning technique.; we use this approach liberally in this paper not only because it performs well, but because it requires data transformations to ex- For any x, we form two copies x 1 = t 1 (x) and x 2 = t 2 (x) given functions t 1 , t 2 ∼ T , where T denotes the distribution of the augmentation functions. Given D is of size N , the contrastive loss is written as: L = 1 2N N k=1 [l(z (1) k , z k ) + l(z (2) k , z k )] where l(u, v) ≡ -log e sim(u,v)/τ w =u e sim(u,w)/τ (1) Here, z i,k = g θ (f ω (t i (x k ))), sim(•, •) is cosine similarity, and τ ∈ (0, ∞) is a scalar temperature; we choose τ = 0.2. By minimizing the contrastive loss, we obtain the learned h as the encoded feature for other downstream tasks. Note, the contrastive loss takes z's as inputs, whereas the encoded feature is h, which is the variable after the function f ω (•) and before g θ (•).

3.1. EVALUATION PROCEDURE & APPROACH TO EXPERIMENT CONTROL

Our goal is to demonstrate that training self-supervised protein sequence models, with simple string manipulations as data augmentations, will lead to better performance on downstream tasks. To attempt to control external variables, we study the following restricted setting; we provide the procedural diagram in Figure 2 and the corresponding explanations of the four major steps below (See Appendix A for training setups in details.): Baseline.-A self-supervised model M 0 is trained on non-augmented sequence data D seq to do representation learning. To have a consistent baseline, we set M 0 to the Transformer-based model trained and published in Rao et al. (2019) , without modification. This was trained with maskedtoken prediction on Pfam protein sequence data (El-Gebali et al., 2019) ; it has 12 self-attention layers, 8 heads per layer, and 512 hidden dimensions, yielding 38M parameters in total. Augmented training on validation set.-We fine-tune M 0 on augmented subsets D val ⊂ D seq , given a set of pre-defined data transformations T aug . We define M aug as the final trained model derived from T aug (D seq ) with M 0 as the initial conditions for the model parameters. We explore two different methods of fine-tuning on augmented data -a contrastive task (as in Eq. 1) and a masked-token task (exponentiated cross entropy loss) -as well as different combinations of data augmentations. We use reduced subsets |D val | |D seq | to both reduce the computational cost of running bulk experiments, as well as to protect against overfitting. For consistency, we inherit the choice of D val from the cross-validation split used to train M 0 in Rao et al. (2019) . To adapt the same baseline model M 0 to different self-supervised losses, we add a loss-specific randomlyinitialized layer to the M 0 architecture: contrastive learning with a fully connected layer that outputs 256 dimensional vectors and masked-token uses fully connected layers with layer normalization to output one-hot vectors for each of the masked letters. We define our different choices of T aug in the next section. Linear evaluation on TAPE.-To assess the representations learned by M aug , we evaluate performance on four TAPE downstream training tasks (Rao et al., 2019) : stability, fluorescence, remote homology, and secondary structure. For consistency, we use the same training, validation, and testing sets. The first two tasks are evaluated by Spearman correlation (ρ) to the ground truth and the latter two by classification accuracy. However, we do not consider the fifth TAPE task, contact map prediction, as this relies only on the single CASP12 dataset, which has an incomplete test set due to data embargoes (AlQuraishi, 2019) . Secondary structure prediction is a sequence-to-sequence task where each input amino acid is classified to a particular secondary structure type (helix, beta sheet, or loop), which is evaluated on data from CASP12, TS115, and CB513 (Berman et al., 2000; Moult et al., 2018; Klausen et al., 2019) , with specifically "3-class" classification accuracy being the metric in this paper. The remote homology task classifies sequences into one of 1195 classes, representing different possible protein folds, which are further grouped hierarchically into families, then superfamilies; the datasets are derived from Fox et al. (2013) . The fluorescence task regresses a protein sequence to a real-valued log-fluorescence intensity measured in Sarkisyan et al. (2016) . The stability task regresses a sequence to a real-valued measure of the protein maintaining its fold above a concentration threshold (Rocklin et al., 2017) . We perform linear evaluation by training only a single linear layer for each downstream task for each contrastive-learning model M aug , but not changing the parameters of M aug , and its corresponding learned encodings, across all tasks. To compare the contrastive learning techniques to further fine-tuning with masked-token prediction, we identify the best-performing data augmentations per-task, then replace the M CL aug with the masked-token model with the same augmentation M MT aug , then also do linear evaluation on M MT aug . Full fine-tuning on TAPE.-For the best-performing augmented models in the linear evaluation task (either M CL aug or M MT aug ), we further study how the models improve when allowing the parameters of M aug to vary along with the linear model during the task-specific supervised model-tuning.

3.2. DATA AUGMENTATIONS

We focus on random augmentations to protein primary sequences, both chemically and nonchemically motivated (see Fig. 1 ). Each of these data augmentations has reasons why it both might help generalize the model and might destroy the information contained in the primary sequence. Replacement (Dictionary/Alanine) [RD & RA]. -We randomly replace, with probability p, the i th amino acid in the primary sequence S = {A i } N i=1 with the most similar amino acid to it A i , according to a replacement rule; we do this independently for all i. We treat p as a hyperparameter and will assess how p affects downstream TAPE predictions. For Replacement (Dictionary), following French & Robson (1983) , we pair each naturally-occuring amino acid with a partner that belongs to the same class (aliphatic, hydroxyl, cyclic, aromatic, basic, or acidic), but do not substitute anything for proline (only backbone cyclic), glycine (only an H side-chain), tryptophan (indole side chain), or histidine (basic side chain with no size or chemical reaction equivalent). We experimented with different pairings, finding little difference in the results; our best results were obtained with the final mappings: [[A,V], [S,T], [F,Y], [K,R], [C,M], [D,E], [N,Q], [V,I]]. We also study replacing residues with the single amino acid, alanine (A), as motivated by alanine-scanning mutagenesis (Cunningham & Wells, 1989) . Single mutations to A are used experimentally to probe the importance of an amino acid because Alanine resembles a reduction of any amino acid to its C β , which eliminates functionality of other amino acids while maintaining a certain backbone rigidity and is thus considered to be minimally disruptive to the overall fold of a protein, although many interesting exceptions can still occur by these mutations. , we define an index range i ∈ [α, β] with α < β ≤ N , then replace amino acids A i in this range with a permutation chosen uniformly at random. We define Global Random Shuffling (GRS) with α = 1 and β = N and Local Random Shuffling (LRS) with the starting point of intervals chosen randomly between α ∈ [1, N -2] and β = min(N, α + 50), ensuring at least two amino acids get shuffled. While shuffling aggressively destroys protein information, models trained with shuffling can focus more on permutation-invariant features, such as the overall amino acid counts and sequence length of the original protein.

Sequence Reversion & Subsampling [SR & SS

].-For Sequence Reversion, we simply reverse the sequence: given S = {A i } N i=1 we map i → i = N -i. Note that a protein sequence is oriented, proceeding sequentially from the N-to the C-terminii; reversing a protein sequence changes the entire structure and function of the protein. However, including reversed sequences might encourage the model to use short-range features more efficiently, as it has for seq2seq LSTM models (Sutskever et al., 2014) . For Subsampling, we let the sequence index range from i ∈ [α, β], then uniformly sample α ∈ [1, N -2] and then preserve A i , for i = (α, α + 1, ..., min(N, α + 50)). While many properties pertaining to the global fold of a protein are due to long-range interactions between residues that are well-separated in the primary sequence, properties such as proteosomal cleavage or docking depend more heavily on the local sequence regime, implying that training while prioritizing local features might still improve performance. Combining Augmentations.-We consider applying augmentations to the fine-tuning of semisupervised models both individually and together. For Single Augmentations we consider only one of the augmentations at a time. With Leave-One-Out Augmentation we define lists of augmentations to compare together; for each list of augmentations, we iteratively remove one augmentation from the list and apply all other augmentations during training. Finally, in Pairwise Augmentation all pairs of augmentations are considered.

4. RESULTS

Assessing data-augmented representations with linear evaluation.-The core result of this paper uses linear evaluation methods to probe how well data augmentations are able to improve the learned representation of protein sequences. In Table 1 we summarize the best results from various data augmentation procedures, highlighting all the cases that outperform the TAPE Baseline. We compare identical model architectures trained with the same data, but using various data augmen- 2019), which we call the TAPE Baseline; and (2) a contrastive learning model trained in the SimCLR approach, but using no data augmentations, i.e., we only use the negative sampling part of SimCLR; we call this the Contrastive Baseline. We also report standard deviation (σ) for the major figures of this paper by bootstrapping the testing results 5,000 times, with convergence after ∼ 3, 000 samples. We see broad improvement when using contrastive learning with data augmentations in comparison to both baselines for the stability, fluorescence, and remote homology tasks, and better-or-similar results for secondary structure prediction. For stability, we find +0. Figure 3 demonstrates our linear evaluation results using contrastive learning based on the composition of pairs of data augmentations. For stability, amino acid replacement (with either a dictionary (RD) or alanine alone (RA)) consistently improves performance compared to the TAPE baseline, as well as to other augmentation strategies, which typically underperform the baseline. Fluorescence sees improvements using all data augmentations, but random shuffling (LRS & GRS) as well as the binomial replacement of both types result in the best individual performance. For remote homology, it is apparent that subsampling plays an important role in model performance given the improvement it introduces on the three testing sets; the "family" homology level is included here and the other remote homology tasks are qualitatively similar. Similarly, we see that data augmentation procedures that use subsampling tends to yield better performance than alternatives, with the best performing approach using subsampling alone. For complete heatmaps for the 6 remote homology and secondary structure testing sets, please refer to Fig. 5 in Appendix B. Effect of increasing augmentation rates.-Fig. 4 presents results on the effect of varying data augmentations in two cases: (1) increasing the amino acid replacement probability p for the Replacement Dictionary [RD] strategy with contrastive learning (top row); and (2) by augmenting increasingly larger fractions of the input data according to the best augmentations (see "Best Aug." row in Table 2 ) found in linear evaluation for masked-token prediction (bottom row). We define the augmentation ratio γ as the fraction of the samples in the validation dataset that are randomly augmented per epoch; for contrastive learning fine-tuning, data augmentations are required for every data element. For masked-token prediction, we see little change in performance for any task as a function of γ using the for any of the best corresponding augmentation strategies. However, there is a small, but consistent, reduction in performance with increasing γ, implying that masked-token prediction is not always able to significantly improve its performance by using data augmentations. It is clear that large augmentation ratios γ ∼ 1 hurt the model performance on Stability, Fluorescence and Remote Homology tasks. We also see that the TAPE Baseline model generally performs worse than further training with no data augmentation, indicating further training of the baseline model with the same data and procedure can improve the performance of Rao et al. (2019) . However, for contrastive learning, we see clear evidence that data augmentations can help generalization. We see increasing Spearman correlation for stability and fluorescence tasks and increasing accuracy for remote homology and secondary structure with increasing p for p < 0.01. We see a consistent decrease in all metrics, to the lowest seen amount for each task, for replacement probability p = 0.1 and then a recovery to larger (sometimes the largest seen) for higher replacement p = 0.5. However, no augmentation, p = 0, consistently underperforms as compared to alternative values of p > 0. -To assess the relative effects of contrastive learning and maskedtoken prediction, we compare results between the two approaches with or without data augmentations. All information for this comparison is in Fig. 4 and Table 1 . It is unsurprising that using the identity function as a data transform in SimCLR (Eq. 1) yields little increase in generalizability; indeed, we see that masked-token prediction has better performance than contrastive learning for all tasks with no data augmentations (Fig. 4 , γ = 0 vs p = 0). However, we see mixed results when comparing contrastive learning vs masked-token prediction methods with the same data augmentation techniques. As seen in Table 1 ("MR: Best Aug." row vs highest/red numbers in "CL: *" rows): contrastive learning significantly improves over masked-token prediction for stability (+0.046 in ρ), fluorescence (+0.061 in ρ), and remote homology ([+1.2%, +8.1%, +1.4%] in classification accuracy); and masked-token prediction improves over contrastive learning for secondary structure ([+1.4%, +1.2%, +0.9%] in classification accuracy). Overall, we cannot conclude from these pairs of linear evaluation studies that contrastive learning definitively performs better or worse than masked-token prediction on all downstream tasks: different tasks benefit from different training procedures and different combinations of data augmentations. However, we observe that the overall best results from Table 1 utilize the combination of contrastive learning with pairs of data augmentation for all tasks besides secondary structure prediction. Exploring the best performance via full fine-tuning.-We provide results of the best performing fine-tuned models (on downstream tasks) and the comparison to the TAPE's original baselines in Table 2 , in order to verify whether the learned representations of the best models provide good initialization points for transfer learning. Here, we have done full fine-tuning only on the best performing, per-task models found during the linear evaluation study (see Table 1 ). Notice that the baseline comparison changes in this table than for the linear evaluation results above because we allow the optimization to also adjust the parameters of the self-supervised models for every task (the TAPE baselines in Table 2 are from Rao et al. ( 2019)). The fine-tuned, data-augmented models outperform the TAPE baseline on stability (+0.018 in Spearman correlation ρ), remote homology and secondary structure; and they perform within one σ on fluorescence, although the large difference in performance between full fine-tuning and linear evaluation on the fluorescence tasks indicates that most of the model's predictive capacity is coming from the supervised learning task itself. The random amino acid replacement strategy is a consistent approach that achieved our best performance for all tasks and subsampling performed well on tasks that depend on the structural properties of proteins (remote homology and secondary structure): [+0.0%, +4.1%, +3.7%] classification accuracy for remote homology and [+0.1%, +0.8%, +0.9%] classification accuracy for secondary structure.

5. CONCLUSION

We experimentally verify that relatively naive string manipulations can be used as data augmentations to improve the performance of self-supervised protein sequence on the TAPE validation tasks. We demonstrate that, in general, augmentations will boost the model performance, in both linear evaluation and model fine-tuning cases. However, different downstream tasks benefit from different protein augmentations; no single augmentation that we studied was consistently the best. However, the approach that we have taken, where we fine-tune a pretrained model on the validation set re-quires significantly lower computational cost than training on the full training set. Consequently, a modeler interested in a small number of downstream tasks would not be over-burdened to attempt fine-tuning of the semi-supervised model on a broad range of data augmentation transformations. 3 , Row 1, on the self-supervised part in either the SimCLR (contrastive learning) or the masked-toke prediction model. Here we train all the models for 30 epochs on the Pfam validation set in order to make a relatively fair comparison. As discussed in the main paper, after the augmented training is finished, we perform linear evaluations on the pre-trained model in the previous steps with the hyperparameters listed in Table 3 , Row 2-5 for the 4 downstream tasks. All of the linear evaluation results shown in the main paper and appendix are based on the best results we find after the augmented training and linear evaluation with the hyperparameters in Table 3 . For model fine-tuning, since models trained with contrastive learning and with TAPE's semi-supervised learning models have different statistics (model parameters are different), we do not consider using the same set of hyperparameters. Instead, we report the best results in comparison to the TAPE's result. The corresponding hyperparameter setups of the best cases described above, including the augmented training and linear evaluation, can be found in Table 4 . In addition, the optimizer being applied is AdamW, which is identical to the one in TAPE. Given the NVIDIA V100 GPUs we use have 16 GB memory, a memory limitation, we constrain the sequence length ≤ 512 to enable the training. The batch size we report in appendix are the total batch that considers the number of GPUs. "Gradient Acc" in Table 3 is short for "Gradient Accumulation Steps", which describes how many steps per gradient update to the model.

B LINEAR EVALUATION RESULTS

Here we provide comprehensive linear evaluation results of contrastive learning with single augmentation, leave-one-out and pairwise augmentation. To Simplify the tables, we use the same abbreviations applied in the main paper to indicate the augmentations. We summarize the best results after the training and evaluation according to Section A for both single augmentation and pairwise augmentations in Figure 5 . We use diverging palette with the center (gray/white) being the best linear evaluation baseline results with TAPE's pre-trained model and warmer colors refering to the better-than-baseline results and cooler colors being worse-than-baseline results. The diagonal values come from single augmentation setup and all other values come from pairwise augmentation setup. Table 5 and Table 6 also include Figure 5 's corresponding values. By checking the values of the figure and tables, we observe the following: (1) For stability, the binomial replacement works well with single augmentation and pairwise augmentation. There is no improvement from leave-one-out augmentation cases. (2) For fluorescence, the pairwise augmentation with binomial replacement and shuffling can improve the model performance. (3) For remote homology, we clearly see improvements coming from subsampling on all of the three testing sets. And this is independent of other augmentations in the pairwise case. (4) For secondary structure, we do not observe gains from either single or pairwise augmentation, which is consistent with the discussion in the main paper. The best case for secondary structure comes from the masked-token prediction model with augmentations. Besides the results above, we also provide leave-one-out results in Table 8 with different leaveone-out cases that contain different augmentations. With leave-one-out augmentations, we observe only several cases where the leave-one-out cases outperform TAPE's baselines across 4 downstream 



Figure 1: Diagram of data augmentations. We study randomly replacing residues (with probability p) with (a) a chemically-motivated dictionary replacement or (b) the single amino acid alanine. We also consider randomly shuffling either (c) the entire sequence or (d) a local region only. Finally, we look at (e) reversing the whole sequence and (f) subsampling to a subset of the original.

Figure 2: Diagram of experimental approach (see Sect. 3.1). We use dashed boxes to indicate different steps: semi-supervised pre-training, augmented learning, linear evaluation, and finally finetuning the best performing augmented models on downstream tasks. In each box, we include the general model architectures, with major sub-modules in different colors. The model freezer indicates the semi-supervised model is not updated during linear evaluation;.

Figure 3: Contrastive learning performance with pairwise & single augmentations in linear evaluation for 4 different tasks. The axes refer to different augmentations, with diagonal being a single augmentation. The values in the heatmaps are correlation (stability and fluorescence) and classification accuracy (remote homology and secondary structure). We do not consider two pairs: RD(p = 0.01) & RD(p = 0.5) and GRS & LRS, due to redundancy. The per-task performance of the masked-token TAPE Baseline model is colored white in each subfigure; red is better performance, blue is worse. (Also see Appendix B.)

Figure 4: Top row: Effects of Binomial Replacement p for linear evaluation on constrastive learning models for TAPE tasks. Bottom row: Effects of augmentation ratio γ for linear evaluation on TAPE's self-supervised model with the best performing task-specific augmentations (see "Best Aug." row in Table2) , using masked token prediction. The subfigures include linear evaluation results with different augmentation ratios, γ. "γ=0.0" refers to fine-tuning with no data augmentations, whereas "Baseline" refers to the TAPE pre-trained model with no further training. (Also see Appendix C.)

Figure 5: Contrastive Learning Performance with Pairwise & Single Augmentation in Linear Evaluation. The heatmaps include the performance of contrastive learning models with pairwise/single augmentations for 4 different downstream tasks considering all possible testing sets. Both x and y axes refer to different augmentations. The diagonal of heatmaps refer to the single augmentation cases. All other cells refer to cases with pairwise augmentations. The values in the heatmaps refer to evaluation results: "ρ" for Stability and Fluorescence, and "Classification Accuracy" for Secondary Structure and Remote Homology.

Best linear evaluation results. Bold refers to cases that outperform the TAPE baselines; and red is the task-wise best-performing result. MT and CL refer to training with masked-token prediction and contrastive learning, respectively. Stability and fluorescence are scored by Spearman correlation and remote homology (fold, family, superfamily) and secondary structure (CASP12, TS115, CB513) by classification accuracy. Bootstrap errors are reported per task by taking the maximum error found for any of the models. (Also see Appendix C.)

Model fine-tuning results, with associated training method and data augmentation procedure for each task. Testing sets for remote homology: (fold, family, superfamily), and for secondary structure: (CASP12, TS115, CB513).

Hyperparmeter Setups for Models in Different TasksFor the augmented training, we focused on training the self-supervised part of the model. Namely, we apply the hyperparameters in Table

Configurations of the best models

annex

tasks. Nevertheless, the leave-one-out results provide insights that one should not combine arbitrary augmentations given the decrease in the performance we observe in leave-one-out cases. 

C SUPPLEMENTARY

In this section, we provide supplementary results. Specifically, we provide the complete comparison table on the masked-token prediction model with different augmentation ratio γ in Table 9 . We also provide the corresponding plot in Figure 4 and analylsis in the main paper. Besides, Table 7 provides similar information as Table 2 in the main paper, except for values for remote homolody and secondary structure here being the cross entropy, rather than classification error. .111, 0.441, 0.100] [0.111, 0.441, 0.100] [0.128, 0.614, 0.178] [0.111, 0.441, 0.100 

