IMPROVING GENERALIZABILITY OF PROTEIN SE-QUENCE MODELS WITH DATA AUGMENTATIONS

Abstract

While protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-topredict changes to the protein's function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semisupervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

1. INTRODUCTION

Semi-supervised learning has proven to be an effective mechanism to promote generalizability for protein machine learning models, as task-specific labels are generally very sparse. However, with other common data types there are simple transformations that can be applied to the data in order to improve a model's ability to generalize: for instance, vision models use cropping, rotations, or color distortion; natural language models can employ synonym substitution; and time series data models benefit from window restriction or noise injection. Scientific data, such as a corpus of protein sequences, have few obvious transformations that can be made to it that unambiguously preserve the meaningful information in the data. Often, an easily understood transformation to a protein sequence (e.g., replacing an amino acid with a chemically similar one) will unpredictably produce either a very biologically similar or very biologically different mutant protein. In this paper, we take the uncertainty arising from the unknown effect of simple data augmentations in protein sequence modeling as an empirical challenge that deserves a robust assessment. To our knowledge, no study has been performed to find out whether simple data augmentation techniques improve a suite of protein tasks. We focus on fine-tuning previously published self-supervised models that are typically used for representation learning with protein sequences, viz. the transformerbased methods of Rao et al. (2019) which have shown the best ability to generalize on a set of biological tasks, which are referred to as Tasks Assessing Protein Embeddings (TAPE). We test one or more of the following data augmentations: replacing an amino acid with a pre-defined alternative; shuffling the input sequences either globally or locally; reversing the sequence; or subsampling the sequence to focus only on a local region (see Fig. 1 ). We demonstrate that the protein sequence representations learned by fine-tuning the baseline models with data augmentations results in relative improvements between 1% (secondary structure accuracy) and 41% (fluorescence ρ), as assessed with linear evaluation for all TAPE tasks we studied. When fine-tuning the same representations during supervised learning on each TAPE task, we show significant improvement as compared to baseline for 3 out of 4 TAPE tasks, with the fourth (fluorescence) within 1σ in performance. We also study the effect of increasingly aggressive data augmentations: when fine-tuning baseline models with contrastive learning (Hadsell et al., 2006; Chen et al., 2020a) we see a local maximum in downstream performance as a function of the quantity of data augmentation, with "no augmentations" generally under-performing modest amounts of data augmentations. Conversely, performing the same experiments but using masked-token prediction instead of contrastive learning, we detect a minor trend of decreasing performance on the TAPE tasks as we more frequently use data augmentations during fine-tuning. We interpret this as evidence that contrastive learning techniques, which require the use of data augmentation, are important methods that can be used to improve generalizibility of protein models.

2. RELATED WORKS

Self-supervised and semi-supervised methods have become the dominant paradigm in modeling protein sequences for use in downstream tasks. 



Figure 1: Diagram of data augmentations. We study randomly replacing residues (with probability p) with (a) a chemically-motivated dictionary replacement or (b) the single amino acid alanine. We also consider randomly shuffling either (c) the entire sequence or (d) a local region only. Finally, we look at (e) reversing the whole sequence and (f) subsampling to a subset of the original.

Rao et al. (2019)  have studied next-token and maskedtoken prediction, inspired by the BERT natural language model(Devlin et al., 2018).Riesselman  et al. (2019)  have extended this to autoregressive likelihoods; and Rives et al. (2019), Heinzinger et al. (2019) and Alley et al. (2019) have shown that unsupervised methods trained on unlabeled sequences are competitive with mutation effect predictors using evolutionary features.Of importance to this work are self-supervised learning algorithms employed for other data types that use or learn data augmentations. For example, Gidaris et al. (2018) learn image features through random rotations; Dosovitskiy et al. (2014) and Noroozi & Favaro (2016) study image patches and their correlations to the original samples. van den Oord et al. (2018) uses contrastive methods to predicts future values of an input sequence. We consider sequence augmentations in natural language as the most relevant comparison for the data augmentations we study in this paper. Zou, 2019). However, sequence augmentations designed for natural languages often require the preservation of contextual meaning of the sentences, a factor that is less explicit for protein sequences.Contrastive Learning is a set of approaches that learn representations of data by distinguishing positive data pairs from negative pairs(Hadsell et al., 2006).SimCLR (v1 & v2) (Chen et al., 2020a;b)   describes the current state-of-the-art contrastive learning technique.; we use this approach liberally in this paper not only because it performs well, but because it requires data transformations to ex-

