IMPROVING GENERALIZABILITY OF PROTEIN SE-QUENCE MODELS WITH DATA AUGMENTATIONS

Abstract

While protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-topredict changes to the protein's function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semisupervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

1. INTRODUCTION

Semi-supervised learning has proven to be an effective mechanism to promote generalizability for protein machine learning models, as task-specific labels are generally very sparse. However, with other common data types there are simple transformations that can be applied to the data in order to improve a model's ability to generalize: for instance, vision models use cropping, rotations, or color distortion; natural language models can employ synonym substitution; and time series data models benefit from window restriction or noise injection. Scientific data, such as a corpus of protein sequences, have few obvious transformations that can be made to it that unambiguously preserve the meaningful information in the data. Often, an easily understood transformation to a protein sequence (e.g., replacing an amino acid with a chemically similar one) will unpredictably produce either a very biologically similar or very biologically different mutant protein. In this paper, we take the uncertainty arising from the unknown effect of simple data augmentations in protein sequence modeling as an empirical challenge that deserves a robust assessment. To our knowledge, no study has been performed to find out whether simple data augmentation techniques improve a suite of protein tasks. We focus on fine-tuning previously published self-supervised models that are typically used for representation learning with protein sequences, viz. the transformerbased methods of Rao et al. (2019) which have shown the best ability to generalize on a set of biological tasks, which are referred to as Tasks Assessing Protein Embeddings (TAPE). We test one or more of the following data augmentations: replacing an amino acid with a pre-defined alternative; shuffling the input sequences either globally or locally; reversing the sequence; or subsampling the sequence to focus only on a local region (see Fig. 1 ).

