CONVOLUTIONS ARE COMPETITIVE WITH TRANSFORM-ERS FOR PROTEIN SEQUENCE PRETRAINING

Abstract

Pretrained protein sequence language models largely rely on the transformer architecture. However, transformer run-time and memory requirements scale quadratically with sequence length. We investigate the potential of a CNN-based architecture for protein sequence masked language model pretraining and subsequent finetuning. CNNs are competitive on the pretraining task with transformers across several orders of magnitude in parameter size while scaling linearly with sequence length. More importantly, CNNs are competitive with and occasionally superior to transformers across an extensive set of downstream evaluations, including structure prediction, zero-shot mutation effect prediction, and out-of-domain generalization. We also demonstrate strong performance on sequences longer than the positional embeddings allowed in the current state-of-the-art transformer protein masked language models. Finally, we close with a call to disentangle the effects of pretraining task and model architecture when studying pretrained protein sequence models.

1. INTRODUCTION

Large pretrained protein language models, largely relying on the attention-based transformer (Vaswani et al., 2017) architecture, have advanced the ability of machine-learning methods to predict protein structure and function from sequence, especially when labeled training data is sparse. Most modern self-supervised protein sequence pretraining combines a transformer model with either an autoregressive likelihood (Madani et al., 2020; 2021; Ferruz et al., 2022; Hesslow et al., 2022) or with the masked language modeling (MLM) task introduced for natural language by BERT (bidirectional encoder representations from transformers) (Devlin et al., 2018) . Pretrained transformer protein MLMs contain structural information (Rao et al., 2019; Rives et al., 2021; Chowdhury et al., 2021) , encode evolutionary trajectories (Hie et al., 2022a; 2021) , are zero-shot predictors of mutation fitness effects (Meier et al., 2021) , improve out-of-domain generalization on protein engineering datasets (Dallago et al., 2021) , and suggest improved sequences for engineering (Hie et al., 2022b) . Protein MLMs are now incorporated into the latest machine-learning methods for detecting signal peptides (Teufel et al., 2021) and predicting intracellular localization (Thumuluri et al., 2022) . One drawback of transformers is that the compute and memory required by the attention layers scale quadratically with input sequence length. In addition, transformer attention is invariant to position, so transformer sequence models include a positional embedding. Depending on the formulation, these embeddings can be difficult to extend past the maximum length seen during training. As a result, some popular pretrained protein transformer models limit the input length during pretraining and inference; for example, ESM has a maximum input length of 1022 residues. Of the 42 million cluster representatives in the March 2020 release of UniRef50 (Suzek et al., 2015) , 1.1 million, or 2.6%, are longer than 1022 residues. This includes many proteins of interest, such as the SARS-Cov-2 spike glycoprotein and Streptococcus pyogenes CRISPR-associated endonuclease Cas9. Furthermore, there has been little investigation of how model architecture interacts with pretraining tasks on protein sequences. Transformers can perform the masked language model task on protein sequences, and pretraining improves the performance of transformers on downstream protein structure and property prediction tasks. However, it is important to disentangle pretraining from architectural advances and consider them independently. We seek to do this by investigating the effectiveness of pretrained and naive convolutions for proteins. We train protein sequence convolutional masked language models on UniRef50, which we refer to as CARP (Convolutional Autoencoding Representations of Proteins). Our CARP models are competitive with transformers on the pretraining task, given comparable parameter sizes. The largest CARP, with approximately 640M learnable parameters (CARP-640M) is competitive with the current state-of-the-art transformer protein sequence masked language model, ESM (Rives et al., 2021; Meier et al., 2021) on a variety of downstream prediction tasks, including structure prediction, zeroshot mutation effect prediction, and out-of-domain generalization on biologically-relevant protein engineering datasets. Because CARP scales linearly in computation with the input sequence and does not rely on an input positional embedding, it is straightforward to apply it to sequences longer than the longest sequences in training, which we demonstrate with zero-shot predictions of mutation effects in CRISPR-Cas9. These empirical results demonstrate a need to deepen our understanding of protein sequence pretraining by disentangling the effects of architecture and the pretraining task. Finally, while performance on structure prediction tasks improves as model size and pretraining performance improve, this is not the case for all fitness prediction tasks, demonstrating we also need to deepen our understanding of how pretraining relates to downstream tasks. 

2. CONVOLUTIONAL PROTEIN SEQUENCE MASK LANGUAGE MODELS

We pretrain CARP using the masked language model (MLM) objective described in Rives et al. (2021) . Each sequence is corrupted by changing some tokens to a special mask token or another amino acid token, and the model is tasked with reconstructing the original sequence. Specifically, 15% of tokens from each sequence are randomly selected for supervision. For those 15% of tokens, 80% are replaced by the mask token, 10% are replaced by a randomly-chosen amino acid, and 10% remain unchanged. The model is trained to minimize the cross-entropy loss between its predictions for the selected tokens and the true tokens at those locations. We train on the cluster representatives from the March 2020 release of UniRef50, with approximately 83k sequences held out for validation and another 210k sequences held out for testing, leaving 41.5 million sequences for training. CARP combines the ByteNet encoder dilated CNN architecture from Kalchbrenner et al. (2016) with simple input embedding and output decoding layers, as shown in Figure 1a . CARP begins with an embedding layer, which maps an input sequence of L tokens x ∈ D L to an 8-dimensional intermediate embedding, followed by a linear mapping into the model dimension d: e 0 ∈ R L×d . This passes through a stack of n ByteNet dilated CNN blocks Figure 1b with residual connections in between followed by a final layer norm to produce the encoder representation e n ∈ R L×d , and finally a linear decoder maps this to the L × t logits, where t is the number of possible tokens. The 1 × 5 convolution layer in every ByteNet block is dilated and padded to preserve sequence length. Dilation



Figure 1: The CARP architecture.

