MASKED INVERSE FOLDING WITH SEQUENCE TRANS-FER FOR PROTEIN REPRESENTATION LEARNING

Abstract

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein language model parameterized as a structured graph neural network. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.

1. INTRODUCTION

Large pretrained protein language models (MLMs) have advanced the ability of machine-learning methods to predict protein structure, function, and fitness from sequence, especially when labeled training data is sparse. The recent state-of-the-art, inspired by BERT ((bidirectional encoder representations from transformers) (Devlin et al., 2018) , uses increasingly-large transformer (Vaswani et al., 2017) models to reconstruct masked and mutated protein sequences taken from databases such as UniProt (UniProt Consortium, 2021), UniRef (Suzek et al., 2015), and BFD (Steinegger et al., 2019; Steinegger & Söding, 2018) . Pretrained protein MLMs contain structural information (Rao et al., 2019; Rives et al., 2021; Chowdhury et al., 2021) , encode evolutionary trajectories (Hie et al., 2022b; 2021) , are zero-shot predictors of mutation fitness effects (Meier et al., 2021) , improve out-of-domain generalization on protein engineering datasets (Dallago et al., 2021) , and suggest improved sequences for engineering (Hie et al., 2022a) . Protein MLMs are now incorporated into the latest machine-learning methods for detecting signal peptides (Teufel et al., 2021) and predicting intracellular localization (Thumuluri et al., 2022) . However, only training on sequences ignores the rich information contained in experimental and predicted protein structures, especially as the number of high-quality structures from AlphaFold (Jumper et al., 2021; Varadi et al., 2022) increases. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure. Deep learning-based inverse folding is usually parametrized as a graph neural network (GNN) (Ingraham et al., 2019; Strokach et al., 2020; Jin et al., 2021; Jing et al., 2020) or SE(3)-equivariant transformer (McPartlon et al., 2022) that either reconstructs or autoregressively decodes the aminoacid sequence conditioned on the desired backbone structure. The ability to generate amino-acid sequences that fold into a desired structure is useful for developing novel therapeutics (Chevalier et al., 2017) , biosensors (Quijano-Rubio et al., 2021) , industrial enzymes (Siegel et al., 2010) , and targeted small molecules (Lucas & Kortemme, 2020). Furthermore, single-chain inverse folding approaches could be coupled with recent sequential assembly based multimer structure prediction techniques (Bryant et al., 2022) for fixed-backbone multimer design. However, we are primarily interested in using inverse folding as a pretraining task, with the intuition that incorporating structural information should improve performance on downstream tasks. Furthermore, current inverse folding methods must be trained on sequences with known or predicted structures, and thus do not take maximal advantage of the large amount of sequences that do not have known structures or of the menagerie of pretrained protein MLMs. For example, UniRef50 contains 42 million sequences, while the PDB (Rose et al., 2016) currently contains 190 thousand experimentally-measured protein structures. In this study, we train a Masked Inverse Folding (MIF) protein masked language model (MLM) parameterized as a structured graph neural network (SGNN) (Ingraham et al., 2019) . To our knowledge, this is the first example of combining the MLM task with structure in a pretraining task. We then show that using the outputs from a pretrained sequence-only protein MLM as input to MIF further improves pretraining perplexity by leveraging information from sequences without experimental structures. We will refer to this model as Masked Inverse Folding with Sequence Transfer (MIF-ST). Figure 1 compares the previous sequence-only dilated convolutional protein MLM (CARP), MIF, and MIF-ST. This is a novel way of transferring information from unlabeled protein sequences into a model that requires structure. We evaluate MIF and MIF-ST on downstream protein engineering tasks and analyze the effect of experimental and predicted structures on performance. Finally, we comment on the state of pretrained models for protein fitness prediction. 

2.1. BACKGROUND

Proteins are chains of amino acids that fold into three-dimensional structures. In masked language modeling pretraining on protein sequences, a model learns to reconstruct the original protein sequence from a corrupted version, and then the model likelihoods are used to make zero-shot predictions or the pretrained weights are used as a starting point for training on a downstream task, such as structure or fitness prediction. For example, ESM (Rives et al., 2021) and CARP (Yang et al., 2022) use the corruption scheme first described in BERT (Devlin et al., 2018) . With a vocabulary of T of amino acids, we start from an amino-acid sequence s of length L of amino acids s i ∈ T : 1 ≤ i ≤ L, 15% of positions M are selected uniformly at random. 80% of these are changed to a special mask token, 10% are randomly mutated to another amino acid, and the remaining 10% are unchanged to generate s noised . The model learns to predict the original amino acids: p (s i |s noised ) ∀i ∈ M (1) by minimizing the negative log likelihood at positions i ∈ M.

2.2. MASKED INVERSE FOLDING

While MLM pretraining on protein sequences can encode structural and functional information, adding information about the protein's backbone structure improves sequence recovery. A protein's backbone structure consists of the coordinates for each amino-acid residue's C, C α , C β , and N atoms, leaving out information about the side chains (which would trivially reveal each residue's amino-acid



Figure 1: Summary of models: (a) the Convolutional Autoencoding Representations of Proteins protein masked language model, (b) the Masked Inverse Folding model, and (c) the Masked Inverse Folding with Sequence Transfer model.

