SINGLE LAYERS OF ATTENTION SUFFICE TO PREDICT PROTEIN CONTACTS

Abstract

The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment, then predicts that the edges with highest weight correspond to contacts in the 3D structure. On the other hand, increasingly large Transformers are being pretrained on protein sequence databases but have demonstrated mixed results for downstream tasks, including contact prediction. This has sparked discussion about the role of scale and attention-based models in unsupervised protein representation learning. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce a simplified attention layer, factored attention, and show that it achieves comparable performance to Potts models, while sharing parameters both within and across families. Further, we extract contacts from the attention maps of a pretrained Transformer and show they perform competitively with the other two approaches. This provides evidence that large-scale pretraining can learn meaningful protein features when presented with unlabeled and unaligned data. We contrast factored attention with the Transformer to indicate that the Transformer leverages hierarchical signal in protein family databases not captured by our single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.

1. INTRODUCTION

Inferring protein structure from sequence is a longstanding problem in computational biochemistry. Potts models, a particular kind of Markov Random Field (MRF), are the predominant unsupervised method for modeling interactions between amino acids. Potts models are trained to maximize pseudolikelihood on alignments of evolutionarily related proteins (Balakrishnan et al., 2011; Ekeberg et al., 2013; Seemayer et al., 2014) . Features derived from Potts models were the main drivers of performance at the CASP11 competition (Monastyrskyy et al., 2016) and have become standard in state-of-the-art supervised models (Wang et al., 2017; Yang et al., 2019; Senior et al., 2020) . Inspired by the success of BERT (Devlin et al., 2018) , GPT (Brown et al., 2020) and related unsupervised models in NLP, a line of work has emerged that learns features of proteins through self-supervised pretraining (Rives et al., 2020; Elnaggar et al., 2020; Rao et al., 2019; Madani et al., 2020; Nambiar et al., 2020) . This new approach trains Transformer (Vaswani et al., 2017) models on large datasets of protein sequences. There is significant debate over the role of pretraining in protein modeling. Pretrained model performance raises questions about the importance of data and model scale (Lu et al., 2020; Elnaggar et al., 2020) , the potential for neural features to compete with evolutionary features extracted by established bioinformatic methods (Rao et al., 2019) , and the benefits of transfer learning for protein landscape prediction (Shanehsazzadeh et al., 2020) . In this paper, we take the position that attention-based models can build on the strengths of both Potts trained on alignments and Transformers pretrained on databases. We introduce a simplified model, factored attention, and show that it is motivated directly by fundamental properties of protein sequences. We demonstrate empirically that a single layer of factored attention suffices to predict protein contacts competitively with state-of-the-art Potts models, while leveraging parameter sharing across positions within a single family and across hundreds of families. This highlights the potential for explicit modeling of biological properties with attention mechanisms. Further, we systematically demonstrate that contacts extracted from ProtBERT-BFD (Elnaggar et al., 2020) are competitive with those estimated by Potts models across 748 protein families, inspired by recent work from Vig et al. (2020) . We find that ProtBERT-BFD outperforms Potts for proteins with alignments smaller than 256 sequences. This indicates that large-scale Transformer pretraining merits continued efforts from the community. We contrast factored attention with ProtBERT-BFD to identify a large gap between the gains afforded by the assumptions of single-layer models and the performance of ProtBERT-BFD. This suggests the existence of properties linking protein families that allow for effective modeling of thousands of families at once. These properties are inaccessible to Potts models, yet are implicitly learned by Transformers through extensive pretraining. Understanding and leveraging these properties represents an exciting challenge for protein representation learning. Our contributions are as follows: 1. We analyze the assumptions made by the attention mechanism in the Transformer and show that a single attention layer is a well-founded model of interactions within protein families; attention for proteins can be justified without any analogies to natural language. 2. We show that single-layer models can successfully share parameters across positions within a family or sharing of amino acid features across hundreds of families, demonstrating that factored attention achieves performance nearly identical to Potts with far fewer parameters. 3. We carefully benchmark ProtBERT-BFD against an optimized Potts implementation and show that the pretrained Transformer extracts contacts competitively with Potts.

2. BACKGROUND

Proteins are polymers composed of amino acids and are commonly represented as strings. Along with this 1D sequence representation, each protein folds into a 3D physical structure. Physical distance between positions in 3D is often a much better indicator of functional interaction than proximity in sequence. One representation of physical distance is a contact map C, a symmetric matrix in which entry C ij = 1 if the β carbons of i and j are within 8 Å of one another, and 0 otherwise. Multiple Sequence Alignments. To understand structure and function of a protein sequence, one typically assembles a set of its evolutionary relatives and looks for patterns within the set. A set of related sequences is referred to as a protein family, commonly represented by a Multiple Sequence Alignment (MSA). Gaps in aligned sequences correspond to insertions from an alignment algorithm (Johnson et al., 2010; Remmert et al., 2012) , ensuring that positions with similar structure and function line up for all members of the family. After aligning, sequence position carries significant evolutionary, structural, and functional information. See Appendix A.1 for more information. Coevolutionary Analysis of Protein Families. The observation that statistical patterns in MSAs can be used to predict couplings has been widely used to infer structure and function from protein families (Korber et al., 1993; Göbel et al., 1994; Lapedes et al., 1999; Lockless & Ranganathan, 1999; Fodor & Aldrich, 2004; Thomas et al., 2008; Weigt et al., 2009; Fatakia et al., 2009) . Let X i be the amino acid at position i sampled from a particular family. High mutual information between X i and X j suggests an interaction between positions i and j. The main challenge in estimating contacts from mutual information is to disentangle "direct couplings" corresponding to functional interactions from interactions induced by non-functional patterns (Lapedes et al., 1999; Weigt et al., 2009) . State-of-the-art estimates interactions from MRF parameters, as described below. Supervised Structure Prediction. Modern structure prediction methods take a supervised approach, taking MSAs as inputs and outputting predicted structural features. Deep Learning has greatly advanced state of the art for supervised contact prediction (Wang et al., 2017; Jones & Kandathil, 2018; Senior et al., 2020; Adhikari, 2020) . These methods train deep residual networks that take covariance statistics or coevolutionary parameters as inputs and output contact maps or distance matrices. Extraction of contacts without supervised structural signal has not seen competitive performance from neural networks until the recent introduction of Transformers pretrained on protein databases.

