PROTEIN REPRESENTATION LEARNING BY GEOMETRIC STRUCTURE PRETRAINING

Abstract

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/ DeepGraphLearning/GearNet.

1. INTRODUCTION

Proteins are workhorses of the cell and are implicated in a broad range of applications ranging from therapeutics to material. They consist of a linear chain of amino acids (residues) which fold into specific conformations. Due to the advent of low cost sequencing technologies (Ma & Johnson, 2012; Ma, 2015) , in recent years a massive volume of protein sequences have been newly discovered. As functional annotation of a new protein sequence remains costly and time-consuming, accurate and efficient in silico protein function annotation methods are needed to bridge the existing sequence-function gap. Since a large number of protein functions are governed by their folded structures, several datadriven approaches rely on learning representations of the protein structures, which then can be used for a variety of tasks such as protein design (Ingraham et al., 2019; Strokach et al., 2020; Cao et al., 2021; Jing et al., 2021 ), structure classification (Hermosilla et al., 2021) , model quality assessment (Baldassarre et al., 2021; Derevyanko et al., 2018) , and function prediction (Gligorijević et al., 2021) . Due to the challenge of experimental protein structure determination, the number of reported protein structures is orders of magnitude lower than the size of datasets in other machine learning application domains. For example, there are 182K experimentally-determined structures in the Protein Data Bank (PDB) (Berman et al., 2000) vs 47M protein sequences in Pfam (Mistry et al., 2021) and vs 10M annotated images in ImageNet (Russakovsky et al., 2015) . To address this gap, recent works have leveraged the large volume of unlabeled protein sequence data to learn an effective representation of known proteins (Bepler & Berger, 2019; Rives et al., 2021; Elnaggar et al., 2021) . A number of studies have pretrained protein encoders on millions of sequences via self-supervised learning. However, these methods neither explicitly capture nor leverage the available protein structural information that is known to be the determinants of protein functions. To better utilize structural information, several structure-based protein encoders (Hermosilla et al., 2021; Hermosilla & Ropinski, 2022; Wang et al., 2022a) have been proposed. However, these models have not explicitly captured the interactions between edges, which are critical in protein structure modeling (Jumper et al., 2021) . Besides, very few attempts (Hermosilla & Ropinski, 2022; Chen et al., 2022; Guo et al., 2022) have been made until recently to develop pretraining methods that exploit unlabeled 3D structures due to the scarcity of experimentally-determined protein structures. Thanks to recent advances in highly accurate deep learning-based protein structure prediction methods (Baek et al., 2021; Jumper et al., 2021) , it is now possible to efficiently predict structures for a large number of protein sequences with reasonable confidence. Motivated by this development, we develop a protein encoder pretrained on the largest possible numberfoot_0 of protein structures that is able to generalize to a variety of property prediction tasks. We propose a simple yet effective structure-based encoder called GeomEtry-Aware Relational Graph Neural Network (GearNet), which encodes spatial information by adding different types of sequential or structural edges and then performs relational message passing on protein residue graphs. Inspired by the design of triangle attention in Evoformer (Jumper et al., 2021) , we propose a sparse edge message passing mechanism to enhance the protein structure encoder, which is the first attempt to incorporate edge-level message passing on GNNs for protein structure encoding. We further introduce a geometric pretraining method to learn the protein structure encoder based on the popular contrastive learning framework (Chen et al., 2020) . We propose novel augmentation functions to discover biologically correlated protein substructures that co-occur in proteins (Ponting & Russell, 2002) and aim to maximize the similarity between the learned representations of substructures from the same protein, while minimizing the similarity between those from different proteins. Simultaeneously, we propose a suite of straightforward baselines based on self-prediction (Devlin et al., 2018) . These pretraining tasks perform masked prediction of different geometric or physicochemical attributes, such as residue types, Euclidean distances, angles and dihedral angles. Through extensively benchmarking these pretraining techniques on diverse downstream property prediction tasks, we set up a solid starting point for pretraining protein structure representations. Extensive experiments on several benchmarks, including Enzyme Commission number prediction (Gligorijević et al., 2021) , Gene Ontology term prediction (Gligorijević et al., 2021) , fold classification (Hou et al., 2018) and reaction classification (Hermosilla et al., 2021) verify our Gear-Net augmented with edge message passing can consistently outperform existing protein encoders on most tasks in a supervised setting. Further, by employing the proposed pretraining method, our model trained on fewer than a million samples achieves comparable or even better results than the state-of-the-art sequence-based encoders pretrained on million-or billion-scale datasets.

2. RELATED WORK

Previous works seek to learn protein representations based on different modalities of proteins, including amino acid sequences (Rao et al., 2019; Elnaggar et al., 2021; Rives et al., 2021) , multiple sequence alignments (MSAs) (Rao et al., 2021; Biswas et al., 2021; Meier et al., 2021) and protein structures (Hermosilla et al., 2021; Gligorijević et al., 2021; Somnath et al., 2021) . These works share the common goal of learning informative protein representations that can benefit various downstream applications, like predicting protein function (Rives et al., 2021) and protein-protein interaction (Wang et al., 2019) , as well as designing protein sequences (Biswas et al., 2021) . Compared with sequence-based methods, structure-based methods should be, in principle, a better solution to learning an informative protein representation, as the function of a protein is determined by its structure. This line of works seeks to encode spatial information in protein structures by 3D CNNs (Derevyanko et al., 2018) or graph neural networks (GNNs) (Gligorijević et al., 2021; Baldassarre et al., 2021; Jing et al., 2021; Wang et al., 2022a; Aykent & Xia, 2022) . Among these methods, IEConv (Hermosilla et al., 2021) tries to fit the inductive bias of protein structure modeling, which introduced a graph convolution layer incorporating intrinsic and extrinsic distances between nodes. Another potential direction is to extract features from protein surfaces (Gainza et al., 2020; Sverrisson et al., 2021; Dai & Bailey-Kellogg, 2021) . Somnath et al. (2021) combined the advantages of both worlds and proposed a parameter-efficient multi-scale model. Besides, there are



We use AlphaFoldDB v1 and v2 (Varadi et al., 2021) for pretraining, which is the largest protein structure database before March, 2022.

