PROTEIN REPRESENTATION LEARNING BY GEOMETRIC STRUCTURE PRETRAINING

Abstract

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/ DeepGraphLearning/GearNet.

1. INTRODUCTION

Proteins are workhorses of the cell and are implicated in a broad range of applications ranging from therapeutics to material. They consist of a linear chain of amino acids (residues) which fold into specific conformations. Due to the advent of low cost sequencing technologies (Ma & Johnson, 2012; Ma, 2015) , in recent years a massive volume of protein sequences have been newly discovered. As functional annotation of a new protein sequence remains costly and time-consuming, accurate and efficient in silico protein function annotation methods are needed to bridge the existing sequence-function gap. Since a large number of protein functions are governed by their folded structures, several datadriven approaches rely on learning representations of the protein structures, which then can be used for a variety of tasks such as protein design (Ingraham et al., 2019; Strokach et al., 2020; Cao et al., 2021; Jing et al., 2021) , structure classification (Hermosilla et al., 2021) , model quality assessment (Baldassarre et al., 2021; Derevyanko et al., 2018) , and function prediction (Gligorijević et al., 2021) . Due to the challenge of experimental protein structure determination, the number of reported protein structures is orders of magnitude lower than the size of datasets in other machine learning application domains. For example, there are 182K experimentally-determined structures in the Protein Data Bank (PDB) (Berman et al., 2000) vs 47M protein sequences in Pfam (Mistry et al., 2021) and vs 10M annotated images in ImageNet (Russakovsky et al., 2015) . To address this gap, recent works have leveraged the large volume of unlabeled protein sequence data to learn an effective representation of known proteins (Bepler & Berger, 2019; Rives et al., 2021; Elnaggar et al., 2021) . A number of studies have pretrained protein encoders on millions of sequences via self-supervised learning. However, these methods neither explicitly capture nor leverage the available protein structural information that is known to be the determinants of protein functions.

