MULTI-LEVEL PROTEIN STRUCTURE PRE-TRAINING WITH PROMPT LEARNING

Abstract

A protein can focus on different structure levels to implement its functions. Each structure has its own merit and driving forces in describing specific characteristics, and they cannot replace each other. Most existing function prediction methods take either the primary or the tertiary structure as input, unintentionally ignoring the other levels of protein structures. Considering protein sequences can determine multi-level structures, in this paper, we aim to realize the comprehensive potential of protein sequences for function prediction. Specifically, we propose a new prompt-guided multi-task pre-training and fine-tuning framework. Through the prompt-guided multi-task pre-training, we learn multiple prompt signals to steer the model, called PromptProtein, to focus on different levels of structures. We also design a prompt fine-tuning module to provide downstream tasks the on-demand flexibility of utilizing respective levels of structural information. Extensive experiments on function prediction and protein engineering show that PromptProtein outperforms state-of-the-art methods by large margins. To the best of our knowledge, this is the first prompt-based pre-trained protein model.

1. INTRODUCTION

Pre-trained language models (PTLMs) have prevailed in natural language processing (NLP). Recently, some methods (Alley et al., 2019; Elnaggar et al., 2021; Rives et al., 2021) use PTLMs to encode protein sequences to predict biological functions, which are called pre-trained protein models (PTPMs). In contrast to natural languages, there are four distinct levels of protein structures (Kessel & Ben-Tal, 2018) . The primal is the protein sequence consisting of amino acids, the second refers to the local folded structures (e.g., α helix and β pleated sheet), the tertiary describes the natural folded three-dimensional structure, and the quaternary is a protein multimer comprising multiple polypeptides. A protein can focus on different structure levels to implement its specific functions, including reserving a piece of the sequence, manifesting the whole 3D structure as conformational elements, or even cooperating with other proteins. Therefore, when predicting protein functions, it is vital to flexibly utilize multi-level structural information. AlphaFold2 (Jumper et al., 2021) makes great progress in the tertiary structure prediction based on protein sequences. However, directly learning from predicted structures can be unachievable as the prediction of proteins without homologous sequences is inaccurate. More importantly, the quaternary structure of protein multimers which faithfully depicts protein functions is usually different from the tertiary (see Figure 1 ) and reliable predictive models have not been released. Fortunately, protein sequences are easy to obtain and can determine all the other levels of structures. This paper aims to realize the full potential of protein sequences in function prediction by prompting a PTPM to exploit all levels of protein structures during pre-training. The main challenges are twofold: 1) how to design proper pre-training tasks for different protein structures? and 2) how to efficiently integrate these tasks in the pre-training phase and transfer the implicit protein structure knowledge for function prediction in fine-tuning phase. For the second challenge, a straightforward strategy is to leverage multi-task learning to combine the losses of different pre-training tasks. However, many works (Wu et al., 2019; Yu et al., 2020) find that task interference is common when tasks are diverse. This problem can be more severe in multi-task pre-training due to the gap between pre-training and downstream tasks, causing negative knowledge transfer. For example, BERT (Kenton & Toutanova, 2019) leverages MLM and Next Sentence Prediction (NSP) to learn the sequential dependency and sentence relationship simultaneously, while RoBERTa (Liu et al., 2019) finds the performance will be slightly improved when removing the NSP loss. We postulate this problem also exists in multi-level protein structures, as different structures can be inconsonant. The MLM task emphasizes the neighboring relations along the sequence, while the CRD task shall focus more on long-range amino acid pairs which can be spatially close in the tertiary structure. To address this challenge, inspired by recent prompt learning, we propose a prompt-guided multitask pre-training and fine-tuning framework, and the resulting protein model is called PromptProtein. The prompt-guided multi-task pre-training associates multiple pre-training tasks with dedicated sentinel tokens, called prompts. To utilize the prompt tokens, we introduce a prompt-aware attention module, which modifies two components of the Transformer architecture: 1) Attention mask, which is designed to block attention calculation from input data to a prompt as a prompt should be taskdependent instead of sample-dependent. 2) For skip connection, a prompt is used to calculate a skip weight, which can filter out task-irrelevant information. At the fine-tuning phase, we propose a prompt fine-tuning module to coordinate all prompt tokens, such that the model is capable of leveraging multi-level protein structure information flexibly, enabling the positive transfer of learned structural knowledge to downstream tasks. We conduct experiments on function prediction and protein engineering as downstream tasks, where PromptProtein significantly outperforms state-of-the-art on all datasets, especially on low-resource protein engineering tasks where PromptProtein achieves an average improvement of 17.0%.

2. RELATED WORKS

Protein Representation Models. Proteins have complex structures that determine their biological functions (Epstein et al., 1963) . A growing body of work focuses on how to leverage structural information. Since evolution through natural selection has spoken protein sequences as their "natural language", various natural language processing methods have been extended to proteins. Asgari & Mofrad (2015) ; Yang et al. (2018) apply word embedding algorithms (Mikolov et al., 2013) to obtain protein representations. Dalkiran et al. (2018); Öztürk et al. (2018) use one-dimensional con-



Figure 1: A comparison of protein CDK1 in the tertiary (left) and quaternary (right) structures.For the first challenge, we design three complementary pre-training tasks across multiple structure levels, targeting both fine and coarse resolutions. Specifically, we use the de facto Mask Language Modeling (MLM) task to exploit the primary structure information, where the model needs to predict randomly masked amino acids in a protein. For the secondary and tertiary structure, we propose the alpha-carbon CooRDinate prediction (CRD) task, where the model should output the relative positions between residues. For the quaternary structure, we propose the Protein-Protein Interaction prediction (PPI) task, where the model is required to estimate the interaction probability. We collect millions of data covering different levels of protein structures from UniRef50 (Consortium, 2021), Protein Data Bank(Berman et al., 2000), and STRING(Szklarczyk et al., 2019).

