PROTEIN REPRESENTATION LEARNING VIA KNOWL-EDGE ENHANCED PRIMARY STRUCTURE MODELING

Abstract

Protein representation learning has primarily benefited from the remarkable development of language models (LMs). Accordingly, pre-trained protein models also suffer from a problem in LMs: a lack of factual knowledge. The recent solution models the relationships between protein and associated knowledge terms as the knowledge encoding objective. However, it fails to explore the relationships at a more granular level, i.e., the token level. To mitigate this, we propose Knowledge-exploited Auto-encoder for Protein (KeAP), which performs tokenlevel knowledge graph exploration for protein representation learning. In practice, non-masked amino acids iteratively query the associated knowledge tokens to extract and integrate helpful information for restoring masked amino acids via attention. We show that KeAP can consistently outperform the previous counterpart on 9 representative downstream applications, sometimes surpassing it by large margins. These results suggest that KeAP provides an alternative yet effective way to perform knowledge enhanced protein representation learning. Code and models are available at https://github.com/RL4M/KeAP.

1. INTRODUCTION

The unprecedented success of AlphaFold (Jumper et al., 2021; Senior et al., 2020) has sparked the public's interest in artificial intelligence-based protein science, which in turn promotes scientists to develop more powerful deep neural networks for protein. At present, a major challenge faced by researchers is how to learn generalized representation from a vast amount of protein data. An analogous problem also exists in natural language processing (NLP), while the recent development of big language models (Devlin et al., 2018; Brown et al., 2020) However, as pointed out by (Peters et al., 2019; Zhang et al., 2019; Sun et al., 2020; Wang et al., 2021) , pre-trained language models often suffer from a lack of factual knowledge. To alleviate similar problems appearing in protein models, Zhang et al. (2022) proposed OntoProtein that explicitly injects factual biological knowledge into the pre-trained model, leading to observable improvements on several downstream protein analysis tasks, such as amino acid contact prediction and proteinprotein interaction identification. In practice, OntoProtein leverages the masked language modeling (MLM) (Devlin et al., 2018) and TransE (Bordes et al., 2013) objectives to perform structure and knowledge encoding, respectively. Specifically, the TransE objective is applied to triplets from knowledge graphs, where each triplet can be formalized as (Protein, Relation, Attribute). The relation and attribute terms described using natural language are from the gene ontologies (Ashburner et al., 2000) associated with protein. However, OntoProtein only models the relationships on top of the contextual representations of protein (averaging amino acid representations) and textual knowledge (averaging word representations), preventing it from exploring knowledge graphs at a more granular level, i.e., the token level. We propose KeAP (Knowledge-exploited Auto-encoder for Protein) to perform knowledge enhanced protein representation learning. To address the granularity issue of OntoProtein, KeAP performs token-level protein-knowledge exploration using the cross-attention mechanism. Specifically, each amino acid iteratively queries each word from relation and attribute terms to extract useful, relevant information using QKV Attention (Vaswani et al., 2017) . The extracted information is then integrated into the protein representation via residual learning (He et al., 2016) . The training process is guided only by the MLM objective, while OntoProtein uses contrastive learning and masked modeling simultaneously. Moreover, we propose to explore the knowledge in a cascaded manner by first extracting information from relation terms and then from attribute terms, which performs more effective knowledge encoding. KeAP has two advantages over OntoProtein (Zhang et al., 2022) . First, KeAP explores knowledge graphs at a more granular level by applying cross-attention to sequences of amino acids and words from relation and attributes. Second, KeAP provides a neat solution for knowledge enhanced protein pre-training. The encoder-decoder architecture in KeAP can be trained using the MLM objective only (both contrastive loss and MLM are used in OntoProtein), making the whole framework easy to optimize and implement. Experimental results verify the performance superiority of KeAP over OntoProtein. In Fig. 1 , we fine-tune the pre-trained protein models on 9 downstream applications. We see that KeAP outperforms OntoProtein on all 9 tasks mostly by obvious margins, such as amino acid contact prediction and protein-protein interaction (PPI) identification. Compared to ProtBert, KeAP achieves better results on 8 tasks while performing comparably on protein stability prediction. In contrast, OntoProtein produces unsatisfactory results on homology, stability, and binding affinity prediction.

2.1. REPRESENTATION LEARNING FOR PROTEIN

How to learn generalized protein representation has recently become a hot topic in protein science, inspired by the widespread use of representation learning in language models (Devlin et al., 2018; Radford et al., 2019; Yang et al., 2019; Sarzynska-Wawer et al., 2021) . Bepler & Berger (2018) introduced a multi-task protein representation learning framework, which obtains supervision signals from protein-protein structural similarity and individual amino acid contact maps. Due to a plethora of uncharacterized protein data, self-supervised pre-training (Alley et al., 2019; Rao et al., 2019) was proposed to directly learn representation from chains of amino acids, where tremendous and significant efforts were made to improve the pre-training result by scaling up the size of the model and dataset (Elnaggar et al., 2021; Rives et al., 2021; Vig et al., 2020; Rao et al., 2020; Yang et al., 2022; Nijkamp et al., 2022; Ferruz et al., 2022; Chen et al., 2022) . In contrast, protein-related factual knowledge, providing abundant descriptive information for protein, has been long ignored and largely unexploited. OntoProtein (Zhang et al., 2022) first showed that we can improve the performance of pre-trained models on downstream tasks by explicitly injecting the factual biological knowledge associated with protein sequences into pre-training. In practice, OntoProtein proposed to reconstruct masked amino acids while minimizing the embedding distance between contextual representations of protein and associated knowledge terms. One potential pitfall of this operation is that it fails to explore the relationships between protein and



offers a viable solution: unsupervised pre-training with self-supervision. In practice, by viewing amino acids as language tokens, we can easily transfer existing unsupervised pre-training techniques from NLP to protein, and the effectiveness of these techniques has been verified in protein representation learning Rao et al. (2019); Alley et al. (2019); Elnaggar et al. (2021); Unsal et al. (2022).

Figure 1: Transfer learning performance of ProtBert, OntoProtein, and our KeAP on downstream protein analysis tasks. S-, M-, and L-Contact stand for short-range, medium-range, and longrange contact prediction. PPI denotes the protein-protein interaction prediction. † means the model is trained with the full ProteinKG25.

