IMPROVING PROTEIN INTERACTION PREDICTION US-ING PRETRAINED STRUCTURE EMBEDDING

Abstract

The study of protein-protein interactions (PPIs) plays an important role in the discovery of protein drugs and in revealing the behavior and function of cells. So far, most PPI prediction works focus on protein sequence and PPI network structure, but ignore the structural information of protein physical binding. This results in interacting proteins are not similar necessarily, while similar proteins do not interact with each other. In this paper, we design a novel method, called PSE4PPI, which can leverage pretrained structure embedding that contain further structural and physical pairwise relationships between amino acid structure information. And this method can be transferred to new ppi predictions, such as antibody-target interactions and PPIs across different species. Experimental results on PPi predictions show that our pretrained structure embedding leads to significant improvement in PPI prediction comparing to sequence and network based methods. Furthermore, we show that embeddings pretrained based on ppi from different species can be transferred to improve the prediction for human proteins.

1. INTRODUCTION

Proteins are the basic functional units of human biology. However, they rarely function alone and usually do so in an interactive manner. Protein-protein interactions (PPIs) are important for studying cytoomics and discovering new putative therapeutic targets to cure diseases Szklarczyk et al. (2015) . But these research processes usually require expensive and time-consuming wet experimental results to obtain PPI results. The purpose of PPI prediction is to predict there exists protein physical binding for a given pair of amino acid sequences of proteins or not. 2020), which can obtain the representation of proteins by the amino acid sequence of proteins and the local neighborhood structure of PPI network, and then calculate whether there is an interaction relationship between proteins. And the PPI prediction problem is often formalized as link prediction problem and the similarity between two proteins is calculated to predict whether the bind will formulate. But researchers report that interacting proteins are not similar necessarily, while similar proteins do not interact with each other Kovács et al. (2019) . This is because the above methods ignore the influence of protein structure information on PPi prediction problems. We need to reconfirm that the relation type of PPIs mentioned in this paper are limited to physical binding. So it is not enough to consider the similarity between the sequence level and the structure level of PPI network for the physical binding between proteins, because it requires the two proteins to interact at the 3D-structure level. Predicting the physical binding between proteins need interface contact, and then the corresponding amino acids to formulate physical binding. Therefore, we introduce protein structure information through pretrained structure embedding to improve the prediction accuracy of PPI. In this paper, we first get our pretrained structure embeddings from pretrained protein structure model Wu et al. (2022) in PPI network, which contains the protein sequence information and the physical pairwise relationships between amino acids. Then, we use GNN-based method to predict PPI networks, in which we consider proteins as nodes, interactions as edge, and pretrained structure embeddings as features of nodes. In addition, we also use data from different species in the StringDB as pretraining data for human PPI prediction, and explore whether PPI prediction is transferable across different species. Our contribution include (1) To our best knowledge, we are first to use pretrained protein structure embedding as PPI network feature to solve PPI prediction problems; (2) We try to use StringDB Mering et al. (2003) to introduce more PPI data of the same or different species to improve the performance of the prediction model. (3) We used different species PPIs as pretraining data, and find that different species PPI data are significantly helpful for PPI prediction of human; (4) We achieved the SOTA performance among GNN-based PPI prediction methods. 2022) combined molecular protein graph neural network (GNN) and language model (LM), by generating per-residue embedding from the sequence information as the node's feature of the protein graph, to predict the interaction between proteins



Figure 1: The illustration of PPI network and Protein-protein physical binding. Recently, most PPI prediction works focus on protein sequence Sun et al. (2017); Hashemifar et al. (2018); Zhang et al. (2019) and PPI network structure Hamilton et al. (2017); Yang et al. (2020), which can obtain the representation of proteins by the amino acid sequence of proteins and the local neighborhood structure of PPI network, and then calculate whether there is an interaction relationship between proteins. And the PPI prediction problem is often formalized as link prediction

SEQUENCED-BASED PPI METHOD The sequential input-based model can use high and complex input features to improve the prediction ability of the model for PPI. Sun et al. (2017) first applied stacked autoencoder (SAE) to obtain sequence-based input features in sequence-based PPI prediction problem. Du et al. (2017) proposed a deep learning-based PPI model (DeepPPI), which can obtain high-level features from the general description of proteins. Hashemifar et al. (2018) proposes a convolution-based model that inputs the sequences of a bunch of proteins to predict whether they will interact; Gonzalez-Lopez et al. (2018) proposed a method that can use recurrent neural networks to process raw protein sequences and predict protein interactions without relying on feature engineering. Jha et al. (2021) proposed a deep multimodal framework to predict protein interactions using structural and sequence features of proteins and due to existing work. 2.2 GNN-BASED METHOD Graph neural network plays a very important role in solving protein interaction problems through PPI network. Yang et al. (2020) studied PPI prediction based on both sequence information and graph structure, showing superiority over existing sequence-based methods. Baranwal et al. (2022) proposed a graph attention network for structure based predictions of PPIs, named Struct2Graph, to describe a PPI analysis. Colonnese et al. (2021) adopted a Graph Signal Processing based approach (GRABP) modeling PPI network with a suitably designed GSP (Graph Signal Processing) based Markovian model to represent the connectivity properties of the nodes. Kabir & Shehu (2022) introduced a novel deep learning framework for semi-supervised learning over multi-relational graph, which learned affect of the different relations over output. Jha et al. (

