SPLIT AND MERGE PROXY: PRE-TRAINING PROTEIN-PROTEIN CONTACT PREDICTION BY MINING RICH IN-FORMATION FROM MONOMER DATA

Abstract

Protein-protein contact prediction is a key intelligent biology computation technology for complex multimer protein function analysis but still sufferers from low accuracy. An important problem is that the number of training data cannot meet the requirements of deep-learning-based methods due to the expensive cost of capturing structure information of multimer data. In this paper, we solve this data volume bottleneck in a cheap way, borrowing rich information from monomer data. To utilize monomer (single chain) data in this multimer (multiple chains) problem, we propose a simple but effective pre-training method called Split and Merger Proxy (SMP), which utilizes monomer data to construct a proxy task for model pre-training. This proxy task cuts monomer data into two sub-parts, called pseudo multimer, and pre-trains the model to merge them back together by predicting their pseudo contacts. The pre-trained model is then used to initialize for our target -protein-protein contact prediction. Because of the consistency between this proxy task and the final target, the whole method brings a stronger pre-trained model for subsequent fine-tuning, leading to significant performance gains. Extensive experiments validate the effectiveness of our method and show the model performs better than the state of the art by 11.40% and 2.97% on the P@ L/10 metric for bounded benchmarks DIPS-Plus and CASP-CAPRI, respectively. Further, the model also achieves almost 1.5 times performance superiority to the state of the art on the harder unbounded benchmark DB5. The code, model, and pre-training data will be released after this paper is accepted.

1. INTRODUCTION

Proteins are large molecules consisting of amino acids (also called residues) sequences. Proteinprotein contact prediction aims to compute the constraints between given protein sequences (specifically whether residue (can be understood as an individual amino acid) on one protein are in contact with residue on the other protein), which is important for the structural or functional analysis of protein complexes. The predicted constraints reveal the relationships between each residue pair of the two protein sequences, which can not only benefit complex protein structure prediction but also be useful for many kinds of protein function analysis scenarios, e. g. developing new drugs and designing new proteins. The success of RaptorX Wang et al. ( 2017 Our main idea is to expand the training data by introducing the monomer data into the training step for protein-protein contact prediction. The existing monomer data is free and also can provide useful biological prior. ComplexContact Zeng et al. (2018) is the first work introducing the monomer data into the multimer contact prediction task. It proves the potential value of the monomer data to the multimer task. But obviously, there is an unneglectable task gap between the monomer and the multimer. Specifically, the monomer can only provide information about one chain while the multimer task requires more. So ComplexContact Zeng et al. (2018) suffers from that task gap and existing contact prediction methods often neglect these data. In this paper, we design a novel and effective pre-training method called Split and Merger Proxy (SMP) to introduce monomer data into the protein-protein contact prediction task more effectively, which reduces the aforementioned task gap and leads to better results. The proposed SMP is a proxy task for contact prediction pre-training. As shown in Figure 1 , SMP generates pseudo multimer data from monomers and utilizes that data to pre-train the contact prediction model. In particular, a single protein is split into two sub-parts that are treated as a pseudo multimer. That pseudo data are used to train the contact prediction model, equal to guide the model to merge these split data back. Although the pseudo multimer data contain biological noise, they can provide additional richer information that complements the existing multimer data. The training targets of SMP and the final task are both contact prediction, so there is no task gap in the fine-tuning stage. The pre-trained model can be fine-tuned on the real multimer data without any modification, leading to a better final model and more accurate contact results. Our main contributions are as follows: • We design a novel proxy task, Split and Merger Proxy (SMP), to pre-train contact prediction models on the monomer data more effectively. From the best of our knowledge, this is the first work to leverage the monomer protein data to pre-train the multimer protein contact prediction task. • Experiments show that we achieve a new state-of-the-art and improve the P@ L/10 metric by a large margin -11.40% and 2.97% respectively on DIPS-Plus and CASP-CAPRI benchmarks when compared with the latest state-of-the-art DeepInteract Morehead et al.



Figure 1: The main idea of Split and Merger Proxy (best viewed in color). In the pre-training stage, a monomer (single chain) is firstly split into two sub-parts that are treated as pseudo multimers (a pair of chains). And then the deep model is pre-trained by learning to merge the pseudo multimers back by predicting their protein-protein contacts. recent most popular protein structure prediction model AlphaFold2 Jumper et al. (2021) is trained on about 400k monomer data, 60k with 3D structure labels of Protein Data Bank (PDB) wwp (2019) and 350k protein sequence, and achieves electron microscope accuracy. Obviously, existing humanlevel accurate and successful artificial intelligence models also need big data to train. However, the number of the current largest open-sourced multimer training data Morehead et al. (2022) is much lower than the aforementioned topics, which is only 15k and limits the performance of the deep model. The main reason is the expensive cost of capturing the multimer protein structural information by high-accurate devices. So to tackle the problem of the scarcity of training data, we focus on finding a cheap way to obtain additional data and avoid the extra cost.

