SPLIT AND MERGE PROXY: PRE-TRAINING PROTEIN-PROTEIN CONTACT PREDICTION BY MINING RICH IN-FORMATION FROM MONOMER DATA

Abstract

Protein-protein contact prediction is a key intelligent biology computation technology for complex multimer protein function analysis but still sufferers from low accuracy. An important problem is that the number of training data cannot meet the requirements of deep-learning-based methods due to the expensive cost of capturing structure information of multimer data. In this paper, we solve this data volume bottleneck in a cheap way, borrowing rich information from monomer data. To utilize monomer (single chain) data in this multimer (multiple chains) problem, we propose a simple but effective pre-training method called Split and Merger Proxy (SMP), which utilizes monomer data to construct a proxy task for model pre-training. This proxy task cuts monomer data into two sub-parts, called pseudo multimer, and pre-trains the model to merge them back together by predicting their pseudo contacts. The pre-trained model is then used to initialize for our target -protein-protein contact prediction. Because of the consistency between this proxy task and the final target, the whole method brings a stronger pre-trained model for subsequent fine-tuning, leading to significant performance gains. Extensive experiments validate the effectiveness of our method and show the model performs better than the state of the art by 11.40% and 2.97% on the P@ L/10 metric for bounded benchmarks DIPS-Plus and CASP-CAPRI, respectively. Further, the model also achieves almost 1.5 times performance superiority to the state of the art on the harder unbounded benchmark DB5. The code, model, and pre-training data will be released after this paper is accepted.

1. INTRODUCTION

Proteins are large molecules consisting of amino acids (also called residues) sequences. Proteinprotein contact prediction aims to compute the constraints between given protein sequences (specifically whether residue (can be understood as an individual amino acid) on one protein are in contact with residue on the other protein), which is important for the structural or functional analysis of protein complexes. The predicted constraints reveal the relationships between each residue pair of the two protein sequences, which can not only benefit complex protein structure prediction but also be useful for many kinds of protein function analysis scenarios, e. g. developing new drugs and designing new proteins. The success of RaptorX Wang et al. ( 2017 2009) which has 14 million labeled data who provides enough vision category information of real word. For natural language processing (NLP), the most popular language model BERT Devlin et al. ( 2018) is trained on document-level data BooksCorpus Zhu et al. (2015) and English Wikipedia in an unsupervised manner. And in computational biology, the



); Xu et al. (2021) and AlphaFold2 Jumper et al. (2021) demonstrates the application potential of deep learning in the computational biology field and inspired a series of new biological computation methods. However, when extending the deep model to protein-protein (inter-chain) contact prediction, recent works have not achieved satisfying performance as the aforementioned successful works do. An important bottleneck is data quantity limitation. Many well-known successful deep learning systems are almost trained under large-scale datasets. For example, in computer vision (CV), ConvNet Krizhevsky et al. (2012); Simonyan & Zisserman (2014); He et al. (2016)) and ViT Dosovitskiy et al. (2020); Liu et al. (2021); Yuan et al. (2021) are trained on ImageNet Deng et al. (

