PRE-TRAINING PROTEIN STRUCTURE ENCODER VIA SIAMESE DIFFUSION TRAJECTORY PREDICTION

Abstract

Inspired by the governing role of protein structures on protein functions, structurebased protein representation pre-training is recently gaining interest but remains largely unexplored. Along this direction, pre-training by maximizing the mutual information (MI) between different descriptions of the same protein (i.e., correlated views) has shown some preliminary promise, while more in-depth studies are required to design more informative correlated views. Previous view designs focus on capturing structural motif co-occurrence in the same protein structure, while they cannot capture detailed atom-and residue-level interactions and also the statistical dependencies of residue types along protein sequences. To address these limitations, we propose the Siamese Diffusion Trajectory Prediction (SiamDiff) method. SiamDiff employs the multimodal diffusion process as a faithful simulation of the structure-sequence co-diffusion trajectory that gradually and smoothly approaches the folded structure and the corresponding sequence of a protein from scratch. Upon a native protein and its correlated counterpart obtained with random structure perturbation, we build two multimodal (structure and sequence) diffusion trajectories and regard them as two correlated views. A principled theoretical framework is designed to maximize the MI between such paired views, such that the model can acquire the atom-and residue-level interactions underlying protein structural changes and also the residue type dependencies. We study the effectiveness of SiamDiff on both residue-level and atom-level structural representations. Experimental results on EC and ATOM3D benchmarks show that the performance of SiamDiff is consistently competitive on all benchmark tasks, compared with existing baselines. The source code will be made public upon acceptance.

1. INTRODUCTION

In the past year, thanks to the rise of highly accurate and efficient protein structure predictors based on deep learning (Jumper et al., 2021; Baek et al., 2021) , the gap between the number of reported protein sequences and corresponding (computationally) solved structures is greatly narrowed. These advances open the opportunity of self-supervised protein representation learning based on protein structures, i.e., learning informative protein representations from massive protein structures without using any annotation. Compared to extensively studied sequence-based protein self-supervised learning like protein language models (Elnaggar et al., 2021; Rives et al., 2021) , structure-based methods could learn more effective representations to indicate protein functions, since a protein sequence determines its structure, and the structure is the determinant of its diverse functions (Harms & Thornton, 2010) . To attain this goal, some recent works have explored different self-supervised learning strategies on protein structures, including contrastive learning (Zhang et al., 2022; Hermosilla & Ropinski, 2022) , self-prediction (Zhang et al., 2022; Chen et al., 2022) and denoising score matching (Guo et al., 2022; Wu et al., 2022a) . Among these works, mutual information (MI) maximization based methods (Zhang et al., 2022; Hermosilla & Ropinski, 2022) achieve superior performance on protein function and structural class prediction. At the core of these methods, different structural descriptions of the same protein (i.e., correlated views) are built to capture the co-occurrence of structural motifs. However, such view construction scheme fails to capture detailed atom-and residue-level interactions and also the statistical dependencies of residue types along protein sequences. Therefore, the MI maximization based on these views may not produce effective representations for the tasks that require to model detailed local structures (e.g., the Residue Identity task from ATOM3D (Townshend et al., 2020) ) or

