PRE-TRAINING PROTEIN STRUCTURE ENCODER VIA SIAMESE DIFFUSION TRAJECTORY PREDICTION

Abstract

Inspired by the governing role of protein structures on protein functions, structurebased protein representation pre-training is recently gaining interest but remains largely unexplored. Along this direction, pre-training by maximizing the mutual information (MI) between different descriptions of the same protein (i.e., correlated views) has shown some preliminary promise, while more in-depth studies are required to design more informative correlated views. Previous view designs focus on capturing structural motif co-occurrence in the same protein structure, while they cannot capture detailed atom-and residue-level interactions and also the statistical dependencies of residue types along protein sequences. To address these limitations, we propose the Siamese Diffusion Trajectory Prediction (SiamDiff) method. SiamDiff employs the multimodal diffusion process as a faithful simulation of the structure-sequence co-diffusion trajectory that gradually and smoothly approaches the folded structure and the corresponding sequence of a protein from scratch. Upon a native protein and its correlated counterpart obtained with random structure perturbation, we build two multimodal (structure and sequence) diffusion trajectories and regard them as two correlated views. A principled theoretical framework is designed to maximize the MI between such paired views, such that the model can acquire the atom-and residue-level interactions underlying protein structural changes and also the residue type dependencies. We study the effectiveness of SiamDiff on both residue-level and atom-level structural representations. Experimental results on EC and ATOM3D benchmarks show that the performance of SiamDiff is consistently competitive on all benchmark tasks, compared with existing baselines. The source code will be made public upon acceptance.

1. INTRODUCTION

In the past year, thanks to the rise of highly accurate and efficient protein structure predictors based on deep learning (Jumper et al., 2021; Baek et al., 2021) , the gap between the number of reported protein sequences and corresponding (computationally) solved structures is greatly narrowed. These advances open the opportunity of self-supervised protein representation learning based on protein structures, i.e., learning informative protein representations from massive protein structures without using any annotation. Compared to extensively studied sequence-based protein self-supervised learning like protein language models (Elnaggar et al., 2021; Rives et al., 2021) , structure-based methods could learn more effective representations to indicate protein functions, since a protein sequence determines its structure, and the structure is the determinant of its diverse functions (Harms & Thornton, 2010) . To attain this goal, some recent works have explored different self-supervised learning strategies on protein structures, including contrastive learning (Zhang et al., 2022; Hermosilla & Ropinski, 2022) , self-prediction (Zhang et al., 2022; Chen et al., 2022) and denoising score matching (Guo et al., 2022; Wu et al., 2022a) . Among these works, mutual information (MI) maximization based methods (Zhang et al., 2022; Hermosilla & Ropinski, 2022) achieve superior performance on protein function and structural class prediction. At the core of these methods, different structural descriptions of the same protein (i.e., correlated views) are built to capture the co-occurrence of structural motifs. However, such view construction scheme fails to capture detailed atom-and residue-level interactions and also the statistical dependencies of residue types along protein sequences. Therefore, the MI maximization based on these views may not produce effective representations for the tasks that require to model detailed local structures (e.g., the Residue Identity task from ATOM3D (Townshend et al., 2020) ) or minor differences on structures and sequences due to point mutations (e.g., the Mutation Stability Prediction task from ATOM3D), as demonstrated by the experimental results in Tables 1 and 2 . To tackle such limitation, in this work, we propose the Siamese Diffusion Trajectory Prediction (SiamDiff) method to jointly model fine-grained atom-and residue-level interactions and residue type dependencies. Specifically, given a native protein, we first derive its correlated counterpart by random structure perturbation. We further extend the original protein and the generated counterpart respectively by the multimodal diffusion process, in which we transform both the protein structure and the protein sequence towards random distribution by gradually and smoothly adding noises. Such a diffusion process is verified as a faithful simulation of the structure-sequence co-diffusion trajectory by recent studies (Anand & Achim, 2022; Luo et al., 2022) . We regard the diffusion trajectories of the original protein and the generated counterpart as two correlated views, and a principled theoretical framework is designed to maximize the MI between such paired views. Under such a learning framework, the model can acquire the atom-and residue-level interactions underlying protein structural changes and also the residue type dependencies. The learned protein representations are expected to boost diverse types of downstream tasks. SiamDiff can be flexibly applied to both residue-level and atom-level structures for effective representation learning. We employ different self-supervised algorithms to pre-train residue-level and atom-level structure encoders, and the pre-trained models are extensively evaluated on Enzyme Commission number prediction (Gligorijević et al., 2021) and ATOM3D (Townshend et al., 2020) benchmarks. Experimental results verify that SiamDiff can consistently achieve competitive performance on all benchmark tasks and on both structure levels, in contrast to existing baselines.

2. RELATED WORK

Protein Structure Representation Learning. The community witnessed a surge of research interests in learning informative protein structure representations using structure-based encoders and training algorithms. The encoders are designed to capture protein structural information on different granularity, including residue-level structures (Gligorijević et al., 2021; Wang et al., 2022b; Zhang et al., 2022) , atom-level structures (Hermosilla et al., 2021; Jing et al., 2021a; Wang et al., 2022a) and protein surfaces (Gainza et al., 2020; Sverrisson et al., 2021; Somnath et al., 2021) . Recent works study pre-training on massive unlabeled protein structures for generalizable representations, covering contrastive learning (Zhang et al., 2022; Hermosilla & Ropinski, 2022) , self-prediction of geometric quantities (Zhang et al., 2022; Chen et al., 2022) and denoising score matching (Guo et al., 2022; Wu et al., 2022a) . All these methods only employ native proteins for pre-training. By comparison, the proposed SiamDiff uses the information from multimodal diffusion trajectories to better acquire atom-and residue-level interactions and residue type dependencies. Diffusion Probabilistic Models (DPMs). DPM was first proposed in Sohl-Dickstein et al. (2015) and has been recently rekindled for its strong performance on image and waveform generation (Ho et al., 2020; Chen et al., 2020a) . Recent works (Nichol & Dhariwal, 2021; Song et al., 2020; 2021) have improved training and sampling for DPMs. Besides the DPMs for continuous data, some works study discrete DPMs and achieve impressive results on generating texts (Austin et al., 2021; Li et al., 2022 ), images (Austin et al., 2021) and image segmentation data (Hoogeboom et al., 2021) . Inspired by these progresses, DPMs have been recently adopted to solve the problems in chemistry and biology domain, including molecule generation (Xu et al., 2022; Hoogeboom et al., 2022; Wu et al., 2022b) , molecular representation learning (Liu et al., 2022 ), protein design (Anand & Achim, 2022; Luo et al., 2022) and motif-scaffolding (Trippe et al., 2022) . In this work, we novelly study how DPMs can help protein representation learning, which aligns with a recent effort (Abstreiter et al., 2021) on diffusion-based image representation learning.

3.1. PROBLEM DEFINITION

Notations. A protein with n r residues (amino acids) and n a atoms can be represented as a sequencestructure tuple P = (S, R). We use S = [s 1 , s 2 , • • • , s nr ] to denote its sequence with s i as the type of the i-th residue, while R = [r 1 , r 2 ..., r na ] ∈ R na×3 denotes its structure with r i as the Cartesian

