CONTINUOUS-DISCRETE CONVOLUTION FOR GEOMETRY-SEQUENCE MODELING IN PROTEINS

Abstract

The structure of proteins involves 3D geometry of amino acid coordinates and 1D sequence of peptide chains. The 3D structure exhibits irregularity because amino acids are distributed unevenly in Euclidean space and their coordinates are continuous variables. In contrast, the 1D structure is regular because amino acids are arranged uniformly in the chains and their sequential positions (orders) are discrete variables. Moreover, geometric coordinates and sequential orders are in two types of spaces and their units of length are incompatible. These inconsistencies make it challenging to capture the 3D and 1D structures while avoiding the impact of sequence and geometry modeling on each other. This paper proposes a Continuous-Discrete Convolution (CDConv) that uses irregular and regular approaches to model the geometry and sequence structures, respectively. Specifically, CDConv employs independent learnable weights for different regular sequential displacements but directly encodes geometric displacements due to their irregularity. In this way, CDConv significantly improves protein modeling by reducing the impact of geometric irregularity on sequence modeling. Extensive experiments on a range of tasks, including protein fold classification, enzyme reaction classification, gene ontology term prediction and enzyme commission number prediction, demonstrate the effectiveness of the proposed CDConv.

1. INTRODUCTION

Proteins are large biomolecules and are essential for life. Understanding their function is significant for life sciences. However, it usually requires enormous experimental efforts (Wüthrich, 2001; Jaskolski et al., 2014; Bai et al., 2015; Thompson et al., 2020) to find out their function. Recently, with the development of deep learning, emerging computational and data-driven approaches are particularly useful for efficient protein understanding (Derevyanko et al., 2018; Ingraham et al., 2019; Strokach et al., 2020; Cao et al., 2021; Jing et al., 2021; Jumper et al., 2021; Shanehsazzadeh et al., 2020) , including protein design, structure classification, model quality assessment, function prediction, etc. Because the function of proteins is based on their structure, accurately modeling protein structures can facilitate a mechanistic understanding of their function to life. Proteins are made up of different amino acids. There are 20 types of amino acids (residues) commonly found in plants and animals and a typical protein is made up of 300 or more amino acids. Because these amino acids are linked by peptide bonds and form a chain (shown in Fig. 1 ), proteins exhibit a 1D sequence structure. Moreover, because amino acids are arranged uniformly in the chains and their orders are discrete, the sequence structure is regular. In this case, 1D Convolutional Neural Network (CNN) (Kulmanov et al., 2018; Hou et al., 2018; Kulmanov & Hoehndorf, 2021) and Long Short-Term Memory (LSTM) (Bepler & Berger, 2019; Rao et al., 2019; Alley et al., 2019; Strodthoff et al., 2020) are widely used to model the regular 1D sequence structure of proteins. In addition to the 1D sequential order in peptide chains, each amino acid is with a 3D coordinate that specifies its spatial position in a protein. As shown in Fig. 1 , those 3D coordinates describe a geometry structure, which is also crucial for protein recognition. As mentioned by Alexander et al. (2009), 

