CONTINUOUS-DISCRETE CONVOLUTION FOR GEOMETRY-SEQUENCE MODELING IN PROTEINS

Abstract

The structure of proteins involves 3D geometry of amino acid coordinates and 1D sequence of peptide chains. The 3D structure exhibits irregularity because amino acids are distributed unevenly in Euclidean space and their coordinates are continuous variables. In contrast, the 1D structure is regular because amino acids are arranged uniformly in the chains and their sequential positions (orders) are discrete variables. Moreover, geometric coordinates and sequential orders are in two types of spaces and their units of length are incompatible. These inconsistencies make it challenging to capture the 3D and 1D structures while avoiding the impact of sequence and geometry modeling on each other. This paper proposes a Continuous-Discrete Convolution (CDConv) that uses irregular and regular approaches to model the geometry and sequence structures, respectively. Specifically, CDConv employs independent learnable weights for different regular sequential displacements but directly encodes geometric displacements due to their irregularity. In this way, CDConv significantly improves protein modeling by reducing the impact of geometric irregularity on sequence modeling. Extensive experiments on a range of tasks, including protein fold classification, enzyme reaction classification, gene ontology term prediction and enzyme commission number prediction, demonstrate the effectiveness of the proposed CDConv.

1. INTRODUCTION

Proteins are large biomolecules and are essential for life. Understanding their function is significant for life sciences. However, it usually requires enormous experimental efforts (Wüthrich, 2001; Jaskolski et al., 2014; Bai et al., 2015; Thompson et al., 2020) to find out their function. Recently, with the development of deep learning, emerging computational and data-driven approaches are particularly useful for efficient protein understanding (Derevyanko et al., 2018; Ingraham et al., 2019; Strokach et al., 2020; Cao et al., 2021; Jing et al., 2021; Jumper et al., 2021; Shanehsazzadeh et al., 2020) , including protein design, structure classification, model quality assessment, function prediction, etc. Because the function of proteins is based on their structure, accurately modeling protein structures can facilitate a mechanistic understanding of their function to life. Proteins are made up of different amino acids. There are 20 types of amino acids (residues) commonly found in plants and animals and a typical protein is made up of 300 or more amino acids. Because these amino acids are linked by peptide bonds and form a chain (shown in Fig. 1 ), proteins exhibit a 1D sequence structure. Moreover, because amino acids are arranged uniformly in the chains and their orders are discrete, the sequence structure is regular. In this case, 1D Convolutional Neural Network (CNN) (Kulmanov et al., 2018; Hou et al., 2018; Kulmanov & Hoehndorf, 2021) and Long Short-Term Memory (LSTM) (Bepler & Berger, 2019; Rao et al., 2019; Alley et al., 2019; Strodthoff et al., 2020) are widely used to model the regular 1D sequence structure of proteins. In addition to the 1D sequential order in peptide chains, each amino acid is with a 3D coordinate that specifies its spatial position in a protein. As shown in Fig. 1 , those 3D coordinates describe a geometry structure, which is also crucial for protein recognition. As mentioned by Alexander et al. (2009) & Kishan, 2001) . Therefore, it is necessary to consider both the 1D and 3D structures in protein modeling. However, different from the sequence structure, the geometry structure is irregular because amino acids are distributed unevenly in Euclidean space and their coordinates are continuous variables. This increases the challenge for neural networks to understand proteins. To model the 1D and 3D structures in proteins, Gligorijević et al. ( 2021) employed an LSTM and a Graph Convolutional Network (GCN) for sequence and geometry modeling, respectively. Because the two structures are processed separately, it may not properly understand proteins' local geometrysequence structure. In contrast, a few unified networks try to model the two types of structures together (Baldassarre et al., 2021; Jing et al., 2021; Zhang et al., 2022; Hermosilla & Ropinski, 2022) . However, those methods process geometric and sequential displacements together or model the sequence structure in a similar way to geometry modeling, thus neglecting the regularity of the 1D structure. Moreover, because the length units of the sequential and geometric spaces in proteins are not compatible, treating their distances similarly may mislead deep neural networks. In this paper, we first propose and formulate a new class of convolution operation, named Continuous-Discrete Convolution (CDConv), to make the most of the dual discrete and continuous nature of the data to avoid the impact of regular and irregular modeling on each other. Specifically, CDConv employs independent learnable weights to reflect different regular and discrete displacements but directly encodes continuous displacements due to their irregularity. Then, we implement a (3+1)D CDConv and use this operation to construct a hierarchical (3+1)D CNN for geometry-sequence modeling in proteins. We apply our network to protein fold classification, enzyme reaction classification, gene ontology term prediction and enzyme commission number prediction. Experimental results demonstrate the effectiveness of the proposed method. The contributions of this paper are fourfold: • We propose a new class of convolution, i.e., CDConv, which unifies continuous and discrete convolutions and makes the most of the regularity and irregularity in data, respectively. • We implement a (3+1)D CDConv for geometry-sequence modeling in proteins. Based on the convolution, we construct deep neural networks for protein representation learning. • We conduct extensive experiments on four tasks and the proposed method surpasses the previous methods by large margins, resulting in the new state-of-the-art accuracy. • We find that amino acids in central regions may be more important than those in surface areas. This may need to be verified via biochemical experiments in the future.

2. RELATED WORK

Protein Representation Learning. Protein representation learning attracts increasing attention in the fields of protein modeling and structural bioinformatics and is critical in biology. Because proteins are sequences of amino acids, 1D CNN, LSTM and Transformer are employed for sequence-based protein representation learning (Amidi et al., 2018; Kulmanov et al., 2018; Hou et al., 2018; Rao et al., 



, Illustration of the geometry-sequence structure of a protein. The dot color indicates amino acid (residue) types. 1) Amino acids are linked by peptide bonds and form a chain, which exhibits a regular 1D sequence structure because they are arranged uniformly and their sequential orders are discrete. 2) In addition, amino acids are with 3D coordinates that determine a geometry structure, which exhibits irregularity because they are distributed unevenly in Euclidean space and their coordinates are continuous variables.proteins with similar peptide chains may fold into very different 3D geometry structures. Conversely, proteins with similar 3D geometry structures may have entirely different amino acid chains (Agrawal

