INTRINSIC-EXTRINSIC CONVOLUTION AND POOLING FOR LEARNING ON 3D PROTEIN STRUCTURES

Abstract

Proteins perform a large variety of functions in living organisms and thus play a key role in biology. However, commonly used algorithms in protein learning were not specifically designed for protein data, and are therefore not able to capture all relevant structural levels of a protein during learning. To fill this gap, we propose two new learning operators, specifically designed to process protein structures. First, we introduce a novel convolution operator that considers the primary, secondary, and tertiary structure of a protein by using n-D convolutions defined on both the Euclidean distance, as well as multiple geodesic distances between the atoms in a multi-graph. Second, we introduce a set of hierarchical pooling operators that enable multi-scale protein analysis. We further evaluate the accuracy of our algorithms on common downstream tasks, where we outperform state-of-the-art protein learning algorithms.

1. INTRODUCTION

Geometry similar Topology different

Geometry different

Topology similar Proteins perform specific biological functions essential for all living organisms and hence play a key role when investigating the most fundamental questions in the life sciences. These biomolecules are composed of one or several chains of amino acids, which fold into specific conformations to enable various biological functionalities. Proteins can be defined using a multi-level structure:: The primary structure is given by the sequence of amino acids that are connected through covalent bonds and form the protein backbone. Hydrogen bonds between distant amino acids in the chain form the secondary structure, which defines substructures such as α-helices and β-sheets. The tertiary structure results from protein folding and expresses the 3D spatial arrangement of the secondary structures. Lastly, the quarternary structure is given by the interaction of multiple amino acid chains. Considering only one subset of these levels can lead to misinterpretations due to ambiguities. As shown by Alexander et al. (2009) , proteins with almost identical primary structure, i.e., only containing a few different amino acids, can fold into entirely different conformations. Conversely, proteins from SH3 and OB folds have similar tertiary structures, but their primary and secondary structures differ significantly (Agrawal & Kishan, 2001) (Fig. 1 ). To avoid misinterpretations arising from these observations, capturing the invariances with respect to primary, secondary, and tertiary structures is of key importance when studying proteins and their functions. Previously, the SOTA was dominated by methods based on hand-crafted features, usually extracted from multi-sequence alignment tools (Altschul et al., 1990) or annotated databases (El-Gebali et al., 2019) . In recent years, these have been outperformed by protein learning algorithms in different protein modeling tasks such as protein fold classification (Hou et al., 2018; Rao et al., 2019; Bepler & Berger, 2019; Alley et al., 2019; Min et al., 2020) or protein function prediction (Strodthoff et al., 2020; Gligorijevic et al., 2019; Kulmanov et al., 2017; Kulmanov & Hoehndorf, 2019; Amidi et al., 2017) . This can be attributed to the ability of machine learning algorithms to learn meaningful representations of proteins directly from the raw data. However, most of these techniques only consider a subset of the relevant structural levels of proteins and thus can only create a representation from partial information. For instance, due to the high amount of available protein sequence data, most techniques solely rely on protein sequence data as input, and apply learning algorithms from the field of natural language processing (Rao et al., 2019; Alley et al., 2019; Min et al., 2020; Strodthoff et al., 2020) 2018)). In this paper, we introduce a novel end-to-end protein learning algorithm, that is able to explicitly incorporate the multi-level structure of proteins and captures the resulting different invariances. We show how a multi-graph data structure can represent the primary and secondary structures effectively by considering covalent and hydrogen bonds, while the tertiary structure can be represented by the spatial 3D coordinates of the atoms (Sec. 3). By borrowing terminology from differential geometry of surfaces, we define a new convolution operator that uses both intrinsic (primary and secondary structures) and extrinsic (tertiary and quaternary structures) distances (Sec. 4). Moreover, since protein sizes range from less than one hundred to tens of thousands of amino acids (Brocchieri & Karlin, 2005) , we propose protein-specific pooling operations that allow hierarchical grouping of such a wide range of sizes, enabling the detection of features at different scales (Sec. 5). Lastly, we demonstrate, that by considering all mentioned protein structure levels, we can significantly outperform recent SOTA methods on protein tasks, such as protein fold and enzyme classification. Code and data of our approach is available at https://github.com/phermosilla/ IEConv_proteins.

2. RELATED WORK

Early works on learning protein representations (Asgari & Mofrad, 2015; Yang et al., 2018) used word embedding algorithms (Mikolov et al., 2013) , as employed in Natural Language Processing (NLP). Other approaches have used 1D convolutional neural networks (CNN) to learn protein representations directly from an amino acid sequence, for tasks such as protein function prediction (Kulmanov et al., 2017; Kulmanov & Hoehndorf, 2019 ), protein-compound interaction (Tsubaki et al., 2018) , or protein fold classification (Hou et al., 2018) . Recently, researchers have applied complex NLP models trained unsupervised on millions of unlabeled protein sequences and fine-tune them for different downstream tasks (Rao et al., 2019; Alley et al., 2019; Min et al., 2020; Strodthoff et al., 2020; Bepler & Berger, 2019) . While representing proteins as amino acid sequences during learning, is helpful when only sequence data is available, it does not leverage the full potential of spatial protein representations that become more and more available with modern imaging and reconstruction techniques. To learn beyond sequences, approaches have been developed, that consider the 3D structure of proteins. A range of methods has sampled protein structures to regular volumetric 3D representations and assessed the quality of the structure (Derevyanko et al., 2018) , classified proteins in enzymes classes (Amidi et al., 2017) , predicted the protein-ligand binding affinity (Ragoza et al., 2017) and the binding site (Jiménez et al., 2017) , as well as the contact region between two proteins (Townshend et al., 2019) . While this is attractive, as 3D grids allow for unleashing the benefits of all approaches developed for 2D images, such as pooling and multi-resolution techniques, unfortunately, grids do not scale well to fine structures or many atoms, and even more importantly, they do not consider the primary and secondary structure of proteins.



Figure 1: Invariances present in protein structures.

, 1D convolutional neural networks(Kulmanov et al., 2017; Kulmanov &  Hoehndorf, 2019), or use structural information during training(Bepler & Berger, 2019). Other methods have solely used 3D atomic coordinates as an input, and applied 3D convolutional neural networks (3DCNN)(Amidi et al., 2017; Derevyanko et al., 2018)  or graph convolutional neural networks (GCNN)(Kipf & Welling, 2017). While few attempts have been made to consider more than one structural level of proteins in the network architecture(Gligorijevic et al., 2019), none of these hybrid methods incorporate all structural levels of proteins simultaneously. In contrast, a common approach is to process one structural level with the network architecture and the others indirectly as input features(Baldassarre et al. (2020) or Hou et al. (

