INTRINSIC-EXTRINSIC CONVOLUTION AND POOLING FOR LEARNING ON 3D PROTEIN STRUCTURES

Abstract

Proteins perform a large variety of functions in living organisms and thus play a key role in biology. However, commonly used algorithms in protein learning were not specifically designed for protein data, and are therefore not able to capture all relevant structural levels of a protein during learning. To fill this gap, we propose two new learning operators, specifically designed to process protein structures. First, we introduce a novel convolution operator that considers the primary, secondary, and tertiary structure of a protein by using n-D convolutions defined on both the Euclidean distance, as well as multiple geodesic distances between the atoms in a multi-graph. Second, we introduce a set of hierarchical pooling operators that enable multi-scale protein analysis. We further evaluate the accuracy of our algorithms on common downstream tasks, where we outperform state-of-the-art protein learning algorithms.

1. INTRODUCTION

Geometry similar Topology different

Geometry different

Topology similar Proteins perform specific biological functions essential for all living organisms and hence play a key role when investigating the most fundamental questions in the life sciences. These biomolecules are composed of one or several chains of amino acids, which fold into specific conformations to enable various biological functionalities. Proteins can be defined using a multi-level structure:: The primary structure is given by the sequence of amino acids that are connected through covalent bonds and form the protein backbone. Hydrogen bonds between distant amino acids in the chain form the secondary structure, which defines substructures such as α-helices and β-sheets. The tertiary structure results from protein folding and expresses the 3D spatial arrangement of the secondary structures. Lastly, the quarternary structure is given by the interaction of multiple amino acid chains. Considering only one subset of these levels can lead to misinterpretations due to ambiguities. As shown by Alexander et al. (2009) , proteins with almost identical primary structure, i.e., only containing a few different amino acids, can fold into entirely different conformations. Conversely, proteins from SH3 and OB folds have similar tertiary structures, but their primary and secondary structures differ significantly (Agrawal & Kishan, 2001) (Fig. 1 ). To avoid misinterpretations arising from these observations, capturing the invariances with respect to primary, secondary, and tertiary structures is of key importance when studying proteins and their functions. Previously, the SOTA was dominated by methods based on hand-crafted features, usually extracted from multi-sequence alignment tools (Altschul et al., 1990) or annotated databases (El-Gebali et al., 2019) . In recent years, these have been outperformed by protein learning algorithms in different



Figure 1: Invariances present in protein structures.

