LEARNING HIERARCHICAL PROTEIN REPRESENTA-TIONS VIA COMPLETE 3D GRAPH NETWORKS

Abstract

We consider representation learning for proteins with 3D structures. We build 3D graphs based on protein structures and develop graph networks to learn their representations. Depending on the levels of details that we wish to capture, protein representations can be computed at different levels, e.g., the amino acid, backbone, or all-atom levels. Importantly, there exist hierarchical relations among different levels. In this work, we propose to develop a novel hierarchical graph network, known as ProNet, to capture the relations. Our ProNet is very flexible and can be used to compute protein representations at different levels of granularity. By treating each amino acid as a node in graph modeling as well as harnessing the inherent hierarchies, our ProNet is more effective and efficient than existing methods. We also show that, given a base 3D graph network that is complete, our ProNet representations are also complete at all levels. Experimental results show that ProNet outperforms recent methods on most datasets. In addition, results indicate that different downstream tasks may require representations at different levels.

1. INTRODUCTION

Proteins consist of one or more amino acid chains and perform various functions by folding into 3D conformations. Learning representations of proteins with 3D structures is crucial for a wide range of tasks (Cao et al., 2021; Strokach et al., 2020; Wu et al., 2021; Yang et al., 2019; Ganea et al., 2022; Stärk et al., 2022; Morehead et al., 2022a; b; Liu et al., 2020) . In machine learning, molecules, proteins, etc. are usually modeled as graphs (Liu et al., 2022; Fout et al., 2017; Jumper et al., 2021; Gao et al., 2021; Gao & Ji, 2019; Yan et al., 2022; Wang et al., 2022b; Yu et al., 2022; Xie et al., 2022a; b; Gui et al., 2022; Luo et al., 2022) . With the advances of deep learning, 3D graph neural networks (GNNs) have been developed to learn from 3D graph data (Liu et al., 2022; Jumper et al., 2021; Xie & Grossman, 2018; Liu et al., 2021; Joshi et al., 2023) . In this work, we build 3D graphs based on protein structures and develop 3D GNNs to learn protein representations. Depending on the levels of granularity we wish to capture, we construct protein graphs at different levels, including the amino acid, backbone, and all-atom levels, as shown in Fig. 1 . Specifically, each node in constructed graphs represents an amino acid, and each amino acid possesses internal structures at different levels. Importantly, there exist hierarchical relations among different levels. Existing methods for protein representation learning either ignore hierarchical relations within proteins (Jing et al., 2021b; Zhang et al., 2023) , or suffer from excessive computational complexity (Jing et al., 2021a; Hermosilla et al., 2021) as shown in Table 1 . In this work, we propose a novel hierarchical graph network, known as ProNet, to learn protein representations at different levels. Our ProNet effectively captures the hierarchical relations naturally present in proteins. By constructing representations at different levels, our ProNet effectively integrates the inherent hierarchical relations of proteins, resulting in a more rational protein learning scheme. Building on a novel hierarchical fashion, our method can achieve great efficiency, even at the most complex all-atom level. In addition, completeness at all levels enable models to generate informative and discriminative representations. Practically, ProNet possesses great flexibility for different data and downstream tasks. Users can easily choose the level of granularity at which the model should operate based on their data and tasks. We conduct experiments on multiple downstream tasks, including protein fold and function prediction, protein-ligand binding affinity prediction, and protein-protein interaction prediction. Results show that ProNet outperforms recent methods on most datasets. We also show that different data and tasks may require representations at different levels.

2. BACKGROUND

Representation learning of small molecules with 3D structures has been studied recently (Schütt et al., 2017; Klicpera et al., 2020; Liu et al., 2022; Wang et al., 2022a) , and existing methods can fully determine 3D structures of molecules (Wang et al., 2022a) . However, representation learning of proteins with 3D structures is still challenging due to the large number of atoms and special hierarchies that naturally present in protein structures. Existing methods for protein representation learning either ignore hierarchical relations within proteins, or suffer from excessive computational complexity, as explained in Table 1 . Detailed related work is listed in Sec. 5. In this section, we first introduce hierarchical structures of proteins in Sec. 2.1, which inspires us to design hierarchical representations of proteins in Sec. 3. We then introduce existing complete 3D graph networks in Sec. 2.2, which can be used to capture protein structures completely.

2.1. HIERARCHICAL PROTEIN STRUCTURES

Proteins are macromolecules consisting of one or more chains of amino acids. Each chain may contain up to hundreds or even thousands of amino acids. An amino acid consists of an amino (-NH 2 ) group, a carboxyl (-COOH) group, and a side chain that is unique to each amino acid. The functional groups are all attached to the alpha carbon (C α ) atom. The C α atoms, together with the corresponding amino group and carboxyl group, form the backbone of a protein. As shown in Fig. 1 , we can use C α coordinates, backbone atom coordinates, or all-atom coordinates to represent protein structures, leading to three levels of representations. Note that protein structures are traditionally organized into primary, secondary, tertiary, and quaternary levels, and our categorization of levels is different. Next, we can use complete 3D graph networks to fully capture protein structures at three levels. 



Figure 1: Illustration of hierarchical representations of proteins. Different colors indicate different types of amino acids. The filled circles are C α atoms, and non-filled are other atoms. Each amino acid has different levels of inner structures. From coarse-grained to fine-grained levels, we can use C α coordinates, backbone atom coordinates, or all-atom coordinates to represent the protein structure. Note that we treat each amino acid as a node in the graph modeling despite different levels. The actual atoms are in 3D, and this illustration uses 2D for simplicity.

Comparisons of existing protein learning methods. Firstly, treating atoms instead of amino acids as nodes leads to high complexity. Here n, N , and k denote the number of amino acids, the number of atoms, and the average degree in a 3D protein graph, and N ≫ n. In addition, most existing methods only capture one level of protein structures, and only IEConv considers hierarchical relations of proteins using several pooling layers. Our method can learn hierarchical representations at three levels. Lastly, our method can capture 3D structures completely at all levels.

