LEARNING HARMONIC MOLECULAR REPRESENTA-TIONS ON RIEMANNIAN MANIFOLD

Abstract

Molecular representation learning plays a crucial role in AI-assisted drug discovery research. Encoding 3D molecular structures through Euclidean neural networks has become the prevailing method in the geometric deep learning community. However, the equivariance constraints and message passing in Euclidean space may limit the network expressive power. In this work, we propose a Harmonic Molecular Representation learning (HMR) framework, which represents a molecule using the Laplace-Beltrami eigenfunctions of its molecular surface. HMR offers a multi-resolution representation of molecular geometric and chemical features on 2D Riemannian manifold. We also introduce a harmonic message passing method to realize efficient spectral message passing over the surface manifold for better molecular encoding. Our proposed method shows comparable predictive power to current models in small molecule property prediction, and outperforms the state-of-the-art deep learning models for ligand-binding protein pocket classification and the rigid protein docking challenge, demonstrating its versatility in molecular representation learning.

1. INTRODUCTION

Molecular representation learning is a fundamental step in AI-assisted drug discovery. Obtaining good molecular representations is crucial for the success of downstream applications including protein function prediction (Gligorijević et al., 2021) and molecular matching, e.g., protein-protein docking (Ganea et al., 2021) . In general, an ideal molecular representation should well integrate both geometric (e.g., 3D conformation) and chemical information (e.g., electrostatic potential). Additionally, such representation should capture features in various resolutions to accommodate different tasks, e.g., high-level holistic features for molecular property prediction, and fine-grained features for describing whether two proteins can bind together at certain interfaces. Recently, geometric deep learning (GDL) (Bronstein et al., 2017; 2021; Monti et al., 2017) has been widely used in learning molecular representations (Atz et al., 2021; Townshend et al., 2021) . GDL captures necessary information by performing neural message passing (NMP) on common structures such as 2D/3D molecular graph (Klicpera et al., 2020; Stokes et al., 2020 ), 3D voxel (Liu et al., 2021) ,and point cloud (Unke et al., 2021) . Specifically, GDL encodes: a) geometric features by modeling atomic positions in the Euclidean space, and b) chemical features by feeding atomic information into the message passing networks. High-level features could then be obtained by aggregating these atom-level features, which has shown promising results empirically. However, we argue that current molecular representations via NMP in the Euclidean space is not necessarily the optimal solution, which suffers from several drawbacks. First, current GDL approaches need to employ equivariant networks (Thomas et al., 2018) to guarantee that the molecular representations transform accordingly upon rotation and translation (Fuchs et al., 2020) , which could undermine the network expressive power (Cohen et al., 2018; Li et al., 2021) . Therefore, developing a representation that could properly encode 3D molecular structure while bypassing the equivariance requirement is desirable. Second, current molecular representations in GDL are learned in a bottom-up manner, which are hardly able to provide features in different resolutions for different tasks. Specifically, NMP in Euclidean space typically achieves long-range communication between distant atoms by stacking deep layers or increasing the neighborhood radius. This would hinder the effective representation of macromolecules with tens of thousands of atoms (Battiston et al., 2020; Boguna et al., 2021) . To remedy this, residue-level graph representations are commonly used for large molecules (Jumper et al., 2021; Gligorijević et al., 2021) . Hence designing efficient multiresolution message passing mechanisms would be ideal for encoding molecules with distinct sizes. On the other hand, the molecular surface is a high-level representation of a molecule's shape, which has been widely used to study inter-molecular interactions (Richards, 1977; Shulman-Peleg et al., 2004) . Intuitively, the interaction between molecules is commonly described as a "key-lock pair", where both shape complementarity (Li et al., 2013) and chemical interactions (e.g., hydrogen bond) determine whether the key matches the lock molecule. It has been shown that the molecular surface holds key information about inter-molecular interactions (Gainza et al., 2020) , which makes it an ideal candidate for molecular representation learning (Sverrisson et al., 2021; Somnath et al., 2021) . Inspired by the idea of Shape-DNA (Reuter et al., 2006) , hereby we propose Harmonic Molecular Representation learning (HMR) by utilizing the Laplace-Beltrami eigenfunctions on the molecular surface (a 2D manifold). Our representation has the following advantages: a) HMR works on 2D Riemannian manifold instead of in the 3D Euclidean space, thus the resulting molecular representation is by design roto-translation invariant; b) HMR represents a molecule in a top-down manner, and is capable of offering multi-resolution features that accommodate various target molecules (i.e., from small molecules to large proteins), thanks to the smooth nature of the Laplace-Beltrami eigenfunctions (Fig. 1 ); c) HMR naturally integrates geometric and chemical features -the molecular shape defines the Riemannian manifold (i.e., the underlying domain equipped with a metric), and the atomic configurations determine the associated functions distributed on the manifold (e.g., electrostatics). To demonstrate that HMR is generally applicable to different downstream tasks including molecular property prediction and molecular matching, we propose two specific techniques: (1) manifold harmonic message passing for realizing holistic molecular representations, and (2) learning regional functional correspondence for molecular surface matching. Without loss of generality, we apply the proposed techniques to solve three drug discovery-related problems: QM9 small molecule property regression, ligand-binding protein pocket classification, and rigid protein docking pose prediction. Our proposed method shows comparable performance for small molecule property prediction to NMP-based models, while outperforming the state-of-the-art deep learning models in protein pocket classification and the rigid protein docking challenge.

2. RELATED WORK

Molecular Surface Representation The molecular surface representation is commonly adopted for tasks involving molecular interfaces (Duhovny et al., 2002) , where non-covalent interactions (e.g., hydrophobic interactions) play a decisive role (Sharp, 1994) . Non-Euclidean convolutional neural networks (Monti et al., 2017) and point cloud-based learning models (Sverrisson et al., 2022) have been applied to encode the molecular surface for downstream applications, e.g., protein binding site prediction (Mylonas et al., 2021) . However, existing methods apply filters with fixed sizes and are highly dependent on the surface mesh quality, which limit the expressive power for molecular shape representation across different spatial scales (Somnath et al., 2021; Isert et al., 2022) .



Figure 1: Multi-resolution molecular surface representation. Showing the electrostatic potential (blue regions being negatively charged, PDB ID: 3V6F) at different resolutions under our representation. See Appendix B for technical details about tuning resolution.

funding

* This work was conducted during internship at ByteDance Research.

