MOLECULAR GEOMETRY PRETRAINING WITH SE(3)-INVARIANT DENOISING DISTANCE MATCHING

Abstract

Molecular representation pretraining is critical in various applications for drug and material discovery due to the limited number of labeled molecules, and most existing work focuses on pretraining on 2D molecular graphs. However, the power of pretraining on 3D geometric structures has been less explored. This is owing to the difficulty of finding a sufficient proxy task that can empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose GeoSSL, a 3D coordinate denoising pretraining framework to model such an energy landscape. Further by leveraging an SE(3)-invariant score matching method, we propose GeoSSL-DDM in which the coordinate denoising proxy task is effectively boiled down to denoising the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method.

1. INTRODUCTION

Learning effective molecular representations is critical in a variety of tasks in drug and material discovery, such as molecular property prediction [14, 20, 21, 74], de novo molecular design and optimization [7, 36, 37, 39, 53, 77] , and retrosynthesis and reaction planning [4, 22, 52, 64] . Recent work based on graph neural networks (GNNs) [20] has shown superior performance thanks to the simplicity and effectiveness of GNNs in modeling graph-structured data. However, the problem remains challenging due to the limited number of labeled molecules as it is in general expensive and time-consuming to label molecules, which usually requires expensive physics simulations or wet-lab experiments. As a result, recently, there has been growing interest in developing pretraining or self-supervised learning methods for learning molecular representations by leveraging the huge amount of unlabeled molecule data [28, 35, 63, 75] . These methods have shown superior performance on many tasks, especially when the number of labeled molecules is insufficient. However, one limitation of these approaches is that they represent molecules as topological graphs, and molecular representations are learned through pretraining 2D topological structures (i.e., based on the covalent bonds). But intrinsically, for molecules, a more natural representation is based on their 3D geometric structures, which largely determine the corresponding physical and chemical properties. Indeed, recent works [20, 38] have empirically verified the importance of applying 3D geometric information for molecular property prediction tasks. Therefore, a more promising direction is to pretrain molecular representations based on their 3D geometric structures, which is the main focus of this paper. The main challenge for molecule geometric pretraining arises from discovering an effective proxy task to empower the pretraining to extract essential features from the 3D geometric structures. Our proxy task is motivated by the following observations. Studies [48] have shown that molecules are not static but in a continuous motion in the 3D Euclidean space, forming a potential energy surface (PES). As shown in Figure 1 , it is desirable to study the molecule in the local minima of the PES, called conformer. However, such stable state conformer often comes with different noises for the following reasons. First, the statistical and systematic errors in conformation estimation are unavoidable [11] . Second, it has been well-acknowledged that a conformer can have vibrations around the local minima in PES. Such characteristics of the molecular geometry motivate us to attempt to denoise the molecular coordinates around the local minima, to mimic the computation errors and conformation vibration within the corresponding local region. The denoising goal is to learn molecular representations that are robust to such noises and effectively capture the energy surface around the local minima. Figure 1 : Illustration on coordinate geometry of molecules. The molecule is in a continuous motion, forming a potential energy surface (PES), where each 3D coordinate (x-axis) corresponds to an energy value (y-axis). The provided molecules, i.e., conformers, are in the local minima (g1). It often comes with noises around the minima (e.g., statistical and systematic errors or vibrations), which can be captured using the perturbed geometry (g2). To achieve the aforementioned goal, we first introduce a general geometric self-supervised learning framework called GeoSSL. Based on this, we further propose an SE(3)-invariant denoising distance matching pretraining algorithm, GeoSSL-DDM. In a nutshell, to capture the smooth energy surface around the local minima, we aim to maximize the mutual information (MI) between a given stable geometry and its perturbed version (i.e., g 1 and g 2 in Figure 1 ). In practice, it is difficult to directly maximize the mutual information between two random variables. Thus, we propose to maximize an equivalent lower bound of the above mutual information, which amounts to a pretraining framework on denoising a geometric structure, coined GeoSSL. Moreover, directly denoising such noisy coordinates remains challenging because one may need to effectively constrain the pairwise atomic distances while changing the atomic coordinates. To cope with this obstacle, we further leverage an SE(3)-invariant score matching method, GeoSSL-DDM, to successfully transform the coordinate denoising desire to the denoising of pairwise atomic distances, which then can be effectively computed. In other words, our pretraining proxy task, namely mutual information maximization, effectively boils down to achieving an intuitive learning objective: denoising a molecule's pairwise atomic distances. Using 22 downstream geometric molecular prediction tasks, we empirically verify that our method outperforms nine pretraining baselines. Our main contributions are summarized as follows. (1) We propose a novel geometric self-supervised learning framework, GeoSSL. To the best of our knowledge, it is the first pretraining framework focusing on the pure 3D molecular data 1 . (2) To overcome the challenge of attaining the coordinate denoising objective in GeoSSL, we propose GeoSSL-DDM, an SE(3)-invariant score matching strategy to successfully transform such objective into the denoising of pairwise atomic distances. (3) We empirically demonstrate the effectiveness and robustness of GeoSSL-DDM on 22 downstream tasks.

2. RELATED WORK 2.1 EQUIVARIANT GEOMETRIC MOLECULE REPRESENTATION LEARNING

Geometric representation learning. Recently, 3D geometric representation learning has been widely explored in the machine learning community, including but not limited to 3D point clouds [8, 44, 55, 67] , N-body particle [45, 47] , and 3D molecular conformation [6, 31, 32, 41, 50, 51, 56] , amongst many others. The learned representation should satisfy the physical constraints, e.g., it should be equivariant to the rotation and transition in the 3D Euclidean space. Such constraints can be described using group symmetry as introduced below.

SE(3)-invariant energy.

Constrained by the physical nature of 3D geometric data, a key principle we need to follow is to learn an SE(3)-equivariant representation function. The SE(3) is the special Euclidean group consisting of rigid transformations in the 3D Cartesian space, where the transformations include all the combinations of translations and rotations. Namely, the learned representation should be equivariant to translations and rotations for molecule geometries. We also note that the representation function needlessly satisfies the reflection equivariance for certain tasks like molecular chirality [1] . For more rigorous discussion, please check [17, 19, 65] . In this work, we will design an SE(3)invariant energy (score) function in addition to the SE(3)-equivariant representation backbone model.

3. PRELIMINARIES

Molecular geometry graph. Molecules can be naturally featured in a geometric formulation, i.e., all the atoms are spatially located in 3D Euclidean space. Note that the covalent bonds are added heuristically by expert rules, so they are only applicable in 2D topology graphs. Besides, atoms are not static but in a continual motion along a potential energy surface [2] . The 3D structures at the local minima on this surface are named conformer, as shown in Figure 1 . Conformers at such an equilibrium state possess nice properties, and we would like to model them during pretraining. Geometric neural network. We denote each conformer as g = (X, R). Here X ∈ R n×d is the atom attribute matrix and R ∈ R n×3 is the atom 3D-coordinate matrix, where n is the number of atoms and d is the feature dimension. The representations for the i-th node and whole molecule are: hi = GNN-3D(T (g))i = GNN-3D(T (X, R))i, h = READOUT h0, . . . , hn-1 , where T is the transformation function like atom masking, and READOUT is the readout function. In this work, we take the mean over all the node representations as the readout function. The computation of such probability is intractable due to the high cardinality of the data space. Recently, great progress has been made in solving this intractable function, including contrastive divergence [12] , noise contrastive estimation [25] , and score matching (SM) [30, 60, 61] . For example, SM solves this by first introducing the concept score, the gradient of the log-likelihood with respect to the data, and then matching the model score with the data score using Fisher divergence. This approach has been further improved by combining SM with denoising auto-encoding, forming the promising denoising score matching (DSM) strategy [69] . In this work, we will explore the potential of leveraging DSM for molecule geometry representation learning. We aim to utilize pairwise distance information, one of the most fundamental factors in the geometric molecule data.

Energy

Problem setup. Our goal here is to apply a self-supervised pretraining algorithm on a large molecular geometric dataset and adapt the pretrained representation for fine-tuning on geometric downstream tasks. For both the pretraining and downstream tasks, only the 3D geometric information is available, and our solution is agnostic in terms of the backbone geometric neural network.

4. METHOD

This section first introduces the GeoSSL framework and then proposes the GeoSSL-DDM algorithm. We start with exploring the coordinate perturbation for molecular data in Section 4.1. Then we introduce a coordinate-aware mutual information (MI) maximization formula and turn it into a coordinate denoising framework in Section 4.2. Nevertheless, the coordinate denoising is non-trivial since it requires geometric data reconstruction, and we adopt the score matching for estimation, as proposed in Section 4.3. The ultimate training objective is discussed in Section 4.4.

4.1. COORDINATE PERTURBATION FOR GEOMETRIC DATA

The mainstream self-supervised learning community designs the pretraining task by defining multiple views from the data, and these views share common information to some degree. Thus, by designing generative or contrastive tasks to maximize the mutual information (MI) between these views, the pretrained representation can encode certain key information. This will make the representation more robust and more generalizable to downstream tasks. In our work, we propose GeoSSL-DDM, an SE(3)invariant self-supervised learning (SSL) method for molecule geometric representation learning. The 3D geometric information or the atomic coordinates are critical to molecular properties. We carry out an additional ablation study to verify this in Appendix B. Then based on this acknowledgment, we introduce a geometry perturbation, which adds small noises to the atom coordinates. For notation, following Section 3, we define the original geometry graph and an augmented geometry graph as two views, denoted as g 1 = (X 1 , R 1 ) and g 2 = (X 2 , R 2 ), respectively. The augmented geometry graph can be seen as a coordinate perturbation to the original graph with the same atom types, i.e., X 2 = X 1 and R 2 = R 1 + ϵ, where ϵ is drawn from a normal distribution.

4.2. COORDINATE DENOISING WITH MI MAXIMIZATION FRAMEWORK: GEOSSL

The two views defined above share certain common information. By maximizing the mutual information (MI) between them, we expect that the learned representation can better capture the geometric information and is robust to noises and thus can generalize well to downstream tasks. To maximize the MI, we turn to maximize the following lower bound on the two geometry views, leading to the geometric self-supervised learning framework, GeoSSL: LGeoSSL ≜ 1 2 E p(g 1 ,g 2 ) log p(g1|g2) + log p(g2|g1) . (2) In Equation (2), we transform the MI maximization problem into maximizing the summation of two conditional log-likelihoods. In addition, these two conditional log-likelihoods are in the mirroring direction, and such symmetry can reveal certain nice properties, e.g., it highlights the equal importance and uncertainty of the two views and can lead to a more robust representation of the geometry. To solve Equation (2), we adopt the energy-based model (EBM) for estimation. EBM has been acknowledged as a flexible framework for its powerful usage in modeling distribution over highlystructured data, like molecules [27, 34] . To adapt it for GeoSSL, the objective can be turned into: LGeoSSL -EBM = 1 2 E p(g 1 ,g 2 ) log p(R1|g2) + 1 2 E p(g 1 ,g 2 ) log p(R2|g1) = 1 2 E p(g 1 ,g 2 ) log exp(f (R1, g2)) A R 1 |g 2 + 1 2 E p(g 2 ,g 1 ) log exp(f (R2, g1)) A R 2 |g 1 , where the f (•) are the negative of energy functions, and A R1|g2 and A R2|g1 are the intractable partition functions. The first equation in Equation (3) is because the two views share the same atom types. This equation can be treated as denoising the atom coordinates of one view from the geometry of the other view. In the following, we will explore how to use the score matching for solving the above EBM estimation problem, and further transform the coordinate-aware GeoSSL to the denoising distance matching as the final objective.

4.3. FROM COORDINATE DENOISING TO DISTANCE DENOISING: GEOSSL-DDM

Before going into details, first, we would like to briefly discuss denoising score matching (DSM). DSM has three main advantages that inspire us to apply it for solving the coordinate-aware GeoSSL. (1) The DSM solution has a nice formulation, such that the final objective function can be simplified with an intuitive explanation: GeoSSL-DDM can be seen as solving the denoising pairwise distance at multiple noise levels. (2) The score defined in geometric data can be viewed as a coordinatebased pseudo-force. Such pseudo-force can play an important role in the corresponding geometric representation learning. (3) In terms of the MI maximization, existing methods like InfoNCE [68] , EBM-NCE, and Representation Reconstruction [38] map the data to the representation space for either inter-data contrastive learning or intra-data reconstruction. This operation can avoid the decoding design issue for highly-structured data [13], yet the trade-off is losing the data-inherent information by a certain degree. In other words, the data-level reconstruction task (e.g., DSM) is expected to lead to a more robust representation. Thus, considering the above points, we adopt DSM to our framework and propose GeoSSL-DDM. We expect that it can learn an expressive geometric representation function by solving the coordinate-aware GeoSSL. Additionally, the two terms in Equation (3) are in the mirroring direction. Thus in what follows, we adopt a proxy task that can calculate the two directions separately, and we take one for illustration, e.g., log exp(f (R1,g2)) A R 1 |g 2 .

4.3.1. DENOISING DISTANCE MATCHING

Score. The score is defined as the gradient of the log-likelihood w.r.t. the data, i.e., the atom coordinates in our case. Because the normalization function is a constant regarding the data, it will disappear during the score calculation. To adapt it into our setting, the score is obtained as the gradient of the negative energy function w.r.t. the atom coordinates, as: s(R1, g2) ≜ ∇R 1 log p(R1|g2) = ∇R 1 f (R1, g2). If we assume that the learned optimal energy function, i.e., f (•), possesses certain physical or chemical information, then the score in Equation ( 4) can be viewed as a special form of the pseudo-force. This may require more domain-specific knowledge, which we leave for future exploration. Score decomposition: from coordinates to distances. Through back-propagation [54] , the score on atom coordinates can be further decomposed into the scores attached to pairwise distances: s(R1, g2)i = j̸ =i ∂f (R1, g2) ∂d1,ij • ∂d1,ij ∂r1,i = j̸ =i 1 d1,ij • s(d1, g2)ij • (r1,i -r1,j), where r 1,i is the i-th coordinate in g 1 , d 1,ij denotes the pairwise distance between the i-th and j-th nodes in g 1 , and s(d 1 , g 2 ) ij ≜ ∂f (R1,g2) ∂d1,ij . Such decomposition has a nice intuition from the pseudoforce perspective: the pseudo-force on each atom can be further decomposed as the summation of pseudo-forces attached to the pairwise distances between this atom and all its neighbors. Note that here the pairwise atoms are connected in the 3D Euclidean space, not by the covalent bonds. Denoising distance matching (DDM). Then we adopt the denoising score matching (DSM) [69] to our task. To be more concrete, we take the Gaussian kernel as the perturbed noise distribution on each pairwise distance, i.e., q σ ( d1 |g 2 ) = E pdata(d1|g2) [q σ ( d1 |d 1 )], where σ is the deviation in Gaussian perturbation. One main advantage of using the Gaussian kernel is that the following gradient of conditional log-likelihood has a closed-form formulation: ∇ d1 log q σ ( d1 |d 1 , g 2 ) = (d 1 -d1 )/σ 2 , and the objective function of DSM is to train a score network to match it. This trick was first introduced in [69] , and has been widely utilized in deep generative modeling tasks [58, 59] . To adapt to our setting, this is essentially saying that we want to train a score network, i.e., s θ ( d1 |g 2 ), to match the distance perturbation, or we can say it aims at matching the pseudo-force with the pairwise distances from the pseudo-force aspect. By taking the Fisher divergence as the discrepancy metric and the trick mentioned above, the estimation objective can be simplified to DF (qσ( d1|g2)||pθ( d1|g2)) = 1 2 E p data (d 1 |g 2 ) E qσ ( d1 |d 1 ,g 2 ) ∥s θ ( d1, g2) - d1 -d1 σ 2 ∥ 2 + C. For more detailed derivations, please refer to Appendix C. In this section, we turn the coordinate-aware GeoSSL framework into a distance perturbation matching problem, which is equivalent to denoising distance matching, i.e., GeoSSL-DDM. The corresponding pipeline is illustrated in Figure 2 .

4.3.2. SE(3)-INVARIANT SCORE NETWORK MODELING

The objective function in Equation ( 6) is essentially doing the distance denoising. Since the distance is a type-0 feature [65] , we simply design an SE(3)-invariant score network as s θ (•). For modeling h(•), we take an SE(3)-equivariant 3D geometric graph neural network as the geometric representation backbone model. Following the notations in Section 3 and g 2 modeling, we have h(g2)i = 3D-GNN(T (g2))i, h(g2)ij = h(g2)i + h(g2)j, for the atom-level and atom pairwise-level representation. Then we define the score network as: s θ ( d1, g2)ij = MLP MLP( d1,ij) ⊕ h(g2)ij , where ⊕ is the concatenation and MLP is the multi-layer perception. GeoSSL-DDM is agnostic to the backbone geometric representation function, and its main module is the score network in Equation ( 8). Thus, GeoSSL-DDM is an SE(3)-invariant [19] pretraining algorithm. Meanwhile, the type-0 distance can be modeled in a more expressive SE(3)-equivariant manner, and we leave that for future work.

4.4. ULTIMATE OBJECTIVE

With the above score network modeling, we can formulate the ultimate objective function. We adopt the following four training tricks from [38, 58, 59] to stabilize the score matching training process. (1) We carry out the distance denoising at L-level of noises. (2) We add a weighting coefficient λ(σ) = σ β for each noise level, where β acts as the annealing factor. (3) We scale the score network by a factor of 1/σ. (4) We sample the same atoms from the two geometry views with a masking ratio r. Ultimately, the objective function for GeoSSL-DDM, is as follows: LGeoSSL -DDM = 1 2L L l=1 σ β l E p data (d 1 |g 2 ) E q( d1 |d 1 ,g 2 ) s θ ( d1, g2) σ l - d1 -d1 σ 2 l 2 2 + 1 2L L l=1 σ β l E p data (d 2 |g 1 ) E q( d2 |d 2 ,g 1 ) s θ ( d2, g1) σ l - d2 -d2 σ 2 l 2 2 . ( ) Algorithm 1 GeoSSL-DDM pretraining 1: Input: A 3D geometry dataset and L levels of Gaussian noise. 2: Output: A pre-trained 3D representation function h(•). 3: for each 3D geometry graph g 1 do 4: Obtain g 2 by adding Gaussian noises to atom coordinates in g 1 .

5:

for each noise level l ∈ {1, ..., L} do

6:

Add noise to the pairwise distance with d1 = d 1 + σ l , d2 = d 2 + σ l .

7:

Get the score s θ ( d1 , g 2 ), s θ ( d2 , g 1 ) with Equation (8) accordingly.

8:

end for

9:

Update 3D GNN representation function h(•) using Equation (9).

10: end for

The algorithm is in Algorithm 1. Comparison with score matching in generative modeling. We note that score matching has been widely used for generative modeling tasks. One of the main drawbacks in the generative setting is the long mixing time for MCMC sampling. However, our work aims at representation learning, so such a sampling issue will not affect our task. We further note that there also exists a series of works exploring the score matching for conformation generation [54] . However, their scores or pseudo-forces are attached to the 2D topology (the covalent bonds), while our work is for the pure geometric data and is attached to the pairwise distances defined in the 3D Euclidean space.

5. EXPERIMENTS

In this section, we compare our method with nine 3D geometric pretraining baselines, including one randomly initialized, one supervised, and seven self-supervised approaches. For the downstream tasks, we adopt 22 tasks covering quantum mechanics prediction, force prediction, and binding affinity prediction. We provide all the experiment details and ablation studies in Appendix D.

5.1. BACKBONE MODELS

Our proposed GeoSSL-DDM is model-agnostic, and here we evaluate our method using one of the state-of-the-art geometric graph neural networks, PaiNN [51] . We carry out the exact same experiments on another backbone model, SchNet [49] , and present the results in Appendix D. PaiNN [51] is a follow-up work of SchNet [49] . It addresses the limitation of rotational equivariance in SchNet by embracing rotational invariance, attaining a more expressive 3D geometric model. Other backbone models. First, we want to highlight that what we propose is a general solution and is agnostic to the backbone 3D geometric models. And in addition to the PaiNN model, we want to acknowledge that, recently, there have been several works along this research line, including but not limited to [6, 17, 17, 32, 41, 47, 56 ]. Yet, they may require large computation resources and may be infeasible (e.g., out of GPU memory) in our setting. The decision is made by considering the model performance, computation efficiency, and memory cost. For more benchmark results and detailed comparisons of the 3D geometric models, please check Appendix A.

5.2. BASELINES AND PRETRAINING DATASET

Pretraining dataset. The PubChemQC database is a large-scale database with around 4M molecules with 3D geometries, and it calculates both the ground-state and excited-state 3D geometries using DFT (density functional theory). Due to the high computational cost, only several thousand molecules can be processed every day, and this dataset takes years of effort in total. Following this, Molecule3D [73] takes the ground-state geometries from PubChemQC and transforms the data formats into a deep learning-friendly way. It also parses essential quantum properties for each molecule, including energies of the highest occupied molecular orbital (HOMO) and the lowest occupied molecular orbital (LUMO), the energy gap between HOMO-LUMO, and the total energy. For our molecular geometry pretraining, we take a subset of 1M molecules with 3D geometries from Molecule3D. Self-supervised learning pretraining baselines. We first consider the four coordinate-MI-unaware SSL methods: (1) Type Prediction is to predict the atom type of masked atoms; (2) Distance Prediction aims to predict the pairwise distances among atoms; (3) Angle Prediction is to predict the angle among triplet atoms, i.e., the bond angle prediction; (4) 3D InfoGraph adopts the contrastive learning paradigm by taking the node-graph pair from the same molecule geometry as positive and negative otherwise. Next, following the coordinate-aware GeoSSL framework introduced in Equation (2), we include two contrastive and one generative SSL baselines. ( 5) GeoSSL-InfoNCE [68] and ( 6) GeoSSL-EBM-NCE [38] are the two widely-used contrastive learning loss functions, where the goal is to align the positive views and contrast the negative views simultaneously. Finally, (7) GeoSSL-RR (RR for Representation Reconstruction) [38] is a generative SSL that is a proxy to maximize the MI. RR is a more general form of non-contrastive SSL methods like BOYL [23] and SimSiam [9], and the goal is to reconstruct each view from its counterpart in the representation space. Following this, our proposed GeoSSL-DDM, can be classified as generative SSL for distance denoising. Supervised pretraining baseline. We also compare our method with a supervised pretraining baseline. As aforementioned, the large-scale pretraining dataset uses the DFT to calculate the energy and extracts the most stable conformers with the lowest energies, which reveal the most fundamental properties of molecules in the 3D Euclidean space. Thus, such energies can be naturally adopted as supervised signals, and we take this as a supervised pretraining baseline.

5.3. DOWNSTREAM TASKS ON QUANTUM MECHANICS AND FORCE PREDICTION

QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. It includes 12 tasks that are related to the quantum properties. For example, U0 and U298 are the internal energies at 0K at 0K and 298.15K respectively, and U298 and G298 are the other two energies that can be transferred from H298 respectively. The other 8 tasks are quantum mechanics related to the DFT process. MD17 [10] is a dataset on molecular dynamics simulation. It includes eight tasks, corresponding to eight organic molecules, and each task includes the molecule positions along the potential energy surface (PES), as shown in Figure 1 . The goal is to predict the energy-conserving interatomic forces for each atom in each molecule position. We follow the literature [31, 41, 50, 51] of using 1K for training and 1K for validation, while the test set (from 48K to 991K) is much larger. The results on QM9 and MD17 are displayed in Tables 1 and 2 respectively. From Tables 1 and 2 , we can observe that most the pretraining baselines tested perform on par with or even worse than the randomly-initialized baseline. The top performing baseline is the representation reconstruction method (RR), which optimizes the coordinate-aware MI; it outperforms the other baselines on 5 out of 12 tasks in QM9 and 6 out of 8 tasks in MD17. This implies the potential of applying generative SSL for maximizing this coordinate-aware MI. Promisingly, our proposed GeoSSL-DDM, achieves consistently improved performance on all 12 tasks in QM9 and 8 tasks in MD17. All these observations empirically verify the effectiveness of the distance denoising in GeoSSL-DDM, which models the most determinant factor in molecule geometric data.

5.4. DOWNSTREAM TASKS ON BINDING AFFINITY PREDICTION

Atom3D [66] is a recently published dataset. It gathers several core tasks for 3D molecules, including binding affinity. The binding affinity prediction is to measure the strength of binding interaction between a small molecule to the target protein. Here we will model both the small molecule and protein with their 3D atom coordinates provided. We follow Atom3D in data preprocessing and data splitting. For more detailed discussions and statistics, please check Appendix D. During the binding process, there is a cavity in a protein that can potentially possess suitable properties for binding a small molecule (ligand), and it is termed a pocking [62] . Because of the large volume of the protein, we follow [66] by only taking the binding pocket, where there are no more than 600 atoms for each molecule and protein pair. To be more concrete, we consider two binding affinity tasks. (1) The first task is ligand binding affinity (LBA). It is gathered from [70] and the task is to predict the binding affinity strength between a small molecule and a protein pocket. (2) The second task is ligand efficacy prediction (LEP). We have a molecule bounded to pockets, and the goal is to detect if the same molecule has a higher binding affinity with one pocket compared to the other one. Results in Table 3 illustrate that, for the LBA task, two pretraining baseline methods fail to generalize to LBA (the loss gets too large), and all the other pretraining baselines cannot beat the randomly initialized baseline. For the LEP task, the supervised and two contrastive learning pretraining baselines stand out for both ROC and PR metrics. Meaningfully, for both tasks, GeoSSL-DDM is able to achieve promising improvement, revealing that modeling the local region around conformer with distance denoising can also benefit binding affinity downstream tasks.

5.5. DISCUSSION: CONNECTION WITH MULTI-TASK PRETRAINING

In the above experiments, we test multiple self-supervised and supervised pretraining tasks separately. Yet, all these pretraining methods are not contradicted but could be complementary instead. Existing work has successfully shown the effect of combining them in various ways. For example, [28] shows that jointly doing supervised and self-supervised pretraining can augment the pretrained representation. [38, 57] prove that contrastive and generative SSL pretraining methods can be learned simultaneously as a multi-task pretraining. In addition, in terms of the molecule-specific pretraining, [38] empirically verifies that 2D topology and 3D geometry views can share certain information, and maximizing their mutual information together with 2D topology SSL for pretraining is beneficial. With these insights, we would like to claim that all of these points are worth exploring in the future, especially in the line of pretraining for molecular geometry. Because pretraining datasets often come with multiple quantum properties and the 2D molecular topology can be obtained heuristically. Yet as the first step to explore self-supervised learning using only the 3D geometric data (i.e., without covalent bonds), our study here would like to leave multi-task pretraining for future exploration.

6. CONCLUSIONS AND FUTURE DIRECTIONS

We proposed a novel coordinate denoising method, coined GeoSSL-DDM, for molecular geometry pretraining. GeoSSL-DDM leverages an SE(3)-invariant score matching strategy, under the GeoSSL framework, to successfully decompose its coordinate denoising objective into the denoising of pairwise atomic distances in a molecule, which then can be effectively computed and directly target the determinant factors in molecular geometric data. We empirically verified the effectiveness and robustness of our method, showing its superior performance to nine state-of-the-art pretraining baselines on 22 benchmarking geometric molecular property prediction and binding affinity tasks. Our work opens up venues for multiple promising directions. First, from the machine learning perspective, we propose a general pipeline on using EBM for MI maximization on geometric data pretraining. Yet, there are more explorations on the success of EBM, like GFlowNet [3], and it would be interesting to explore how to combine it with molecular geometric data along this systematic path. In addition, GeoSSL-DDM does not utilize the 2D structure (i.e., covalent bonds for molecules), and it would be desirable to consider how to utilize the distance denoising together with the 2D topology information. In terms of applications, our proposed GeoSSL-DDM is a general framework, and it can be naturally applied to other geometric data, such as point clouds and protein pretraining. In addition, our current goal is to perform denoising in the local region, yet it would be interesting to explore larger regions. From this aspect, the denoising can be viewed as recovering the molecular dynamics trajectory, and we would explore how generalizable this pretrained representation is to downstream tasks.

A BENCHMARKS AND RELATED WORK

A.1 GEOMETRIC NEURAL NETWORKS Recently, geometric neural networks have been actively proposed, including SchNet [50] , TFN [17], DimeNet++ [31] , SE(3)-Trans [17], EGNN [47] , SEGNN [6] , SphereNet [41] , SpinConv [56] , PaiNN [51] , and GemNet [32] . We reproduce most of them on the QM9 dataset as shown in Table 4 . Among this, we would like to highlight two models: SchNet and PaiNN. SchNet [49] is composed of the following key steps: z (0) i = embedding(xi), z i = MLP n j=1 f (x (t-1) j , ri, rj) , hi = MLP(z (K) i ), ( ) where K is the number of hidden layers, and f (xj, ri, rj) = xj • e k (ri -rj) = xj • exp(-γ∥∥ri -rj∥2 -µ∥ 2 2 ) is the continuous-filter convolution layer, enabling the modeling of continuous coordinates of atoms. PaiNN [51] • The performance on QM9 is very robust to either using (1) 110K for training, 10K for val, 10,831 for test or using (2) 100K for training, 13,083 for val and 17,748 for test. • The optimization, especially the learning rate scheduler is very critical. During the benchmarking, we find that using cosine annealing learning rate schedule [43] is generally the most robust. For more detailed discussion on QM9, please refer to Appendix D. We show the benchmark results on QM9 in Table 4 . 

B AN EXAMPLE ON THE IMPORTANCE OF ATOM COORDINATES

First, it has been widely acknowledged [15] that the atom positions or molecule shapes are important factors to the quantum properties. Here we carry out an evidence example to empirically verify this. The goal here is to make predictions on 12 quantum properties in QM9. The molecule geometric data includes two main components as input features: the atom types and atom coordinates. Other key information can be inferred accordingly, including the pairwise distances and torsion angles. We consider corruption in each of the components to empirically test their importance accordingly. • Atom type corruption. There are in total 118 types of atom types, and the standard embedding option is to apply the one-hot encoding. In the corruption case, we replace all the atom types with a hold-out index, i.e., index 119. • Atom coordinate corruption. Originally QM9 includes atom coordinates that are in the stable state, and now we replace them with the coordinates generated with MMFF [26] from RDKit [33] . We take SchNet and PaiNN as the backbone 3D GNN models, and the results are in Table 5 . We can observe that (1) Both corruption examples lead to performance decrease. (2) The atom coordinate corruption may lead to more severe performance decrease than the atom type corruption. To put this into another way is that, when we corrupt the atom types with the same hold-out type, it is equivalently to removing the atom type information. Thus, this can be viewed as using the equilibrium atom coordinates alone, and the property prediction is comparatively robust. This observation can also be supported from the domain perspective. According to the valence bond theory, the atom type information can be implicitly and roughly inferred from the atom coordinates. Therefore, by combining all the above observations and analysis, one can draw the conclusion that, for molecule geometry data, the atom coordinates reveal more fundamental information for representation learning.

C MUTUAL INFORMATION MAXIMIZATION WITH ENERGY-BASED MODEL

In this section, we will give a detailed discussion on mutual information (MI) maximization with the energy-based model (EBM). First, we can get a lower bound of MI. Assuming that there exist (possibly negative) constants a and b such that a ≤ H(X) and b ≤ H(Y ), i.e., the lower bounds to the (differential) entropies, then we have: I(X; Y ) = 1 2 H(X) + H(Y ) -H(Y |X) -H(X|Y ) ≥ 1 2 a + b -H(Y |X) -H(X|Y ) ≥ 1 2 a + b + LMI, where the loss L MI is defined as: L MI = 1 2 E p(x,y) log p(x|y) + log p(y|x) . Empirically, we use energy-based models to model the distributions. The existence of a and b can be understood as the requirements that the two distributions (p x , p y ) are not collapsed. Notice that to keep consistent with the notations in Section 3, we will be using g 1 and g 2 as the two variables. Then the goal is equivalent to optimizing the following equation: LGeoSSL ≜ 1 2 E p(g 1 ,g 2 ) log p(g1|g2) + log p(g2|g1) . ( ) Thus, we transform the MI maximization problem into maximizing the summation of two conditional log-likelihoods. Such an objective function opens a wider venue for estimating MI, e.g., using the EBM to estimate Equation ( 14). Adaptation to Geometric Data The 3D geometric information or the atomic coordinates are critical to molecular properties. Then based on this, we propose a geometry perturbation, which adds small noises to the atom coordinates. This geometry perturbation possesses certain motivations from both domain and machine learning perspectives. (1) From the practical experiment perspective, the statistical and systematic errors [11] on conformation estimation are unavoidable. Coordinate perturbation is a natural way to enable learning representations robust to such noises. (2) From the domain aspect, molecules are not static but in continuous motion in the 3D Euclidean space, and we can obtain a potential energy surface accordingly. We are interested in modeling the conformer, i.e., the 3D coordinates with the lowest energy. However, even the conformer at the lowest energy point can have vibrations, and coordinate perturbation can better capture such movement yet with the same order of magnitude on energies. (3) As will be illustrated later, our proposed method can be simplified as denoising atomic distance matching. (4) Leveraging coordinate perturbation for model regularization has also been empirically verified its effectiveness for supervised molecule geometric representation learning [21] . Such characteristics of molecular geometry motivate us to apply coordinate perturbation. If we take each of the two views as adding noise to the coordinates from the other view, then the objective in Equation ( 14) essentially states that we want to conduct coordinate denoising, as shown in Figure 3 . Yet, this is not a trivial task due to the complicated geometric space (e.g., 3D coordinates) reconstruction.

C.1 AN EBM FRAMEWORK FOR MI ESTIMATION

The lower bound in Equation ( 14) is composed of two conditional log-likelihood terms, and then we model the conditional likelihood with EBM. This gives us: LGeoSSL -EBM = - 1 2 E p(g 1 ,g 2 ) log exp(fg 1 (g1, g2)) A g 1 |g 2 + log exp(fg 2 (g2, g1)) A g 2 |g 1 , where f g1 (g 1 , g 2 ) = -E(g 1 |g 2 ) and f g2 (g 2 , g 1 ) = -E(g 2 |g 1 ) are the negative energy functions, and A g1|g2 and A g2|g1 are the corresponding partition functions. The energy functions can be flexibly defined, thus the bottleneck here is the intractable partition function due to the high cardinality. To solve this, existing methods include noise-contrastive estimation (NCE) [25] and score matching (SM) [60, 61] , and we will describe how to apply them for MI maximization. 

C.2 EBM-NCE FOR MI ESTIMATION

Under the EBM framework, if we solve Equation ( 15) with Noise-Contrastive Estimation (NCE) [25] , the final objective is termed EBM-NCE, as: LGeoSSL -EBM-NCE = - 1 2 E p data (y) E pn(g 1 |g 2 ) [log 1 -σ(fg 1 (g1, g2)) ] + E p data (g 1 |g 2 ) [log σ(fg 1 (g1, g2))] - 1 2 E p data (x) E pn(g 2 |g 1 ) [log 1 -σ(fg 2 (g2, g1)) ] + E p data (g 2 |g 1 ) [log σ(fg 2 (g2, g1))] . All the detailed derivations can be found in [25] . Specifically, EBM-NCE is equivalent to the Jensen-Shannon estimation for MI, while the mathematical intuitions and derivation processes are different. Besides, it also belongs to the contrastive SSL venue. That is, it aims at aligning the positive pairs and contrasting the negative pairs. In this subsection, we will focus on geometric data like molecular geometry. Recall that we have two views: g 1 and g 2 , and the goal is to maximize the lower bound of the mutual information in Equation ( 14). Because the two views share the same atomic features, it can be reduced to:

C.3 EBM-SM FOR MI ESTIMATION: GEOSSL-DDM

LGeoSSL -EBM = 1 2 E p(g 1 ,g 2 ) log p(g1|g2) + 1 2 E p(g 1 ,g 2 ) log p(g2|g1) = 1 2 E p(g 1 ,g 2 ) log p(⟨X1, R1⟩|⟨X2, R2⟩) + 1 2 E p(g 1 ,g 2 ) log p(⟨X2, R2⟩|⟨X1, R1⟩) = 1 2 E p(g 1 ,g 2 ) log p(R1|g2) + 1 2 E p(g 1 ,g 2 ) log p(R2|g1) = 1 2 E p(g 1 ,g 2 ) log exp(f (R1, g2)) A R 1 |g 2 + 1 2 E p(g 2 ,g 1 ) log exp(f (R2, g1)) A R 2 |g 1 , where the f (•) are the negative of energy functions, and A R1|g2 and A R2|g1 are the intractable partition functions. The first equation in Equation ( 17) results from that the two views share the same atom types. This equation can be treated as denoising the atom coordinates of one view from the geometry of the other view. In the following, we will explore how to use the score matching for solving EBM, and further transform the coordinate-aware mutual information maximization to the denoising distance matching (GeoSSL-DDM) as the final objective.

Score Definition

The two terms in Equation (3) are in the mirroring direction. Thus in what follows, we may as well adopt a proxy task that these two directions can calculated separately, and take one direction for illustration, e.g., log exp(f (R1,g2)) A R 1 |g 2 . The score is defined as the gradient of the log-likelihood w.r.t. the data, i.e., the atom coordinates in our case. Because the normalization function is a constant w.r.t. the data, it will disappear during the score calculation. To adapt it into our setting, the score is obtained as the gradient of the negative energy function w.r.t. the atom coordinates, as: s(R1, g2) = ∇R 1 log p(R1|g2) = ∇R 1 f (R1, g2). If we assume that the learned optimal energy function, i.e., f (•), possesses certain physical or chemical information, then the score in Equation ( 18) can be viewed as a special form of the pseudo-force. This may require more domain-specific knowledge, and we leave this for future exploration. Score Decomposition: From Coordinates To Distances Through back-propagation [54] , the score on atom coordinates can be further decomposed into the scores attached to pairwise distances: s(R1, g2)i = ∂f (R1, g2) ∂r1,i = j∈N (i) ∂f (R1, g2) ∂d1,ij • ∂d1,ij ∂r1,i = j∈N (i) 1 d1,ij • ∂f (R1, g2) ∂d1,ij • (r1,i -r1,j) = j∈N (i) 1 d1,ij • s(d1, g2)ij • (r1,i -r1,j), where s(d 1 , g 2 ) ij ≜ ∂f (R1,g2) ∂d1,ij . Such decomposition has a nice underlying intuition from the pseudoforce perspective: the pseudo-force on each atom can be further decomposed as the summation of pseudo-forces on the pairwise distances starting from this atom. Note that here the pairwise atoms are connected in the 3D Euclidean space, not by the covalent-bonding. Denoising Distance Matching (DDM) Then we adopt the denoising score matching (DSM) [69] to our task. To be more concrete, we take the Gaussian kernel as the perturbed noise distribution on each pairwise distance, i.e., q σ ( d1 |g 2 ) = E pdata(d1|g2) [q σ ( d1 |d 1 )], where σ is the deviation in Gaussian perturbation. One main advantage of using the Gaussian kernel is that the following gradient of conditional log-likelihood has a closed-form formulation: ∇ d1 log q σ ( d1 |d 1 , g 2 ) = (d 1 -d1 )/σ 2 , and the goal of DSM is to train a score network to match it. This trick was first introduced in [69] , and has been widely utilized in the score matching applications [58, 59] . To adapt this into our setting, this is essentially saying that we want to train a "distance network", i.e., s θ ( d1 |g 2 ), to match the distance perturbation, or we can say it aims at matching the pseudo-force with the pairwise distances from another aspect. By taking the Fisher divergence as the discrepancy metric and the trick mentioned above, the estimation s θ ( d1 , g 2 ) ≈ ∇ d1 log q( d1 |d 1 , g 2 ) can be simplified to the following: DF (qσ( d1|g2)||pθ( d1|g2)) = 1 2 E p data (d 1 |g 2 ) E qσ ( d1 |d 1 ,g 2 ) ∥s θ ( d1, g2) - d1 -d1 σ 2 + ∥ 2 + C. Final objective. We adopt the following four model training tricks from [38, 58, 59] to stabilize the score matching training process. (1) We carry out the distance denoising at L-level of noises. (2) We add a weighting coefficient λ(σ) = σ β for each noise level, where β is the annealing factor. (3) We scale the score network by a factor of 1/σ. (4) We sample the exactly same atoms from the two geometry views with masking ratio r. Finally, by considering the two directions and all the above tricks, the objective function becomes the follows: LGeoSSL -DDM = 1 2L L l=1 σ β l E p data (d 1 |g 2 ) E q( d1 |d 1 ,g 2 ) s θ ( d1, g2) σ l - d1 -d1 σ 2 l 2 2 + 1 2L L l=1 σ β l E p data (d 2 |g 1 ) E q( d2 |d 2 ,g 1 ) s θ ( d2, g1) σ l - d2 -d2 σ 2 l 2 2 . C.4 DISCUSSIONS Using the energy-based model (EBM) to solve MI maximization can open a novel venue, especially for high-structured data like molecular geometry. To solve EBM, existing methods include noise-contrastive estimation (NCE) [25] , score matching (SM) [61] , etc. To put this under the MI maximization setting, EBM-NCE is essentially a contrastive learning method, where the goal is to align the positive pairs and contrast the negative pairs simultaneously. While EBM-SM or GeoSSL-DDM, is a generative self-supervised learning (SSL) on distance denoising, and it is especially appealing in the field for geometric data representation learning. Further interpretation of pseudo-force. Score matching can be smoothly adopted to 3D geometric setting. Because scores are defined as gradients of the energy function with respect to the atom positions, it can be thought of a form of pseudo-forces. Following this, GeoSSL-DDM, can be viewed as a pseudo-force matching, which is more natural to the molecular structures. However, further understanding of this requires more domain knowledge in understanding or designing of the energy function. This is beyond the score of this paper, and we would like to leave it for future exploration. Multi-view pretraining: complementary information with 2D topological graph. Recently, there have been certain works [38] proving that 3D geometric information is useful for 2D topology. Here we want to conjecture that the reverse direction is also meaningful: 2D topology can be also useful for 3D representation. This may not seem reasonable from the domain perspective, since 2D topology can be heuristically obtained from the 3D geometry, i.e., all the 2D information is redundant to 3D geometry. However, from the machine learning theory perspective [5, 18] , this is still helpful in reducing the sample complexity. From a higher level perspective, we want to explicitly point out that such gap between machine learning and scientific domain has been widely existed, and it would be an interesting direction for further exploration.

D EXPERIMENTS

In this section, we would like to discuss the experiment details of our work. The main structure is as follows: • In Appendix D.1, we introduce the computation resources. • In Appendices D.2 to D.4, we introduce the downstream datasets. -Notice that because the performance of QM9 and MD17 is quite stable after fixing the seed (e.g., 42), we we will not run cross-validation. This also follows the main literature [41, 50, 51] . -Yet, for LBA & LEP, these two datasets are quite small and are very sensitive to data splitting, so we pick up 5 seeds (12, 22, 32, 42, and 52) and run cross-validation on them. • In Appendix D.5, we list the key hyperparameters for all the pretraining baselines and GeoSSL-DDM. • In Appendix D.6, we show the empirical results using SchNet as the backbone model.

D.1 COMPUTATIONAL RESOURCES

We have around 20 V100 GPU cards for computation at an internal cluster. Each job can be finished within 3-24 hours (each job takes one single GPU card).

D.2 DATASET: QM9

QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. It includes 12 tasks that are related to the quantum properties. For example, U0 and U298 are the internal energies at 0K at 0K and 298.15K respectively, and U298 and G298 are the other two energies that can be transferred from H298 respectively. The other 8 tasks are quantum mechanics related to the DFT process. We follow [50] in preprocessing the dataset (including unit transformation for each task). Current work is using different data split (in terms of the splitting size). Originally there are 133,885 molecules in QM9, where 3,054 are filtered out, leading to 130,831 molecules. During the benchmark, we find that the performance on QM9 is very robust to either using (1) 110K for training, 10K for val, 10,831 for test or using (2) 100K for training, 13,083 for val and 17,748 for test. In this paper, we are using option (1).

D.3 DATASET: MD17

MD17 [10] is a dataset on molecular dynamics simulation. It includes eight tasks, corresponding to eight organic molecules, and each task includes the molecule positions along the potential energy surface (PES), as shown in Figure 1 . The goal is to predict the energy-conserving interatomic forces for each atom at each molecule position. We list some basic statistics in Table 6 . We follow [41, 51] in preprocessing the dataset (including unit transformation for each task). During the binding process, a cavity in a protein can potentially possess suitable properties for binding a small molecule (ligand), and it is termed a pocking [62] . Because of the large volume of protein, we follow [66] by only taking the binding pocket, where there are no more than 600 atoms for each molecule and protein pair. To be more concrete, we consider two binding affinity tasks. (1) The first task is ligand binding affinity (LBA). It is gathered from [70] and the task is to predict the binding affinity strength between a small molecule and a protein pocket. (2) The second task is ligand efficacy prediction (LEP). We have a molecule bounded to pockets, and the goal is to detect if the same molecule has a higher binding affinity with one pocket compared to the other one. We list some basic statistics in Table 7 .

D.5 HYPERPARAMETER SPECIFICATION

We list all the detailed hyperparameters in this subsection. For all the methods, we use the same optimization strategy, i.e., with learning rate as 5e-4 and cosine annealing learning rate schedule [43] . The other hyperparameters for each pretraining method are listed in Table 8 . For the other hyperparameters, we are using the default hyperparameters, as attached in the codes. 

D.6 SCHNET AS BACKBONE MODEL

We want to highlight that some backbone models (e.g., DimeNet++ and SphereNet) may perform better or on par with the PaiNN, as shown in Table 4 . Yet they will be out of GPU memory. Thus, considering all (including the model performance, computation efficiency, and memory cost) together, we adopt PaiNN as the backbone model in the main paper. In this section, we carry out experiments using SchNet as the backbone model. We follow the same process as in Section 5, i.e., we compare our method with one randomly-initialized and seven pretraining baselines. The results on QM9, MD17, LBA and LEP are in Tables 9 to 11 accordingly. From these three tables, we can observe that in general, GeoSSL-DDM can reach the most optimal results, yielding 21 best performance in 22 downstream tasks, and can reach comparative performance on the remaining task (within top 2 model). This can largely support the effectiveness of our proposed method, GeoSSL-DDM. In addition, we also want to mention that a lot of pretraining tasks show the negative transfer issue. Comparing to the results in Section 5, we conjecture that this is related to the task (both pretraining and downstream tasks) and the backbone model. Yet, this is beyond the scope of our work, and we would like to leave this as a future direction. Among all the hyperparameters (see Table 8 ) for GeoSSL-DDM, we find that the annealing factor is one of the most sensitive ones. Annealing factor β is applied on the weighting coefficient λ(σ) = σ β . In this section, we carry out an ablation study to verify this by pretraining GeoSSL-DDM with annealing factors at five different scales. In Table 13 , we can observe that in general, GeoSSL-DDM can attain better performance with more denoising layers. This is in fact consistent with that in vision applications [61] . Promisingly, even with smaller L (e.g., L = 1), GeoSSL-DDM can still achieve a modest improvement to some extent.

G COMPARISON WITH A PARALLEL WORK

We note that there is a parallel work introduced in [76] , which also explores the effect of denoising for geometric data pretraining. That work is different from GeoSSL-DDM and we here summarize the main differences as follows: • The parallel work as presented in [76] is similar to that of denoising score matching (DSM) as introduced in [69] , i.e., with only one layer of denoising in score matching. On the contrary, our model has multiple denoising layers, which is much closer to the NCSN [58] , where the number of noise layers has been proven to be important to the effectiveness of the denoising score matching models. We here also empirically verify the above analysis. That is, we present the experimental results in Table 13 , where L = 1 is equivalent to the method in [76] . We can observe that with layer number L = 1 (namely the third row of the table), the performance does increase in some cases, which matches with the observation in [76] . Nevertheless, the results in Table 13 clearly indicate that with larger L, the model can attain further error reduction and improve model robustness. • Theoretically, the work in [76] specifically aims at the application task of representation learning in geometric pretraining, through a straightforward adaption of denoising score matching from vision. In contrast, our GeoSSL-DDM approach indeed provides a very general framework that leverages energy-based model (EBM) for mutual information (MI) maximization for geometric data pretraining. As such, GeoSSL-DDM can be easily replaced by other EBM models such as the GFlowNet network [3] to better capture the multi-mode distributions in geometric data during pretraining (please see Section 6 for more discussion).



During the rebuttal of our submission, one of the reviewers pointed us to this parallel work[76], which is also under review. We provide a detailed comparison with this work in Appendix G.



-based model and denoising score matching. Energy-based model (EBM) is a flexible and powerful tool for modeling data distribution. It has the form of Gibbs distribution as p θ (x) = exp(-E(x))/A, where p θ (x) is the model distribution and A denotes the normalization constant.

Figure 2: Pipeline for GeoSSL-DDM. The g1 and g2 are around the same local minima, yet with coordinate noises perturbation. Originally we want to conduct coordinate denoising between these two views. Then as proposed in GeoSSL-DDM, we transform it to an equivalent problem, i.e., distance denoising. This figure shows the three key steps: extract the distances from the two geometric views, perform distance perturbation, and denoise the perturbed distances. Notice that the covalent bonds in the 3D data are added for illustration only.

Figure 3: Pipeline for denoising coordinate matching.

Figure 4: Pipeline for GeoSSL-DDM. The g1 and g2 are around the same local minima, yet with coordinate noises perturbation. Originally we want to do coordinate denoising between these two views. Then as proposed in GeoSSL-DDM, we transform it to an equivalent problem, i.e., distance denoising. This figure shows the three key steps: extract the distances from the two geometric views, perform distance perturbation, and denoise the perturbed distances.

Downstream results on 12 quantum mechanics prediction tasks from QM9. We take 110K for training, 10K for validation, and 11K for test. The evaluation is mean absolute error, and the best results are in bold.

Downstream results on 8 force prediction tasks from MD17. We take 1K for training, 1K for validation, and the number of molecules for test are varied among different tasks, ranging from 48K to 991K. The evaluation is mean absolute error, and the best results are in bold.

Downstream results on 2 binding affinity tasks. We select three evaluation metrics for LBA: the root mean squared error (RMSD), the Pearson correlation (Rp) and the Spearman correlation (RS). LEP is a binary classification task, and we use the area under the curve for receiver operating characteristics (ROC) and precision-recall (PR) for evaluation. We run cross validation with 5 seeds, and the best results are in bold.

of MI, which is symmetric in terms of the denoising directions. We believe that such symmetry are treating the two views equally, and can better reveal the mutual concept, making the pre-trained representation more robust to the position augmentations. (5) Empirical baseline. PTSSL lacks the comparisons with other pre-training methods, while we compare with 7 SOTA pre-training methods, especially those driven by maximizing the MI with the same augmentations. Without such comparisons, it is hard to tell the effectiveness of the pseudo-force matching for geometric data. (6) Score network. Last but not least, the score network designed in PTSSL does not satisfy the SE(3) equivariant property.

An evidence example on molecular data. The goal is to predict 12 quantum properties (regression tasks) of 3D molecules (with 3D coordinates on each atom). The evaluation metric is MAE.

Some basic statistics on MD17.Pretraining Aspirin ↓ Benzene ↓ Ethanol ↓ Malonaldehyde ↓ Naphthalene ↓ Salicylic ↓ Toluene ↓ Uracil ↓

Some basic statistics on LBA & LEP. For LBA, we use split-by-sequence-identity-30: we split protein-ligand complexes such that no protein in the test dataset has more than 30% sequence identity with any protein in the training dataset. For LEP, we split the complex pairs by protein target.

Hyperparameter specifications.

Downstream results on 12 quantum mechanics prediction tasks from QM9. We take 110K for training, 10K for validation, and 11K for test. The evaluation is mean absolute error, and the best results are in bold.

Downstream results on 8 force prediction tasks from MD17. We take 1K for training, 1K for validation, and the number of molecules for test are varied among different tasks, ranging from 48K to 991K. The evaluation is mean absolute error, and the best results are in bold.

Downstream results on 2 binding affinity tasks. We select three evaluation metrics for LBA: the root mean squared error (RMSD), the Pearson correlation (Rp) and the Spearman correlation (RS). LEP is a binary classification task, and we use the area under the curve for receiver operating characteristics (ROC) and precision-recall (PR) for evaluation. We run cross-validation with 5 seeds, and the best results are in bold. ± 0.02 0.522 ± 0.01 0.501 ± 0.01 0.436 ± 0.03 0.369 ± 0.02 Supervised 1.477 ± 0.04 0.528 ± 0.02 0.503 ± 0.03 0.462 ± 0.05 0.392 ± 0.03 Type Prediction 1.483 ± 0.04 0.498 ± 0.03 0.481 ± 0.03 0.570 ± 0.04 0.509 ± 0.07 Distance Prediction 1.461 ± 0.06 0.535 ± 0.04 0.512 ± 0.04 0.502 ± 0.06 0.415 ± 0.05 Angle Prediction 1.499 ± 0.01 0.475 ± 0.01 0.462 ± 0.02 0.532 ± 0.06 0.449 ± 0.03 3D InfoGraph 1.467 ± 0.06 0.526 ± 0.03 0.500 ± 0.03 0.515 ± 0.05 0.412 ± 0.04 GeoSSL-RR ---0.439 ± 0.04 0.365 ± 0.02 GeoSSL-InfoNCE 1.528 ± 0.05 0.483 ± 0.02 0.464 ± 0.02 0.588 ± 0.06 0.523 ± 0.05 GeoSSL-EBM-NCE 1.499 ± 0.03 0.509 ± 0.02 0.498 ± 0.02 0.493 ± 0.07 0.429 ± 0.06 GeoSSL-DDM (ours) 1.432 ± 0.02 0.550 ± 0.02 0.529 ± 0.02 0.633 ± 0.03 0.541 ± 0.03 E ABLATION STUDIES E.1 THE EFFECT OF ANNEALING FACTOR IN GEOSSL-DDM

Ablation study on the effect of annealing factor β on 12 quantum mechanics prediction tasks from QM9. We take 110K for training, 10K for validation, and 11K for test. The backbone model is PaiNN, and the evaluation is the mean absolute error.β Alpha ↓ Gap ↓ HOMO↓ LUMO ↓ Mu ↓ Cv ↓ G298 ↓ H298 ↓ r2 ↓ U298 ↓ U0 ↓ Zpve ↓As can be observed in Table12, the models are more stable with smaller annealing values (e.g., 0.2 and 0.05). With large annealing values, the model performance can degrade drastically.E.2 THE EFFECT ON THE NUMBER OF NOISE LAYERS IN GEOSSL-DDMAnother important hyperparameter listed in Table8is the number of noise layers, L. Here we conduct an ablation study on it, and the results are shown in Table13.

Ablation study on the effect of the noise layer L on 12 quantum mechanics prediction tasks from QM9. We take 110K for training, 10K for validation, and 11K for test. The backbone model is PaiNN, and the evaluation is the mean absolute error.

ACKNOWLEDGEMENT

We would like to thank Anima Anandkumar, Chaowei Xiao, Weili Nie, Zhuoran Qiao, Chengpeng Wang, and Pierre-André Noël for their insightful discussions. This project is supported by the Natural Sciences and Engineering Research Council (NSERC) Discovery Grant, the Canada CIFAR AI Chair Program, collaboration grants between Microsoft Research and Mila, Samsung Electronics Co., Ltd., Amazon Faculty Research Award, Tencent AI Lab Rhino-Bird Gift Fund and two NRC Collaborative R&D Projects (AI4D-CORE-06, AI4D-CORE-08). This project was also partially funded by IVADO Fundamental Research Project grant PRF-2019-3583139727.

ETHICS STATEMENT

We authors acknowledge that we have read and committed to adhering to the ICLR Code of Ethics.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of the empirical results, we provide the implementation details (hyperparameters, dataset statistics, etc.) in Section 5 and appendix D, and publically share our source code through this GitHub link. Besides, the complete derivations of equations and clear explanations are shown in Section 4 and appendix C.

F STRONG MODEL ROBUSTNESS WITH RANDOM SEEDS

To further illustrate that our proposed GeoSSL-DDM is robust and insensitive to certain random seeds, we further provide the downstream results with more random seeds. We list the key details as follows:• Dataset. We conduct downstream experiments with random seeds on two datasets: QM9 and MD17. • Backbone models. We run two backbone models: PaiNN in Appendix F.1 and SchNet in Appendix F.2. • Seeds. Up till now, for both the main tables (Tables 1, 2 , 9 and 10) and ablation studies (in Appendix E), we use a fixed seed 42. In this section, we provide results with two additional seeds 22 and 32. • Baselines. We here compare against the most optimal baselines: random initialization (without any pretraining), distance prediction, representation reconstruction (RR), and EBM-NCE. • Reported results. We report both the mean and standard deviation with seeds 22, 32, and 42 for all the experiments.

F.1 PAINN

Here we take the PaiNN as the backbone model. The results on QM9 and MD17 are reported in Tables 14 and 15 respectively. Such empirical results match with the main result in Tables 1 and 2 , and they do verify that our proposed GeoSSL-DDM is indeed learning a more robust representation.Table 14 : Downstream results on 12 quantum mechanics prediction tasks from QM9. We take 110K for training, 10K for validation, and 11K for test. The evaluation metric is mean absolute error, and the best results are in bold. We report both the mean and standard deviation for seeds 22, 32, and 42. 

F.2 SCHNET

Here we take the SchNet as the backbone model. The results on QM9 and MD17 are reported in Tables 16 and 17 respectively. Such empirical results match with the main result in Tables 9 and 10 , and they do verify that our proposed GeoSSL-DDM is indeed learning a more robust representation.Table 16 : Downstream results on 12 quantum mechanics prediction tasks from QM9. We take 110K for training, 10K for validation, and 11K for test. The evaluation is mean absolute error, and the best results are in bold. We report both the mean and standard deviation for seeds 22, 32, and 42. 

