PRE-TRAINING VIA DENOISING FOR MOLECULAR PROPERTY PREDICTION

Abstract

Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -arising from approximating the Boltzmann distribution with a mixture of Gaussians -directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -on pre-training.

1. INTRODUCTION

The success of the best performing neural networks in vision and natural language processing (NLP) relies on pre-training the models on large datasets to learn meaningful features for downstream tasks (Dai & Le, 2015; Simonyan & Zisserman, 2014; Devlin et al., 2018; Brown et al., 2020; Dosovitskiy et al., 2020) . For molecular property prediction from 3D structures (a point cloud of atomic nuclei in R 3 ), the problem of how to similarly learn such representations remains open. For example, none of the best models on the widely used QM9 benchmark use any form of pre-training (e.g. Klicpera et al., 2020a; Liu et al., 2022b; Schütt et al., 2021; Thölke & De Fabritiis, 2022) , in stark contrast with vision and NLP. Effective methods for pre-training could have a significant impact on fields such as drug discovery and material science. In this work, we focus on the problem of how large datasets of 3D molecular structures can be utilized to improve performance on downstream molecular property prediction tasks that also rely on 3D structures as input. We address the question: how can one exploit large datasets like PCQM4Mv2, 1that contain over 3 million structures, to improve performance on datasets such as DES15K that are orders of magnitude smaller? Our answer is a form of self-supervised pre-training that generates useful representations for downstream prediction tasks, leading to state-of-the-art (SOTA) results. Inspired by recent advances in noise regularization for graph neural networks (GNNs) (Godwin et al., 2022) , our pre-training objective is based on denoising in the space of structures (and is hence self-supervised). Unlike existing pre-training methods, which largely focus on 2D graphs, our approach targets the setting where the downstream task involves 3D point clouds defining the molecular structure. Relying on the well-known connection between denoising and score-matching (Vincent, 2011; Song & Ermon, 2019; Ho et al., 2020) , we show that the denoising objective is equivalent to learning a particular force field, adding a new interpretation of denoising in the context of molecules and shedding light on how it aids representation learning. The contributions of our work are summarized as follows: • We investigate a simple and effective method for pre-training via denoising in the space of 3D structures with the aim of improving downstream molecular property prediction from such 3D structures. Our denoising objective is shown to be related to learning a specific force field. • Our experiments demonstrate that pre-training via denoising significantly improves performance on multiple challenging datasets that vary in size, nature of task, and molecular composition. This establishes that denoising over structures successfully transfers to molecular property prediction, setting, in particular, a new state-of-the-art on 10 out of 12 targets in the widely used QM9 dataset. Figure 1 illustrates performance on one of the targets in QM9. • We make improvements to a common GNN, in particular showing how to apply Tailored Activation Transformation (TAT) (Zhang et al., 2022) to Graph Network Simulators (GNS) (Sanchez-Gonzalez et al., 2020), which is complementary to pre-training and further boosts performance. • We analyze the benefits of pre-training by gaining insights into the effects of dataset size, model size and architecture, and the relationship between the upstream and downstream datasets.

2. RELATED WORK

Pre-training of GNNs. Various recent works have formulated methods for pre-training using graph data (Liu et (Liu et al., 2021a) , where 3D structure is treated as one view of a 2D molecule for the purpose of upstream contrastive learning. Their work focuses on downstream tasks that only involve 2D information, while our aim is to improve downstream models for molecular property prediction from 3D structures. After the release of this pre-print, similar ideas have been studied by Jiao et al. (2022) and Liu et al. (2022a) . Denoising, representation learning and score-matching. Noise has long been known to improve generalization in machine learning (Sietsma & Dow, 1991; Bishop, 1995) . Denoising autoencoders have been used to effectively learn representations by mapping corrupted inputs to original inputs (Vincent et al., 2008; 2010) . Specific to GNNs ( ). We conjecture that similar improvements will hold for other models.

3.1. PROBLEM SETUP

Molecular property prediction consists of predicting scalar quantities given the structure of one or more molecules as input. Each data example is a labelled set specified as follows: we are provided with a set of atoms S = {(a 1 , p 1 ), . . . , (a |S| , p |S| )}, where a i ∈ {1, . . . , 118} and p i ∈ R 3 are the atomic number and 3D position respectively of atom i in the molecule, alongside a label y ∈ R. We assume that the model, which takes S as input, is any architecture consisting of a backbone, which first processes S to build a latent representation of it, followed by a vertex-level or graph-level "decoder", that returns per-vertex predictions or a single prediction for the input respectively.

3.2. PRE-TRAINING VIA DENOISING

Given a dataset of molecular structures, we pre-train the network by denoising the structures, which operates as follows. Let D structures = {S 1 , . . . , S n } denote the upstream dataset of equilibrium structures, and let GNN θ denote a graph neural network with parameters θ which takes S ∈ D structures as input and returns per-vertex predictions GNN θ (S) = (ε 1 , . . . , ε|S| ). The precise parameterization of the models we consider in this work is described in Section 3.3 and Appendix A. Starting with an input molecule S ∈ D structures , we perturb it by adding i.i.d. Gaussian noise to its atomic positions p i . That is, we create a noisy version of the molecule: S = {(a 1 , p1 ), . . . , (a |S| , p|S| )}, where pi = p i + σϵ i and ϵ i ∼ N (0, I 3 ), The noise scale σ is a tuneable hyperparameter (an interpretation of which is given in Section 3.2.1). We train the model as a denoising autoencoder by minimizing the following loss with respect to θ: E p( S,S) GNN θ ( S) -(ϵ 1 , . . . , ϵ |S| ) 2 . ( ) The distribution p( S, S) corresponds to sampling a structure S from D structures and adding noise to it according to Equation (1) . Note that the model predicts the noise, not the original coordinates. Next, we motivate denoising as our pre-training objective for molecular modelling.

3.2.1. DENOISING AS LEARNING A FORCE FIELD

Datasets in quantum chemistry are typically generated by minimizing expensive-to-compute interatomic forces with methods such as density functional theory (DFT) (Parr & Weitao, 1994) . We speculate that learning this force field would give rise to useful representations for downstream tasks, since molecular properties vary with forces and energy. Therefore, a reasonable pre-training objective would be one that involves learning the force field. Unfortunately, this force field is either unknown or expensive to evaluate, and hence it cannot be used directly for pre-training. An alternative is to approximate the data-generating force field with one that can be cheaply evaluated and use it to learn good representations -an approach we outline in this section. Using the well-known link between denoising autoencoders and score-matching (Vincent, 2011; Song & Ermon, 2019 ; 2020), we can show that the denoising objective in Equation ( 2) is equivalent to learning a particular force field directly from equilibrium structures with some desirable properties. For clarity, in this subsection we condition on and suppress the atom types and molecule size in our notation, specifying a molecular structure by its coordinates x ∈ R 3N (with N as the size of the molecule). From the perspective of statistical physics, a structure x can be treated as a random quantity sampled from the Boltzmann distribution p physical (x) ∝ exp(-E(x)), where E(x) is the (potential) energy of x. According to p physical , low energy structures have a high probability of occurring. Moreover, the per-atom forces are given by ∇ x log p physical (x) = -∇ x E(x), which is referred to as the force field. Our goal is to learn this force field. However both the energy function E and distribution p physical are unknown, and we only have access to a set of equilibrium structures x 1 , . . . , x n that locally minimize the energy E. Since x 1 , . . . , x n are then local maxima of the distribution p physical , our main approximation is to replace p physical with a mixture of Gaussians centered at the data: p physical (x) ≈ q σ (x) := 1 n n i=1 q σ (x | x i ), where we define q σ (x | x i ) = N (x; x i , σ 2 I 3N ). This approximation captures the fact that p physical will have local maxima at the equilibrium structures, vary smoothly with x and is computationally convenient. Learning the force field corresponding to q σ (x) now yields a score-matching objective: E qσ(x) ∥GNN θ (x) -∇ x log q σ (x)∥ 2 . As shown by Vincent (2011)  (x) = 1 n n i=1 δ(x = x i ) to be the empirical distribution and q σ (x, x) = q σ (x | x)q 0 (x), the objective in Equation ( 3) is equivalent to: E qσ(x,x) ∥GNN θ (x) -∇ x log q σ (x | x)∥ 2 = E qσ(x,x) GNN θ (x) - x - x σ 2 2 . (4) We notice that the RHS corresponds to the earlier denoising loss in Equation (2) (up to a constant factor of 1/σ applied to GNN θ that can be absorbed into the network). To summarize, denoising equilibrium structures corresponds to learning the force field that arises from approximating the distribution p physical with a mixture of Gaussians. Note that we can interpret the noise scale σ as being related to the sharpness of p physical or E around the local maxima x i . We also remark that the equivalence between Equation (3) and the LHS of Equation (4) does not require q σ (x | x i ) to be a Gaussian distribution (Vincent, 2011) , and other choices will lead to different denoising objectives, which we leave as future work. See Appendix B for technical caveats and a discussion of the differences between denoising for generative modelling versus learning forces.

3.2.2. NOISY NODES: DENOISING AS AN AUXILIARY LOSS

Recently, Godwin et al. (2022) also applied denoising as an auxiliary loss to molecular property prediction, achieving significant improvements on a variety of molecular datasets. In particular, their approach, called Noisy Nodes, consisted of augmenting the usual optimization objective for predicting y with an auxiliary denoising loss. They suggested two explanations for why Noisy Nodes improves performance. First, the presence of a vertex-level loss discourages oversmoothing (Chen et al., 2019; Cai & Wang, 2020) of vertex/edge features after multiple message-passing layers -a common problem plaguing GNNs -because successful denoising requires diversity amongst vertex features in order to match the diversity in the noise targets ϵ i . Second, they argued that denoising can aid representation learning by encouraging the network to learn aspects of the input distribution. The empirical success of Noisy Nodes indicates that denoising can indeed result in meaningful representations. Since Noisy Nodes incorporates denoising only as an auxiliary task, the representation learning benefits of denoising are limited to the downstream dataset on which it is used as an auxiliary task. Our approach is to apply denoising as a pre-training objective on another large (unlabelled) dataset of structures to learn higher-quality representations, which results in better performance. that also appear in PCQM4Mv2.

3.3. GNS AND GNS-TAT

The main two models we consider in this work are Graph Net Simulator (GNS) (Sanchez-Gonzalez et al., 2020), which is a type of GNN, and a better-performing variant we contribute called GNS-TAT. GNS-TAT makes use of a recently published network transformation method called Tailored Activation Transforms (TAT) (Zhang et al., 2022) , which has been shown to prevent certain degenerate behaviors at initialization in deep MLPs/convnets that are reminiscent of oversmoothing in GNNs (and are also associated with training difficulties). While GNS is not by default compatible with the assumptions of TAT, we propose a novel GNN initialization scheme called "Edge-Delta" that makes it compatible by initializing to zero the weights that carry "messages" from vertices to edges. This marks the first application of TAT to any applied problem in the literature. See Appendix A for details.

4. EXPERIMENTS

The goal of our experimental evaluation in this section is to answer the following questions. First, does pre-training a neural network via denoising improve performance on the downstream task compared to training from a random initialization? Second, how does the benefit of pre-training depend on the relationship between the upstream and downstream datasets? Our evaluation involves four realistic and challenging molecular datasets, which vary in size, compound compositions (organic or inorganic) and labelling methodology (DFT-or CCSD(T)-generated), as described below.

4.1. DATASETS AND TRAINING SETUP

Datasets. First, the main dataset we use for pre-training is PCQM4Mv2 (Nakata & Shimazaki, 2017), which contains 3.4 million organic molecules, specified by their 3D structures at equilibrium calculated using DFT. 2 The molecules in PCQM4Mv2 only contain one label, however the labels are not used as denoising only requires the structures. interactions. Each pair is labelled with its interaction energy computed using the gold-standard CCSD(T) method (Bartlett & Musiał, 2007) . CCSD(T) is usually both more expensive and accurate than DFT, which is used for all aforementioned datasets. See Appendix D for further details and a discussion about the choice of using DFT-generated structures for pre-training.

Figure 2 (right)

shows what percentage of elements appearing in each of QM9, OC20 and DES15K also appear in PCQM4Mv2. Whereas QM9 is fully covered by PCQM4Mv2, we observe that DES15K has less element overlap with PCQM4Mv2 and less than < 30% of elements in OC20 are contained in PCQM4Mv2. This is owing to the fact that surface molecules in OC20 are inorganic lattices, none of which appear in PCQM4Mv2. This suggests that we can expect least transfer from PCQM4Mv2 to OC20. We also compare PCQM4Mv2 and QM9 in terms of the molecular compositions, i.e. the number of atoms of each element, that appear in each. Due to presence of isomers, both datasets contain multiple molecules with the same composition. For each molecular composition in QM9, Figure 2 (left) shows its frequency in both QM9 and PCQM4Mv2. We observe that most molecular compositions in QM9 also appear in PCQM4Mv2. We also remark that since pre-training is self-supervised using only unlabelled structures, test set contamination is not possiblein fact, PCQM4Mv2 does not have most of the labels in QM9. Training setup. GNS/GNS-TAT were implemented in JAX (Bradbury et We evaluate two variants of our model on QM9 in Table 1 , GNS-TAT with Noisy Nodes trained from a random initialization versus pre-trained parameters. Pre-training is done on PCQM4Mv2 via denoising. For best performance on QM9, we found that using atom type masking and prediction during pre-training additionally helped (Hu et al., 2020a) . We fine-tune a separate model for each of the 12 targets, as usually done on QM9, using a single pre-trained model. This is repeated for three seeds (including pre-training). Following customary practice, hyperparameters, including the noise scale for denoising during pre-training and fine-tuning, are tuned on the HOMO target and then kept fixed for all other targets. We first observe that GNS-TAT with Noisy Nodes performs competitively with other models and significantly improves upon GNS with Noisy Nodes, revealing the benefit of the TAT modifications. Utilizing pre-training then further improves performance across all targets, achieving a new state-of-the-art compared to prior work for 10 out of 12 targets. Interestingly, for the electronic spatial extent target R 2 , we found GNS-TAT to perform worse than other models, which may be due to the optimal noise scale being different from that of other targets.

4.3. RESULTS ON OC20

Next, we consider the Open Catalyst 2020 benchmark focusing on the downstream task of predicting the relaxed energy from the initial structure (IS2RE). We compared GNS with Noisy Nodes trained 9 for a comparison to other models in the literature. Right: Test performance curves for predicting interaction energies of dimer geometries in the DES15K dataset with GNS-TAT. "PT" and "NN" stand for pre-training and Noisy Nodes respectively. from scratch versus using pre-trained parameters. We experimented with two options for pre-training: (1) pre-training via denoising on PCQM4Mv2, and (2) pre-training via denoising on OC20 itself. For the latter, we follow Godwin et al.'s [2022] approach of letting the denoising target be the relaxed structure, while the perturbed input is a random interpolation between the initial and relaxed structures with added Gaussian noise -this corresponds to the IS2RS task with additional noise. As shown in Figure 3 (left), pre-training on PCQM4Mv2 offers no benefit for validation performance on IS2RE, however pre-training on OC20 leads to considerably faster convergence but the same final performance. The lack of transfer from PCQM4Mv2 to OC20 is likely due to the difference in nature of the two datasets and the small element overlap as discussed in Section 4.1 and Figure 2 (right). On the other hand, faster convergence from using parameters pre-trained on OC20 suggests that denoising learned meaningful features. Unsurprisingly, the final performance is unchanged since the upstream and downstream datasets are the same in this case, so pre-training with denoising is identical to the auxiliary task of applying Noisy Nodes. The performance achieved is also competitive with other models in the literature as shown in Table 9 .

4.4. RESULTS ON DES15K

In our experiments so far, all downstream tasks were based on DFT-generated datasets. While DFT calculations are more expensive than using neural networks, they are relatively cheap compared to even higher quality methods such as CCSD(T) (Bartlett & Musiał, 2007) . In this section, we evaluate how useful pre-training on DFT-generated structures from PCQM4Mv2 is when fine-tuning on the recent DES15K dataset which contains higher quality CCSD(T)-generated interaction energies. Moreover, unlike QM9, inputs from DES15K are systems of two interacting molecules and the dataset contains only around 15,000 examples, rendering it more challenging. We compare the test performance on DES15K achieved by GNS-TAT with Noisy Nodes when trained from scratch versus using pre-trained parameters from PCQM4Mv2. As a baseline, we also include pre-training on PCQM4Mv2 using 2D-based AttrMask (Hu et al., 2020a ) by masking and predicting atomic numbers. Figure 3 (right) shows that using Noisy Nodes significantly improves performance compared to training from scratch, with a further improvement resulting from using pre-training via denoising. AttrMask underperforms denoising since it likely does not fully exploit the 3D structural information. Importantly, this shows that pre-training by denoising structures obtained through relatively cheap methods such as DFT can even be beneficial when fine-tuning on more expensive and smaller downstream datasets. See Appendix G.1 for similar results on another architecture.

5.1. PRE-TRAINING A DIFFERENT ARCHITECTURE

To explore whether pre-training is beneficial beyond GNS/GNS-TAT, we applied pre-training via denoising to the TorchMD-NET architecture (Thölke & De Fabritiis, 2022) . TorchMD-NET is a transformer-based architecture whose layers maintain per-atom scalar features x i ∈ R F and vector features v i ∈ R 3×F , where F is the feature dimension, that are updated in each layer using a selfattention mechanism. We implemented denoising by using gated equivariant blocks (Weiler et al., 2018; Schütt et al., 2021) applied to the processed scalar and vector features. The resulting vector features are then used as the noise prediction. 

5.2. VARYING DATASET SIZES

We also investigate how downstream test performance on the HOMO target in QM9 varies as a function of the number of upstream and downstream training examples. First, we compare the performance of GNS-TAT with Noisy Nodes either trained from scratch or using pre-trained parameters for different numbers of training examples from QM9; we also include the performance of just GNS-TAT. As shown in Figure 4 (left), pre-training improves the downstream performance for all dataset sizes. The difference in test MAE also grows as the downstream training data reduces. Second, we assess the effect of varying the amount of pre-training data while fixing the downstream dataset size for both GNS and GNS-TAT as shown in Figure 4 (middle). For both models, we find that downstream performance generally improves as upstream data increases, with saturating performance for GNS-TAT. More upstream data can yield better quality representations.

5.3. VARYING MODEL SIZE

We study the benefit of pre-training as models are scaled up on large downstream datasets. Recall that the S2EF dataset in OC20 contains around 130 million DFT evaluations for catalytic systems, providing three orders of magnitude more training data than QM9. We compare the performance of four GNS models with sizes ranging from 10 million to 1.2 billion parameters scaled up by increasing the hidden layer sizes in the MLPs. Each is pre-trained via denoising using the trajectories provided for the IS2RE/IS2RS tasks as described in Section 4.3. We also compare this to a 130 million parameter variant of GNS trained from scratch. As shown in Figure 4 (right), the pre-trained models continue to benefit from larger model sizes.We also observe that pre-training is beneficial, as the model trained from scratch underperforms in comparison: the 130 million parameters model trained from scratch is outperformed by a pre-trained model of less than half the size. As shown in Section 3.2.1, denoising structures corresponds to learning an approximate force field directly from equilibrium structures. We explore whether pre-training via denoising would therefore also improve models trained to predict atomic forces. We compare the performance of TorchMD-NET for force prediction on the MD17 (aspirin) dataset with and without pre-training on PCQM4Mv2. Table 3 shows that pre-training improves force prediction. We also assess the effect of pre-training on the OC20 dataset for force prediction in Appendix G.3, similarly finding an improvement due to pre-training. Finally, we perform an experiment to assess how useful the features learned by pre-training are if they are not fine-tuned for the downstream task but kept fixed instead. Specifically, on the HOMO target in QM9, we freeze the backbone of the model and fine-tune only the decoder (cf. Appendix A). To evaluate this, we compare it to using random parameters from initialization for the model's backbone, which allows us to isolate how useful the pre-trained features are. As described in Appendix A, the decoder is a simple module involving no message-passing. Figure 5 shows that only training the decoder while keeping the pre-trained parameters fixed results in test MAE of 40 meV, which is worse than fine-tuning the entire model but substantially better than the performance of >100 meV in test MAE resulting from training the decoder when the remaining parameters are randomly initialized. This suggests that the features learned by denoising are more discriminative for downstream prediction than random features. We note that training only the decoder is also substantially faster than training the entire network -one batch on a single V100 GPU takes 15ms, which is 50× faster than one batch using 16 TPUs for the full network.

6. LIMITATIONS & FUTURE WORK

We have shown that pre-training can significantly improve performance for various tasks. One additional advantage of pre-trained models is that they can be shared in the community, allowing practitioners to fine-tune models on their datasets. However, unlike vision and NLP, molecular networks vary widely and the community has not yet settled on a "standard" architecture, making pre-trained weights less reusable. Moreover, the success of pre-training inevitably depends on the relationship between the upstream and downstream datasets. In the context of molecular property prediction, understanding what aspects of the upstream data distribution must match the downstream data distribution for transfer is an important direction for future work. More generally, pre-training models on large datasets incurs a computational cost. However, our results show that pre-training for 3D molecular prediction does not require the same scale as large NLP and vision models. We discuss considerations on the use of compute and broader impact in Appendix C.

7. CONCLUSION

We investigated pre-training neural networks by denoising in the space of 3D molecular structures. We showed that denoising in this context is equivalent to learning a force field, motivating its ability to learn useful representations and shedding light on successful applications of denoising in other works (Godwin et al., 2022) . This technique enabled us to utilize existing large datasets of 3D structures for improving performance on various downstream molecular property prediction tasks, setting a new SOTA in some cases such as QM9. More broadly, this bridges the gap between the utility of pre-training in vision/NLP and molecular property prediction from structures. We hope that this approach will be particularly impactful for applications of deep learning to scientific problems.

REPRODUCIBILITY STATEMENT

Our theoretical results, complete with assumptions and proofs, are included in Section 3. In the ENCODER, we represent the set of atoms S = {(a 1 , p 1 ), (a 2 , p 2 ), . . . , (a |S| , p |S| )} as a directed graph G = (V, E) where V = {v 1 , v 2 , . . . , v |S| } and E = {e i,j } i,j are the sets of "featurized" vertices and edges, respectively. Edges e i,j ∈ E are constructed whenever the distance between the i-th and j-th atoms is less than the connectivity radius R cut (given in Appendix F), in which case we connect v i and v j with a directed edge e i,j from i to j that is a featurization of the displacement vector p j -p i . Meanwhile, for the i-th atom, v i is given by a learnable vector embedding of the atomic number a i . The PROCESSOR consists of L message-passing steps that produce intermediate graphs G 1 , . . . , G L (with the same connectivity structure as the initial one). Each of these steps computes the sum of a shortcut connection from the previous graph, and the application of an Interaction Network (Battaglia et al., 2016) . Interaction Networks first update each edge feature by applying an "edge update function" to a combination of the existing feature and the features of the two connected vertices. They then update each vertex feature by applying a "vertex update function" to a combination of the existing feature and the (new) edge features of incoming edges. In GNS, edge update functions are 3 hidden layer fully-connected MLPs, using a "shifted softplus" (ssp(x) = log(0.5e x + 0.5)) activation function, applied to the concatenation of the relevant edge and vertex features, followed by a layer normalization layer. Vertex update functions are similar, but are applied to the concatenation of the relevant vertex feature and sum over relevant edge features. In our implementation of GNS we applied the same PROCESSOR in sequence three times (with shared parameters), with the output of each being decoded to produce a prediction and corresponding loss value. The loss for the whole model is then given by the average of these. (Test-time predictions are meanwhile computed using only the output of the final PROCESSOR.) The DECODER is responsible for computing graph-level and vertex-level predictions from the output of each PROCESSOR. Vertex-level predictions, such as noise as described in Section 3.2, are decoded using an MLP applied to each vertex feature. Graph-level predictions (e.g. energies) are produced by applying an MLP to each vertex feature, aggregating the result over vertices (via a sum), and then applying another MLP to the result. Unfortunately, the GNS architecture violates two key assumptions of TAT. Firstly, the sums over edge features (performed in the vertex update functions) violate the assumption that all sum operations must be between the outputs of linear layers with independently sampled initial weights. Secondly, GNS networks have multiple inputs for which information needs to be independently preserved and propagated to the output, while DKS/TAT assumes a single input (or multiple inputs whose representations evolve independently in the network). To address these issues we introduce a new initialization scheme called "Edge-Delta", which initializes to zero the weights that multiply incoming vertex features in the edge update functions (and treats these weights as absent for the purpose of computing the initial weight variance). This approach is inspired by the use of the "Delta initialization" (Balduzzi et residual network (with edge update functions acting as the residual branches), which we will refer to as the "edge network". Given the use of Edge-Delta we can then apply TAT to GNS as follows 4 . First, we replace GNS's activation functions with TAT's transformed Leaky-ReLU activation functions (or "Tailored ReLUs"), which we compute with TAT's η parameter set to 0.8, and its "subnetwork maximizing function" defined on the edge network 5 . We also replace each sum involving shortcut connections with weighted sums, whose weights are 0.9 and √ 1 -0.9 2 for the shortcut and non-shortcut branches respectively. We retain the use of layer normalization layers in the edge/vertex update functions, but move them to before the first fully-connected layer, as this seems to give the best performance. As required by TAT, we use a standard Gaussian fan-in initialization for the weights, and a zero initialization for the biases, with Edge-Delta used only for the first linear layer of the edge update functions. Finally, we replace the sum used to aggregate vertex features in the DECODER with an average. See Figures 6 and 7 for an illustration of these changes. We experimented with an analogous "Vertex-Delta" initialization, which initializes to zero weights in the vertex update functions that multiply summed edge features, but found that Edge-Delta gave the best results. This might be because the edge features, which encode distances between vertices (and are best preserved with the Edge-Delta approach), are generally much more informative than the vertex features in molecular property prediction tasks. We also ran informal ablation studies, and found that each of our changes to the original GNS model contributed to improved results, with the use of Edge-Delta and weighted shortcut sums being especially important. GNS vs. GNS-TAT on OC20. Our preliminary experiments used to determine appropriate choices for the use of Edge-Delta/Vertex-Delta initialization, weighting for shortcut sums and TAT hyperparameters such as η were conducted on QM9. This led to a parameterization of GNS-TAT which gave improved performance on QM9/DES15K as shown earlier in Sections 4.2 and 4.4. However, we kept the original parameterization for GNS from Godwin et al. (2022) on OC20, as the model size was different from QM9/DES15K (see Appendix F) and re-tuning hyperparameters on OC20 would be computationally expensive. Using the same TAT settings as QM9 on OC20 results in approximately the same performance as the original GNS model as shown in Table 4 . 

B DENOISING AS LEARNING A FORCE FIELD

We specify a molecular structure as x = (x (1) , . . . , x (N ) ) ∈ R 3N , where x (i) ∈ R 3 is the coordinate of atom i. Let E(x) denote the total (potential) energy of x, such that -∇ x E(x) are the forces on the atoms. As discussed in Section 3.2.1, learning the force field, i.e. the mapping x → -∇ x E(x), is a reasonable pre-training objective. Furthermore, learning the force field can be viewed as scorematching if we define the distribution p physical (x) ∝ exp(-E(x)) and observe that the score of p physical is the force field: ∇ x log p physical (x) = -∇ x E(x). However, a technical caveat is that p physical is an improper probability density, because it cannot be normalized due to the translation invariance of E. Writing the translation of a structure as x + t := (x (1) + t, . . . , x (N ) + t) where t ∈ R 3 is a constant vector, we have E(x + t) = E(x). This implies that the normalizing constant R 3N p physical (x) dx diverges to infinity. To remedy this, we can restrict ourselves to the (3N -3)-dimensional subspace V := {x ∈ R 3N | i x (i) = 0} ⊆ R 3N consisting of the mean-centered structures, over which p physical can be defined as a normalizable distribution. Proceeding similarly as Section 3.2.1, let x 1 , . . . , x n ∈ V be a set of mean-centered equilibrium structures. For any x ∈ V , we now approximate p physical (x) ≈ q σ (x) := 1 n n i=1 q σ (x | x i ), where the Gaussian distributions q σ (x | x i ) are defined on V as: q σ (x | x i ) = 1 (2πσ) (3N -3)/2 exp - 1 2σ 2 ∥x -x i ∥ 2 . For convenience, we have expressed structures as vectors in the ambient space R 3N , however they are restricted to lie in the smaller space V . Note that the normalizing constant accounts for the fact that V is (3N -3)-dimensional. As before, we define q 0 (x) = 1 n n i=1 δ(x = x i ) to be the empirical distribution and q σ (x, x) = q σ (x | x)q 0 (x). The score-matching objective is given by: J 1 (θ) = E qσ(x) ∥GNN θ (x) -∇ x log q σ (x)∥ 2 , where the expectation is now over V . As shown by Vincent (2011) , minimizing the objective above is equivalent to the minimizing the following objective: J 2 (θ) = E qσ(x,x) ∥GNN θ (x) -∇ x log q σ (x | x)∥ 2 . ( ) This is recognized as a denoising objective, because ∇ x log q σ (x | x) = (x -x)/σ 2 . A practical implication of this analysis is that the noise (x -x)/σ 2 ∈ V should be mean-centered, which is intuitive since it is impossible to predict a translational component in the noise. We include a proof of the equivalence between Equations ( 5) and ( 6) for completeness: Proposition 1 (Vincent (2011)). The minimization objectives J 1 (θ) and J 2 (θ) are equivalent. Proof. We first observe: J 1 (θ) = E qσ(x) ∥GNN θ (x)∥ 2 -2E qσ(x) [⟨GNN θ (x), ∇ x log q σ (x)⟩] + C 1 J 2 (θ) = E qσ(x) ∥GNN θ (x)∥ 2 -2E qσ(x,x) [⟨GNN θ (x), ∇ x log q σ (x | x)⟩] + C 2 , where C 1 , C 2 are constants independent of θ. Therefore, it suffices to show that the middle terms on the RHS are equal. Since expectations over q σ (x) and q σ (x, x) are restricted to V ⊆ R 3N , we apply a change of basis to write them as integrals against the (3N -3)-dimensional Lebesgue measure. Pick an orthonormal basis {v 1 , . . . , v 3N -3 } ⊆ R 3N for V and let P V = [v 1 , . . . , v 3N -3 ] ∈ R 3N ×3(N -1) be the projection matrix, so z = P ⊺ V x expresses a mean-centered structure x in terms of the coordinates of the chosen basis for V . Noting that P V has orthonormal columns and that it yields a bijection between V and R 3N -3 , we calculate: (Vincent, 2011) . In this section, we elaborate on the differences between the equivalence of denoising and score-matching for generative modelling versus the equivalence of denoising and force learning. Both of these equivalences rely on the result of Vincent (2011) although with different assumptions and different aims in practice. E qσ(x) [⟨GNN θ (x), ∇ x log q σ (x)⟩] = R 3N -3 q σ (P V z)⟨GNN θ (P V z), ∇ log q σ (P V z)⟩ dz = R 3N -3 q σ (P V z) GNN θ (P V z), ∇q σ (P V z) q σ (P V z) dz = R 3N -3 ⟨GNN θ (P V z), ∇q σ (P V z)⟩ dz = R 3N -3 GNN θ (P V z), 1 n n i=1 ∇q σ (P V z | x i ) dz = R 3N -3 GNN θ (P V z), 1 n n i=1 q σ (P V z | x i )∇ log q σ (P V z | x i ) dz = R 3N -3 1 n n i=1 q σ (P V z | x i ) ⟨GNN θ (P V z), ∇ log q σ (P V z | x i )⟩ dz = E qσ(x,x) [⟨GNN θ (x), ∇ x log q σ (x | x)⟩] B. As background, generative score-based modelling consists of modelling a target distribution p(x) by learning an approximation of its score function ∇ log p(x) using a neural network given a training set of samples x 1 , . . . x n ∼ p(x). Using the learned approximation of the score function, new samples can be generated using score-based MCMC techniques, such as Langevin MCMC. Learning the score function corresponds to the optimization objective min θ E p(x) ∥φ θ (x) -∇ log p(x)∥ 2 . ( ) Although the expectation over p(x) can be approximated using the samples x i , the true score ∇ log p(x) is unknown, rendering the objective function intractable as is. Denoising score-matching (Vincent, 2011) approaches this by replacing p(x) with an approximation that leads to a tractable objective. We can replace p(x) with a noisy approximation q σ (x) := q σ (x | x)p(x) dx where q σ (x | x) := N (x; x, σ 2 I) is the noising distribution. The advantage of doing so is that scorematching with q σ (x) is equivalent to the following tractable denoising objective by Proposition 1: min θ E qσ(x|x)p(x) ∥φ θ (x) -∇ log q σ (x | x)∥ 2 . However, in order for this approximation to be accurate, it is crucial that the noise level σ is as small as possible, ideally zero in which case the approximation is exact. As shown by Song & Ermon (2019) , choosing a value for σ close to zero leads to unreliable learned estimates of the score away from the regions of p(x) with high probability density. Therefore, they proposed multi-scale denoising where the model simultaneously learns to approximate the scores of q σi for i = 1, . . . L, where σ 1 > σ 2 > • • • > σ L is a sequence of increasing noise levels such that σ L is close to zero and q σ L ≈ p. In practice, this then leads to multi-scale denoising. In the context of learning forces, we first observe that the score function of the Boltzmann distribution p(x) = p physical (x) ∝ exp(-E(x)) is equal to the forces ∇ log p physical (x) = -∇E(x). Therefore, learning the forces corresponds to the following score-matching objective: min θ E pphysical(x) ∥φ θ (x) -∇ log p physical (x)∥ 2 . Once again, the score function ∇ log p physical (x) is unknown and must be approximated. However, unlike previously, we now have access to a set of equilibrium, energy-minimizing structures x 1 , . . . , x n ; these examples are local maxima of the distribution p physical (x) rather than samples from it as in the case before. Let q σ (x) denote a mixture of Gaussians with noise scale σ centered at x 1 , . . . , x n . In order to capture the fact that p physical (x) has local maxima at x 1 , . . . , x n , we can approximate p physical (x) with q σ (x). This distribution q σ (x) can also be written as q σ (x) = q σ (x | x)q 0 (x) dx, where q 0 (x) = 1 n n i=1 δ(x = x i ), hence we can apply Proposition 1 again and reduce the optimization objective to denoising as explained in Section 3.2.1. In contrast with before, in this case, a desirable noise scale σ is not zero, since x 1 , . . . , x n are local maxima of p physical and, intuitively, the approximation should contain some non-zero mass around each local maxima. As a result, we do not use multi-scale denoising in practice. It is also worth noting that in contrast with generative modelling, our aim is to learn the score function itself (i.e. forces) rather than producing new samples by using the learned score function with MCMC approaches.

C BROADER IMPACT

Who may benefit from this work? Molecular property prediction works towards a range of applications in materials design, chemistry, and drug discovery. Wider use of pre-trained models may accelerate progress in a similar manner to how pre-trained language or image models have enabled practitioners to avoid training on large datasets from scratch. Pre-training via denoising is simple to implement and can be immediately adopted to improve performance on a wide range of molecular property prediction tasks. As research converges on more standardized architectures, we expect shared pre-trained weights will become more common across the community. Potential negative impact and ethical considerations. Pre-training models on large structure datasets incurs additional computational cost when compared to training a potentially smaller model with less capacity from scratch. Environmental mitigation should be taken into account when pretraining large models (Patterson et al., 2021) . However, the computational cost of pre-training can and should be offset by sharing pre-trained embeddings when possible. Moreover, in our ablations of upstream dataset sizes for GNS-TAT, we observed that training on a subset of PCQM4Mv2 was sufficient for strong downstream performance. In future work, we plan to investigate how smaller subsets with sufficient diversity can be used to minimize computational requirements, e.g. by requiring fewer gradient steps.

D DATASETS

PCQM4Mv2. The main dataset we use for pre-training is PCQM4Mv2 (Nakata & Shimazaki, 2017) (license: CC BY 4.0), which contains 3,378,606 organic molecules, specified by their 3D structures at equilibrium (atom types and coordinates) calculated using DFT. Molecules in PCQM4Mv2 have around 30 atoms on average and vary in terms of their composition, with the dataset containing 22 unique elements in total. The molecules in PCQM4Mv2 only contain one label, unlike e.g. QM9, which contains 12 labels per molecule, however we do not use these labels as denoising only requires the structures. Usage of DFT-generated structures for pre-training. The structures in PCQM4Mv2 are obtained using DFT calculations, which in principle could have also been used to generate labels for molecular properties in DFT-generated downstream datasets, such as QM9. However, there are multiple reasons why denoising remains a desirable pre-training objective in such settings. First, although there is a computational cost for generating datasets such as PCQM4Mv2, it is now openly available and part of our aim is to understand how to leverage such datasets for tasks on other datasets (analogous to how ImageNet is expensive to build, but once it is available, it is important to understand how it can improve downstream performance on other datasets). Second, even if PCQM4Mv2 contained all the labels in QM9, pre-training via denoising structures allows one to pre-train a single, label-agnostic model which can be individually fine-tuned on any of the targets in QM9. This is substantially cheaper than per-target pre-training, and the resulting pre-trained model is also re-useable for differing needs and downstream tasks. In other settings where the downstream dataset is generated using more expensive methods than DFT, pre-training via denoising DFT-relaxed structures can also be helpful and has a clear benefit, as shown by our experiment on the CCSD(T)-generated dataset DES15K where denoising on PCQM4Mv2 improves performance for a downstream task involving a "higher" level of theory such as CCSD(T). Generally, we emphasize that the methodology of denoising structures can be applied to any dataset of structures (regardless of whether they are computed using DFT or not). We hope that denoising will be useful in the future for learning representations by pre-training on structures obtained through other methods such as experimental data (in which case labeling may be expensive) and databases generated by other models such as AlphaFold (where only structures are available) (Jumper et al., 2021) .

E EXPERIMENT SETUP AND COMPUTE RESOURCES

Below, we list details on our experiment setup and hardware resources used. GNS & GNS-TAT. GNS-TAT training for QM9, PCQM4Mv2 and DES15K was done on a cluster of 16 TPU v3 devices and evaluation on a single V100 device. GNS training for OC20 was done on 8 TPU v4 devices, with the exception of the 1.2 billion parameters variant of the model, which was trained on 64 TPU v4 devices. Pre-training on PCQM4Mv2 was executed for 3 • 10 5 gradient updates (approximately 1.5 days of training). Fine-tuning experiments were run until convergence for QM9 (10foot_6 gradient updates taking approximately 2 days) and DES15K (10 5 gradient updates taking approximately 4 hours) and stopped after 5 • 10 5 gradient updates on OC20 (2.5 days) to minimize hardware use (the larger models keep benefiting from additional gradient updates). TorchMD-NET. We implemented denoising for TorchMD-NET on top of Thölke & De Fabritiis's [2022] open-source code. 6 Models were trained on QM9 using data parallelism over two NVIDIA RTX 2080Ti GPUs. Pre-training on PCQM4Mv2 was done using three GPUs to accommodate the larger molecules while keeping the batch size approximately the same as QM9. All hyperparameters except the learning rate schedule were kept fixed at the defaults. Pre-training took roughly 24 hours, whereas fine-tuning took around 16 hours. Hyperparameter optimization. We note that effective pre-training via denoising requires sweeping noise values, as well as loss co-efficients for denoising and atom type recovery. For GNS/GNS-TAT, we relied on the hyperparameters published by Godwin et al. (2022) but determined new noise values for pre-training and fine-tuning by tuning over the set of values {0.005, 0.01, 0.02, 0.05, 0.1} for each of PCQM4Mv2 and QM9 (on the HOMO energy target). We used the same values for DES15K without modification. We also ran a similar number of experiments to determine cosine cycle parameters for learning rates.



Note that PCQM4Mv2 is a new version of PCQM4M that now offers 3D structures. An earlier version of this dataset without any 3D structures, called PCQM4M, was used for supervised pre-training(Ying et al., 2021), but to our knowledge, this is the first time the 3D structures from v2 have been used and in a self-supervised manner. GitHub repository: https://github.com/shehzaidi/pre-training-via-denoising. Note that Edge-Delta initialization is compatible with TAT, since for the purposes of q/c value propagation, zero-initialized connections in the network can be treated as absent. For the purposes of computing the subnetwork maximizing function we ignore the rest of the network and just consider the edge network. While the layer normalization layer (which we move before the MLP) technically depends on the vertex features, this dependency can be ignored as long as the q values of these features is 1 (which will be true given the complete set of changes we make to the GNS architecture). Available on GitHub at: https://github.com/torchmd/torchmd-net.



Figure 1: GNS-TAT pre-trained via denoising on PCQM4Mv2 outperforms prior work on QM9.

Figure2: Left: Frequency of compositions of molecules appearing in QM9 overlayed with the corresponding frequency in PCQM4Mv2. Each bar represents one molecular composition (e.g. one carbon atom, two oxygen atoms). Right: Percentage of elements appearing in QM9, DES15K, OC20 that also appear in PCQM4Mv2.

Figure 3: Left: Validation performance curves on the OC20 IS2RE task (ood_both split) with GNS. See Table9for a comparison to other models in the literature. Right: Test performance curves for predicting interaction energies of dimer geometries in the DES15K dataset with GNS-TAT. "PT" and "NN" stand for pre-training and Noisy Nodes respectively.

Figure 4: Left: Impact of varying the downstream dataset size for the HOMO target in QM9 with GNS-TAT. Middle: Impact of varying the upstream dataset size for the HOMO target in QM9. Right: Validation performance curves on the OC20 S2EF task (ood_both split) for different GNS model sizes. "PT" and "NN" stand for pre-training and Noisy Nodes respectively.

training w/ random features Decoder training w/ PT features Full model fine-tuning

Figure 5: Training only the decoder results in significantly better performance with pre-trained instead of random features with GNS-TAT.

2.1 and Appendix B. The code to reproduce our experiments on the TorchMD-NET architecture are open source. We report hardware setup with training times in Appendix E and detailed hyperparameter settings in Appendix F. Datasets are described with their respective licenses in Appendix D. A ARCHITECTURAL DETAILS A.1 STANDARD GNS As our base model architecture we chose a Graph Net Simulator (GNS) (Sanchez-Gonzalez et al., 2020), which consists of an ENCODER which constructs a graph representation from the input S, a PROCESSOR of repeated message passing blocks that update the latent graph representation, and a DECODER which produces predictions. Our implementation follows Godwin et al.'s [2022] modifications to enable molecular and graph-level property predictions which has been shown to achieve strong results across different molecular prediction tasks without relying on problem-specific features.

GNS WITH TAILORED ACTIVATION TRANSFORMATION (GNS-TAT)

Figure 6: Diagram showing the edge update for a single step t of the PROCESSOR. Left: Edge update for GNS. Right: Edge update for GNS-TAT (with modifications shown in red).

Figure 7: Diagram showing the vertex update for a single step t of the PROCESSOR. Left: Vertex update for GNS. Right: Vertex update for GNS-TAT (with modifications shown in red).

QM9 is a dataset (Ramakrishnan et al., 2014) (license: CCBY 4.0) with approximately 130,000 small organic molecules containing up to nine heavy C, N, O, F atoms, specified by their structures. Each molecule has 12 different labels corresponding to different molecular properties, such as highest occupied molecular orbital (HOMO) energy and internal energy, which we use for fine-tuning. Following prior work, we randomly split the dataset into 114k examples for training, 10k examples for validation and 10k examples for testing. OC20. Open Catalyst 2020 (Chanussot* et al., 2021) (OC20, license: CC Attribution 4.0) is a recent large benchmark containing trajectories of interacting surfaces and adsorbates that are relevant to catalyst discovery and optimization. This dataset contains three tasks: predicting the relaxed state energy from the initial structure (IS2RE), predicting the relaxed structure from the initial structure (IS2RS) and predicting the energy and forces given the structure at any point in the trajectory (S2EF). For IS2RE and IS2RS, there are 460,000 training examples, where each data point is a trajectory of a surface-adsorbate molecule pair starting with a high-energy initial structure that is relaxed towards a low-energy, equilibrium structure. For S2EF, there are 113 million examples of (non-equilibrium) structures with their associated energies and per-atom forces. We evaluate the models on the provided validation sets, some of which include out-of-distribution data relative to the training data, as detailed in Chanussot* et al. (2021). DES15K. DES15K (Donchev et al., 2021) (license: CC0 1.0) is a small dataset containing around 15,000 interacting molecule pairs, specifically dimer geometries with non-covalent molecular interactions. Each pair is labelled with the associated interaction energy computed using the coupled-cluster method with single, double, and perturbative triple excitations (CCSD(T)) (Bartlett & Musiał, 2007), which is widely regarded as the gold-standard method in electronic structure theory. We randomly split the dataset into 13k training examples, 1k validation examples and 1k testing examples.

Autoregressive or reconstruction-based approaches, such as ours, learn representations by requiring the model to predict aspects of the input graph (Hu et al., 2020a;b; Rong et al., 2020; Liu et al., 2019). Most methods in the current literature are not designed to handle 3D structural information, focusing instead on 2D graphs. The closest work to ours is GraphMVP

Ho et al., 2020;Hoogeboom et al., 2022;Xu et al., 2022;Shi et al., 2021). We also rely on this connection to show that denoising structures corresponds to learning a force field.Equivariant neural networks for 3D molecular property prediction. Recently, the dominant approach for improving models for molecular property prediction from 3D structures has been through the design of architectures that incorporate roto-translational inductive biases into the model, such that the outputs are invariant to translating and rotating the input atomic positions. A simple way to achieve this is to use roto-translation invariant features as inputs, such as inter-atomic distances(Schütt et  al., 2017; Unke & Meuwly, 2019), angles (Klicpera et al., 2020b;a; Shuaibi et al., 2021; Liu et al., 2022b), or the principal axes of inertia (Godwin et al., 2022). There is also broad literature on equivariant neural networks, whose intermediate activations transform accordingly with roto-translations of inputs thereby naturally preserving inter-atomic distance and orientation information. Such models can be broadly categorized into those that are specifically designed for molecular property prediction (Thölke & De Fabritiis, 2022; Schütt et al., 2021; Batzner et al., 2021; Anderson et al., 2019; Miller et al., 2020) and general-purpose architectures (Satorras et al., 2021; Finzi et al., 2020; Hutchinson et al., 2021; Thomas et al., 2018; Kondor et al., 2018; Brandstetter et al., 2021). Our pre-training technique is architecture-agnostic, and we show that it can be applied to enhance performance in both

Results on QM9 comparing the performance of GNS-TAT + Noisy Nodes (NN) with and without pre-training on PCQM4Mv2 (averaged over three seeds) with other baselines.

Performance of TorchMD-NET with Noisy Nodes and pre-training on PCQM4Mv2.



GNS vs. GNS-TAT energy prediction MAE on OC20. TAT hyperparameters were tuned with experiments on QM9 and then kept fixed for OC20.

1 DIFFERENCES BETWEEN DENOISING FOR GENERATIVE MODELLING VS. LEARNING FORCES Denoising has recently seen widespread use as a means for generative score-based modelling (e.g. Song & Ermon, 2019; 2020; Ho et al., 2020) due to the equivalence between denoising and scorematching

F HYPERPARAMETERS

We report the main hyperparameters used for GNS and GNS-TAT below. 

G.1 PERFORMANCE OF TORCHMD-NET ON DES15K

In addition to the experiments involving GNS-TAT on DES15K in Section 4.4, we also consider the performance of TorchMD-NET on DES15K with and without pre-training on PCQM4Mv2. As shown in Table 8 , TorchMD-NET outperforms GNS-TAT when trained from scratch, and pre-training then yields a further boost in performance as with GNS-TAT. 3 (left), the model pre-trained on OC20 itself achieves SOTA performance with faster convergence than all other GNS variants. Note that since we pre-train on OC20 itself, the pre-trained model performs equally well at convergence as the model trained from scratch with noisy nodes (cf. Section 4.3). Recall that in Section 5.4 we explored whether pre-training also improves force prediction models, given the link between denoising and learning forces as described in Section 3.2.1. In this section, we consider a second experiment for force models. The OC20 dataset contains a force prediction task (S2EF), where a model is trained to predict point-wise energy and forces (each point being a single DFT evaluation during a relaxation trajectory) from a given 3D structure. Tables 10 and 11 show the performance of GNS when trained from scratch vs. pre-trained via denoising on equilibrium structures in OC20, as described in Section 4.3. We show two metrics for measuring force prediction performance on each of the four validation datasets: mean absolute error (lower is better) and cosine similarity (higher is better). We observe that the pre-trained model improves upon the model trained from scratch for both metrics and all four validation datasets, with improvements up to 15%. 

