PRE-TRAINING VIA DENOISING FOR MOLECULAR PROPERTY PREDICTION

Abstract

Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -arising from approximating the Boltzmann distribution with a mixture of Gaussians -directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -on pre-training.

1. INTRODUCTION

The success of the best performing neural networks in vision and natural language processing (NLP) relies on pre-training the models on large datasets to learn meaningful features for downstream tasks (Dai & Le, 2015; Simonyan & Zisserman, 2014; Devlin et al., 2018; Brown et al., 2020; Dosovitskiy et al., 2020) . For molecular property prediction from 3D structures (a point cloud of atomic nuclei in R 3 ), the problem of how to similarly learn such representations remains open. For example, none of the best models on the widely used QM9 benchmark use any form of pre-training (e.g. Klicpera et al., 2020a; Liu et al., 2022b; Schütt et al., 2021; Thölke & De Fabritiis, 2022) , in stark contrast with vision and NLP. Effective methods for pre-training could have a significant impact on fields such as drug discovery and material science. In this work, we focus on the problem of how large datasets of 3D molecular structures can be utilized to improve performance on downstream molecular property prediction tasks that also rely on 3D structures as input. We address the question: how can one exploit large datasets like PCQM4Mv2,foot_0 that contain over 3 million structures, to improve performance on datasets such as DES15K that are orders of magnitude smaller? Our answer is a form of self-supervised pre-training that generates useful representations for downstream prediction tasks, leading to state-of-the-art (SOTA) results. Inspired by recent advances in noise regularization for graph neural networks (GNNs) (Godwin et al., 2022) , our pre-training objective is based on denoising in the space of structures (and is hence self-supervised). Unlike existing pre-training methods, which largely focus on 2D graphs, our approach targets the setting where the downstream task involves 3D point clouds defining the molecular structure. Relying on the well-known connection between denoising and score-matching (Vincent, 2011; Song & Ermon, 2019; Ho et al., 2020) , we show that the denoising objective is equivalent to learning a particular force field, adding a new interpretation of denoising in the context of molecules and shedding light on how it aids representation learning.

Graph Neural Network

Predict noise Add noise to atom coordinates HOMO energy The contributions of our work are summarized as follows: • We investigate a simple and effective method for pre-training via denoising in the space of 3D structures with the aim of improving downstream molecular property prediction from such 3D structures. Our denoising objective is shown to be related to learning a specific force field. • Our experiments demonstrate that pre-training via denoising significantly improves performance on multiple challenging datasets that vary in size, nature of task, and molecular composition. This establishes that denoising over structures successfully transfers to molecular property prediction, setting, in particular, a new state-of-the-art on 10 out of 12 targets in the widely used QM9 dataset. 

2. RELATED WORK

Pre-training of GNNs. Various recent works have formulated methods for pre-training using graph data (Liu et al., 2021b; Hu et al., 2020a; Xie et al., 2021; Kipf & Welling, 2016) , rather than 3D point clouds of atom nuclei as in this paper. Approaches based on contrastive methods rely on learning representations by contrasting different views of the input graph (Sun et al., 2019; Veličković et al., 2019; You et al., 2020; Liu et al., 2021a) , or bootstrapping (Thakoor et al., 2021) . Autoregressive or reconstruction-based approaches, such as ours, learn representations by requiring the model to predict aspects of the input graph (Hu et al., 2020a; b; Rong et al., 2020; Liu et al., 2019) . Most methods in the current literature are not designed to handle 3D structural information, focusing instead on 2D graphs. The closest work to ours is GraphMVP (Liu et al., 2021a) , where 3D structure is treated as one view of a 2D molecule for the purpose of upstream contrastive learning. Their work focuses on downstream tasks that only involve 2D information, while our aim is to improve downstream models for molecular property prediction from 3D structures. After the release of this pre-print, similar ideas have been studied by Jiao et al. (2022) and Liu et al. (2022a) . Denoising, representation learning and score-matching. Noise has long been known to improve generalization in machine learning (Sietsma & Dow, 1991; Bishop, 1995) . Denoising autoencoders have been used to effectively learn representations by mapping corrupted inputs to original inputs (Vincent et al., 2008; 2010) . Specific to GNNs (Battaglia et al., 2018; Scarselli et al., 2009; Bronstein et al., 2017) , randomizing input graph features has been shown to improve performance (Hu et al., 2020a; Sato et al., 2021) . Applications to physical simulation also involve corrupting the state with Gaussian noise (Sanchez-Gonzalez et al., 2018; 2020; Pfaff et al., 2020) . Our work builds on Noisy Nodes (Godwin et al., 2022) , which incorporates denoising as an auxiliary task to improve performance, indicating the effectiveness of denoising for molecular property prediction (cf. Section 3.2.2). Denoising is also closely connected to score-matching (Vincent, 2011) , which has become popular



Note that PCQM4Mv2 is a new version of PCQM4M that now offers 3D structures.



Figure 1: GNS-TAT pre-trained via denoising on PCQM4Mv2 outperforms prior work on QM9.

illustrates performance on one of the targets in QM9. • We make improvements to a common GNN, in particular showing how to apply Tailored Activation Transformation (TAT) (Zhang et al., 2022) to Graph Network Simulators (GNS) (Sanchez-Gonzalez et al., 2020), which is complementary to pre-training and further boosts performance. • We analyze the benefits of pre-training by gaining insights into the effects of dataset size, model size and architecture, and the relationship between the upstream and downstream datasets.

