MOLE-BERT: RETHINKING PRE-TRAINING GRAPH NEURAL NETWORKS FOR MOLECULES

Abstract

Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked, and GNNs are then trained to predict masked types as in AttrMask (Hu et al., 2020), following the Masked Language Modeling (MLM) task of BERT (Devlin et al., 2019). However, unlike MLM with a large vocabulary, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom 'vocabulary'. To amend this problem, we propose a variant of VQ-VAE (Van Den Oord et al., 2017) as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom 'vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (MAM), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (TMCL) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, Mole-BERT, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at https://github.com/junxia97/Mole-BERT.

1. INTRODUCTION

Pre-training Language Models (PLMs) have revolutionized the landscape of Natural Language Processing (NLP) (Qiu et al., 2020b; Zheng et al., 2022) . The representative one is BERT (Devlin et al., 2019) , whose Masked Language Modeling (MLM) task first randomly masks some proportions of tokens within a text, and then recovers the masked tokens based on the encoding results of the corrupted text. Although BERT also includes the pre-training task of next sentence prediction, MLM is verified as the only success recipe for BERT (Liu et al., 2019) . Inspired by this, MLM-style pre-training task has been extended to many other domains (Hu et al., 2020; He et al., 2022) . Molecules can be naturally represented as graphs with their atoms as nodes and chemical bonds as edges. Hence, Graph Neural Networks (GNNs) can be utilized to process molecular graph data. To exploit the abundant unlabeled molecules, tremendous efforts have been devoted to pre-training GNNs for molecular graph representations (Xia et al., 2022e) . The pioneering work (Hu et al., 2020) on this topic first pre-trains GNNs with a MLM-style pre-training task (AttrMask) on large-scale unlabeled molecular graph datasets. Specifically, they randomly mask some proportions of atoms and then pre-train the models to predict them. AttrMask has emerged as a fundamental pre-training task and many subsequent works adopt it as a sub-task for pre-training (Zhang et al., 2021; Li et al., 2021a) . During the tuning stage, researchers replace the top layer of the pre-trained models with a task-specific sub-network and train the new model with the labeled molecules of the downstream tasks. However, † Equal Contribution, * Corresponding Author

