MOLE-BERT: RETHINKING PRE-TRAINING GRAPH NEURAL NETWORKS FOR MOLECULES

Abstract

Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked, and GNNs are then trained to predict masked types as in AttrMask (Hu et al., 2020), following the Masked Language Modeling (MLM) task of BERT (Devlin et al., 2019). However, unlike MLM with a large vocabulary, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom 'vocabulary'. To amend this problem, we propose a variant of VQ-VAE (Van Den Oord et al., 2017) as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom 'vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (MAM), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (TMCL) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, Mole-BERT, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at https://github.com/junxia97/Mole-BERT.

1. INTRODUCTION

Pre-training Language Models (PLMs) have revolutionized the landscape of Natural Language Processing (NLP) (Qiu et al., 2020b; Zheng et al., 2022) . The representative one is BERT (Devlin et al., 2019) , whose Masked Language Modeling (MLM) task first randomly masks some proportions of tokens within a text, and then recovers the masked tokens based on the encoding results of the corrupted text. Although BERT also includes the pre-training task of next sentence prediction, MLM is verified as the only success recipe for BERT (Liu et al., 2019) . Inspired by this, MLM-style pre-training task has been extended to many other domains (Hu et al., 2020; He et al., 2022) . Molecules can be naturally represented as graphs with their atoms as nodes and chemical bonds as edges. Hence, Graph Neural Networks (GNNs) can be utilized to process molecular graph data. To exploit the abundant unlabeled molecules, tremendous efforts have been devoted to pre-training GNNs for molecular graph representations (Xia et al., 2022e) . The pioneering work (Hu et al., 2020) on this topic first pre-trains GNNs with a MLM-style pre-training task (AttrMask) on large-scale unlabeled molecular graph datasets. Specifically, they randomly mask some proportions of atoms and then pre-train the models to predict them. AttrMask has emerged as a fundamental pre-training task and many subsequent works adopt it as a sub-task for pre-training (Zhang et al., 2021; Li et al., 2021a) . During the tuning stage, researchers replace the top layer of the pre-trained models with a task-specific sub-network and train the new model with the labeled molecules of the downstream tasks. However, 2020) observe that pre-training only with AttrMask (node-level pre-training task) will incur the negative transfer issue (i.e., pre-trained models fall behind no pre-trained models) sometimes. Intuitively, they think this phenomenon can be attributed to the lack of graph-level pre-training tasks and thus introduce supervised graph-level pre-training strategies, which are impractical because the labels are often expensive or unavailable. Additionally, some supervised pre-training tasks unrelated to the downstream task of interest can even degrade the downstream performance (Hu et al., 2020) . In this paper, we provide a second voice on this predominant belief and aim to explain the negative transfer in molecular graph pre-training. Firstly, as can be observed in Figure 1 (a), the pre-training accuracy of AttrMask converges to ∼ 96% quickly, which indicates AttrMask task (118-way classification, 118 is the number of common chemical elements in nature) is extremely simple for the small atom vocabulary size. The atom vocabulary is the set of unique atom types of common chemical elements in nature. In contrast, the training accuracy of the MLM task (∼ 30k-way classification) in BERT only grows to 70% and hardly converges for the large text vocabulary size (∼ 30k tokens) (Kosec et al., 2021) . Text vocabulary is the set of unique tokens in the corpus. Secondly, the quantitative divergence between different atoms is extremely significant (see Figure 1(b) ), which will bias the models' prediction toward dominant atoms (e.g., carbons) and lead to fast convergence. Previous works (Clark et al., 2020; Robinson et al., 2021) have revealed that simple pre-training tasks will capture less transferable knowledge and impair the generalization or adaptation to novel tasks. Tokenization is the first step in any NLP pipeline, which separates a piece of text into smaller units called tokens (Sennrich et al., 2015) . Hence, pre-training language models include two stages: the first stage is tokenizer training and the second stage is language models pre-training. However, for GNNs pre-training, previous works adopt the atoms' types as tokens, which will result in a small-size and unbalanced atom vocabulary. We argue that atoms with different contexts should be tokenized into different discrete values even if they belong to the same type. For example, aldehyde carbons and ester carbons indicate different properties of molecules even if both of them are carbons. Hence, we introduce a context-aware tokenizer to encode atoms to meaningful discrete values. Specifically, these discrete values are the latent codes of a variant of graph VQ-VAE (Van Den Oord et al., 2017) . The tokenizer is context-aware because the encoder of graph VQ-VAE is a GNN model. In this way, we can categorize the dominant atoms (e.g., carbons) into several chemically meaningful sub-classes (e.g., aldehyde carbons, ester carbons, etc.) considering the atoms' contexts, which will enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant and rare atoms. To support the above claims, we provide the t-SNE visualization of carbon representations learned by the proposed tokenizer in Figure 1(c ). As can be observed, the representations of carbons are clustered based on types of functional groups, which indicates our tokenizer can encode atoms to chemically meaningful values. With the new vocabulary, we propose a node-level pre-training task, dubbed Masked Atoms Modeling (MAM), to randomly mask the discrete values and pre-train GNNs to predict them. For molecular graph-level pre-training, graph contrastive learning (You et al., 2020) is a feasible pre-training strategy. However, contrastive approaches push different molecules away equally regardless of their true degrees of similarities (Xia et al., 2022c; Liu et al., 2023; 2022c) . To



† Equal Contribution, * Corresponding Author



Figure 1: (a): Pre-training accuracy curves of AttrMask and MAM; (b): The atoms ratios of various chemical elements in the pre-training datasets; (c): The t-SNE visualization of the carbon representations learned by the proposed tokenizer. Firstly, we randomly sample 30, 000 carbons from the QM9 dataset (Ruddigkeit et al., 2012). Among them, 250 are randomly chosen and colored based on the types of functional groups that carbons belong to. Both R and R' are the abbreviations for groups in the rest of a molecule. The details of the 7 local structure groups are listed in Appendix I.

