ONE TRANSFORMER CAN UNDERSTAND BOTH 2D & 3D MOLECULAR DATA

Abstract

Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M.

1. INTRODUCTION

Deep learning approaches have revolutionized many domains, including computer vision (He et al., 2016) , natural language processing (Devlin et al., 2019; Brown et al., 2020) , and games (Mnih et al., 2013; Silver et al., 2016) . Recently, researchers have started investigating whether the power of neural networks could help solve important scientific problems in chemistry, e.g., predicting the property of molecules and simulating the molecular dynamics from large-scale training data (Hu et al., 2020a; 2021; Zhang et al., 2018; Chanussot et al., 2020) . One key difference between chemistry and conventional domains such as vision and language is the multimodality of data. In vision and language, a data instance is usually characterized in a particular form. For example, an image is defined as RGB values in a pixel grid, while a sentence is defined as tokens in a sequence. In contrast, molecules naturally have different chemical formulations. A molecule can be represented as a sequence (Weininger, 1988 ), a 2D graph (Wiswesser, 1985) , or a collection of atoms located in a 3D space. 2D and 3D structures are the most popularly used formulations as many valuable properties and statistics can be obtained from them (Chmiela et al., 2017; Stokes et al., 2020) . However, as far as we know, most previous works focus on designing neural network models for either 2D or 3D structures, making the model learned in one form fail to be applied in tasks of the other form. We argue that a general-purpose neural network model in chemistry should at least be able to handle molecular tasks across data modalities. In this paper, we take the first step toward this goal by developing Transformer-M, a versatile Transformer-based Molecular model that performs well for both 2D and 3D molecular representation learning. Note that for a molecule, its 2D and 3D forms describe the same collection of atoms but use different characterizations of the structure. Therefore, the key challenge is to design a model expressive and compatible in capturing structural knowledge in different formulations and train the parameters to learn from both information. Transformer is more favorable than other architectures as it can explicitly plug structural signals in the model as bias terms (e.g., positional encodings (Vaswani et al., 2017; Raffel et al., 2020) ). We can conveniently set 2D and 3D structural information as different bias terms through separated channels and incorporate them with the atom features in the attention layers. Architecture. The backbone network of our Transformer-M is composed of standard Transformer blocks. We develop two separate channels to encode 2D and 3D structural information. The 2D channel uses degree encoding, shortest path distance encoding, and edge encoding extracted from the 2D graph structure, following Ying et al. (2021a) . The shortest path distance encoding and edge encoding reflect the spatial relations and bond features of a pair of atoms and are used as bias terms in the softmax attention. The degree encoding is added to the atom features in the input layer. For the 3D channel, we follow Shi et al. ( 2022) to use the 3D distance encoding to encode the spatial distance between atoms in the 3D geometric structure. Each atom pair's Euclidean distance is encoded via the Gaussian Basis Kernel function (Scholkopf et al., 1997) and will be used as a bias term in the softmax attention. For each atom, we sum up the 3D distance encodings between it and all other atoms, and add it to atom features in the input layer. See Figure 1 for an illustration. Training. Except for the parameters in the two structural channels, all other parameters in Transformer-M (e.g., self-attention and feed-forward networks) are shared for different data modalities. We design a joint-training approach for Transformer-M to learn its parameters. During training, when the instances in a batch are only associated with 2D graph structures, the 2D channel will be activated, and the 3D channel will be disabled. Similarly, when the instances in a batch use 3D geometric structures, the 3D channel will be activated, and the 2D channel will be disabled. When both 2D and 3D information are given, both channels will be activated. In such a way, we can collect 2D and 3D data from separate databases and train Transformer-M with different training objectives, making the training process more flexible. We expect a single model to learn to identify and incorporate information from different modalities and efficiently utilize the parameters, leading to better generalization performance. Experimental Results. We use the PCQM4Mv2 dataset in the OGB Large-Scale Challenge (OGB-LSC) (Hu et al., 2021) to train our Transformer-M, which consists of 3.4 million molecules of both 2D and 3D forms. The model is trained to predict the pre-computed HOMO-LUMO gap of each data instance in different formats with a pre-text 3D denoising task specifically for 3D data. With the pre-trained model, we directly use or fine-tune the parameters for various molecular tasks of different data formats. First, we show that on the validation set of the PCQM4Mv2 task, which only contains 2D molecular graphs, our Transformer-M surpasses all previous works by a large margin. The improvement is credited to the joint training, which effectively mitigates the overfitting problem. Second, On PDBBind (Wang et al., 2004; 2005b ) (2D&3D), the fine-tuned Transformer-M achieves state-of-the-art performance compared to strong baselines. Lastly, on QM9 (Ramakrishnan et al., 2014) (3D) benchmark, the fine-tuned Transformer-M models achieve competitive performance compared to recent methods. All results show that our Transformer-M has the potential to be used as a general-purpose model in a broad range of applications in chemistry.

2. RELATED WORKS

Neural networks for learning 2D molecular representations. Graph Neural Network (GNN) is popularly used in molecular graph representation learning (Kipf & Welling, 2016; Hamilton et al., 2017; Gilmer et al., 2017; Xu et al., 2019; Veličković et al., 2018) . A GNN learns node and graph representations by recursively aggregating (i.e., message passing) and updating the node representations from neighbor representations. Different architectures are developed by using different aggregation and update strategies. We refer the readers to Wu et al. (2020) for a comprehensive survey. Recently, many works extended the Transformer model to graph tasks (Dwivedi & Bresson, 2020; Kreuzer et al., 2021; Ying et al., 2021a; Luo et al., 2022; Kim et al., 2022; Rampášek et al., 2022; Park et al., 2022; Hussain et al., 2022; Zhang et al., 2023) . Seminal works include Graphormer (Ying et al., 2021a) , which developed graph structural encodings and used them in a standard Transformer model.

