SUBSTRUCTURE-ATOM CROSS ATTENTION FOR MOLECULAR REPRESENTATION LEARNING

Abstract

Designing a neural network architecture for molecular representation is crucial for AI-driven drug discovery and molecule design. In this work, we propose a new framework for molecular representation learning. Our contribution is threefold: (a) demonstrating the usefulness of incorporating substructures to node-wise features from molecules, (b) designing two branch networks consisting of a transformer and a graph neural network so that the networks fused with asymmetric attention, and (c) not requiring heuristic features and computationally-expensive information from molecules. Using 1.8 million molecules collected from ChEMBL and PubChem database, we pretrain our network to learn a general representation of molecules with minimal supervision. The experimental results show that our pretrained network achieves competitive performance on 11 downstream tasks for molecular property prediction.

1. INTRODUCTION

Predicting properties of molecules is one of the fundamental concerns in various fields. For instance, researchers apply deep neural networks (DNNs) to replace expensive real-world experiments to measure the molecular properties of a drug candidate, e.g., the capability of permeating the bloodbrain barrier, solubility, and affinity. Such an attempt significantly reduces wet-lab experimentation that often takes more than ten years and costs $1 million (Hughes et al., 2011; Mohs & Greig, 2017) . Among the DNN architectures, graph neural networks (GNNs) and Transformers are widely adopted to recognize graph structure of molecules. GNN architectures for molecular representation learning include message-passing neural network (MPNN) and directed MPNN (Gilmer et al., 2017; Yang et al., 2019) , where they investigate how to obtain effective node, edge, and graph representation. GNNs are powerful in capturing local information of a node, but may lack the ability to encode information from far-away nodes due to over-smoothing and over-squashing issues (Li et al., 2018; Alon & Yahav, 2020) . On the other hand, Transformer-based architectures such as MAT (Maziarka et al., 2020) and Graphormer (Ying et al., 2021) augment the self-attention layer of a vanilla Transformer using high-order graph connectivity information. Transformers can encode global information as they consider attention between every pair of nodes from the first layer. To guide a structural bias in the attention mechanism, previous work relies on heuristic features such as the shortest path between two nodes since the naive Transformers cannot recognize the graph structure. From the understanding of chemical structure, it is known that meaningful substructures can be found across different molecules, also known as motif or fragments (Murray & Rees, 2009) . For example, carbon rings and NO2 groups are typical substructures contributed to mutagenicity (Debnath et al., 1991) showing that proper usage of substructures can help a property prediction. Molecular substructures are often represented as molecular fingerprints or molecular fragmentation. Molecular fingerprints such as MACCS (Molecular ACCess System) keys (Durant et al., 2002) and Extended-Connectivity Fingerprints (ECFPs) Rogers & Hahn (2010) represent a molecule into a fixed binary vector where each bit indicates the presence of a certain motif in the molecule. With a predefined fragmentation dictionary, such as BRICS (Degen et al., 2008) or tree decomposition (Jin et al., 2018) , a molecule can be decomposed into distinct partitions. Interestingly, machine learning algorithms that utilize molecular substructures still show competitive performance on some datasets to deep learning models (Hu et al., 2020; Maziarka et al., 2020) . We name our network as Substructure-Atom Cross Attention (SACA) as it uses substructure as well as atom information in molecules and fuses them through cross-attention. The architectural choices allows us to avoid heuristic features such as the shortest-path or 3-dimensional distance in attention layers, which are required in many existing Transformer architectures for molecular representation learning (Maziarka et al., 2020; Ying et al., 2021) . Furthermore, our model reduces the complexity for attention calculation over the node-level Transformer models from O(N 2 ) to O(N ), where N is the number of atoms. To demonstrate the empirical effectiveness of the proposed network and see the ability to capture the general representation of molecules, we evaluate our model on 11 downstream tasks from MoleculeNet (Wu et al., 2018) . Our approach achieves the competitive performance on 11 downstream tasks. In what follows, we summarize the key contributions and benefits. • We propose a novel network that combines the information from substructures and node features in a molecule. Our model combines the advantages of both Transformer and GNN architectures to represent the information given to each architecture. • We show the effectiveness of our model for molecular representation learning. Our model achieves competitive performance upon strong baseline models on 11 molecular property tasks. • Our model does not require computationally-expensive heuristic information of molecular graphs. • The source code and pretrained networks will be released in the public domain upon the paper acceptance.

2.1. ARCHITECTURES FOR MOLECULAR REPRESENTATION LEARNING

Graph neural networks (GNNs) Most common architecture for molecular representation learning is the GNNs since molecules can be naturally represented as a graph structure; a node as an atom, an edge as a connection. Researchers have actively investigated variations of GNN architectures (Gilmer et al., 2017; Yang et al., 2019; Xiong et al., 2019; Song et al., 2020) , for molecules. For



Figure1: Traditional GNNs, such as GCN and GAT, cannot distinguish the two molecular graphs in (a). However, it can be easily distinguished through simple 6-ring and 5-ring substructure information. On the other hand, the two molecules in (b) have similar substructures, so atom features from neighborhood are necessary to discriminate the two molecules.We propose a fusion architecture between a GNN and Transformer to incorporate molecular graph information and molecular substructures. Molecular substructures and graph are encoded through Transformer and GNN, respectively. The Transformer is designed to recognize the molecular substructures, where the substructure information is mixed through self-attention to obtain better representations. With the Transformer only architecture, however, local information of molecules, such as atoms, bonds, and connectivity, can be lost from the structures. For example, the two molecules shown in Figure1bshare the same representation with MACCS keys while having different structures. To overcome, we use a separate GNN branch for preserving local information. In our model, we inject the GNN feature into the intermediate Transformer layers through the fusing network. In this way, substructures and local node information are interactively fused, producing a final representation for molecular graphs.

