SUBSTRUCTURE-ATOM CROSS ATTENTION FOR MOLECULAR REPRESENTATION LEARNING

Abstract

Designing a neural network architecture for molecular representation is crucial for AI-driven drug discovery and molecule design. In this work, we propose a new framework for molecular representation learning. Our contribution is threefold: (a) demonstrating the usefulness of incorporating substructures to node-wise features from molecules, (b) designing two branch networks consisting of a transformer and a graph neural network so that the networks fused with asymmetric attention, and (c) not requiring heuristic features and computationally-expensive information from molecules. Using 1.8 million molecules collected from ChEMBL and PubChem database, we pretrain our network to learn a general representation of molecules with minimal supervision. The experimental results show that our pretrained network achieves competitive performance on 11 downstream tasks for molecular property prediction.

1. INTRODUCTION

Predicting properties of molecules is one of the fundamental concerns in various fields. For instance, researchers apply deep neural networks (DNNs) to replace expensive real-world experiments to measure the molecular properties of a drug candidate, e.g., the capability of permeating the bloodbrain barrier, solubility, and affinity. Such an attempt significantly reduces wet-lab experimentation that often takes more than ten years and costs $1 million (Hughes et al., 2011; Mohs & Greig, 2017) . Among the DNN architectures, graph neural networks (GNNs) and Transformers are widely adopted to recognize graph structure of molecules. GNN architectures for molecular representation learning include message-passing neural network (MPNN) and directed MPNN (Gilmer et al., 2017; Yang et al., 2019) , where they investigate how to obtain effective node, edge, and graph representation. GNNs are powerful in capturing local information of a node, but may lack the ability to encode information from far-away nodes due to over-smoothing and over-squashing issues (Li et al., 2018; Alon & Yahav, 2020) . On the other hand, Transformer-based architectures such as MAT (Maziarka et al., 2020) and Graphormer (Ying et al., 2021) augment the self-attention layer of a vanilla Transformer using high-order graph connectivity information. Transformers can encode global information as they consider attention between every pair of nodes from the first layer. To guide a structural bias in the attention mechanism, previous work relies on heuristic features such as the shortest path between two nodes since the naive Transformers cannot recognize the graph structure. From the understanding of chemical structure, it is known that meaningful substructures can be found across different molecules, also known as motif or fragments (Murray & Rees, 2009) . For example, carbon rings and NO2 groups are typical substructures contributed to mutagenicity (Debnath et al., 1991) showing that proper usage of substructures can help a property prediction. Molecular substructures are often represented as molecular fingerprints or molecular fragmentation. Molecular fingerprints such as MACCS (Molecular ACCess System) keys (Durant et al., 2002) and Extended-Connectivity Fingerprints (ECFPs) Rogers & Hahn (2010) represent a molecule into a fixed binary vector where each bit indicates the presence of a certain motif in the molecule. With a predefined fragmentation dictionary, such as BRICS (Degen et al., 2008) or tree decomposition (Jin et al., 2018) , a molecule can be decomposed into distinct partitions. Interestingly, machine learning algorithms that utilize molecular substructures still show competitive performance on some datasets to deep learning models (Hu et al., 2020; Maziarka et al., 2020) .

