CONTRASTIVE LEARNING OF MOLECULAR REPRESENTATION WITH FRAGMENTED VIEWS

Abstract

Molecular representation learning is a fundamental task for AI-based drug design and discovery. Contrastive learning is an attractive framework for this task, as also evidenced in various domains of representation learning, e.g., image, language, and speech. However, molecule-specific ways of constructing good positive or negative views in contrastive training under consideration of their chemical semantics have been relatively under-explored. In this paper, we consider a molecule as a bag of meaningful fragments, e.g., chemically informative substructures, by disconnecting a non-ring single bond as the semantic-preserving transformation. Then, we suggest to construct a complete (or incomplete) bag of fragments as the positive (or negative) views of a molecule: each fragment loses chemical substructures from the original molecule, while the union of the fragments does not. Namely, this provides easy positive and hard negative views simultaneously for contrastive representation learning so that it can selectively learn useful features and ignore nuisance features. Furthermore, we additionally suggest to optimize the torsional angle reconstruction loss around the fragmented bond to incorporate with 3D geometric structure in the pretraining dataset. Our experiments demonstrate that our scheme outperforms prior state-of-the-art molecular representation learning methods across various downstream molecule property prediction tasks.

1. INTRODUCTION

Obtaining discriminative representations of molecules is a long-standing research problem in chemistry (Morgan, 1965) . Such a task is critical for many applications such as drug discovery (Capecchi et al., 2020) and material design (Gómez-Bombarelli et al., 2018) , since it is a fundamental building block for various downstream tasks, e.g., molecule property prediction (Duvenaud et al., 2015) and molecule generation (Mahmood et al., 2021) . Over the past decades, researchers have focused on handcrafting the molecular fingerprint representation which encodes the presence or absence of chemically meaningful substructures, e.g., functional groups, in a molecule (Rogers & Hahn, 2010) . Recently, graph neural networks (GNNs) (Kipf & Welling, 2016) have gained much attention as a framework to learn the molecular representation due to its remarkable performance for learning to predict chemical properties (Wu et al., 2018) . However, they suffer from overfitting without much labeled training data (Rong et al., 2020b) . To resolve this issue, researchers have investigated selfsupervised learning to generate supervisory signals from a large amount of unlabeled molecules. A notable approach on this line of work is contrastive learning, which learns a discriminative representation by maximizing the agreement of representations of "similar" positive views while minimizing the agreement of "dissimilar" negative views (Chen et al., 2020a) ; it has widely demonstrated its effectiveness for representation learning not only for molecules (Wang et al., 2021; 2022) , but also for other domains, e.g., image (Chen et al., 2020a; He et al., 2019 ), video (Pan et al., 2021 ), language (Wu et al., 2020 ), and speech (Chung et al., 2021) . Here, the common challenge for learning good representation is how to construct effective positive and negative views in a self-supervised manner. For molecule contrastive representation learning, most prior works have utilized graph-augmentation techniques, e.g., edge/node drop, to produce positive views (You et al., 2020; 2021) . However, such augmentations often fail to generate proper positive views of molecule graph, losing important chemical semantics from the anchor molecule, e.g., randomly inserting an edge of a graph may generate a non-realistic molecule (Fang et al., 2021b) . Thus, semantic-preserving transformation

