REAKE: CONTRASTIVE MOLECULAR REPRESEN-TATION LEARNING WITH CHEMICAL SYNTHESIS KNOWLEDGE GRAPH

Abstract

Molecular representation learning has demonstrated great promise in bridging machine learning and chemical science and in supporting novel chemical discoveries. State-of-the-art methods mostly employ graph neural networks (GNNs) with selfsupervised learning (SSL) and extra chemical reaction knowledge to empower the learned embeddings. However, prior works ignore three major issues in modeling reaction data, that is abnormal energy flow, ambiguous embeddings, and sparse embedding space problems. To alleviate these problems, we propose ReaKE, a chemical synthesis knowledge graph-driven pre-training framework for molecular representation learning. We first construct a large-scale chemical synthesis knowledge graph comprising reactants, products and reaction rules. We then propose triplet-level and graph-level contrastive learning strategies to jointly optimize the knowledge graph and molecular embeddings. Representations learned by ReaKE can capture the changes between the before and after of a reaction (template information) without prior information. Extensive experiments of downstream tasks and visualization demonstrate the effectiveness of our method compared with the state-of-art methods.

1. INTRODUCTION

Organic chemistry is rapidly developed with the growing interest in big data technology (Schwaller et al., 2021b) . Among them, reaction prediction becomes a necessary component of retro-synthesis analysis or virtual library generation for drug design (Kayala & Baldi, 2011) . However, the prediction of chemical reaction outcomes in terms of products, yieldsfoot_0 , or reaction rates with computational approaches remains a formidable undertaking. In the last years, natural language processing (NLP)-based methods showed robustness and effectiveness in representing molecules and reaction prediction (Schwaller et al., 2020) , these methods treat the precursors' Simplified molecular-input line-entry system (SMILES)foot_1 as text. While effective, they face the challenges of dealing with molecules' structural information. To handle this challenge, researchers leverage the ascendency of Graph neural networks (GNNs) in modeling 2D molecular structures (Liu et al., 2019; Yang et al., 2019; Liu et al., 2022; Ma et al., 2022) . Still, there exists the problem of predicting out-of-distribution data samples since labeled data are limited and the chemical space is complex (Wu et al., 2018) . Thus, some recent methods employ self-supervised learning (SSL)strategies to use unlabelled data, including designing special pretext tasks and applying the contrastive learning framework (You et al., 2020; Zhang et al., 2020; Xu et al., 2021; Wang et al., 2022; Li et al., 2022) . However, SSL on molecular graph structures remains challenging as the current approaches mostly lack domain knowledge in chemical synthesis. Recent studies have pointed out that pre-trained GNNs with random node/edge masking gives limited improvements and often lead to negative transfer on downstream tasks Hu et al. (2020); Stärk et al. (2021) , as the perturbations actions on graph structures can hurt the structural inductive



Reaction yield is a measure of the quantity of moles of a product formed in relation to the reactant consumed, obtained in a chemical reaction, usually expressed as a percentage. It is a specification for unambiguously describing molecular structures in ASCII strings. For example, the SMILES string of ethanol is 'CCO'.

