REAKE: CONTRASTIVE MOLECULAR REPRESEN-TATION LEARNING WITH CHEMICAL SYNTHESIS KNOWLEDGE GRAPH

Abstract

Molecular representation learning has demonstrated great promise in bridging machine learning and chemical science and in supporting novel chemical discoveries. State-of-the-art methods mostly employ graph neural networks (GNNs) with selfsupervised learning (SSL) and extra chemical reaction knowledge to empower the learned embeddings. However, prior works ignore three major issues in modeling reaction data, that is abnormal energy flow, ambiguous embeddings, and sparse embedding space problems. To alleviate these problems, we propose ReaKE, a chemical synthesis knowledge graph-driven pre-training framework for molecular representation learning. We first construct a large-scale chemical synthesis knowledge graph comprising reactants, products and reaction rules. We then propose triplet-level and graph-level contrastive learning strategies to jointly optimize the knowledge graph and molecular embeddings. Representations learned by ReaKE can capture the changes between the before and after of a reaction (template information) without prior information. Extensive experiments of downstream tasks and visualization demonstrate the effectiveness of our method compared with the state-of-art methods.

1. INTRODUCTION

Organic chemistry is rapidly developed with the growing interest in big data technology (Schwaller et al., 2021b) . Among them, reaction prediction becomes a necessary component of retro-synthesis analysis or virtual library generation for drug design (Kayala & Baldi, 2011) . However, the prediction of chemical reaction outcomes in terms of products, yieldsfoot_0 , or reaction rates with computational approaches remains a formidable undertaking. In the last years, natural language processing (NLP)-based methods showed robustness and effectiveness in representing molecules and reaction prediction (Schwaller et al., 2020) , these methods treat the precursors' Simplified molecular-input line-entry system (SMILES)foot_1 as text. While effective, they face the challenges of dealing with molecules' structural information. To handle this challenge, researchers leverage the ascendency of Graph neural networks (GNNs) in modeling 2D molecular structures (Liu et al., 2019; Yang et al., 2019; Liu et al., 2022; Ma et al., 2022) . Still, there exists the problem of predicting out-of-distribution data samples since labeled data are limited and the chemical space is complex (Wu et al., 2018) . Thus, some recent methods employ self-supervised learning (SSL)strategies to use unlabelled data, including designing special pretext tasks and applying the contrastive learning framework (You et al., 2020; Zhang et al., 2020; Xu et al., 2021; Wang et al., 2022; Li et al., 2022) . However, SSL on molecular graph structures remains challenging as the current approaches mostly lack domain knowledge in chemical synthesis. Recent studies have pointed out that pre-trained GNNs with random node/edge masking gives limited improvements and often lead to negative transfer on downstream tasks Hu et al. (2020); Stärk et al. (2021) , as the perturbations actions on graph structures can hurt the structural inductive bias of molecules. More recently, a few studies inject extra chemical reaction knowledge into SSL training to empower the learned embeddings (Wen et al., 2022; Wang et al., 2021) . Among them, the state-of-art method MolR(Wang et al., 2021) preserves the equivalence of molecules with respect to chemical reactions in the embedding space. (i.e., forcing the sum of reactant embeddings and the sum of product embeddings to be equal for each chemical equation.) 3Albeit promising, the chemical reaction-aware method face the following three problems: (1) Abnormal energy flow: all chemical reactions are accompanied by changes in entropy, and changes in entropy require reaction conditions such as temperature and pressure to trigger. Under the equivalence assumption of the previous method, the reactants and products can flow with each other as long as the embedding is equal, which violates the principle of entropy increase in the second law of thermodynamics. For example, given a reaction A + B → C and a reaction of D + E → C, it will result in A + B → D + E, but that reaction might not occur. (2) Ambiguous embeddings: the previous method assumes that the embeddings of reactants and products are equal in embedding space, however, reactants and products are often similar but totally different in property, this assumption will lead to a lack of discrimination between reactants and products in the embedding space, for example, incorrectly predicting products as reactants, more detailed examples are in Table 5 of Appendix F (such as No.5 and No.72 reactions). (3) Sparse embedding space: since the amount of recorded chemical reactions is limited, the embedding spaces of reactants and products learned by the previous methods are sparse and lack smoothness, which may lead to a large offset of embeddings when making a small perturbation to the reaction. i.e., if I have an A → B reaction, there will be e(A) = e(B), e(•) represents the embedding function. Suppose a small perturbation σ (removing an atom outside the reaction center) is done on both A and B, it may cause a large offset due to the sparsity of the chemical space and make e(A + σ) ̸ = e(B + σ). To address these problems, we develop ReaKE, a novel deep learning framework that learns chemistry-meaningful molecular representations from graph-in-graph data architecture, i.e., a knowledge graph (KG) that connects 2D molecular graphs using reaction templates. First, to alleviate the energy flow and the ambiguous embedding problems, we construct a chemical synthesis knowledge graph and build explicit connections between molecules through reaction template information. This can introduce the changes in reaction sites as the trigger conditions of flow between molecules, but also help distinguish reactants and products in the embedding space. Then, for solving the sparse embedding space problem, we further design a functional group-based SSL method for reaction triplet-level representation learning, which can help build a denser chemical embedding space. Finally, we propose a reaction-aware contrastive learning strategy to improve the efficiency of the knowledge graph-level training. Extensive experiments demonstrate that the representations learned by our proposed model can benefit a wide range of downstream tasks which require chemical synthesis priors information. For example, ReaKE achieves a 6.8% absolute Hit@1 gain in pretext reaction prediction, an average of 9.4% absolute F 1 score gain in reaction classifications, and an average 4% R 2 improvement in yield predictions over existing state-of-the-art methods, respectively. Further visualization studies indicate that our reaction representations can not only categorize reactions clearly but also capture discriminative properties of reaction templates.

2. METHODS

An illustrative overview of our proposed method of molecular pre-training with Reaction Knowledge Embedding (ReaKE) is presented in Fig. 1 . In this section, we first introduce the definition of a chemical synthesis knowledge graph (section 2.1), as schematically shown in Fig. 2 (a). Then we depict the joint learning of the triplet-level encoder and the knowledge encoder at the graph-level (section 2.2), followed by the overall pre-training objects (section 2.3). Knowledge graph embedding (KGE) aims to encode components of a KG into a low-dimensional continuous vector space to support the downstream graph operations and knowledge reuse.



Reaction yield is a measure of the quantity of moles of a product formed in relation to the reactant consumed, obtained in a chemical reaction, usually expressed as a percentage. It is a specification for unambiguously describing molecular structures in ASCII strings. For example, the SMILES string of ethanol is 'CCO'. For example, given the chemical equation of Fischer esterification of acetic acid and ethanol: CH3COOH + C2H5OH → CH3COOC2H5 + H2O, MolR assumes the eCH 3 COOH + eC 2 H 5 OH = eCH 3 COOC 2 H 5 + eH 2 O also holds, where e(•) represents molecule embedding function.

