RETCL: A SELECTION-BASED APPROACH FOR RETROSYNTHESIS VIA CONTRASTIVE LEARNING

Abstract

Retrosynthesis, of which the goal is to find a set of reactants for synthesizing a target product, is an emerging research area of deep learning. While the existing approaches have shown promising results, they currently lack the ability to consider availability (e.g., stability or purchasability) of the reactants or generalize to unseen reaction templates (i.e., chemical reaction rules). In this paper, we propose a new approach that mitigates the issues by reformulating retrosynthesis into a selection problem of reactants from a candidate set of commercially available molecules. To this end, we design an efficient reactant selection framework, named RETCL (retrosynthesis via contrastive learning), for enumerating all of the candidate molecules based on selection scores computed by graph neural networks. For learning the score functions, we also propose a novel contrastive training scheme with hard negative mining. Extensive experiments demonstrate the benefits of the proposed selection-based approach. For example, when all 671k reactants in the USPTO database are given as candidates, our RETCL achieves top-1 exact match accuracy of 71.3% for the USPTO-50k benchmark, while a recent transformer-based approach achieves 59.6%. We also demonstrate that RETCL generalizes well to unseen templates in various settings in contrast to template-based approaches. The code will be released.

1. INTRODUCTION

Retrosynthesis (Corey, 1991) , finding a synthetic route starting from commercially available reactants to synthesize a target product (see Figure 1a ), is at the center of focus for discovering new materials in both academia and industry. It plays an essential role in practical applications by finding a new synthetic path, which can be more cost-effective or avoid patent infringement. However, retrosynthesis is a challenging task that requires searching over a vast number of molecules and chemical reactions, which is intractable to enumerate. Nevertheless, due to its utter importance, researchers have developed computer-aided frameworks to automate the process of retrosynthesis for more than three decades (Corey et al., 1985) . The computer-aided approaches for retrosynthesis mainly fall into two categories depending on their reliance on the reaction templates, i.e., sub-graph patterns describing how the chemical reaction occurs among reactants (see Figure 1b ). The template-based approaches (Coley et al., 2017b; Segler & Waller, 2017; Dai et al., 2019) first enumerate known reaction templates and then apply a wellmatched template into the target product to obtain reactants. Although they can provide chemically interpretable predictions, they limit the search space to known templates and cannot discover novel synthetic routes. In contrast, template-free approaches (Liu et al., 2017; Karpov et al., 2019; Zheng et al., 2019; Shi et al., 2020) generate the reactants from scratch to avoid relying on the reaction templates. However, they require to search the entire molecular space, and their predictions could be either unstable or commercially unavailable. We emphasize that retrosynthesis methods are often required to consider the availability of reactants and generalize to unseen templates in real-world scenarios. For example, when a predicted reactant is not available (e.g., not purchasable) for a chemist or a laboratory, the synthetic path starting from the predicted reactant cannot be instantly used in practice. Moreover, chemists often require retrosynthetic analysis based on unknown reaction rules. This is especially significant due to our Contribution. In this paper, we propose a new selection-based approach, which allows considering the commercial availability of reactants. To this end, we reformulate the task of retrosynthesis as a problem where reactants are selected from a candidate set of available molecules. This approach has two benefits over the existing ones: (a) it guarantees the commercial availability of the selected reactants, which allows chemists proceeding to practical procedures such as lab-scale experiments or optimization of reaction conditions; (b) it can generalize to unseen reaction templates and find novel synthetic routes. For the selection-based retrosynthesis, we propose an efficient selection framework, named RETCL (retrosynthesis via contrastive learning). To this end, we design two effective selection scores in synthetic and retrosynthetic manners. To be specific, we use the cosine similarity between molecular embeddings of the product and the reactants computed by graph neural networks. For training the score functions, we also propose a novel contrastive learning scheme (Sohn, 2016; He et al., 2019; Chen et al., 2020b) with hard negative mining (Harwood et al., 2017) to overcome a scalability issue while handling a large-scale candidate set. To demonstrate the effectiveness of our RETCL, we conduct various experiments based on the USPTO database (Lowe, 2012) containing 1.8M chemical reactions in the US patent literature. Thanks to our prior knowledge on the candidate reactants, our method achieves 71.3% test accuracy and significantly outperforms the baselines without such prior knowledge. Furthermore, our algorithm demonstrates its superiority even when enhancing the baselines with candidate reactants, e.g., our algorithm improves upon the existing template-free approach (Chen et al., 2019) by 11.7%. We also evaluate the generalization ability of RETCL by testing USPTO-50k-trained models on the USPTOfull dataset; we obtain 39.9% test accuracy while the state-of-the-art template-based approach (Dai et al., 2019) To this end, we define a chemical reaction R → P as a synthetic process of converting a reactant-set R = {R 1 , . . . , R n }, i.e., a set of reactant molecules, to a product molecule P (see Figure 1a ). We



A chemical database, https://www.reaxys.com



Figure 1: Examples of (a) a chemical reaction and (b) the corresponding reaction template in the USPTO-50k dataset. The objective of retrosynthesis is to find the reactants for the given product.

achieves 26.7%. Finally, we demonstrate how our RETCL can improve multi-step retrosynthetic analysis where intermediate reactants are not in our candidate set. Our framework is based on solving the retrosynthesis task as a selection problem over a candidate set of commercially available reactants given the target product. Especially, we design a selection procedure based on molecular embeddings computed by graph neural networks and train the networks via contrastive learning.

