ENERGY-BASED VIEW OF RETROSYNTHESIS

Abstract

Retrosynthesis is the process of identifying a set of reactants to synthesize a target molecule. It is of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. However, the inner connections of these models are rarely discussed, and rigorous evaluations of these models are largely in need. In this paper, we propose a framework that unifies sequenceand graph-based methods as energy-based models (EBMs) with different energy functions. This unified view establishes connections and reveals the differences between models, thereby enhancing our understanding of model design. We also provide a comprehensive assessment of performance to the community. Moreover, we present a novel dual variant within the framework that performs consistent training to induce the agreement between forward-and backward-prediction. This model improves the state-of-the-art of template-free methods with or without reaction types.



Retrosynthesis is a critical problem in organic chemand drug discovery (Corey, 1988; 1991; Segler et al., 2018b; Szymkuć et al., 2016; Strieth-Kalthoff et al., 2020) . As the reverse process of chemical synthesis (Coley et al., 2017a; 2019) , retrosynthesis aims to find the set of reactants that can synthesize the provided target via chemical reactions (Fig 1 ). Since the search space of theoretically feasible reactant candidates is enormous, models should be designed carefully to have the expression power to learn complex chemical rules and maintain computational efficiency. Recent machine learning applications on retrosynthesis, including sequence-and graph-based models, have made significant progress (Segler & Waller, 2017a; Segler et al., 2018b; Johansson et al., 2020) . Sequence-based models treat molecules as one-dimensional token sequences (SMILES (Weininger, 1988) , bottom of Fig 1 ) and formulate retrosynthesis as a sequence-tosequence problem, where recent advances in neural machine translation (Vaswani et al., 2017; Schwaller et al., 2019 ) can be applied. In this principle, the LSTM-based encoder-decoder frameworks and, more recently, transformer-based approaches have achieved promising results (Liu et al., 2017; Schwaller et al., 2019; Zheng et al., 2019) . On the other hand, graph-based models have a natural representation of human-interpretable molecular graphs, where chemical rules are easily applied. Graph-based approaches that perform graph matching with chemical rules ("templates"; definition in Sec 3.2) or reaction centers have reached encouraging results (Dai et al., 2019; Shi et al., 2020) . In this paper, we focus on one-step retrosynthesis, which is also the foundation of multi-step retrosynthesis (Segler et al., 2018b) . Our goal here is to provide a unified view of both sequence-and graph-based retrosynthesis models using an energy-based model (EBM) framework. It is beneficial because: First, the model design with EBM is very flexible. Within this framework, both types of models can be formulated as different EBM variants by instantiating the energy score functions into specific forms. Second, EBM provides principled ways for training models, including maximum likelihood estimator, pseudolikelihood, etc. Third, a unified view is critical to provide insights into different EBM variants, as it is easy to extract commonalities and differences between EBM variants, understand strengths • We propose a unified energy-based model (EBM) framework that integrates sequence-and graph-based models for retrosynthesis. To our best knowledge, this is the first effort to unify and exploit inner connectivity between different models. • We perform rigorous evaluations by running tens of experiments on different model designs. We believe revealing the performance to the community contributes to the development of retrosynthesis models. 

2. ENERGY-BASED MODEL FOR RETROSYNTHESIS

Retrosynthesis is to predict a set of reactant molecules from a product molecule. We denote the product as y, and the set of reactants predicted for one-step retrosynthesis as X. The key for retrosynthesis is to model the conditional probability p(X|y) (Dai et al., 2019; Shi et al., 2020; Liu et al., 2017; Schwaller et al., 2019) . EBM provides a common theoretical framework that can unify many retrosynthesis models, including but not limited to existing models. et al., 2006; Hinton, 2012) defines the distribution using an energy function. Without loss of generality, we define the joint distribution of product and reactants as follows: p θ (X, y) = exp(-E θ (X, y)) Z(θ) where the partition function Z(θ) = y X exp(-E θ (X, y)) is a normalization constant to ensure a valid probability distribution. Since the design of E θ is free of choice, EBMs can be used to unify many retrosynthesis models by instantiating the energy function E(θ) with different designs and approximation of the partition function. Note there is a trade-off between model expression capacity and learning tractability. EBM is also easy to obtain arbitrary conditioning with different partition functions. For example, the forward prediction probability for reaction outcome prediction p θ (y|X) can be written as exp(-E θ (X,y)) y exp(-E θ (X,y )) with the same form of energy function. Overall, the proposed framework works as follows: (1) design and train an energy function E θ (Sec 3 and Sec 4), and (2) use E θ for inference in retrosynthesis (Sec 5). See 

3.1. SEQUENCE-BASED MODELS

Here we describe several sequence-based parametriztion to instantiate our EBM framework, which use SMILES string as representations of molecules. We first define the sequence-based notations.



Figure 1: and SMILES.

Fig 2 and Algorithm 1. Based on how to parameterize reactant and product molecule X and y, the model designs can be divided into two categories: sequence-based and graph-based models.

Inspired by such a unified framework, we propose a novel dual EBM variant that performs consistent training over forward and backward prediction directions. dual model improves the state-of-the-art accuracy by 9.9% for full automate template-free and 2.7% for template-based.The goal of this paper is to investigate the performance of different models under the setup without any hand-crafted chemistry features, e.g. reaction center, during training. Incorporating these handcrafted chemistry features usually can boost accuracy significantly regardless of the model design. So adding features is not the focus of our paper. See discussion in AppendixA.2, A.3, A.4.

Obtain a list of X candidates by P . L test ← P (y test ) 5. X * = arg min X∈L test E θ * (X, y test ) Return: X

