ENERGY-BASED VIEW OF RETROSYNTHESIS

Abstract

Retrosynthesis is the process of identifying a set of reactants to synthesize a target molecule. It is of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. However, the inner connections of these models are rarely discussed, and rigorous evaluations of these models are largely in need. In this paper, we propose a framework that unifies sequenceand graph-based methods as energy-based models (EBMs) with different energy functions. This unified view establishes connections and reveals the differences between models, thereby enhancing our understanding of model design. We also provide a comprehensive assessment of performance to the community. Moreover, we present a novel dual variant within the framework that performs consistent training to induce the agreement between forward-and backward-prediction. This model improves the state-of-the-art of template-free methods with or without reaction types.



Retrosynthesis is a critical problem in organic chemand drug discovery (Corey, 1988; 1991; Segler et al., 2018b; Szymkuć et al., 2016; Strieth-Kalthoff et al., 2020) . As the reverse process of chemical synthesis (Coley et al., 2017a; 2019) , retrosynthesis aims to find the set of reactants that can synthesize the provided target via chemical reactions (Fig 1 ). Since the search space of theoretically feasible reactant candidates is enormous, models should be designed carefully to have the expression power to learn complex chemical rules and maintain computational efficiency. Recent machine learning applications on retrosynthesis, including sequence-and graph-based models, have made significant progress (Segler & Waller, 2017a; Segler et al., 2018b; Johansson et al., 2020) . Sequence-based models treat molecules as one-dimensional token sequences (SMILES (Weininger, 1988) , bottom of Fig 1 ) and formulate retrosynthesis as a sequence-tosequence problem, where recent advances in neural machine translation (Vaswani et al., 2017; Schwaller et al., 2019 ) can be applied. In this principle, the LSTM-based encoder-decoder frameworks and, more recently, transformer-based approaches have achieved promising results (Liu et al., 2017; Schwaller et al., 2019; Zheng et al., 2019) . On the other hand, graph-based models have a natural representation of human-interpretable molecular graphs, where chemical rules are easily applied. Graph-based approaches that perform graph matching with chemical rules ("templates"; definition in Sec 3.2) or reaction centers have reached encouraging results (Dai et al., 2019; Shi et al., 2020) . In this paper, we focus on one-step retrosynthesis, which is also the foundation of multi-step retrosynthesis (Segler et al., 2018b) . Our goal here is to provide a unified view of both sequence-and graph-based retrosynthesis models using an energy-based model (EBM) framework. It is beneficial because: First, the model design with EBM is very flexible. Within this framework, both types of models can be formulated as different EBM variants by instantiating the energy score functions into specific forms. Second, EBM provides principled ways for training models, including maximum likelihood estimator, pseudolikelihood, etc. Third, a unified view is critical to provide insights into different EBM variants, as it is easy to extract commonalities and differences between EBM variants, understand strengths 1



Figure 1: and SMILES.

