METRO: MEMORY-ENHANCED TRANSFORMER FOR RETROSYNTHETIC PLANNING VIA REACTION TREE

Abstract

Retrosynthetic planning plays a critical role in drug discovery and organic chemistry. Starting from a target molecule as the root node, it aims to find a complete reaction tree subject to the constraint that all leaf nodes belong to a set of starting materials. The multi-step reactions are crucial because they determine the flow chart in the production of the Organic Chemical Industry. However, existing datasets lack curation of tree-structured multi-step reactions, and fail to provide such reaction trees, limiting models' understanding of organic molecule transformations. In this work, we first develop a benchmark curated for the retrosynthetic planning task, which consists of 124,869 reaction trees retrieved from the public USPTO-full dataset. On top of that, we propose Metro: Memory-Enhanced Transformer for RetrOsynthetic planning. Specifically, the dependency among molecules in the reaction tree is captured as context information for multi-step retrosynthesis predictions through transformers with a memory module. Extensive experiments show that Metro dramatically outperforms existing single-step retrosynthesis models by at least 10.7% in top-1 accuracy. The experiments demonstrate the superiority of exploiting context information in the retrosynthetic planning task. Moreover, the proposed model can be directly used for synthetic accessibility analysis, as it is trained on reaction trees with the shortest depths. Our work is the first step towards a brand new formulation for retrosynthetic planning in the aspects of data construction, model design, and evaluation.

1. INTRODUCTION

Retrosynthetic planning is a fundamental problem in organic chemistry (Coley et al., 2018a; Genheden et al., 2020) . The goal of retrosynthetic planning is to find a series of starting molecules that go through a sequence of reactions, which can also be represented as reaction tree, to synthesize the target molecule. Retrosynthetic planning can be decomposed into multi-step retrosynthesis reactions through which we find all starting molecules that meet the requirements. The multi-step reactions outline the transformation direction of organic molecules and the transformation target. In the production of the Organic Chemical Industry, it requires us to design efficient organic synthesis routes to synthesize our desired target products at a low cost. Therefore, given a target molecule, predicting reasonable and efficient reaction routes to synthesize this molecule is a very crucial problem in both machine learning and organic chemistry (Segler et al., 2018) . To tackle this problem, past works, including MCTS (Segler et al., 2018) , DFPN-E (Kishimoto et al., 2019 ), Retro*(Chen et al., 2020) , self-improved retrosynthetic planning (Kim et al., 2021 ), RetroGraph (Xie et al., 2022 ), and Grasp (Yu et al., 2022) , model the retrosynthetic planning as a search problem (Xie et al., 2022) . Specifically, they first utilize reactions to train a template-based MLP retrosynthesis model (Segler et al., 2017) and then learn a search algorithm to perform a backward search to transform the molecules through retrosynthesis predictions until all the reactants are starting materials (Chen et al., 2020) . The current benchmark for test evaluation of retrosynthetic planning models consists of 189 test routes (Chen et al., 2020) . These approaches have the following limitations: 1) the training dataset of single-step reactions limits the understanding of the transformation of organic molecules as a sequence of chaining chemical reactions. 2) past works use single-step retrosynthesis models, which neglect the context information in the reaction tree. 3) the test set is too small to comprehensively evaluate the performance. 4) the evaluation unit of existing benchmark is the reaction route which is one path from the root node to the leaf node in the reaction tree. In this work, we address these limitations by first constructing a new benchmark with 124,869 reaction trees retrieved from the public USPTO dataset and leverage the retrosynthesis transformer with an additional memory module to capture reaction tree information for retrosynthetic planning. Benchmark. SCScore (Coley et al., 2018b) concludes that the number of steps required to synthesize a molecule is an accurate metric for estimating molecule synthetic accessibility. Based on this observation and inspired by the prediction of synthesis accessibility with reaction knowledge graph (Li & Chen, 2022), we construct a reaction graph from the existing reactions in the database. On the reaction graph, directed edges represent retrosynthesis reactions where the starting point denotes the product molecule and the ending point represents the reactant molecule to synthesize this product. Given a target molecule, we can search the shortest routes to form an efficient reaction tree from the reaction graph, while the ending points of these routes are the starting molecules that satisfy the requirements. By constructing the reaction trees for target molecules, we can obtain a new benchmark for our retrosynthetic planning task. Metro. In this work, we propose Metro: Memory-Enhanced Transformer for RetrOsynthetic planning by extending Transformer with an additional memory module. Our proposed Metro can capture the dependency among the molecules on the reaction route as context information. By taking the context information into consideration, we can control the search within a reasonable reaction space specified for the reaction route. Extensive experimental results on retrosynthetic planning show that Metro achieves up to 13.2%, 14.5%, 11.1%, 10.5%, and 10.0% over transformer on top-1, top-2, top-3, top-4, and top-5 accuracy, which demonstrates the superiority of exploiting context information for retrosynthetic planning task.

2. PRELIMINARIES

In this section, we formally define important terminologies used in the rest of the paper, including SMILES representation, starting material, and reaction tree.

SMILES Representation.

The simplified molecule-input line-entry system (SMILES) (Weininger, 1988 ) is a chemical specification for describing the structure of chemical compounds using strings. Organic compounds can be denoted by SMILE representations like in Fig. 1 , which is well suited for machine learning models to process. We denote the SMILES representation of molecule x as s(x), where s(x) i is the character at the i-th position of the string s(x). Given a reaction r 1 + r 2 + . . . + r n → p, the SMILES representation of this reaction is as follows: s(r 1 ).s(r 2 ) . . . s(r n ) → s(p), (1) where multiple reactants are concatenated by "." in the SMILES representation. Starting Material. We denote the space of all chemical molecules as M. The starting materials are a special set of molecules, denoted as S ⊆ M. AiZynthFinder (Genheden et al., 2020) defines the starting material as a commercially purchasable compound. ZINC (Sterling & Irwin, 2015) releases the open source databases of purchasable compounds. We define this list of compounds in these databases as our starting materials. Reaction Tree and Reaction Routes. Given the above definitions, we can denote a reaction tree (Shibukawa et al., 2020; Nguyen & Tsuda, 2021) as T = {T, R, I, τ }, where T ∈ M \ S is the product molecule we desire to synthesize (A in Fig. 1 ), R = {r 1 , r 2 , . . . , r n } ⊆ S is the set of starting materials (E, F, G, H in Fig. 1 ) that go through a series of reactions τ to synthesize A, and I = {m 1 , m 2 , . . . , m u } ⊆ M \ S is the set of intermediate products (B, C, D in Fig. 1 ) where intermediate products are formed from reactants or intermediate products and then react further to give the final product or produce intermediate products. A reaction tree consists of multiple reaction routes. A reaction route is a path from the target molecule to a starting material in the reaction tree. According to the definition, the number of reaction routes is equal to the number of starting materials. We denote reaction route as l, the set of reaction routes as L = {l 1 , l 2 , . . . , l n }, and we have τ = τ l1 ∪ τ l2 ∪ • • • ∪ τ ln ,

