RETCL: A SELECTION-BASED APPROACH FOR RETROSYNTHESIS VIA CONTRASTIVE LEARNING

Abstract

Retrosynthesis, of which the goal is to find a set of reactants for synthesizing a target product, is an emerging research area of deep learning. While the existing approaches have shown promising results, they currently lack the ability to consider availability (e.g., stability or purchasability) of the reactants or generalize to unseen reaction templates (i.e., chemical reaction rules). In this paper, we propose a new approach that mitigates the issues by reformulating retrosynthesis into a selection problem of reactants from a candidate set of commercially available molecules. To this end, we design an efficient reactant selection framework, named RETCL (retrosynthesis via contrastive learning), for enumerating all of the candidate molecules based on selection scores computed by graph neural networks. For learning the score functions, we also propose a novel contrastive training scheme with hard negative mining. Extensive experiments demonstrate the benefits of the proposed selection-based approach. For example, when all 671k reactants in the USPTO database are given as candidates, our RETCL achieves top-1 exact match accuracy of 71.3% for the USPTO-50k benchmark, while a recent transformer-based approach achieves 59.6%. We also demonstrate that RETCL generalizes well to unseen templates in various settings in contrast to template-based approaches. The code will be released.

1. INTRODUCTION

Retrosynthesis (Corey, 1991) , finding a synthetic route starting from commercially available reactants to synthesize a target product (see Figure 1a ), is at the center of focus for discovering new materials in both academia and industry. It plays an essential role in practical applications by finding a new synthetic path, which can be more cost-effective or avoid patent infringement. However, retrosynthesis is a challenging task that requires searching over a vast number of molecules and chemical reactions, which is intractable to enumerate. Nevertheless, due to its utter importance, researchers have developed computer-aided frameworks to automate the process of retrosynthesis for more than three decades (Corey et al., 1985) . The computer-aided approaches for retrosynthesis mainly fall into two categories depending on their reliance on the reaction templates, i.e., sub-graph patterns describing how the chemical reaction occurs among reactants (see Figure 1b ). The template-based approaches (Coley et al., 2017b; Segler & Waller, 2017; Dai et al., 2019) first enumerate known reaction templates and then apply a wellmatched template into the target product to obtain reactants. Although they can provide chemically interpretable predictions, they limit the search space to known templates and cannot discover novel synthetic routes. In contrast, template-free approaches (Liu et al., 2017; Karpov et al., 2019; Zheng et al., 2019; Shi et al., 2020) generate the reactants from scratch to avoid relying on the reaction templates. However, they require to search the entire molecular space, and their predictions could be either unstable or commercially unavailable. We emphasize that retrosynthesis methods are often required to consider the availability of reactants and generalize to unseen templates in real-world scenarios. For example, when a predicted reactant is not available (e.g., not purchasable) for a chemist or a laboratory, the synthetic path starting from the predicted reactant cannot be instantly used in practice. Moreover, chemists often require retrosynthetic analysis based on unknown reaction rules. This is especially significant due to our incomplete knowledge of chemical reactions; e.g., 29 million reactions were regularly recorded between 2009 and 2019 in Reaxysfoot_0 (Mutton & Ridley, 2019) . Contribution. In this paper, we propose a new selection-based approach, which allows considering the commercial availability of reactants. To this end, we reformulate the task of retrosynthesis as a problem where reactants are selected from a candidate set of available molecules. This approach has two benefits over the existing ones: (a) it guarantees the commercial availability of the selected reactants, which allows chemists proceeding to practical procedures such as lab-scale experiments or optimization of reaction conditions; (b) it can generalize to unseen reaction templates and find novel synthetic routes. For the selection-based retrosynthesis, we propose an efficient selection framework, named RETCL (retrosynthesis via contrastive learning). To this end, we design two effective selection scores in synthetic and retrosynthetic manners. To be specific, we use the cosine similarity between molecular embeddings of the product and the reactants computed by graph neural networks. For training the score functions, we also propose a novel contrastive learning scheme (Sohn, 2016; He et al., 2019; Chen et al., 2020b) with hard negative mining (Harwood et al., 2017) to overcome a scalability issue while handling a large-scale candidate set. To demonstrate the effectiveness of our RETCL, we conduct various experiments based on the USPTO database (Lowe, 2012) containing 1.8M chemical reactions in the US patent literature. Thanks to our prior knowledge on the candidate reactants, our method achieves 71.3% test accuracy and significantly outperforms the baselines without such prior knowledge. Furthermore, our algorithm demonstrates its superiority even when enhancing the baselines with candidate reactants, e.g., our algorithm improves upon the existing template-free approach (Chen et al., 2019) by 11.7%. We also evaluate the generalization ability of RETCL by testing USPTO-50k-trained models on the USPTOfull dataset; we obtain 39.9% test accuracy while the state-of-the-art template-based approach (Dai et al., 2019) achieves 26.7%. Finally, we demonstrate how our RETCL can improve multi-step retrosynthetic analysis where intermediate reactants are not in our candidate set. We believe our scheme has the potential to improve further in the future, by utilizing (a) additional chemical knowledge such as atom-mapping or leaving groups (Shi et al., 2020; Somnath et al., 2020) ; (b) various constrastive learning techniques in other domains, e.g., computer vision (He et al., 2019; Chen et al., 2020b; Hénaff et al., 2019; Tian et al., 2019 ), audio processing (Oord et al., 2018) , and reinforcement learning (Srinivas et al., 2020) .

2.1. OVERVIEW OF RETCL

In this section, we propose a selection framework for retrosynthesis via contrastive learning, coined RETCL. Our framework is based on solving the retrosynthesis task as a selection problem over a candidate set of commercially available reactants given the target product. Especially, we design a selection procedure based on molecular embeddings computed by graph neural networks and train the networks via contrastive learning. To this end, we define a chemical reaction R → P as a synthetic process of converting a reactant-set R = {R 1 , . . . , R n }, i.e., a set of reactant molecules, to a product molecule P (see Figure 1a ). We aim to solve the problem of retrosynthesis by finding the reactant-set R from a candidate set C which can be synthesized to the target product P . Especially, we consider the case when the candidate set C consists of commercially available molecules. Throughout this paper, we say that the synthetic direction (from R to P ) is forward and the retrosynthetic direction (from P to R) is backward. Note that our framework stands out from the existing works in terms of the candidate set C. To be specific, (a) template-free approaches (Lin et al., 2019; Karpov et al., 2019; Shi et al., 2020) choose C as the whole space of (possibly unavailable) molecules; and (b) template-based approaches (Coley et al., 2017b; Segler & Waller, 2017; Dai et al., 2019) choose C as possible reactants extracted from the known reaction templates. In comparison, our framework neither requires (a) search over the entire space of molecules, or (b) domain knowledge to extract the reaction templates. We now briefly outline the RETCL framework. Our framework first searches the most likely reactantsets R 1 , . . . , R T ⊂ C in a sequential manner based on a backward selection score ψ(R|P, R given ), and then ranks the reactant-sets using ψ(R|P, R given ) and another forward score φ(P |R). For learning the score functions, we propose a novel contrastive learning scheme with hard negative mining for improving the selection qualities. We next provide detailed descriptions of the search procedure and the training scheme in Section 2.2 and 2.3, respectively.

2.2. SEARCH PROCEDURE WITH GRAPH NEURAL NETWORKS

We first introduce the search procedure of RETCL in detail. To find a reactant-set R = {R 1 , . . . , R n }, we select each element R i sequentially from the candidate set C based on the backward (retrosynthetic) selection score ψ(R|P, R given ). It represents a selection score of a reactant R given a target product P and a set of previously selected reactants R given ⊂ C. Note that the score function is also capable of selecting a special reactant R halt to stop updating the reactant-set. Using beam search, we choose top T likely reactant-sets R 1 , . . . , R T . Furthermore, we rank the chosen reactant-sets R 1 , . . . , R T based on the backward selection score ψ(R|P, R given ) and the forward (synthetic) score φ(P |R). The latter represents the synthesizability of R for P . Note that ψ(R|P, R given ) and φ(P |R) correspond to backward and forward directions of a chemical reaction R → P , respectively (see Section 2.1 and Figure 1a ). Using both score functions, we define an overall score on a chemical reaction R → P as follows: score(P, R) = 1 n + 2 max π∈Π n+1 i=1 ψ(R π(i) |P, {R π(1) , . . . , R π(i-1) }) + φ(P |R) , where R n+1 = R halt and Π is the space of permutations defined on the integers 1, . . . , n+1 satisfying π(n+1) = n+1. Based on score(P, R), we decide the rankings of R 1 , . . . , R T for synthesizing the target product P . We note that the max π∈Π operator and the 1 n+2 term make the overall score (equation 1) be independent of order and number of reactants, respectively. Figure 2 illustrates this search procedure of our framework. Score design. We next elaborate our design choices for the score functions ψ and φ. We first observe that the molecular graph of the product P can be decomposed into subgraphs from each reactant of the reactant-set R, as illustrated in Figure 1a . Moreover, when selecting reactants sequentially, the structural information of the previously selected reactants R given should be ignored to avoid duplicated selections. From these observations, we design the scores ψ θ (R|P, R given ) and φ(P |R) as follows: ψ(R|P, R given ) = CosSim f θ (P ) - S∈Rgiven g θ (S), h θ (R) , φ(P |R) = CosSim R∈R g θ (R), h θ (P ) , where CosSim is the cosine similarity and f θ , g θ , h θ are embedding functions from a molecule to a fixed-sized vector with parameters θ. Note that one could think that f θ and g θ are query functions for a product and a reactant, respectively, while h θ is a key function for a molecule. Such a query-key separation allows the search procedure to be processed as an efficient matrix-vector multiplication. This computational efficiency is important in our selection-based setting because the number of candidates is often very large, e.g., |C| ≈ 6 × 10 5 for the USPTO dataset. To parameterize the embedding functions f θ , g θ and h θ , we use the recently proposed graph neural network (GNN) architecture, structure2vec (Dai et al., 2016; 2019) . The implementation details of the architecture is described in Section 3.1.

Incorporating reaction types.

A human expert could have some prior information about a reaction type, e.g., carbon-carbon bond formation, for the target product P . To utilize this prior knowledge, we add trainable embedding bias vectors u (t) and v (t) for each reaction type t into the query embeddings of ψ and φ, respectively. For example, φ(P |R) becomes CosSim( R∈R g θ (R) + v (t) , h θ (P )). The bias vectors are initialized by zero at beginning of training.

2.3. TRAINING SCHEME WITH CONTRASTIVE LEARNING

Finally, we describe our learning scheme for training the score functions defined in Section 2.1 and 2.2. We are inspired by how the score functions ψ(R|P, R given ) and φ(P |R) resemble the classification scores of selecting (a) the reactant R given the product P and the previously selected reactants R given and (b) the product P given all of the selected reactants R, respectively. Based on this intuition, we consider two classification tasks with the following probabilities: p(R|P, R given , C) = exp(ψ(R|P, R given )/τ ) R ∈C\{P } exp(ψ(R |P, R given )/τ ) , q(P |R, C) = exp(φ(P |R)/τ ) P ∈C\R exp(φ(P |R)/τ ) , where τ is a hyperparameter for temperature scaling and C is the given candidate set of molecules. Note that we do not consider P and R ∈ R as available reactants and products for the classification tasks of p and q, respectively. This reflects our prior knowledge that the product P is always different from the reactants R in a chemical reaction. As a result, we arrive at the following losses defined on a reaction of the product P and the reactant-set R = {R 1 , . . . , R n }: L backward (P, R|θ, C) = -max π∈Π n+1 i=1 log p(R π(i) |P, {R π(1) , . . . , R π(i-1) }, C), L forward (P, R|θ, C) = -log q(P |R, C), where R n+1 = R halt and Π is the space of permutations defined on the integers 1, . . . , n+1 satisfying π(n+1) = n+1. We note that minimizing the above losses increases the scores ψ(R|P, R given ) and φ(P |R) of the correct pairs of product and reactants, i.e., numerators, while decreasing that of wrong pairs, i.e., denominators. Such an objective is known as contrastive loss which has recently gained much attention in various domains (Sohn, 2016; He et al., 2019; Chen et al., 2020b; Oord et al., 2018; Srinivas et al., 2020) . Unfortunately, the optimization of L backward and L forward is intractable since the denominators of p(R|P, R given , C) and q(P |R, C) require summation over the large set of candidate molecules C. To resolve this, for each mini-batch of reactions B sampled from the training dataset, we approximate C with the following set of molecules: C B = {M | ∃ (R, P ) ∈ B such that M = P or M ∈ R}, i.e., C B is the set of all molecules in B. Then we arrive at the following training objective: L(B|θ) = 1 |B| (R,P )∈B L backward (P, R|θ, C B ) + L forward (P, R|θ, C B ) . (2) Hard negative mining. In our setting, molecules in the candidate set C B are easily distinguishable. Hence, learning to discriminate between them is often not informative. To alleviate this issue, we replace the C B with its augmented version C B by adding hard negative samples, i.e., similar molecules, as follows: C B = C B ∪ M ∈C B {Top-K nearest neighbors of M from C}, where K is a hyperparameter controlling hardness of the contrastive task. The nearest neighbors are defined with respect to the cosine similarity on {h θ (M )} M ∈C . Since computing all embeddings {h θ (M )} M ∈C for every iteration is time-consuming, we update information of the nearest neighbors periodically. We found that the hard negative mining plays a significant role in improving the performance of RETCL (see Section 3.3).

3.1. EXPERIMENTAL SETUP

Dataset. We mainly evaluate our framework in USPTO-50k, which is a standard benchmark for the task of retrosynthesis. It contains 50k reactions of 10 reaction types derived from the US patent literature, and we divide it into training/validation/test splits following Coley et al. (2017b) . To apply our framework, we choose the candidate set of commercially available molecules C as the all reactants in the entire USPTO database as Guo et al. (2020) did. This results in the candidate set with a size of 671,518. For the evaluation metric, we use the top-k exact match accuracy, which is widely used in the retrosynthesis literature. We also experiment with other USPTO benchmarks for more challenging tasks, e.g., generalization to unseen templates. We provide a more detailed description of the USPTO benchmarks in Appendix A. Hyperparameters. We use a single shared 5-layer structure2vec (Dai et al., 2016; 2019) architecture and three separate 2-layer residual blocks with an embedding size of 256. To obtain graph-level embedding vectors, we use sum pooling over mean pooling since it captures the size information of molecules. For contrastive learning, we use a temperature of τ = 0.1 and K = 4 nearest neighbors for hard negative mining. More details are provided in Appendix B.

3.2. SINGLE-STEP RETROSYNTHESIS IN USPTO-50K

Table 1 evaluates our RETCL and other baselines using the top-k exact match accuracy with k ∈ {1, 3, 5, 10, 20, 50}. We first note that our framework significantly outperforms a concurrent selectionbased approach,foot_1 Bayesian-Retro (Guo et al., 2020) , by 23.8% and 23.7% in terms of top-1 accuracy when reaction type is unknown and given, respectively. Furthermore, ours also outperforms templatebased approaches utilizing the different knowledge, i.e., reaction templates instead of candidates, with a large margin, e.g., 18.8% over GLN (Dai et al., 2019) in terms of top-1 accuracy when reaction type is unknown. Incorporating the knowledge of candidates into baselines. However, it is hard to fairly compare between methods operating under different assumptions. For example, template-based approaches require the knowledge of reaction templates, while our selection-based approach requires that of available reactants. To alleviate such a concern, we incorporate our prior knowledge of candidates C into the baselines; we filter out reactants outside the candidates C from the predictions made by the baselines. As reported in Table 2 , our framework still outperforms the template-free approaches with a large margin, e.g., Transformer (Chen et al., 2019) achieves 68.4% in the top-1 accuracy, while we achieve 78.9% when reaction type is given. Although GLN uses more knowledge than ours in this setting, its top-k accuracy is saturated to 93.3% which is the coverage of known templates, i.e., the upper bound of template-based approaches. However, our framework continues to increase the top-k accuracy as k increases, e.g., 97.5% in terms of top-200 accuracy. We additionally compare with SCROP (Zheng et al., 2019) using their publicly available predictions with reaction types; SCROP achieves 70.4% in the top-1 accuracy, which also underperforms ours.

3.3. ANALYSIS AND ABLATION STUDY

Failure cases. Figure 3 shows examples of wrong predictions from our framework. We found that the reactants of wrong predictions are still similar to the ground-truth ones. For example, the top-3 predictions of the examples A and B are partially correct; the larger reactant is correct while the smaller one is slightly different. In the example C, the ring at the center of the product is broken in the ground-truth reactants while our RETCL predicts non-broken reactants. Surprisingly, in a chemical database, Reaxys, we found a synthetic route starting from reactants in the top-2 prediction to synthesize the target product. We attach the corresponding route to Appendix C. These results show that our RETCL could provide meaningful information for retrosynthetic analysis in practice. Nearest neighbors on molecular embeddings. For hard negative mining described in Section 2.3, it is required to find similar molecules using the cosine similarity on {h θ (M )} M ∈C . As illustrated in Figure 4 , h θ (M ) is capable of capturing the molecular structures. Effect of components. Table 3 shows the effect of components of our framework. First, we found that the hard negative mining as described in Section 2.3 increases the performance significantly. This is because there are many similar molecules in the candidate set C, thus a model could predict slightly different reactants without hard negative mining. We also demonstrate the effect of checking the synthesizablity of the predicted reactants with φ(P |R). As seen the fourth and fifth rows in Table 3 , using φ(P |R) provides a 2.6% gain in terms of top-10 accuracy. Moreover, we empirically found that sum pooling for aggregating node embedding vectors is more effective than mean pooling. This is because the former can capture the size of molecules as the norm of graph embedding vectors.

3.4. MORE CHALLENGING RETROSYNTHESIS TASKS

Generalization to unseen templates. The advantage of our framework over the template-based approaches is the generalization ability to unseen reaction templates. To demonstrate it, we remove reactions of classes (i.e., reaction types) from 6 to 10 in training/validation splits of the USPTO-50k benchmark. Then the number of remaining reactions is 27k. In this case, the templates extracted from the modified dataset cannot be applied to the reactions of different classes. Hence the template-based approaches suffer from the generalization issue; for example, GLN (Dai et al., 2019) cannot provide correct predictions for reactions of unseen types as reported in Table 4 , while our RETCL is able to provide correct answers. Table 5 : Generalization to USPTO-full.

Method

Top-1 Top-10 Top-50 Transformer (Chen et al., 2019) 29.9 46.6 51.0 GLN (Dai et al., 2019) 26.7 42.2 46.7 RETCL (Ours) 39.9 57.1 60.9 We also conduct a more realistic experiment: testing on a larger dataset, the test split of USPTOfull dataset preprocessed by Dai et al. (2019) , using a model trained on a smaller dataset, USPTO-50k. We note that the number of reactions for training, 40k, is smaller than that of testing reac- Multi-step retrosynthesis. To consider a more practical scenario, we evaluate our algorithm for the task of multi-step retrosynthesis. To this end, we use the synthetic route benchmark provided by Chen et al. (2020a) . Here, we assume that only the building blocks (or starting materials) are commercially available, and intermediate reactants require being synthesized from the building blocks. In this challenging task, we demonstrate how our method could be used to improve the existing template-free Transformer model (TF, Chen et al. 2019) . Given a target product, the hybrid algorithm operates as follows: (1) our RETCL proposes a set of reactants from the candidates C; (2) TF proposes additional reactants outside the candidates C; (3) TF chooses the top-K reactants based on its log-likelihood of all the proposed reactants. As an additional baseline, we replace RETCL by another independently trained TF in the hybrid algorithm. We use Retro* (Chen et al., 2020a) for efficient route search with the retrosynthesis models and evaluate the discovered routes based on the metrics used by Kishimoto et al. (2019) ; Chen et al. (2020a) . As reported in Table 6 , our model can enhance the search quality of the existing template-free model in the multi-step retrosynthesis scenarios. This is because our RETCL is enable to recommend available and plausible reactants to TF for each retrosynthesis step. Note that the MLP column is the same as reported in Chen et al. (2020a) which uses a template-based single-step MLP model. The detailed description of this multi-step retrosynthesis experiment and the discovered routes are provided in Appendix D.

4. RELATED WORK

The template-based approaches (Coley et al., 2017b; Segler & Waller, 2017; Dai et al., 2019) rely on reaction templates that are extracted from a reaction database (Coley et al., 2017a; 2019) or encoded by experts (Szymkuć et al., 2016) . They first select one among known templates, and then apply it to the target product. On the other hand, template-free methods (Liu et al., 2017; Karpov et al., 2019; Zheng et al., 2019; Shi et al., 2020) consider retrosynthesis as a conditional generation problem such as machine translation. Recently, synthon-based approaches (Shi et al., 2020; Somnath et al., 2020) have also shown the promising results based on utilizing the atom-mapping between products and reactants as an additional type of supervisions. Concurrent to our work, Guo et al. (2020) also propose a selection-based approach, Bayesian-Retro, based on sequential Monte Carlo sampling (Del Moral et al., 2006) . As reported in Table 1 , our RETCL significantlly outperforms Bayesian-Retro. The gap is more evident since it uses 6 × 10 5 forward evaluations (i.e., 6 hoursfoot_2 ) of Molecular Transformer (Schwaller et al., 2019a) for single-step retrosynthesis of one target product while our RETCL requires only one second.

5. CONCLUSION

In this paper, we propose RETCL for solving retrosynthesis. To this end, we reformulate retrosynthesis as a selection problem of commercially available reactants, and propose a contrastive learning scheme with hard negative mining to train our RETCL. Through the extensive experiments, we show that our framework achieves outstanding performance for the USPTO benchmarks. Furthermore, we demonstrate the generalizability of RETCL to unseen reaction templates. We believe that extending our framework to multi-step retrosynthesis or combining with various contrastive learning techniques in other domains could be interesting future research directions.

B IMPLEMENTATION DETAILS

We here provide a detailed description of our implementation. Since the USPTO datasets provide molecule information on the SMILES (Weininger, 1988) format, we convert a SMILES representation to a bidirectional graph with atom and bond features. To this end, we use RDKitfoot_3 and Deep Graph Library (DGL) (Wang et al., 2019) . Let G = (V, E) be the molecular graph, and X(v) ∈ R datom and X(uv) ∈ R dbond are features for an atom v ∈ V and a bond uv ∈ E in the molecular graph G, respectively. The atom feature X(v) includes the atom type (e.g., C, I, B), degree, formal charge, and so on; the bond feature X(uv) includes the bond type (single, double, triple or aromatic), whether the bond is in a ring, and so on. For more details, we highly recommend to see DGL and its extension, DGL-LifeSci.foot_4  Architecture. We build our graph neural network (GNN) architecture based on the molecular graph G with features X as follows: H (0) (v) ← ReLU BN W (0) atom X(v) + u∈N (v) W (0) bond X(uv) , H (l) (v) ← ReLU BN W (l) 1 u∈N (v) H (l-1) (u) + u∈N (v) W (l) bond X(uv) , H (l) (v) ← ReLU BN W (l) 2 H (l) (v) + H (l-1) (v) , for l = 1, 2, . . . , L, H(v) ← W last H (L) (v), where N (v) is the set of adjacent vertices with v. This architecture is based on structure2vec (Dai et al., 2019; 2016) , however it is slightly different with the model used by Dai et al. (2019) : we use ReLU after BN instead of BN after ReLU; we append a last linear model W last . Based on the atom-level embeddings H(v), we construct query and key embeddings f θ , g θ , and h θ using three separate residual blocks as follows: f θ (M ) ← v∈V H(v) + BN(W (f ) 2 ReLU(BN(W (f ) 1 ReLU(H(v))))) , g θ (M ) ← v∈V H(v) + BN(W (g) 2 ReLU(BN(W (g) 1 ReLU(H(v))))) , h θ (M ) ← v∈V H(v) + BN(W (h) 2 ReLU(BN(W (h) 1 ReLU(H(v))))) , where M is the corresponding molecule with the molecular graph G. Note that θ includes all W defined above, and we omit bias vectors of the linear layers due to the notational simplicity. We found that these design choices, e.g., sharing GNN layers and using residual layers, also provide an accuracy gain. Therefore, more sophisticated architecture designs could provide further improvements; we leave it for future work. Optimization. For learning the parameter θ, we use the stochastic gradient descent (SGD) with a learning rate of 0.01, a momentum of 0.9, a weight decay of 10 -5 , a batch size of 64, and a gradient clip of 5.0. We train our model for 200k iterations and evaluate on the validation split every 1000 iterations. The information of the nearest neighbors is also updated every 1000 iterations. When evaluating on the test split, we use the best validation model with a beam size of 200. To sum up, we use Pytorch (Paszke et al., 2017) for automatic differentiation, Deep Graph Library (Wang et al., 2019) for building graph neural networks, and RDKit for processing SMILES (Weininger, 1988) representations. All our models can be executed on single NVIDIA RTX 2080 Ti GPU. illustrated in Figure 5 , we found that our RETCL's prediction differs from the ground-truth reactants in USPTO-50k, however, it exists as a 3-step reaction with two reagents (sodium acetate and thiophene) in the chemical literature (Gonda & Novak, 2015) .foot_5 Note that our framework currently does not consider reagent prediction. Therefore, our prediction can be regarded as an available (i.e., correct) synthetic path in practice.

D MULTI-STEP RETROSYNTHESIS

For the multi-step retrosynthesis experiment described in Section 3.4, we use a synthetic route dataset provided by Chen et al. (2020a) . This dataset is constructed from the USPTO (Lowe, 2012) database like other benchmarks. We recommend to see Chen et al. (2020a) for the construction details. The dataset contains 299202 training routes, 65274 validation routes, and 190 test routes. We first extract single-step reactions and molecules from the training and validation splits of the dataset. The extracted reactions are used for training our RETCL and Transformer (TF, Chen et al. 2019) , and the molecules are used as the candidate set C trainfoot_6 for ours. When testing the single-step models with Retro* (Chen et al., 2020a) , we use all starting molecules (i.e., 114802 molecules) in the routes in the dataset as the candidate set C. This reflects more practical scenarios because intermediate reactants often be unavailable in multi-step retrosynthesis. We remark that TF also uses the candidate set C as the prior knowledge for finishing the search procedure. The evaluation metrics used in Section 3.4 are success rate and average length of routes. The success means that a synthetic route for a target product is successfully discovered under a limit of the number of expansions. We set the limit by 100 and use only the top-5 predictions of a single or hybrid model for each expansion. When computing the average length, we only consider the cases where all the single-step models discover routes successfully. As Chen et al. (2020a) did, we use the negative log-likelihood computed by TF as the reaction cost. 

E GENERALIZATION TO UNSEEN CANDIDATES

The knowledge of the candidate set C could be updated after learning the RETCL framework. In this case, the set used in the test phase is larger than that in the training phase, i.e., C train C test . One can learn the framework once again, however someone want to use it instantly without additional training. To validate that our framework can generalize to unseen candidates, we conduct an additional experiments with a smaller candidate set C small . We first train our model with C train = C small and then test with the larger candidate set C test = C large . Here we consider two cases of C small : (a) 91k molecules in training and validation splits of (b) 100k molecules in all splits of USPTO-50k. As reported in Table 8 , the model trained with C small achieves comparable performance to the model trained with C large . This demonstrates that our model trained with a small corpora (e.g., USPTO-50k) can work with unseen candidates. F RESTRICTION OF KNOWLEDGE OF CANDIDATE REACTANTS One might have very restricted knowledge of the candidate set C of commercially available reactants due to own circumstances such as a budget limit. In this case, one of ground-truth reactants might be missing from the candidate set. Since there exist multiple solutions in the retrosynthesis task, retrosynthesis tools should be able to provide alternative solutions in such a case. To verify that our RETCL can recommend such a solution, we experiment with varying sizes of the candidate set C. Remark that using a smaller candidate set means only a small number of reactants is practically available, thus performance degradation is expected. Here, we use two metrics: exact-match accuracy and coverage proposed by Schwaller et al. (2019b) . To be specific, the coverage measures whether the target product is synthesized from the predicted reactant-set made by RetCL using a forward synthesis model (Schwaller et al., 2019a) . This can evaluate plausibility of the prediction even if one of the reactants is missing from the candidate set. We report the metrics using top-10 predictions in the USPTO-50k test split. As shown in Table 9 , even though some ground-truth candidates are missing, our framework can provide plausible solutions. 



A chemical database, https://www.reaxys.com Note that Bayesian-Retro(Guo et al., 2020) is not scalable to a large candidate set. See Section 4 for details. See https://github.com/zguo235/bayesian_retro. Open-Source Cheminformatics Software, https://www.rdkit.org/ Bringing Graph Neural Networks to Chemistry and Biology, https://lifesci.dgl.ai/ We found this synthetic path and the corresponding literature from a chemical database, Reaxys. Note that the sodium acetate and the thiophene are considered as reagents in Reaxys. Note that this candidate set is used only for training.



Figure 1: Examples of (a) a chemical reaction and (b) the corresponding reaction template in the USPTO-50k dataset. The objective of retrosynthesis is to find the reactants for the given product.

Figure 2: Illustration of the search procedure in RETCL. It first (1-3) selects reactants sequentially based on ψ(R|P, R given ), and then (4) check the synthesizability of the selected reactant-set based on φ(P |R). The overall score is the average over all scores from (1) to (4).

Figure 3: Failure cases of RETCL. Example A Top 4 nearest neighbors of A Example B Top 4 nearest neighbors of B

Figure 5: A synthetic path existing in Reaxys based on RETCL's prediction.

Figure 6 and 7 illustrate the discovered routes by TF and RETCL+TF under the aforementioned setting. The molecules in the blue boxes are building blocks (i.e., available reactants) and the numbers indicate the reaction costs (i.e., the negative log-likelihoods computed by TF). As shown in the figures, our algorithm allows to discover a shorter and cheaper route.

Figure 6: Synthetic routes discovered by (a) Transformer and (b) our RETCL+Transformer.

Figure 7: Synthetic routes discovered by (a) Transformer and (b) our RETCL+Transformer.

The top-k exact match accuracy (%) of computer-aided approaches in USPTO-50k. The template-based approaches use the knowledge of reaction templates while others do not. † The results are reproduced using the code ofChen et al. (2019).

The top-k exact match accuracy (%) of our RETCL, Transformer(Chen et al., 2019) and GLN(Dai et al., 2019)  with discarding predictions not in the candidate set C.

Ablation study.

The top-10 exact match accuracy (%) of our RETCL and GLN(Dai et al., 2019)  trained on USPTO-50k without reaction types from 6 to 10. The average column indicates the average of class-wise accuracy for each reaction type. As reported in Table5, our framework provides a consistent benefit over the templatebased approaches. These results show that our strength of generalization ability.

Multi-step retrosynthesis.

Generalization to unseen candidates. |C train | |C test | Top-1 Top-3 Top-5 Top-10 Top-20 Top-50

Top-10 exact-match accuracy (%) and coverage (%) under restricted knowledge of the candidate set C of commercially available reactants.

A DATASET DETAILS

We here describe the details of USPTO datasets. The reactions in the USPTO datasets are derived the US patent literature (Lowe, 2012) . The entire set, USPTO 1976-2016, contains 1.8 million raw reactions. The commonly-used benchmark of single-step retrosynthesis is USPTO-50k containing 50k clean atom-mapped reactions which can be classified into 10 broad reaction types (Schneider et al., 2016) . See Table 7a for the information of the reaction types. For generalization experiments in Section 3.4, we introduce a filtered dataset, USPTO-50k-modified, which contains reactions of reaction types from 1 to 5. We report the number of reactions of the modified dataset in Table 7b . We also use the USPTO-full dataset, provided by Dai et al. (2019) , which contains 1.1 million reactions. Note that we use only the test split of USPTO-full (i.e., only 101k reactions) for testing generalizability. Note that we do not use atom-mappings in the USPTO benchmarks. Moreover, we do not consider reagents for single-step retrosynthesis following prior work (Liu et al., 2017; Dai et al., 2019; Lin et al., 2019; Karpov et al., 2019) . 

