DATA TRANSFER APPROACHES TO IMPROVE SEQ-TO-SEQ RETROSYNTHESIS

Abstract

Retrosynthesis is a problem to infer reactant compounds to synthesize a given product compound through chemical reactions. Recent studies on retrosynthesis focus on proposing more sophisticated prediction models, but the dataset to feed the models also plays an essential role in achieving the best generalizing models. Generally, a dataset that is best suited for a specific task tends to be small. In such a case, it is the standard solution to transfer knowledge from a large or clean dataset in the same domain. In this paper, we conduct a systematic and intensive examination of data transfer approaches on end-to-end generative models, in application to retrosynthesis. Experimental results show that typical data transfer methods can improve test prediction scores of an off-the-shelf Transformer baseline model. Especially, the pre-training plus fine-tuning approach boosts the accuracy scores of the baseline, achieving the new state-of-the-art. In addition, we conduct a manual inspection for the erroneous prediction results. The inspection shows that the pre-training plus fine-tuning models can generate chemically appropriate or sensible proposals in almost all cases.

1. INTRODUCTION

Retrosynthesis, first identified by Corey & Wipke (1969) , is a fundamental chemical problem to infer a set of reactant compounds that can be synthesized into a desired product compound through a series of chemical reactions. The search space of sets of compounds is innately huge. Further, a product compound can be synthesized through different series of reactions from different reactant compound sets. Such difficulties require the huge efforts of human chemical experts and the large knowledge base to build a retrosynthesis engine for years. Thus, expectations of machine-learning (ML) based retrosynthesis engines is growing in recent years. The need for the retrosynthesis becomes intensive in these days along with the development of in silico (computational) chemical compound generations (Jin et al., 2018; Kusner et al., 2017) , which are also applied to new drug discovery for COVID-19 (Cantürk et al., 2020; Chenthamarakshan et al., 2020) . These generation models can generate unseen compounds in computers but do not answer how to synthesize them in practical. Retrosynthesis engines can help chemists and pharmacists fill this gap. Practical retrosynthesis planning requires a strong model to learn inherent biases in the target dataset while keeping generalization performance to generate unseen (test) product compounds. The current trend is to focus on developing such a strong ML model architecture such as seq-to-seq models (Liu et al., 2017; Karpov et al., 2019) and graph-to-graphs models (Shi et al., 2020; Yan et al., 2020; Somnath et al., 2020) , which achieve the State-of-the-Art (SotA) retrosynthesis accuracy. However, model architecture is not the only issue to consider. In the current deep neural network (DNN) era, the quantity (many samples) and the quality (less noisy, corrupted samples) of the available dataset often governs the final performance of the ML model. The problem is that there are only a few large and high-quality supervised training datasets that are available publicly. Instead, only a small and/or a noisy dataset is usually available for the target application task. To cope with this problem, data transfer approaches are widely employed as ordinary research pipelines in computer vision (CV), natural language processing (NLP), and machine translation (MT) domains (Kornblith et al., 2019; Xie et al., 2020; He et al., 2020; Khan et al., 2019; Sennrich et al., 2016) . A data transfer approach tries to transfer knowledge to help a difficult training on the small and/or noisy target dataset. That knowledge is imported from augmented dataset, which is usually a large or clean dataset in the same domain but does not share the same task or the same assumptions with the target dataset. Such augmented datasets are beneficial if the quantity or the quality of the augmented dataset is superior compared to the target dataset. However, this data transfer approach is not still well investigated in the previous retrosynthesis studies as we explained above. In this paper, we conduct a systematic investigation of the effect of data transfer for the improvement of the retrosynthesis models. We examine three standard methods of data transfer: joint training, self-training, and pre-training plus fine-tuning. The result shows that every data transfer method can improve the test prediction accuracy of an off-the-shelf Transformer retrosynthesis model. Especially, a Transformer with pre-training plus fine-tuning achieves comparable performance with, and in some cases better performance than, the SotA models. In addition, we conducted an intensive manual inspection for the erroneous prediction results. This inspection clarifies the limitations of our approaches. But at the same time, it reveals that the pre-training plus fine-tuning model can generate chemically appropriate or sensible proposals more than 99% cases of top-1 predictions. As seen, most efforts of ML-based retrosynthesis researches are dedicated to stronger DNN architectures. In this paper, we propose another approach to improve the SotA of retrosynthesis, i.e., by transferring the knowledge of additional datasets that are not directly prepared for the users' target tasks.

2.2. DATA TRANSFER IN GENERAL ML DOMAINS

Generally, the performances of the statistical ML model are dependent on the design of the model (variable dependencies) and the datasets. In the dataset side, the both of the quality (fewer noises in features, less mislabels or conflicting samples) and the quantity (many sample sizes, not-few and



Figure 1: Overview of the retrosynthesis problem.

Wei et al., 2016)). Among them,Liu et al. (2017)  first introduced the LSTM seq-to-seq model. Their model handled the compounds with the SMILES (Weininger, 1988)-format string representation, and the retrosynthesis problem was solved as the MT problem. Later, Karpov et al. (2019) replaced the LSTM with the Transformer (Vaswani et al., 2017), which is the current baseline seq-to-seq DNN, and achieves a good performance. Recently, Shi et al. (2020) introduced a graph-to-graphs approach. Graph-to-graphs approach can treat the compounds as molecular graphs with help of graph neural networks. This approach well matches human experts' intuition and the very recent model(Somnath et al., 2020)  greatly improved the accuracy, outperforming the former State-of-the-Art (SotA) model based on the deep logic network(Dai et al., 2019).

