POLYRETRO: FEW-SHOT POLYMER RETROSYNTHESIS VIA DOMAIN ADAPTATION

Abstract

Polymers appear everywhere in our daily lives -fabrics, plastics, rubbers, etc. -and we could hardly live without them. To make polymers, chemists develop processes that combine smaller building blocks (monomers) to form long chains or complex networks (polymers). These processes are called polymerizations and will usually take lots of human efforts to develop. Although machine learning models for small molecules have generated lots of promising results, the prediction problem for polymerization is new and suffers from the scarcity of polymerization datasets available in the field. Furthermore, the problem is made even more challenging by the large size of the polymers and the additional recursive constraints, which are not present in the small molecule problem. In this paper, we make an initial step towards this challenge and propose a learning-based search framework that can automatically identify a sequence of reactions that lead to the polymerization of a target polymer with minimal polymerization data involved. Our method transfers models trained on small molecule datasets for retrosynthesis to check the validity of polymerization reaction. Furthermore, our method also incorporates a template prior learned on a limited amount of polymer data into the framework to adapt the model from small molecule to the polymer domain. We demonstrate that our method is able to propose high-quality polymerization plans for a dataset of 52 real-world polymers, of which more than 50% successfully recovers the currently-in-used polymerization processes in the real world.

1. INTRODUCTION

Human beings are living in a world of chemical products, among which a category of chemicals, called polymers, is playing an essential role. Ranging from fabrics to plastics to rubbers, polymers are appearing in every corner of our daily lives. Polymers with different properties are desired when used in different circumstances, and chemists have been spending tremendous effort to design and synthesize new polymers in the pursuit of ones with better properties. To make polymers, chemists develop processes that combine small building blocks, which we call monomers, to form longer chains or complex networks. Such processes are called polymerization and will take a significant amount of human effort to develop. Since the rise of deep learning (LeCun et al., 2015) , applying these models to science problems like biology and chemistry ones have gradually gathered attentions. Specifically, the applications of AI methods in the retrosynthetic design of chemical compounds have become very popular recently (Segler et al., 2018; Coley et al.) . While most work focuses on synthesizing drug-like small molecules, the study of polymer retrosynthesis is still at its infancy. The reasons are multifold, but one of the most important ones being the lack of available polymerization datasets, which poses difficulties for existing learning-based methods to learn meaningful pattern for polymerization reactions. Moreover, polymers usually have a chain or network structure with repeat units, which is very different from small molecules. This additional constraints also introduces difficulties in the formulation and modeling of polymer design/retrosynthesis. In this paper, we focus specifically on the polymer retrosynthesis problem. While there has been a series of work focusing on small molecule retrosynthesis (Corey & Wipke, 1969; Gasteiger et al., 1992; Coley et al., 2017; Liu et al., 2017; Segler & Waller, 2017; Segler et al., 2018; Coley et al.; Karpov et al., 2019; Baylon et al.; Schreck et al.; Dai et al., 2019; Chen et al., 2020) , the problem of polymerization is very different and challenging in the machine learning sense that 1. To predict synthesis routes for polymer repeat units, additional structural constraints such as recursive constraint should be imposed to guarantee a potentially valid polymerization procedure. Such constraints do not exist in existing formulation for molecule retrosynthesis, thus most methods could not be directly applied. 2. Polymerization data for training is very limited. Compared with retrosynthesis models built for small molecules where accessible training data is at least tens of thousands, the size of polymerization data is tiny, and in our case it is even less than 100. This size is meaningless for most existing models to learn any synthesis patterns. In this paper, we formulate the problem of polymer retrosynthesis as a constrained optimization and present PolyRetro, a novel learning-based search framework to tackle the problem of polymer retrosynthesis. With beam search and rejection sampling, PolyRetro is able to propose monomer candidates with high polymerization probability while satisfying monomer synthesizability constraints. Our method is based on reaction templates collected from small molecule reactions, which capture the local structural properties for small molecule reactions. We leverage an one-step retrosynthesis model trained on a small molecule reaction dataset and adapt it to the polymer domain by incorporating a template prior learned on tiny-sized polymerization data. To verify whether the proposed monomers are synthesizable, we employ Retro* (Chen et al., 2020), a multi-step retrosynthesis model to predict their synthesis routes. We demonstrate PolyRetro through experiments that it is able to predict monomers accurately given target polymer repeat units. To our knowledge, we are the first to formulate, model and tackle the polymer retrosynthesis machine learning problem. The approach we developed is general in the sense that it can also be applied to other machine learning problem such as theorem proving and program synthesis, where the results we want to obtain involves recursion. For instance, the analogue problem in theorem proving is deriving proof for a theorem which contains recursive relations; the analogue problem in program synthesis is generating programs containing loops and recursive calls. We choose to focus our application in polymer synthesis because its importance and high societal impact. See Appendix E for more concrete discussions on the generality of PolyRetro. Our contributions are summarized below: • We formulate the problem of polymer retrosynthesis as a constrained optimization problem. To our knowledge, this is the first machine learning formulation that takes constraints in polymer retrosynthesis into consideration. • We propose PolyRetro, a learning-based search framework that tackles the problem of polymer retrosynthesis. To our knowledge, this is also the first learning-based method in this problem setting. • PolyRetro is able to recover 53% of ground truth monomers for a real-world polymer dataset using limited training data, significantly outperforming all existing algorithms.

2. RELATED WORKS

Computer-aided retrosynthetic planning for chemical molecules was first formalized by E. J. Corey (Corey & Wipke, 1969) and have been deployed over the past years. The task of retrosynthetic design is to identify a series of reactions that leads to the synthesis of target molecule. This is one of the most fundamental problems in organic chemistry. Recently, many machine learning methods has been proposed to the easier but also important subproblem, where one is given target molecule and the task is to predict the direct predicates (Coley et al.) . Methods to tackle such 'one-step version' of retrosynthesis could be roughly divided into two categories, template-based and template-free ones. A template of chemical reaction is essentially how bonds and atom change during the reaction, and could be applied reversely to get reactants from products. Thus there have been a series of methods trying to predict the reaction templates given product molecules to get the corresponding reactants (Coley et al., 2017; Segler & Waller, 2017; Baylon et al.; Dai et al., 2019) . While powerful, these methods are not applicable in the case where training data comes without templates. To resolve this, there have been attempts to use

