SMILESFORMER: LANGUAGE MODEL FOR MOLECU-LAR DESIGN

Abstract

The objective of drug discovery is to find novel compounds with desirable chemical properties. Generative models have been utilized to sample molecules at the intersection of multiple property constraints. In this paper we pose molecular design as a language modeling problem where the model implicitly learns the vocabulary and composition of valid molecules, hence it is able to generate new molecules of interest. We present SmilesFormer, a Transformer-based model, which is able to encode molecules, molecule fragments, and fragment compositions as latent variables, which are in turn decoded to stochastically generate novel molecules. This is achieved by fragmenting the molecules into smaller combinatorial groups, then learning the mapping between the input fragments and valid SMILES sequences. The model is able to optimize molecular properties through a stochastic latent space traversal technique. This technique systematically searches the encoded latent space to find latent vectors that are able to produce molecules to meet the multi-property objective. The model was validated through various de novo molecular design tasks, achieving state-of-the-art performances when compared to previous methods. Furthermore, we used the proposed method to demonstrate a drug rediscovery pipeline for Donepezil, a known acetylcholinesterase inhibitor.

1. INTRODUCTION

Generative models play a major role in discovering and designing new molecules, which is key to innovation in in-silico drug discovery (Schwalbe-Koda & Gómez-Bombarelli, 2020). There is a vast amount of molecular data available, therefore generative models should be able to learn concepts of valid and desired molecules using a data-centric approach. However, due to the high dimensionality of the molecular space (Probst & Reymond, 2020) and the substantial number of data samples, it is still challenging to traverse the space of valid molecules that satisfy desired objectives of de novo molecular design. Other challenges that have been identified for de novo molecule generation task include the reliance on a brute-force trial and error approach to searching for hit compounds, the difficulty in designing an effective reward function for molecular optimization, sample efficiency due to online oracle calls during molecule optimization (Fu et al., 2022) . Addressing these challenges is still an active area of research with various approaches such as Reinforcement Learning (Zhou et al., 2019 )(Jin et al., 2020b ), Genetic Algorithms (Wüthrich et al., 2021) , Variational Auto-Encoders (VAEs) (Kusner et al., 2017) and Generative Adversarial Networks (GANs) (Schwalbe-Koda & Gómez-Bombarelli, 2020; Prykhodko et al., 2019) . In this work, we employ a Transformer-based language model (Vaswani et al., 2017) to encode a molecular latent space by generating valid molecule sequences from fragments and fragment compositions. The encoded latent space can then be explored to generate molecules (represented by SMILES strings (Weininger, 1988) ) that satisfy desired properties. Similar to Gao et al. (2022) who addressed the problem of synthesizability by modeling synthetic pathways within the molecule design pipeline, we leverage data to minimize costly experiments in downstream generation and optimization tasks using a fragment-based approach that associates synthesizable building blocks with target molecules during training. Our model, however, introduces an online fragmentation approach, removing the need to create a separate fragments dataset while learning the intrinsic relationship that makes up the formation of a valid SMILES string. This is essentially an approach for learning the SMILES language model. While SMILES has been seen to be less informative compared to other forms of molecule representation, we argue that it is simple and easy to follow as it is a linear walk through the molecular structure. Also, we see the non-canonical property as a form of data augmentation which has been shown to benefit the training of generative models (Coley, 2021) . We explored this idea by using only non-canonical SMILES as input to our model and teaching the model to generate canonical SMILES. Our contributions are summarized as follows: (1) We propose an approach for learning efficient representations of the molecular space using molecule fragments. Our fragment-based training pipeline constrains the model to learn the building blocks necessary to generate stable molecules while also meeting multi-property objectives. (2) We present an optimization strategy that efficiently traverses the molecular space with flexible parameterization. (3) We demonstrate a de novo molecular optimization practical use case with a rediscovery pipeline of an established acetylcholinesterase inhibitor.

2. RELATED WORK

The majority of generative models focus on the generation of valid molecules with desired properties (Gao et al., 2022) , however with great reliance on domain knowledge in medicinal chemistry to help guide the generation process. For example in Jin et al. ( 2018), an approach based on VAEs (Kingma & Welling, 2014) generates a molecular graph by first generating a tree-structured scaffold over chemical substructures. This involved building a cluster vocabulary that increases as the size of the considered chemical space increases. Other more recent state-of-the-art approaches are based on genetic algorithms through expert-guided learning (Ahn et al., 2020; Nigam et al., 2021; Wüthrich et al., 2021) , differentiable scaffolding tree aided by a graph convolutional knowledge network (Fu et al., 2022) , and VAE reinforcement learning with expert designed reward function to explore encoded latent space (Zhavoronkov et al., 2019) . Our work is motivated by Fabian et al. (2020) who made an earlier attempt at using a Transformerbased language model applied over a vast amount of data to learn the representation of molecules for domain-relevant auxiliary tasks, however, did not explore de novo molecular design. Defining molecular design as a language problem enabled us to move away from heavy reliance on expert knowledge and allows us to choose a method that can scale over a massive amount of available datasets. Intuitively, we see fragments as similar to words or phrases in a language sentence. Hence, learning the relationship between fragments and full sequences means a full sequence can be generated from fragments or even a combination of fragments. The most similar works to this idea use molecule fragments as the building block for molecule generation pipelines (Polishchuk, 2020) , showing that fragment-based approaches are in-between atom-based approaches and reaction-based approaches. While the approach utilizes a database of known compounds to perform chemically reasonably mutation of input structures, ours relied on an online stochastic fragmentation process based on a retrosynthetic combinatorial analysis procedure (RECAP) (Lewell et al., 1998) provided by the RDKit library (Landrum, 2022) . This fragmentation process is only used during training and fragments are not stored for later use. Another method (Firth et al., 2015) coupled a rule-based fragmentation scheme with a fragment replacement algorithm to broaden the scope of re-connection options considered in the generation of potential solution structures. Fragments were used optionally in Nigam et al. (2021) to bias the operators in the genetic algorithm pipeline. The effect of this optional bias, however, was not reported. Similar to the REINVENT (Blaschke et al., 2020) , which was developed as a production-ready tool for de novo design, applicable to drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space, we adopt a transfer learning approach that utilizes a pretrained generative model as a prior to further train a smaller set of compounds which are relevant for the desired outcome in downstream tasks. To evaluate generative models for molecular design, distribution benchmarks and goal-directed benchmarks (Brown et al., 2019; Polykovskiy et al., 2020) have been proposed. However, since the task of molecular design is usually related to specific targets, earlier proposed benchmarks do not fully represent how a generative model is utilized in the drug discovery pipeline. We observe that recent works have been more focused on multi-objective property optimization as this is more practically useful. Aided by standardized libraries like TD Commons (Huang et al., 2021) 



, recent works from Jin et al. (2020b); Nigam et al. (2021); Fu et al. (2022), and Gao et al. (2022) evaluated their models

