SMILESFORMER: LANGUAGE MODEL FOR MOLECU-LAR DESIGN

Abstract

The objective of drug discovery is to find novel compounds with desirable chemical properties. Generative models have been utilized to sample molecules at the intersection of multiple property constraints. In this paper we pose molecular design as a language modeling problem where the model implicitly learns the vocabulary and composition of valid molecules, hence it is able to generate new molecules of interest. We present SmilesFormer, a Transformer-based model, which is able to encode molecules, molecule fragments, and fragment compositions as latent variables, which are in turn decoded to stochastically generate novel molecules. This is achieved by fragmenting the molecules into smaller combinatorial groups, then learning the mapping between the input fragments and valid SMILES sequences. The model is able to optimize molecular properties through a stochastic latent space traversal technique. This technique systematically searches the encoded latent space to find latent vectors that are able to produce molecules to meet the multi-property objective. The model was validated through various de novo molecular design tasks, achieving state-of-the-art performances when compared to previous methods. Furthermore, we used the proposed method to demonstrate a drug rediscovery pipeline for Donepezil, a known acetylcholinesterase inhibitor.

1. INTRODUCTION

Generative models play a major role in discovering and designing new molecules, which is key to innovation in in-silico drug discovery (Schwalbe-Koda & Gómez-Bombarelli, 2020). There is a vast amount of molecular data available, therefore generative models should be able to learn concepts of valid and desired molecules using a data-centric approach. However, due to the high dimensionality of the molecular space (Probst & Reymond, 2020) and the substantial number of data samples, it is still challenging to traverse the space of valid molecules that satisfy desired objectives of de novo molecular design. Other challenges that have been identified for de novo molecule generation task include the reliance on a brute-force trial and error approach to searching for hit compounds, the difficulty in designing an effective reward function for molecular optimization, sample efficiency due to online oracle calls during molecule optimization (Fu et al., 2022) . Addressing these challenges is still an active area of research with various approaches such as Reinforcement Learning (Zhou et al. In this work, we employ a Transformer-based language model (Vaswani et al., 2017) to encode a molecular latent space by generating valid molecule sequences from fragments and fragment compositions. The encoded latent space can then be explored to generate molecules (represented by SMILES strings (Weininger, 1988) ) that satisfy desired properties. Gao et al. (2022) who addressed the problem of synthesizability by modeling synthetic pathways within the molecule design pipeline, we leverage data to minimize costly experiments in downstream generation and optimization tasks using a fragment-based approach that associates synthesizable building blocks with target molecules during training. Our model, however, introduces an online fragmentation approach, removing the need to create a separate fragments dataset while learning the intrinsic relationship that makes up the formation of a valid SMILES string. This is essentially an approach for learning the SMILES language model.



, 2019)(Jin et al., 2020b), Genetic Algorithms (Wüthrich et al., 2021), Variational Auto-Encoders (VAEs) (Kusner et al., 2017) and Generative Adversarial Networks (GANs) (Schwalbe-Koda & Gómez-Bombarelli, 2020; Prykhodko et al., 2019).

