DEEP EVOLUTIONARY LEARNING FOR MOLECULAR DESIGN

Abstract

In this paper, we propose a deep evolutionary learning (DEL) process that integrates fragment-based deep generative model and multi-objective evolutionary computation for molecular design. Our approach enables (1) evolutionary operations in the latent space of the generative model, rather than the structural space, to generate novel promising molecular structures for the next evolutionary generation, and (2) generative model fine-tuning using newly generated highquality samples. Thus, DEL implements a data-model co-evolution concept which improves both sample population and generative model learning. Experiments on two public datasets indicate that sample population obtained by DEL exhibits improved property distributions, and dominates samples generated by multiobjective Bayesian optimization algorithms.

1. INTRODUCTION

A drug is a molecule that binds to a target (e.g. protein) to inhibit or activate specific pathways in pathogens or host cells that cause abnormal phenotype. Drug discovery and development is a costly and time-consuming process, which is compounded by personalized medicines development for cancer or other complex and rare diseases. Computational drug discovery has been shown to accelerate the whole discovery process using simulations and machine intelligence. However, challenge remains in this field by demands for a robust and unbiased feature representation theories for molecules and their corresponding receptors, and efficient search algorithms. The rise of AI and data science provides us with a unique opportunity to reevaluate the problem and develop fast intelligent search or design approaches (Gromski et al., 2019; Chen et al., 2018) . These new technologies claim differences from the traditional ones in two aspects: (1) features can be automatically learned using embedding techniques on a large number of training samples, and (2) high-level relationships (in supervised case) and complex distributions (in unsupervised case) can be captured using appropriate deep architectures. Recently, new representation theories and architectures have been proposed in the domain of molecular generation. There exist two new major methods to present a molecule for a machine learning algorithm. The first method converts a molecule structure to a string, such as the simplified molecular-input line-entry system (SMILES) string (Weininger, 1988) , and adopt natural language processing (NLP) methods for supervised or unsupervised learning. The second uses an undirected graph to present a molecular structure and applies graph convolutional neural networks (Duvenaud et al., 2015) . Three major families of AI algorithms have been developed for novel drug discovery: deep generative models (DGMs), reinforcement learning, and the combination of both. As one family of major neural probabilistic models for data modelling, generative autoencoders (e.g. variational autoencoder (VAE) (Kingma & Welling, 2014)) have been adopted to learn on either SMILES strings (Romez-Bombarelli et al., 2018) or molecular graphs (Simonovsky & Komodakis, 2018) with corresponding physical and biochemical properties for molecular generation. By integrating the generative adversarial nets and autoencoders, adversarial autoencoders (AAEs) have also been well applied to molecular design (Kadurin et al., 2017) . The advantage of using generative autoencoders is that, molecules, as discrete objects in our world, are mapped to the continuous latent space, whose landscape can be organized by their properties, which helps generate new structures with preferred property values. However, critical problems remain due to imperfect representation methods. When SMILES strings are used in VAE, the model suffers from imbalance of tokens in embedding, generation of invalid structures, and the problem where two almost identical molecules have markedly different canonical SMILES strings. When using graph as molecular representation in VAE, a technical difficulty is to design an effective graph decoder. In addition to other heuristic methods, a SMILES decoder can be used to pair with a graph encoder. While DGMs offer convenience of searching in latent space, reinforcement learning algorithms can directly search in the molecules' structural space by adding or deleting bounds and atoms (Zhou et al., 2019) . In the Markov decision process (MDP) for drug design, the agent is a molecular generator, the molecular structure indicates the state, the actions are modifications to the current structure, and a simulator (e.g. surrogate model) is often used as the environment to provide reward. Furthermore, generative and predictive models can be integrated in MDP to form deep reinforcement learning (DRL) methods, where the generative model is trained as a policy approximation and the predictive model can be used as a value function approximation (You et al., 2018; Popova et al., 2018) . Search in the discrete input space and inefficient learning are arguable concerns to be addressed when applying reinforcement-learning-based solutions for compound design. Interestingly, as an old peer of reinforcement learning for black-box optimizations, evolutionary computation (EC) methods (Eiben & Smith, 2015) have been catching up with promising performances in modern optimization, design and modelling problems. Besnard et al. ( 2012) present a strategy for evolution of ligands along multiple properties in the structural space, where a library of knowledge-based chemical structural transformation is used as the mutation operator. Interactions between EC and neural networks have mainly focused on network evolution and neural surrogate models for fitness functions. For examples, EC has been used at a large scale for neuroevolution that leads to evolution of neural network architectures (Stanley et al., 2019) ; feedforward neural network is commonly used as fitness function in EC (Mandal et al., 2019) . Furthermore, it has been recently discovered that evolutionary strategy (ES) can perform competitively with reinforcement learning in game AI (Salimans et al., 2017) . ES (Wierstra et al., 2014) and estimation of distribution algorithms (EDA) (Hauschild & Pelikan, 2011) build parameterized search distributions over promising points and either employ gradient information or sample from such a probabilistic model to find better points. Both probabilistic strategies from EC can potentially be used as alternatives to Bayesian optimization (BO) in (continuous) black-box optimizations. A single-objective BO has been recently applied to molecule optimization in the latent space of VAE (Romez-Bombarelli et al., 2018) . Since model learning is essentially parameter estimation from the statistical modelling perspective, the quality of data in deep learning is crucial for model performance. Data augmentation is becoming a new strategy in deep learning to improve the training of a model. For example, in computer vision, transformations (such as rotation and flipping) of images are used to increase the sample size when the original data set is insufficient (Perez & Wang, 2017; Cubuk et al., 2019; Shorten & Khoshgoftaar, 2019) . In NLP, a text dataset can be augmented using tricks such as replacing words or phrases with their synonyms (Wei & Zou, 2019) , and resorting aids from other language models (e.g. word embedding and neural machine translations) (Sennrich et al., 2016; Wei & Zou, 2019) . Basically, these methods either increase the data by transforming existing information which can only alleviate the limit of certain techniques (e.g. convolution), or indirectly borrow new information from other sources (e.g. methods in NLP). In summary, even though modern machine learning, evolutionary computation, and data science methods have been applied to molecular design and achieved promising results, we are still challenged by three chief issues. (1) An effective representation method for compound structures, that is encoding-decoding friendly and invariant to multimorphic forms, is still missing. (2) New ideas are expected for effective representation and coding of discrete structures in EC. And, (3) quality of data can be further improved as current data augmentation tricks only increase the number of samples but does not specifically address data quality. In this paper, we propose a novel deep evolutionary learning (DEL) process that combines the merits of deep generative model and multi-objective evolutionary computation together for molecular design. Specifically, our work has three major contributions. (1) In our approach, latent representations of phenotypic samples in a population serve as genotypic codes for evolutionary operations.This approach differs from traditional evolutionary algorithms that search in the original space of a problem. Specifically, our framework's DGM encoder projects molecular structures in a population from discrete space to continuous latent space where evolutionary operations are applied to help explore the latent representation space. Subsequently, the DGM decoder maps the genotypic representation to the phenotypic space for generating new molecules with desired property values. (2) In each evolutionary generation, the newly formed population containing novel competitive molecules can be used to further fine-tune the DGM. This approach is an innovative data augmentation strategy that enriches training data with novel high-

