DEEP EVOLUTIONARY LEARNING FOR MOLECULAR DESIGN

Abstract

In this paper, we propose a deep evolutionary learning (DEL) process that integrates fragment-based deep generative model and multi-objective evolutionary computation for molecular design. Our approach enables (1) evolutionary operations in the latent space of the generative model, rather than the structural space, to generate novel promising molecular structures for the next evolutionary generation, and (2) generative model fine-tuning using newly generated highquality samples. Thus, DEL implements a data-model co-evolution concept which improves both sample population and generative model learning. Experiments on two public datasets indicate that sample population obtained by DEL exhibits improved property distributions, and dominates samples generated by multiobjective Bayesian optimization algorithms.

1. INTRODUCTION

A drug is a molecule that binds to a target (e.g. protein) to inhibit or activate specific pathways in pathogens or host cells that cause abnormal phenotype. Drug discovery and development is a costly and time-consuming process, which is compounded by personalized medicines development for cancer or other complex and rare diseases. Computational drug discovery has been shown to accelerate the whole discovery process using simulations and machine intelligence. However, challenge remains in this field by demands for a robust and unbiased feature representation theories for molecules and their corresponding receptors, and efficient search algorithms. The rise of AI and data science provides us with a unique opportunity to reevaluate the problem and develop fast intelligent search or design approaches (Gromski et al., 2019; Chen et al., 2018) . These new technologies claim differences from the traditional ones in two aspects: (1) features can be automatically learned using embedding techniques on a large number of training samples, and (2) high-level relationships (in supervised case) and complex distributions (in unsupervised case) can be captured using appropriate deep architectures. Recently, new representation theories and architectures have been proposed in the domain of molecular generation. There exist two new major methods to present a molecule for a machine learning algorithm. The first method converts a molecule structure to a string, such as the simplified molecular-input line-entry system (SMILES) string (Weininger, 1988) , and adopt natural language processing (NLP) methods for supervised or unsupervised learning. The second uses an undirected graph to present a molecular structure and applies graph convolutional neural networks (Duvenaud et al., 2015) . Three major families of AI algorithms have been developed for novel drug discovery: deep generative models (DGMs), reinforcement learning, and the combination of both. As one family of major neural probabilistic models for data modelling, generative autoencoders (e.g. variational autoencoder (VAE) (Kingma & Welling, 2014)) have been adopted to learn on either SMILES strings (Romez-Bombarelli et al., 2018) or molecular graphs (Simonovsky & Komodakis, 2018) with corresponding physical and biochemical properties for molecular generation. By integrating the generative adversarial nets and autoencoders, adversarial autoencoders (AAEs) have also been well applied to molecular design (Kadurin et al., 2017) . The advantage of using generative autoencoders is that, molecules, as discrete objects in our world, are mapped to the continuous latent space, whose landscape can be organized by their properties, which helps generate new structures with preferred property values. However, critical problems remain due to imperfect representation methods. When SMILES strings are used in VAE, the model suffers from imbalance of tokens in embedding, generation of invalid structures, and the problem where two almost identical molecules have markedly different canonical SMILES strings. When using graph as molecular

