DEEP EVOLUTIONARY LEARNING FOR MOLECULAR DESIGN

Abstract

In this paper, we propose a deep evolutionary learning (DEL) process that integrates fragment-based deep generative model and multi-objective evolutionary computation for molecular design. Our approach enables (1) evolutionary operations in the latent space of the generative model, rather than the structural space, to generate novel promising molecular structures for the next evolutionary generation, and (2) generative model fine-tuning using newly generated highquality samples. Thus, DEL implements a data-model co-evolution concept which improves both sample population and generative model learning. Experiments on two public datasets indicate that sample population obtained by DEL exhibits improved property distributions, and dominates samples generated by multiobjective Bayesian optimization algorithms.

1. INTRODUCTION

A drug is a molecule that binds to a target (e.g. protein) to inhibit or activate specific pathways in pathogens or host cells that cause abnormal phenotype. Drug discovery and development is a costly and time-consuming process, which is compounded by personalized medicines development for cancer or other complex and rare diseases. Computational drug discovery has been shown to accelerate the whole discovery process using simulations and machine intelligence. However, challenge remains in this field by demands for a robust and unbiased feature representation theories for molecules and their corresponding receptors, and efficient search algorithms. The rise of AI and data science provides us with a unique opportunity to reevaluate the problem and develop fast intelligent search or design approaches (Gromski et al., 2019; Chen et al., 2018) . These new technologies claim differences from the traditional ones in two aspects: (1) features can be automatically learned using embedding techniques on a large number of training samples, and (2) high-level relationships (in supervised case) and complex distributions (in unsupervised case) can be captured using appropriate deep architectures. Recently, new representation theories and architectures have been proposed in the domain of molecular generation. There exist two new major methods to present a molecule for a machine learning algorithm. The first method converts a molecule structure to a string, such as the simplified molecular-input line-entry system (SMILES) string (Weininger, 1988) , and adopt natural language processing (NLP) methods for supervised or unsupervised learning. The second uses an undirected graph to present a molecular structure and applies graph convolutional neural networks (Duvenaud et al., 2015) . Three major families of AI algorithms have been developed for novel drug discovery: deep generative models (DGMs), reinforcement learning, and the combination of both. As one family of major neural probabilistic models for data modelling, generative autoencoders (e.g. variational autoencoder (VAE) (Kingma & Welling, 2014) ) have been adopted to learn on either SMILES strings (Romez-Bombarelli et al., 2018) or molecular graphs (Simonovsky & Komodakis, 2018) with corresponding physical and biochemical properties for molecular generation. By integrating the generative adversarial nets and autoencoders, adversarial autoencoders (AAEs) have also been well applied to molecular design (Kadurin et al., 2017) . The advantage of using generative autoencoders is that, molecules, as discrete objects in our world, are mapped to the continuous latent space, whose landscape can be organized by their properties, which helps generate new structures with preferred property values. However, critical problems remain due to imperfect representation methods. When SMILES strings are used in VAE, the model suffers from imbalance of tokens in embedding, generation of invalid structures, and the problem where two almost identical molecules have markedly different canonical SMILES strings. When using graph as molecular representation in VAE, a technical difficulty is to design an effective graph decoder. In addition to other heuristic methods, a SMILES decoder can be used to pair with a graph encoder. While DGMs offer convenience of searching in latent space, reinforcement learning algorithms can directly search in the molecules' structural space by adding or deleting bounds and atoms (Zhou et al., 2019) . In the Markov decision process (MDP) for drug design, the agent is a molecular generator, the molecular structure indicates the state, the actions are modifications to the current structure, and a simulator (e.g. surrogate model) is often used as the environment to provide reward. Furthermore, generative and predictive models can be integrated in MDP to form deep reinforcement learning (DRL) methods, where the generative model is trained as a policy approximation and the predictive model can be used as a value function approximation (You et al., 2018; Popova et al., 2018) . Search in the discrete input space and inefficient learning are arguable concerns to be addressed when applying reinforcement-learning-based solutions for compound design. Interestingly, as an old peer of reinforcement learning for black-box optimizations, evolutionary computation (EC) methods (Eiben & Smith, 2015) have been catching up with promising performances in modern optimization, design and modelling problems. Besnard et al. (2012) present a strategy for evolution of ligands along multiple properties in the structural space, where a library of knowledge-based chemical structural transformation is used as the mutation operator. Interactions between EC and neural networks have mainly focused on network evolution and neural surrogate models for fitness functions. For examples, EC has been used at a large scale for neuroevolution that leads to evolution of neural network architectures (Stanley et al., 2019) ; feedforward neural network is commonly used as fitness function in EC (Mandal et al., 2019) . Furthermore, it has been recently discovered that evolutionary strategy (ES) can perform competitively with reinforcement learning in game AI (Salimans et al., 2017) . ES (Wierstra et al., 2014) and estimation of distribution algorithms (EDA) (Hauschild & Pelikan, 2011) build parameterized search distributions over promising points and either employ gradient information or sample from such a probabilistic model to find better points. Both probabilistic strategies from EC can potentially be used as alternatives to Bayesian optimization (BO) in (continuous) black-box optimizations. A single-objective BO has been recently applied to molecule optimization in the latent space of VAE (Romez-Bombarelli et al., 2018) . Since model learning is essentially parameter estimation from the statistical modelling perspective, the quality of data in deep learning is crucial for model performance. Data augmentation is becoming a new strategy in deep learning to improve the training of a model. For example, in computer vision, transformations (such as rotation and flipping) of images are used to increase the sample size when the original data set is insufficient (Perez & Wang, 2017; Cubuk et al., 2019; Shorten & Khoshgoftaar, 2019) . In NLP, a text dataset can be augmented using tricks such as replacing words or phrases with their synonyms (Wei & Zou, 2019) , and resorting aids from other language models (e.g. word embedding and neural machine translations) (Sennrich et al., 2016; Wei & Zou, 2019) . Basically, these methods either increase the data by transforming existing information which can only alleviate the limit of certain techniques (e.g. convolution), or indirectly borrow new information from other sources (e.g. methods in NLP). In summary, even though modern machine learning, evolutionary computation, and data science methods have been applied to molecular design and achieved promising results, we are still challenged by three chief issues. (1) An effective representation method for compound structures, that is encoding-decoding friendly and invariant to multimorphic forms, is still missing. (2) New ideas are expected for effective representation and coding of discrete structures in EC. And, (3) quality of data can be further improved as current data augmentation tricks only increase the number of samples but does not specifically address data quality. In this paper, we propose a novel deep evolutionary learning (DEL) process that combines the merits of deep generative model and multi-objective evolutionary computation together for molecular design. Specifically, our work has three major contributions. (1) In our approach, latent representations of phenotypic samples in a population serve as genotypic codes for evolutionary operations.This approach differs from traditional evolutionary algorithms that search in the original space of a problem. Specifically, our framework's DGM encoder projects molecular structures in a population from discrete space to continuous latent space where evolutionary operations are applied to help explore the latent representation space. Subsequently, the DGM decoder maps the genotypic representation to the phenotypic space for generating new molecules with desired property values. (2) In each evolutionary generation, the newly formed population containing novel competitive molecules can be used to further fine-tune the DGM. This approach is an innovative data augmentation strategy that enriches training data with novel high-quality samples. The whole DEL process implements a new learning paradigm that co-evolves data and model alternatingly through multiple evolutionary generations. (3) Our comprehensive experiments demonstrate that DEL is able to produce populations of novel samples with improved values of properties and outperforms state-of-the-art multi-objective BO algorithms (MOBO).

2. METHOD

The proposed deep evolutionary learning (DEL) process combines deep learning and multi-objective EC through the latent representation space of molecules. One of the theoretical innovations of our approach is that it demonstrates that EC methods are extendable to corresponding deep versions. The main idea is illustrated in Figure 1a The major advantages of our DEL algorithm over existing interactions between EC and neural networks can be explained as follows. (a) We directly evolve a collection of data rather than many neural network structures and parameters. Data evolution tends to be more efficient than direct evolution of model structures and parameters. (b) The single neural network model (i.e. DGM in DEL) can be indirectly improved through learning on the evolved data with modern gradient-based variational learning and inference algorithms. Thus, the improvement of populations along evolutionary generations can be viewed as an effective data augmentation strategy that includes novel and highquality samples for further training of the neural network. (c) The continuous latent representation space established by the encoder of DGM can be naturally used as encoding (genotypic) space for evolutionary computation. Thus, evolutionary operations are carried out in the latent space instead of the discrete structural input space, allowing more efficient and smooth exploration, because the latent space is often multimodal and can be organized by properties (regularized by the property predictor) and evolutionary operations in this space can help the search escape from local regions and explore new regions of interest. (d) The multi-objective operations -non-dominated sorting and crowding distance, can help identify competitive and diverse parent samples to breed offspring. In summary, DEL takes advantages of both multi-objective EC and probabilistic neural model learning. The DGM, multi-objective components (non-dominated sorting and crowding distance), evolutionary operations, formation of new populations are discussed in details as below.

2.1. FRAGVAE FOR FRAGMENT-BASED MOLECULAR MODELLING

In our DEL process, we adopted a VAE model originally for fragment-based molecular generation (Podda et al., 2020) . The concept of fragment-based drug design (FBDD) was introduced in (Shuker et al., 1996) . In FBDD-based approaches, small organic molecules that bind to proximal subsites of a protein are identified, optimized, and linked together to produce high-affinity ligands. Wet-lab approaches for FBDD include X-ray crystallography and NMR spectroscopy. Compared to atom-based drug design, FBDD has the following advantages (Erlanson, 2011) . (1) The search space in FBDD is much smaller (10 7 versus 10 60 ). (2) Identifying a fragment with certain affinity to the target may mean finding a pharmacophore. (3) Fragment-based synthesis could be more efficient than high-throughput screening. Majority fragmentation methods, which break a molecule into parts, are based on synthetic accessibility. For examples, RECAP (retrosynthetic combinatorial analysis procedure) is a method that breaks bonds formed by chemical reactions (Lewell et al., 1998) ; BRICS (breaking of retrosynthetically interesting chemical substructures) generates a more elaborated set of fragmentation rules along synthetically accessible bonds and generates more fragments than RECAP (Degen et al., 2008) . BRICS is used in (Podda et al., 2020) to chop a SMILES string into several fragments. Then, fragment embeddings are produced using Word2Vec (Mikolov et al., 2013) . Next, the sequences of fragments are modelled by a GRU-based VAE. In our work, a multi-head feedforward neural network component for predicting values of properties is added to the original model such that the latent representations can be regularized by properties of interest. Additionally, we normalize the three loss terms using batch size, and allow tuning of the weights among the loss terms. A crucial implementation bug in the original VAE model was also corrected (see Appendix A.2). Hereafter, we name this modified VAE model for fragments as FragVAE whose architecture is displayed in Figure 1b . We denote the encoder parameter by φ, the decoder parameter by θ, and the property predictor network by f ψ (z) parameterized by ψ. The objective (to be minimized) of this DGM employed in DEL is a weighted combination of three terms: l φ,θ,ψ = -E q φ (z|x) [log p θ (x|h)] + βKL q φ (z|x)||p θ (z) + αE q φ (z|x) MSE(f ψ (z), y) , where the first term is to reduce the reconstruction error, the second term is to regularize the posterior latent distribution with a simple prior, and the third term uses mean squared error of property prediction to further regularize the posterior distribution of latent codes. Previous studies unveil that VAE can easily fail on modelling text data because of the training imbalance between the reconstruction error (difficult to reduce once the KL divergence becomes very small) and the KL divergence (easy to diminish to zero). Thus, proper trade-off between the reconstruction error and KL divergence through β-VAE (Higgins et al., 2017) is vital in text generation and molecular generation (Yan et al., 2020; Bowman et al., 2016) . In practice, the value of β should be smaller than 1. To look for a suitable value of β, we design a versatile function, called β-function, as formulated below, β(t) = min max ae k(1-T t ) , l , u , where T represents the total number of epochs, t ∈ {1, 2, • • • , T } indicates the current index of epoch, k controls the incremental speed, a defines the amplitude, l and u serve as lower and upper bounds respectively for the value of β. With different settings, a variety of curves of this function are shown in Figure 6 (see Appendix A.7). The value of α can be set similarly in FragVAE.

2.2. NON-DOMINATION RANK AND CROWDING DISTANCE

To get non-domination rank and crowding distance of a feasible solution for guiding the sample selection (Section 2.3) and the population merging (Section 2.4), the fast non-dominated sort and crowding comparison methods are adopted from the classic NSGA-II algorithm for multi-objective optimization (Deb et al., 2002) . The properties in molecular design are treated as objectives. In an optimization problem with K objectives f (z) = {f 1 (z), • • • , f K (z)}, feasible solution z 1 is said to dominate z 2 (denoted by z 1 ≺ z 2 ), if ∀k ∈ {1, • • • , K}: f k (z 1 ) ≤ f k (z 2 ) and ∃k ∈ {1, • • • , K}: f k (z 1 ) < f k (z 2 ). Using this concept of domination, all feasible solutions in a collection can be sorted to form Pareto frontiers (or fronts, ranks) F = {F 1 , F 2 , • • • }. Samples in the same frontier do not dominate each other. Frontier F i dominates F j for j > i. Thus, we define function F (z) to retrieve the rank (i.e. frontier index) of any feasible solution z in the population. The crowding distance of a feasible solution is computed as the normalized perimeter of the cuboid formed by its immediate neighbours along all objective axes. To compute the crowding distance of z i , the normalized distance between its nearest neighbours above (denoted by z a ) and below (denoted by z b ) it w.r.t. the k-th objective axis is calculated using d k (z i ) = f k (za)-f k (z b ) f max k -f min k , where  i : d(z i ) = K k=1 d k (z i ). The concept of crowding distance measures the density of the area around a feasible solution. Using the two concepts, partial order can be defined. We say or (2) F (z 1 ) = F (z 2 ) and d(z 1 ) > d(z 2 ) . When two solutions have same rank, the one with larger crowding distance is preferred because it helps maintain a diverse population. z 1 ≺ n z 2 if either (1) z 1 ≺ z 2 (that is F (z 1 ) < F (z 2 )),

2.3. EVOLUTIONARY OPERATIONS

The evolutionary operations include parent selection, recombination and mutation to produce new offspring in the evolutionary process. Binary tournament selection is applied to select one out of two randomly drawn samples from the current population. In such a selection process, let us suppose z 1 and z 2 are randomly taken from the population and z 1 ≺ n z 2 . Sample z 1 will be selected with selection probability p s which is close to one, and z 2 will be selected with a small chance 1-p s . This selection process is repeated M times to thus find M parents where M is the fixed population size. A pair of such parents will produce two children through recombination and mutation operations. Given two parents' latent representation z p1 and z p2 , there are two recombination options -linear and discrete methods, to produce their new children ẑ1 and ẑ2 . For linear recombination, ẑ1 = z p1 +r 1 (z p2 -z p1 ) and ẑ2 = z p1 +r 2 (z p2 -z p1 ) where r 1 = -d+(1+2d)α 1 , r 2 = -d+(1+2d)α 2 , d = 0.25, and α 1 , α 2 ∼ Uniform(0,1). For discrete method, supposing a latent representation vector is of length L, an integer l is randomly drawn from {1, • • • , L -1} such that ẑ1 = z p1 [1 : l], z p2 [l + 1 : L] and ẑ2 = z p2 [1 : l], z p1 [l + 1 : L] . After crossover, a new sample ẑm (m ∈ {1, • • • , M }) will have a small mutation probability p m (say 0.01) of getting mutation. For ẑm , a random value r is drawn from Uniform(0, 1). If r < p m , then a random integer l is randomly selected from {1, • • • , L} such that the l-th position of ẑm is replaced with a value drawn from standard Gaussian distribution: ẑm,l ∼ N (0, 1).

2.4. FORMING NEW POPULATION

Ideally we need to maintain excellent and diverse populations. After possible mutation operations, all M genotypic coding vectors will pass through the decoder of the DGM to produce phenotypic samples. All valid samples (supposed in set Pt+1 ) will be kept to merge with the previous population (denoted by P t ) to produce a new generation (denoted by P t+1 ). To implement it, all samples in Pt+1 + P t are sorted based on their non-domination ranks first and then on their crowding distances. Finally, only the top M samples are taken from them to form the new population.

3. EXPERIMENTS

The performance of DEL was investigated on the ZINC (Irwin & Shoichet, 2005) and PCBA (Wang et al., 2016) datasets. These data were processed in the work of Podda et al. (2020) . ZINC and PCBA are respectively composed of 227,945 and 383,790 molecules with two or more fragments. More statistics of both data can be found in (Podda et al., 2020) . We comprehensively investigated the empirical performance of FragVAE and DEL. Three properties (QED: quantitative estimation of drug-likeness, SAS: synthetic accessibility score, and logP: water-octanol partition coefficient) are selected as objectives in DEL. Molecules with large QED, low SAS, and small logP values are prioritized. Incorporation of other properties (e.g. binding affinity, structure-property relationship, and ADME) will be considered in future work. QED, a scalarization of eight molecular properties (including logP) (Bickerton et al., 2012) , is an adequate initial screening step for drug candidates. As we dive into more specific applications, selective properties can be tailored for subsequent screening. Thus, explicit usage of logP as one objective in DEL can help better assess lipophilicity, a key factor in drug design for some diseases, e.g. kidney and heart problems.

3.1. EVALUATION OF FRAGVAE

As FragVAE is a significant modification of the original model used in (Podda et al., 2020) , we investigated the impact of β value to the performance of FragVAE in terms of loss function values through Figure 7 (see Appendix A.7). Other hyperparameter values can be found in Appendix A.3. One can see that a large β value can quickly reduce the KL loss to near zero which leads to stagnant reductions of reconstruction error and property regression error -the notorious posterior collapse problem (Goyal et al., 2017) , because it is much easier to reduce the KL divergence than the reconstruction error in complex sequence modelling. Using a suitable small value of β would allow the continuous decrease of the reconstruction error and property regression error. This observation is consistent with discoveries in language generative models (Yan et al., 2020; Bowman et al., 2016) . Table 2 (see Appendix A.6) shows the validity, novelty and diversity of 20,000 samples generated from trained FragVAEs using standard Normal prior to sample z followed by the decoder. Results of previous language-model-based and graph-based methods are also given for comparison. In general, the validity is defined as the ratio of number of valid generated samples to total number of generated samples. To clarify, the perfect validity reported in Podda et al. (2020) is actually calculated as the ratio of valid generated SMILES strings after discarding invalid fragment sequences versus total number of valid fragment sequences, i.e. Validity (SMILES) in Table 2 . We found that this ratio is always 1 in fragment-based models. To have a better understanding about the model, we hence computed the validity of fragment sequences as the percentage of number of valid fragment sequences to total number of generated fragment sequences, i.e. Validity (Fragments) in Table 2 . The novelty is defined as the ratio of number of generated novel valid molecules that do not exist in the training data versus total number of generated valid samples. The diversity is calculated as the percentage of generated unique valid samples among total number of generated valid samples. When the value of β is very small (0.01), the posterior p(z|x) is highly different from the simple standard Normal prior p(z). Thus, it is reasonable to see relatively low diversity in samples derived using standard Normal distribution. However, it does not imply that FragVAE with a very small value of β is poor at learning latent representation. In fact, previous work in β-VAE shows that small values of β tend to encourage disentangled representations and form latent clusters (Li et al., 2020) . The property distributions of generated samples can be important indicators to the proximity of generated samples with actual samples. Figure 2 shows the distributions of QED, SAS and logP in samples generated using standard Normal prior. Large β value can lead to the sample SAS distribution appearing at the right side of the actual SAS distribution. Interestingly, the property distributions when using β = 0.01 do not resemble actual data, implying that using a very small value of β would lead to latent representations that deviate from standard normal distributions. To summarize, β ≤ 0.1 can prevent the model training from posterior collapse and can form structured latent representation space which is useful for latent space exploration using optimization techniques. Figure 3 shows population validity, novelty and diversity in different DEL processes. In this chart, validity (SMILES) is the validity of SMILES strings in the population; validity (fragments) is the validity of fragment sequences sampled using FragVAE after evolutionary operations; novelty is the ratio of population samples that are not in the training data against the population size; and diversity is the ratio of unique population samples against the population size. We observe that almost all samples in the populations are novel; and the population samples in the last generation are quite diverse, ranging from 0.798 to 0.988. Table 9 lists the increasing numbers of high-quality novel molecules discovered along the DEL processes. In contrast to the training molecules visualized in Figures 9 and 10 (Appendix), DEL is able to discover novel and diverse high-quality molecules (see . Furthermore, the distributions of properties and structural features of samples along the evolutionary process are compared in Figure 4 (and Figures 19-21 in Appendix). It can be seen that the process can be quite different from the actual data distribution and is able to gradually improve the distributions of QED, SAS and logP in population samples towards the preferred goals. Interestingly, samples randomly generated using standard Normal prior do not clearly show this trend (Figures 22-25 in Appendix). 

3.3. COMPARISON WITH MULTI-OBJECTIVE BAYESIAN OPTIMIZATION

DEL was compared with two MOBO methods: q-Pareto Efficient Global Optimization (qParEGO) and q-Expected Hypervolume Improvement (qEHVI) (Daulton et al., 2020) . These MOBO methods were run in the latent space of FragVAE trained in the first generation of DEL using all training samples. Hyperparameter settings of these algorithms are listed in Appendix A.5. Figure 26 shows the hypervolumes of both algorithms and the quasi-random baseline which selects candidates from a scrambled Sobol sequence along batches. Hypervolumes obtained using these models are plotted in Figure 26 . It shows that qParEGO and qEHVI work better with β = 0.01 than that with β = 0.1. To qualitatively compare DEL with the MOBO algorithms, the first five Pareto fronts obtained using DEL and last five batches obtained using qParEGO and qEHVI are visualized in Figure 5 and Figure 27 in Appendix. We can see that the solutions from qParEGO and qEHVI are behind the Pareto fronts from DEL. To quantitatively compare DEL with qParEGO and qEHVI, we conducted nondominated sorting on the combination of the first Pareto front of DEL and the last six batches of qParEGO and qEHVI and report the results in Table 1 . It shows that all solutions from the DEL Pareto front stay in the new integrated Pareto front, while almost all solutions from qParEGO and qEHVI are behind the integrated Pareto front. Furthermore, as indicated in Table 3 in Appendix, DEL runs more efficiently than these MOBO algorithms, even though the population size (20,000) of DEL is much larger than the batch sizes ( 8 

3.4. ABLATION STUDIES

By default, DEL uses the property predictor for latent representation regularization, fine-tunes Frag-VAE using new population data in each generation, forms new child latent codes using the linear crossover method, and fixes the population size to 20K. Variants were created by (1) disabling the property predictor, (2) disabling finetuning, (3) using the discrete crossover method, and (4) allowing a much larger population size (100K). While it is difficult to compare these variants in terms of validity, novelty and diversity of population samples (see Figure 28 in Appendix), it turns out that non-dominated sorting is an informative method for comparison. Table 4 , in Appendix A.6, indicates that DEL with the property predictor outperforms the variant without it. Table 5 shows that FragVAE finetuning in DEL can help obtain better Pareto fronts. Table 6 implies that both linear and discrete crossover operations behave well in DEL. Also, DEL with larger population size can form better Pareto fronts (see Table 7 ). Additionally, we applied non-dominated sorting to compare the quality of Pareto fronts obtained using different values of β on DEL, and found that the integrated Pareto front consists of samples relatively evenly from all settings (Table 8 ).

4. CONCLUSION

In this paper, we presented our DEL framework where a fragment-based VAE is integrated such that evolutionary exploration is conducted in the continuous latent representation space rather than the discrete structural space. Our intensive experiments show that DEL is able to generate novel populations of molecules with improved properties, and outperforms state-of-the-art multi-objective Bayesian optimization algorithms. Applications of DEL are certainly not restricted to design of small molecules. As future work, DEL will be tested on different datasets, other design problems, and more specific applications. Other types of DGMs and search strategies will be explored to further enhance DEL. New MOBO algorithms for latent-space based optimization need to be studied further to address issues such as scalability, unknown invalid domains in latent space, and the curse of dimensionality. Github link of our PyTorch and BoTorch implementation will be available.  β = 0.1β = 0.1 → 0.4β = 0.01β = 0.01 → 0.4 β = 0.1 β = 0.1 → 0.4 β = 0.01 β = 0. 1.0 k = 1, a = 1, l = 0, u = 1 k = 4, a = 1, l = 0, u = 1 k = 1, a = 0.4, l = 0, u = 1 k = 1, a = 1, l = 0.01, u = 0.4 k = 0, a = 0.1, l = 0, u = 1



and formally presented in Algorithm 1 (see Appendix A.1). This algorithm consists of the following steps. (a) A VAE (as molecule modeller) and a multilayer perceptron neural network (MLP, property predictor as regularizer) are pretrained using all the original training data to start the first evolutionary generation, or, if not in the first generation, using samples from the previous population. (b) Training samples (if first generation) or population samples (otherwise) are projected to the latent space using the encoder of the VAE. (c) Based on non-dominated ranking and crowding distances of samples with respect to multiple properties, evolutionary operations (selection, recombination/crossover and mutation) are conducted on the latent representations of the samples. (d) Given these new latent codes after evolutionary operations, new molecule samples are generated by the decoder of the VAE. (e) Properties of these generated samples are obtained using a simulator (e.g. RDKit (Landrum, 2006) in our experiment). (f) New samples with good desired properties and good samples from the previous generation form the new population. (g) Steps (b-f) are iterated for multiple generations. (h) The final population is returned.

Figure 1: Deep evolutionary learning process and deep generative model integrated in DEL.

maximal and minimal values of the k-th objective. Then, these individual results are summed up to form the crowding distance of z

Structural feature distributions over PCBA.

Figure 2: Property and structural feature distributions over 20K randomly sampled molecules.3.2 DEL PERFORMANCEWe executed DEL processes using fixed or annealed loss trade-off weights: (1) β = 0.1 and α = 1, (2) β = 0.01 and α = 1, (3) β = 0.1 and α = 1 in the initial training of FragVAE, then annealing β to 0.4 and α to 4 in the second generation of DEL (denoted by β = 0.1 → 0.4, α = 1 → 4), and similarly (4) β = 0.01 → 0.4, α = 1 → 4. Other hyperparameter values can be found in Appendix A.4. The change of losses is shown in Figure8. We observe that (1) all settings lead to similar reconstruction error convergence, (2) smaller values of β tend to obtain smaller property

Figure 3: Population validity, novelty and diversity during DEL processes.

Structural feature distributions over PCBA.

Figure 4: Property & structural feature distributions of DEL populations (β=0.01 → 0.4, α=1 → 4).

Figure 5: Pareto fronts of DEL and MOBO algorithms (β = 0.01 → 0.4). In the legend, the last batch of qParEGO or qEHVI is written as Front 1.Table1: Non-dominated sorting of combination of DEL, qParEGO and qEHVI Pareto fronts.

Figure 6: Different of forms of the β-function. The total number of epochs is T = 100.

Figure 7: Loss of FragVAE in initial training on ZINC and PCBA respectively.

Figure 8: Loss of of FragVAE in DEL on ZINC and PCBA respectively.

Figure 9: ZINC training molecules that satisfy properties QED≥ 0.88, SAS≤ 3, and logP≤ 1. Note: 196 molecules satisfy these conditions, but 64 are visualized.

Figure 10: PCBA training molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1. Note: 406 molecules satisfy these conditions, but 64 are visualized.

Figure 11: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on ZINC with hyperparameter: β = 0.1, α = 1.

Figure 12: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on ZINC with hyperparameter: β = 0.1 → 0.4, α = 1 → 4.

Figure 13: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on ZINC with hyperparameter: β = 0.01, α = 1.

Figure 14: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on ZINC with hyperparameter: β = 0.01 → 0.4, α = 1 → 4.

Figure 15: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on PCBA with hyperparameter: β = 0.1, α = 1.

Figure 16: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on PCBA with hyperparameter: β = 0.1 → 0.4, α = 1 → 4.

Figure 17: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on PCBA with hyperparameter: β = 0.01, α = 1.

Figure 18: Novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the final generation of DEL trained on PCBA with hyperparameter: β = 0.01 → 0.4, α = 1 → 4.

Figure 19: Property and structural feature distributions of population samples during DEL (β = 0.1).

Figure 20: Property and structural feature distributions of population samples during DEL (β = 0.1 → 0.4).

Structural feature distributions over PCBA.

Figure 21: Property and structural feature distributions of population samples during DEL (β = 0.01).

Figure 22: Property and structural feature distributions of randomly sampled molecules using Frag-VAE during DEL (β = 0.1).

Figure 23: Property and structural feature distributions of randomly sampled molecules using Frag-VAE during DEL (β = 0.1 → 0.4).

Structural feature distributions over PCBA.

Figure 24: Property and structural feature distributions of randomly sampled molecules using Frag-VAE during DEL (β = 0.01).

Figure 25: Property and structural feature distributions of randomly sampled molecules using Frag-VAE during DEL (β = 0.01 → 0.4).

Figure 26: Hypervolume along batches when running Sobol random search, qParEGO, and qEHVI.

Figure 27: Pareto fronts of DEL and MOBO algorithms (β = 0.1). In the legend, the last batch of qParEGO or qEHVI is written as Front 1.

Figure 28: Validity, novelty and diversity of population samples (after evolutionary operations and before merging with previous population) in different variants of DEL.

) of MOBO algorithms. It has been a well-known challenge to scale up BO algorithms.

Non-dominated sorting of combination of DEL, qParEGO and qEHVI Pareto fronts.

Performance of FragVAE in comparison with existing methods.

Running time of DEL and MOBO. Since both methods involve training FragVAE, the FragVAE training time was not counted in this table. Format: Hours:Minutes:Seconds. Note: when β is annealed to 0.4 in DEL, α is annealed from 1 to 4. All experiments were carried out on a Dell Precision 5820 Workstation equipped with an Intel Xeon W-2255 CPU (10C), RAM of 128GB, and a Nvidia Quadro RTX 6000 (24GB) GPU.

Non-dominated sorting of combination of Pareto fronts from DEL with and without property prediction (PP) component as latent space regularization.

Non-dominated sorting of combination of Pareto fronts from DEL with and without DGM finetuning (FT) phrase.

Non-dominated sorting of combination of Pareto fronts from DEL with linear or discrete crossover operation. Note: when β is annealed to 0.4, α is annealed from 1 to 4.

Non-dominated sorting of combination of Pareto fronts from DEL with population sizes of 20K and 100K. Note: when β is annealed to 0.4, α is annealed from 1 to 4.

Non-dominated sorting of combination Pareto fronts from DEL with different values of β. Note: when β is annealed to 0.4, α is annealed from 1 to 4. Population size: 20K.

Numbers of novel molecules that satisfy properties QED ≥ 0.88, SAS ≤ 3, and logP ≤ 1 in the 1st, 5th and final (10th) generations of DEL. Numbers of training molecules satisfying same conditions are also given. Note: when β is annealed to 0.4, α is annealed from 1 to 4.

annex

Listing 1: Wrong use of view function in the original VAE model in (Podda et al., 2020) . , s t a t e = s e l f . r n n ( packed , s t a t e ) # n u m l a y e r s by b a t c h by h i d d e n s i z e -> b a t c h by h i d d e n l a y e r * h i d d e n s i z e s t a t e = s t a t e . view ( b a t c h s i z e , s e l f . h i d d e n s i z e * s e l f . h i d d e n l a y e r s ) mean = s e l f . rnn2mean ( s t a t e ) # mean : b a t c h by l a t e n t s i z e l o g v a r = s e l f . r n n 2 l o g v ( s t a t e ) # l o g v : b a t c h by l a t e n t s i z e Listing 2: Our correction. , s t a t e = s e l f . r n n ( packed , s t a t e ) # n u m l a y e r s by b a t c h by h i d d e n s i z e -> b a t c h by n u m l a y e r s by h i d d e n s i z e s t a t e = s t a t e . t r a n s p o s e ( 1 , 0 ) s t a t e = s t a t e . f l a t t e n ( s t a r t d i m = 1 ) # b a t c h by h i d d e n l a y e r * h i d d e n s i z e mean = s e l f . rnn2mean ( s t a t e ) # mean : b a t c h by l a t e n t s i z e l o g v a r = s e l f . r n n 2 l o g v ( s t a t e ) # l o g v : b a t c h by l a t e n t s i z e

A.3 HYPERPARAMETER SETTING FOR FRAGVAE

When not specified in the main text, the following values of hyperparameters are used in experiments.• size of the embedding layer: 128

