LATENT OPTIMIZATION VARIATIONAL AUTOENCODER FOR CONDITIONAL MOLECULE GENERATION

Abstract

Variational autoencoder (VAE) is a generation algorithm, consisting of an encoder and a decoder, and the latent variable is used as the input of the decoder. VAE is widely used for image, audio and text generation tasks. In general, the training of VAE is at risk of posterior collapsing especially for long sequential data. To alleviate this, modified evidence lower bounds (ELBOs) were propsed. However, these approaches heuristically control training loss using a hyper-parameter, and are not way to solve the fundamental problem of vanilla VAE. In this paper, we propose a method to insert an optimization step of the latent variable and alternately update the encoder and decoder for maximizing ELBOs. In experiments, we applied the latent optimization VAE (LOVAE) on ZINC dataset, consisting of string representation of molecules, for the inverse molecular design. We showed that the proposed LOVAE is more stable in the training and achieves better performance than vanilla VAE in terms of ELBOs and molecular generation performance.

1. INTRODUCTION

Deep neural networks (DNNs) have demonstrated a dramatic performance improvement in various applications. Text extraction from image recognition, language translation, speech and natural language recognition, and personal identification by fingerprint and iris have already achieved high accuracy (Wu et al., 2016; Devlin et al., 2018; Awad, 2012; Nguyen et al., 2017) . Recently, these applications became successful commercialized products. For the purpose of generation of image, variational autoencoder (VAE) (Kingma & Welling, 2014) , generative adversarial network (GAN) (Goodfellow et al., 2014) , and reversible generative models (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018) were proposed and showed much progress (Bleicher et al., 2003; Phatak et al., 2009; Grathwhohl et al., 2018) . These generative models were initially studied in image data and showed better performance than previous models. Since then, it has been extended in research area to generate new sentences (Iqbal & Qureshi, 2020) and to discover new drugs (Chen et al., 2018) and materials (Kim et al., 2018a) . Traditional materials research consists of four steps: molecule design, physical or chemical property prediction, molecular synthesis, and experimental evaluation. These steps are repeated until the desired molecular properties of a molecular structure are satisfied. Until now, trial-and-error techniques based on human knowledge have been widely used. However, they are time consuming and very expensive. In order to improve the traditional method, research for a high-throughput computational screening (HTCS) (Bleicher et al., 2003) was conducted. However, this also had limitations such as high computational cost, predefined molecular structures by human knowledge, and low accuracy of simulation. Unlike the traditional approach, inverse molecular design is an attempt to find novel molecules that satisfy desired properties from exploring a large chemical space (Sanchez-Lengeling & Aspuru-Guzik, 2018) . It extracts knowledge of potential molecular structures and properties from accumulated molecular structure databases (PubChem, ZINC, etc.) and proposes new molecular structures that do not exist in their database (Bolton et al., 2008; Irwin et al., 2012) . With the inverse molecular design, it is possible to save cost by conducting molecular synthesis and experimental evaluation only for molecular structures having desired properties instead of searching almost infinite chemical space. With the development of machine learning techniques, the generative models such as GAN and VAE have been applied to inverse molecular design tasks in recent years (Sanchez-Lengeling & Aspuru-Guzik, 2018; Shi et al., 2020; Yang et al., 2020; Jin et al., 2018; Wang et al., 2020) . In GAN, the discriminator tries to distinguish the molecular structure from the generator, and the generator tries to generate similar molecular structure from the database, simultaneously. In VAE, encoder outputs latent variables using the molecular structure as an input, and decoder generates the original molecular structure using the latent variables from encoder. However, there is a difficulty in training generative models, such as poor convergence. In GAN, it is difficult to learn due to the alternative update method from the two player game situation (Goodfellow et al., 2014) . Problems such as the two models oscillation without adversarial, or the fact that the parameters are no longer learned in a specific situation due to the mode collapsing phenomenon, are still problems to be solved. In VAE, evidence lower bound (ELBO) for training VAE models requires training both the reconstruction and the KL-divergence loss of the latent variables, but it can cause a phenomenon called posterior collapsing. In order to prevent the posterior collapse, beta-VAE (Higgins et al., 2017) , Re-balancing-VAE (Yan et al., 2019) and KL-annealing (Bowman et al., 2016) have been proposed. In this paper, we propose a latent optimization VAE (LOVAE) that provides stable learning method by inserting the latent variable optimization technique in conditional VAE (cVAE) (Sohn et al., 2015) . We apply two stages of the latent optimization to vanilla VAE. By first training encoder once, the latent variable has been optimized in the direction of reducing the training loss in the same input data. After a reparameterization of the latent variables from the updated encoder, an additional latent optimization was applied. Our proposed method, LOVAE, was compared and verified in the inverse molecular design task, and a drug-like molecular structure (ZINC dataset (Sterling & Irwin, 2015) ) was used as database. We show the proposed method outperform vanilla VAE in terms of reconstruction loss and ELBO, which are training indicators. In addition, it showed a more improved appearance in generation phase. Consequently, generating molecules of LOVAE showed higher uniqueness, novelty ratio, and target property satisfaction than some of previous approaches. And, also, LOVAE generated molecules that showed higher value of penalized LogP property than the existing methods.

2. RELATED WORK

Inverse molecular design The inverse molecular design based on human-knowledge is very time consuming and relies on the intuition of the researcher, so many researchers recently tried to solve it through simulation method and others. HTCS (Bleicher et al., 2003) , One of major simulation method, is an automated computational process that can rapidly identify active compounds, antibodies or genes, and the results provide starting points for many chemical works such as drug and material design (Chen et al., 2018; Kim et al., 2018a) . HTCS uses a kind of brute-force method for searching and analyzing the desired chemical characteristics of molecule using a combination of hundreds or tens of millions of active compounds, but this can be disadvantage because the absolutely large amount of resources used to find the desired goals (Phatak et al., 2009) . Recently, to solve the problems, artificial intelligence approaches have been widely applied in the field of the molecular design under the name of the inverse molecular design. The VAE and GAN are typical generative models and they have been applying to the inverse molecular design field (Sanchez-Lengeling & Aspuru-Guzik, 2018; Jin et al., 2020; Yang et al., 2020; Simm et al., 2020) . For the inverse molecular design, various formats for simple representation of molecular structure information have been defined instead of atom's xyz coordinates. First, MDL format that represents 3D coordinate information and binding information between adjacent atoms together. Extended Connectivity Fingerprint (ECFP) (Rogers & Hahn, 2010) and Simplified Molecular Input Line Entry System (SMILES) (Weininger et al., 1989) are another representation for the molecular structure as a sequential character string. And recently, the method of representing the molecular structure as a graph structure is also being researched. Among them, SMILES, string representation of molecules, is relatively easy to handle and has been showing a good performance.

Generative models

The generative models such as GAN and VAE are very sensitive to the latent variables. In order words, the training of latent variables greatly affects the performance of the generative models. However, according to way of dealing with latent variables, difficult problems have been occurred such as posterior collapsing and mode collapsing in the training. As a solution for this phenomenon, methods to adjust a KL loss weight to a value other than 1 have been proposed in VAE. In Kim et al. (2018b) , latent variable from encoder is optimized according to maximizing of ELBO, and both of the encoder and decoder parameters are updated as the optimized latent variable. In Zhang et al. (2020) , the reconstruction loss on the latent space is additive to the previous loss term. In GAN, there are several works applying the latent variable optimization. The method in Bojanowski et al. (2017) optimizes the latent variable by assuming that the latent variable is learnable noise. In Wu et al. (2019) , randomly sampled latent variables are performed a gradient descent in the direction of reducing the loss of GAN. After that, the parameters of the discriminator and generator are updated with the loss from the optimized latent variable z. All of these approaches resulted in a training stability and performance improvement in GAN.

3. CONDITIONAL VAE

In the conditional VAE (Kang & Cho, 2018; Kingma et al., 2014) , the input variable x is assumed to be generated from a generative distribution p θ (x|y, z) conditioned on the output variable y and latent variable z. The prior distribution of z is assumed to be p(z) = N (z|0, I). We use variational inference to approximate the posterior distribution of z given x and y by q φ (z|x, y) = N (z|µ φ (x, y), diag(σ φ (x, y))). (1) From the perspective of the auto-encoder, q φ (z|x, y) and p θ (x|y, z) are called as an encoder and a decoder, respectively. Feed-forward neural networks are used for µ φ (x, y) and σ φ (x, y). The objective of the conditional VAE is to maximize ELBO, which is a lower bound of the marginal log-likelihood: log p θ (x, y) ≥ E q φ (z|x,y) log p θ (x|y, z) -KL(q φ (•|x, y)||p(•)), (2) up to an additive constant, where KL denotes the Kullback-Leibler divergence. Given a random sample z generated from the encoder q φ (z|x, y), define the total loss as L total (x, y, z) = -log p θ (x|y, z) + KL(q φ (•|x, y)||p(•)). (3) Then, -L total (x, y, z) is Monte Carlo approximation of ELBO equation 2. We define L recon (x, y, z) = -log p θ (x|y, z), because it can be regarded as a reconstruction loss. In vanilla conditional VAE, parameters θ and φ are jointly optimized to minimize L total . A string representation of a molecule called SMILES is widely used to analyze molecular data (Higgins et al., 2017; Yan et al., 2019; Kang & Cho, 2018) . To deal with string data like SMILES in conditional VAE, recurrent neural networks (Yan et al., 2019; Kang & Cho, 2018) are used for the decoder p θ (x|y, z). Given a target molecular property y, a new molecule x having this property is generated in the following way: z ∼ p(z), x ∼ p θ (x|y, z). ( ) Algorithm 1 An update of the encoder and decoder in LOVAE # Update the Encoder Generate z ∼ q φ (z|x, y). Calculate L total (x, y, z). Update φ ← φ using L total (x, y, z). # Additive Latent Optimization Generate z ∼ q φ (z |x, y). Update z ← zα β+||g|| 2 g where g = ∂Lrecon(x,y,z ) ∂z . # Update the Decoder Calculate L recon (x, y, z ). Update θ ← θ using L recon (x, y, z ).

4. LATENT OPTIMIZATION VAE

At first, we have considered advancing VAE in terms of a latent variable optimization. The problem we originally wanted to consider in this paper is that vanilla VAE training is not in an optimized process. The decoder is trained depending on the encoder output z. However, in the vanilla VAE, the latent value, which is the result of encoder before the update with the same input data x, is used for the decoder training. From the perspective of the decoder, when the same input data x is used, it may be more effective to calculate L total by using the updated latent variable (z ) passing through the updated encoder. Our proposed method, LOVAE, tried to solve this problem in terms of the latent variable optimization. LOVAE uses the same input x for learning the encoder and decoder, and (z ) is used for updating the decoder because the encoder is updated first. From this approach, L total becomes smaller than the vanilla VAE (L total (x, y, z) > L total (x, y, z )). In addition, it helped the decoder training by optimizing (z ) in the direction of reducing the lL total (x, y, z ) one more in a way that does not spoil the training of encoder and decoder. That is, better encoder and better latent variable can make the decoder even better. To be more specific, the encoder is updated as usual with the decoder while the decoder is fixed, and optimization of the latent variable from the encoder follows. Finally, the decoder is updated with the optimized latent variable. Updating encoder first has an effect similar to latent optimization. By this updating, a more suitable z can be created and this latent variable not only reduces the loss but also depends on the input data. A brief comparison of the vanilla (conditional) VAE and LOVAE is described in Figure 1 . First, the encoder parameter φ is updated to φ in the direction of reducing the total loss L total (x, y, z) where z is generated using the current encoder parameter φ. Secondly, z is generated using the updated encoder parameter φ , and then z is updated to z in the direction of reducing the reconstruction loss L recon (x, y, z ) using the natural gradient descent method (Wu et al., 2019) . Lastly, the decoder parameter θ is updated to θ in the direction of reducing the reconstruction loss L recon (x, y, z ) using the optimized z . Since the optimized z is used in the update of the decoder parameter, LOVAE is expected to achieve bigger ELBO and show stable convergence. This will be verified with numerical results in Section 5. Note that latent optimization is only applied in the training, and the inference of LOVAE remains same with the vanilla one. In summary, the whole process of an update is detailed in Algorithm 1.

5.1. EXPERIMENT SETUP

ZINC database (Sterling & Irwin, 2015) is a database that organizes information about various compounds drug-like molecules. ZINC contains 3D structural information of compound quality and molecular physical properties such as molecular weight (molWt), partition coefficient (LogP), The distribution of target properties in our dataset, ZINC310K, is shown in Figure 3 . Median values of molwt, LogP, and QED are 359.02, 2.91, and 0.70, respectively. Among the existing VAE variants, beta-VAE (Higgins et al., 2017) , re-balancing VAE (Yan et al., 2019) , and KL-annealing VAE (Bowman et al., 2016) control the weight of KL loss in the total loss equation 3 to achieve their own purpose, such as the disentanglement of latent variables, the avoidance of the posterior collapse, and the stability of the training. The proposed latent optimization technique is also applicable with those methods. Unlike LOVAE, semi-amortized VAE (SA-VAE) in Kim et al. (2018b) updates latent variable without the encoder update phase and applied a momentum based optimization multiple times. After that, decoder and encoder parameters are updated according to the optimized latent variable. According to (Kim et al., 2018b) , SA-VAE utilized the encoder and latent optimization parts even in test phase. That is, SA-VAE is not proper to our task, because it needs the latent optimization part. On the other hand, LOVAE does not utilize the additive latent optimization in the inference phase. Like LOVAE, there is an existing study that learns the encoder first (He et al., 2019) . In the case of He et al. (2019) , the encoder is updated several times until certain condition is satisfied, and different input data are used when updating encoder and decoder. LOVAE differs in some ways. It updates the encoder first, but updates it only once without any stop condition, and uses the same input data for the encoder and decoder, and performs the additive latent optimization with the reparameterization to help overall VAE learning. We think z obtained by the updating encoder at the same input data is more natural and suitable for the decoder learning. In addition, we think that using the same input data for the training encoder and decoder is more effective in terms of the latent optimization than using different input data to the encoder and decoder. Also, Table 1 shows that LOVAE requires less training time. We referred to the experimental results of Kim et al. (2018b) . Although SA-VAE and The basic model structure in this paper follows the general VAE structure for sequential data. The encoder has a bi-directional RNN structure, and the decoder is a uni-directional RNN structure (Yan et al., 2019; Kang & Cho, 2018) . Each RNN structure was composed of three layers of GRU cells, and the dimension of the latent variable was set to 100. The hidden size of each GRU cells is 250, and the dimension of the properties is 3. A 103-dimensional vector in which a 100-dimensional latent variable and a 3-dimensional property are concatenated is inputted to the decoder. Vanilla VAE and LOVAE have the exactly same model structure, but only training strategy differs. We used the Adam optimizer with β 1 = 0.9, β 2 = 0.999, and = 10 -6 , and a polynomial-based learning rate decay was applied. The initial and end learning rate for the learning rate decay were 0.001 and 0.0, respectively. In the case of max epoch for the training, several cases were tried. Among them, we set a number of the max epoch that shows good performance. During the training, we normalize each property value to have a mean 0 and standard deviation 1. For the additive latent optimization of LOVAE, α = 0.001 and β = 5 were applied.

5.2. EVALUATION: VAE TRAINING PHASE

As defined in Section 3, L total of VAE is the sum of L recon and L KL . It can be thought that L total is primarily important, and the importance of L recon and L KL can be determined depending on the purpose. For the evaluation, models initialized with 5 different random seeds were trained in each algorithm, vanilla VAE and LOVAE. The vanilla VAE is based on cVAE. The trained models were compared and analyzed with L total and L recon of the train set. The loss of LOVAE was measured without the additive latent optimization. Table 2 shows the training results for vanilla VAE and LOVAE. We tried each method five times and calculated the mean of the training losses. 'Original ELBO' utilized equation 3 as the total loss, and 'reduced KL term' reduced the weight of L KL of equation 3 like Yan et al. (2019) . In our experiments, when the weight of KL loss was from 0.7 to 0.8, each method showed a smaller L total . It can be seen that LOVAE is better than vanilla VAE in both L recon and L total . In the results of 'original ELBO', L recon of LOVAE is on average 2.74 less than that of vanilla VAE. In terms of L KL , LOVAE is 1.68 greater, but L total of LOVAE is 1.07 smaller. In case of 'reduced KL term', LOVAE also shows a smaller L total . The relative improvement of tL total is about 4.45% in the case of 'original ELBO'.

5.3. EVALUATION: MOLECULAR GENERATION PHASE

The generated molecules from the generative model can be evaluated according to three criteria. The first one is a validity. This means that the generated molecule has a sound structure. This can be (Landrum) . The second one is novelty. The purpose of the inverse molecular design is to find new molecules that have not discovered yet. A molecule is said to be novel if it is not in the train set. The third one is uniqueness. If the latent space is very narrow and latent variable z is repeatedly sampled in a similar space, there is much room for generating the same SMILES. That is, the more uniqueness, the better the generative model. The generated molecules are more meaningful if they satisfy all three criteria, validity, novelty, and uniqueness. In this paper, the ratio that satisfies all three criteria is defined as a generative efficiency. For example, if a generative model attempts to generate 1,000 molecules and has 600 molecules that satisfy validity, novelty, and uniqueness at once, the generative efficiency is 0.6. For evaluation, three values were determined for each property as a condition for the generative model, and ZINC310K was used. It was determined to be close to the median, lower 10%, and upper 10% value in our train set. In case of molwt, 360.0, 260.0, and 460.0 were used as the condition. For LogP and QED, {3.0, 1.5, 4.5} and {0.7, 0.5, 0.9} were chosen, respectively. The molecular generation was attempted 3,500 times in each condition value, and 31,500 molecules were generated in a total of 9 conditions. For the analysis, 'original ELBO' models were used. The results of the generative efficiency are shown in Table 3 . It can be seen that LOVAE has good generative efficiency and uniform performance in all properties.

5.4. EVALUATION: PROPERTY SATISFACTION

In addition to the generative efficiency, it is possible to use property satisfaction as a measure of performance for cVAE. It is how many molecules with properties being close to the target condition can be generated. For this, evaluation of property satisfaction was conducted based on two criteria. The first is the percentage that the property value of the generated molecule falls within 10% of the error range of the condition value. For example, if the target value of molwt is 360.0, it measures the percentage of the generated molecules whose property value lies between 324.0 and 396.0. The results of the experiment are shown in Table 4 . In all three properties, LOVAE showed higher property satisfaction. At 10% and 5% error property satisfaction, LOVAE showed a relatively 12.1% and 10.5% improvement, respectively. For the comparison with previous works using ZINC250K, we referred to You et al. (2018b) . In that paper, property targeting task was performed, and specific ranges of molwt and LogP were considered. In our case, if the target range of molWt is from 150 to 200, we conditioned LOVAE as 175. The target ranges are four like Table 5 . Except for the target range -2.5 LogP -2.0, LOVAE showed the best performance. Since percentages of the training data in the range of -2.5 LogP -2.0 and 5 LogP 5.5 are 0.28% and 1.30%, the first target range can be a bit more difficult. In this respect, the result of LOVAE at the target range -2.5 LogP -2.0 seems to make sense. In the many previous papers, a property maximization task was performed and it was evaluated on penalized LogP (pLogP) and QED (Kusner et al., 2017) . QED is a property with a boundary range with [0, 1], but the range of penalized logP is (-∞, ∞). pLogP is a LogP penalized by the synthetic accessibility score (SA) and the number of large rings (cycle), pLogP = logP -SA -cycle. It can be thought as a extrapolation task because generative models have to create a new molecule that is not in the range of the property values of the training DB (The maximum pLogP in training DB = 5.072). In order to find new molecules which show the highest property, some of previous approaches utilized a reward function or a property regressors with sparse Gaussian process (Kusner et al., 2017; Shi et al., 2020; Gómez-Bombarelli et al., 2018) . In our approach, LOVAE, we just conditioned by a high value such as 30.0 and 0.98 for LogP and QED, respectively. Table 6 shows the property maximization results. LOVAE generated the new molecules shown the highest penalized LogP and QED property. It is noteworthy that LOVAE showed good performance only with LogP condition without a separate part like a reward or property regressor. That is, it was confirmed that LOVAE, which is a conditional VAE type, works properly even in the extrapolation task. Top 3 molecules of each property are represented Fig. 4 . In addition, LOVAE with pLogP as condition was trained and verified. The performance was worse than LOVAE with LogP, but since the number of large rings can also be given as condition, the trend of the generated molecules was different (5). This was applied and verified in the inverse molecular design task in ZINC dataset, and confirmed that it showed a better appearance in the train loss, ELBO, and molecular generative performance than those of vanilla VAE.



Figure 1: Comparison of the vanilla (conditional) VAE and LOVAE

Figure 2: Example of SMILES in ZINC dataset: COc1ccc(N2CC(C(=O)Oc3cc(C)ccc3C)CC2=O)cc1

Figure 4: Samples of generated molecules of LOVAE

Comparison of total training time, in terms of relative speed vanilla VAE LOVAE He et al. (2019) SA-VAE



Probability that the property value of the generated molecule falls within target range

