LATENT OPTIMIZATION VARIATIONAL AUTOENCODER FOR CONDITIONAL MOLECULE GENERATION

Abstract

Variational autoencoder (VAE) is a generation algorithm, consisting of an encoder and a decoder, and the latent variable is used as the input of the decoder. VAE is widely used for image, audio and text generation tasks. In general, the training of VAE is at risk of posterior collapsing especially for long sequential data. To alleviate this, modified evidence lower bounds (ELBOs) were propsed. However, these approaches heuristically control training loss using a hyper-parameter, and are not way to solve the fundamental problem of vanilla VAE. In this paper, we propose a method to insert an optimization step of the latent variable and alternately update the encoder and decoder for maximizing ELBOs. In experiments, we applied the latent optimization VAE (LOVAE) on ZINC dataset, consisting of string representation of molecules, for the inverse molecular design. We showed that the proposed LOVAE is more stable in the training and achieves better performance than vanilla VAE in terms of ELBOs and molecular generation performance.

1. INTRODUCTION

Deep neural networks (DNNs) have demonstrated a dramatic performance improvement in various applications. Text extraction from image recognition, language translation, speech and natural language recognition, and personal identification by fingerprint and iris have already achieved high accuracy (Wu et al., 2016; Devlin et al., 2018; Awad, 2012; Nguyen et al., 2017) . Recently, these applications became successful commercialized products. For the purpose of generation of image, variational autoencoder (VAE) (Kingma & Welling, 2014) , generative adversarial network (GAN) (Goodfellow et al., 2014) , and reversible generative models (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018) were proposed and showed much progress (Bleicher et al., 2003; Phatak et al., 2009; Grathwhohl et al., 2018) . These generative models were initially studied in image data and showed better performance than previous models. Since then, it has been extended in research area to generate new sentences (Iqbal & Qureshi, 2020) and to discover new drugs (Chen et al., 2018) and materials (Kim et al., 2018a) . Traditional materials research consists of four steps: molecule design, physical or chemical property prediction, molecular synthesis, and experimental evaluation. These steps are repeated until the desired molecular properties of a molecular structure are satisfied. Until now, trial-and-error techniques based on human knowledge have been widely used. However, they are time consuming and very expensive. In order to improve the traditional method, research for a high-throughput computational screening (HTCS) (Bleicher et al., 2003) was conducted. However, this also had limitations such as high computational cost, predefined molecular structures by human knowledge, and low accuracy of simulation. Unlike the traditional approach, inverse molecular design is an attempt to find novel molecules that satisfy desired properties from exploring a large chemical space (Sanchez-Lengeling & Aspuru-Guzik, 2018) . It extracts knowledge of potential molecular structures and properties from accumulated molecular structure databases (PubChem, ZINC, etc.) and proposes new molecular structures that do not exist in their database (Bolton et al., 2008; Irwin et al., 2012) . With the inverse molecular design, it is possible to save cost by conducting molecular synthesis and experimental evaluation only for molecular structures having desired properties instead of searching almost infinite chemical space. With the development of machine learning techniques, the generative models such as GAN and VAE have been applied to inverse molecular design tasks in recent years (Sanchez-Lengeling & Aspuru-Guzik, 2018; Shi et al., 2020; Yang et al., 2020; Jin et al., 2018; Wang et al., 2020) . In GAN, the discriminator tries to distinguish the molecular structure from the generator, and the generator tries to generate similar molecular structure from the database, simultaneously. In VAE, encoder outputs latent variables using the molecular structure as an input, and decoder generates the original molecular structure using the latent variables from encoder. However, there is a difficulty in training generative models, such as poor convergence. In GAN, it is difficult to learn due to the alternative update method from the two player game situation (Goodfellow et al., 2014) . Problems such as the two models oscillation without adversarial, or the fact that the parameters are no longer learned in a specific situation due to the mode collapsing phenomenon, are still problems to be solved. In VAE, evidence lower bound (ELBO) for training VAE models requires training both the reconstruction and the KL-divergence loss of the latent variables, but it can cause a phenomenon called posterior collapsing. In order to prevent the posterior collapse, beta-VAE (Higgins et al., 2017) In this paper, we propose a latent optimization VAE (LOVAE) that provides stable learning method by inserting the latent variable optimization technique in conditional VAE (cVAE) (Sohn et al., 2015) . We apply two stages of the latent optimization to vanilla VAE. By first training encoder once, the latent variable has been optimized in the direction of reducing the training loss in the same input data. After a reparameterization of the latent variables from the updated encoder, an additional latent optimization was applied. Our proposed method, LOVAE, was compared and verified in the inverse molecular design task, and a drug-like molecular structure (ZINC dataset (Sterling & Irwin, 2015) ) was used as database. We show the proposed method outperform vanilla VAE in terms of reconstruction loss and ELBO, which are training indicators. In addition, it showed a more improved appearance in generation phase. Consequently, generating molecules of LOVAE showed higher uniqueness, novelty ratio, and target property satisfaction than some of previous approaches. And, also, LOVAE generated molecules that showed higher value of penalized LogP property than the existing methods.

2. RELATED WORK

Inverse molecular design The inverse molecular design based on human-knowledge is very time consuming and relies on the intuition of the researcher, so many researchers recently tried to solve it through simulation method and others. HTCS (Bleicher et al., 2003) , One of major simulation method, is an automated computational process that can rapidly identify active compounds, antibodies or genes, and the results provide starting points for many chemical works such as drug and material design (Chen et al., 2018; Kim et al., 2018a) . HTCS uses a kind of brute-force method for searching and analyzing the desired chemical characteristics of molecule using a combination of hundreds or tens of millions of active compounds, but this can be disadvantage because the absolutely large amount of resources used to find the desired goals (Phatak et al., 2009) . Recently, to solve the problems, artificial intelligence approaches have been widely applied in the field of the molecular design under the name of the inverse molecular design. The VAE and GAN are typical generative models and they have been applying to the inverse molecular design field (Sanchez-Lengeling & Aspuru-Guzik, 2018; Jin et al., 2020; Yang et al., 2020; Simm et al., 2020) . For the inverse molecular design, various formats for simple representation of molecular structure information have been defined instead of atom's xyz coordinates. First, MDL format that represents 3D coordinate information and binding information between adjacent atoms together. Extended Connectivity Fingerprint (ECFP) (Rogers & Hahn, 2010) and Simplified Molecular Input Line Entry System (SMILES) (Weininger et al., 1989) are another representation for the molecular structure as a sequential character string. And recently, the method of representing the molecular structure as a graph structure is also being researched. Among them, SMILES, string representation of molecules, is relatively easy to handle and has been showing a good performance.

Generative models

The generative models such as GAN and VAE are very sensitive to the latent variables. In order words, the training of latent variables greatly affects the performance of the generative models. However, according to way of dealing with latent variables, difficult problems



, Re-balancing-VAE (Yan et al., 2019) and KL-annealing (Bowman et al., 2016) have been proposed.

