MOLEBM: MOLECULE GENERATION AND DESIGN BY LATENT SPACE ENERGY-BASED MODELING Anonymous

Abstract

Generation of molecules with desired chemical and biological properties such as high drug-likeness, high binding affinity to target proteins, is critical in drug discovery. In this paper, we propose a probabilistic generative model to capture the joint distribution of molecules and their properties. Our model assumes an energybased model (EBM) in the latent space. Given the latent vector sampled from the latent space EBM, both molecules and molecular properties are conditionally sampled via a molecule generator model and a property regression model respectively. The EBM in a low dimensional latent space allows our model to capture complex chemical rules implicitly but efficiently and effectively. Due to the joint modeling with chemical properties, molecule design can be conveniently and naturally achieved by conditional sampling from our learned model given desired properties, in both single-objective and multi-objective optimization settings. The latent space EBM, molecule generator, and property regression model are learned jointly by approximate maximum likelihood, while optimization of properties is accomplished by gradual shifting of the model distribution towards the region supported by molecules with high property values. Our experiments show that our model outperforms state-of-the-art models on various molecule design tasks.

1. INTRODUCTION

In drug discovery, it is of vital importance to find or design molecules with desired pharmacologic or chemical properties such as high drug-likeness and binding affinity to a target protein. It is challenging to directly optimize or search over the drug-like molecule space since it is discrete and enormous, with an estimated size is on the order of 10 33 (Polishchuk et al., 2013) . Recently, a large body of work attempts to tackle this problem. The first line of work leverages deep generative models to map the discrete molecule space to a continuous latent space, and optimizes molecular properties in the latent space with methods like Bayesian optimization (Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Jin et al., 2018) . The second line of work recruits reinforcement learning algorithms to optimize properties in the molecular graph space directly (You et al., 2018; De Cao & Kipf, 2018; Zhou et al., 2019; Shi et al., 2020; Luo et al., 2021) . A number of other efforts have been made to optimize molecular properties with genetic algorithms (Nigam et al., 2020) , particle-swarm algorithms (Winter et al., 2019 ) specialized MCMC methods (Xie et al., 2021) . In this work, we propose a method along the first line mentioned above, by learning a probabilistic latent generative model of molecule distributions and optimizing chemical properties in the latent space. Given the central role of latent variables in this approach, we emphasize that it is critical to learn a latent space model that captures the data regularities of the molecules. Thus, instead of assuming a simple Gaussian distribution in the latent space as in prior work (Gómez-Bombarelli et al., 2018; Jin et al., 2018) , we assume a flexible and expressive energy-based model (EBM) (LeCun et al., 2006; Ngiam et al., 2011; Kim & Bengio, 2016; Xie et al., 2016; Kumar et al., 2019; Nijkamp et al., 2019; Du & Mordatch, 2019; Grathwohl et al., 2019; Finn et al., 2016) For molecule modeling, without any explicit validity constraints in generation, our model generates molecules with high validity with simple SMILES representation (Weininger, 1988) . Given our goal of property optimization, we learn a joint distribution of molecules and their properties. Our model consists of 1) an EBM in a low-dimensional continuous latent space, 2) a generator mapping from the latent space to the observed molecule space, and 3) a property regression model mapping from the latent space to the property values (see Figure 1 ). We call our model as MolEBM. All three components in our model are learned jointly by approximate maximum likelihood. A learned model generates a molecule with a high property value in two steps: 1) given the property value, sample the latent vector; 2) given the sampled latent vector, generate a molecule (see the topto-bottom path in Figure 1a ). Since the learned model approximates the data distribution, directly sampling from the learned model conditional on a high property value does not work well since a molecule with a high property value is most likely not in the original data distribution. We thus design a method to gradually shift the learned distribution towards the region supported by molecules with high property values, and sample molecules with desirable properties from the shifted distribution. In drug discovery, most often we need to consider multiple properties simultaneously. Our model can be extended to this setting straightforwardly. With our method, we only need to add a regression model for each property, while the learning and sampling methods remain the same. Learning the model involves inferring the latent vector given both the molecule and the property value, and we recruits Langevin dynamics instead of amortized inference network for inference computation. This design makes our approach versatile in dealing with varying number of properties. We evaluate our method in various settings including single-objective optimization and multiobjective optimization. Our method outperforms prior methods by significant margins. In summary, our contributions are as follows: • We propose to learn a latent space energy-based model for the joint distribution of molecules and molecular properties. • We develop a sampling with gradual distribution shifting method, enabling us to extrapolate the data distribution and sample from the region supported by molecules with high property values. • Our methods are versatile enough to be extended to optimizing multiple properties together. • Our model achieves state-of-the-art performances on a wide range of molecule optimization tasks.

2. RELATED WORK

Optimization with Generative Models. Deep generative models approximate the distribution of molecules with desired biological or non-biological properties. Existing approaches for generating molecules include applying variational autoencoder (VAE) (Kingma & Welling, 2014) and generative adversarial network (GAN) (Goodfellow et al., 2014) etc. to molecule data (Gómez-Bombarelli et al., 2018; Jin et al., 2018; De Cao & Kipf, 2018; Honda et al., 2019; Madhawa et al., 2019; Shi et al., 2020; Zang & Wang, 2020; Kotsias et al., 2020; Chen et al., 2021; Fu et al., 2020; Liu et al., 2021; Bagal et al., 2021; Eckmann et al., 2022; Segler et al., 2018) . After learning continuous representations for molecules, they are further able to optimize using different methods. (Segler et al., 2018) proposes to optimize by simulating design-synthesis-test cycles. (Gómez-Bombarelli et al., 2018; Jin et al., 2018; Eckmann et al., 2022) propose to learn a surrogate function to predict properties, and then use Bayesian optimization to optimize the latent vectors. However, the performance of this latent optimization is not satisfactory due to three major issues. First, it is difficult to train an accurate surrogate predictor especially for those novel molecules with high properties along the design trajectories. Second, as the learned latent space tries to cover the fixed data space, its ability to explore the targets out of the distribution is limited (Brown et al., 2019; Huang et al., 2021) . Third, those methods are heavily dependent on the quality of learned latent space, which requires non-trivial efforts to design encoders when dealing with multiple properties. To address above issues, (Eckmann et al., 2022) use VAE to learn the latent space and train predictors separately using generated molecules, and then leverage latent inceptionism, which involves the decoder solely, to optimize the latent vector with multiple predictors. In this paper, we propose an encoder-free model in both training and optimization to learn the joint distribution of molecules and properties, and make it possible to obtain several adequate predictors. We then design an efficient algorithm to shift the learned distribution iteratively.



in latent space. This leads to a latent space energy-based model (LSEBM) as studied in Pang et al. (2020); Nie et al. (2021), where LSEBM has been shown to model the distributions of natural images and text well.

