MOLEBM: MOLECULE GENERATION AND DESIGN BY LATENT SPACE ENERGY-BASED MODELING Anonymous

Abstract

Generation of molecules with desired chemical and biological properties such as high drug-likeness, high binding affinity to target proteins, is critical in drug discovery. In this paper, we propose a probabilistic generative model to capture the joint distribution of molecules and their properties. Our model assumes an energybased model (EBM) in the latent space. Given the latent vector sampled from the latent space EBM, both molecules and molecular properties are conditionally sampled via a molecule generator model and a property regression model respectively. The EBM in a low dimensional latent space allows our model to capture complex chemical rules implicitly but efficiently and effectively. Due to the joint modeling with chemical properties, molecule design can be conveniently and naturally achieved by conditional sampling from our learned model given desired properties, in both single-objective and multi-objective optimization settings. The latent space EBM, molecule generator, and property regression model are learned jointly by approximate maximum likelihood, while optimization of properties is accomplished by gradual shifting of the model distribution towards the region supported by molecules with high property values. Our experiments show that our model outperforms state-of-the-art models on various molecule design tasks.

1. INTRODUCTION

In drug discovery, it is of vital importance to find or design molecules with desired pharmacologic or chemical properties such as high drug-likeness and binding affinity to a target protein. It is challenging to directly optimize or search over the drug-like molecule space since it is discrete and enormous, with an estimated size is on the order of 10 33 (Polishchuk et al., 2013) . Recently, a large body of work attempts to tackle this problem. The first line of work leverages deep generative models to map the discrete molecule space to a continuous latent space, and optimizes molecular properties in the latent space with methods like Bayesian optimization (Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Jin et al., 2018) . The second line of work recruits reinforcement learning algorithms to optimize properties in the molecular graph space directly (You et al., 2018; De Cao & Kipf, 2018; Zhou et al., 2019; Shi et al., 2020; Luo et al., 2021) . A number of other efforts have been made to optimize molecular properties with genetic algorithms (Nigam et al., 2020), particle-swarm algorithms (Winter et al., 2019) specialized MCMC methods (Xie et al., 2021) . In this work, we propose a method along the first line mentioned above, by learning a probabilistic latent generative model of molecule distributions and optimizing chemical properties in the latent space. Given the central role of latent variables in this approach, we emphasize that it is critical to learn a latent space model that captures the data regularities of the molecules. Thus, instead of assuming a simple Gaussian distribution in the latent space as in prior work (Gómez-Bombarelli et al., 2018; Jin et al., 2018) , we assume a flexible and expressive energy-based model (EBM) (LeCun et al., 2006; Ngiam et al., 2011; Kim & Bengio, 2016; Xie et al., 2016; Kumar et al., 2019; Nijkamp et al., 2019; Du & Mordatch, 2019; Grathwohl et al., 2019; Finn et al., 2016) For molecule modeling, without any explicit validity constraints in generation, our model generates molecules with high validity with simple SMILES representation (Weininger, 1988) .



in latent space. This leads to a latent space energy-based model (LSEBM) as studied in Pang et al. (2020); Nie et al. (2021), where LSEBM has been shown to model the distributions of natural images and text well.

