RETRIEVAL-BASED CONTROLLABLE MOLECULE GENERATION

Abstract

Generating new molecules with specified chemical and biological properties via generative models has emerged as a promising direction for drug discovery. However, existing methods require extensive training/fine-tuning with a large dataset, often unavailable in real-world generation tasks. In this work, we propose a new retrieval-based framework for controllable molecule generation. We use a small set of exemplar molecules, i.e., those that (partially) satisfy the design criteria, to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria. We design a retrieval mechanism that retrieves and fuses the exemplar molecules with the input molecule, which is trained by a new selfsupervised objective that predicts the nearest neighbor of the input molecule. We also propose an iterative refinement process to dynamically update the generated molecules and retrieval database for better generalization. Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning. On various tasks ranging from simple design criteria to a challenging real-world scenario for designing lead compounds that bind to the SARS-CoV-2 main protease, we demonstrate our approach extrapolates well beyond the retrieval database, and achieves better performance and wider applicability than previous methods.

1. INTRODUCTION

Drug discovery is a complex, multi-objective problem (Vamathevan et al., 2019) . For a drug to be safe and effective, the molecular entity must interact favorably with the desired target (Parenti and Rastelli, 2012) , possess favorable physicochemical properties such as solubility (Meanwell, 2011) , and be readily synthesizable (Jiménez-Luna et al., 2021) . Compounding the challenge is the massive search space (up to 10 60 molecules Polishchuk et al. ( 2013)). Previous efforts address this challenge via highthroughput virtual screening (HTVS) techniques (Walters et al., 1998) by searching against existing molecular databases. Combinatorial approaches have also been proposed to enumerate molecules beyond the space of established drug-like molecule datasets. For example, genetic-algorithm (GA) based methods (Sliwoski et al., 2013; Jensen, 2019; Yoshikawa et al., 2018) explore potential new drug candidates via heuristics such as hand-crafted rules and random mutations. Although widely adopted in practice, these methods tend to be inefficient and computationally expensive due to the vast chemical search space (Hoffman et al., 2021) . The performance of these combinatorial approaches also heavily depends on the quality of generation rules, which often require task-specific engineering expertise and may limit the diversity of the generated molecules. To this end, recent research focuses on learning to controllably synthesize molecules with generative models (Tang et al., 2021; Chen, 2021; Walters and Murcko, 2020) . It usually involves first training an unconditional generative model from millions of existing molecules (Winter et al., 2019a; Irwin et al., 2022) and then controlling the generative models to synthesize new desired molecules that



* Work done during an internship at NVIDIA. † The first two authors contributed equally to this paper. 1

