DE NOVO MOLECULAR GENERATION VIA CONNECTION-AWARE MOTIF MINING

Abstract

De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose a new method, MiCaM, to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connectionaware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.

1. INTRODUCTION

Drug discovery, from designing hit compounds to developing an approved product, often takes more than ten years and billions of dollars (Hughes et al., 2011) . De novo molecular generation is a fundamental task in drug discovery, as it provides novel drug candidates and determines the underlying quality of final products. Recently, with the development of artificial intelligence, deep neural networks, especially graph neural networks (GNNs), have been widely used to accelerate novel molecular generation (Stokes et al., 2020; Bilodeau et al., 2022) . Specifically, we can employ a GNN to generate a molecule iteratively: in each step, given an unfinished molecule G 0 , we first determine a new generation unit G 1 to be added; next, determine the connecting sites on G 0 and G 1 ; and finally determine the attachments between the connecting sites, e.g., creating new bonds (Liu et al., 2018) or merging shared atoms (Jin et al., 2018) . In different methods, the generation units could be either atoms (Li et al., 2018; Mercado et al., 2021) or frequent fragments (referred to as motifs) (Jin et al., 2020a; Kong et al., 2021; Maziarz et al., 2021) . For fragment-based models, building an effective motif vocabulary is a key factor to the success of molecular generation (Maziarz et al., 2021) . Previous works usually rely on heuristic rules or tem- However, heuristic rules cannot cover some chemical structures that commonly occur yet are a bit more complex than pre-defined structures. For example, the subgraph patterns benzaldehyde ("O=Cc1ccccc1") and trifluoromethylbenzene ("FC(F)(F)c1ccccc1") (as shown in Figure 1 It includes a data-driven algorithm to mine a connection-aware motif vocabulary from a molecule library, as well as a connection-aware generator for de novo molecular generation. The algorithm mines the most common substructures based on their frequency of appearance in the molecule library. Briefly, across all molecules in the library, we find the most frequent fragment pairs that are adjacent in graphs, and merge them into an entire fragment. We repeat this process for a pre-defined number of steps and collect the fragments to build a motif vocabulary. We preserve the connection information of the obtained motifs, and thus we call them connection-aware motifs. Based on the mined vocabulary, we design the generator to simultaneously pick up motifs to be added and determine the connection mode of the motifs. In each generation step, we focus on a nonterminal connection site in the current generated molecule, and use it to query another connection either (1) from the motif vocabulary, which implies connecting a new motif, or (2) from the current molecule, which implies cyclizing the current molecule. We evaluate MiCaM on distribution learning benchmarks from GuacaMol (Brown et al., 2019) , which aim to resemble the distributions of given molecular sets. We conduct experiments on three different datasets and MiCaM achieves the best overall performance compared with several strong baselines. After that, we also work on goal directed benchmarks, which aim to generate molecules with specific target properties. We combine MiCaM with iterative target augmentation (Yang et al., 2020) by jointly adapting the motif vocabulary and network parameters. In this way, we achieve state-of-the-art results on four different types of goal-directed tasks, and find motif patterns that are relevant to the target properties.

2.1. CONNECTION-AWARE MOLECULAR MOTIF MINING

Our motif mining algorithm aims to find the common molecule motifs from a given training data set D, and build a connection-aware motif vocabulary for the follow-up molecular generation. It



Figure 1: (a) Two substructures that occur frequently in ChEMBL. (b) Different connection modes of sulfur atoms. The "*-S(-*)(-*)=*" is commonly seen while"*#S#*" does not appear in ChEMBL.

(a)) occur 398,760 times and 57,545 times respectively in the 1.8 million molecules in ChEMBL(Mendez  et al., 2019), and both of them are industrially useful. Despite their high frequency in molecules, the aforementioned methods cannot cover such common motifs. Moreover, in different concrete generation tasks, different motifs with some domain-specific structures or patterns are favorable, which can hardly be enumerated by existing rules. Another important factor that affects the generation quality is the connection information of motifs. This is because although many connections are valid under a valence check, the "reasonable" connections are predetermined and reflected by the data distribution, which contribute to the chemical properties of molecules(Yang et al., 2021). For example, a sulfur atom with two triple-bonds, i.e., "*#S#*" (see Figure1(b)), is valid under a valence check but is "unreasonable" from a chemical point of view and does not occur in ChEMBL.In this work, we propose MiCaM, a generative model based on Mined Connection-aware Motifs.

