DE NOVO MOLECULAR GENERATION VIA CONNECTION-AWARE MOTIF MINING

Abstract

De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose a new method, MiCaM, to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connectionaware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.

1. INTRODUCTION

Drug discovery, from designing hit compounds to developing an approved product, often takes more than ten years and billions of dollars (Hughes et al., 2011) . De novo molecular generation is a fundamental task in drug discovery, as it provides novel drug candidates and determines the underlying quality of final products. Recently, with the development of artificial intelligence, deep neural networks, especially graph neural networks (GNNs), have been widely used to accelerate novel molecular generation (Stokes et al., 2020; Bilodeau et al., 2022) . Specifically, we can employ a GNN to generate a molecule iteratively: in each step, given an unfinished molecule G 0 , we first determine a new generation unit G 1 to be added; next, determine the connecting sites on G 0 and G 1 ; and finally determine the attachments between the connecting sites, e.g., creating new bonds (Liu et al., 2018) or merging shared atoms (Jin et al., 2018) . In different methods, the generation units could be either atoms (Li et al., 2018; Mercado et al., 2021) or frequent fragments (referred to as motifs) (Jin et al., 2020a; Kong et al., 2021; Maziarz et al., 2021) . For fragment-based models, building an effective motif vocabulary is a key factor to the success of molecular generation (Maziarz et al., 2021) . Previous works usually rely on heuristic rules or tem- plates to obtain a motif vocabulary. For example, JT-VAE (Jin et al., 2018) decomposes molecules into pre-defined structures like rings, chemical bonds, and individual atoms, while MoLeR (Maziarz et al., 2021) separates molecules into ring systems and acyclic linkers or functional groups. Several other works build motif vocabularies in a similar manner (Jin et al., 2020a; Yang et al., 2021) . However, heuristic rules cannot cover some chemical structures that commonly occur yet are a bit more complex than pre-defined structures. For example, the subgraph patterns benzaldehyde ("O=Cc1ccccc1") and trifluoromethylbenzene ("FC(F)(F)c1ccccc1") (as shown in Figure 1 (a)) occur 398,760 times and 57,545 times respectively in the 1.8 million molecules in ChEMBL (Mendez et al., 2019) , and both of them are industrially useful. Despite their high frequency in molecules, the aforementioned methods cannot cover such common motifs. Moreover, in different concrete generation tasks, different motifs with some domain-specific structures or patterns are favorable, which can hardly be enumerated by existing rules. Another important factor that affects the generation quality is the connection information of motifs. This is because although many connections are valid under a valence check, the "reasonable" connections are predetermined and reflected by the data distribution, which contribute to the chemical properties of molecules (Yang et al., 2021) . For example, a sulfur atom with two triple-bonds, i.e., "*#S#*" (see Figure 1 (b)), is valid under a valence check but is "unreasonable" from a chemical point of view and does not occur in ChEMBL. In this work, we propose MiCaM, a generative model based on Mined Connection-aware Motifs. It includes a data-driven algorithm to mine a connection-aware motif vocabulary from a molecule library, as well as a connection-aware generator for de novo molecular generation. The algorithm mines the most common substructures based on their frequency of appearance in the molecule library. Briefly, across all molecules in the library, we find the most frequent fragment pairs that are adjacent in graphs, and merge them into an entire fragment. We repeat this process for a pre-defined number of steps and collect the fragments to build a motif vocabulary. We preserve the connection information of the obtained motifs, and thus we call them connection-aware motifs. Based on the mined vocabulary, we design the generator to simultaneously pick up motifs to be added and determine the connection mode of the motifs. In each generation step, we focus on a nonterminal connection site in the current generated molecule, and use it to query another connection either (1) from the motif vocabulary, which implies connecting a new motif, or (2) from the current molecule, which implies cyclizing the current molecule. We evaluate MiCaM on distribution learning benchmarks from GuacaMol (Brown et al., 2019) , which aim to resemble the distributions of given molecular sets. We conduct experiments on three different datasets and MiCaM achieves the best overall performance compared with several strong baselines. After that, we also work on goal directed benchmarks, which aim to generate molecules with specific target properties. We combine MiCaM with iterative target augmentation (Yang et al., 2020) by jointly adapting the motif vocabulary and network parameters. In this way, we achieve state-of-the-art results on four different types of goal-directed tasks, and find motif patterns that are relevant to the target properties.

2.1. CONNECTION-AWARE MOLECULAR MOTIF MINING

Our motif mining algorithm aims to find the common molecule motifs from a given training data set D, and build a connection-aware motif vocabulary for the follow-up molecular generation. It The merging graphs are initialized the same as the molecular graphs. In the first iteration, "c:c" (marked in blue) is the most frequent pattern. We merge the patterns "c:c" in all the molecules to update the merging graphs. In the second iteration, "c:c:c:c" (marked in yellow) is the most frequent pattern, so we merge such patterns and update merging graphs. We repeat this process for 3 iterations and record the merging operations in order. (b) Motif-vocabulary Construction Phase. We apply the recorded merging operations sequentially on all molecules. Then the two molecules are fragmentized as motifs. We break the bonds between different motifs while preserving the broken bonds. In this way we construct a connection-aware motif vocabulary as shown in (c). processes D with two phases: the merging-operation learning phase and the motif-vocabulary construction phase. Figure 2 presents an example and more implementation details are in Algorithm 1 in Appendix A.1.

Merging-operation Learning Phase

In this phase, we aim to learn the top K most common patterns (which correspond to rules indicating how to merge subgraphs) from the training data D, where K is a hyperparameter. Each molecule in D is represented as a graph G(V, E) (the first row in Figure 2 (a)), where the nodes V and edges E denote atoms and bonds respectively. For each G(V, E) ∈ D, we use a merging graph G M (V M , E M ) (the second row in Figure 2 (a)) to track the merging status, i.e., to represent the fragments and their connections. In G M (V M , E M ), each node F ∈ V M represents a fragment (either an atom or a subgraph) of the molecule, and the edges in E M indicate whether two fragments are connected with each other. We initialize each merging graph from the molecule graph by treating each atom as a single fragment and inheriting the bond connections from G, i.e., G (0) M (V (0) M , E (0) M ) = G(V, E). We define an operation "⊕" to create a new fragment F ij = F i ⊕ F j by merging two fragments F i and F j together. The newly obtained F ij contains all nodes and edges from F i , F j , as well as all edges between them. We iteratively update the merging graphs to learn merging operations. In the merging graph G (k) M (V (k) M , E (k) M ) at the k th iteration (k = 0, • • • , K -1) , each edge represents a pair of fragments, (F i , F j ), that are adjacent in the molecule. It also gives out a new fragment F ij = F i ⊕ F j . We traverse all edges (F i , F j ) ∈ E (k) M in all merging graphs G (k) M to count the frequency of F ij = F i ⊕ F j , and denote the most frequent F ij as M (k) .foot_0 Consequently, the k-th merging operation is defined as: if F i ⊕ F j == M (k) , then merge F i and F j together. 2 We apply the merging operation on all merging graphs to update them into G (k+1) M (V (k+1) M , E (k+1) M ). We repeat such a process for K iterations to obtain a merging operation sequence {M (k) } K-1 k=0 . (1) (5) (6) • • • (b) Generation Steps [1] [2] (1) (5) (6) • • • {1} {2}{3} 𝑧 NN start Step 3 [1] [2] Step 0 𝑣 0 Step 2 𝑣 2 Step 1 [3] 𝑣 1 [4] [3] [4] (1) (5) (6) • • • [𝑧, ℎ 𝒢 𝑡 , ℎ 𝑣 𝑡 ] NN query GNN pmol GNN motif (a) Motif Vocabulary • • • (1) {1} (4) (3) {2} NN key {1} {2}{3} • • • ℎ ℱ * (1) (2) (3) (4) (5) (6) • • • ℎ 𝑣 [1] [2] [3] [4] • • • {3} (5) (6) Figure 3 : The Generation Steps. (a) The generation procedure is based on the connection-aware motif vocabulary. We obtain the graph representations h F * (yellow) and the node representations h v (dark blue) via GNN motif . (b) In the t th generation step, we obtain the graph representation h Gt (dark green) and node representations h v (light blue) via GNN pmol . We focus on a connection site v t and use [z, h Gt , h vt ] to query another connection site either from the motif vocabulary (which implies adding a motif) or from the partial molecule (which implies cyclizing). The right answers of every steps are marked by red boxes.

Motif-vocabulary Construction Phase

For each molecule G(V, E) ∈ D, we apply the merging operations sequentially to obtain the ultimate merging graph G M (V M , E M ) = G (K) M (V (K) M , E (K) M ). We then disconnect all edges between different fragments and add the symbols " * " to the disconnected positions (see Figure 2(b, c )). The fragments with " * " symbols are connection-aware, and we denote the connection-aware version of a fragment F as F * . The motif vocabulary is the collection of all such connection-aware motifs: Vocab = ∪ G M (V M ,E M )∈D {F * : F ∈ V M }. During generating, our model connects the motifs together by directly merging the connection sites (i.e., the " * "s) to generate new molecules.

Time Complexity

The time complexity of learning merging operations is O(K|D|e), where K is the number of iterations, |D| is the molecule library size, and e = max G(V,E)∈D |E|. In practice, the time cost decreases rapidly as the iteration step k increases. It takes less than 10 minutes and about 90 minutes to run 3, 000 iterations on the QM9 (∼ 133K molecules) (Ruddigkeit et al., 2012) and ZINC (∼ 219K molecules) (Irwin et al., 2012) datasets, respectively, using 6 CPU cores (see Appendix A.1). The time complexity of fragmentizing an arbitrary molecule G(V, E) into motifs is O(K|E|), linear with the number of bonds.

2.2. MOLECULAR GENERATION WITH CONNECTION-AWARE MOTIFS

The generation procedure of MiCaM is shown in Figure 3 and Algorithm 2 in Appendix A.2. Mi-CaM generates molecules by gradually adding new motifs to a current partial molecule (denoted as G t , where t is the generation step), or merging two connection sites in G t to form a new ring. For ease of reference, we define the following notations. We denote the node representation of any atom v (including the connection sites) as h v , and denote the graph representation of any graph G (either a molecule or a motif) as h G . Let C G denote the connection sites from a graph G (either a partial molecule or a motif), and let C Vocab = ∪ F * ∈Vocab C F * be the set of all connection sites from the motif vocabulary. Generation Steps In the t th generation step, MiCaM modifies the partial molecule G t as follows: (1) Focus on a connection site v t from G t . We use a queue Q to manage the the orders of connection sites in C Gt . At the t th step, we pop the head of Q to get a connection site v t . The Q is maintained as follows: after selecting a new motif F * to connect with G t , we use the RDKit library to give a canonical order of the atoms in F * , and then put the connection sites into Q following this order. After merging two connection sites together, we just remove them from Q. (2) Encode v t and candidate connections. We employ a graph neural network GNN pmol to encode the partial molecule G t and obtain the representations of all atoms (including the connection sites) and the graph. The node representations h vt of v t and the graph representation h Gt of G t will be jointly used to query another connection site. For the motifs F * ∈ Vocab, we use another GNN, denoted as GNN motif , to encode their atoms and connection sites. In this way we obtain the node representations h v of all connection sites v ∈ C Vocabfoot_2 . The candidate connections are either from C Vocab or from C Gt \ {v t }. (3) Query another connection site. We employ two neural networks, NN query to make a query vector, and NN key to make key vectors, respectively. Specifically, the probability P v of picking every connection sites is calculated by: P v = softmax v∈CVocab∪C G t \{vt} (NN query ([z, h Gt , h vt ]) • NN key (h v )) , where z is a latent vector as used in variational auto-encoder (VAE) (Kingma & Welling, 2013) . Using different z results in diverse molecules. During training, z is sampled from a posterior distribution given by an encoder, while during inference, z is sampled from a prior distribution. For inference, we make a constraint on the bond type of picked v by only considering the connection sites whose adjacent edges have the same bond type as v t . This practice guarantees the validity of the generated molecules. We also implement two generation modes, i.e., greedy mode that picks the connection as arg max P v , and distributional mode that samples the connection from P v . (4) Connect a new motif or cyclize. After the model picks a connection v, it turns into a connecting phase or a cyclizing phase, depending on whether v ∈ C Vocab or v ∈ C Gt . If v ∈ C Vocab , and suppose that v ∈ C F * , then we connect F * with G t by directly merging v t and v. Otherwise, when v ∈ C Gt , we merge v t and v together to form a new ring, and thus the molecule cyclizes itself. Notice that, allowing the picked connection site to come from the partial molecule is important, because with this mechanism MiCaM theoretically can generate novel rings that are not in the motif vocabulary. We repeat these steps until Q is empty and thus there is no non-terminal connection site in the partial molecule, which indicates that we have generated an entire molecule. Starting As in the beginning (i.e., the 0 th step), the partial graph is empty, we implement this step exceptionally. Specifically, we use another neural network NN start to pick up the first motif from the vocabulary as G 0 . The probability P F * of picking every motifs is calculated by: P F * = softmax F * ∈Vocab (NN start (z) • NN key (h F * )) , h F * = GNN motif (F * ). (2)

2.3. TRAINING MICAM

We train our model in a VAE (Kingma & Welling, 2013) paradigm. 4 A standard VAE has an encoder and a decoder. The encoder maps the input molecule G to its representation h G , and then builds a posterior distribution of the latent vector z based on h G . The decoder takes z as input and tries to reconstruct the G. VAE usually has a reconstruction loss term (between the original input G and the reconstructed Ĝ) and a regularization term (to control the posterior distribution of z). In our work, we use a GNN model GNN mol as the encoder to encode a molecule G and obtain its representation h G = GNN mol (G). The latent vecotr z is then sampled from a posterior distribution q(•|G) = N (µ(h G ), exp(Σ(h G ))) , where µ and Σ output the mean and log variance, respectively. How to use z is explained in Equation ( 1) and (2). The decoder consists of GNN pmol , GNN motif , NN query , NN key and NN start that jointly work to generate molecules. The overall training objective function is defined as: E G∼D [L(G)] = E G∼D [L rec (G) + β prior • L prior (G) + β prop • L prop (G)] . In Equation ( 3): (1) L rec (G) is the reconstruction loss as that in a standard VAE. It uses cross entropy loss to evaluate the likelihood of the reconstructed graph compared with the input G. (2) The loss L prior (G) = D KL (q(•|G)∥N (0, I) ) is used to regularize the posterior distribution in the latent space. (3) Following Maziarz et al. (2021) , we add a property prediction loss L prop (G) to ensure the continuity of the latent space with respect to some simple molecule properties. Specifically, we build another network NN prop to predict the properties from the latent vector z. (4) β prior and β prop are hyperparameters to be determined according to validation performances. Here we emphasize two useful details. (1) Since MiCaM is an iterative method, in the training procedure, we need to determine the orders of motifs to be processed. The orders of the intermediate motifs (i.e., t ≥ 1) are determined by the queue Q introduced in Section 2.2. The only remaining item is the first motif, and we choose the motif with the largest number of atoms as the first one. The intuition is that the largest motif mostly reflects the molecule's properties. The generation order is then determined according to Algorithm 2. (2) In the training procedure, we provide supervisions for every individual generation steps. We implement the reconstruction loss L rec (G) by viewing the generation steps as parallel classification tasks. In practice, as the vocabulary size is large due to various possible connections, L rec (G) is costly to compute. To tackle this problem, we subsample the vocabulary and modify L rec (G) via contrastive learning (He et al., 2020) . More details can be found in Appendix A.4.

2.4. DISCUSSION AND RELATED WORK

Molecular Generation A plethora of existing generative models are available and they fall into two categories: (1) string-based models (Kusner et al., 2017; Gómez-Bombarelli et al., 2018; Sanchez-Lengeling & Aspuru-Guzik, 2018; Segler et al., 2018) , which rely on string representations of molecules such as SMILES (Weininger, 1988) and do not utilize the structual information of molecules, (2) and graph-based models (Liu et al., 2018; Guo et al., 2021) that are naturally based on molecule graphs. Graph-based approaches mainly include models that generate molecular graphs (1) atom-by-atom (Li et al., 2018; Mercado et al., 2021) , and (2) fragment-by-fragment (Kong et al., 2021; Maziarz et al., 2021; Zhang et al., 2021; Guo et al., 2021) . This work is mainly related to fragment-based methods. Motif Mining Our motif mining algorithm is inspired by Byte Pair Encoding (BPE) (Gage, 1994) , which is widely adapted in natural language processing (NLP) to tokenize words into subwords (Sennrich et al., 2015) . Compared with BPE in NLP, molecules have much more complex structures due to different connections and bond types, which we solve by building the merging graphs. Another related class of algorithms are Frequent Subgraph Mining (FSM) algorithms (Kuramochi & Karypis, 2001; Jiang et al., 2013) , which also aim to mine frequent motifs from graphs. However, these algorithms do not provide a "tokenizer" to fragmentize an arbitrary molecule into disjoint fragments, like what is done by BPE. Thus we cannot directly apply them in molecular generation tasks. Kong et al. (2021) also try to mine motifs, but they do not incorporate the connection information into the motif vocabulary and they apply a totally different generation procedure, which are important to the performance (see Appendix B.3). See Appendix D for more discussions. Motif Representation Different from many prior works that view motifs as discrete tokens, we represent all the motifs in the motif vocabulary as graphs, and we apply the GNN motif to obtain the representations of motifs. This novel approach has three advantages. (1) The GNN motif obtains similar representations for similar motifs, which thus maintains the graph structure information of the motifs (see Appendix C.3). (2) The GNN motif , combined with contrastive learning, can handle large size of motif vocabulary in training, which allows us to construct a large motif vocabulary (see Section 2.3 and Apppendix A.4). (3) The model can be easily transferred to another motif vocabulary. Thus we can jointly tune the motif vocabulary and the network parameters on a new dataset, improving the capability of the model to fit new data (see Section 3.2).

3.1. DISTRIBUTIONAL LEARNING RESULTS

To demonstrate the effectiveness of MiCaM, we test it on the benchmarks from GuacaMol (Brown et al., 2019) , a commonly used evaluation framework to assess de novo molecular generation mod- els. 5 We first consider the distribution learning benchmarks, which assess how well the models learn to generate novel molecules that resemble the distribution of the training set. Experimental Setup Following Brown et al. (2019) , we consider five metrics for distribution learning: validity, uniqueness, novelty, KL divergence (KL Div) and Fréchet ChemNet Distance (FCD). The first three metrics measure if the model can generate chemically valid, unique, and novel candidate molecules, while last two are designed to measure the distributional similarity between the generated molecules and the training set. For the KL Div benchmark, we compare the probability distributions of a variety physicochemial descriptors. A higher KL Div score, i.e., lower KL divergences for the descriptors, means that the generated molecules resemble the training set in terms of these descriptors. The FCD is calculated from the hidden representations of molecules in a neural network called ChemNet (Preuer et al., 2018) , which can capture important chemiacal and biological features of molecules. A higher FCD score, i.e., a lower Fréchet ChemNet Distance means that the generated molecules have similar chemical and biological properties to those from the training set. We evaluate our method on three datasets: QM9 (Ruddigkeit et al., 2012) We can see that MiCaM achieves the best performances on the KL Divergence and FCD scores on all the three datasets, which demonstrates that it can well resemble the distributions of training sets. Meanwhile it keeps high uniqueness and novelty, comparable to previous best results. In this experiment, we set the number of merging operations to be 1000 for QM9, due to the results in Figure 4 . For ZINC and GuacaMol, we simply set the number to be 500 and find that MiCaM has achieved performances that outperform all the baselines. This indicates that existing methods tend to perform well on the sets of relatively simple molecules such as QM9, while MiCaM performs well on datasets with variant complexity. We further visualize the distributions in Figure 7 of Appendix B.2.

Number of Merging Operations

We conduct experiments on different choices of the number of merging operations. Figure 4 presents the experimental results on QM9. It shows that FCD score and KL Divergence score, which measure the similarity between the generated molecules and the training set, increase as the number of merging operations grows. Meanwhile, the novelty decreases as the number grows. Intuitively, the more merging operations we use for motif vocabulary construction, the larger motifs will be contained in the vocabulary, and thus will induce more structural information from the training set. We can achieve a trade-off between the similarity and novelty by controlling the number of merging operations. Empirically, a medium number (about 500) of operations is enough to achieve a high similarity.

Generation Modes

We also compare two different generation modes, i.e., the greedy mode and the distributional mode. With the greedy mode, the model always picks the motif or the connection site with the highest probability. While the distributional mode allows picking motifs or connection sites according to a distribution. The results show that the greedy mode leads to a little higher KL Divergence and FCD scores, while the distributional mode leads to a higher novelty.

3.2. GOAL DIRECTED GENERATION RESULTS

We further demonstrate the capability of our model to generate molecules with wanted properties. In such goal directed generation tasks, we aim to generate molecules that have high scores which are predefined by rules. Iteratively Tuning We combine MiCaM with iterative target augmentation (ITA) (Yang et al., 2020) and generate optimized molecules by iteratively generating new molecules and tuning the model on molecules with highest scores. Specifically, we first pick out N molecules with top scores from the GuacaMol dataset and store them in a training buffer. Iteratively, we tune our model on the training buffer and generate new molecules. In each iteration, we update the training buffer to store the top N molecules that are either newly generated or from the training buffer in the last iteration. In order to accelerate the model to explore the latent space, we pair MiCaM with Molecular Swarm Optimization (MSO) (Winter et al., 2019) in each iteration to generate new molecules. A novelty of our approach is that when tuning on a new dataset, we jointly update the motif vocabulary and the network parameters. We can do this because we apply a GNN motif to obtain motif representations, which can be transferred to a newly built motif vocabulary. The representations of the new motifs are calculated by GNN motif , and we then optimize the network parameters using both new data and newly constructed motif vocabulary. Experimental Setup We test MiCaM on several goal directed generation tasks from GuacaMol benchmarks. Specifically, we consider four different categories of tasks: a Rediscovery task (Celecoxib Rediscovery), a Similarity task (Aripiprazole Similarity), an Isomers task (C 11 H 24 Isomers), and two multi-property objective (MPO) tasks (Ranolazine MPO and Sitagliptin MPO). We compare MiCaM with several strong baselines.

Quantitative Results

For each benchmark, we run 7 iterations. In each iteration, we apply MSO to generate 80, 000 molecules, and store 10, 000 molecules with highest scores in the training buffer to tune MiCaM. We tune the model pretrained on the GuacaMol dataset and Table 2 presents the results. MiCaM achieves 1.0 scores on some relatively easy benchmarks. It achieves high scores on several difficult benchmarks such MPO tasks, outperforming the baselines. Case Studies There are some domain-specific motifs that are beneficial to the target properties in different goal directed generation tasks, which are likely to be the pharmacophores in drug molecules. We conduct case studies to demonstrate the ability of MiCaM to mine such favorable motifs for domain-specific tasks. In Figure 5 we present cases for the Ranolazine MPO benchmark, which tries to discover molecules similar to Ranolazine, a known drug molecule, but with additional requirements on some other properties. This benchmark calculates the geometric mean of four scores: Sim (the similarity between the molecule and Ranolazine), logP, TPSA, and Num Fs (the number of fluorine atoms). We present a generation trajectory as well as the scores in each generation step. Due to the domain-specific motif vocabulary and the connection query mechanism, it requires only a few steps to generate such a complex molecule. Moreover, we can see that the scores increase as some key motifs are added to the molecule, which implies that the picked motifs are relevant to the target properties. See Figure 8 in Appendix B.4 for more case studies.

4. CONCLUSION

In this work, we proposed MiCaM, a novel model that generates molecules based on mined connection-aware motifs. Specifically, the contributions include (1) a data-driven algorithm to mine motifs by iteratively merging most frequent subgraph patterns and (2) a connection-aware generator for de novo molecular generation. It achieve state-of-the-art results on distribution learning tasks and on three different datasets. Combined with iterative target augmentation, it can learn domainspecific motifs related to some properties and performs well on goal directed benchmarks. where "a * " denotes that we change the label of the atom a to " * ". The symbol " * " can be seen as a dummy atom or a connection site, which indicates that the bond is non-terminal and we will grow the molecule here. Efficiency Figure 6 presents the time costs of learning operations from QM9, which demonstrates that our algorithm is fast, and the time cost of each iteration decreases rapidly as the motif frequency decreases. 

Sequential Merging Operations

The merging operations are useful as they work as a "tokenizer" that can be applied to fragmentize an arbitrary molecule, which is unavailable for other methods like Frequent Subgraph Mining. Another choice of using the merging operations is not applying them sequentially, but traversing the molecule edges to find if there are patterns that appear in the learnt frequent motifs. However, this is sub-optimal as it does not work reasonably on any arbitrary molecule outside the dataset to learn the operations. For example, consider a trivial dataset D = {CC, CN, CNN, CN=O, CC=O}. When we run two iterations, the learnt merging operations are {CN, CC}. Then we use them to fragmentize a new molecule "CCN". If we apply the merging operations sequentially, the molecule will be decomposed into {C, CN}. However, if we traverse the molecule edges to find the patterns, the molecule will be decomposed into {CC, N}, which is sub-optimal because "CN" appears with a higher frequency in the dataset.

A.2 GENERATING PROCEDURE

Algorithm 2 presents the pseudo code of our generation procedure. For ease of reference, let C G denote the connection sites from a graph G (either a partial molecule or a motif), and let C Vocab = ∪ F * ∈Vocab C F * be the set of all connection sites from the motif vocabulary.

A.3 NETWORKS

The backbone of MiCaM is VAE. The encoder is GNN mol , followed by three MLPs: µ and Σ for resampling, and NN prop for property prediction. The decoder consists of GNN pmol , GNN motif , which are GNNs, and NN start , NN query , NN key , which are MLPs. All MLPs have 3 layers and use ReLU as the activation function. The latent size and the hidden size are both 256. We employ GINE (Hu et al., 2019) as the GNN structures, and in each GNN layer we employ a 3layer MLP for messaage aggregation. For all GNNs, we use five atom-level features as inputs: atom symbol, is aromatic, formal charge, num explicit Hs, num implicit Hs. The features are embedded with the dimension 192, 16, 16, 16, 16 , respectively, and thus the node embedding size is 256. Four edges, we consider four types of bonds (single bonds, double bonds, triple bonds and aromatic bond), and the embedding size is 256. GNN mol and GNN pmol have 15 layers and GNN motif has 6 layers. You can see our released code for more details. G ← G.Merge(v t , v); // Cyclizing itself 18 Q ← Q \ {v} A.4 EXPERIMENT DETAILS Learning Merging Operations For QM9, we apply 1000 merging operations due to the comparative results in 4. For ZINC and GuacaMol, we simply use 500 merging operations without elaborate searching, and find that it has achieved good results. Due to the large size of GuacaMol dataset, we randomly sample 100, 000 molecules from it to learn merging operations, and then apply these merging operations on all molecules to obtain the motif vocabulary.

Training MiCaM

We preprocess all the molecules to provide supervision signals for decoder to rebuild the molecules. We provide the true indices of picked connections in every steps so that the model can learn the ground truth. In each optimization step, we update the network parameters by optimizing the loss function on a batch B of molecules: L B = E G∼B [β prior • L prior (G) + L rec (G) + β prop • L prop (G)] . Specifically, the reconstruction loss L rec is written as a sum over the negative log probabilities of the partial graphs G t at each step t, conditioned on z and the last steps: L rec (G) = -E z∼q(•|G) log p(G 0 |z) + t log p(G t+1 |z, G t ) = -E z∼q(•|G) log p(F * 0 |z) + t log p(u t |z, G t , v t ) , where F * 0 , v t and u t are the first motif, the focused query connection site and the picked connection at the t th step, respectively. In practice, since the motif vocabulary is large due to different connections, we modify the loss via contrastive learning for efficient training: log p(F * 0 |z) ← log exp(NN start (z) • NN key (h F * 0 )) F * ∈I F * exp(NN start (z) • NN key (h F * )) , log p(u t |z, G t , v t ) ← log exp(NN query ([z, h Gt , h vt ]) • NN key (h ut )) v∈Iv exp(NN query ([z, h Gt , h vt ]) • NN key (h v )) , where I F * and I v are the sets of motifs and connections, respectively, containing a positive sample and negative samples from the batch B. We find that a proper choice of β prior is essentially important to the performances on distribution learning benchmarks, especially for FCD scores. For QM9, we use a short warm-up (3, 000 steps), and use a long sigmoid schedule (400, 000 steps) (Bowman et al., 2015) to let β prior to reach 0.4. For property prediction, we predict four simple properties of molecules, including molecular weight, synthetic accessibility (SA) score, octanol-water partition coefficient (logP) and quantitative estimate of drug-likeness (QED). The target values are computed using the RDKit library. Empirically, for distribution learning benchmarks, a small β prop (about 0.3) is beneficial.

A.5 VALIDITY CHECK

We conduct a validity check during generation to avoid the model generating invalid aromatic rings (e.g., merging two "*:c:c:c:c:*"s into "c1ccccccc1"). Specifically, when the model tries to generate such an invalid aromatic ring, we simply remove the aromaticity this ring so that the molecule is still valid (e.g., "c1ccccccc1" will be replaced with "C1CCCCCCC1"). Without this chemical validity check, the validity rates on QM9, ZINC, and GuacaMol are 99.68%, 98.6%, and 98.28%, respectively. The high validity rates indicate that MiCaM learns to generate valid aromatic rings. We visualize the distributional results on the three datasets in Figure 7 . Specifically, we first calculate the Morgan fingerprints of all molecules. Morgan fingerprint has been used for a long time in drug discovery, as it can represents the structual information of molecules (Rogers & Hahn, 2010) . For visualization, we apply the t-distributed stochastic neighbor embedding (t-SNE) algorithm (Van der Maaten & Hinton, 2008)-a nonlinear dimensionality reduction technique to keep the similar high-dimensional vectors close in lower-dimensional space-to represent molecules in the two-dimensional plane, and then plot the distributions of molecules. Figure 8 : Generation trajectories for Celecoxib Rediscovery. We show the trajectories of the best molecules in three different iterations. In each generation step, the query connection is marked in red, and the newly added motif is marked in yellow.

B.4 CASE STUDIES

Figure 8 presents cases for the Celecoxib Rediscovery task, which aims to discover molecules similar to Celecoxid, a known drug molecule. Specifically, we present the trajectories to generate molecules with the highest scores in the 1st, 3rd and 5th iterations. As the number of iteration increases, the motifs learnt by MiCaM tend to be more specific and more adaptive to the target, leading to the model to generate molecules with higher scores while costing fewer generation steps.

B.5 GENERATED MOLECULES

Some examples of the generated molecules are in Figure 9 . For further comparison, we visualize the probability distributions of GuacaMol, the molecules generated by MiCaM and MoLeR, respectively, in Figure 10 . We can see that MiCaM fits the reference distribution better than MoLeR. Moreover, from the visualization, we find that the outermost contour line of MiCaM covers more area than MoLeR and fits that of the reference data better. This indicates that some reasonable chemical spaces are explored more by MiCaM than MoLeR. We then find that such cases include molecules with large rings or complex ring systems. See Figure 11 for some concrete examples. 

D DISCUSSION AND RELATED WORK

Motif Vocabulary Construction Many previous works explored motif construction methods. JT-VAE (Jin et al., 2018) decomposes molecules into rings, chemical bonds, and individual atoms. HierVAE (Jin et al., 2020a) and MoLeR (Maziarz et al., 2021) decompose molecules into ring systems and acyclic linkers or functional groups. MGSSL (Zhang et al., 2021) first uses BRICS (Degen et al., 2008) , which is built upon chemical rules, to split molecules into fragments. After that, it manually designs another two rules to further decompose the molecules into rings and chains. Some molecule fragmentation tools such as BRICS and RECAP (Lewell et al., 1998) are also well developed, though outside the ML community. However, as the motif vocabulary obtained by those methods is large and long-tail, the combination of the methods with ML models is nontrivial. The aforementioned methods are mainly based upon pre-defined rules and templates. Guo et al. (2021) proposed DEG, which learns graph grammars to generate molecules and is similar to motif-based methods. However, as DEG applies REINFORCE and MCTS to search grammars, the learned grammars are only for specific metrics, and DEG cannot mine motifs from large datasets. MiCaM mines the most frequent motifs directly from the dataset. The built motif vocabulary is promising to be used in more tasks such as large-scale pre-training, which we leave as future works. Goal Directed Generation Many frameworks have been developed for goal-directed generation tasks, including iterative methods such as ITA (Yang et al., 2020) , genetic methods such as SMILES GA (Yoshikawa et al., 2018) and Graph GA (Jensen, 2019) , and latent space optimization methods such as MSO (Winter et al., 2019) . RationaleRL (Jin et al., 2020b) proposes to extract rationales from a collection of molecules, and then learns to expand the rationales into full molecules. MolEvol (Chen et al., 2021) then proposes a novel EM-like evolution-by-explanation algorithm bsed on rationales. Such frameworks designed for goal-directed learning tasks can be naturally combined with our proposed generative procedure, which we leave as future works.



Different (Fi, Fj) can make same Fij. For example, both (CCC, C) and (CC, CC) can make "CCCC". When applying merging operations, we traverse edges in orders given by RDKit(Landrum et al., 2006). During inference, the motif representations can be calculated offline to avoid additional computation time. MiCaM can be naturally paired with other paradigms such as GAN or RL, which we plan to explore in future works. The code of MiCaM is available at https://github.com/MIRALab-USTC/AI4Sci-MiCaM.



Figure 1: (a) Two substructures that occur frequently in ChEMBL. (b) Different connection modes of sulfur atoms. The "*-S(-*)(-*)=*" is commonly seen while"*#S#*" does not appear in ChEMBL.

Figure 2: An example of connection-aware molecular motif mining. Given a training set D = {"Brc1ccccc1", "Cc1cccc(O)c1"}, it consists of two phases. (a) Merging-operation Learning Phase.The merging graphs are initialized the same as the molecular graphs. In the first iteration, "c:c" (marked in blue) is the most frequent pattern. We merge the patterns "c:c" in all the molecules to update the merging graphs. In the second iteration, "c:c:c:c" (marked in yellow) is the most frequent pattern, so we merge such patterns and update merging graphs. We repeat this process for 3 iterations and record the merging operations in order. (b) Motif-vocabulary Construction Phase. We apply the recorded merging operations sequentially on all molecules. Then the two molecules are fragmentized as motifs. We break the bonds between different motifs while preserving the broken bonds. In this way we construct a connection-aware motif vocabulary as shown in (c).

Figure 4: KL Divergence and FCD scores (higher is better) for different numbers of merging operations and different choices of generation modes.

Figure5: A generation trajectory for Ranolazine MPO benchmark. In each generation step, the query connection is marked in red, and the newly added motif is marked in yellow.

Figure 6: Time costs of learning merging operations on QM9. (a) and (b) show the accumulative time costs and time costs of each iteration, respectively. (c) shows the frequencies of the learnt merging operations.

Figure 7: Visualization of the probability distributions of training sets (QM9, ZINC and GuacaMol) and the generated molecules. The postfix " ref" means reference, i.e., the training sets (shown in green, blue and red, respectively), and the postfix " gen" means the sets of molecules generated by our model (shown in grey). We obtain the representations of molecules by calculating their molecular fingerprints, and we then apply t-SNE dimensionality reduction for visualization. The curves represent the contour lines of the probability distributions of the datasets. (a) shows the distributional shift over the three training sets. (b), (c) and (d) demonstrate that our generative model can properly fit the different distributions respectively.

Figure 9: Samples from molecules randomly generated by a MiCaM model, trained on GuacaMol.

Figure 11: Molecules with large rings or complex ring systems generated by MiCaM, which are not likely from MoLeR.

Figure 14: The t-SNE visualization of Motif representations. Each color represents a collection of motifs with a common structure but different connections. Each point represents a motif.

Distributional results on QM9, ZINC, and GuacaMol. The higher the better for all metrics. The results of JT-VAE, GCPN and GP-VAE are fromKong et al. (2021). For MoLeR, we use the released code fromMaziarz et al. (2021) with no changes.

Goal directed generation results on five GuacaMol benchmarks. The higher the better for all metrics. The results of other baselines are fromBrown et al. (2019) andAhn et al. (2020). The benchmarks are: a. Celecoxib Rediscovery; b. Aripiprazole Similarity; c. C 11 H 24 Isomers; d. Ranolazine MPO; and e. Sitagliptin MPO. Table 1 presents our experimental results of distribution learning tasks. We compare our model with several state-of-the-art models: JT-VAE(Jin et al., 2018), GCPN(You et al., 2018), GP-VAE(Kong et al., 2021), and MoLeR(Maziarz et al., 2021). Since all the models are graph-based, they obtain 100% validity by introducing chemical valence check during generation.

ACKNOWLEDGEMENT

The authors would like to thank all the anonymous reviewers for their insightful comments. This work was supported in part by National Nature Science Foundations of China grants U19B2026, U19B2044, 61836011, 62021001, and 61836006, and the Fundamental Research Funds for the Central Universities grant WK3490000004.

annex

Our connection-aware motif mining algorithm is in Algorithm 1. In the merging graphwhere V ⊂ V and Ê ⊂ E are atoms and bonds in F, respectively. The edge set E M is defined by: for any two fragmentswhich means the new fragment F ij contains all nodes and edges from F i , F j and the edges between them. Notice that when we traverse the edges of graphs, we always follow the orders determined by RDKit. In the motif-vocabulary construction phase, we disconnect bonds between different fragments and add "*" atoms to create connection-aware motifs. Specifically, for each F( V, Ê) ∈ V M , we define its corresponding connection-aware motifAlgorithm 2: Generating a molecule Input: A connection-aware motif vocabulary Vocab = {F * }. A latent vector z.Trade-off Among Metrics As KL Divergence and FCD negatively correlate with Uniqueness and Novelty, there is a trade-off among the metrics in distribution learning tasks. We can achieve this trade-off via some hyperparameters. For example, on QM9, if we conduct 500 merging operations, and use distributional mode (sampling from top 5 choices) for sampling, MiCaM achieves higher uniqueness and novelty and outperforms MoLeR in terms of all the metrics. The results are in Table 3 . 

Motif Vocabulary

We conduct two more experiments on QM9 to verify the effect of the motif vocabulary. Specifically, we use the generation procedure of MiCaM, but replace the motif vocabulary with the vocabularies in MoLeR (Maziarz et al., 2021) and MGSSL (Zhang et al., 2021) , respectively. For a fair comparison, we preserve the connection information (i.e., the "*"s) in the two new vocabularies. We name the two models MiCaM-moler and MiCaM-brics, respectively, as MGSSL applies BRICS (Degen et al., 2008) with further decomposition. The results are in Table 4 . Generating Procedure Besides the molecule fragmentation strategy, two components contribute to the performance of MiCaM. First is the connection information preserved in the motif vocabulary and corresponding connection-aware decoder. Second is the GNN motif , which captures the graph structures of motifs, and allows efficient training on a large motif vocabulary via contrastive learning. We conduct ablation studies to demonstrate the importance of the two components. Specifically, we implement three different versions of MiCaM. MiCaM-v1 does not apply NN motif and does not use connection information for generation. In each step, it first picks up the motif (without connection information) by viewing them as discrete tokens, and then determines the connecting points and bonds. MiCaM-v2 employs the NN motif to pick up motifs, but does not directly query the connection sites. The results demonstrate that leveraging connection information and employing the GNN motif actually bring performance improvements. 

C.3 MOTIF REPRESENTATIONS

Since we apply GNN motif to encode motif representations, instead of viewing motifs as discrete tokens, the learnt motif representations can maintain structural information in the sense that similar motifs have close representations. To show this, we visualize some motif representations in Figure 14 .

