GRAPHVF: CONTROLLABLE PROTEIN-SPECIFIC 3D MOLECULE GENERATION WITH VARIATIONAL FLOW

Abstract

Designing molecules that bind to specific target proteins is a fundamental task in drug discovery. Recent generative models leveraging geometrical constraints imposed by proteins and molecules have shown great potential in generating protein-specific 3D molecules. Nevertheless, these methods fail to generate 3D molecules with 2D skeletal curtailments, which encode pharmacophoric patterns essential to drug potency and synthesizability. To cope with this challenge, we propose GraphVF, which integrates geometrical and skeletal restraints into a variational flow framework, where the former is captured through a normalizing flow transformation and the latter is encoded by an amortized factorized Gaussian. We empirically verify that our method achieves state-of-the-art performance on protein-specific 3D molecule generation in terms of binding affinity and some other drug properties. In particular, it represents the first controllable geometryaware, protein-specific molecule generation method, which enables creating binding 3D molecules with specified chemical sub-structures or drug properties.

1. INTRODUCTION

The de novo design of synthetically feasible drug-like molecules that bind to specific protein pockets is a crucial yet very challenging task in drug discovery. To cope with such challenge, there has been a recent surge of interest in leveraging deep generative models to effectively searching the chemical space for molecules with desired properties. These machine learning models typically encode the chemical structures of molecules into a low-dimensional space, which then can be optimized and sampled to generate potential 2D or 3D molecule candidates (Jin et al., 2018; Shi et al., 2020; Zhu et al., 2022; Hoogeboom et al., 2022) . Along this research line, a more promising direction has also been explored recently: generating 3D molecules that bind to given proteins. Such binding 3D molecule generation is fundamentally important because binding in fact mainly facilitates the functionalities of drugs. Fortunately, leveraging autoregressive models to generate drug molecules (i.e., ligands) directly based on the 3D geometry of the binding pocket has shown promising potential (Luo et al., 2021; Peng et al., 2022; Liu et al., 2022) . These methods explicitly capture the fine-grained atomic interactions in the 3D space, and produce ligand poses that can directly fit into the given binding pocket. Nevertheless, two critical issues remains unsolved for these existing geometric approaches: 1) effective encoding and sufficient preservation of pharmacophoric structural patterns in the ligand candidates, and 2) controllable ligand generation that aims at specified drug properties or sub-structures. The former prevents generating ligands that seem geometrically plausible, yet structurally invalid or pharmacophorically impotent; the later dominates the synthesibility and the practical usefulness of the drugs. We further elaborate them next. In practice, it is extremely valuable to keep track of the pharmacophoric patterns in the existing ligands, which indeed determines a ligand's bio-chemical activities and binding affinity to a large extent (Wermuth et al., 1998 ). Consider, for example, the molecules of serotonin (a benign neurostransmiter) and N,N-Dimethyltryptamine (DMT, a famous hallucinogen). As can be seen in Figure 5a from Appendix E, serotonin and DMT share a large common bulk of their structures, which both possess an indole and an ethylamine group, but differ enormously in their neural activities. In fact, the extra Methyl groups in DMT's NHMe 2 are pharmacophoric, inducing an attractive charge interaction with Asp-231 (Gomez-Jeria & Robles-Navarro, 2015) . This pharmacophoric feature gives rise to DMT's binding affinity with the 5-HT 2A binding site and produces hallucination.  - ✓ - - ✓ DMCG VAE ✓ ✓ ✓ ✓ - - JT-VAE VAE ✓ ✓ - ✓ - - GraphAF Autoregressive Flow ✓ ✓ - - - - GraphBP Autoregressive Flow ✓ - ✓ - ✓ - Pocket2Mol Spatial Autoregression ✓ ✓ ✓ - ✓ - GraphVF Variational Flow ✓ ✓ ✓ ✓ ✓ ✓ Such observations suggest that effectively enforcing pharmacophoric patterns in ligands is critical for binding. Equally important, controlling molecular properties like solubility, polarizability and heat capacity are instrumental to drug quality. This is to make sure that the synthesized drug molecules have good exposure, e.g. absorption/distribution/metabolism/excretion (ADME) in vivo, and thus, sufficient efficacy in clinical trials (Egan, 2010) . It is worth noting that, although recent diffusion models like EDM (Hoogeboom et al., 2022) have been popular for their capability to perform controlled generation on these properties, performing such control while being pertinent to a given pocket structure for binding remains under-explored by previous works. To address the aforementioned two issues, we propose GraphVF, a protein-aware molecule generation framework that integrates both geometrical and skeletal constraints, aiming at controlling over the structure and property of the generated ligands. To attain this goal, we leverage flow-based architecture that combines amortized variational inference (Zhang et al., 2018) and autoregressive normalizing-flow generation. In specific, global structure of the drug ligand is organized as a junction tree (Jin et al., 2018) , and fine-grained geometrical context of the protein receptor is encoded via a valence-aware E(3)-GNN. These two constraints are integrated into a variational flow architecture, where the former enforces the variational distribution globally, while the latter administers the flow transformations autoregressively. We show empirically that, GraphVF generates drug molecules with high binding affinity to the receptor proteins, with or without the aid of reference ligands, outperforming state-of-the-art methods in terms of binding affinity and some other drug properties. More importantly, GraphVF exposes a clean-cut interface for imposing customized constraints, which is extremely useful in practice for controlling the sub-structure and bio-chemical property of generated drug ligands. To specify what our proposed model can actually do, we compare GraphVF with several representative models for molecule generation in Table 1 . Our main contributions are summarized as follows. • We devise a novel variational flow-based framework to seamlessly integrate geometrical and skeletal restraints to improve protein-specific 3D molecule generation. • We show the first method that enables generating 3D molecules with specified chemical sub-structures or bio-chemical properties. • We empirically demonstrate our method's superior performance to state-of-the-art approaches on generating binding 3D molecules.

2. RELATED WORK

Non-Protein Specific Molecule Generation Different generative techniques have been applied to the task of molecular generation, including Variational Autoencoders (VAEs) (Kingma & Welling, 2013) , Diffusion Models (Sohl-Dickstein et al., 2015) , Normalizing Flows (NFs) (Dinh et al., 2016) , and Autoregressive Models (Van Oord et al., 2016) . The line of work is usually context-free, aiming to produce high-quality molecules from scratch, or to render reasonable 3D conformations of given molecules. For example, JT-VAE (Jin et al., 2018) generates molecular graphs with the guidance of a tree-structured scaffold over chemical substructures. GraphAF (Shi et al., 2020) uses a flow-based model to generate atoms and bonds in an autoregressive manner. DMCG (Zhu et al., 2022) and EDM (Hoogeboom et al., 2022) leverage equivariant diffusion or iterative sampling and de-noising to generate 3D conformations from 2D structures. Unlike these methods, our approach aims at generating molecules that bind to given 3D protein pockets. 3D Molecule Generation for Target Protein Binding With the wide availability of large-scale datasets (Francoeur et al., 2020; Li et al., 2021) for target protein binding, recent works have been able to generate drug ligands directly based on the 3D geometry of the binding pockets. For example, Pocket2Mol (Peng et al., 2022) leverages a spatial-autoregressive model; it directly models the p.d.f. for atom occurrence in the 3D space as a Gaussian mixture (GMM), and then iteratively places the atoms from the learned distribution until there is no room for new atoms. GraphBP (Liu et al., 2022) , an autoregressive model, retains good model capacity via normalizing flow; variables are randomly sampled from a compact latent space, before they are projected into the chemical space by an arbitrarily complex flow transformation. Despise their promising potential, these methods ignore the topological organization of the drug ligand itself, as well as the structural patterns and pharmacophoric features embodied in it. As a result, existing methods tend to generate ligands that seem geometrically plausible, yet structurally invalid or pharmacophorically impotent. Our approach here aims to address this problem. Also, our method enables controllable molecule generation, facilitating generating drug ligand candidates with specified chemical sub-structures or drug properties.

3. PRELIMINARIES

Our proposed method leverages an autoregressive flow strategy to generate binding molecules.

3.1. AUTOREGRESSIVE FLOW MODELS

Given a prior distribution p Z , a flow model (Dinh et al., 2014; Rezende & Mohamed, 2015; Weng, 2018) is defined as an invertible parameterized function f θ : z ∈ R D → x ∈ R D , where θ represents the parameters of f , and D is the dimension for z and x.This maps the latent variable z ∼ p Z to the data variable x, and the log-likelihood of x is calculated as log p X (x) = log p Z f -1 θ (x) + log det ∂f -1 θ (x) ∂x . Autoregressive flow model (Papamakarios et al., 2017) formulates a flow model with an autoregressive computation to enable easy Jacobian determinant computation. Let x i be the i-th component of x and x i conditions on x 1...i-1 . The inverse function f -1 θ is then defined as follows: x i = σ i (x 1...i-1 ) ⊙ z i + µ i (x 1...i-1 ), i = 1...D, where ⊙ denotes element-wise multiplication, σ i (•) ∈ R and µ i (•) ∈ R are non-linear functions of x 1...i-1 . Doing so, we can effectively calculate the follows: z i = x i -µ i σ i , det ∂f -1 θ (x) ∂x = D i=1 1 σ i .

3.2. PROBLEM FORMULATION AND NOTATIONS

Given a specific protein receptor, our task is to generate a ligand molecule that binds effectively to it. Here, proteins and ligands are represented as graphs in the 3D geometric space. Node features include atom type a and position r, while edge feature involves bond type b. For training, we are given pairs of protein P and ligand R in their binding poses. For generation, we are given protein targets P to generate drug ligands that bind tightly to them. Consider a given protein-ligand pair with M and N atoms respectively. We denote the protein as P = ( Ṽ , Ẽ), where Ṽ = {(ã i , ri )} M i=1 and Ẽ = { bij } M i,j=1 , and the ligand as R = (V, E), where V = {(a i , r i )} N i=1 and E = {b ij } N i,j=1 . With a slight abuse of notation, the 2D topology of the ligand structure is also referred to as R.

4. THE PROPOSED METHOD

In this section, we first introduce how we encode both the ligand scaffold (Section 4.1) and the protein-ligand geometry information 4.2). This is followed by Section 4.3, which presents how these two pieces of knowledge are used to generate a binding molecule autoregressively. Finally, in Section 4.4, we detail how we cope with three challenges arising in Section 4.3 through a variational flow model. Inspired by (Jin et al., 2018) , we extract the coarse-grained structural patterns of the ligand scaffold in a fragment-driven approach. The whole procedure is illustrated in Figure 1 , and the detail is discussed next. First, the ligand molecule R is parsed into a compilation of occluded canonical sub-structures, according to a set of pre-defined vocabulary (more details are in Appendix A.1). Next, the resulting graph structure is pooled into a junction tree structure. Structural patterns are well exposed in the junction tree, where each node represents a sub-structure. Finally, the structral encoding φ R = (µ R , Σ R ) is derived from the tree structure via a Gated Recurrent Unit (GRU) (Chung et al., 2014) adapted for tree message passing. In particular, µ R and Σ R are equally-sized dense vectors, which parameterize the amortized variational distribution to be discussed in Sections 4.3 and 4.4. Note that φ R is merely the shorthand symbol for the concatenation of µ R and Σ R . It should also be emphasized that GRU encoder and φ R are not fixed: they are trained in an end-2-end manner alongside the whole model to avoid deviations from the training objective. Refer to Appendix A.2 for implementation details about the GRU architecture.

4.2. GEOMETRY GRAPH ENCODING

Equivariant graph neural networks like SchNet (Schutt et al., 2017) and EGNN (Satorras et al., 2021) have become a routine component in this receptor-based line of work, which are essential for encoding molecular features with roto-translational equivariance. Atoms around the binding pocket are organized into a kNN/radius graph, based on their euclidean distance in the 3D space. This distance-based approach is appropriate for modeling non-covalent interactions like hydrogen bonds and hydrophobic interactions, but remains inadequate for modeling covalent bonds. Bond lenghs are known to be characteristic, e.g. C≡N 1.16 Å, C=C 1.34 Å (Lide, 2012). Explicitly incorporating bond types during message passing is thus beneficial for better perception of atomic interactions and more reasonable delineation of molecular structures. To achieve this goal, we devise Echnet, an SE(3)-equivariant graph neural network specially tailored for bond type message passing in the 3D geometric setting, which is formulated as follows: h (0) i = Emb(a i ) (4) m ij = concat {Erbf(||r i -r j ||), Emb(b ij )} (5) h (l) i = h (l-1) i + k∈N (i)\j h (l-1) k ⊙ Φ (l) (m ki ), l = 1, ..., L where Φ is the feed-forward neural network, Erbf(•) is radial basis function (Liu et al., 2022) , Emb(•) is the embedding layer, concat(•) is the concatenation of two vectors, and Φ is the feedforward neural network. L is the number of convolution layers, h i stands for the encoding of atom i at the l th convolution layer, and m ij is message passed from atom i to j. 

4.3. BINDING LIGAND GENERATION

We formalize the procedure of new ligand generation as a Markovian sampling process, where atoms and bonds are autoregressively added according to the intermediary state at the binding site. The generation process at step i = 4 is illustrated in Figure 2 . In particular, we provide a prior distribution N (µ ρ , Σ ρ ) to control certain desired traits of the generated ligand. The implication and derivation of N (µ ρ , Σ ρ ) will be discussed in Section 4.4. When not specified, the prior is set to N (0, I) by default. We now elucidate the generation process at step i. Firstly, we construct radius graph G i based on protein graph P and ligand sub-graph R 1:i-1 : G i = τ (P ∪ R 1:i-1 ) where the radius operator τ (•) adds edges (of bond order 0) to neighboring atoms within radius τ . In particular, at generation step 1, when no ligand atoms have yet been generated, G i is simply τ (P). Echnet outputs the encoding of each atom in both protein and ligand: ẽ1:M , e 1:N -1 = Echnet(G i ) Secondly, we select a focal atom f i with the focal classifier. Except in the first step, f i is only selected from the drug ligand. Based on f i and two of its nearest neighbors, we construct a spherical coordinate system (SCS), transforming Cartesian coordinates into polar coordinates (d, θ, ϕ). Finally, we add a new atom to the drug ligand via sequential generation of its atom type a i , bindings with existing atoms b 1:i-1,i and position r i = (d i , θ i , ϕ i ), in order to better capture the underlying dependencies (Liu et al., 2022) . The prior random variables are sampled from N (µ ρ , Σ ρ ), including z (node) i , z 1:i-1,i and z (pos) i . The priors are then consecutively projected to the 3D geometric space via flow transformation F i : x (node) i , x (bond) 1:i-1,i , x (pos) i = F i z (node) i , z (bond) 1:i-1,i , z (pos) i ; e 1:N -1 Specifically, the flow transformation F i is parameterized as follows: µ (node) i , σ (node) i = Node-MLP(e fi ) x (node) i = σ (node) i ⊙ z (node) i + µ (node) i (11) µ (bond) 1:i-1,i , σ (bond) 1:i-1,i = Bond-MLP(e 1:i-1 , x (node) i ) x (bond) 1:i-1,i = σ (bond) 1:i-1,i ⊙ z (bond) 1:i-1,i + µ (bond) 1:i-1,i µ (pos) i , σ (pos) i = Position-MLP(e fi , x (node) i , x 1:i-1,i ) x (pos) i = σ (pos) i ⊙ z (pos) i + µ (pos) i ( ) where ⊙ denotes element-wise multiplication, and x (node) i , x 1:i-1,i , x (pos) i are the vectorized representation of atom type, bond type and position, and σ, µ are parameters for flow transformation. The sequential dependency between a, b, d is embodied in Equations 12 and 14, where new atom/bond types that have just been generated are immediately used to parameterize σ and µ of the next flow transformation. Thus, we have rendered all the sampled features a i , b 1:i-1,i , d i , θ i , ϕ i from step i, and successfully generate the new atom and its associated bonds. We go on with this iteration, until the focal classifier reports that no atom is eligible for f i , and the generation procedure is called to an end. Algorithm 2 from Appendix B explains the generation algorithm in more detail.

4.4. VARIATIONAL FLOW TRAINING

Though self-contained, the generation procedure in Section 4.3 puts forward three critical technical challenges nonetheless: 1) Integration of both binding pocket geometry and ligand structural patterns; 2) Generation of molecules with high binding affinity, even without reference ligands; 3) Controllable generation interface for customized bio-chemical constraints. We propose variational flow, a dedicated training framework, to solve the above challenges. As is illustrated in Figure 3 , the term is coined according to the two key components in the framework: the amortized variational distribution N (µ R , Σ R ), and the invertible flow function F i . We proceed to discuss the different components of this training framework. -.$"%"%#7 8*%*.$3"(%7 Encode 2D structure: The primary difference between training and generation is that, the complete structure of the reference ligand is only available during training. Therefore, for each drug-ligand pair, we use the junction tree encoder (Section 4.1) to encode the reference ligand's structral pattern as φ R = (µ R , Σ R ) on the fly, which parameterizes the variational distribution N (µ R , Σ R ). !"#$ %&'($&)* !"#$ !#)$&)* +, -./0)102'0 Encode 3D geometry: As is suggested in Figure 3 , this procedure is basically the inverse to the generation step (equation 9): z (node) i , z (bond) 1:i-1,i , z (pos) i = F -1 i x (node) i , x (bond) 1:i-1,i , x (pos) i ; e 1:N -1 To be specific, z i is derived as follows in the training phase: z (node) i = x (node) i -µ (node) i ⊙ 1 σ (node) i (17) z (bond) 1:i-1,i = x (bond) 1:i-1,i -µ (bond) 1:i-1,i ⊙ 1 σ (bond) 1:i-1,i z (pos) i = x (pos) i -µ (pos) i ⊙ 1 σ (pos) i where σ i and µ i are derived in the same way as Equations 10, 12 and 14. Optimization: At the core of the variational flow methodology is our belief that molecular data are intrinsically multi-modal, and representations of different modality should be rendered back into the same molecular entity. That is to say, whether the input data is 2D structure or 3D geometry, a well-trained model would project them to the same distribution in the chemical space. That is why in Figure 3 , we would expect the generated molecular distribution Xi to overlap with the ground-truth distribution X i , even without the guidance of 2D structure R. We formalize this intuition into a bi-level optimization task for training step i: maximize p Xi (x i ) , ( ) subject to D KL (X i || Xi ) < ξ since F i are invertible functions (i = 1...N ), the LHS of Inequality 21 can be transformed into: L b KL = D KL (X i || Xi ) (22) = D KL (F i (Z)||F i ( Ẑ)) (23) = D KL (Z|| Ẑ) (24) = D KL (N (µ R , Σ R )||N (0, I)) where b stands for a particular protein-ligand pair. This is an interesting result, which shows that across the N training steps of the same protein-ligand pair, we need only control the KL divergence between N (µ R , Σ R ) and N (0, I). Note that N (µ R , Σ R ) is contingent on structure R: they are not the same across different protein-ligand pairs. As for Equation 20, it corresponds to the flow loss term described in Equation 1. For step i in pair b, it can be elaborated as: L (node) i = -log(Prod(N (z (node) i |µ R , Σ R ))) -log(Prod( 1 σ (node) i )) L (bond) 1:i-1,i = -log(Prod(N (z (bond) 1:i-1,i |µ R , Σ R ))) -log(Prod( 1 σ (bond) 1:i-1,i )) L (pos) i = -log(Prod(N (z (pos) i |µ R , Σ R ))) -log(Prod( 1 σ (pos) i )) L i,b flow = L (node) i + L (bond) 1:i-1,i + L (pos) i (29) Therefore, the bilevel optimization task adopts a β-regularized loss form. Note that N (µ R , Σ R ) is involved in both loss terms: L b total = 1 N ( N i=1 L i,b flow ) + βL b KL (30) We use mini-batch training, and optimize the batch loss B b=1 L b total with Adam (Kingma & Ba, 2014). We use β-annealing for adequate training of both 2D and 3D encoders, and to trade-off between binding affinity and controllability of the generated molecules. Algorithm 1 from Appendix B explains the training algorithm in more detail. Controllable generation: During generation, the variational prior provides a flexible interface for controlling certain properties of the generated molecules, without the need to even re-train the model. This can be achieved by taking a collection of molecules with a certain desirable property ρ, denotes as {R a } a∈I , where I is the index set. The latent distribution for the desired property ρ is defined as N (µ ρ , Σ ρ ), which can be naturally parameterized as: (µ ρ , Σ ρ ) = 1 |I| a∈I (µ Ra , Σ Ra ). ( ) where (µ Ra , Σ Ra ) (a ∈ I) can be obtained from the tree encoder. Intuitively, (µ ρ , Σ ρ ) contains inductive bias for the desired property ρ, and ligands that are sampled from this distribution should be more likely to possess property ρ.

5.1. 3D MOLECULAR GENERATION CONDITIONED ON PROTEIN POCKET

Dataset. We use the benchmarking CrossDocked dataset (Francoeur et al., 2020) , which contains 22.5 million protein-ligand pairs, to evaluate the generation performance of GraphVF. For fair comparison, we follow Pocket2Mol (Peng et al., 2022) to prepare and split the data. (3) Quantitative Estimation of Drug-likeness (QED), a measure of drug-likeness based on the concept of desirability (Bickerton et al., 2012) ; (4) LopP denotes the octanol-water partition coefficient, and good drug candidates always have a LogP between -0.4 and 5.6 (Ghose et al., 1998) ; (5) Diversity is calculated as 1average Tanimoto similarities of generated molecules for every protein pockets, following Pocket2Mol; (6) Time estimates the time(s) spent on generating 100 molecules for a pocket. We choose GraphBP and Pocket2Mol as our baselines, which reprent the state-of-art models for binding molecule generation. For GraphBP and GraphVF, we trained them on the dataset for 40 epochs with the same hyperparameters. For Pocket2Mol (Peng et al., 2022) , we obtain the pre-trained model from their authors, and then compute the scores using Gnina.

Method

Results. The comparison results are presented in Table 5 . We can see that our GraphVF outperforms the two state-of-the-art baselines in terms of both binding affinity (HA) and Diversity. As is shown by the generation time, GraphVFis much more efficient than Pocket2Mol in 2 orders of magnitude. We have also attained comparable performance on certain properties like QED, LogP and SA, even without the explicit guidance from the variational prior. We also conduct experiment on the ablational 'w/o 2D encoder' variant in Table 5 . Without the information provided by the 2D encoder, the HA value for the abalational variant drops drastically from 31.1% to 26.3%. This shows that GraphVF learns better with the 2D encoder, and significantly benefits from the variational flow architecture. We also provide visualizations of selected generated ligands in Figure 6 from Appendix E.

5.2. SUB-STRUCTURE ANALYSIS

Setup. As it is pointed out in Pocket2Mol Peng et al. (2022) that conventional metrics could not reflect the geometry of sampled molecules, we conduct additional sub-structure analysis following Pocket2Mol. For the distribution of bond angles and dihedral angles, we evaluate GraphVF by KL divergence on the same benchmarks as Pocket2Mol. We perform extra experiments on the distribution of bond length. Results. The results are presented in Appendix D. In comparision to GraphBP and Pocket2Mol, GraphVF yields the best results on dihedral angles, which indicates it is more capable at modeling complex dependencies. At the same time, it achieves compatible results on bond length and bond angles.

5.3. CONTROLLABLE GENERATION FOR SPECIFIED CHEMICAL SUB-STRUCTURES

Our pretrained framework could be used to encourage the desired sub-structures to be generated without losing diversity. We carry out case studies on generation of molecules containing the following motifs: oxhydryl, peptide bond, 6-member carbon ring, and 5-member ring containing element S. For each motif, we calculate (µ ρ , Σ ρ ) among 500 randomly sampled reference ligand molecules that contain the motif as a sub-structure. Finally, we calculate the rate of the generated molecules that contain the desired sub-structures on the test set, which is compared with the results of directly sampling from prior distribution N (0, I). The experimental results are summarized in Table 3 . With the prior distributions collected from molecules that contain certain desired sub-structures, our model is more likely to generate ligand molecules with those sub-structures. Table 3 : Controllable Generation for Specified Chemical Sub-structures. Rate of desired sub-structure(%) w/ latent ρ w/o latent ρ oxhydryl 51.7 42.8 peptide bond 6.4 1.5 6-member carbon ring 14.5 0.3 5-member ring containing element S 29.7 0.4

5.4. CONTROLLABLE MOLECULAR GENERATION FOR SPECIFIED DRUG PROPERTIES

Our framework can also be explicitly controlled to generate drug-like molecules with desired properties. To support this claim, we perform case studies under two classical pharmaceutic settings: 1) Antibiotic Discovery (Stokes et al., 2020) This task aims to identify molecules that inhibit the growth of E. coli, a bacterium canonically used for testing antibiotic activity. 2) SARS Inhibition (Tokars & Mesecar, 2021) . This task is to identify molecules that inhibit the 3CL protease of SARS-CoV, the pathogen to a respiratory pandemic during the 2000s. For antibiotic discovery (likewise for SARS Inhibition), the inhibition scores of all the reference ligands in the CrossDocked test set are evaluated via a pretrained ensemble model (Yang et al., 2019) . We select the top 5% among them with the highest inhibition scores to calculate (µ ρ , Σ ρ ). Results for the two case studies are presented in Table 4 , which clearly shows that the latent P is effective in terms of manipulating desired properties of the generated molecules. 

6. CONCLUDING REMARKS

We proposed GraphVF, a novel variational flow-based framework for controllable binding 3D molecule generation. We empirically demonstrated that, through effectively integrating 2D structure semantics and 3D pocket geometry, GraphVF obtained superior performance to the state-ofthe-art strategies for pocket-based 3D molecule generation. We also experimentally showed that, GraphVF can effectively generate binding molecules with desired ligand sub-structures and biochemical properties. Our work here demonstrates that domain constraints can be effectively leveraged by deep generative models to improve the qualities of molecule design and fulfill the needs for controllable molecule generation. Our studies here shed light on the potential of generating binding ligands with sophisticated domain knowledge and finer-grained control over a variety of bio-chemical properties.

B ALGORITHMS FOR TRAINING AND GENERATION

The pseudo codes of training and generation algorithms are in Algorithms 1 and 2. Re-order R with ring-first graph traversal 8: (µ R , Σ R ) = JT-Encoder(R), where prior Z ∼ N (µ R , Σ R ) ▷ 2D Global 9: for i = 1, ..., N do ▷ 3D Autoregressive 10: Construct sub-graph G i := τ (P ∪ R 1:i-1 ) 11: Construct SCS upon focal atom f i 12: ẽ1:M , e 1:N -1 = Echnet(G i ) ▷ Encode 3D conformation 13: x (node) i = a i + u, u ∼ U[0, 1) d ▷ Atom type dequantization 14: µ (node) i , σ i = Node-MLP(e fi ) 15: z (node) i = x (node) i -µ (node) i ⊙ 1 σ (node) i 16: x (bond) 1:i-1,i = b 1:i-1,i + u, u ∼ U[0, 1) (i-1)×d ▷ Bond type dequantization 17: µ (bond) 1:i-1,i , σ :i-1,i = Bond-MLP(e 1:i-1 , x ) 18: z (bond) 1:i-1,i = x (bond) 1:i-1,i -µ (bond) 1:i-1,i ⊙ 1 σ (bond) 1:i-1,i 19: x (pos) i = RBF fi (r i ) ▷ Spherize atom position to f i 20: µ (pos) i , σ i = Position-MLP(e fi , x node) i , x (bond) 1:i-1,i ) 21: z (pos) i = x (pos) i -µ (pos) i ⊙ 1 σ (pos) i 22: L (node) i = -log(Prod(N (z (node) i |µ R , Σ R ))) -log(Prod( 1 σ (node) )) 23: L (bond) 1:i-1,i = -log(Prod(N (z (bond) 1:i-1,i |µ R , Σ R ))) -log(Prod( 1 σ (bond) 1:i-1,i )) 24:  L (pos) i = -log(Prod(N (z (pos) i |µ R , Σ R ))) -log(Prod( 1 σ (pos) i )) 25: L i,b flow = L (node) i + L (bond) 1:i-1,i + L ẽ1:M , e 1:N -1 = Echnet(G i ) ▷ Encode 3D conformation 14: Sample z (node) i ∼ N (µ ρ , Σ ρ ) 15: µ (node) i , σ = Node-MLP(e fi ) 16: x (node) i = σ (node) i ⊙ z (node) i + µ (node) i ▷ Atom type generation 17: Sample z (bond) 1:i-1,i , ∼ N (µ ρ , Σ ρ ) 18: µ (bond) 1:i-1,i , σ :i-1,i = Bond-MLP(e 1:i-1 , x x (bond) 1:i-1,i = σ (bond) 1:i-1,i ⊙ z (bond) 1:i-1,i + µ (bond) 1:i-1,i ▷ Bond type generation 20: Sample z (pos) i ∼ N (µ ρ , Σ ρ ) 21: µ (pos) i , σ i = Position-MLP(e fi , x node) i , x 1:i-1,i ) 22: x V.append({(a i , r i )}); E.append({b 1:i-1,i }) ▷ Autoregressive ligand generation 25: end for 26: LigGen t .append(R)

27:

end for 28: end for 29: return [LigGen 1 , LigGen 2 , ..., LigGen T ] C IMPLEMENTATION DETAILS Network Architecture. We stack 6 layers of the Echnet and 20 layers of the tree GRU. We use 6 variational flow layers for generation. Training Details. We train GraphVF for 40 epochs on the full training split by batch size 4. We use Adam optimizer while setting learning rate as 1e-4 and weight decay as 1e-6. For the β-annealing which is applied to the whole training process, we pick the minimum β as 1e-4 and maximum β as 0.015. Generation Details. We sample 100 molecules for each pocket in the test set. Molecules that have less than 15 atoms are excluded and re-sampled, while molecules that have more than 50 atoms are truncated at the 50-th atom. Additionally, to help GraphVF generate ligand molecules with good geometric properties, we limit the sample space by three validity constraints during generation: 1. A bond exists between the newly generated atom and the focal atom; 2. At most one other atom could be connected to the newly generated atom with a bond. Empirically, these constraints can help. We apply all validity constraints by default for all experiments. Specially, we found that without constraint 1, GraphVF could produce better results on HA. So we removed that constraint only for that case.

D RESULTS ON STRUCTURAL ANALYSIS

!"# !$# !%# !!"#$%&'()$*+$,*-$. -012$'34567$-89 :;<-$=%">&$-?$4@-$. 17,A$'3-B94$CDE FGHF-$=;G'I$-+?$@1*$. 1AJ9$'31AJ9$68K FGHF-$=;G'I$-+?$@1*$. 1ELM$'31AJ9$68K ""N'$=;G'I$-$-*@$. 1A60$'3,ABB$O04 ,P11 4P++ ,P,? ,P@@ 4P4@ @P1? @P.-4P-* 4P*-@P+* !"#"$"%&"'()*"&+*", -"%"$./"0 ()*"&+*", 



Figure 1: Ligand molecule R (e.g. DMT) is first parsed into a compilation of canonical substructures, then pooled into a junction tree structure, and finally encoded into φ R = (µ R , Σ R ).

Figure 2: Generation procedure of GraphVF. Atoms are added autoregressively, whose types, bonds and positions are sampled from prior distribution N (µ ρ , Σ ρ ) and predicted via normalizing flow.

Figure 3: Comparison of a single training/generation step in the variational flow framework

Training algorithm of GraphVF Input: η learning rate, B batch size, T maximum epoch number, Variational annealing hyperparameters β min , β max , use Prod(•) as the product of elements across dimensions of a tensor Initial: Parameters θ of GraphVF (Echnet, Junction Tree Encoder, and Node/Edge/Position-MLP) 1: for t = 1, .., T do 2: β = β min + (β max -β min ) sin 2 π t T ▷ β-annealing acc. to epoch number 3: for b = 1, ..., B do 4: Sample a receptor-ligand pair from dataset, with receptor size M and ligand size N 5: Protein receptor P = ( Ṽ , Ẽ), where Ṽ = {(ã i , ri )} M i=1 , and Ẽ = { bij } M i,j=1 6: Drug ligand R = (V, E), where V = {(a i , r i )} N i=1 , and E = {b ij } N i,j=1 7:

= D KL (N (µ R , Σ R )||N (0, I)) ▷ Global loss term for variational distribution Z Generation algorithm of GraphVF Input: T number of protein receptors, B number of drug ligands to generate for each receptor, N maximum number of atoms in the generated ligand. Optional parameters (µ ρ , Σ ρ ) as the cue to certain desired property ρ, (0, I) by default. Initial: Trained GraphVF model (Echnet, Junction Tree Encoder, and Node/Edge/Position-MLP) 1: for t = 1, .., T do 2: Sample a protein receptor from dataset, with receptor size M 3: Protein receptor P = ( Ṽ , Ẽ), where Ṽ = {(ã i , ri )} M i=1 , and Ẽ = { bij } M graph G i := τ (P ∪ R 1:i-1 )

3. The newly generated atom could only have bonds with atoms that is predicted positive by the focal classifier.

Figure 4: Visualization of bond distributions. (a) CC Bond Length; (b) COC Bond Angle; (c) CCCC Dihedral Angle.

Figure 5: (a) Comparison between Serotonin and DMT structure; (b) Binding pose of DMT with 5-HT 2A , pay special attention to the interaction between NHMe 2 and Asp-231.

Figure 6: Examples of generated molecules with higher binding affinity (Gnina score ↑) than the reference molecules. Protein names and residue/ligand IDs are listed on top.

Comparison among representative molecular generative methods.

Performance of different methods on 3D molecular generation based on protein pockets. Higher values indicate better results. Best results are in bold. Result with * is dealt with special valency constraints. Valency constraints are detailed in Appendix C.Setup. Following GraphBP(Liu et al., 2022) and Pocket2Mol, we randomly sample 100 molecules for every protein pocket in the generation stage. The quality of generated molecules is evaluated by 6 widely adopted metrics. (1) High Affinity (HA), which estimates the percentage of generated molecules that have higher CNNAfinity calculated by the Gnina program(McNutt et al., 2021); (2) Synthetic Accessibility (SA), which represents the easiness of drug synthesis;

Controllable Generation for Antibiotic Discovery and SARS Inhibition.

The KL divergence of the distance (upper part), bond angles (middle part) and dihedral angles (lower part) with the test set. Lowercase letters denotes atoms in aromatic rings. Best results are in bold.

A.1 PARSING AND POOLING

There are 3 types of canonical sub-structures, as are exemplified by the DMT molecule in Figure 1: 1) Rings, e.g. the blue, green nodes;2) Non-ring covalent atom pairs, e.g. the red, yellow and purple nodes;3) Pivot atoms that are connected to 3 or more items, e.g. the gray node.The rules for identifying sub-structures is self-contained, yielding a relative sparse and stable set of vocabulary. A total of 427 canonical sub-structures are identified from the 100,000 reference ligands in the CrossDocked dataset.Once the ligand molecule is parsed into a compilation of sub-structures, the molecular graph can be pooled into a junction tree in a straight-forward manner, where each sub-structure corresponds to a tree node, and any two intersecting sub-structures yield an edge between their corresponding nodes.

A.2 TREE GRU ENCODER

We adopt the same junction tree encoder architecture as Jin et al. (2018) . Namely, the tree message passing scheme arbitrarily selects a leaf node as the root, and pass messages from child nodes to parent nodes iteratively in a bottom-up approach. We denote the message from node i to j as m ij , which is updated via a GRU adapted for tree propagation:To be more specific, the GRU architecture is formulated as follows:where x i is a one-hot vector, indicating the type of canonical sub-structure of node i. The latent representation of each node h i can be derived by aggregating all the inwards messages from its child nodes:Finally, the structural encoding of the whole molecule is obtained by feeding h root through a MLP.The resulting vector φ R is split into two equally-sized dense vector (µ R , Σ R ), which parameterize the mean and variance of an amortized diagonal Gaussian distribution N (µ R , Σ R ):It should be emphasized that all the parameters that appear in this section are trainable. Our earlier attempts with a pretrained version of GRU would result in a serious degradation of the quality of generated molecules. Thus, end-2-end training of these parameters is ideal for achieving better model performance.

