EQUIVARIANT SHAPE-CONDITIONED GENERATION OF 3D MOLECULES FOR LIGAND-BASED DRUG DESIGN

Abstract

Shape-based virtual screening is widely used in ligand-based drug design to search chemical libraries for molecules with similar 3D shapes yet novel 2D graph structures compared to known ligands. 3D deep generative models can potentially automate this exploration of shape-conditioned 3D chemical space; however, no existing models can reliably generate geometrically realistic drug-like molecules in conformations with a specific shape. We introduce a new multimodal 3D generative model that enables shape-conditioned 3D molecular design by equivariantly encoding molecular shape and variationally encoding chemical identity. We ensure local geometric and chemical validity of generated molecules by using autoregressive fragment-based generation with heuristic bonding geometries, allowing the model to prioritize the scoring of rotatable bonds to best align the growing conformation to the target shape. We evaluate our 3D generative model in tasks relevant to drug design including shape-conditioned generation of chemically diverse molecular structures and shape-constrained molecular property optimization, demonstrating its utility over virtual screening of enumerated libraries.

1. INTRODUCTION

Generative models for de novo molecular generation have revolutionized computer-aided drug design (CADD) by enabling efficient exploration of chemical space, goal-directed molecular optimization (MO), and automated creation of virtual chemical libraries (Segler et al., 2018; Meyers et al., 2021; Huang et al., 2021; Wang et al., 2022; Du et al., 2022; Bilodeau et al., 2022) . Recently, several 3D generative models have been proposed to directly generate low-energy or (bio)active molecular conformations using 3D convolutional networks (CNNs) (Ragoza et al., 2020) , reinforcement learning (RL) (Simm et al., 2020a; b) , autoregressive generators (Gebauer et al., 2022; Luo & Ji, 2022) , or diffusion models (Hoogeboom et al., 2022) . These methods have especially enjoyed accelerated development for structure-based drug design (SBDD), where models are trained to generate druglike molecules in favorable binding poses inside an explicit protein pocket (Drotár et al., 2021; Luo et al., 2022; Liu et al., 2022; Ragoza et al., 2022) . However, SBDD requires atomically-resolved structures of a protein target, assumes knowledge of binding sites, and often ignores dynamic pocket flexibility, rendering these methods less effective in many CADD settings. Ligand-based drug design (LBDD) does not assume knowledge of protein structure. Instead, molecules are compared against previously identified "actives" on the basis of 3D pharmacophore or 3D shape similarity under the principle that molecules with similar structures should share similar activity (Vázquez et al., 2020; Cleves & Jain, 2020) . In particular, ROCS (Rapid Overlay of Chemical Structures) is commonly used as a shape-based virtual screening tool to identify molecules with similar shapes to a reference inhibitor and has shown promising results for scaffold-hopping tasks (Rush et al., 2005; Hawkins et al., 2007; Nicholls et al., 2010) . However, virtual screening relies on enumeration of chemical libraries, fundamentally restricting its ability to probe new chemical space. Here, we consider the novel task of generating chemically diverse 3D molecular structures conditioned on a molecular shape, thereby facilitating the shape-conditioned exploration of chemical space without the limitations of virtual screening (Fig. 1 ). Importantly, shape-conditioned 3D molecular generation presents unique challenges not encountered in typical 2D generative models: Figure 1 : We explore the task of shape-conditioned 3D molecular generation to generate chemically diverse molecules in 3D conformations with high shape similarity to an encoded target shape. Challenge 1. 3D shape-based LBDD involves pairwise comparisons between two arbitrary conformations of arbitrary molecules. Whereas traditional property-conditioned generative models or MO algorithms shift learned data distributions to optimize a single scalar property, a shape-conditioned generative model must generate molecules adopting any reasonable shape encoded by the model. Challenge 2. Shape similarity metrics that compute volume overlaps between two molecules (e.g., ROCS) require the molecules to be aligned in 3D space. Unlike 2D similarity, the computed shape similarity between the two molecules will change if one of the structures is rotated. This subtly impacts the learning problem: if the model encodes the target 3D shape into an SE(3)-invariant representation, the model must learn how the generated molecule would fit the target shape under the implicit action of an SE(3)-alignment. Alternatively, if the model can natively generate an aligned structure, then the model can more easily learn to construct molecules that fit the target shape. Challenge 3. A molecule's 2D graph topology and 3D shape are highly dependent; small changes in the graph can strikingly alter the shapes accessible to a molecule. It is thus unlikely that a generative model will reliably generate chemically diverse molecules with similar shapes to an encoded target without 1) simultaneous graph and coordinate generation; and 2) explicit shape-conditioning. Challenge 4. The distribution of shapes a drug-like molecule can adopt is chiefly influenced by rotatable bonds, the foremost source of molecular flexibility. However, existing 3D generative models are mainly developed using tiny molecules (e.g., fewer than 10 heavy atoms), and cannot generate flexible drug-like molecules while maintaining chemical validity (satisfying valencies), geometric validity (non-distorted bond distances and angles; no steric clashes), and chemical diversity. To surmount these challenges, we design a new generative model, SQUIDfoot_0 , to enable the shapeconditioned generation of chemically diverse molecules in 3D. Our contributions are as follows: • Given a 3D molecule with a target shape, we use equivariant point cloud networks to encode the shape into (rotationally) equivariant features. We then use graph neural networks (GNNs) to variationally encode chemical identity into invariant features. By mixing chemical features with equivariant shape features, we can generate diverse molecules in aligned poses that fit the shape. • We develop a sequential fragment-based 3D generation procedure that fixes local bond lengths and angles to prioritize the scoring of rotatable bonds. By massively simplifying 3D coordinate generation, we generate drug-like molecules while maintaining chemical and geometric validity. • We design a rotatable bond scoring network that learns how local bond rotations affect global shape, enabling our decoder to generate 3D conformations that best fit the target shape. We evaluate the utility of SQUID over virtual screening in shape-conditioned 3D molecular design tasks that mimic ligand-based drug design objectives, including shape-conditioned generation of diverse 3D structures and shape-constrained molecular optimization. To inspire further research, we note that our tasks could also be approached with a hypothetical 3D generative model that disentangles latent variables controlling 2D chemical identity and 3D shape, thus enabling zero-shot generation of topologically distinct molecules with similar shapes to any encoded target.

2. RELATED WORK

Fragment-based molecular generation. Seminal works in autoregressive molecular generation applied language models to generate 1D SMILES strings character-by-character (Gómez-Bombarelli et al., 2018; Segler et al., 2018) , or GNNs to generate 2D molecular graphs atom-by-atom (Liu et al., 2018; Simonovsky & Komodakis, 2018; Li et al., 2018) . Recent works construct molecules fragment-by-fragment to improve the chemical validity of intermediate graphs and to scale generation to larger molecules (Podda et al., 2020; Jin et al., 2019; 2020) . Our fragment-based decoder is related to MoLeR (Maziarz et al., 2022) , which iteratively generates molecules by selecting a new fragment (or atom) to add to the partial graph, choosing attachment sites on the new fragment, and predicting new bonds to the partial graph. Yet, MoLeR only generates 2D graphs; we generate 3D molecular structures. Beyond 2D generation, Flam-Shepherd et al. ( 2022) use an RL agent to generate 3D molecules by sampling and connecting molecular fragments. However, they sample from a small multiset of fragments, restricting the accessible chemical space. Powers et al. (2022) use fragments to generate 3D molecules inside a protein pocket, but only consider 7 distinct rings. Generation of drug-like molecules in 3D. In this work, we generate novel drug-like 3D molecular structures in free space, e.g., not conformers given a known molecular graph (Ganea et al., 2021; Jing et al., 2022) . Myriad models have been proposed to generate small 3D molecules such as E(3)-equivariant normalizing flows and diffusion models (Satorras et al., 2022a; Hoogeboom et al., 2022) , RL agents with an SE(3)-covariant action space (Simm et al., 2020b) , and autoregressive generators that build molecules atom-by-atom with SE(3)-invariant internal coordinates (Luo & Ji, 2022; Gebauer et al., 2022) . However, fewer 3D generative models can generate larger drug-like molecules for realistic chemical design tasks. Of these, Hoogeboom et al. (2022) and Arcidiacono & Koes (2021) fail to generate chemically valid molecules, while Ragoza et al. (2020) rely on postprocessing and geometry relaxation to extract stable molecules from their generated atom density grids. Only Roney et al. (2021) and Li et al. (2021) , who develop autoregressive generators that simultaneously predict graph structure and internal coordinates, have shown to reliably generate valid drug-like molecules. We also couple graph generation with 3D coordinate prediction; however, we employ fragment-based generation with fixed local geometries to ensure local chemical and geometric validity. Futher, we focus on shape-conditioned molecular design; none of these works can natively address the aforementioned challenges posed by shape-conditioned molecular generation. Shape-conditioned molecular generation. Other works partially address shape-conditioned 3D molecular generation. Skalic et al. (2019) and Imrie et al. (2021) train networks to generate 1D SMILES strings or 2D molecular graphs conditioned on CNN encodings of 3D pharmacophores. However, they do not generate 3D structures, and the CNNs do not respect Euclidean symmetries. Zheng et al. (2021) use supervised molecule-to-molecule translation on SMILES strings for scaffold hopping tasks, but do not generate 3D structures. Papadopoulos et al. (2021) use REINVENT (Olivecrona et al., 2017) on SMILES strings to propose molecules whose conformers are shapesimilar to a target, but they must re-optimize the agent for each target shape. Roney et al. (2021) fine-tune a 3D generative model on the hits of a ROCS virtual screen of > 10 10 drug-like molecules to shift the learned distribution towards a target shape. Yet, this expensive screening approach must be repeated for each new target. Instead, we seek to achieve zero-shot generation of 3D molecules with similar shapes to any encoded shape, without requiring fine-tuning or post facto optimization. Equivariant geometric deep learning on point clouds. Various equivariant networks have been designed to encode point clouds for updating coordinates in R 3 (Satorras et al., 2022b), predicting tensorial properties (Thomas et al., 2018) , or modeling 3D structures natively in Cartesian space (Fuchs et al., 2020) . Especially noteworthy are architectures which lift scalar neuron features to vector features in R 3 and employ simple operations to mix invariant and equivariant features without relying on expensive higher-order tensor products or Clebsh-Gordan coefficients (Deng et al., 2021; Jing et al., 2021) . In this work, we employ Deng et al. (2021) 's Vector Neurons (VN)-based equivariant point cloud encoder VN-DGCNN to encode molecules into equivariant latent representations in order to generate molecules which are natively aligned to the target shape. Two recent works also employ VN operations for structure-based drug design and linker design (Peng et al., 2022; Huang et al., 2022) . Huang et al. (2022) also build molecules in free space; however, they generate just a few atoms to connect existing fragments and do not condition on molecular shape.

3. METHODOLOGY

Problem definition. We model a conditional distribution P (M |S) over 3D molecules M = (G, G) with graph G and atomic coordinates G = {r a ∈ R 3 } given a 3D molecular shape S. Specifically, we aim to sample molecules M ′ ∼ P (M |S) with high shape similarity (sim S (M ′ , M S ) ≈ 1) and low graph (chemical) similarity (sim G (M ′ , M S ) < 1) to a target molecule M S with shape S. This scheme differs from 1) typical 3D generative models that learn P (M ) without modeling P (M |S), and from 2) shape-conditioned 1D/2D generators that attempt to model P (G|S), the distribution of molecular graphs that could adopt shape S, but do not actually generate specific 3D conformations. We define graph (chemical) similarity sim G ∈ [0, 1] between two molecules as the Tanimoto similarity computed by RDKit with default settings (2048-bit fingerprints). We define shape similarity sim * S ∈ [0, 1] using Gaussian descriptions of molecular shape, modeling atoms a ∈ M A and b ∈ M B from molecules M A and M B as isotropic Gaussians in R 3 (Grant & Pickup, 1995; Grant et al., 1996) . We compute sim * S using (2-body) volume overlaps between atom-centered Gaussians: sim * S (G A , G B ) = V AB V AA + V BB -V AB ; V AB = a∈A,b∈B V ab ; V ab ∝ exp - α 2 ||r a -r b || 2 , ( ) where α controls the Gaussian width.  (M A , M B ) = max R,t sim * S (G A R+t, G B ) as the shape similarity when M A is optimally aligned to M B . We perform such alignments with ROCS. Approach. At a high level, we model P (M |S) with an encoder-decoder architecture. Given a molecule M S = (G S , G S ) with shape S, we encode S (a point cloud) into equivariant features. We then variationally encode G S into atomic features, conditioned on the shape features. We then mix these shape and atom features to pass global SE(3) {in,equi}variant latent codes to the decoder, which samples new molecules from P (M |S). We autoregressively generate molecules by factoring P (M |S) = P (M 0 |S)P (M 1 |M 0 , S)...P (M |M n-1 , S), where each M l = (G l , G l ) are partial molecules defined by a BFS traversal of a tree-representation of the molecular graph (Fig. 2 ). Tree-nodes denote either non-ring atoms or rigid (ring-containing) fragments, and tree-links denote acyclic (rotatable, double, or triple) bonds. We generate M l+1 by growing the graph G l+1 around a focus atom/fragment, and then predict G l+1 by scoring a query rotatable bond to best fit shape S.

Simplifying assumptions.

(1) We ignore hydrogens and only consider heavy atoms, as is common in molecular generation. (2) We only consider molecules with fragments present in our fragment library to ensure that graph generation can be expressed as tree generation. (3) Rather than generating all coordinates, we use rigid fragments, fix bond distances, and set bond angles according to hybridization heuristics (App. A.8); this lets the model focus on scoring rotatable bonds to best fit the growing conformer to the encoded shape. (4) We seed generation with M 0 (the root tree-node), restricted to be a small (3-6 atoms) substructure from M S ; hence, we only model P (M |S, M 0 ).

3.1. ENCODER

Featurization. We construct a molecular graph G using atoms as nodes and bonds as edges. We featurize each node with the atomic mass; one-hot codes of atomic number, charge, and aromaticity; and one-hot codes of the number of single, double, aromatic, and triple bonds the atom forms (including bonds to implicit hydrogens). This helps us fix bond angles during generation (App. A.8). We featurize each edge with one-hot codes of bond order. We represent a shape S as a point cloud built by sampling n p points from each of n h atom-centered Gaussians with (adjustable) variance σ 2 p . Fragment encoder. We also featurize each node with a learned embedding f i ∈ R d f of the atom/fragment type to which that atom belongs, making each node "fragment-aware" (similar to MoLeR). In principle, fragments could be any rigid substructure with ≥ 2 atoms. Here, we specify fragments as ring-containing substructures without acyclic single bonds (Fig. 14 ). We construct a library L f of atom/fragment types by extracting the top-k (k = 100) most frequent fragments from the dataset and adding these, along with each distinct atom type, to L f (App. A.13). We then encode each atom/fragment in L f with a simple GNN (App. A.12) to yield the global atom/fragment embeddings: {f i = a h (a) fi , {h (a) fi } = GNN L f (G fi ) ∀f i ∈ L f }, where h (a) fi are per-atom features. Figure 2 : Encoder-decoder architecture of SQUID, which equivariantly encodes molecular shape and variationally encodes chemical identity to generate chemically diverse 3D molecules that fit the shape. SQUID generates molecules atom-by-atom and fragment-by-fragment, iteratively growing the molecular graph in a tree-expansion and generating 3D coordinates by scoring rotatable bonds. Shape encoder. Given M S with n h heavy atoms, we use VN-DGCNN (App. A.11) to encode the molecular point cloud P S ∈ R (n h np)×3 into a set of equivariant per-point vector features Xp ∈ R (n h np)×q×3 . We then locally mean-pool the n p equivariant features per atom: Xp = VN-DGCNN(P S ); X = LocalPool( Xp ), where X ∈ R n h ×q×3 are per-atom equivariant representations of the molecular shape. Because VN operations are SO(3)-equivariant, rotating the point cloud will rotate X: XR = LocalPool(VN-DGCNN(P S R)). Although VN operations are strictly SO(3)-equivariant, we subtract the molecule's centroid from the atomic coordinates prior to encoding, making X effectively SE(3)-equivariant. Throughout this work, we denote SO(3)-equivariant vector features with tildes. Variational graph encoder. To model P (M |S), we first use a GNN (App. A.12) to encode G S into learned atom embeddings H = {h (a) ∀ a ∈ G S }. We condition the GNN on per-atom invariant shape features X = {x (a) } ∈ R n h ×6q , which we form by passing X through a VN-Inv (App. A.11): H = GNN((H 0 , X); G S ); X = VN-Inv( X), where H 0 ∈ R n h ×(da+d f ) are the set of initial atom features concatenated with the learned fragment embeddings, H ∈ R n h ×d h , and (•, •) denotes concatenation in the feature dimension. For each atom in M S , we then encode h (a) µ , h log σ 2 = MLP(h (a) ) and sample h (a) var ∼ N (h (a) µ , h σ ): H var = h (a) var = h (a) µ + ϵ (a) ⊙ h (a) σ ; h (a) σ = exp( 1 2 h (a) log σ 2 ) ∀ a ∈ G S , where ϵ (a) ∼ N (0, 1) ∈ R d h , H var ∈ R n h ×d h , and ⊙ denotes elementwise multiplication. Here, the second argument of N (•, •) is the standard deviation vector of the diagonal covariance matrix. Mixing shape and variational features. The variational atom features H var are insensitive to rotations of S. However, we desire the decoder to construct molecules in poses that are natively aligned to S (Challenge 2). We achieve this by conditioning the decoder on an equivariant latent representation of P (M |S) that mixes both shape and chemical information. Specifically, we mix H var with X by encoding each h (a) var ∈ H var into linear transformations, which are applied atom-wise to X. We then pass the mixed equivariant features through a separate : XHvar = VN-MLP(W (a) H X(a) , X(a) ); W (a) H = Reshape(MLP(h (a) var )) ∀ a ∈ G S , where W (a) H ∈ R q ′ ×q , X(a) ∈ R q×3 , and XHvar ∈ R n h ×dz×3 . This maintains equivariance since W var for each atom in M S , mixing H var with X, and passing the resultant ( Z, z) to the decoder. We seed generation with a small structure M 0 (extracted from M S ), and build M ′ by sequentially generating larger structures M ′ l+1 in a tree-like manner (Fig. 2 ). Specifically, we grow new atoms/fragments around a "focus" atom/fragment in M ′ l , which is popped from a BFS queue. To generate M ′ l+1 from M ′ l (e.g., grow the tree from the focus), we factor P (M l+1 |M l , S) = P (G l+1 |M l , S)P (G l+1 |G l+1 , M l , S). Given ( Z, z), we sample the new graph G ′ l+1 by iteratively attaching (a variable) C new atoms/fragments (children tree-nodes) around the focus, yielding G ′(c) l for c = 1, ..., C, where G ′(C) l = G ′ l+1 and G ′(0) l = G ′ l . We then generate coordinates G ′ l+1 by scoring the (rotatable) bond between the focus and its parent tree-node. New bonds from the focus to its children are left unscored in M ′ l+1 until the children become "in focus". Partial molecule encoder. Before bonding each new atom/fragment to the focus (or scoring bonds), we encode the partial molecule M ′(c-1) l with the same scheme as for M S (using a parallel encoder; Fig. 2 ), except we do not variationally embed H ′ .foot_1 Instead, we process H ′ analogously to H var . Further, in addition to globally pooling the per-atom embeddings to obtain Z′ = a X′(a) H and z ′ = a x ′(a) H , we also selectively sum-pool the embeddings of the atom(s) in focus, yielding Z′ foc = a∈focus X′(a) H and z ′ foc = a∈focus x ′(a) H . We then align the equivariant representations of M ′(c-1) l and M S by concatenating Z, Z′ , Z -Z′ , and Z′ foc and passing these through a VN-MLP: Zdec = VN-MLP( Z, Z′ , Z -Z′ , Z′ foc ). Note that Zdec ∈ R q×3 is equivariant to rotations of the overall system (M ′(c-1) l , M S ). Finally, we form a global invariant feature z dec ∈ R ddec to condition graph (or coordinate) generation: z dec = (VN-Inv( Zdec ), z, z ′ , z -z ′ , z ′ foc ). Graph generation. We factor P (G l+1 |M l , S) into a sequence of generation steps by which we iteratively connect children atoms/fragments to the focus until the network generates a (local) stop token. . Given z dec , the model first predicts whether to stop (local) generation via p ∅ = sigmoid(MLP ∅ (z dec )) ∈ (0, 1). If p ∅ ≥ τ ∅ (a threshold, App. A.16), we stop and proceed to bond scoring. Otherwise, we select which atom a foc on the focus (if multiple) to grow from: p focus = softmax({MLP focus (z dec , x ′(a) H ) ∀ a ∈ focus}). ( ) The decoder then predicts which atom/fragment f next ∈ L f to connect to the focus next: p next = softmax({MLP next (z dec , x ′(afoc) H , f fi ) ∀ f i ∈ L f }). ( ) If the selected f next is a fragment, we predict the attachment site a site on the fragment G fnext : p site = softmax({MLP site (z dec , x ′(afoc) H , f next , h (a) fnext ) ∀ a ∈ G fnext }), where h (a) fnext are the encoded atom features for G fnext . Lastly, we predict the bond order (1 • , 2 • , 3 • ) via p bond = softmax(MLP bond (z dec , x ′(afoc) H , f next , h (asite) fnext )). We repeat this sequence of steps until p ∅ ≥ τ ∅ , yielding G l+1 . At each step, we greedily select the action after masking actions that violate known chemical valence rules. After each sequence, we bond a new atom or fragment to the focus, giving G ′(c) l . If an atom, the atom's position relative to the focus is fixed by heuristic bonding geometries (App. A.8). If a fragment, the position of the attachment site is fixed, but the dihedral of the new bond is yet unknown. Thus, in subsequent generation steps we only encode the attachment site and mask the remaining atoms in the new fragment until that fragment is "in focus" (Fig. 2 ). This means that prior to bond scoring, the rotation angle of the focus is random. To account for this when training (with teacher forcing), we randomize the focal dihedral when encoding each M ′(c-1) l . Scoring rotatable bonds. After sampling G ′ l+1 ∼ P (G l+1 |M ′ l , S) , we generate G ′ l+1 by scoring the rotation angle ψ ′ l+1 of the bond connecting the focus to its parent node in the generation tree (Fig. 2 ). Since we ultimately seek to maximize sim S (M ′ , M S ), we exploit the fact that our model generates shape-aligned structures to predict max ψ ′ l+2 ,ψ ′ l+3 ,... sim * S (G ′ (ψfoc) , G S ) for various query dihedrals ψ ′ l+1 = ψ foc of the focus rotatable bond in a supervised regression setting. Intuitively, the scorer is trained to predict how the choice of ψ foc affects the maximum possible shape similarity of the final molecule M ′ to the target M S under an optimal policy. App. A.2 details how regression targets are computed. During generation, we sweep over each query 3 , and select the ψ foc that maximizes the predicted score: ψ foc ∈ [-π, π), encode each resultant structure M ′ (ψfoc) l+1 into z (ψfoc) dec, scorer ψ ′ l+1 = arg max ψfoc sigmoid(MLP scorer (z (ψfoc) dec, scorer )). At generation time, we also score chirality by enumerating stereoisomers G χ foc ∈ G ′ foc of the focus and selecting the (G χ foc , ψ foc ) that maximizes Eq. 12 (App. A.2). Training. We supervise each step of graph generation with a multi-component loss function: L graph-gen = L ∅ +L focus +L next +L site +L bond +β KL L KL +β next-shape L next-shape +β ∅-shape L ∅-shape . (13) L ∅ , L focus , L next , and L bond are standard cross-entropy losses. L site = -log( a p (a) site I[c a > 0] ) is a modified cross-entropy loss that accounts for symmetric attachment sites in the fragments G fi ∈ L f , where p (a) site are the predicted attachment-site probabilities and c a are multi-hot class probabilities. L KL is the KL-divergence between the learned N (h µ , h σ ) and the prior N (0, 1). We also employ two auxiliary losses L next-shape and L ∅-shape in order to 1) help the generator distinguish between incorrect shape-similar (near-miss) vs. shape-dissimilar fragments, and 2) encourage the generator to generate structures that fill the entire target shape (App. A.10). We train the rotatable bond scorer separately from the generator with an MSE regression loss. See App. A.15 for training details.

4. EXPERIMENTS

Dataset. We train SQUID with drug-like molecules (up to n h = 27) from MOSES (Polykovskiy et al., 2020) using their train/test sets. L f includes 100 fragments extracted from the dataset and 24 atom types. We remove molecules that contain excluded fragments. For remaining molecules, we generate a 3D conformer with RDKit, set acyclic bond distances to their empirical means, and fix acyclic bond angles using heuristic rules. While this 3D manipulation neglects distorted bonding geometries in real molecules, the global shapes are marginally impacted, and we may recover refined geometries without seriously altering the shape (App. A.8). The final dataset contains 1.3M 3D molecules, partitioned into 80/20 train/validation splits. The test set contains 147K 3D molecules. In the following experiments, we only consider molecules M S for which we can extract a small (3-6 atoms) 3D substructure M 0 containing a terminal atom, which we use to seed generation. In principle, M 0 could include larger structures from M S , e.g., for scaffold-constrained tasks. Here, we use the smallest substructures to ensure that the shape-conditioned generation tasks are not trivial. Shape-conditioned generation of chemically diverse molecules. "Scaffold-hopping"-designing molecules with high 3D shape similarity but novel 2D graph topology compared to known inhibitors-is pursued in LBDD to develop chemical lead series, optimize drug activity, or evade intellectual property restrictions (Hu et al., 2017) . We imitate this task by evaluating SQUID's ability to generate molecules M ′ with high sim S (M ′ , M S ) but low sim G (M ′ , M S ). Specifically, for 1000 molecules M S with target shapes S in the test set, we use SQUID to generate 50 molecules per M S . To generate chemically diverse species, we linearly interpolate between the posterior N (h µ , h σ ) and the prior N (0, 1), sampling each h var ∼ N ((1 -λ)h µ , (1 -λ)h σ + λ1) using either λ = 0.3 or λ = 1.0 (prior). We then filter the generated molecules to have sim G (M ′ , M S ) < 0.7, or < 0.3 to only evaluate molecules with substantial chemical differences compared to M S . Of the filtered molecules, we randomly choose N max samples and select the sample with highest sim S (M ′ , M S ). Figure 3A plots distributions of sim S (M ′ , M S ) between the selected molecules and their respective target shapes, using different sampling (N max = 1, 20) and filtering (sim G (M ′ , M S ) < 0.7, 0.3) schemes. We compare against analogously sampling random 3D molecules from the training set. Overall, SQUID generates diverse 3D molecules that are quantitatively enriched in shape similarity compared to molecules sampled from the dataset, particularly for N max = 20. Qualitatively, the molecules generated by SQUID have significantly more atoms which directly overlap with the atoms of M S , even in cases where the computed shape similarity is comparable between SQUIDgenerated molecules and molecules sampled from the dataset (Fig. 3C ). We quantitatively explore this observation in App. A.7. We also find that using λ = 0.3 yields greater sim S (M ′ , M S ) than λ = 1.0, in part because using λ = 0.3 yields less chemically diverse molecules (Fig. 3B ; Challenge 3). Even so, sampling N max = 20 molecules from the prior with sim G (M ′ , M S ) < 0.3 still yields more shape-similar molecules than sampling N max = 500 molecules from the dataset. We emphasize that 99% of samples from the prior are novel, 95% are unique, and 100% are chemically valid (App. A.4). Moreover, 87% of generated structures do not have any steric clashes (App. A.4), indicating that SQUID generates realistic 3D geometries of the flexible drug-like molecules. Ablating equivariance. SQUID's success in 3D shape-conditioned molecular generation is partly attributable to SQUID aligning the generated structures to the target shape in equivariant feature space (Eq. 7), which enables SQUID to generate 3D structures that fit the target shape without having to implicitly learn how to align two structures in R 3 (Challenge 2). We explicitly validate this design choice by setting Z = 0 in Eq. 7, which prevents the decoder from accessing the 3D orientation of M S during training/generation. As expected, ablating SQUID's equivariance reduces the enrichment in shape similarity (relative to the dataset baseline) by as much as 33% (App. A.9). Shape-constrained molecular optimization. Scaffold-hopping is often goal-directed; e.g., aiming to reduce toxicity or improve bioactivity of a hit compound without altering its 3D shape. We mimic this shape-constrained MO setting by applying SQUID to optimize objectives from Gau-caMol (Brown et al., 2019) while preserving high shape similarity (sim S (M, M S ) ≥ 0.85) to various "hit" 3D molecules M S from the test set. This task considerably differs from typical MO tasks, which optimize objectives without constraining 3D shape and without generating 3D structures. To adapt SQUID to shape-constrained MO, we implement a genetic algorithm (App. A.5) that iteratively mutates the variational atom embeddings H var of encoded seed molecules ("hits") M S in order to generate 3D molecules M * with improved objective scores, but which still fit the shape of M S . Table 1 reports the optimized top-1 scores across 6 objectives and 8 seed molecules M S (per objective, sampled from the test set), constrained such that sim S (M * , M S ) ≥ 0.85. We compare against the score of M S , as well as the (shape-constrained) top-1 score obtained by virtual screening (VS) our training dataset (>1M 3D molecules). Of the 8 seeds M S per objective, 3 were selected from top-scoring molecules to serve as hypothetical "hits", 3 were selected from top-scoring large molecules (≥ 26 heavy atoms), and 2 were randomly selected from all large molecules. In 40/48 tasks, SQUID improves the objective score of the seed M S while maintaining sim S (M * , M S ) ≥ 0.85. Qualitatively, SQUID optimizes the objectives through chemical alterations such as adding/deleting individual atoms, switching bonding patterns, or replacing entire substructures -all while generating 3D structures that fit the target shape (App. A.5). In 29/40 of successful cases, SQUID (limited to 31K samples) surpasses the baseline of virtual screening 1M molecules, demonstrating the ability to efficiently explore new shape-constrained chemical space.

5. CONCLUSION

We designed a novel 3D generative model, SQUID, to enable shape-conditioned exploration of chemically diverse molecular space. SQUID generates realistic 3D geometries of larger molecules that are chemically valid, and uniquely exploits equivariant operations to construct conformations that fit a target 3D shape. We envision our model, alongside future work, will advance creative shape-based drug design tasks such as 3D scaffold hopping and shape-constrained 3D ligand design.

REPRODUCIBILITY STATEMENT

We have taken care to facilitate the reproduciblility of this work by detailing the precise architecture of SQUID throughout the main text; we also provide extensive details on training protocols, model parameters, and further evaluations in the Appendices. Our source code can be found at https://github.com/keiradams/SQUID. Beyond the model implementation, our code includes links to access our datasets, as well as scripts to process the training dataset, train the model, and evaluate our trained models across the shape-conditioned generation and shape-constrained optimization tasks described in this paper.

ETHICS STATEMENT

Advancing the shape-conditioned 3D generative modeling of drug-like molecules has the potential to accelerate pharmaceutical drug design, showing particular promise for drug discovery campaigns involving scaffold hopping, hit expansion, or the discovery of novel ligand analogues. However, such advancements could also be exploited for nefarious pharmaceutical research and harmful biological applications. A.1 OVERVIEW OF DEFINITIONS, TERMS, AND NOTATIONS the learned embedding of atom/fragment f i ∈ L f h (a) learned embedding of atom a P S point cloud representation of shape S R an arbitrary rotation matrix in R 3×3 SO (3) the special orthogonal group (rotations) in R 3 SE (3) the special Euclidean group (rotations and translations) in R 3 Xp tensor containing the learned equivariant features of each point in a point cloud; Xp ∈ R (n h np)×q×3 X tensor containing the learned equivariant features of each atom in the encoded 3D molecule; X ∈ R n h ×q×3 q (1st) dimensionality of vector features X, Xp H matrix containing the learned atom embeddings; H = {h (a) ∀a ∈ G} x (a) invariant shape features of atom a; x (a) ∈ R 6q X matrix containing the invariant shape features of each atom in the encoded 3D molecule; X ∈ R n h ×6q d a dimension of input atom features d f dimension of learned fragment embeddings d h dimension of learned atom embeddings h µ vector of means of the posterior distribution N (h µ , h σ ) h σ vector of standard deviations of the posterior distribution N (h µ , h σ ) h var sampled variational atom features λ interpolation factor between N (h µ , h σ ) and N (0, 1) N (0, 1) multidimensional standard normal distribution with µ = 0, σ = 1 

A.2 SCORING ROTATABLE BONDS AND STEREOCHEMISTRY

Recall that our goal is to train the scorer to predict max ψ ′ l+2 ,ψ ′ l+3 ,... sim * S (G ′ (ψfoc) , G S ) for various query dihedrals ψ ′ l+1 = ψ foc . That is, we wish to predict the maximum possible shape similarity of the final molecule M ′ to M S when fixing ψ ′ l+1 = ψ foc and optimally rotating all the yet-to-be-scored (or generated) rotatable bond dihedrals ψ ′ l+2 , ψ ′ l+3 , ... so as to maximize sim * S (G ′ (ψfoc) , G S ). Training. We train the scorer independently from the graph generator (with a parallel architecture) using a mean squared error loss between the predicted scores ŝ(ψfoc) dec, scorer = sigmoid(MLP(z (ψfoc) dec, scorer )) and the regression targets s (ψfoc) for N s different query dihedrals ψ foc ∈ [-π, π):  L scorer = 1 N s Ns i=1 (s (ψ (i) foc ) - ŝ(ψ (i) foc ) dec, scorer ) 2 Tfoc , G S T foc ; α = 2.0). Since we fix bonding geometries, we need only sample N ψ sets of dihedrals of the rotatable bonds in G S T foc to sample N ψ conformers, making this conformer enumeration very fast. Note that rather than using α = 0.81 in these regression targets, we use α = 2.0 to make the scorer more sensitive to shape differences (App. A.7). When computing regression targets, we use N ψ < 1800 and select 36 (evenly spaced) ψ focus ∈ [-π, π) per rotatable bond. Figure 4 visualizes how regression targets are computed. App. A.15 contains further training specifics. Scoring stereochemistry. At generation time, we also enumerate all possible stereoisomers of the focus (except cis/trans bonds) and score each stereoisomer separately, ultimately selecting the (stereoisomer, ψ foc ) pair that maximizes the predicted score. Figure 5 illustrates how we enumerate stereoisomers. Note that although we use the learned scoring function to score stereoisomerism at generation time, we do not explicitly train the scorer to score different stereoisomers. Masking severe steric clashes. At generation time, we do not score any query dihedral ψ foc that causes a severe steric clash (< 1 Å) with the existing partially generated structure (unless all query dihedrals cause a severe clash). Figure 5 : In addition to scoring the dihedral of the focal bond, we also enumerate all stereoisomers around the focus and score each stereoisomer's focal dihedral separately. We then select the best scoring (steroisomer, dihedral) pair. We enumerate stereoisomers by combinatorially swapping chiral atoms, swapping axial/equatorial attachments to the focus (if relevant), and swapping the axial/equatorial attachment of the focus to the rest of the partial molecule (if relevant). Although swapping cis/trans bonds is feasible within SQUID's framework, we currently do not enumerate cis/trans isomers because of their minor presence in the dataset. Hence, any generated double bonds will have random stereochemistry.

A.3 RANDOM EXAMPLES OF GENERATED 3D MOLECULES

Figures 6 and 7 show additional random examples of molecules generated by SQUID when sampling N max = 1, 20 molecules with sim G (M ′ , M S ) < 0.7 from the prior (λ = 1.0) or λ = 0.3 and selecting the sample with the highest sim S (M ′ , M S ). Note that the visualized poses of the generated conformers are those which are directly generated by SQUID; the generated conformers have not been explicitly aligned to M S (e.g., using ROCS). Even so, the conformers are (for the most part) aligned to M S since SQUID's equivariance enables the model to generate natively aligned structures. It is apparent in these examples that using larger N max yields molecules with significantly improved shape similarity to M S , both qualitatively and quantitatively. This is in part caused by: 1) stochasticity in the variationally sampled atom embeddings H var ; 2) stochasticity in the input molecular point clouds, which are sampled from atom-centered isotropic Gaussians in R 3 ; 3) sampling sets of variational atom embeddings that may not be entirely self-consistent (e.g., for instance, if we sample only 1 atom embedding that implicitly encodes a ring structure); and 4) the choice of τ ∅ , the threshold for stopping local generation. While a small τ ∅ (we use τ ∅ = 0.01) helps prevent the model from adding too many atoms or fragments around a single focus, a small τ ∅ can also lead to early (local) stoppage, yielding molecules that do not completely fill the target shape. By sampling more molecules (using larger N max ), we have more chances to avoid these adverse random effects. Further work will attempt to improve the robustness of the encoding scheme and generation procedure in order to increase SQUID's overall sample efficiency. 4 reports the percentage of molecules that are chemically valid, novel, and unique when sampling 50 molecules from the prior (λ = 1.0) for 1000 encoded molecules M S (e.g., target shapes) from the test set, yielding a total of 50K generated molecules. We define chemical validity as passing RDKit sanitization. Since we directly generate the molecular graph and mask actions which violate chemical valency, 100% of generated molecules are valid. We define novelty as the percentage of generated molecules whose molecular graphs are not present in the training data. We define uniqueness as the percentage of generated molecular graphs (of the 50K total) that are only generated once. For novelty and uniqueness calculations, we consider different stereoisomers to have the same molecular graph. We also report the percentage of generated 3D structures that have an apparent steric clash, defined to be a non-bonded interatomic distance below 2 Å. When sampling from the prior (λ = 1.0), the average internal chemical similarity of the generated molecules is 0.26 ± 0.04. When sampling with λ = 0.3, the average internal chemical similarity is 0.32 ± 0.07. We define internal chemical similarity to be the average pairwise chemical similarity (Tanimoto fingerprint similarity) between molecules that are generated for the same target shape. Table 5 reports the graph reconstruction accuracy when sampling 3D molecules from the posterior (λ = 0.0), for 1000 target molecules M S from the test set. We report the top-k graph reconstruction accuracy (ignoring stereochemical differences) when sampling k = 1 molecule per encoded M S , and when sampling k = 20 molecules per encoded M S . Since we have intentionally trained SQUID inside a shape-conditioned variational autoencoder framework in order to generate chemically diverse molecules with similar 3D shapes, the significance of graph reconstruction accuracy is debatable in our setting. However, it is worth noting that the top-1 reconstruction accuracy is 16.3%, while the top-20 reconstruction accuracy is much higher (57.2%). This large difference is likely attributable to both stochasticity in the variational atom embeddings and stochasticity in the input 3D point clouds. 94.7% Steric Clash (<2 Å) (↓) 13.1% Table 5 : Graph reconstruction accuracy when sampling 3D molecules from the posterior (λ = 0.0) for 1000 targets M S from the test set. We report the "top-k" reconstruction accuracy when sampling k = 1 molecule per target, as well as k = 20 molecules per target.

Graph Reconstruction %

k = 1 16.3% k = 20 57.2%

A.5.1 GENETIC ALGORITHM

We adapt SQUID to shape-constrained molecular optimization by implementing a genetic algorithm on the variational atom embeddings H var . Algorithm 1 details the exact optimization procedure. In summary, given the seed molecule M S with a target 3D shape and an initial substructure M 0 (which is contained by all generated molecules for a given M S ), we first generate an initial population of generated molecules M ′ by repeatedly sampling H var for various interpolation factors λ, mixing these H var with the encoded shape features of M S , and decoding new 3D molecules. We only add a generated molecule to the population if sim S (M ′ , M S ) ≥ τ S (we use τ S = 0.75), so that the GA does not overly explore regions of chemical space that have no chance of satisfying the ultimate constraint sim S (M ′ , M S ) ≥ 0.85. After generating the initial population, we iteratively 1) select the top-scoring samples in the population, 2) cross the top-scoring H var in crossover events, 3) mutate the top and crossed H var via adding random noise, and 4) generate new molecules M ′ for each mutated H var . The final optimized molecule M * is the top-scoring generated molecule that satisfies the shape-similarity constraint sim S (M ′ , M S ) ≥ 0.85.

A.5.2 VISUALIZATION OF OPTIMIZED MOLECULES

Figure 8 visualizes the structures of the SQUID-optimized molecules M * and their respective seed molecules M S (e.g., the starting "hit" molecules with target shapes) for each of the optimization tasks which led to an improvement in the objective score. We also overlay the generated 3D conformations of M * on those of M S , and report the objective scores for each M * and M S . 1) for 1000 target molecules M S from the test set. When sampling either from the dataset or from SQUID, we compute the mean shape similarity (across the 1000 targets) after selecting the best sample M which maximizes sim S (M ′ , M S ; α) amongst N max samples M ′ for which sim G (M ′ , M S ) < 0.7.

A.7 EXPLORING DIFFERENT VALUES OF α IN SIM S

Our analysis of shape similarity thus far has used Equation 1 with α = 0.81 in order to recapitulate the shape similarity function used by ROCS, which is widely used in drug discovery. However, compared to randomly sampled molecules in the dataset, the molecules generated by SQUID qualitatively appear to do a significantly better job at fitting the target shape S on an atom-by-atom basis, even if the computed shape similarities (with α = 0.81) are comparable (see examples in Figure 3 ). We quantify this observation by increasing the value of α when computing sim S (M ′ , M S ; α) for generated molecules M ′ , as α is inversely related to the width of the isotropic 3D Gaussians used in the volume overlap calculations in Equation 1. Intuitively, increasing α will greater penalize sim S if the atoms of M ′ and M S do not perfectly align. Figure 10 plots the mean sim S (M, M S ; α) for the most shape-similar molecule M of N max sampled molecules M ′ for increasing values of α. Averages are calculated over 1000 target molecules M S from the test set, and we only consider generated molecules for which sim G (M ′ , M S ) < 0.7. Crucially, the gap between the mean sim S (M, M S ; α) obtained by generating molecules with SQUID vs. randomly sampling molecules from the dataset significantly widens with increasing α. This effect is especially apparent when using SQUID with λ = 0.3 and N max = 20, although can be observed with other generation strategies as well. Hence, SQUID does a much better job at generating (still chemically diverse) molecules that have significant atom-to-atom overlap with M S .

A.8 HEURISTIC BONDING GEOMETRIES AND THEIR IMPACT ON GLOBAL SHAPE

In all molecules (dataset and generated) considered in this work, we fix acyclic bond distances to their empirical averages and set acyclic bond angles to heuristic values based on hybridization rules in order to reduce the degrees of freedom in 3D coordinate generation. Here, we describe how we fix these bonding geometries and explore whether this local 3D structure manipulation significantly alters the global molecular shape. Fixing bonding geometries. We fix acyclic bond distances by computing the mean bond distance between pairs of atom types across all the RDKit-generated conformers in our training set. After collecting these empirical mean values, we manually set each acyclic bond distance to its respective mean value for each conformer in our datasets. We set acyclic bond angles using simple hybridization rules. Specifically, sp3-hybridized atoms will have bond angles of 109.5 • , sp2-hybridized atoms will have bond angles of 120 • , and sp-hybridized atoms will have bond angles of 180 • . We manually fix the acyclic bond angles to these heuristic values for all conformers in our datasets. We use RDKit to determine the hybridization states of each atom. During generation, occasionally the hybridization of certain atoms (N, O) may change once they are bonded to new neighbors. For instance, an sp3 nitrogen can become sp2 once bonded to an aromatic ring. We adjust bond angles on-the-fly in these edge cases. Impact on global shape. Figure 11 plots the histogram of sim S (M fixed , M relaxed ) for 1000 test set conformers M fixed whose bonding geometries have been fixed, and the original RDKitgenerated conformers M relaxed with relaxed (true) bonding geometries. In the vast majority of cases, fixing the bonding geometries negligibly impacts the global shape of the 3D molecule (sim S (M fixed , M relaxed ) ≈ 1). This is because the main factor influencing global molecular shape is rotatable bonds (e.g., flexible dihedrals), which are not altered by fixing bond distances and angles. Recovering refined bonding geometries. Even though fixing bond distances and angles only marginally impacts molecular shape, we still may wish to recover refined bonding geometries of the generated 3D molecules without altering the generated 3D shape. We can accomplish this (to a first approximation) for generated molecules by creating a geometrically relaxed conformation of the generated molecular graph with RDKit, and then manually setting the dihedrals of the rotatable bonds in the relaxed conformer to match the corresponding dihedrals in the generated conformers. Importantly, if we perform this relaxation procedure for both the dataset molecules and the SQUID-generated molecules, the (relaxed) generated molecules still have significantly enriched shape-similarity to the (relaxed) target shape compared to (relaxed) random molecules from the dataset (Fig. 12 ). A.9 ABLATING EQUIVARIANCE SQUID aligns the equivariant representations of the encoded target shape and the partially generated structures in order to generate 3D conformations that natively fit the target shape, without having to implicitly learn SE(3)-alignments (Challenge 2). We achieve this in Equation 7, where we mix the equivariant representations of M S and the partially generated structure M ′(c-1) l . To empirically motivate this design choice, we ablate the equivariant alignment by setting Z = 0 in Eq. 7. We denote this ablated model as SQUID-NoEqui. Note that because we still pass the unablated invariant features z to the decoder (Eq. 8), SQUID-NoEqui is still conditioned on the shape of M S -the model simply no longer has access to any explicit information about the relative spatial orientation of M ′(c-1) l to M S (and thus must learn this spatial relationship from scratch). As expected, ablating SQUID's equivariance significantly reduces SQUID's ability to generate chemically diverse molecules that fit the target shape. Figure 13 plots the distributions of sim S (M ′ , M S ) for the best of N max generated molecules with sim G (M ′ , M S ) < 0.7 or 0.3 when using SQUID or SQUID-NoEqui. Crucially, the mean shape similarity when sampling with (λ = 1.0, N max = 20, sim G (M ′ , M S ) < 0.7) decreases from 0.828 (SQUID) to 0.805 (SQUID-NoEqui). When sampling with (λ = 0.3, N max = 20, sim G (M ′ , M S ) < 0.7), the mean shape similarity also decreases from 0.879 (SQUID) to 0.839 (SQUID-NoEqui). Relative to the mean shape similarity of 0.758 achieved by sampling random molecules from the dataset (N max = 20, sim G (M ′ , M S ) < 0.7), this corresponds to a substantial 33% reduction in the shape-enrichment of SQUID-generated molecules. Interestingly, sampling (λ = 1.0, N max = 20, sim G (M ′ , M S ) < 0.7) with SQUID-NoEqui still yields shape-enriched molecules compared to analogously sampling random molecules from the dataset (mean shape similarity of 0.805 vs. 0.758). This is because even without the equivariant feature alignment, SQUID-NoEqui still conditions molecular generation on the (invariant) encoding of the target shape S, and hence biases generation towards molecules which better fit the target shape (after alignment with ROCS).

A.10 AUXILIARY TRAINING LOSSES

We employ two auxiliary losses when training the graph generator in order to encourage the generated graphs to better fit the encoded target shape. The first auxiliary loss penalizes the graph generator if it adds an incorrect atom/fragment to the focus that is of significantly different size than the correct (ground truth) atom/fragment. We first compute a matrix ∆V f ∈ R (|L f |×|L f |) + containing the (pairwise) volume difference between all atoms/fragments in the library L f ∆V (i,j) f = |v fi -v fj | where v fi is the volume of atom/fragment f i ∈ L f (computed with RDKit). We then compute the auxiliary loss L next-shape as: L next-shape = 1 |L f | (p next • ∆V (g) f ) ( ) where g is the index of the correct (ground truth) next atom/fragment f next, true , ∆V f is the gth row of ∆V f , and p next are the predicted probabilities over the next atom/fragment types to be connected to the focus (see Eq. 10). The second auxiliary loss penalizes the graph generator if it prematurely stops (local) generation, with larger penalties if the premature stop would result in larger portions of the (ground truth) graph not being generated. When predicting (local) stop tokens during graph generation (with teacher forcing), we compute the number of atoms in the subgraph induced by the subtree whose root treenode is the next atom/fragment to be added to the focus (in the current generation sequence). We then multiply the predicted probability for the local stop token by this number of "future" atoms that would not be generated if a premature stop token were generated. Hence, if the correct action is to indeed stop generation around the focus, the penalty will be zero. However, if the correct action is to add a large fragment to the current focus but the generator predicts a stop token, the penalty will be large. Formally, we compute: L ∅-shape = p ∅ |G S T next | if p ∅, true = 0 otherwise 0 where p ∅, true is the ground truth action for local stopping (p ∅, true = 0 indicates that the correct action is to not stop local generation), and G S T next is the subgraph induced by the subtree whose root node is the next atom/fragment (to be generated) in the ground-truth molecular graph.

A.11 OVERVIEW OF VECTOR NEURONS (VN) OPERATIONS

In this work, we use Deng et al. (2021) 's VN-DGCNN to encode molecular point clouds into equivariant shape features. We also employ their general VN operations (VN-MLP, VN-Inv) during shape and chemical feature mixing. We refer readers to Deng et al. (2021) for a detailed description of these equivariant operations and models. Here, we briefly summarize some relevant VN-operations for the reader's convenience. VN-MLP. Vector neurons (VN) lift scalar neuron features to vector features in R 3 . Hence, instead of having features x ∈ R q , we have vector features X ∈ R q×3 . While linear transformations are naturally equivariant to global rotations R since W( XR) = (W X)R for some rotation matrix R ∈ R 3×3 , Deng et al. ( 2021) construct a set of non-linear equivariant operations f such that f ( XR) = f ( X)R, thereby enabling natively equivariant network design. VN-MLPs combine linear transformations with equivariant activations. In this work, we use VN-LeakyReLU, which Deng et al. (2021) define as: VN-LeakyReLU( X; α) = α X + (1 -α)VN-ReLU( X) where VN-ReLU( X) =    x, if x • k || k|| ≥ 0 x -(x • k || k|| ) k || k|| otherwise ∀ x ∈ X where k = U X for a learnable weight matrix U ∈ R 1×q , and where x ∈ R 3 . By composing series of linear transformations and equivariant activations, VN-MLPs map X ∈ R q×3 to X′ ∈ R q ′ ×3 such that X′ R = VN-MLP( XR). VN-Inv. Deng et al. ( 2021) also define learnable operations that map equivariant features X ∈ R q×3 to invariant features x ∈ R 3q . In general, VN-Inv constructs invariant features by multiplying equivariant features X with other equivariant features Ỹ ∈ R 3×3 : X = X Ỹ⊤ The invariant features X ∈ R q×3 can then be reshaped into standard invariant features x ∈ R 3q . In our work, we slightly modify Deng et al. (2021) 's original formulation. Given a set of equivariant features X = { X(i) } ∈ R n×q×3 , we define a VN-Inv as: VN-Inv( X) = X where X = {x (i) } ∈ R n×6q and: x (i) = Flatten( Ṽ(i) T⊤ i ) Ṽ(i) = ( X(i) , i X(i) ) if n > 1 otherwise X(i) Ti = VN-MLP( Ṽ(i) ) where Ti ∈ R 3×3 , and Ṽ(i) ∈ R 2q×3 (n > 1) or Ṽ(i) ∈ R q×3 (n = 1). VN-DGCNN. Deng et al. (2021) introduce VN-DGCNN as an SO(3)-equivariant version of the Dynamic Graph Convolutional Neural Network (Wang et al., 2019) . Given a point cloud P ∈ R n×3 , VN-DGCNN uses (dynamic) equivariant edge convolutions to update equivariant per-point features: Ẽ(t+1) nm = VN-LeakyReLU (t) (Θ (t) ( X(t) m -X(t) n ) + Φ (t) X(t) n ) (25) X(t+1) n = m∈KNN f (n) Ẽ(t+1) nm ( ) where KNN f (n) are the k-nearest neighbors of point m in feature space, Φ (t) and Θ (t) are weight matrices, and X(t) n ∈ R q×3 are the per-point equivariant features. A.12 GRAPH NEURAL NETWORKS In this work, we employ graph neural networks (GNNs) to encode: • each atom/fragment in the library L f • the target molecule M S • each partial molecular structure M ′(c) l during sequential graph generation • the query structures M ′(ψfoc) l+1 when scoring rotatable bonds Our GNNs are loosely based upon a simple version of the EGNN (Satorras et al., 2022b) . Given a molecular graph G with atoms as nodes and bonds as edges, we use graph convolutional layers defined by the following: m t+1 ij = ϕ t m h t i , h t j , ||r i -r j || 2 , m t ij ( ) m t+1 i = j∈N (i) m ij (28) h (t=1) i = ϕ (0) h (h 0 i , m (t=1) i ) ( ) h t+1 i = ϕ t h (h t i , m t+1 i ) + h t i (t > 0) where h t i are the learned atom embeddings at each GNN-layer, m t ij are learned (directed) messages, r i ∈ R 3 are the coordinates of atom i, N (i) is the set of 1-hop bonded neighbors of atom i, and each ϕ t m , ϕ t h are MLPs. Note that h 0 i are the initial atom features, and m 0 ij are the initial bond features for the bond between atoms i and j. In general, m t ij ̸ = m t ji for t > 0, but here m 0 ij = m 0 ji . Note that since we only aggregate messages from directly bonded neighbors, ||r i -r j || only encodes bond distances, and does not encode any information about specific 3D conformations. Hence, our GNNs effectively only encode 2D chemical identity, as opposed to 3D shape. Our atom/fragment library L f includes 100 distinct fragments (Fig. 14 ) and 24 unique atom types. The 100 fragments were selected based on the top-100 most frequently occurring fragments in our training set. In this work, we specify fragments as ring-containing substructures that do not contain any acyclic single bonds. However, in principle fragments could be any (valid) chemical substructure. Note that we only use 1 (geometrically optimized) conformation per fragment, which is assumed to be rigid. Hence, in its current implementation, SQUID does not consider different ring conformations (e.g., boat vs. chair conformations of cyclohexane). ). Otherwise, generation sequences are treated independently. When training the rotatable bond scorer, we batch together different query dihedrals ψ foc of the same focal bond. Rather than scoring all 36 rotation angles in the same batch, we include the ground-truth rotation angle and randomly sample 9/35 others to include in the batch. Within each batch (for both graph-generation and scoring), all the encoded molecules M S are constrained to have the same number of atoms, and all the partial molecular structures G ′(c) l are constrained to have the same number of atoms. This restriction on batch composition is purely for convenience: the public implementation of VN-DGCNN from Deng et al. (2021) is designed to train on point clouds with the same number of points, and we construct point clouds by sampling a (fixed) n p points for each atom. Training setup. We train the graph generator and the rotatable bond scorer separately. For the graph generator, we train for 2M iterations (batches), with a maximum batch size of 400 (generation sequences). We use the Adam optimizer with default parameters. We use an initial learning rate of 2.5 × 10 -4 , which we exponentially decay by a factor of 0.9 every 50K iterations to a minimum of 5 × 10 -6 . We weight the auxiliary losses by β next-shape = 10.0 and β ∅-shape = 10.0. We log-linearly increase β KL from 10 -5 to 10 -1 over the first 1M iterations, after which it remains constant at 10 -1 . For each generation sequence, we randomize the rotation angle of the bond connecting the focus to the rest of the partial graph (e.g., the focal dihedral), as this dihedral has yet to be scored. In order to make the graph generator more robust to imperfect rotatable bond scoring at generation time, during training, we perturb the dihedrals of each rotatable bond in the partially generated structure M ′ l by δψ ∼ N (µ = 0 • , σ = 15 • ) while fixing the coordinates of the focus. For the rotatable bond scorer, we train for 2M iterations (baches), with a maximum batch size of 32 (focal bonds). Since we sample 10 query dihedrals per focal bond, the effective (maximum) batch size is 320. We use the Adam optimizer with default parameters. We use an initial learning rate of 5 × 10 -4 , which we exponentially decay by a factor of 0.9 every 50K iterations to a minimum of 1 × 10 -5 . In order to make the scorer more robust to imperfect bond scoring earlier in generation sequence, during training, we perturb the dihedrals of each rotatable bond in the partially generated structure M ′ l by δψ ∼ N (µ = 0 • , σ = 5 • ) while fixing the coordinates of the focus. Training times. Training the graph generator on 1 GPU takes ∼ 5-7 days. Training the rotatable bond scorer on 1 GPU takes ∼ 5 days. Dataset processing times. Pre-computing the regression targets for the rotatable bond scorer is the most compute-intensive part of dataset pre-processing. We required ∼ 2-3 days when parallelizing across 384 cpu cores. • M 0 has ≤ 6 heavy atoms • M 0 contains at least 1 terminal atom (an atom with only 1 bonded neighbor) • Any new atom/fragment that is bonded to M 0 forms a valid dihedral. For this to be satisfied, M 0 must contain at least 3 atoms. • M 0 does not contain a sequence of 4 atoms that forms a valid dihedral of a flexible rotatable bond. Hence, all rotatable bonds are scored during generation. Figure 15 shows some examples of common seed structures. If there are multiple suitable M 0 for a given M S , we arbitrarily select one of them. We only consider one M 0 per M S in our experiments. Once we extract M 0 , we fix the coordinates of M 0 to their ground truth positions in M S . We note that the above seeding procedure could be replaced by a suitable model that builds and (equivariantly) predicts the coordinates of the starting structure M 0 . In this work, we extract seeds for simplicity, but use the above criteria to ensure that the downstream shape-conditioned 3D generation and design tasks are non-trivial. We also emphasize that our choice of seeding procedure is fairly arbitrary, and does not necessarily restrict how the model is used in other generation tasks. For instance, larger seed structures could be used. Masking invalid actions. When generating the molecular graph, we mask actions which would violate chemical valency. Because we define atom types in part by the number of single, double, and triple bonds the atom forms (including bonds to implicit hydrogens), we create valency masks by ensuring that each atom forms the correct number and types of bonds. When scoring rotatable bonds, we also mask query dihedrals that would lead to a severe steric clash (<1 Å) with the existing partially generated molecular structure (App. A.2). Bond scoring. When scoring a rotable bond at generation time, we sample 36 dihedral angles separated by 10 • . In principle, one could use finer discretization schemes to generate more refined 3D conformations, at the cost of speed and/or memory. Threshold for stopping local generation. We manually tuned τ ∅ , the threshold for local stopping, to τ ∅ = 0.01 in order to balance tendencies of the model to 1) prematurely stop local generation (thus leading to molecules that do not completely fill the target shape), and 2) generate extra atoms/fragments around the focus (leading to steric crowding). Generation time. Typically, generating (in serial, on a cpu) one 3D molecule with 20-30 atoms takes 2-3 seconds. Overall generation time (on a cpu) and memory cost significantly increase if many stereoisomers are enumerated when scoring any particular rotatable bond (App. A.2). We thus cap the number of enumerated stereoisomers to 32 (per focus).

A.17 RELAXATION OF GENERATED GEOMETRIES

This work focuses on generating flexible drug-like molecules in specific 3D conformations that fit a target shape. Notably, SQUID generates 3D conformations using heuristic bond distances and bond angles. Consequently, the generated 3D conformations are explicitly constrained to have reasonable bonding geometries (e.g., local structures). Empirically, the generated molecules also have few steric clashes, with just 13% of the generated molecules having a steric clash under 2 Å. Nevertheless, the generated molecules are not necessarily in their minimum energy conformations. Note that this is by design, as SQUID is intentionally designed to generate molecules that fit arbitrary 3D molecular shapes. In this section, we relax the generated geometries with force field optimization and consider how this geometry relaxation affects the shape similarity to the target shape. We use the following procedure to (locally) optimize a generated 3D conformer. 3) may have slightly different fragment geometries. Note that this is similar to the procedure that is performed in Appendix A.8, except now explicit hydrogens are included in the 3D structure. We then use the MMFF force field in RDKit to optimize the new 3D conformation, using either 20 or 50 optimization steps, and finally remove the hydrogens from the optimized conformation. Figure 16 shows the distribution of the shape similarity between original (unrelaxed) SQUIDgenerated (λ = 1.0) conformations and their geometry-optimized counterparts. After 20 steps of MMFF optimization, approximately 68% of generated conformers have a shape similarity to their geometry-relaxed counterparts of at least 0.9. After 50 steps, this proportion only drops to 62%. Hence, the majority of generated conformations do not undergo large shape changes upon (local) geometry optimization. This indicates that the generated conformations, whilst not necessarily the minimum energy conformations, are for the most part geometrically reasonable. The largest changes in 3D shape upon geometry relaxation either occur due to steric clashes in the generated conformations (∼ 13% of generated molecules have a steric clash), or due to the model approximating all acyclic rotatable bonds as fully rotatable. In reality, some "rotatable" bonds actually have high barriers to rotation due to favorable orbital interactions. For instance, some amide bonds (such as those in peptides) prefer to be planar due to pi-orbital interactions and electron delocalization. In its current formulation, SQUID does not attempt to account for these kinds of preferred orientations of otherwise rotatable bonds. We are especially interested in how relaxing the generated geometries affects the shape similarity to the target 3D shape. Figure 17 shows the distribution in sim S (M ′ , M S ) for the best of N max = 20 SQUID-generated molecules M ′ (sampled from the prior, λ = 1.0) with sim G (M ′ , M S ) < 0.7, before and after the generated molecules undergo geometry optimization. Importantly, relaxing the geometries of the generated 3D molecules does not have a large impact on the distribution of shape similarity to the target 3D shape. In particular, even after 50 steps of geometry optimization, the generated molecules are still significantly enriched in shape similarity to the target 3D shape compared to molecules randomly sampled from the dataset. In this section, we compare SQUID to LigDream (Skalic et al., 2019) , which is a shape-captioning network that generates 1D SMILES strings of molecules conditioned on a 3D CNN encoding of a target 3D molecular shape and pharmacophores. In contrast to SQUID, LigDream (1) does not generate 3D conformers end-to-end, and hence requires post hoc conformer generation/enumeration; (2) is not robust to rotations, as LigDream's 3D CNN encoder is not SO(3)-invariant; and (3) does not separately encode molecular shape and chemical identity. Nevertheless, we still compare to Lig-Dream to demonstrate SQUID's significant advantages for shape-conditioned molecular generation and design. Skalic et al. (2019) train LigDream to convergence using > 25 million molecules from a drug-like subset of ZINC15. Rather than re-training LigDream on our significantly smaller dataset of ∼ 1 million molecules, we directly apply Skalic et al. ( 2019)'s pre-trained LigDream on molecules from our test set. Note that our dataset is derived from MOSES, itself a subset of ZINC15, and hence the two models are trained on very similar molecules. We find this sufficient for the purposes of this comparison, with the caveat that LigDream is still trained on ∼20 times more data than SQUID. To fairly and directly compare SQUID (a 3D model) to LigDream (a 1D/2D model requiring post hoc conformer generation), we perform the following procedure. Given a reference molecule M S with a target 3D shape, we encode M S using both SQUID (λ = 1.0) and LigDream. LigDream's encoder also employs a VAE; LigDream samples diverse molecules by controlling a "variability factor" λ LigDream , which scales the variance of LigDream's posterior distribution. Following Skalic et al. (2019) , we consider both λ LigDream = 1.0 and 5.0. After encoding M S , we use SQUID and LigDream to generate a pool of 50 molecules, which we then filter to only include those with sim G (M ′ , M S ) < 0.7. Of the filtered molecules, we then sample N max = 20 molecules. For SQUID, we directly select the generated molecule from these N max molecules that maximizes sim S (M ′ , M S ). For LigDream, we use RDKit to generate N C conformations for each of the N max generated molecules, and select the conformer amongst these N max N C total conformations that maximizes sim S (M ′ , M S ). Since SQUID directly generates just 1 conformation per generated molecule, we use N C = 1 for a head-to-head comparison. Using this procedure, Figure 18 plots the distribution in sim S (M ′ , M S ) for 1000 distinct M S , using SQUID (prior, λ = 1.0) and LigDream (λ LigDream = 1.0, 5.0). Overall, SQUID generates 3D molecules which are significantly more shape-similar to M S versus using LigDream with λ LigDream = 1.0 or 5.0. This is due to SQUID being able to directly generate 3D conformers that fit the target shape, and not relying on a non-shape-conditioned conformer generator (such as RDKit). Importantly, the SQUID-generated molecules have an average chemical similarity to the encoded molecule (sim G (M ′ , M S )) of 0.26, whereas the LigDream-generated molecules (using λ LigDream = 1.0) have a higher average chemical similarity of 0.36. This is a substantial result, as we are interested in generating chemically diverse molecules that fit the target shape -not generating molecules that look like M S . Note that SQUID can always generate more chemically similar molecules by decreasing λ, if desired. Increasing λ LigDream to 5.0 causes LigDream to generate more chemically diverse molecules, with the average chemical similarity now on par with that obtained with SQUID. However, setting λ LigDream = 5.0 causes the LigDream-generated molecules to be substantially less shape-similar to M S . This emphasizes that 2D shape-conditioned generative models will struggle to generate chemically diverse molecules that still fit the target 3D shape, presumably because they do not simultaneously generate both the molecular graph and 3D coordinates (see Challenge 3). On the other hand, because SQUID (1) separately encodes 3D shape and 2D chemical identity, and (2) simultaneously generates both the molecular graph and the molecular coordinates, SQUID is able to generate more diverse molecules while still fitting the target 3D shape. To test whether increasing N C can close the gap in the shape similarity distributions between SQUID and LigDream (with λ LigDream = 5.0), we increase N C to 20. This essentially permits RDKit to sample more conformational space in order to search for conformations of the LigDreamgenerated molecules that best fit the target shape. Note that this is no longer a head-to-head comparison, as we are now comparing the best-of-20 SQUID-generated conformations to the best-of-400 LigDream(+RDKit)-generated conformations. Nevertheless, Figure 19 shows that even when using N C = 20, SQUID still significantly outperforms LigDream in generating chemically diverse molecules with high shape similarity to the target 3D shape. 



SQUID: Shape-Conditioned Equivariant Generator for Drug-Like Molecules We have dropped the (c) notation for clarity. However, each Zdec is specific to each (M ′(c-1) l , MS) system. We train the scorer independently from the graph generator, but with a parallel architecture. Hence, zdec ̸ = zdec, scorer. The main architectural difference between the two models (graph generator and scorer) is that we do not variationally encode Hscorer into Hvar,scorer, as we find it does not impact empirical performance.



) )R for a rotation R. Finally, we sum-pool the per-atom features in XHvar into a global equivariant representation Z ∈ R dz×3 . We also embed a global invariant representation z ∈ R dz by applying a VN-Inv to XHvar , concatenating the output with H var , passing through an MLP, and sum-pooling the resultant per-atom features:Given M S , we sample new molecules M ′ ∼ P (M |S, M 0 ) by encoding P S into equivariant shape features X, variationally sampling h (a)

Fig. 2 sketches a generation sequence by which a new atom/fragment is attached to the focus, yielding G

Figure 3: (A) Distributions of sim S (M, M S ) for the best of N max molecules with sim G (M, M S ) < 0.7 or 0.3 when sampling from the dataset (grey) or SQUID with either λ = 1.0 (prior; red) or λ = 0.3 (green). (B) Histograms of sim S vs. sim G for two sampling schemes from (A). (C) Three sets of samples from the prior (red) and the dataset (N max = 20; grey), overlaid on M S (blue).

Figure4: Demonstration of conformer sampling when scoring rotatable bonds. For each query dihedral ψ foc = ψ 1 , ψ 2 , ... ∈ [-π, π), we sample N ψ conformations of the subgraph G Tfoc (keeping ψ foc held fixed) induced by the subtree T foc whose root node is the current focus. For each sampled (subgraph) conformation G Tfoc , we compute the shape similarity of between the sampled conformation and the ground truth (subgraph) conformation G S T foc . We select the maximum computed (nonaligned) shape similarity sim * S amongst the N ψ sampled conformations as the regression target for that ψ foc .

14) Computing regression targets. When training with teacher forcing (M ′ l = M S l , G ′ = G S ), we compute regression targets s ψfoc ≈ max ψ l+2 ,ψ l+3 ,... sim * S (G ′ (ψfoc) , G S ) by setting the focal dihedral ψ l+1 = ψ foc , sampling N ψ conformations of the "future" graph G Tfoc induced by the subtree T foc whose root (sub)tree-node is the focus, and computing s ψfoc = max i=0,...,N ψ sim * S (G

Figure 6: Random examples of molecules generated by SQUID using different sampling strategies. The generated 3D molecules are shown overlaid on the target molecule M S (blue). In these examples, the generated molecules have not been explicitly aligned to M S , and their displayed poses are those directly generated by SQUID.

Figure 7: Additional random examples of molecules generated by SQUID using different sampling strategies. The generated 3D molecules are shown overlaid on the target molecule M S (blue). In these examples, the generated molecules have not been explicitly aligned to M S , and their displayed poses are those directly generated by SQUID.

Figure 8: Results of shape-constrained MO for the 40/48 tasks which improved the objective score. Subpanels depict the target molecules M S (top row), the SQUID-optimized molecules M * (middle row), the overlaid 3D structures (bottom row; blue is M S , red is M * ), and their respective objective scores for the (from top to bottom) GSK3B, JNK3, Osimertinib MPO, Sitagliptin MPO, Celecoxib Rediscovery, and Thiothixene Rediscovery tasks. In the overlaid structures, the generated molecules have not been explicitly aligned to M S ; the poses shown are those directly generated by SQUID.

Figure 9: Parity plot empirically showing the close approximation of Eq. 1 (α = 0.81) to the shape-similarity function used by the commercial ROCS program, for 50000 shape comparisons.

Figure 10: Mean shape similarity sim S (M, M S ; α) for increasing values of α (Equation1) for 1000 target molecules M S from the test set. When sampling either from the dataset or from SQUID, we compute the mean shape similarity (across the 1000 targets) after selecting the best sample M which maximizes sim S (M ′ , M S ; α) amongst N max samples M ′ for which sim G (M ′ , M S ) < 0.7.

Figure 11: Histogram of shape similarity sim S (M fixed , M relaxed ) between 1000 test molecules whose acylic bonding geometries are fixed by heuristics and their geometrically relaxed counterparts.

Figure 12: Distribution of sim S (M ′ , M S ) when sampling (N max = 20, sim G (M ′ , M S ) < 0.7) molecules M ′ from SQUID (λ = 1.0) and from the dataset, when relaxing the bonding geometries of each M ′ and M S . Importantly, SQUID generates 3D molecules that are significantly enriched in shape similarity even after relaxing the (fixed) bonding geometries of the generated structures.

Figure 13: Distributions of sim S (M ′ , M S ) for the best of N max generated molecules for which sim G (M ′ , M S ) < 0.7 or 0.3, when using SQUID or SQUID-NoEqui. Generated distributions are compared to those obtained by similarly sampling molecules from the training dataset. Overall, ablating SQUID's equivariance signficantly reduces the enrichment in shape similarity of generated molecules to the target shape.

Figure 14: All 100 distinct fragments in our atom/fragment library L f .

Figure 15: Examples of starting seed structures M 0 highlighting where new atoms/fragments might be attached (orange).

Figure 16: Distribution of shape similarity between SQUID-generated molecules sampled from the prior (λ = 1.0), and the same molecules after 20 (blue) or 50 (orange) steps of geometry relaxation with MMFF.

Figure 17: Distributions of sim S (M ′ , M S ) for the best of N max = 20 SQUID-generated (λ = 1.0) molecules M ′ with sim G (M ′ , M S ) < 0.7, before (red) or after each M ′ has undergone 20 (dark grey) or 50 (light grey) steps of MMFF geometry relaxation.

Figure 18: Distributions of sim S (M ′ , M S ) for the best of N max = 20 SQUID-generated (prior, λ = 1.0) or LigDream-generated (N C = 1, λ = 1.0 or 5.0) molecules with sim G (M ′ , M S ) < 0.7. SQUID (λ = 1.0) generates more chemically diverse molecules compared to LigDream (λ = 1.0), and SQUID generates more shape similar molecules compared to LigDream (λ = 5.0).

Figure 19: Distributions of sim S (M ′ , M S ) for the best of N max = 20 SQUID-generated (prior, λ = 1.0) or LigDream-generated (N C = 20, λ = 5.0) molecules with sim G (M ′ , M S ) < 0.7.

Setting α = 0.81 approximates the shape similarity function used by the ROCS program (App. A.6). sim * S is sensitive to SE(3) transformations of molecule M A with respect to molecule M B . Thus, we define sim S

Top-1 optimized objective scores across 6 objectives and 8 seed molecules M S (per objective), under the shape-similarity constraint sim S (M * , M S ) ≥ 0.85. (-) indicates that SQUID, limited to 31K generated samples, or virtual screening (VS) could not improve upon M S .

List of definitions, terms, and notations ′ , M S ) graph (chemical) similarity between molecules M ′ and M S sim * S (G A , G B ) rotationally-sensitive shape similarity of G A and G B sim S (M A , M B ) shape similarity of molecules M A and M B after M A is optimally aligned to M B (with ROCS) α parameter controlling the width of the atom-centered Gaussians when computing sim S (M A , M B )

List of definitions, terms, and notations (continued)

Generation statistics computed by sampling 50 molecules from the prior (λ = 1.0) per encoded target molecule M S for 1000 random targets from the test set.

Summary of training and generation parameters. SeeApp. A.15, A.16  for further descriptions on training and generation protocols. We use molecules from MOSES(Polykovskiy et al., 2020) to train, validate, and test SQUID. Starting from the train/test sets provided by MOSES, we first generate an RDKit conformer for each molecule, and remove any molecules for which we cannot generate a conformer. Conformers are initially created with the ETKDG algorithm in RDKit, and then separately optimized for 200 iterations with the MMFF force field. We then fix the acyclic bond distances and bond angles for each conformer (App. A.8). Using the molecules from MOSES's train set, we then create the fragment library by extracting the top-100 most frequently occurring fragments (ring-containing substructures without acylic bonds). We separately generate a 3D conformer for each distinct fragment, optimizing the fragment structures with MMFF for 1000 steps. Given these 100 fragments, we then remove all molecules from the train and test sets containing non-included fragments. From the filtered training set, we then extract 24 unique atom types, which we add to the atom/fragment library L f . We remove any molecule in the test set that contains an atom type not included in these 24. Finally, we randomly split the (filtered) training set into separate training/validation splits. The training split contains 1058352 molecules, the validation split contains 264589 molecules, and the test set contains 146883 molecules. Each molecule has one conformer.Collecting training data for graph generation and scoring. We individually supervise each step of autoregressive graph generation and use teacher forcing. We collect the ground-truth generation actions by representing each molecular graph as a tree whose root tree-node is either a terminal atom or a terminal fragment in the graph. A "terminal" atom is only bonded to one neighboring atom. A "terminal" fragment has only one acyclic (rotatable) bond to a neighboring atom/fragment. Starting from this terminal atom/fragment, we construct the molecule according to a breadth-firstsearch traversal of the generation tree (see Fig.2); we break ties using RDKit's canonical atom ordering. We augment the data by enumerating all generation trees starting from each possible terminal atom/fragment in the molecule. For each rotatable bond in the generation trees, we collect regression targets for training the scorer by following the procedure outlined in App. A.2.

Given a SQUIDgenerated 3D molecule, we first extract the molecule's 2D graph, and use RDKit to generate a new (unrelated) conformation of the molecule, with explicit hydrogens included. Note that explicit hydrogens (which are not natively generated by SQUID) are needed for accurate force-field geometry optimization. We then manually rotate each rotatable bond in the RDKit-generated conformer to the exact rotation angle that was natively generated by SQUID, thereby (approximately) yielding the same conformation that SQUID generated. The major differences between these new conformations and the original conformations are that the new conformations (1) include hydrogens, (2) have relaxed local bonding geometries, and (

ACKNOWLEDGMENTS

This research was supported by the Office of Naval Research under grant number N00014-21-1-2195. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2141064. The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this poster. The authors thank Rocío Mercado, Sam Goldman, Wenhao Gao, and Lagnajit Pattanaik for providing helpful suggestions regarding the content and presentation of this paper.

annex

 for λ ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0 ▷ Mix mutated chemical and shape information ▷ Cross by randomly swapping half of the atom embeddings Add H c to T C end for▷ Mix mutated chemical and shape information Parameter sharing. For both the graph generator and the rotatable bond score, the (variational) molecule encoder (in the Encoder, Fig. 2 ) and the partial molecule encoder (in the Decoder, Fig. 2 ) share the same fragment encoder (L f -GNN), which is trained end-to-end with the rest of the model. Apart from L f -GNN, these encoders do not share any learnable parameters, despite having parallel architectures. The graph generator and the rotatable bond scorer are completely independent, and are trained separately.Hyperparameters. Tables 6 and 7 tabulate the set of hyperparameters used for SQUID across all the experiments conducted in this paper. Table 8 summarizes training and generation parameters, but we refer the reader to App. A.15 and A.16 for more detailed discussion of training and generation protocols.Because of the large hyperparameter search space and long training times, we did not perform extensive hyperparameter optimizations. We manually tuned the learning rates and schedulers to maintain training stability, and we maxed-out batch sizes given memory constraints. We set β ∅-shape = 10 and β next-shape = 10 to make the magnitudes of L ∅-shape and L next-shape comparable to the other loss components for graph-generation. We slowly increase β KL over the course of training from 10 -5 to a maximum of 10 -1 , which we found to provide a reasonable balance between L KL and graph reconstruction. h (GNN L f ) [hidden, (output)] layer sizes [64, (64) ] Number of GNN L f layers 3 ϕ (t=0) m (Enc., Dec. GNNs) [hidden, (output)] layer sizes [128, (64) ] ϕ (t=0) h (Enc., Dec. GNNs) [hidden, (output) ] layer sizes [128, (64) ] ϕ (t>0) m (Enc., Dec. GNNs) [hidden, (output) ] layer sizes [64, (64) ] ϕ (t>0) h Table 7 : Architecture hyperparameters for SQUID's rotatable bond scorer. Many of the GNN, VN-DGCNN, Vn-Inv, VN-MLP, and MLP modules use the same set of hyperparameters (not learnable parameters), but some do not. Where it appears, (→) indicates the outputs of the module(s) to which we refer (see Fig. 2 ). The rotatable bond scorer has a total of 1270080 learnable parameters. [hidden, (output) ] layer sizes [64, (64) ] ϕ (t) h (GNN L f ) [hidden, (output)] layer sizes [64, (64) ] Number of GNN L f layers 3 ϕ (t=0) m (Enc./Dec. GNNs) [hidden, (output)] layer sizes [128, (64) ] ϕ (t=0) h (Enc./Dec. GNNs) [hidden, (output)] layer sizes [128, (64) ] ϕ (t>0) m (Enc./Dec. GNNs) [hidden, (output)] layer sizes [64, (64) ] ϕ (t>0) h (Enc./Dec. GNNs) [hidden, (output)] layer sizes [64, (64) ] Number of (Enc./Dec.)-GNN layers 3 Conv. dimensions for VN-DGCNN (→ X, X′ ) [32, 32, 64 , 128] Conv. pooling type for VN-DGCNN (→ X, X′ ) mean Number of k-NN for VN-DGCNN (→ X, X′ ) 10 VN-Inv (→ X, X ′ , X H , X ′ H ) [hidden, (output)] layer sizes [64, 32, (3) ] VN-Inv (→ z dec ) [hidden, (output) [128, 64, (64) ] MLP (scoring) [hidden, (output) ] layer sizes [64, 64, 64, (1) ] MLP (all) hidden layer activation function LeakyReLU(0.2)

