SYMMETRY-AWARE ACTOR-CRITIC FOR 3D MOLECULAR DESIGN

Abstract

Automating molecular design using deep reinforcement learning (RL) has the potential to greatly accelerate the search for novel materials. Despite recent progress on leveraging graph representations to design molecules, such methods are fundamentally limited by the lack of three-dimensional (3D) information. In light of this, we propose a novel actor-critic architecture for 3D molecular design that can generate molecular structures unattainable with previous approaches. This is achieved by exploiting the symmetries of the design process through a rotationally covariant state-action representation based on a spherical harmonics series expansion. We demonstrate the benefits of our approach on several 3D molecular design tasks, where we find that building in such symmetries significantly improves generalization and the quality of generated molecules.

1. INTRODUCTION

The search for molecular structures with desirable properties is a challenging task with important applications in de novo drug design and materials discovery (Schneider et al., 2019) . There exist a plethora of machine learning approaches to accelerate this search, including generative models based on variational autoencoders (VAEs) (Gómez-Bombarelli et al., 2018) , recurrent neural networks (RNNs) (Segler et al., 2018) , and generative adversarial networks (GANs) (De Cao & Kipf, 2018) . However, the reliance on a sufficiently large dataset for exploring unknown regions of chemical space is a severe limitation of such supervised models. Recent RL-based methods (e.g., Olivecrona et al. (2017) , Jørgensen et al. (2019 ), Simm et al. (2020) ) mitigate the need for an existing dataset of molecules as they only require access to a reward function. Most approaches rely on graph representations of molecules, where atoms and bonds are represented by nodes and edges, respectively. This is a strongly simplified model designed for the description of single organic molecules. It is unsuitable for encoding metals and molecular clusters as it lacks information about the relative position of atoms in 3D space. Further, geometric constraints on the design process cannot be included, e.g. those given by the active site of an enzyme. A more general representation closer to the physical system is one in which a molecule is described by its atoms' positions in Cartesian coordinates. However, it would be very inefficient to naively learn a model based on this representation. That is because molecular properties such as the energy are invariant (i.e. unchanged) under symmetry operations like translation or rotation of all atomic positions. A model without the right inductive bias would thus have to learn those symmetries from scratch. In this work, we develop a novel RL approach for designing molecules in Cartesian coordinates that explicitly encodes these symmetry operations. The agent builds molecules by consecutively placing atoms such that if the generated structure is rotated or translated, the agent's action is rotated and translated accordingly; this way, the reward remains the same (see Fig. 1 (a) ). We achieve this through a rotationally covariant state representation based on spherical harmonics, which we integrate into a novel actor-critic network architecture with an auto-regressive policy that maintains the desired covariance. Building in this inductive bias enables us to generate molecular structures with more complex coordination geometry than the class of molecules that were attainable with previous approaches. Finally, we perform experiments on several 3D molecular design tasks, where we find that our approach significantly improves the generalization capabilities of the RL agent and the quality of the generated molecules. In summary, our contributions are as follows: • we propose the first approach for 3D molecular design that exploits symmetries of the design process by leveraging a rotationally covariant state representation; • we integrate this state representation into an actor-critic neural network architecture with a rotationally covariant auto-regressive policy, where the orientation of the atoms to be placed is modeled through a flexible distribution based on spherical harmonics; • we demonstrate the benefits of our approach on several 3D molecular design tasks, including a newly proposed task that showcases the generalization capabilities of our agent.

2.1. REINFORCEMENT LEARNING FOR MOLECULAR DESIGN

In the standard RL setting (Sutton & Barto, 2018) , an agent interacts with the environment to maximize its reward. Formally, such an environment is described by a Markov decision process (MDP) M = (S, A, T , µ 0 , γ, r) with states s t ∈ S, actions a t ∈ A, transition dynamics T : S × A → S, initial state distribution µ 0 , discount factor γ ∈ (0, 1], and reward function r : S × A → R. The goal is to learn a stochastic policy π(a t |s t ) that maximizes the expected discounted return J(θ) = E s0∼µ0 [V π (s 0 )], where the value function V π (s t ) = E π [ T t =t γ t r(s t , a t )|s t ] is defined as the expected discounted return when starting from state s t and following policy π. Following Simm et al. (2020) , we design molecules by iteratively picking atoms from a bag and positioning them on a 3D canvas. Such a sequential decision-making problem is described by an MDP where the state s t = (C t , B t ) comprises both the canvas C t and the bag B t . The canvas C t = C 0 ∪ {(e i , x i )} t-1 i=0 is a set of atoms with chemical element e i ∈ {H, C, N, O, . . . } and position x i ∈ R 3 placed up to time t -1, where C 0 can either be empty or contain a set of initially placed atoms. The number of atoms on the canvas is denoted by |C t |. The bag B t = {(e, m(e))} is a multi-set of atoms yet to be placed, where m(e) is the multiplicity of the element e. Each action a t = (e t , x t ) consists of the element e t ∈ B t and position x t ∈ R 3 of the next atom to be added to the canvas. Placing an atom through action a t in state s t is modeled by a deterministic transition function T (s t , a t ) that yields the next state s t+1 = (C t+1 , B t+1 ) with B t+1 = B t \e t . The reward function r(s t , a t ) = -∆E(s t , a t ) is given by the negative energy difference between the resulting structure described by C t+1 , and the sum of energies of the current structure C t and a new atom of element e t placed at the origin, i.e. ∆E(s t , a t ) = E(C t+1 ) -[E(C t ) + E({(e, 0)})]. Intuitively, the reward encourages the agent to build stable, low-energy structures. We evaluate the energy using the fast semi-empirical Parametrized Method 6 (PM6) (Stewart, 2007) as implemented in SPARROW (Husch et al., 2018; Bosia et al., 2020) ; see Appendix A for details. An example of a rollout is shown in Fig. 1 (b) . At the beginning of the episode, the agent observes the initial state (C 0 , B 0 ) ∼ µ 0 (s 0 ), e.g. C 0 = ∅ and B 0 = SOF 4foot_0 . The agent then iteratively constructs a molecule by placing atoms from the bag onto the canvas until the bag is empty. 2 2.2 ROTATIONALLY COVARIANT NEURAL NETWORKS A function f : X → Y is invariant under a transformation operator T g : X → X if f (T g [x]) = f (x) for all x ∈ X , g ∈ G, where G is a mathematical group. In contrast, f is covariant with respect to T g if there exists an operator T g : Y → Y such that f (T g [x]) = T g [f (x)]. To achieve rotational covariance, it is natural to work with spherical harmonics. They are a set of complexvalued functions Y m : Sfoot_1 → C with = 0, 1, 2, . . . and m = -, -+ 1, . . . , -1, on the unit sphere S 2 in Rfoot_2 . The first few spherical harmonics are in Appendix B. They are defined by Y m (ϑ, ϕ) = (-1) m 2 + 1 4π ( -m)! ( + m)! P m l (cos(ϑ))e imϕ , ϕ ∈ [0, 2π], ϑ ∈ [0, π], where P m denotes the associated normalized Legendre polynomials of the first kind (Bateman, 1953) , and each Y m is normalized such that |Y m (ϑ, ϕ)| 2 sin ϑdϑdϕ = 1. Any square-integrable function f : S 2 → C can be written as a series expansion in terms of the spherical harmonics, f (x) = ∞ =0 m=- f m Y m (x), where x = (ϑ, ϕ) ∈ S 2 . The complex-valued coefficients { f m } are the analogs of Fourier coefficients and are given by f m = f (x)Y m * (x)Ω(dx). Such a function f can be modeled by learning the coefficients { f m } using CORMORANT (Anderson et al., 2019) , a neural network architecture for predicting properties of chemical systems that works entirely in Fourier space. A key feature is that each neuron is covariant to rotation but invariant to translation; further, each neuron explicitly corresponds to a subset of atoms in the molecule. The input of CORMORANT is a spherical function f 0 : S 2 → C d and the output is a collection of vectors f = { f0 , f1 , . . . , fL }, where each f ∈ τ × (2 + 1) is a rotationally covariant vector with τ channels. That is, if the input is rotated by R ∈ SO(3), then each f transforms as f → D (R) f , where D (R) : SO(3) → C (2 +1)×(2 +1) are the irreducible representations of SO(3), also called the Wigner D-matrices.

3. COVARIANT POLICY FOR MOLECULAR DESIGN

An efficient RL agent needs to exploit the symmetries of the molecular design process. Therefore, we require a policy π(a|s) with actions a = (e, x) that is covariant under translation and rotation with respect to the position x, i.e., x should rotate (or translate) accordingly if the atoms on the canvas C are rotated (or translated). In contrast, the policy needs to be invariant to the element e, i.e. the chosen element remains unchanged under such transformations (see Fig. x is now covariant under translation and rotation. We choose these sub-actions using the following auto-regressive policy: A novel actor-critic neural network architecture that implements this policy is illustrated in Fig. 3 . π(a|s) = π(x, In the following, we discuss its state embedding, actor, and critic networks in more detail.

3.1. STATE EMBEDDING

The state embedding network transforms canvas C and bag B to obtain a rotationally covariant and translationally invariant representation. For that, we concatenate a vectorized representation of the bag with each atom on the canvas and feed it into CORMORANT, i.e. s cov ← CORMORANT(C, B), where s cov = {s cov } Lmax =0 , s cov ∈ C |C|×τ ×(2 +1 ) , and τ is the number of channels. For the sake of exposition, we assume a single channel for each element in the bag, i.e. τ = N e (cf. Fig 3 ); in practice, we use up to four channels per element. However, not every sub-action in Eq. (3) should transform equally under rotation and translation. While the orientation x needs to be covariant under rotation, the choice of focal atom f , element e, and distance d have to be invariant to rotation and translation. For these sub-actions, we additionally require an invariant state representation. To obtain such a representation s inv ∈ R |C|×k , we employ a combination of transformations from Anderson et al. (2019) as listed in Appendix C (e.g. for = 0, one can simply select the s cov =0 component), which we collectively denote as T inv .

3.2. ACTOR

Focal Atom and Element The distribution p(f |s) over the focal atom f is modeled as categorical, f ∼ Cat(f ; h f ), where h f are the logits for each atom in C predicted by a multi-layer perceptron (MLP). Likewise, the distribution over the element e is given by p(e|f, s) = Cat(e; h e ) with h e = MLP e (s inv f ), where s inv f is the invariant representation for the focal atom. Since the number of possible focal atoms f increases and the set of available elements e decreases during a rollout, we mask out invalid focal atoms f / ∈ {1, . . . , |C t |} and elements e / ∈ B t by setting their probabilities to zero and re-normalizing the categorical distributions. The agent does not make use of chemical concepts like bond connectivity to aid the choice of the focal atom. Distance We select the channel τ e corresponding to element e from s cov f to obtain s cov f,e := {s cov f,e } =0,...,Lmax and s inv f,e ← T inv (s cov f,e ). Then we model the distribution over the distance d between the focal atom and the next atom to be placed as a mixture of M Gaussians, p(d|e, f, s) = M m=1 π m N (µ m , σ 2 m ) , where π m is the mixing coefficient of the m-th Gaussian N (µ m , σ 2 m ). The mixing coefficients and the means are predicted by a mixture density network (MDN) (Bishop, 1994) , i.e. {π m , µ m } M m=1 = MDN(s inv f,e ). The standard deviations {σ m } M m=1 are global parameters. We guarantee that the sampled distances are positive by clipping values below zero.

Combining Invariant and Covariant Features

The choice of distance d can significantly affect the orientation x of the atom. For example, if d has the length of a triple bond, then x will be very different from if it was a single bond. Thus, we condition s cov f,e on distance d through a non-linear and learnable transformation that preserves rotational covariance. We then use this representation to model a spherical distribution over x. Kondor & Trivedi (2018) showed that a linear transformation with learnable parameters is only covariant if the operation combines fragments with the same . Further, the Clebsch-Gordan (CG) non-linearity allows one to combine two covariant features such that the result is still covariant. Thus, we obtain a rotationally covariant representation r := {r } =0,...,Lmax ← T cov (d, s cov f,e ) conditioned on all previous sub-actions as follows: r = s cov f,e ⊕ d • s cov f,e ⊕ (d • s cov f,e ⊗ cg d • s cov f,e ) • W ∀ , where ⊕ denotes the appropriate concatenation of matrices, and W is a learnable complex-valued matrix. As in Anderson et al. (2019) , we perform the CG product ⊗ cg only channel-wise to reduce computational complexity. Orientation Next, we utilize r to obtain a rotationally covariant spherical distribution for the orientation x based on the series expansion in Eq. ( 2). Taking inspiration from commonly used distributions (Jammalamadaka & Terdik, 2019) , we propose to use the following expression: p(x|d, e, f, s) = 1 Z exp   -β 1 √ k Lmax =0 m=- rm Y m (x) 2   , where β ∈ R is a scaling parameter, and the term 1/ √ k with k = Lmax =0 m=-|r m | 2 regularizes the distribution so that it does not approach a delta function. The normalization constant Z is estimated via Lebedev quadrature (Lebedev, 1975; 1977) . We sample from the distribution in Eq. ( 5) using rejection sampling (Bishop, 2009) with a uniform proposal distribution q(x) = (4π) -1 . Note that in contrast to more commonly used parametric distributions (e.g. von Mises-Fisher), this formulation allows to model multi-modalities. We discuss alternatives to Eq. ( 5) in Appendix D.

3.3. CRITIC

The critic needs to compute a value V for the state s that is invariant under translation and rotation. Given s inv , we apply a permutation-invariant set encoding (Zaheer et al., 2017) of the atoms, i.e. V (s) = MLP ρ   |C| i=1 MLP φ (s inv i )   . Finally, we use PPO (Schulman et al., 2017) to learn the parameters of the actor-critic architecture. To encourage sufficient exploration, we add an entropy regularization term over the categorical subaction distributions of the auto-regressive policy. For offline evaluation, we evaluate the policy without any exploration by choosing the most probable action. While the mode of the distributions for f and e is available in closed form, we approximate the global mode of the distributions over d and x by evaluating the density at S samples and picking the one with the highest density.

4. RELATED WORK

Reinforcement Learning for Molecular Design There exists a large variety of RL-based approaches for molecular design using either string-or graph-based representations of molecules (Olivecrona et al., 2017; Guimaraes et al., 2018; Putin et al., 2018; Neil et al., 2018; Popova et al., 2018; You et al., 2018; Zhou et al., 2019) . However, the choice of representation limits the molecules that can be generated to a (small) region of chemical space for which the representation is applicable, i.e., single organic molecules. Such representations also prohibit the use of reward functions based on quantum-mechanical properties; instead, heuristics are often used. Lastly, geometric constraints on the design process cannot be imposed as the representation does not include any 3D information. We perform experiments to answer the following questions: (1) is the agent able to learn how to build highly symmetric molecules in Cartesian coordinates from scratch, (2) can we increase the validity, diversity, and stability of generated molecules, and (3) does our approach lead to improved generalization? We address (1) and ( 2) by evaluating the agent on a diverse range of tasks from the MOLGYM benchmark suite (Simm et al., 2020) , and (3) on a newly proposed stochastic-bag task (see Section 5.1) where bags are sampled from a distribution over bags. In Appendix G, we show with an additional experiment that the agent can learn to place water molecules around a given solute to form a solvation shell. 4We compare our approach (COVARIANT) against the RL agent proposed by Simm et al. ( 2020), which iteratively builds molecules on a 3D canvas by working in internal coordinates (INTERNAL). As an additional baseline, we consider a classical, optimization-based agent (OPT) with access to a black-box function that yields the energy E(C) and the atomic forces F (C) for a given canvas. 5The agent constructs molecules by alternating between randomly placing an atom and optimizing the structure. Moreover, the agent applies several heuristics inspired by fundamental chemical concepts to guide the placement of atoms. To make the comparisons fair, we grant OPT a comparable computational budget in terms of the total number of energy computations. Finally, for some experiments, the best possible performance based on quantum-chemical calculations can be reported. See Appendices E and F for more details on the baselines and experimental settings, and Appendix H for an additional runtime comparison between the agents.

5.1. STOCHASTIC-BAG TASK

In Simm et al. ( 2020), a set of molecular design tasks was introduced: the single-bag task assesses an agent's ability to build single stable molecules, whereas the multi-bag task focuses on building several molecules of different composition and size at the same time. A limitation of these tasks is that the initial bags were selected such that they correspond to known formulas, which in practice might not be known a priori. In the stochastic-bag task, we relax this assumption by sampling from a more general distribution over bags. Before each episode, we construct a Here, we obtain an empirical distribution p e from the multiplicities m(e) of a given bag B * . For example, with B * = {(H, 2), (O, 1)} we obtain p H = 2 3 and p O = 1 3 . Since sampled bags might no longer correspond to valid molecules when placed completely, we discard bags where the sum of valence electrons over all atoms contained in the bag is odd. This ensures that the agent can build a closed-shell system.

5.2. RESULTS

Building Highly Symmetric Molecules First, we evaluate the ability to build stable molecules featuring high symmetry and coordination numbers (e.g. trigonal bipyramidal, square pyramidal, and octahedral) on the single-bag task with bags SOF 4 , IF 5 , SOF 6 , and SF 6 . As shown in Fig. 5 (a) , COVARIANT can solve the task for SOF 4 and IF 5 within 30 000 to 40 000 steps, whereas INTERNAL fails to build low-energy configurations as it cannot distinguish highly symmetric intermediates (cf. Fig. 4 ). Further results in Fig. 5 (b) for SOF 6 and SF 6 show that COVARIANT is capable of building such structures. Likewise, OPT found the optimal structures for all four bags. While the constructed molecules are small in size, they would be unattainable with graph-or string-based methods as such representations lack important 3D information. For example, RDKIT (Landrum, 2019) , a state-of-the-art library for 3D structure generation of organic molecules, failed at this task. Validity, Diversity, and Stability of Generated Molecules Since string and graph representations are not well-suited for designing molecules with complex 3D structure, it is difficult to directly compare to most prior work. To still enable comparisons, we follow the GuacaMol benchmark (Brown et al., 2019) and report the chemical validity, diversity, and stability of the molecules generated by the agents for different experiments. A generated structure is considered valid if it can be successfully converted to a molecular graph by the tool XYZ2MOL (Jensen, 2019; Kim & Kim, 1 is the ratio of valid molecules generated during offline evaluation at the end of training over 10 seeds. Two molecules are considered identical if their molecular graphs yield the same SMILES strings under RDKIT. The diversity shown in Table 1 is the total number of unique and valid structures generated during offline evaluation during training over 10 seeds. 6 In the two stochastic-bag experiments, the agents are trained on bags of sizes from the interval [16, 22] sampled with B * = C 7 H 8 N 2 O 2 and C 7 H 10 O 2 , respectively. Finally, to assess the stability of the generated molecules, valid structures generated in the last iteration underwent a structure optimization using the PM6 method (see Appendix A for details). Then, the root-mean-square deviation of atomic positions (RMSD, in Å) between the original and the optimized structure was computed. In Table 1 , the median RMSD is given per experiment. Results are listed in Table 1 . We observe that COVARIANT significantly outperforms the other agents on most experiments both in terms of validity and diversity. The difference is particularly large for the more challenging stochastic-bag tasks, where COVARIANT does similarly well as on the singlebag experiments. This finding is confirmed in Fig. 6 (a), showing that the exact stoichiometry does not need to be known a priori for the agent to build valid molecules. Moreover, the structures generated by COVARIANT are overall slightly more stable compared to INTERNAL. In contrast, OPT often fails to build valid structures. Inspection of the generated structures reveals that for larger bags the agent tends to build multi-molecular clusters, which are considered invalid in this experiment. The stability for OPT is omitted as all of its valid structures are stable by definition. are in Appendix G. Although the difference in performance seems to be marginal, we stress that chemical validity is often determined by the last 10% of the returns. Indeed, Table 1 and Fig. 9 in Appendix G highlight the higher quality of the structures generated by COVARIANT, indicating better generalization to unseen bags of larger size. OPT fails at this task. 

6. CONCLUSION

We proposed a novel covariant actor-critic architecture based on spherical harmonics for designing highly symmetric molecules in 3D. We showed empirically that exploiting symmetries of the molecular design process improves the quality of the generated molecules and leads to better generalization. In future work, we aim to employ more accurate quantum-chemical methods (e.g., density functional theory) required for building transition metal complexes or structures in which weak intermolecular interactions are important. For that, however, the sample-efficiency of our agent needs to be improved. Finally, we aim to explore reward functions specifically tailored towards drug design.

A REWARD CALCULATION

In the reward function, the energy E has to be computed using quantum-chemical methods. For that, we use the fast semi-empirical Parametrized Method 6 (PM6) (Stewart, 2007) . In particular, we use the implementation in the software package SPARROW (Husch et al., 2018; Bosia et al., 2020) . For each calculation, a molecular charge of zero and the lowest possible spin multiplicity are chosen. All calculations are spin-unrestricted. Limitations of semi-empirical methods are highlighted in, for example, recent work by Husch & Reiher (2018) . More accurate methods such as approximate density functionals need to be employed especially for systems containing transition metals. Further, we enforce that atoms are not placed too close (< 0.6 Å) nor too far away from each other (> 2.0 Å). If the agent places an atom outside these boundaries, the minimum reward of -0.6 is awarded and the episode terminates. Further, the environment encourages the agents to build single molecular structures by terminating the episode and return a reward of -0.6 if elements forming stable bimolecular compounds (e.g, H 2 ) are placed too far away from other atoms on the canvas.

B SPHERICAL HARMONICS

The spherical harmonics form an orthonormal basis of the Hilbert space of square-integrable functions L 2 C (S 2 ). The first few spherical harmonics are given by: Y 0 0 (ϑ, ϕ) = 1 2 √ π , Y -1 1 (ϑ, ϕ) = 3 8π sin ϑe -iϕ , Y 0 1 (ϑ, ϕ) = 3 4π cos ϑ, Y 1 1 (ϑ, ϕ) = - 3 8π sin ϑe iϕ , (8) Y -2 2 (ϑ, ϕ) = 15 32π sin 2 ϑe -i2ϕ , Y -1 2 (ϑ, ϕ) = 15 8π cos ϑ sin ϑe -iϕ , Y 0 2 (ϑ, ϕ) = 5 16π 3 cos 2 ϑ -1 , Y 1 2 (ϑ, ϕ) = - 15 8π cos ϑ sin ϑe iϕ , Y 2 2 (ϑ, ϕ) = 15 32π sin 2 ϑe i2ϕ . The spherical harmonics are normalized such that: 2π 0 π 0 |Y m (ϑ, ϕ)| 2 sin ϑdϑdϕ = 1 ∀ , m.

C CALCULATION OF INVARIANT FEATURES

One can obtain scalar invariats from the covariant features f (Anderson et al., 2019) : • Take the component = 0: ξ 1 ( f ) = f =0 . • Calculate the scalar product with itself: ξ 2 ( f ) = Re[ ξ2 ( f )] + Im[ ξ2 ( f )], where ξ2 ( f ) = m=-(-1) m f m f -m . • Calculate the SO(3)-invariant norm: ξ 3 ( f ) = m=- f m f m * , where * denotes the complex conjugate. The invariant components are then concatenated f inv ← T inv = ξ 1 ( f ) ⊕ L =0 ξ 2 ( f ) ⊕ ξ 3 ( f ) . 1. If the canvas C t is not empty, randomly choose a focal atom f from the list of available atoms on the canvas. An atom is considered available if its number of neighbors is less than a predefined number that depends on its element (e.g., one for hydrogen and four for carbon). Two atoms on the canvas are neighbors if their Euclidean distance is below 1.5 Å. If there are no available atoms on the canvas, a focal atom is randomly chosen from the list of atoms on the canvas. 2. Randomly choose an element e t from the bag B t . 3. Randomly place the atom a t = (e t , x t ) on a sphere with radial distance d = 1.1 Å around x f to obtain C t+1,raw . If the canvas is empty, place the atom at the origin. 4. Optimize only the position of a t using F to obtain C t+1,opt .

5.. Compute the energy difference

∆E(t) = E(C t+1,opt ) -[E(C t ) + E({e t , 0})]. 6. If ∆E(t) > 0, return e t to the bag and go back to step 1. 7. Optimize canvas C t+1,opt using F to obtain C t+1 . 8. Increment t by 1. 9. If the bag is not empty, go back to step 1. In the experiments, the different agents need to be given a comparable computational budget to ensure a meaningful comparison of their performance. This is difficult as they use different computational resources: OPT runs on a CPU whereas INTERNAL and COVARIANT perform many of their computations on a GPU. However, we found experimentally that the quantum-chemical calculations are the most computationally expensive ones. These calculations are performed in the same way for all approaches. Therefore, we believe that by granting each approach the same number of PM6 calculations we achieve a fair comparison.

E.2 OPTIMAL RETURN

The optimal return for the single-bag tasks was derived in the following way. First, we obtained molecular structures for the complexes SOF 4 , IF 5 , SF 6 , and SOF 6 . Subsequently, we performed a structure optimization using the PM6 method. Since the undiscounted return is path-independent, we determined the return R(s) by computing the total interaction energy in the canvas C, i.e.

R(s)

=    |C| i=1 E({e i , 0})    -E(C). F EXPERIMENTAL DETAILS 2020), e.g. regarding the number of hidden units/layers, activation functions, and initialization schemes. We initialize the biases of each network with 0 and each weight matrix as a (semi-)orthogonal matrix. After each hidden layer, a ReLU non-linearity is employed. As explained in the main text, both MLP f and MLP e use a masked softmax activation function to guarantee that only valid actions are chosen. To model the distance d, we employ a Gaussian mixture model consisting of M = 3 Gaussians. As we treat the standard deviations {σ m } 3 m=1  m ∈ [-1, 1] to µ m ∈ [d min , d max ]. If the sampled distance is negative, we clip the value at 0.001. Hyperparameters for CORMORANT are listed in Table 3 . In our experiments, we found it important to use multiple filters τ e per element (e.g. 4) and to set L max = 4. This gives the model enough flexibility to represent complex spherical distributions while remaining computationally tractable. For more details on CORMORANT, see the original work (Anderson et al., 2019) . Further hyperparameters used in our experiments are in Table 4 . PPO is known to be relatively robust with respect to the choice of hyperparameters, and we found the default values to be sufficient in most cases. Within the actor, the scaling parameter β is important to avoid that the spherical distribution approaches a delta distribution. Note that values of β can vary significantly across experiments and might require some tuning. Lastly, the right number of samples S for the global mode estimation of the spherical distribution generally depends on the shape of the distribution. In particular, we would expect that more samples are required as the distribution becomes more peaked. Since we avoid pathological behaviors by scaling the distribution with β, we found S = 1024 to be sufficient for all our experiments. From Fig. 10 , it can be seen that COVARIANT can solve this task by constructing stable H 2 O molecules and placing them in the vicinity of the solute. From visual inspection of the generated structures, it can be observed that in many cases COVARIANT arranges the molecules such that intermolecular bonds can be formed. However, it should be noted that the quantum-chemical method used in the reward function is not very well suited for modeling these interactions. Finally, Fig. 10 shows that while INTERNAL learns faster at the beginning of training, COVARIANT is slightly outperforming INTERNAL towards the end.

H RUNTIME EVALUATION

We compared the runtimes between OPT, INTERNAL, and COVARIANT. For instance, for the singlebag task with the bag C 3 H 5 NO 3 , T = 240 steps of the last rollout took COVARIANT and INTERNAL on average 12 and 11 seconds (s), respectively. The final offline evaluation took the agents on average 4 and 1s, respectively. This speed difference is mainly due to the relatively slow rejection sampling procedure in COVARIANT. Each iteration, policy optimization took on average 2 and 6s for the agents COVARIANT and INTERNAL, respectively. In this case, INTERNAL is slower than COVARIANT as it performed around twice as many epochs during optimization due to early stopping. Since there is no training for OPT, this agent was overall faster than the others. Further, we note that the largest fraction of time was spent on the quantum-chemical calculations which are the same for all agents. The time the quantum-chemical calculation takes to converge depends not only on the size but also on the geometry of the input structure. The entire experiment took COVARIANT approximately 4 hours, INTERNAL 5 hours, and OPT 3 hours.



Shorthand for {(S, 1), (O, 1), (F, 4)}. Hereafter, we drop the time index when it is clear from the context. If the canvas C0 is empty, the agent selects an element e0 ∈ B0 and places it at the origin, i.e. a0 = (e0, 0). Molecular Design in Cartesian Coordinates Another downside of string-and graph-based approaches is their neglect of information encoded in the interatomic distances. In light of this,Gebauer et al. (2018; 2019) proposed a supervised generative neural network for sequentially placing atoms in Cartesian coordinates. While the model respects local symmetries by construction, atoms are placed on a 3D grid. Similar to other supervised approaches, one further requires a dataset that covers the particular class of molecules to be generated. Hammer and coworkers(Jørgensen et al., 2019;Meldgaard et al., 2020) employed a Deep Q-Network(Mnih et al., 2015) to build planar compounds and crystalline surfaces by placing atoms on a grid. Recently, Simm et al. (2020) presented an RL formulation for molecular design in continuous 3D space. The agent models the position of the next atom to be placed in internal coordinatesi.e. the distance, angle, and dihedral angle with respect to already existing atoms-which are invariant under translation and rotation. By mapping from internal to Cartesian coordinates, they then obtain a policy that is covariant under these symmetry operations. However, as shown in Fig4, the angle and dihedral angle are only defined with respect to two reference points, which are chosen to be the two closest points to a focal atom. In highly symmetric states, e.g. as commonly encountered in materials, this representation fails to distinguish different configurations as one cannot uniquely select the two closest atoms as reference points anymore. In contrast, we do not rely on such reference points as the agent directly samples the orientation from a spherical distribution.Covariant Neural Networks in Chemical Science Prior work employed rotationally covariant neural networks to predict translation-and rotation-invariant physical properties(Thomas et al., 2018;Kondor et al., 2018; Weiler et al., 2018; Anderson et al., 2019;Miller et al., 2020;Finzi et al., 2020;Fuchs et al., 2020), e.g. scalars such as the electronic energy. In contrast, we propose a translation-invariant and rotation-covariant neural network architecture for generating molecules. For a more general treatment of covariance (or equivariance) in RL, seevan der Pol et al. (2020). Source code of the agent and environment is available at https://github.com/gncs/molgym. For the calculation of E(C) and F (C) we employ PM6; the same method as in the reward function. For a fairer comparison in the stochastic-bag task, we use the structures generated during offline evaluation, instead of those generated during training as in Simm et al. (2020).



Figure 1: (a) Illustration of a rotation-covariant state-action representation. If the structure is rotated by R, the position x of the action transforms accordingly. (b) Rollout with bag B 0 = SOF 4 . The agent builds a molecule by repeatedly taking atoms from the bag and placing them onto the 3D canvas. Bonds connecting atoms are only for illustration and not part of the MDP.

1 (a)). Since learning such a policy is difficult when working directly in global Cartesian coordinates, we instead followSimm et al. (2020) and use an action representation that is local with respect to an already placed focal atom. If the next atom is placed relative to the focal atom, covariance under translation of x is automatically achieved and only the rotational covariance remains to be dealt with.As shown in Fig.2, we model the action a through a sequence of sub-actions: (1) the index f ∈ {1, . . . , |C|} of the focal atom around which the next atom is placed,3 (2) the element e ∈ {1, . . . , N e } of the next atom from the set of available elements, (3) a distance d ∈ R + between the focal atom and the next atom, and (4) the orientation x = (ϑ, ϕ) ∈ S 2 of the atom on a unit sphere around the focal atom. Denoting x f as the position of the focal atom, we obtain action a = (e, x) by mapping the local coordinates (x, d, f ) to global coordinates x = x f + d • x, where

Figure 2: Action representation of the auto-regressive policy. The agent chooses focal atom f , element e, distance d, and orientation x. We then map back to global coordinates x to obtain action a = (e, x). Bonds between atoms are only for illustration.

Figure 3: Illustration of the state embedding, actor, and critic networks. Both canvas C and bag B are fed to the state embedding network CORMORANT to obtain rotation-covariant (s cov ) and -invariant (s inv ) state representations. The actor network then samples the different sub-actions highlighted in bold. The critic takes the invariant representation s inv to compute a value V .

Figure 4: Example of two configurations (a) and (b) that the agent by Simm et al. (2020) cannot distinguish. While the values for distance d, angle α and dihedral angle ψ are the same, choosing different reference points (in red) leads to a different action. This is particularly problematic in symmetric states, where one cannot uniquely determine these reference points.

bag B = {(e, m(e))} by sampling counts (m(e 1 ), ..., m(e max )) ∼ Mult(ζ, p e ), where the bag size ζ is sampled uniformly from the interval [ζ min , ζ max ].

Figure 5: (a) Average offline performance on the single-bag task with bags SOF 4 (left) and IF 5 (right) across 10 seeds. In the lower right, molecular structures generated by the agents are shown. Dashed lines denote the optimal return for each experiment. Error bars show two standard deviations. (b) Further molecular structures generated by COVARIANT, namely SOF 6 and SF 6 .

Figure 6: Average offline performance on the stochastic-bag task with B * = C 7 H 10 O 2 evaluated on (a) C 7 H 10 O 2 and (b) larger, unseen bags {C 6 H 14 O 3 , C 7 H 16 O, C 7 H 16 O 2 , C 8 H 18 O} over 10 seeds. For comparison, we show an agent trained only on the test bags (purple). Error bars are two standard deviations. Molecular structures generated by COVARIANT (Stochastic) are shown.

Figure 10: Average offline performance on the solvation task with 5 H 2 O molecules and formaldehyde as the solute across 10 seeds. Error bars show two standard errors. The dashed line denotes the optimal return. A selection of molecular clusters generated by the COVARIANT agent is shown.

Validity, diversity, and stability (RMSD in Å) of generated structures.

achieved by COVARIANT are still relatively low. This can partly be explained by the fact that state-of-the-art graph-based approaches have the strict rules of chemical bonding in organic molecules encoded into their models. But as a result, they are limited to generating single organic molecules and cannot build molecules for which these rules do not apply (e.g., hypervalent iodine compounds such as IF 5 ). In terms of stability, the supervised generative model by Gebauer et al. (2019) reported an average RMSD of approximately 0.25 Å. While their approach and the considered molecules are significantly different from ours, this suggests that the generated structures are more stable compared to COVARIANT. Nonetheless, the RL approach presented in this work remains particularly attractive if no dataset exists on which such a supervised model can be trained.GeneralizationTo evaluate the generalization capabilities of all agents to unseen bags, we train on a distribution over bags with B * = C 7 H 10 O 2 and C 7 H 8 N 2 O 2 , and test on sets of larger, outof-distribution bags {C 6 H 14 O 3 , C 7 H 16 O, C 7 H 16 O 2 , C 8 H 18 O}, and {C 8 H 12 N 2 O, C 6 H 12 N 2 O 3 , C 7 H 14 N 2 O, C 7 H 14 N 2 O 2 } respectively. As shown in Fig. 6 (b), COVARIANT obtains higher average returns with lower variance compared to INTERNAL on C 7 H 10 O 2 , while performing only slightly worse than the agent trained directly on the test bags (purple). Results for C 7 H 8 N 2 O 2

The model architecture is summarized in Table2, where the dimensions of s inv and s inv f,e ared inv = (L max + 2) • τ • 2 and d inv f,e = (L max + 2) • τ e • 2,respectively. If possible, we made similar architectural choices as Simm et al. (

Model architecture for actor and critic networks.

Hyperparameters for CORMORANT(Anderson et al., 2019)  used in all experiments.

Hyperparameters for the single-bag, multi-bag and stochastic-bag tasks. Values in parentheses were only used for the single-bag task. For further details on how the PPO hyperparameters are defined, please refer toSchulman et al. (2017).

ACKNOWLEDGEMENTS

We thank A. J. Tripp and K. T. Jensen for useful discussions and feedback. RP receives funding from iCASE grant #1950384 with support from Nokia. JMHL acknowledges support from a Turing AI Fellowship under grant EP/V023756/1. This work has been performed using resources operated by the University of Cambridge Research Computing Service (funded by grant EP/P020259/1).

D PROBABILITY DISTRIBUTION FOR ORIENTATION

In the main paper, we propose the expression in Eq. ( 5) for the distribution p(x|d, e, f, s). An alternative, equally valid expression is p(x|d, e, f, s)where the term 1/m=-|r m | 2 normalizes the distribution. We found experimentally that an agent using this expression performs worse when generating molecular structures featuring complex geometries. We hypothesize that this is because the distribution cannot get peaked enough for L max ≤ 5. As larger L max would result in a significant increase in computational complexity, we chose the expression in Eq. ( 5) over that in Eq. ( 11).The normalization constant Z in Eq. ( 5) is estimated via Lebedev quadrature with 1730 angular grid points (Lebedev, 1975; 1977) . We sample from the distribution in Eq. ( 5) using rejection sampling (Bishop, 2009) with a uniform proposal distribution q(x) = 1 4π . In rejection sampling, one first draws a sample from x0 from q(x). Then, one generates a random number u 0 from the uniform distribution over [0, M q(x 0 )], where M is such that M q(x) ≥ p(x|d, e, f, s). We determine M by evaluating p(x|d, e, f, s) on a uniform grid on S 2 employing a Fibonacci 'sunflower' grid. Finally, if u 0 > p(x|d, e, f, s) then the sample is accepted.We ran an experiment to compare the COVARIANT agent as described in the main paper with an agent employing Eq. 11 for the distribution p(x|d, e, f, s). In Fig. 7 , the hypothesis that the distribution in Eq. 11 cannot get narrow enough is confirmed. After 40 000 steps, the online performance of the alternative agent COVARIANT (ALT.) converges to around 0.5, which is significantly lower compared to COVARIANT. Note that the difference is smaller when considering the offline return because the estimated global mode for each distribution could still be similar. 

E BASELINES

E.1 OPT AGENT Below, we detail the algorithm of the OPT agent. At the beginning of each experiment, the agent is given a canvas C 0 , a bag B 0 , and a black-box function that can compute the energy E(C) and the atomic forces F (C) for a given canvas. We assume a total charge of zero and a low-spin configuration. At the end of each experiment, we compute the total reward obtained for the final structure on canvas C T and report the total number of energy and gradient computations. Next, we assess the ability of COVARIANT to generate solvation clusters-a type of molecular structure that cannot be built with graph-based approaches. Following Simm et al. ( 2020), we task the agent to place 5 water molecules around a formaldehyde molecule that is already on the canvas at the beginning of each episode. In addition, the reward function is augmented with a penalty term for placing atoms far away from the center, i.e. r(s t , a t ) = -∆ E -ρ x 2 , where ρ is a hyper-parameter that is set to 0.01 (see Simm et al. (2020) for details). Therefore, the agent needs to place the water molecules such that hydrogen bonds can be formed between water molecules and between water molecules and the solute.

