SYMMETRY-AWARE ACTOR-CRITIC FOR 3D MOLECULAR DESIGN

Abstract

Automating molecular design using deep reinforcement learning (RL) has the potential to greatly accelerate the search for novel materials. Despite recent progress on leveraging graph representations to design molecules, such methods are fundamentally limited by the lack of three-dimensional (3D) information. In light of this, we propose a novel actor-critic architecture for 3D molecular design that can generate molecular structures unattainable with previous approaches. This is achieved by exploiting the symmetries of the design process through a rotationally covariant state-action representation based on a spherical harmonics series expansion. We demonstrate the benefits of our approach on several 3D molecular design tasks, where we find that building in such symmetries significantly improves generalization and the quality of generated molecules.

1. INTRODUCTION

The search for molecular structures with desirable properties is a challenging task with important applications in de novo drug design and materials discovery (Schneider et al., 2019) . There exist a plethora of machine learning approaches to accelerate this search, including generative models based on variational autoencoders (VAEs) (Gómez-Bombarelli et al., 2018) , recurrent neural networks (RNNs) (Segler et al., 2018) , and generative adversarial networks (GANs) (De Cao & Kipf, 2018) . However, the reliance on a sufficiently large dataset for exploring unknown regions of chemical space is a severe limitation of such supervised models. Recent RL-based methods (e.g., Olivecrona et al. (2017 ), Jørgensen et al. (2019 ), Simm et al. (2020) ) mitigate the need for an existing dataset of molecules as they only require access to a reward function. Most approaches rely on graph representations of molecules, where atoms and bonds are represented by nodes and edges, respectively. This is a strongly simplified model designed for the description of single organic molecules. It is unsuitable for encoding metals and molecular clusters as it lacks information about the relative position of atoms in 3D space. Further, geometric constraints on the design process cannot be included, e.g. those given by the active site of an enzyme. A more general representation closer to the physical system is one in which a molecule is described by its atoms' positions in Cartesian coordinates. However, it would be very inefficient to naively learn a model based on this representation. That is because molecular properties such as the energy are invariant (i.e. unchanged) under symmetry operations like translation or rotation of all atomic positions. A model without the right inductive bias would thus have to learn those symmetries from scratch. In this work, we develop a novel RL approach for designing molecules in Cartesian coordinates that explicitly encodes these symmetry operations. The agent builds molecules by consecutively placing atoms such that if the generated structure is rotated or translated, the agent's action is rotated and translated accordingly; this way, the reward remains the same (see Fig. 1 (a) ). We achieve this through a rotationally covariant state representation based on spherical harmonics, which we integrate into a novel actor-critic network architecture with an auto-regressive policy that maintains the desired covariance. Building in this inductive bias enables us to generate molecular structures with more complex coordination geometry than the class of molecules that were attainable with previous approaches. Finally, we perform experiments on several 3D molecular design tasks, where we find that our approach significantly improves the generalization capabilities of the RL agent and the quality of the generated molecules. In summary, our contributions are as follows: • we propose the first approach for 3D molecular design that exploits symmetries of the design process by leveraging a rotationally covariant state representation; • we integrate this state representation into an actor-critic neural network architecture with a rotationally covariant auto-regressive policy, where the orientation of the atoms to be placed is modeled through a flexible distribution based on spherical harmonics; • we demonstrate the benefits of our approach on several 3D molecular design tasks, including a newly proposed task that showcases the generalization capabilities of our agent.

2.1. REINFORCEMENT LEARNING FOR MOLECULAR DESIGN

In the standard RL setting (Sutton & Barto, 2018) , an agent interacts with the environment to maximize its reward. Formally, such an environment is described by a Markov decision process (MDP) M = (S, A, T , µ 0 , γ, r) with states s t ∈ S, actions a t ∈ A, transition dynamics T : S × A → S, initial state distribution µ 0 , discount factor γ ∈ (0, 1], and reward function r : S × A → R. The goal is to learn a stochastic policy π(a t |s t ) that maximizes the expected discounted return J(θ) = E s0∼µ0 [V π (s 0 )], where the value function V π (s t ) = E π [ T t =t γ t r(s t , a t )|s t ] is defined as the expected discounted return when starting from state s t and following policy π. Following Simm et al. ( 2020), we design molecules by iteratively picking atoms from a bag and positioning them on a 3D canvas. Such a sequential decision-making problem is described by an MDP where the state s t = (C t , B t ) comprises both the canvas C t and the bag B t . The canvas C t = C 0 ∪ {(e i , x i )} t-1 i=0 is a set of atoms with chemical element e i ∈ {H, C, N, O, . . . } and position x i ∈ R 3 placed up to time t -1, where C 0 can either be empty or contain a set of initially placed atoms. The number of atoms on the canvas is denoted by |C t |. The bag B t = {(e, m(e))} is a multi-set of atoms yet to be placed, where m(e) is the multiplicity of the element e. Each action a t = (e t , x t ) consists of the element e t ∈ B t and position x t ∈ R 3 of the next atom to be added to the canvas. Placing an atom through action a t in state s t is modeled by a deterministic transition function T (s t , a t ) that yields the next state s t+1 = (C t+1 , B t+1 ) with B t+1 = B t \e t . The reward function r(s t , a t ) = -∆E(s t , a t ) is given by the negative energy difference between the resulting structure described by C t+1 , and the sum of energies of the current structure C t and a new atom of element e t placed at the origin, i.e. ∆E(s t , a t ) = E(C t+1 ) -[E(C t ) + E({(e, 0)})]. Intuitively, the reward encourages the agent to build stable, low-energy structures. We evaluate the energy using the fast semi-empirical Parametrized Method 6 (PM6) (Stewart, 2007) as implemented in SPARROW (Husch et al., 2018; Bosia et al., 2020) ; see Appendix A for details.



Figure 1: (a) Illustration of a rotation-covariant state-action representation. If the structure is rotated by R, the position x of the action transforms accordingly. (b) Rollout with bag B 0 = SOF 4 . The agent builds a molecule by repeatedly taking atoms from the bag and placing them onto the 3D canvas. Bonds connecting atoms are only for illustration and not part of the MDP.

