ATOM3D: TASKS ON MOLECULES IN THREE DIMENSIONS

Abstract

While a variety of methods have been developed for predicting molecular properties, deep learning networks that operate directly on three-dimensional molecular structure have recently demonstrated particular promise. In this work we present ATOM3D, a collection of both novel and existing datasets spanning several key classes of biomolecules, to systematically assess such learning methods. We develop three-dimensional molecular learning networks for each of these tasks, finding that they consistently improve performance relative to one-and twodimensional methods. The specific choice of architecture proves to be critical for performance, with three-dimensional convolutional networks excelling at tasks involving complex geometries, while graph networks perform well on systems requiring detailed positional information. Furthermore, equivariant networks show significant promise but are currently unable to scale. Our results indicate many molecular problems stand to gain from three-dimensional molecular learning. All code and datasets are available at github.

1. INTRODUCTION

A molecule's three-dimensional (3D) shape is critical to understanding its physical mechanisms of action, and can be used to answer a number of questions relating to drug discovery, molecular design, and fundamental biology. A molecule's atoms often adopt specific 3D configurations that minimize its free energy, and by representing these 3D positions-the atomistic geometry-we can model this 3D shape in ways that would not be possible with 1D or 2D representations such as linear sequences or chemical bond graphs (Table 1 ). However, existing works that examine diverse molecular tasks, such as MoleculeNet (Wu et al., 2018) or TAPE (Rao et al., 2019) , focus on these lower dimensional representations. In this work, we demonstrate the benefit yielded by learning on 3D atomistic geometry and promote the development of 3D molecular learning by providing a collection of datasets leveraging this representation. Furthermore, we argue that the atom should be considered a "machine learning datatype" in its own right, deserving focused study much like images in computer vision or text in natural language processing. All molecules, including proteins, small molecule compounds, and nucleic acids, can be homogeneously represented as atoms in 3D space. These atoms can only belong to a fixed class of element types (e.g. carbon, nitrogen, oxygen), and are all governed by the same underlying laws of physics, leading to important rotational, translational, and permutational symmetries. These systems also contain higher-level patterns that are poorly characterized, creating a ripe opportunity for learning them from data: though certain basic components are well understood (e.g. amino acids, nucleic acids, functional groups), many others can not easily be defined. These patterns are in turn composed in a hierarchy that itself is only partially elucidated. While deep learning methods such as graph neural networks (GNNs) and convolutional neural networks (CNNs) seem especially well suited to atomistic geometry, to date there has been no systematic evaluation of such methods on molecular tasks. Additionally, despite the growing number of 3D structures available in databases such as the Protein Data Bank (PDB) (Berman et al., 2000) , they require significant processing before they are useful for machine learning tasks. Inspired by the success of accessible databases such as ImageNet (Jia Deng et al., 2009) and SQuAD (Rajpurkar et al., 2016) in sparking progress in their respective fields, we create and curate benchmark datasets for atomistic tasks, process them into a simple and standardized format, systematically benchmark 3D molecular learning methods, and present a set of best practices for other machine learning researchers interested in entering the field of 3D molecular learning. We develop new methods for several datasets and reveal a number of insights related to 3D molecular learning, including the consistent improvements yielded by using atomistic geometry, the lack of a single dominant method, and the presence of several tasks that can be improved through 3D molecular learning.

2. RELATED WORK

Three dimensional molecular data have long been pursued as an attractive source of information in molecular learning and chemoinformatics, but until recently have achieved underwhelming results relative to 1D and 2D representations (Swamidass et al., 2005; Azencott et al., 2007) . However, due to increases in data availability and methodological advances, machine learning methods based on 3D molecular structure have begun to demonstrate significant impact in the last couple of years on specific tasks such as protein structure prediction (Senior et al., 2020) , equilibrium state sampling (Noé et al., 2019) , and drug design (Zhavoronkov et al., 2019) . While there have been some broader assessments of groups of related biological tasks, these have focused on on either 1D (Rao et al., 2019) or 2D (Wu et al., 2018) representations. By focusing instead on atomistic geometry, we can consistently improve performance and address disparate problems involving any combination of small molecules, proteins, and nucleic acids through a unified lens. Graph neural networks (GNNs) have grown to be a major area of study, providing a natural way of learning from data with complex spatial structure. Many GNN implementations have been motivated by applications to atomic systems, including molecular fingerprinting (Duvenaud et al., 2015) , property prediction (Schütt et al., 2017; Gilmer et al., 2017; Liu et al., 2019) , protein interface prediction (Fout et al., 2017) , and protein design (Ingraham et al., 2019) . Instead of encoding points in Euclidean space, GNNs encode their pairwise connectivity, capturing a structured representation of atomistic data. Three-dimensional CNNs (3DCNNs) have also become popular as a way to capture these complex 3D geometries. They have been applied to a number of biomolecular applications such as protein interface prediction (Townshend et al., 2019) , protein model quality assessment (Pagès et al., 2019; Derevyanko et al., 2018) , protein sequence design (Anand et al., 2020) , and structure-based drug discovery (Wallach et al., 2015; Torng & Altman, 2017; Ragoza et al., 2017; Jiménez et al., 2018) . These 3DCNNs can encode translational and permutational symmetries, but incur significant computational expense and cannot capture rotational symmetries without data augmentation. In an attempt to address many of the problems of representing atomistic geometries, equivariant neural networks (ENNs) have emerged as a new class of methods for learning from molecular systems. These networks are built such that geometric transformations of their inputs lead to well-defined transformations of their outputs. This setup leads to the neurons of the network learning rules that resemble physical interactions. Tensor field networks (Thomas et al., 2018) and Cormorant (Kondor,



Representation choice for molecules. Adding in 3D information consistently improves performance. The depicted 1D representations are the amino acid sequence and SMILES(Weininger,  1988)  for proteins and small molecules, respectively.

