EQUIFORMER: EQUIVARIANT GRAPH ATTENTION TRANSFORMER FOR 3D ATOMISTIC GRAPHS

Abstract

Despite their widespread success in various domains, Transformer networks have yet to perform well across datasets in the domain of 3D atomistic graphs such as molecules even when 3D-related inductive biases like translational invariance and rotational equivariance are considered. In this paper, we demonstrate that Transformers can generalize well to 3D atomistic graphs and present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating SE(3)/E(3)-equivariant features based on irreducible representations (irreps). First, we propose a simple and effective architecture by only replacing original operations in Transformers with their equivariant counterparts and including tensor products. Using equivariant operations enables encoding equivariant information in channels of irreps features without complicating graph structures. With minimal modifications to Transformers, this architecture has already achieved strong empirical results. Second, we propose a novel attention mechanism called equivariant graph attention, which improves upon typical attention in Transformers through replacing dot product attention with multi-layer perceptron attention and including non-linear message passing. With these two innovations, Equiformer achieves competitive results to previous models on QM9, MD17 and OC20 datasets.

1. INTRODUCTION

Machine learned models can accelerate the prediction of quantum properties of atomistic systems like molecules by learning approximations of ab initio calculations (Gilmer et al., 2017; Zhang et al., 2018b; Jia et al., 2020; Gasteiger et al., 2020a; Batzner et al., 2022; Lu et al., 2021; Unke et al., 2021; Sriram et al., 2022; Rackers et al., 2023) . In particular, graph neural networks (GNNs) have gained increasing popularity due to their performance. By modeling atomistic systems as graphs, GNNs naturally treat the set-like nature of collections of atoms, encode the interaction between atoms in node features and update the features by passing messages between nodes. One factor contributing to the success of neural networks is the ability to incorporate inductive biases that exploit the symmetry of data. Take convolutional neural networks (CNNs) for 2D images as an example: Patterns in images should be recognized regardless of their positions, which motivates the inductive bias of translational equivariance. As for atomistic graphs, where each atom has its coordinate in 3D Euclidean space, we consider inductive biases related to 3D Euclidean group E(3), which include equivariance to 3D translation, 3D rotation, and inversion. Concretely, some properties like energy of an atomistic system should be constant regardless of how we shift the system; others like force should be rotated accordingly if we rotate the system. To incorporate these inductive biases, equivariant and invariant neural networks have been proposed. The former leverages geometric tensors like vectors for equivariant node features (Thomas et al., 2018; Weiler et al., 2018; Kondor et al., 2018; Fuchs et al., 2020; Batzner et al., 2022; Brandstetter et al., 2022; Musaelian et al., 2022) , and the latter augments graphs with invariant information such as distances and angles extracted from 3D graphs (Schütt et al., 2017; Gasteiger et al., 2020b; a; Liu et al., 2022; Klicpera et al., 2021) . A parallel line of research focuses on applying Transformer networks (Vaswani et al., 2017) to other domains like computer vision (Carion et al., 2020; Dosovitskiy et al., 2021; Touvron et al., 2020) and graph (Dwivedi & Bresson, 2020; Kreuzer et al., 2021; Ying et al., 2021; Shi et al., 2022) and has demonstrated widespread success. However, as Transformers were developed for sequence data (Devlin et al., 2019; Baevski et al., 2020; Brown et al., 2020) , it is crucial to incorporate domain-related inductive biases. For example, Vision Transformer (Dosovitskiy et al., 2021) shows that adopting a pure Transformer to image classification cannot generalize well and achieves worse results than CNNs when trained on only ImageNet (Russakovsky et al., 2015) since it lacks inductive biases like translational invariance. Note that ImageNet contains over 1.28M images and the size is already larger than that of many quantum properties prediction datasets (Ruddigkeit et al., 2012; Chmiela et al., 2017; Chanussot* et al., 2021) . Therefore, this highlights the necessity of including correct inductive biases when applying Transformers to the domain of 3D atomistic graphs. Despite their widespread success in various domains, Transformers have yet to perform well across datasets (Fuchs et al., 2020; Thölke & Fabritiis, 2022; Le et al., 2022) in the domain of 3D atomistic graphs even when relevant inductive biases are incorporated. In this work, we demonstrate that Transformers can generalize well to 3D atomistic graphs and present Equiformer, an equivariant graph neural network utilizing SE(3)/E(3)-equivariant features built from irreducible representations (irreps) and a novel attention mechanism to combine the 3D-related inductive bias with the strength of Transformer. First, we propose a simple and effective architecture, Equiformer with dot product attention and linear message passing, by only replacing original operations in Transformers with their equivariant counterparts and including tensor products. Using equivariant operations enables encoding equivariant information in channels of irreps features without complicating graph structures. With minimal modifications to Transformers, this architecture has already achieved strong empirical results (Index 3 in Table 6 and 7 ). Second, we propose a novel attention mechanism called equivariant graph attention, which improves upon typical attention in Transformers through replacing dot product attention with multi-layer perceptron attention and including non-linear message passing. Combining these two innovations, Equiformer (Index 1 in Table 6 and 7 ) achieves competitive results on QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014 ), MD17 (Chmiela et al., 2017; Schütt et al., 2017; Chmiela et al., 2018) and OC20 (Chanussot* et al., 2021) datasets. For QM9 and MD17, Equiformer achieves overall better results across all tasks or all molecules compared to previous models like NequIP (Batzner et al., 2022) and TorchMD-NET (Thölke & Fabritiis, 2022) . For OC20, when trained with IS2RE data and optionally IS2RS data, Equiformer improves upon state-of-the-art models such as SEGNN (Brandstetter et al., 2022) and Graphormer (Shi et al., 2022) . Particularly, as of the submission of this work, Equiformer achieves the best IS2RE result when only IS2RE and IS2RS data are used and improves training time by 2.3× to 15.5× compared to previous models.

2. RELATED WORKS

We focus on equivariant neural networks here. We provide a detailed comparison between other equivariant Transformers and Equiformer and discuss other related works in Sec. B in appendix. SE(3)/E(3)-Equivariant GNNs. Equivariant neural networks (Thomas et al., 2018; Kondor et al., 2018; Weiler et al., 2018; Fuchs et al., 2020; Miller et al., 2020; Townshend et al., 2020; Batzner et al., 2022; Jing et al., 2021; Schütt et al., 2021; Satorras et al., 2021; Unke et al., 2021; Brandstetter et al., 2022; Thölke & Fabritiis, 2022; Le et al., 2022; Musaelian et al., 2022) operate on geometric tensors like type-L vectors to achieve equivariance. The central idea is to use functions of geometry built from spherical harmonics and irreps features to achieve 3D rotational and translational equivariance as proposed in Tensor Field Network (TFN) (Thomas et al., 2018) , which generalizes 2D counterparts (Worrall et al., 2016; Cohen & Welling, 2016; Cohen et al., 2018) to 3D Euclidean space (Thomas et al., 2018; Weiler et al., 2018; Kondor et al., 2018) . Previous works differ in equivariant operations used in their networks. TFN (Thomas et al., 2018) and NequIP (Batzner et al., 2022) use graph convolution with linear messages, with the latter utilizing extra equivariant gate activations (Weiler et al., 2018) . SEGNN (Brandstetter et al., 2022) introduces non-linear messages (Gilmer et al., 2017; Sanchez-Gonzalez et al., 2020) for irreps features, and the non-linear messages use the same gate activation and improve upon linear messages. SE(3)-Transformer (Fuchs et al., 2020) adopts an equivariant version of dot product (DP) attention (Vaswani et al., 2017) with linear messages, and the attention can support vectors of any type L. Subsequent works on equivariant Transformers (Thölke & Fabritiis, 2022; Le et al., 2022) follow the practice of DP attention and linear messages but use more specialized architectures considering only type-0 and type-1 vectors. The proposed Equiformer incorporates all the advantages through combining MLP attention with non-linear messages and supporting vectors of any type. Compared to TFN, NequIP, SEGNN and SE(3)-Transformer, the proposed combination of MLP attention and non-linear messages is more expressive than pure linear or non-linear messages and pure MLP or dot product attention. Compared to other equivariant Transformers (Thölke & Fabritiis, 2022; Le et al., 2022) , in addition to being more expressive, the proposed attention mechanism can support vectors of higher degrees (types) and involve higher order tensor product interactions, which can lead to better performance.

3. BACKGROUND

3.1 E(3) EQUIVARIANCE Atomistic systems are often described using coordinate systems. For 3D Euclidean space, we can freely choose coordinate systems and change between them via the symmetries of 3D space: 3D

