HOLOGRAPHIC-(V)AE: AN END-TO-END SO(3)-EQUIVARIANT (VARIATIONAL) AUTOENCODER IN FOURIER SPACE

Abstract

Group-equivariant neural networks have emerged as a data-efficient approach to solve classification and regression tasks, while respecting the relevant symmetries of the data. However, little work has been done to extend this paradigm to the unsupervised and generative domains. Here, we present Holographic-(V)AE (H-(V)AE), a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data. We extensively test the performance of H-(V)AE on diverse datasets and show that its latent space efficiently encodes the categorical features of spherical images and structural features of protein atomic environments. Our work can further be seen as a case study for equivariant modeling of a data distribution by reconstructing its Fourier encoding.

1. INTRODUCTION

In supervised learning, the success of state-of-the-art algorithms is often attributed to respecting known inductive biases of the function they are trying to approximate. One such bias is the invariance of the function to certain transformations of the input. For example, image classification is translationally invariant. To achieve such invariance, conventional techniques use data augmentation to train an algorithm on many transformed forms of the data. However, this solution is only approximate and increases training time significantly, up to prohibitive scales for high-dimensional and continuous transformations (∼500 augmentations are required to learn 3D rotation-invariant patterns (Geiger & Smidt, 2022) ). Alternatively, one could use invariant features of the data (e.g. pairwise distance between different features) as input to train any machine learning algorithm (Capecchi et al., 2020; Uhrin, 2021) . However, the choice of these invariants is arbitrary and the resulting network could lack in expressiveness. Recent advances have developed neural network architectures that are equivariant under actions of different symmetry groups. These networks can systematically treat and interpret various transformation in data, and learn models that are agnostic to these transformations. For example, models equivariant to euclidean transformations have recently advanced the state-of-the-art on tasks over 3D point cloud data (Liao & Smidt, 2022; Musaelian et al., 2022; Brandstetter et al., 2022) . These models are more flexible and expressive compared to their purely invariant counterparts (Geiger & Smidt, 2022) , and exhibit high data efficiency. Extending such group invariant and equivariant paradigms to unsupervised learning could map out compact representations of data that are agnostic to a specified symmetry transformation (e.g. the global orientation of an object). In recent work Winter et al. (2022) proposed a general mathematical framework for autoencoders that can be applied to data with arbitrary symmetry structures by learning an invariant latent space and an equivariant factor, related to the elements of the underlying symmetry group. Here, we focus on unsupervised learning that is equivariant to rotations around a specified origin in 3D, denoted by the group SO(3). We encode the data in spherical Fourier space and construct holo-grams of the data that are conveniently structured for equivariant operations. These data holograms are inputs to our end-to-end SO(3)-equivariant (variational) autoencoder in spherical Fourier space, with a fully equivariant encoder-decoder architecture trained to reconstruct the Fourier coefficients of the input; we term this approach Holographic-(V)AE (H-(V)AE). Similar to Winter et al. (2022) , our network learns an SO(3)-equivariant latent space composed of a maximally informative set of invariants and an equivariant frame describing the orientation of the data. We extensively test the perfomance of H-(V)AE and demonstrate high accuracy in unsupervised classification and clustering tasks for spherical images and atomic point clouds within protein structures. The learned SO(3) invariant and equivariant representations would be useful for many real world applications in computer vision and structural biology.

2. BACKGROUND 2.1 SPHERICAL HARMONICS AND IRREPS OF SO(3)

We are interested in modeling 3D data (i.e., functions in R 3 ), for which the global orientation of the data should not impact the inferred model (Einstein, 1916) . We consider functions distributed around a specified origin, which we express by the resulting spherical coordinates (r, θ, ϕ) around the origin. In this case, the set of rotations about the origin define the 3D rotation group SO(3), and we will consider models that are rotationally equivariant under SO(3). It is convenient to project data to spherical Fourier space to define equivariant transformations for rotations. To map a radially distributed function ρ(r, θ, ϕ) to a spherical Fourier space, we use the Zernike Fourier Transform (ZFT), Ẑn ℓm = ρ(r, θ, ϕ) Y ℓm (θ, ϕ)R n ℓ (r) dV (1) where Y ℓm (θ, ϕ) is the spherical harmonics of degree ℓ and order m, where ℓ is a non-negative integer (ℓ ≥ 0) and m is an integer within the interval -ℓ ≤ m ≤ ℓ. R n ℓ (r) is the radial Zernicke polynomial in 3D (Eq. A.7) with radial frequency n ≥ 0 and degree ℓ. R n ℓ (r) is non-zero only for even values of n -ℓ ≥ 0. Zernike polynomials -defined as the product Y ℓm (θ, ϕ)R n ℓ (r) -form a complete orthonormal basis in 3D, and therefore can be used to expand and retrieve 3D shapes, if large enough ℓ and n values are used; approximations that restrict the series to finite n and ℓ are often sufficient for shape retrieval, and hence, desirable algorithmically. Thus, in practice, we cap the resolution of the ZFT to a maximum degree L and a maximum radial frequency N . The operators that describe how spherical harmonics transform under rotations are called the Wigner D-matrices. Notably, Wigner-D matrices are the irreducible representations (irreps) of SO(3), which implies that every element of the SO(3) group acting on any vector space can be represented as a direct sum of Wigner-D matrices. As spherical harmonics form a basis for the irreps of SO(3), the SO(3) group acts on spherical Fourier space via a direct sum of irreps. Specifically, the ZFT encodes a data point into a tensor composed of a direct sum of features, each associated with a degree ℓ indicating the irrep that it transforms with under the action of SO(3). We refer to these tensors as SO(3)-steerable tensors and to the vector spaces they occupy as SO(3)-steerable vector spaces, or simply steerable for short since we only deal with the SO(3) group in this work. We note that a tensor may contain multiple features of the same degree ℓ, which we generically refer to as distinct channels c. Throughout the paper, we refer to generic steerable tensors as h and index them by ℓ, m and c. We adopt the "hat" notation for individual entries to remind ourselves of the analogy with Fourier coefficients. See Figure 1A for a graphical illustration of a tensor.

2.2. MAPPING BETWEEN SO(3)-STEERABLE VECTOR SPACES

Constructing equivariant operations equates to constructing maps between steerable vector spaces. There are precise rules constraining the kinds of operations that guarantee a valid SO(3)-steerable output, the most important one being the Clebsch-Gordan (CG) tensor product ⊗ cg . The CG tensor

