SE(3)-EQUIVARIANT ATTENTION NETWORKS FOR SHAPE RECONSTRUCTION IN FUNCTION SPACE

Abstract

We propose a method for 3D shape reconstruction from unoriented point clouds. Our method consists of a novel SE(3)-equivariant coordinate-based network (TF-ONet), that parametrizes the occupancy field of the shape and respects the inherent symmetries of the problem. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular point cloud. Our architecture leverages equivariant attention layers that operate on local tokens. This mechanism enables local shape modelling, a crucial property for scalability to large scenes. Given an unoriented, sparse, noisy point cloud as input, we produce equivariant features for each point. These serve as keys and values for the subsequent equivariant cross-attention blocks that parametrize the occupancy field. By querying an arbitrary point in space, we predict its occupancy score. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets. More importantly, local modelling together with SE(3)-equivariance create an ideal setting for SE(3) scene reconstruction. We show that by training only on single, aligned objects and without any pre-segmentation, we can reconstruct novel scenes containing arbitrarily many objects in random poses without any performance loss.

1. INTRODUCTION

With the advent of range sensors in robotics and in medical applications, research in shape reconstruction from point clouds has seen an increasing activity (Berger et al., 2017) . The performance of classical optimization methods tends to degrade when point clouds become sparser, noisier, unoriented, or untextured. Deep learning methods have been proven useful in encoding shape priors, and solving the reconstruction problem end to end (Riegler et al., 2017) . Many of these deep learning methods operate on meshes (Wang & Zhang, 2022; Gong et al., 2019 ), voxels (Riegler et al., 2017) , and point clouds (Qi et al., 2016) . While voxels are easy to manipulate, shape resolution is limited by memory. On the other hand, meshes can guarantee watertight reconstructions, but they only handle a predefined topology. Point clouds are lightweight in terms of memory, but they discard topology. Recently proposed deep learning methods represent the geometry via a learned occupancy map, or a signed distance function (SDF). In particular, the seminal works of Mescheder et al. (2019); Park et al. (2019) inspired many follow-up works (Chen & Zhang, 2019b; Genova et al., 2020; Sitzmann et al., 2019) . Such representations can encode arbitrary topologies with an effectively infinite resolution. According to Kendall, "Shape is the geometry of an object modulo position, orientation, and scale" (Kendall, 1989) . While intensive research in the field (Niemeyer & Geiger, 2021; Peng et al., 2020; Niemeyer et al., 2020) has led to increasingly better results, very few of these methods incorporate symmetries as an inductive bias for learning. Most translation-equivariant reconstruction methods build on the convolutional occupancy network (Peng et al., 2020) , while most SO(3)-equivariant architectures (Zhu et al., 2021) , and their extensions to SE(3) with GraphOnet (Chen et al., 2022), use the equivariant modules from Vector Neurons (Deng et al., 2021) . We propose TF-Onet, a novel SE(3)-equivariant coordinate-based network for shape reconstruction. Motivated by the SE(3)transformer (Fuchs et al., 2020) , we design a two-level network that uses equivariant attention modules. The first level, acting as an encoder, extracts local features from the point cloud by applying self-attention in local neighborhoods around each point. The second level, a cross-attention occupancy network, takes as input the extracted point features and the coordinates of a query point in space, and outputs the value of the occupancy function at the specified query point. Even unique objects consist of smaller primitive parts, whose subsets are subsequently composed to form large collections of objects. This property extends naturally to scenes that are created by a composition of objects. Our method performs local shape modeling by leveraging the expressivity of equivariant local attention modules and generalizes to novel scenes with novel configurations of objects from classes unseen during training. This property distinguishes our method from similar equivariant works that either use global features (Deng et al., 2021) or per-point features that encode long-range dependencies by using subsampling to expand their receptive field (Chen et al., 2022) . Additionally, as we describe in Section 3.3 the use of the Tensor Field framework allows our method to utilize higher order representations in contrast to the previous works which use Vector Neurons and thus are constrained to only use type-0 (scalars) and type-1 (vectors) representations. In Section 4, we provide experimental evidence showcasing how these differences benefit our method in the reconstruction of single objects in arbitrary poses and in the reconstruction of novel scenes.

Our contributions can be summarized as follows:

• We propose TF-Onet, a novel SE(3)-equivariant, coordinate-based, attention network for learning occupancy fields from sparse point clouds and use it for surface reconstruction. • Experimentally, we outperform other equivariant coordinate-based networks (Vector Neurons, GraphOnet) and non-equivariant networks (Occupancy Networks, Convolutional Occupancy Networks, IFNet, NeuralPull) trained with augmentations. • The most compelling property of our method is that equivariance and local shape modeling allows our network to produce high-quality reconstructions of novel scenes while being trained only on single-aligned objects. These scenes contain an arbitrary number of objects in random poses. We show quantitative 5a and qualitative 6a performance gap over previous methods in a synthetic dataset of randomly placed objects (Seismic dataset). We also show qualitative results on the more challenging Matterport3D (Chang et al., 2017) containing real scenes with unseen object classes.

2. RELATED WORK

In this section, we discuss previous work on surface reconstruction from input point clouds. We focus on methods that reconstruct the surface of an object by using either an occupancy function or a SDF. For oriented point clouds (with known normal vectors), the occupancy function or the SDF can be constructed by classical methods that do not require learning (Alexa et al., 2003; Kazhdan & Hoppe, 2013) . These methods tend to fail in the presence of noise, or when the input point cloud is sparse. To surpass such limitations, Mescheder et al. ( 2019); Chen & Zhang (2019a) proposed to learn the occupancy function for each input point cloud. Similarly, Park et al. (2019) proposed to learn to infer the SDF of the object's surface from a sparse set of SDF values. A limitation of the above methods is that they use a global feature vector-or code-to represent the whole object (or scene), which limits their ability to generalize to novel scenes or objects. More recent methods



Figure 1: (Above): A scene-level point cloud produced by individual SE(3)-transformations of three sparse object point clouds. (Below): Our equivariant reconstruction. The whole scene is given as input to our network. The network, trained only on single objects in canonical poses and agnostic to the number, position and orientation of the objects is able to reconstruct the scene accurately.

