SE(3)-EQUIVARIANT ATTENTION NETWORKS FOR SHAPE RECONSTRUCTION IN FUNCTION SPACE

Abstract

We propose a method for 3D shape reconstruction from unoriented point clouds. Our method consists of a novel SE(3)-equivariant coordinate-based network (TF-ONet), that parametrizes the occupancy field of the shape and respects the inherent symmetries of the problem. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular point cloud. Our architecture leverages equivariant attention layers that operate on local tokens. This mechanism enables local shape modelling, a crucial property for scalability to large scenes. Given an unoriented, sparse, noisy point cloud as input, we produce equivariant features for each point. These serve as keys and values for the subsequent equivariant cross-attention blocks that parametrize the occupancy field. By querying an arbitrary point in space, we predict its occupancy score. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets. More importantly, local modelling together with SE(3)-equivariance create an ideal setting for SE(3) scene reconstruction. We show that by training only on single, aligned objects and without any pre-segmentation, we can reconstruct novel scenes containing arbitrarily many objects in random poses without any performance loss.

1. INTRODUCTION

With the advent of range sensors in robotics and in medical applications, research in shape reconstruction from point clouds has seen an increasing activity (Berger et al., 2017) . The performance of classical optimization methods tends to degrade when point clouds become sparser, noisier, unoriented, or untextured. Deep learning methods have been proven useful in encoding shape priors, and solving the reconstruction problem end to end (Riegler et al., 2017) . Many of these deep learning methods operate on meshes (Wang & Zhang, 2022; Gong et al., 2019 ), voxels (Riegler et al., 2017) , and point clouds (Qi et al., 2016) . While voxels are easy to manipulate, shape resolution is limited by memory. On the other hand, meshes can guarantee watertight reconstructions, but * Equal Contribution 1



Figure 1: (Above): A scene-level point cloud produced by individual SE(3)-transformations of three sparse object point clouds. (Below): Our equivariant reconstruction. The whole scene is given as input to our network. The network, trained only on single objects in canonical poses and agnostic to the number, position and orientation of the objects is able to reconstruct the scene accurately.

