SE(3)-EQUIVARIANT ATTENTION NETWORKS FOR SHAPE RECONSTRUCTION IN FUNCTION SPACE

Abstract

We propose a method for 3D shape reconstruction from unoriented point clouds. Our method consists of a novel SE(3)-equivariant coordinate-based network (TF-ONet), that parametrizes the occupancy field of the shape and respects the inherent symmetries of the problem. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular point cloud. Our architecture leverages equivariant attention layers that operate on local tokens. This mechanism enables local shape modelling, a crucial property for scalability to large scenes. Given an unoriented, sparse, noisy point cloud as input, we produce equivariant features for each point. These serve as keys and values for the subsequent equivariant cross-attention blocks that parametrize the occupancy field. By querying an arbitrary point in space, we predict its occupancy score. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets. More importantly, local modelling together with SE(3)-equivariance create an ideal setting for SE(3) scene reconstruction. We show that by training only on single, aligned objects and without any pre-segmentation, we can reconstruct novel scenes containing arbitrarily many objects in random poses without any performance loss.

1. INTRODUCTION

With the advent of range sensors in robotics and in medical applications, research in shape reconstruction from point clouds has seen an increasing activity (Berger et al., 2017) . The performance of classical optimization methods tends to degrade when point clouds become sparser, noisier, unoriented, or untextured. Deep learning methods have been proven useful in encoding shape priors, and solving the reconstruction problem end to end (Riegler et al., 2017) . Many of these deep learning methods operate on meshes (Wang & Zhang, 2022; Gong et al., 2019) , voxels (Riegler et al., 2017) , and point clouds (Qi et al., 2016) . While voxels are easy to manipulate, shape resolution is limited by memory. On the other hand, meshes can guarantee watertight reconstructions, but they only handle a predefined topology. Point clouds are lightweight in terms of memory, but they discard topology. Recently proposed deep learning methods represent the geometry via a learned occupancy map, or a signed distance function (SDF). In particular, the seminal works of Mescheder et al. (2019) ; Park et al. (2019) inspired many follow-up works (Chen & Zhang, 2019b; Genova et al., 2020; Sitzmann et al., 2019) . Such representations can encode arbitrary topologies with an effectively infinite resolution. According to Kendall, "Shape is the geometry of an object modulo position, orientation, and scale" (Kendall, 1989) . While intensive research in the field (Niemeyer & Geiger, 2021; Peng et al., 2020; Niemeyer et al., 2020) has led to increasingly better results, very few of these methods incorporate symmetries as an inductive bias for learning. Most translation-equivariant reconstruction methods build on the convolutional occupancy network (Peng et al., 2020) , while most SO(3)-equivariant architectures (Zhu et al., 2021) , and their extensions to SE(3) with GraphOnet (Chen et al., 2022) , use the equivariant modules from Vector Neurons (Deng et al., 2021) . We propose TF-Onet, a novel SE(3)-equivariant coordinate-based network for shape reconstruction. Motivated by the SE(3)transformer (Fuchs et al., 2020) , we design a two-level network that uses equivariant attention modules. The first level, acting as an encoder, extracts local features from the point cloud by applying self-attention in local neighborhoods around each point. The second level, a cross-attention occupancy network, takes as input the extracted point features and the coordinates of a query point in space, and outputs the value of the occupancy function at the specified query point. Even unique objects consist of smaller primitive parts, whose subsets are subsequently composed to form large collections of objects. This property extends naturally to scenes that are created by a composition of objects. Our method performs local shape modeling by leveraging the expressivity of equivariant local attention modules and generalizes to novel scenes with novel configurations of objects from classes unseen during training. This property distinguishes our method from similar equivariant works that either use global features (Deng et al., 2021) or per-point features that encode long-range dependencies by using subsampling to expand their receptive field (Chen et al., 2022) . Additionally, as we describe in Section 3.3 the use of the Tensor Field framework allows our method to utilize higher order representations in contrast to the previous works which use Vector Neurons and thus are constrained to only use type-0 (scalars) and type-1 (vectors) representations. In Section 4, we provide experimental evidence showcasing how these differences benefit our method in the reconstruction of single objects in arbitrary poses and in the reconstruction of novel scenes. Our contributions can be summarized as follows: • We propose TF-Onet, a novel SE(3)-equivariant, coordinate-based, attention network for learning occupancy fields from sparse point clouds and use it for surface reconstruction. • Experimentally, we outperform other equivariant coordinate-based networks (Vector Neurons, GraphOnet) and non-equivariant networks (Occupancy Networks, Convolutional Occupancy Networks, IFNet, NeuralPull) trained with augmentations. • The most compelling property of our method is that equivariance and local shape modeling allows our network to produce high-quality reconstructions of novel scenes while being trained only on single-aligned objects. These scenes contain an arbitrary number of objects in random poses. We show quantitative 5a and qualitative 6a performance gap over previous methods in a synthetic dataset of randomly placed objects (Seismic dataset). We also show qualitative results on the more challenging Matterport3D (Chang et al., 2017) containing real scenes with unseen object classes.

2. RELATED WORK

In this section, we discuss previous work on surface reconstruction from input point clouds. We focus on methods that reconstruct the surface of an object by using either an occupancy function or a SDF. For oriented point clouds (with known normal vectors), the occupancy function or the SDF can be constructed by classical methods that do not require learning (Alexa et al., 2003; Kazhdan & Hoppe, 2013) . These methods tend to fail in the presence of noise, or when the input point cloud is sparse. To surpass such limitations, Mescheder et al. (2019) ; Chen & Zhang (2019a) proposed to learn the occupancy function for each input point cloud. Similarly, Park et al. (2019) proposed to learn to infer the SDF of the object's surface from a sparse set of SDF values. A limitation of the above methods is that they use a global feature vector-or code-to represent the whole object (or scene), which limits their ability to generalize to novel scenes or objects. More recent methods

𝓔𝓔 𝓣𝓣

(self-attention) (cross-attention) Figure 2 : Our method consists of two networks E, T . First, to each point ⃗ x i in the point cloud, we assign a simple local feature f i of type-1 (rotating as a vector) which can be the relative position of the point ⃗ x i to the centroid of its local neighborhood. Similarly, to each query point ⃗ q ∈ R 3 we assign a local feature f q . Given a point cloud consisting of pairs (⃗ x i , f i ), the network E applies SE(3)-equivariant attention to assign to each point ⃗ x i a learned feature f E i . Finally, given the learned feature-augmented point cloud and the query-feature pair (⃗ q, f q ), the network T applies SE(3)equivariant cross-attention to output the scalar value of the occupancy function at point ⃗ q. leverage the similarities between local patches of different objects by learning either a combination of local and global features (Genova et al., 2020; Erler et al., 2020) , or only local ones (Jiang et al., 2020; Chabra et al., 2020; Tretschk et al., 2020; Boulch & Marlet, 2022; Williams et al., 2022; Chen et al., 2022) . The composition of these local features allows the reconstruction of scenes containing a variety of different objects. To define the local neighborhoods for which the local features are extracted, Jiang et al. ( 2020); Chabra et al. (2020); Tretschk et al. (2020) voxelize the space into a regular grid and learn a feature representation for each voxel, while Boulch & Marlet (2022) ; Chen et al. ( 2022) learns a local feature representation for each input point. In our work, we follow the latter approach by using local attention layers (Bahdanau et al., 2015) that dynamically change the attention weights of each point to its neighbors. Attention layers were popularized with the introduction of the transformer architecture (Vaswani et al., 2017) , and were later applied in computer vision tasks such as image classification (Dosovitskiy et al., 2021; Wu et al., 2021) . Specifically for point cloud processing tasks, Pan et al. (2021) There is a large body of work on incorporating known symmetries into the learning process, tracing back to Fukushima (1980) ; LeCun et al. (1989) with the use of convolutional layers to build translation-equivariant networks. More recent works extend this idea to discrete (Cohen & Welling, 2016) and continuous (Weiler et al., 2018b) rotationally invariant architectures. These methods have also been applied beyond Euclidean domains, to spheres (Esteves et al., 2018 ), graphs (Maron et al., 2019; Brandstetter et al., 2022 ), meshes (de Haan et al., 2021) , and general manifolds (Cohen et al., 2019; Weiler et al., 2021) . For transformer architectures, Fuchs et al. (2020) propose an SE(3)-equivariant transformer by using the results of Tensor Field Networks (Thomas et al., 2018) . Romero & Cordonnier (2021) proposed a framework for constructing linear self-attention layers that are equivariant to arbitrary discrete groups. Operating on point clouds, an SE(3)-equivariant network (Chen et al., 2021) performs pose estimation and classification using SE(3) convolutions. For the problem of surface reconstruction from point clouds, Convolutional Occupancy Networks (Peng et al., 2020) and later methods (Lionar et al., 2021; Tang et al., 2021; Boulch & Marlet, 2022) use convolutional layers to achieve translation-equivariance. Finally, Vector Neurons (Deng et al., 2021) propose an architecture for SO(3)-equivariant reconstruction from point clouds, extended by GraphOnet Chen et al. (2022) for the SE(3)-equivariant case.

3. METHOD

3.1 LEARNING THE OCCUPANCY FIELD Consider a 3D point cloud P = {(⃗ x i , f i )} N i=1 , where ⃗ x i ∈ R 3 denotes the spatial location of a point and f i ∈ R M is an (optional) associated feature. If the features include the normal vectors, the point cloud is called oriented. Otherwise it is called unoriented. We denote the matrix of coordinates by X = [⃗ x 1 , • • • , ⃗ x N ] ∈ R 3×N and the matrix of features by F = [f 1 , • • • , f N ] ∈ R M ×N . Given a point cloud, our goal is to infer the shape from which it was sampled. We represent this shape with an occupancy function o * : R 3 → {0, 1} whose 1-level set, {⃗ x ∈ R 3 |o * (⃗ x) = 1}, encodes the volume occupied by the object and whose boundary encodes the surface of the object. We approximate this function by learning an operator T that acts on an input point cloud and produces a function ôP : R 3 → [0, 1] that describes the occupancy score ôP (⃗ q) of each position ⃗ q ∈ R 3 . Following Mescheder et al. (2019) , we parametrize T by a coordinate-based neural network. In other words, given a point cloud P = (X, F ), and an arbitrary point ⃗ q in space, we predict T [X, F ](⃗ q) = ôP (⃗ q) ∈ [0, 1]. In contrast to Mescheder et al. (2019) , our architecture works directly on the irregular point grid, extracting a local signature to describe the point cloud, and constraining the model to achieve SE(3)-equivariance, thus respecting the symmetries of the problem. It is more common, especially in robotics applications, to receive featureless point clouds (for example, from Lidar sensors). Our paper focuses on sparse, noisy, and featureless (in particular, unoriented) point clouds. Since the performance of T depends on the expressivity of the input features F , we first design a feature extractor (or encoder) E that produces a point cloud along with features, E[X] = (X, F ). Overall, ôP (⃗ q) = [T (E(X; ϕ))](⃗ q; θ). Our design of T , E is founded on two basic principles: local shape modelling and equivariance. Objects can be seen as a composition of local parts and local surfaces. At this level, even dissimilar objects may share structure. Local shape modelling leverages this compositionality to learn from multiple shapes in a more effective way. This is particularly helpful for reconstructing novel object classes. On the other hand, SE(3)-equivariance restricts models to produce consistent occupancy maps irrespective of the rigid transformations of the shape. This can further reduce sample complexity. Together, these two properties create an ideal setting for scene reconstruction. Scenes are composed of objects which appear in different orientations and positions. Without leveraging compositionality and equivariance, every new configuration of objects would be seen as a novel scene by the network, and thus learning would be hindered by the combinatorial explosion of these combinations. We discuss how we impose the above properties on T , E in the next sections. We note two more important properties of point cloud processing. First, there is no canonical ordering on the points, so a simultaneous permutation of the columns of X, F describes the same point cloud (permutation equivariance) and second, the number N can vary between point clouds. Thus the input has a set structure, irregularly embedded in a Euclidean domain. Two main architectures that deal with such inputs are Point Cloud Convolutions (Wu et al., 2019) and Attention modules (Yu et al., 2021) . We opt for the latter in our design because point cloud convolutions, when constrained to be SE(3)-equivariant, result in very few free parameters (Thomas et al., 2018) and in particular they eliminate the angular degrees of freedom from learning. Thus the performance is highly dependent on ad-hoc non-linearities (Poulenard & Guibas, 2021) . Attention modules on the other hand are inherently non-linear and the attention kernel can be viewed as a modulation to the angular profile of the weights, thus having more degrees of freedom (Fuchs et al., 2020) .

3.2. USING ATTENTION FOR LOCAL SHAPE MODELLING

Preliminaries on Attention: We propose to construct our networks E, T as a composition of equivariant attention modules. Given N out query tokens Q i ∈ R d Q and N in key-value pairs K ∈ R d Q ×Nin , V ∈ R d V ×Nin , an attention module can be described by the formula: A(Q i , K, V ) = V softmax(K T Q i ) = Nin j=1 exp (Q T i K j ) Nin j ′ =1 exp (Q T i K j ′ ) α(Qi,Kj ) V j , i ∈ [N out ]. where subscripts index columns. For each query token Q i , the output A is a linear combination of the values V j , modulated by the similarity α(Q i , K j ) of the query Q i to the corresponding key K j , as imposed by the attention kernel α. In self-attention layers the queries, keys and values are functions of the same set of features, while in cross-attention the query features are distinct and only the keys and values are functions of the same features. We design the feature extractor E as a composition of L self-attention layers, and the occupancy operator T as a composition of L ′ cross-attention layers. Self-attention feature extractor E: We associate each point with a query token with the same functional dependency Q = Q(⃗ x i , f ) = W Q f. Moreover, we design the self-attention module so that not all queries have access to all the keys, but a query attends only keys depending on its position. In particular, if ⃗ x j ∈ N (⃗ x i ), with N (⃗ x i ) the g describes the action on both the query field (⃗ q, S ′ (X, q)) depicted above and the point cloud field (X, F E ) (described in the main text); T , E are equivariant which is equivalent to the diagram being commutative, i.e., L ′′ g • T = T • L ′ g and E • L g = L ′ g • E. k nearest neighbors of ⃗ x i , we assign to the pair (⃗ x i , ⃗ x j ) a key-value token. The functional form of keys and values is V (⃗ x j , ⃗ x i , f ) = W V (⃗ x j -⃗ x i )f, K(⃗ x j , ⃗ x i , f ) = W K (⃗ x j -⃗ x i )f. The standard way to incorporate relative positional encoding is via concatenation or addition (Vaswani et al., 2017) . Our functional form W K (⃗ x j -⃗ x i )f can be understood as a more general way to encode the positions of the tokens. This generalization will be important in our case when we impose extra constraints to satisfy rotational equivariance. The self-attention module of the encoder E at layer l ≥ 1 gets a feature-augmented point cloud (X, F ) and produces a new feature-augmented pointcloud (X, F ′ ) where F ′ is computed as: E l [X, F ] i = ⃗ xj ∈N (⃗ xi) exp [(W l Q f i ) T (W l K (⃗ x j -⃗ x i )f j )] ⃗ xj ∈N (⃗ xi) exp [(W l Q f i ) T (W l K (⃗ x j -⃗ x i )f j )] W l V (⃗ x j -⃗ x i )f j , Since the input point cloud X includes only the locations of the points without additional features, we design a function Ẽ0 that associates hand-designed (not learned) features to the points. In this work, Ẽ0 [X] = S(X) where S(X ) i = ⃗ x i -1/|N (x i )| xj ∈N (xi) ⃗ x j is the relative position from the neighborhood's centroid. Then, E = E L • • • • • E 1 • Ẽ0 . Cross-attention occupancy network T : We design the occupancy operator T as a composition of L ′ cross-attention layers. The input query to T can be any point ⃗ q ∈ R 3 to which we assign a token Q(⃗ q, f q ) = W ′ Q f q . The output of T is the occupancy value of that query location. The keys and values in T are constructed from the output of the feature extractor E[X] = (F E := [f E 1 , ..., f E N ] ). In particular, if ⃗ x j ∈ N X (⃗ q), we create a key-value pair for (⃗ q, ⃗ x j ) with the form K(⃗ x j , ⃗ q, f E j ) = W ′ K (⃗ x j -⃗ q)f E j and V (⃗ x j , ⃗ q, f E j ) = W ′ V (⃗ x j -⃗ q)f E j . Given a feature-augmented point q l := (⃗ q, f l q ), the cross-attention module T l outputs a new feature-augmented point q l+1 = (⃗ q, f l+1 q ). These new features are computed as: T l [X, F E ](q l ) = ⃗ xj ∈N X (⃗ q) exp [(W ′ l Q f l q ) T (W ′ l K (⃗ x j -⃗ q)f E j ] ⃗ xj ∈N X (⃗ q) exp [(W ′ l Q f l q ) T (W ′ l K (⃗ x j -⃗ q)f E j )] W ′ l V (⃗ x j -⃗ q)f E j , Since the input query includes only its location ⃗ q and not additional features, first we associate it with a fixed (not learned) feature f 1 q := S ′ (X, q) = ⃗ q - 1 |N X (⃗ q)| ⃗ xj ∈N X (⃗ q) ⃗ x j , where N X (⃗ q) = N (arg min ⃗ xi∈X ∥⃗ q -⃗ x i ∥ 2 ). Then after passing the feature augmented point to the remaining layers of T we get the output (⃗ q, f L ′ +1 q ) which corresponds to the occupancy value at point ⃗ q. Thus the self-attention and cross-attention modules process the features on the point cloud and the 3d field respectively, without changing the topology and by performing local attention.

3.3. EQUIVARIANT ATTENTION FOR SHAPE RECONSTRUCTION

Shape reconstruction should be independent of the coordinate system used, including of the position of the origin and the orientation of the coordinate axes. One way to capture this geometric consistency is via SE(3)-equivariance, which can be formulated in the language of group theory. In our setup, neither E nor T satisfy this property without additional constraints on (W Q , W K , W V ) and (W ′ Q , W ′ K , W ′ V ). In this section we formulate and solve these constraints. We will assume familiarity with equivariance, and defer a more extensive discussion to the Appendix. Equivariance constraints: In our formulation, the equivariance constraint is that the occupancy field should be a type-0 (or scalar) field i.e., a simultaneous roto-translation of the point cloud and the query should result in an invariant prediction. Formally, for all (T, R) ∈ SE(3), ⃗ q ∈ R 3 , T [E(RX + ⊕ N T )](R⃗ q + T ) = T [E(X)](⃗ q). (3) This constraint has to be satisfied from input to output, not at every layer. For intermediate layers we can relax the constraint and output more expressive vector or tensor fields, with pre-specified transformation properties, so that the whole composition of layers results in a scalar field. Per-layer equivariance constraints. We can think of a feature-augmented point cloud (X, F ) as a feature field in R 3 with finite support. Then, both E, T process fields in R 3 at each layer. We need to specify how these fields transform under a roto-translation of X. Different transformation laws in the layers correspond to different constraints on their weights. We design the layers to transform as, E l (RX + ⊕ N T, ρ l E (R)F l ) = ρ l+1 E (R)F l+1 (4) T l [E(RX + ⊕ N T )](R⃗ q + T, ρ l T (R)f l q ) = ρ l+1 T (R)f l+1 q (5) for all (T, R) ∈ SE(3), where ρ l E (R), ρ l T (R) are SO(3)-representations i.e., square invertible matrices satisfying ρ(R 1 R 2 ) = ρ(R 1 )ρ(R 2 ), for all R 1 , R 2 ∈ SO(3). The layers Ẽ0 , T 0 have also been designed to admit this transformation law since: Ẽ0 (RX + ⊕ N T ) = R Ẽ0 (X) T 0 [E(RX + ⊕ N T )](R⃗ q + T ) = R T 0 [E(X)](⃗ q) i.e., ρ 1 E (R) = ρ 1 T (R) = R (proof Appendix A.10 ). These per-layer constraints in Eqs. 4, 5 are sufficient to solve Eq. 3, provided that ρ L ′ +1 T (R) = I. These transformation laws have the form of the induced representation of SE(3) via SO(3) (Fulton & Harris, 2013) (Appendix A.8.4) . Per-layer constraints on the weights: Observe that the transformation laws above describe fields whose features stay invariant under a translation of their domains (i.e., the point cloud and R 3 ) but when this domain rotates they are multiplied by a square matrix ρ(R). Such fields are called ρ-fields. It is clear from Eq. 1, 2 that due to the use of relative positional encoding, each feature is translation invariant; and thus each field is translation equivariant. We focus on rotations next, discussing constraints on E, and adapting them to T . We prove in the Appendix that to solve Eq. 4, it suffices to satisfy for all R ∈ SO(3) and ⃗ x i , ⃗ x j , j ̸ = i the following constraints on the weights:    W l Q ρ l E (R) = ρ l+1 E (R)W l Q W l K (R(⃗ x j -⃗ x i ))ρ l E (R) = ρ l+1 E (R)W l K (⃗ x j -⃗ x i ) W l V (R(⃗ x j -⃗ x i ))ρ l E (R) = ρ l+1 E (R)W l V (⃗ x j -⃗ x i ). This reduces the per-layer constraints in Eq. 4 to constraints on the weights of each layer. By solving these constraints, we uncover the free parameters in each layer.

Feature types and irreducibles:

To solve equation 6 it is necessary to exploit the structure of the matrices ρ l E , ρ l T , which is an object of study of representation theory (Appendix A.8). Specifically, according to Peter-Weyl theorem we can block diagonalize a representation ρ : SO(3) → R d×d as ρ(R) = Q T [⊕ k D k (R)]Q where Q is a change of basis matrix and 2k+1) , k ∈ N is the k-th irreducible representation-i.e., non-decomposable into smaller one-of SO(3). Those D k (R) matrices produced by the decomposition are called k-th Wigner-D matrices. If a field decomposes to a single irreducible D k , i.e., ρ = D k for some k ∈ N, it is called a type-k field. We already discussed a type-0 (or scalar) field ρ(R) = D 0 (R) = I, the occupancy field, as well as a type-1 (or vector) field D k : SO(3) → R (2k+1)×( ρ(R) = D 1 (R) = R on the input point cloud (X, S(X)). In practice our feature fields f on each point are composed of multiple copies (or multiplicities) of irreducibles of different types that form complex ρ-fields i.e., f = ⊕ k,m l f k,m l , where k indexes the type and m k its multiplicity. Layer Parametrization: Given f l = k∈K m k f l k,m k , and f l+1 = k ′ ∈K ′ m k ′ f l+1 k ′ ,m k ′ , we denote by W k ′ k a block of the matrix W that maps a k-type to a k ′ -type (there are many such blocks depending on the multiplicities). For clarity, we drop the layer index l from the matrices, since all constraints are similar. By requiring that the attention kernel is a scalar field of the positions, we find in Appendix A.10 the sufficient conditions :    W k ′ k Q D k (R) = D k ′ (R)W k ′ k Q W k ′ k K (R(⃗ x j -⃗ x i ))D k (R) = D k ′ (R)W k ′ k K (⃗ x j -⃗ x i ) W k ′ k V (R(⃗ x j -⃗ x i ))D k (R) = D k ′ (R)W k ′ k V (⃗ x j -⃗ x i ).    The solution of the first equation follows from Schur's Lemma (Appendix A.8.1). The next two have been studied in Weiler et al. (2018a) ; Thomas et al. (2018) . Finally, the inner product in the attention kernel can zero out some of the parameters in the keys:      W kk Q = w kk Q I 2k+1 , W k ′ k Q = 0, k ̸ = k ′ W k ′ k V (⃗ x j -⃗ x i ) = k ′ +k J=|k ′ -k| ϕ k ′ k J,V (∥⃗ x j -⃗ x i ∥; θ V )C k ′ k J ( ⃗ xj -⃗ xi ∥⃗ xj -⃗ xi∥ ) W k ′ k K (⃗ x j -⃗ x i ) = k ′ +k J=|k ′ -k| ϕ k ′ k J,K (∥⃗ x j -⃗ x i ∥; θ K )C k ′ k J ( ⃗ xj -⃗ xi ∥⃗ xj -⃗ xi∥ ), W k ′ k K = 0, k ′ / ∈ K (8) where C k ′ k J (x) = J m=-J Y Jm (x)Q k ′ k Jm , x = ⃗ x/∥⃗ x∥ and Q k ′ k Jm ∈ R (2k ′ +1)×(2k+1) are slices from the Clebsch-Gordan matrices Q k ′ k , Y J : S 2 → R 2J+1 is the j-th real-valued spherical harmonic and Y Jm (x) = [Y J (x)] m is its m-th coordinate. The remaining free parameters are the, ϕ J -s which we parametrize with small MLPs. (See Appendix A.7 for an example of these solutions) The constraints for the cross-attention module T are similar to Eq. 7. However, we can further restrict to type-1 features for the keys in the first layer, i.e., [W K ] i,j = 0, for i ̸ = 1, without any loss. The reason is that the input query field is of type-1. In Fig. 3 we visualize the equivariance constraint on T , E by using a commutative diagram. For additional expressivity, we perform equivariant multiheaded attention by splitting the channels, i.e., the multiplicities of the irreducible representations. We also use skip connections and equivariant layer normalization as in Thomas et al. (2018) , but defer the details to the Appendix. See Fig. 2 for an overview of the method. Irreducible types for shape reconstruction: In deep networks, the difference between stacked channels and those forming a ρ-field is that the latter are equipped with transformation laws to mix channels, providing a geometric meaning to the representations. We select the irreducibles to describe our intuition about the geometric entities we look for. If we want to learn a feature that behaves like a 3D Euclidean vector under rotation (for example a normal vector) we need to construct a type-1 feature. If we want a feature that behaves like a symmetric matrix under rotations (such as the inertial matrix), we need a 6-dimensional channel that comprises of one type-0 (scalar) channel and five channels that compose a type-2 (traceless symmetric matrix) feature. We incorporate such geometric entities in a distinct way from previous SO(3)-based methods in surface reconstruction, which only handle vector fields (Deng et al., 2021; Zhu et al., 2021; Chen et al., 2022) . In particular, type-2 features can be useful in our problem to define the bending of the local surface in the receptive field of the query. Thus they are useful to infer whether the query is inside the shape. Our intuitions are supported by the experimental performance gain compared to Deng et al. (2021) ; Chen et al. (2022) and the ablation study on the types of representations used in our network, shown in Table 1 . 

4. EXPERIMENTS

We perform experiments with surface reconstruction from unoriented sparse and noisy input point clouds. We show the importance of SE(3)-equivariance by evaluating objects in various poses and positions in space. Additionally, we show how local shape modeling and equivariance allows our method to train only on single aligned objects and generalize to scenes containing multiple objects in arbitrary locations and orientations.

4.1. SINGLE OBJECT RECONSTRUCTION FROM A SPARSE POINT CLOUD

In our first experiment, we train and evaluate our model on sparse point clouds sampled from single objects in ShapeNet (Chang et al., 2015) . For each input shape, we first uniformly sample 300 points from the ground truth mesh, and then we add normal noise with zero mean and standard deviation of 0.005. We note that this experimental setting of 300 points is more challenging compared to previous works that used denser point clouds containing 3000 points. First we train and evaluate on the original objects from ShapeNet, where both the training and test data points are in their canonical position (the I/I case). Additionally, we follow 3) equivariant network that subsamples the input point cloud to extract per point features with long range dependencies. We note that in contrast to our method both VNN and GraphOnet can only use up to type-1 equivariant features. In Table 1 , we present the F-Score (Tatarchenko et al., 2019) , the Chamfer-L1 distance (Fan et al., 2017) and the Intersection over Union (IoU) of the reconstruction, for models trained on the aligned and SO(3)-augmented datasets. We refer to section A.1 of the Appendix for a more detailed de- 

4.2. SCENE RECONSTRUCTION WITH SINGLE OBJECT TRAINING

In this section, we evaluate the ability of our model to reconstruct novel scenes with many different objects, while trained only on single objects. Due to SE(3)-equivariance, performance is consistent and independent of the pose and the position of the objects. Additionally, our method performs computations in local neighborhoods that usually contain points from a single object. These two factors allow us to reconstruct scenes that contain objects in arbitrary poses and positions, without the need to train on similar scenes, or to segment into separate objects and reconstruct each one. We construct a dataset of synthetic rooms with multiple objects from ShapeNet in arbitrary locations and poses, similarly to the dataset constructed in Peng et al. (2020) , but with the addition of random SO(3) rotations on each object. We call this dataset the Seismic dataset. In figure 5a we show a quantitative comparison between our method (TF-Onet [E:0-2,D:0-2]) and previous methods that lack either the local shape modeling or the equivariant property. Figure 4b shows a qualitative comparison between the reconstructions achieved by these methods, while Figure 6a shows more examples of the reconstruction achieved by our model for scenes from the seismic dataset. Methods that perform global shape modeling (Onet, VNN) and are trained on single objects fail to generalize to scenes containing multiple objects. On the other hand, methods that perform local shape modeling but are not equivariant to SE(3) transforms (ConvOnet), can reconstruct objects in novel poses, but with reduced quality. Finally, the SE(3) equivariant GraphOnet, that uses perpoint features with long range dependencies, cannot generalize well on novel scenes when it is only trained on single objects. Our method benefits from both equivariance and local shape modeling, and is able to generalize to novel scenes achieving quality similar to that on single object reconstruction. In Figure 6b , we also show examples of reconstructions of realistic scenes captured from the Mat-terport3D dataset (Chang et al., 2017) . These scenes contain between 6000 to 13000 points. While our method has only been trained with aligned single objects from ShapeNet represented as point clouds of 300 points, it successfully reconstructs complicated realistic scenes containing multiple objects in arbitrary positions and poses, even from unseen classes.

5. CONCLUSION

We proposed a novel SE(3)-equivariant coordinate-based model for shape reconstruction, consisting of two attention-based networks. By incorporating equivariance and local shape modeling, our method leverages the compositional structure of objects and scenes. These two properties allow our model to train on single-aligned objects and reconstruct novel objects and scenes. We evaluate our method against state-of-the-art SO(3)-equivariant and non-equivariant methods trained with augmentations and compare favorably in the single shape reconstruction category. Additionally, using our model trained on single aligned objects, we show that it can reconstruct novel scenes with quality similar to single object reconstruction, a task where other methods fail.

A APPENDIX A.1 EVALUATION METRICS

In table 1, we present the F-Score, Chamfer-L1 distance and Intersection over Union (IoU) of the reconstruction achieved by various models. For the F-Score and Chamfer-L1 distance, we compute the reconstructed mesh using the Marching Cubes algorithm (Lorensen & Cline, 1987) . Then, we uniformly sample 100,000 points from the reconstructed mesh, and compare them with the "ground truth" points sampled similarly from the "ground truth" mesh. The standard deviation of the results for both metrics due to the randomness introduced by the sampling of the 100,000 points is smaller than 10 -5 . As defined in Tatarchenko et al. (2019) , the F-Score here is the harmonic mean of the precision and the recall of the reconstruction. Precision is the percentage of reconstructed points that lie within a certain distance τ of the ground truth points, and recall is the percentage of ground truth points that lie within a distance τ of the reconstructed points. We compute the F-Score with τ equal to 1% of the side length of the reconstructed volume, and with τ equal to 2% of the side length of the reconstructed volume. For the IoU metric, we uniformly sample 100,000 points from the whole space, and compare the predicted occupancy to the ground truth occupancy. The IoU is then computed by taking the ratio of points that are occupied according to both the predicted and the ground truth occupancy to the points that are occupied according to either the predicted or the ground truth occupancy.

A.2 MODEL ARCHITECTURE, TRAINING AND TESTING DETAILS

For the self-attention feature extractor network E, we use ten multi-headed SE(3) attention layers, and for the cross-attention occupancy network T we use another two. All multi-headed SE(3) attention layers use eight heads. For each point ⃗ x i of the point cloud, we define the neighborhood N (⃗ x i ) as its k nearest neighbors, where k is chosen to be 5% of the size of the point cloud (e.g., for 300 points, k=15). Each feature map in the layers of E, T uses irreducibles up to type-2. We train our model on the ShapeNet (Chang et al., 2015) subset constructed in Choy et al. (2016) . We use the Adam (Kingma & Ba, 2015) optimizer with learning rate that starts at 2 • 10 -4 and linearly decreases to reach the value of 10 -5 . We train for 200,000 iterations using a batch size of 64. During training, we take as input a point cloud of 300 points and as ground truth the occupancy value of 2048 points that are sampled uniformly inside a box that bounds the object. As a training loss, we use the binary cross-entropy loss between the predicted and the ground truth occupancy of the 2048 randomly sampled points. During inference, recalling that the output of our model for one query point is in the range of [0, 1], we classify a query point as occupied if the value of the learned occupancy function is above 0.2. After we query the occupancy of points throughout the space, we reconstruct the object's mesh using the Marching Cubes algorithm (Lorensen & Cline, 1987) .

A.3 EXTRACTING DIFFERENT TYPES OF REPRESENTATIONS AS FEATURES

An important quality of our method is its ability to use different types of representations as its intermediate features. These different types determine the way that the features transform as the input rotates. These transformation laws affect the shape properties that each type of features can encode. For example type-0 features can encode properties of the shape that are invariant to the rotation of the shape, type-1 features can encode properties that must rotate as 3d vectors (e.g. the normal vectors at the surface) and type-2 features can encode properties that rotate as symmetric matrices (e.g. the inertial matrix of the shape). A type l feature consists of a 2l + 1 dimensional vector that rotates according to the corresponding Wigner-D matrix 2l+1) . In figure 7 we visualize the different types of features extracted by our encoder for different cases of inputs. The difference between the type of features can be clearly observed in the case of the symmetric sphere where features of higher type l correspond to higher frequencies on the sphere. D l : SO(3) → R (2l+1)×( The minimum distances range from around 6% to 20% of the room. The performance of the model is relatively stable across all settings with a maximum performance decrease of 0.025 in IoU. Figure 9 : Intersection over Union (IoU) of the reconstructions from the Seismic dataset (Sec. 4.2) with respect to the minimum distance of any two objects in the scene.

A.6 LIMITATIONS AND FUTURE WORK

A possible limitation of this work is the inherent memory overhead of the attention modules used in the network. In our setting, this overhead is lowered due to the locality of the attention operation but it can be further minimized with additional optimization of the attention modules. The optimization of the attention modules is an active research direction that can also benefit our method. An additional possible direction for future work is to extend our method to the problem of scene completion from partial observations. In this task, the model is required to hallucinate and reconstruct unobserved areas of a scene and thus requires a combination of local and global information about the scene configuration. While a significant methodological extension is required for our method to tackle such a task we believe that the core principles of equivariance via higher-order representations and local shape modeling can provide useful tools.

A.7 WEIGHT PARAMETRIZATION

Here, we present the parametrization of the weights of the queries, keys, and values that appear as the solutions of Equation 7(and written analytically in Equation 8). We focus on the specific case where both the input and output feature maps are ρ-fields consisting of up to type-2 irreducibles. Then, each feature in the feature maps can be decomposed into f in = 2 k=0 M k m k =1 f k,m k in , f out = 2 k ′ =0 N k ′ n k ′ =1 f k ′ ,n k ′ out For clarity, we start with multiplicities 1 i.e., M k = N k ′ = 1 (and omit the index of the layer). Then we extend to the general case. The matrix W Q appearing in the query tokens Q(⃗ x i , f in ) = W Q f in has the form:               f 0,1 out f 1,1 out f 2,1 out               =               w 0,0 0 0 0 w 1,1 I 3 0 0 0 w 2,2 I 5               W Q               f 0,1 in f 1,1 in f 2,1 in               Similarly, we present the form of the function of matrices W K , that appear in the key tokens as K(⃗ x j , ⃗ x i , f in ) = W K (⃗ x j -⃗ x i )f in . To simplify the notation we use ⃗ x instead of ⃗ x j -⃗ x i and x = ⃗ x/∥⃗ x∥. To write W K in matrix form we first need to define: 2k+1) are fixed matrices defined in Section 8. Given the definition above we can write W K as: Φ k ′ ,k l-u (∥⃗ x∥) ⊗ C k ′ ,k l-u (x) := u i=l ϕ k ′ ,k i (∥⃗ x∥) learned C k ′ ,k i (x) fixed The functions ϕ k ′ ,k i : R + → R are parametrized by MLPs. C k ′ ,k i (x) ∈ R (2k ′ +1)×(            Φ 0,0 0-0 (∥⃗ x∥) ⊗ C 0,0 0-0 (x) Φ 0,1 1-1 (∥⃗ x∥) ⊗ C 0,1 1-1 (x) Φ 0,2 2-2 (∥⃗ x∥) ⊗ C 0,2 2-2 (x) Φ 1,0 1-1 (∥⃗ x∥) ⊗ C 1,0 1-1 (x) Φ 1,1 0-2 (∥⃗ x∥) ⊗ C 1,1 0-2 (x) Φ 1,2 1-3 (∥⃗ x∥) ⊗ C 1,2 1-3 (x) Φ 2,0 2-2 (∥⃗ x∥) ⊗ C 2,0 2-2 (x) Φ 2,1 1-3 (∥⃗ x∥) ⊗ C 2,1 1-3 (x) Φ 2,2 0-4 (∥⃗ x∥) ⊗ C 2,2 0-4 (x)            W K (⃗ x)            f 0,1 in f 1,1 in f 2,1 in            The form of the matrices W V appearing in the value tokens as V (⃗ x j , ⃗ x i , f in ) = W V (⃗ x j -⃗ x i )f in is similar to W K above. In the general case, if the input representation contains M k copies of type-k irreducibles we need to stack the corresponding blocks of W Q , W K , M k times in the column dimension. Similarly, if the output representation contains N ′ k copies of type-k ′ irreducibles we need to stack the corresponding blocks N k ′ times in the row dimension. Specifically, • for W Q and k = k ′ : w k,k I 2k+1 →     w k,k (1,1) I 2k+1 • • • w k,k (1,M k ) I 2k+1 . . . . . . . . . w k,k (N k ′ ,1) I 2k+1 • • • w k,k (N k ′ ,M k ) I 2k+1     • for W K and l = |k ′ -k|, u = |k ′ + k|: Φ k ′ ,k l-u ⊗ C k ′ ,k l-u →     [Φ k ′ ,k l-u ] (1,1) ⊗ C k ′ ,k l-u • • • [Φ k ′ ,k l-u ] (1,M k ) ⊗ C k ′ ,k l-u . . . . . . . . . [Φ k ′ ,k l-u ] (N k ′ ,1) ⊗ C k ′ ,k l-u • • • [Φ k ′ ,k l-u ] (N k ′ ,M k ) ⊗ C k ′ ,k l-u     A.8 PRELIMINARIES We recall some basic notions from group and representation theory, see e.g., Fulton & Harris (2013) . A group (G, •) is a set G together with a binary operator "•": G × G → G that satisfies the following axioms: • Associativity: g • (h • f ) = (g • h) • f for all g, h, f ∈ G • Identity: there exists an element e ∈ G such that e • g = g • e = g • Inverse: for all g ∈ G, there exists g -1 ∈ G such that g -1 • g = g • g -1 = e. Given G, we say that each group element g ∈ G acts on the space X via an action L g : X → X if L g satisfy the following two properties: • If e is the identity element of G then L e [x] = x for all x ∈ G; • L g • L h = L g•h for all g, h ∈ G. If for any x, y ∈ X there exists g ∈ G such that L g [x] = y, then we call X a homogeneous space for the group G. When X = V is a vector space, we can define the group action using a linear group representation (V, ρ), where ρ : G → GL(V ) is a map from group G to the general linear group GL(V ). This means that using the linear operator ρ(g) on V , we can define the group action L g on the vector space V as L g [x] = ρ(g)x for all x ∈ V , g ∈ G. Then (V, ρ) is a linear group representation if ρ is a group homomorphism, i.e., ρ(g • h) = ρ(g)ρ(h) for all g, h ∈ G, and ρ(e) = I V is the identity operator over V . To simplify the notation, when the vector space V that the group acts on is easily inferred from the context, we will use ρ to denote the representation (V, ρ). Given a set of actions L g : X → X for g ∈ G, and a set of actions T g : Y → Y for g ∈ G, we say that a map f : X → Y is (G, L, T )-equivariant if for every g ∈ G: T g [f (x)] = f (L g [x]) for all g ∈ G, x ∈ X. If f is linear and equivariant (with respect to (G, L, T )), then it is called an intertwiner (with respect to (G, L, T )). If there exists a subspace W ⊂ V such that for all g ∈ G and w ∈ W , we have that ρ(g)w ∈ W , then W is a G-invariant subspace of V , and (W, ρ) is a subrepresentation of (V, ρ). All representations (V, ρ) have at least two subrepresentations: (0, ρ) and (V, ρ). If a representation has no other subrepresentations, then it is called irreducible. Otherwise it is called reducible. A fundamental result is the following (Fulton & Harris, 2013) . A.8.1 SCHUR'S LEMMA Let (V, ρ V ), (W, ρ W ) be irreducible representations of G acting on V and W , respectively. • If V and W are not isomorphic, then there are no nontrivial intertwiners between them. • If V = W are finite-dimensional vector spaces over C, and if ρ V = ρ W , then all intertwiners are scalar multiples of the identity. Now we list a number of fundamental examples, see e.g., Fulton & Harris (2013) ; Hall (2003) . A.8.2 THE TRANSLATION GROUP (R 3 , +): R 3 equipped with the addition operator "+" is a group that is isomorphic to the group of translations in the 3D space. A.8.3 THE SPECIAL ORTHOGONAL GROUP SO(3): SO(3) is the group of 3 × 3 orthogonal matrices with determinant +1 equipped with multiplication; and is isomorphic to the group of all 3D rotations about the origin. SO(3) is a compact group and as a result of the Peter-Weyl theorem, its linear representations can be decomposed into a direct sum of finite-dimensional, unitary, irreducible representations. Specifically a linear representation of SO(3) decomposes as: ρ(g) = Q T   J≥0 D J (g)   Q for all g ∈ G, where Q is a change of basis matrix and J = 0, 1, . . . D J are (2J + 1) × (2J + 1) matrices known as the Wigner D-matrices. The Wigner D-matrices are the irreducible representations of SO(3). In the context of the features of a neural net, the representations (viewed as vectors) that transform according to D J are called type-J vectors. A.8.4 THE SPECIAL EUCLIDEAN GROUP SE(3): SE(3) is the group of proper rigid transformations of 3D space, and is isomorphic to the semidirect product (R 3 , +) ⋊ SO( 3). An element of SE(3) can be represented as (T, R) where T is an element of the group of translations and R is an element of the group SO(3) of 3D rotations. For two elements (T 1 , R 1 ), (T 2 , R 2 ) ∈ SE(3), the group law is defined as: (T 1 , R 1 ) • (T 2 , R 2 ) = (T 1 + R 1 T 2 , R 1 R 2 ). Since every point in R 3 can be transformed into any other point with a proper rigid transformation, we have that R 3 is a homogeneous space for SE(3). In addition to the action of SE(3) on vectors in R 3 , we can also define the action of SE(3) on functions f : R 3 → R M , for any given integer M > 0. This action is called the induced representation π = Ind 3). It acts on f as follows: SE(3) SO(3) ρ of SE( [π((T, R))f ](x) = ρ(R)f (R -1 (x -T )), where (T, R) ∈ SE(3), x ∈ R 3 and ρ is a representation of SO(3). Especially in the context of neural nets where functions are viewed as feature fields, a function that transforms according to π = Ind 3) ρ is called a ρ-field, and if ρ corresponds to the l-th irreducible representation of SO(3), it is also called a field of type-l. SE(3) SO( A.9 ARCHITECTURE DETAILS A feature-augmented point cloud P = (X, F ) where X := [⃗ x 1 • • • ⃗ x N ] ∈ R 3×N and F := [f 1 • • • f N ] ∈ R 3×M for some M ∈ N can be associated with a 3d field f : R 3 → R M of finite support by writing f (x) = N i=1 f i δ(⃗ x -⃗ x i ). This association will be useful to define the group action on the feature-augmented point cloud and will unify the inputs and outputs of T , E in the sense that both of them process fields in R 3 and output fields in R 3 . We will interchangeably use (X, F ) and f for the point cloud in the next sections. We will call f the point cloud function when the distinction is not clear. The main module in our architecture is an SE(3)-equivariant attention block. Each block consists of a multi-head SE(3)-equivariant attention module followed by a skip connection and an equivariant layer normalization step.

A.9.1 MULTI-HEAD SE(3)-EQUIVARIANT ATTENTION MODULE:

This module consists of multiple heads that implement either self-attention (input features for keys, values and queries are the same) or cross-attention (input features for keys and values are the same and different from the inputs for the queries). We will first describe the self-attention variant of the module and then describe the changes that are required for the cross-attention version. The self-attention SE(3)-equivariant module takes as input a function f (describing the point cloud as discussed above) defined by f (⃗ x) = N i=1 f i δ(⃗ x -⃗ x i ) . Each one of the f i vectors can be decomposed into irreducible representations of SO(3) appearing with different multiplicities. This means that a single vector f i can be decomposed as f i = l≥0 m l ≥0 f i l,m l , where f i l,m l corresponds to the m l -th multiplicity of the irreducible component of type-l. Since we perform self-attention, the keys, values and queries are computed using the same input function f . Specifically, for pairs of key and query points (⃗ x j , ⃗ x i ), we compute the keys K(⃗ x j , ⃗ x i , f j ) = W K (⃗ x j -⃗ x i )f j , the queries Q(⃗ x i , f i ) = W Q f i , and the values V (⃗ x j , ⃗ x i , f j ) = W V (⃗ x j -⃗ x i )f j . To ensure equivariance, W K , W Q , W V must satisfy the conditions described in Section 3.3. Suppose that for the computed key and query features, the l-th irreducible appears with multiplicity M l , and for the computed value features the k-th irreducible appears with multiplicity N k . For each irreducible, we split its multiplicities across the different heads of the attention block. Assuming that we have H heads in total, each head h receives keys K (h) (⃗ x j , ⃗ x i , f j ) and queries Q (h) (⃗ x i , f i ) containing irreducibles with multiplicities M l /H and receives values V (h) (⃗ x j , ⃗ x i , f k ) containing irreducibles appearing with multiplicities N k /H. After this split, the output of the self-attention for each head can be computed as: SA (h) [X, f ](⃗ x i ) = ⃗ xj ∈N (⃗ xi) α X Q (h) (⃗ x i , f i ), K (h) (⃗ x j , ⃗ x i , f j ) V (h) (⃗ x j , ⃗ x i , f j ), where α X (Q(⃗ x i , f i ), K(⃗ x j , ⃗ x i , f j )) = exp [(Q(⃗ x i , f i )) T K(⃗ x j , ⃗ x i , f j )] ⃗ xj ∈N (⃗ xi) exp [(Q(⃗ x i , f i )) T K(⃗ x j , ⃗ x i , f j )] . Similar to the keys, values and queries, the output can have irreducibles and multiplicities that differ from the input and decompose as: SA (h) [X, f ] = k n k SA (h) [X, f ] k,n k , where SA (h) [X, f ] k,n k is the n k -th multiplicity of the k-th irreducible. After the application of the self-attention layer, we concatenate the output of all the heads to create SA [X, f ] = h SA (h) [X, f ]. Then we pass the concatenated output through a linear SE(3)equivariant layer to take SA out [X, f ] = W P SA[X, f ]. Since this linear layer also needs to be equivariant, it must follow the same constraints as the query matrix W Q . By Schur's lemma, it follows that W P can only mix feature vectors that correspond to the same irreducibles. For implementing cross-attention, we use the same process as with self-attention, with the only difference that the inputs for the queries are different from the inputs for the keys and the values. As a result, the output of the cross-attention for a single head h is computed as: CA (h) [X, f ](f q , ⃗ q) = ⃗ xj ∈N X (⃗ q) α X Q (h) (⃗ q, f q ), K (h) (⃗ x j , ⃗ q, f j ) V (h) (⃗ x j , ⃗ q, f j ).

A.9.2 SKIP CONNECTION:

The skip connection concatenates the output features of the multi-headed SE(3)-equivariant module with the features of the input query. To respect the type of each feature, this concatenation must happen between features corresponding to the same irreducible. Suppose that f out is the output of the multi-head attention, decomposing into irreducibles and their multiplicities as f out = k n k f out k,n k . Similarly suppose that f q is the input query, decomposing into irreducibles and their multiplicities as f q = l m l f q,(l,m l ) . The application of the skip connection gives an output f skip = k n k f out k,n k ⊕ m k f ⃗ q,(k,m k ) . To keep the output at the same types and multiplicities as specified by the user (i.e., to be a ρ outfield) we only concatenate multiplicities from the input that exist in the output. Also, after the skip connection we apply an equivariant linear layer between irreducibles of the same type to project the features to the correct dimensions. A.9.3 EQUIVARIANT LAYER NORM: Suppose that f l,m denotes the m-th type-l irreducible of the input vector f . As proposed in Fuchs et al. (2020) , for each vector f l,m , we apply layer normalization and a nonlinearity on the norm of f l,m , leaving its direction unchanged. Thus, the equivariant normalization layer can be written as: EqLayerNorm(f ) l,m = ReLU LayerNorm m ∥f l,m ∥ m f l,m ∥f l,m ∥ , where LayerNorm corresponds to layer normalization, proposed in Ba et al. (2016) . We use the SE(3)-equivariant attention block to construct both the network E that assigns to each point in the point cloud a learned feature, and the network T , that outputs the occupancy value of a query point in space. The network E consists of ten SE(3)-equivariant self-attention blocks, and T consists of two SE(3)-equivariant cross-attention blocks. Figure 2 shows a diagram of this architecture. In both networks, we use blocks with eight heads and intermediate representations that contain features up to type-2. Additionally, for each type we set the multiplicity of the corresponding irreducible to 32. Finally, for the computation of the local neighborhood N (⃗ x) around each point ⃗ x, we use the k nearest neighbors, where k is chosen to be 5% of the size of the point cloud (e.g., for 300 points, k=15). Although it is possible for T to output a single scalar that corresponds to the occupancy value at a queried point, we observe in the experiments an increase in performance when T outputs 32 scalar values that we then pass through a simple MLP to get the final occupancy value. A.10 PROOFS ON EQUIVARIANCE FROM SEC.3.3 1. Input-output equivariance (Eq. 3): We will formalize the equivariance constraints in the language of group theory. We have a map ô that takes as input a point cloud X ∈ R 3×N and outputs an occupancy field ô(X) : R 3 → R. First we need to define the actions of SE(3) on the input point cloud and the occupancy map which we denote by L in , L out respectively. Those have the following form for (T, R) ∈ SE(3): L in,(T,R) [X] = RX + ⊕ N T (9) [L out,(T,R) ô(X)](⃗ q) = [ô(X)](R -1 (⃗ q -T )), where ⊕ N on vectors denotes concatenation column-wise. The first equation describes N standard representations of SE(3) and the second the induced representation of SE(3) via SO(3) with ρ(R) = I, i.e., the map ô is a scalar (or type-0) field. For completeness we show that L in indeed describes an action of SE(3) on X. Letting (I, ⊕ N 0) be the identity element of SE(3), we can check that L in,(I,⊕ N 0) [X] = IX + ⊕ N 0 = X. Also, for any (T 1 , R 1 ), (T 2 , R 2 ) from SE(3), we can check that L in,(T2,R2) [L (T1,R1) [X]] = R 2 (R 1 X + ⊕ N T 1 ) + ⊕ N T 2 = = R 2 R 1 X + R 2 (⊕ N T 1 ) + ⊕ N T 2 = R 2 R 1 X + ⊕ N (R 2 T 1 + T 2 ) = L in,R2T1+T2,R2R1 [X] = L in,[(T2,R2)•(T1,R1)] [X]. Using the vector space isomorphism associating X ∈ R 3×N with the unrolled vector vec(X) ∈ R 3N , we can view this action as the direct sum of N standard representations. Moreover, L out is also an action, and in fact an induced representation of SE(3) via SO(3) with ρ(R) = I, R ∈ SO(3), according to the definition A.8.4. Now that we have the input and output actions, we can define the equivariance constraint for our problem. Informally, a simultaneous roto-translation of the point cloud and the query results in an invariant prediction. Formally, ô(L in,(T,R) [X]) = L out,(T,R) ô(X) ⇐⇒ (11) [ô(L in,(T,R) [X])](⃗ q) = [L out,(T,R) ô(X)](⃗ q), ∀⃗ q ∈ R 3 ⇐⇒ (12) [ô(RX + ⊕ N T )](⃗ q) = [ô(X)](R -1 (⃗ q -T )), ∀⃗ q ∈ R 3 ⇐⇒ (13) [ô(RX + ⊕ N T )](R⃗ q + T ) = [ô(X)](⃗ q), ∀⃗ q ∈ R 3 . (14) Now we recall that we have parametrized ô as a composition of a feature extractor E and an occupancy network T , i.e., [ô(X)](⃗ q) = [T (E(X))](⃗ q). Thus we arrive to Eq. 3: [T (E(RX + ⊕ N T ))](R⃗ q + T ) = [T (E(X))](⃗ q). 2. Per-layer equivariance (Eq. 4, 5): Next, we need to prove that the per-layer equivariance constraints in Eq. 4, 5 are sufficient to satisfy the input-output equivariance constraint from Eq. 3, provided that ρ L ′ +1 T (R) = I, R ∈ SO(3). We recall that the forms of the feature extractor E and the occupancy network T are: E[X] = E L • • • • E 2 • E 1 • Ẽ0 [X] (16) Ẽ0 [X] = (X, S(X)) (17) T [X, F ](⃗ q) = T L ′ • • • • T 2 • T 1 • T 0 [X, F ](⃗ q) (18) T 0 [X, F ](⃗ q) = (⃗ q, S ′ (X, ⃗ q)). All E 1 , E 2 , • • • , E L take as input and produce as output a feature field on the point cloud and satisfy the constraints in Eq. 4. Similarly, all T 1 , T 2 , • • • , T L ′ take as input a feature field on the point cloud and a feature field in R 3 and pass the point cloud feature field unaltered to the next layer. They transform the feature field in R 3 and satisfy the constraints in Eq. 5. Thus, by composing these constraints in Eq. 4.5 we immediately find: E L • • • • E 2 • E 1 (RX + ⊕ N T, ρ 1 E (R)F 1 ) = (RX + ⊕ N T, ρ L+1 E (R)F L+1 ), T L ′ • • • • T 1 [RX + ⊕ N T, ρ L+1 E (R)F L+1 ](R⃗ q + T, ρ 1 T (R)f 1 q ) = (R⃗ q + T, ρ L ′ +1 T (R)f L ′ +1 q ), where ρ L ′ +1 T (R) = I as discussed. The equations above show that if every layer transforms an input ρ-field to an output ρ-field, then the composition also transforms an input ρ-field to an output ρ-field. Thus, it remains to show that the fixed layers Ẽ0 , T 0 , namely (X, S(X)) and (⃗ q, S ′ (X, ⃗ q)) respectively are ρ-fields as well. Then, the equations in Eq. 20 would not only apply from the first layer to the output, but from the input to the output. In other words, we need to check that the features produced by S, S ′ do not translate when the point cloud translates but they do rotate (under suitable representations) when the point cloud rotates. 2.1. Input point cloud field and input query field are type-1 fields: For the i-th point at position ⃗ x i , we construct a feature f i as input to the first self-attention layer of E. We select the feature f i as the relative position of the i-th point, ⃗ x i , to the centroid of a neighborhood constructed from its k nearest neighbors in the point cloud in Euclidean norm, i.e., f i := S(X) i = ⃗ x i - 1 |N k i (X)| j∈N k i (X) ⃗ x j , X = ⊕ N i=1 [⃗ x i ], where N k i (X) = {j ∈ [N ] | ∥⃗ x i -⃗ x j ∥ 2 ≤ ∥⃗ x i -⃗ x (i) k ∥ 2 } is the neighborhood of the i-th point in the point cloud. Also, (⃗ x (i) j ) j∈[N ] is a sorting of the points in the point cloud in increasing Euclidean distance to the i-th point, i.e, ∥⃗ x i -⃗ x (i) 1 ∥ 2 ≤ • • • ≤ ∥⃗ x i -⃗ x (i) N ∥ 2 . In case of ties, we assign numbers randomly. Due to the use of " ≤ " (instead of the strict inequality symbol " < "), in the definition of the neighborhoods, if there are ties for ⃗ x (i) k , we include all tied points in the neighborhood. Now we study how the features f i transform when the points in the point cloud transform via L in,(T,R) , discussed in the first step. Since L in,(T,R) [X] = ⊕ N i=1 (L in,(T,R) [X] i ) = ⊕ N i=1 [R⃗ x i + T ], we find: S(L in,(T,R) [X]) i = L in,(T,R) [X] i - 1 |N k i (L in,(T,R) [X])| j∈N k i (L in,(T ,R) [X]) L in,(T,R) [X] j = (R⃗ x i + T ) - 1 |N k i (X)| j∈N k i (X) (R⃗ x j + T ), where we used the claim, proved below, that N k i (L in,(T,R) [X]) = N k i (X). Thus, S(L in,(T,R) [X]) i further equals: R   ⃗ x i - 1 |N k i (X)| j∈N k i (X) ⃗ x j   = RS(X) , for all (T, R) ∈ SE(3). We now prove 21. When all ⃗ x i are mapped as ⃗ x i L in,(T ,R) ------→ R⃗ x i +T , the Euclidean distance between any two points is preserved, i.e., ∥(R⃗ x i + T ) -(R⃗ x j + T )∥ 2 = ∥⃗ x i -⃗ x j ∥ 2 . Thus, if before the transformation ∥⃗ x i -⃗ x (i) k ∥ 2 = d k , then after the transformation we also have ∥L in,(T,R) [X] i - L in,(T,R) [X] (i) k ∥ 2 = d k . This is because all nearest neighbors preserve their distances, and thus sorting returns the same indices up to random tie breaking. Thus, j ∈ N k i (X) ⇐⇒ ∥⃗ x i -⃗ x j ∥ 2 ≤ d k ⇐⇒ ∥(R⃗ x i + T ) -(R⃗ x j + T )∥ 2 ≤ d k ⇐⇒ ∥L in,(T,R) [X] i -L in,(T,R) [X] j ∥ 2 ≤ d k ⇐⇒ ∥L in,(T,R) [X] i -L in,(T,R) [X] j ∥ 2 ≤ ∥L in,(T,R) [X] i -L in,(T,R) [X] (i) k ∥ 2 ⇐⇒ j ∈ N k i (L in,(T,R) [X] ). The first and fourth equivalence hold because we include all tied neighbors in the neighborhood. Thus, the neighborhood is defined by the distance d k , and not by the identity of the k-neighbors. Thus f i = S(X) i transforms according to the standard representation of SO(3) when each point in the point cloud transforms according to the standard representation of SE(3). If we view the features as a function on the point cloud extended to the homogeneous space R 3 ∼ = SE(3)/SO(3), i.e., f (x) = N i=1 f i δ(x-x i ), then this function transforms according to the induced representation as: L (ind) (T,R) [f ](⃗ x) = Rf (R -1 (⃗ x -T )) = N i=1 (Rf i )δ((R -1 (⃗ x -T ) -⃗ x i )). Recall that functions transforming according to the above law are called type-1 fields. We will keep the name, but use a matrix notation instead of the Dirac notation. By concatenating the features column-wise, we find the map, described in the main text as S, i.e., S(RX + ⊕ N T ) = ⊕ N i=1 Rf i = RS(X). Now we turn to the second network, T . For each point ⃗ q ∈ R 3 whose occupancy value we wish to find, we first construct a feature f 1 q := S ′ (X, ⃗ q) and then use the pair (⃗ q, f 1 q ) as the query to the first cross-attention layer of T . We will show that, when the query ⃗ q and the point cloud X transform according to the standard representation of SE(3), this input feature f 1 q also transforms according to the standard representation of SO(3). We again construct the feature f 1 q as the relative position between ⃗ q and the centroid of its neighborhood N Q X (⃗ q). We consider N Q X (⃗ q) to be the same as the neighborhood of its closest-in Euclidean distance-point in the point cloud, and write (if the closest point is defined uniquely) N Q X (⃗ q) := N k arg min i∈[N ] (∥⃗ q-⃗ xi∥2) (X) . We discuss at the end of this step the case where the nearest neighbors are tied. At the moment, let the unique closest point be c = arg min i∈[N ] (∥⃗ q -⃗ x i ∥ 2 ). Then, the query feature becomes: f 1 q := S ′ (X, ⃗ q) = ⃗ q - 1 |N Q X (⃗ q)| j∈N X (⃗ q) ⃗ x j = ⃗ q - 1 |N k c (X)| j∈N k c (X) ⃗ x j , X = ⊕ N i=1 [⃗ x i ]. The proof that S ′ (RX + ⊕ N T, R⃗ q + T ) = RS ′ (X, ⃗ q) is the same as the one for S before. Viewed again as a function defined on R 3 , the map ⃗ q → S ′ (X, ⃗ q) constitutes a type-1 field. The transformation law of this field is depicted in Fig. 3 . 2.2. Remark: When there are ties, i.e., the set arg min i∈[N ] (∥⃗ q -⃗ x i ∥ 2 ) contains more than one point, we form all neighborhoods N Q X (⃗ q) = {N k c (X) | c ∈ arg min i∈[N ] ∥⃗ q -⃗ x i ∥ 2 }. Then, we consider a query token for each pair (⃗ q c , f c q ), where ⃗ q c = ⃗ q and f c q are constructed as in 22, using the neighborhood N k c (X) ∈ N Q X (⃗ q). If f q transforms as a type-1 field, then, after the roto-translation of the point cloud and the query, we could identify the same c as a closest neighbor. However, since those points are equivalent as nearest neighbors, we will process the whole set of fields independently, producing a set of fields in the output. A different order of selection of nearest neighbors after the roto-translation corresponds to a permutation of the set of the output fields. Since attention modules are permutation equivariant, this permutation propagates to the output. We only need to discuss how to combine these outputs on the tokens ⃗ q c that correspond to the same position ⃗ q in the occupancy field. Since the output is a scalar field (as we prove next), we can take the maximum across the same channels to construct a new scalar field that is also invariant to any permutation. The idea is that taking the maximum corresponds to an "OR" operation, since both the usual non-linearities (such as the sigmoid) and the thresholding operations that follow are increasing functions of their inputs. Thus, when the network predicts that the position of the query is "occupied", even based on one neighborhood, the position is likely to be considered occupied. 3. From per-layer constraints 4, 5 to constraints on the weights 6: We focus now on each layer of E, T separately. Now we need to prove that Eq. 6 provide sufficient conditions on the weights to satisfy the per-layer constraints 4, 5. Every layer is composed of a multi-headed attention layer, a skip connection and a normalization layer. We prove the result for the self-attention and cross-attention layers (focusing on one attention head for simplicity). It will be clear from the proof that a concatenation of the heads and a subsequent linear transformation mixing channels that correspond to the same irreducible preserves equivariance, as well as a skip connection that concatenates irreducibles of the same type. We start with the self-attention layers in E described in A.9.1. We drop the layer index l for clarity and denote the actions ρ l E , ρ l+1 E by ρ in , ρ out . At the i-th token. we have SA[X, F ] i = j∈N k i (X) α X [Q(⃗ x i , f i ), K(⃗ x j , ⃗ x i , f j )] W V (⃗ x j -⃗ x i )f j V (⃗ xj ,⃗ xi,fj ) , where the attention kernel α X takes the form: α X [Q(⃗ x i , f i ), K(⃗ x j , ⃗ x i , f j )] = exp [(Q(⃗ x i , f i )) T K(⃗ x j , ⃗ x i , f j )] j∈N k i (X) exp [(Q(⃗ x i , f i )) T K(⃗ x j , ⃗ x i , f j )] = exp [(W Q f i ) T (W K (⃗ x j -⃗ x i )f j )] j∈N k i (X) exp [(W Q f i ) T (W K (⃗ x j -⃗ x i )f j )] , i ∈ [N ]. Then, we are given W Q ρ in (R) = ρ out (R)W Q W K (R(⃗ x j -⃗ x i ))ρ in (R) = ρ out (R)W K (⃗ x j -⃗ x i ) W V (R(⃗ x j -⃗ x i ))ρ in (R) = ρ out (R)W V (⃗ x j -⃗ x i ) and we need to prove: SA(RX + ⊕ N T, ρ in (R)F ) = ρ out (R)SA(X, F ). For the attention layer, we have for X = ⊕ N i=1 [⃗ x i ] and for each output token i ∈ [N ]: SA[RX + ⊕ N T, ρ in (R)F ] i = (24) = j∈N k i (RX+⊕ N T ) α RX+⊕ N T [Q(R⃗ x i + T, ρ in (R)f i ), K(R⃗ x j + T, R⃗ x i + T, ρ in (R)f j )]• W V ((R⃗ x j + T ) -(R⃗ x i + T ))ρ in (R)f j = j∈N k i (X) α RX+⊕ N T [W Q ρ in (R)f i , W K (R(⃗ x j -⃗ x i ))ρ in (R)f j ]W V (R(⃗ x j -⃗ x i ))ρ in (R)f j , where we used 21 for the invariant neighborhoods, N k i (RX + ⊕ N T ) = N k i (X). Now using the constraints for the matrices 23, we find for each term j ∈ N k i (X) that the individual terms in the sum above equal: α RX+⊕ N T [ρ out (R)W Q f i , ρ out (R)W K (⃗ x j -⃗ x i )f j ]ρ out (R)W V (⃗ x j -⃗ x i )f j = ρ out (R)α RX+⊕ N T [ρ out (R)W Q f i , ρ out (R)W K (⃗ x j -⃗ x i )f j ]W V (⃗ x j -⃗ x i )f j , where in the second equation we used that the attention kernel gives a scalar output. Now, the attention kernel from the last equation transforms as α RX+⊕ N T [ρ out (R)W Q f i , ρ out (R)W K (⃗ x j -⃗ x i )f j ] = = exp [(ρ out (R)W Q f i ) T (ρ out (R)W K (⃗ x j -⃗ x i )f j )] j∈N k i (RX+⊕ N T ) exp [(ρ out (R)W Q f i ) T (ρ out (R)W K (⃗ x j -⃗ x i )f j )] = = exp [(W Q f i ) T (W K (⃗ x j -⃗ x i )f j )] j∈N k i (X) exp [(W Q f i ) T (W K (⃗ x j -⃗ x i )f j )] = α X [Q(⃗ x i , f i ), K(⃗ x j , ⃗ x i , f j )], where we used again the properties of the invariant neighborhoods and that ρ out is a unitary representation, i.e., ρ out (R) T ρ out (R) = I for all R ∈ SO(3). Using 25, 26, and 24, we find: SA[RX + ⊕ N T, ρ in (R)F ] i = ρ out (R) j∈N k i (X) α X [Q(⃗ x i , f i )K(⃗ x j , ⃗ x i , f j )]W V (⃗ x j -⃗ x i )f j = ρ out (R)SA[X, F ] i . Thus each self-attention layer in E is equivariant. The proof for each cross-attention layer in T is similar. Again, note that if the closest neighbor of the query is not uniquely defined, then-as discussed above-we output the entire set of fields for every equivalent neighborhood. Then, the output is also a set of fields and a roto-translation of the input will only result in a permutation of these fields. Then, the "max" operation in the output will eliminate the permutation, making the output equivariant. 4. From the constraints on the weights (Eq. 6) to their solutions (Eq. 7): Again, we focus on each layer separately as we did in the previous section. The goal is to solve Eq. 23. By the Peter-Weyl theorem, ρ in decomposes into unitary, irreducible representations of SO(3), possibly with multiplicities. Thus, ρ in (R) = Q T in l m l D l (R) Q in , R ∈ SO(3), where in our case Q in = I, for matrices denotes a concatenation along the diagonal and l indexes the irreducible types and m l indexes the multiplicity of the l-th irreducible. Then, for each ⃗ x in the point cloud (where we suppress the index i for clarity), we have f in = l m l f in l,m l . We also consider an output field transforming according to ρ out (R) = Q T out ( k n k D k (R))Q out - Q out = I in our case-that decomposes as f out = k n k f out k,n k . Recall that each of the matrices W Q , W K , W V are of dimension k n k (2k + 1) × l m l (2l + 1). 1. For W Q , we have for all R ∈ SO(3): ρ out (R)W Q = W Q ρ in (R) ⇐⇒ [ k,n k D k (R)]W Q = W Q [ l,m l D l (R)] ⇐⇒ D k (R)W Q = W Q D l (R). By Schur's Lemma (A.8.1), for each block [W Q ] k,n k l,m l that multiplies f in l,m l to create f out k,n k (after adding all contributions), we have, for some constants c k,n k l,m l : [W Q ] k,n k l,m l = 0 if l ̸ = k c k,n k l,m l I 2l+1 if l = k. Intuitively, the solution above says that the query cannot transform an irreducible type to a different irreducible type (e.g., a scalar to vector), but only mix channels that correspond to the same irreducible. 2. For W V , the constraint is that for all R ∈ SO(3), the following set of equivalent statements holds: W V (R(⃗ x j -⃗ x i ))ρ in (R) = ρ out (R)W V (⃗ x j -⃗ x i ) ⇐⇒ W V (R(⃗ x j -⃗ x i ))[ l,m l D l (R)] = [ k,n k D k (R)]W V (⃗ x j -⃗ x i ) ⇐⇒ [W V (R r (⃗ x j -⃗ x i ))] k,n k l,m l D l (R) = D k (R)[W V (⃗ x j -⃗ x i )] k,n k l,m l . The solution, as discussed in the main text and solved in Weiler et al. (2018a) ; Thomas et al. (2018) is: 2l+1) are the slices from the Q kl Clebsch-Gordan matrices, Y J : S 2 → R (2J+1) , is the J-th real spherical harmonic, Y Jm (x) = [Y J (x)] m is its m-th coordinate, and x = ⃗ x/∥⃗ x∥. [W V (⃗ x j -⃗ x i )] k,n k l,m l = l+k J=|l-k| ϕ (k,n k ) V,(J,l,m l ) (∥⃗ x j -⃗ x i ∥; θ)C k,l J ⃗ x j -⃗ x i ∥⃗ x j -⃗ x i ∥ , where C k,l J (x) = J m=-J Y Jm (x)Q kl Jm , Q kl Jm ∈ R (2k+1)×( 3. For W K , we have the same equation as for W V above: W K (R(⃗ x j -⃗ x i ))ρ in (R) = ρ out (R)W K (⃗ x j -⃗ x i ), but in addition we can constrain the blocks that transform to an irreducible that does not appear in the input to be zero without impacting the result. The reason is the form of W Q in Eq. 28. In particular, if l ′ , k are different irreducibles in the input and output respectively, then the blocks [W K ] k,n k l ′ ,m l ′ that transform a type-l ′ to a type-k only contribute as terms to the total inner product in the attention kernel as follows. For all (l, m l ): ⟨[W Q ] k,n k l,m l D l (R)f l , [W K (⃗ x j -⃗ x i )] k,n k l ′ ,m l ′ D l (R)f l ⟩ The above inner product is zero if l ̸ = k due to Eq. 28. If k / ∈ K, where K is the set of irreducibles appearing in the input (i.e. the range of l) then the inner product is zero for all l, m l . So, we might as well choose: [W K (⃗ x j -⃗ x i )] k,n k l ′ ,m l ′ = 0, if k / ∈ K. since parametrizing those blocks will not contribute to the result. For the rest of the blocks we have, similar as before, [W K (⃗ x j -⃗ x i )] k,n k l,m l = l+k J=|l-k| ϕ (k,n k ) K,(J,l,m l ) (∥⃗ x j -⃗ x i ∥; θ)C k,l J ⃗ x j -⃗ x i ∥⃗ x j -⃗ x i ∥ , The parametrization of the weights of the cross-attention module T is similar to Eq. 7. The only difference is that we can further reduce the number of parameters without impacting the result by using only the type-1 features of the keys in the first layer, i.e., [W K ] k,n k l,m l = 0, for k ̸ = 1. This is due to the same equation involving the inner product between the key and the query we saw above, and the fact that the input query field is of type-1. In Fig. 3 , we visualize the equivariance constraint on T , E by using a commutative diagram. 6. Additional Equivariant Layers: Clearly, the concatenation of multiple attention heads with the same irreducible types, as well as skip connection layers as defined in section A.9, preserve equivariance. Since they add the multiplicities of each irreducible type independently, they only change the output representation by increasing the multiplicities of its irreducibles. The subsequent mixing of channels of the same irreducible type by a linear map performed in the multi-headed attention module corresponds to the same operation that the query performs, thus it also preserves equivariance. Finally, the equivariant layer-norm layer also operates on each irreducible type independently. Since the irreducibles are unitary transforms-i.e., ∥f l,m l ∥ 2 = ∥D l f l,m l ∥ 2 for all f l,m l and all D l -any non-linear operation on the norm of a type-l vector produces a type-0 vector. Since f l,m l /∥f l,m l ∥ 2 is again a type-l vector, the final result is a type-l vector. 7. Attention as a set operation: In addition to SE(3)-equivariance, our model inherits the properties of the standard attention layers. Thus, it is equivariant to any permutation of the points in the point cloud, and to the order of the queries. Moreover, the number of output tokens can vary. We use this property for scene reconstruction, by reconstructing scenes of a variable number of points (and point clouds) even by training on single objects of a fixed number of points. The input (key-value) tokens can also change during inference, which we can use to adapt the neighborhoods dynamically during inference. This can counter that super-sampling a point cloud may reduce the receptive field. 8. On independent SO(3) rotations. Our local attention neighborhoods and equivariant reconstructions are particularly important properties when reconstructing scenes. Here the point clouds can be independently placed in arbitrary positions. It is natural to ask if we can connect the performance of our model in a particular scene to the performance in another scene, where the same point clouds have been independently roto-translated. For a single object, the true occupancy function should be the same under a simultaneous roto-translation of the point cloud and the query. Our equivariant pipeline respects this property, by outputting a scalar field. Thus, the performance of our model is consistent independently of the SE(3) transformations of a single point cloud. Following a similar approach for scenes with multiple objects, one can associate an independent copy of SO(3) acting on each point cloud, i.e., an action X i → L ri [X i ], for all objects i ∈ I. We can ask if there is a transformation of the query ⃗ q, that, after the rotation of the point clouds, results in the same occupancy value. Also, can this implemented without knowing the segmentation function that assigns the points to their point clouds? A reasonable first approach would be to transform the query ⃗ q with the rotation matrix R i used to transform its closest neighbor. Unfortunately, this transformation does not correspond to a group action. To understand this, consider two unit spheres at positions (0, 0, 0) and (1, 0, 0) in three-dimensional space, and the query point ⃗ q = (-10, 0, 0). Then, consider the product group action L 1 = L r1,r2 = L (0,0,π),e that rotates the first sphere around the z-axis by 180 degrees-r 1 = (0, 0, π)-and fixes the second spherer 1 = e. Next, consider the reverse action L 2 = L (0,0,-π),e . If we first multiply the corresponding group elements together, we find L = L 1 • L 2 = L e,e = I. If we then find the nearest sphere and apply the identity action to the query, it will remain at (-10, 0, 0). On the other hand, if we first find the nearest neighbor-in this case, the first sphere-and apply the corresponding action, then the query will rotate around the z-axis by 180-degrees, moving to (10, 0, 0). If we then apply the second action by finding again the nearest neighbor-now the second sphere-the identity action will be applied and the query point stays at (10, 0, 0). Thus, this transformation does not satisfy the "compatibility" property and thus it is not a group action. Even though the conditions are not met for all query points in space, there are subsets of R 3 that are closed under these transformations and for which these transformations indeed form group actions. We will consider equivariance only in those subsets. We construct them geometrically for two point clouds for simplicity. Consider the point clouds X 1 , X 2 rotating around points c 1 , c 2 , respectively. We construct the equivariant zone for the point cloud X 1 . First take the point in X 2 that has the maximum distance (in Euclidean norm), say d 2 , to its center of rotation c 2 . Form the sphere S 2 around X 2 , with a radius d 2 and center c 2 . All of X 2 is contained in S 2 . Connect the centers c 1 , c 2 , and denote the point of intersection of this segment with S 2 as p. Then, draw the segment between p and c 1 and name the distance of c 1 to the middle point of this segment D 1 . Every point in the sphere S 1 of radius D 1 and center c 1 is in the equivariant zone of X 1 , called Z 1 . We prove that in the equivariant zone, our model is equivariant, for any independent rotation of the point clouds. By construction, every query ⃗ q in the first equivariant zone Z 1 has its closest neighbor in the point cloud X 1 . This holds for any rotation of any point cloud. Then, by definition, the action on the query is the same rotation R r1 that transformed the points in X 1 . Since the point cloud and the query both rotate with the same transformation, the query neighborhood constructed within the cross-attention layers is invariant-leading to the same closest neighbor-as we proved for the single object case. Now, suppose each point in X 1 has its k nearest neighbors in X 1 , for any rotation of the point clouds; this is reasonable for sufficiently dense or separated point clouds. Then these point cloud neighborhoods are also invariant after the individual rotations. Under these conditions, a direct product action of SO(3)-s on the point clouds with a simultaneous action of R 1 on the query in the first equivariant zone is viewed by the attention network as a simultaneous rotation of the point cloud X 1 and the query ⃗ q. This is because N (⃗ q) contains only points from X 1 , so the attention module uses only key-value tokens from points in X 1 . Further, the neighborhood of the query is invariant to a simultaneous rotation of X 1 and ⃗ q. Thus, as we proved in the single-object case, the occupancy value prediction for this transformed query is invariant to the transformation. Thus, for all queries in an equivariant zone, equivariance to the direct product action of SO(3) holds, i.e., for all r 1 , r 2 , • • • , r I ∈ SO(3), and for all ⃗ q ∈ Z j , with j ∈ [I], T [(L r1 [X 1 ], L r2 [X 2 ], • • • , L r I [X I ]), R j ⃗ q] = T [(X 1 , X 2 , • • • , X I ), ⃗ q].



Figure 1: (Above): A scene-level point cloud produced by individual SE(3)-transformations of three sparse object point clouds. (Below): Our equivariant reconstruction. The whole scene is given as input to our network. The network, trained only on single objects in canonical poses and agnostic to the number, position and orientation of the objects is able to reconstruct the scene accurately.

used both local and global attention for object detection, and Yu et al. (2021) used a transformer for point cloud completion.

Figure 3: (Up): Input point cloud, input query field (type-1), output occupancy field (type-0) and (Down) their roto-translations. Here L ′g describes the action on both the query field (⃗ q, S ′ (X, q)) depicted above and the point cloud field (X, F E ) (described in the main text); T , E are equivariant which is equivalent to the diagram being commutative, i.e., L ′′ g • T = T • L ′ g and E • L g = L ′ g • E.

Figure 4: (a) Qualitative results for single object surface reconstruction for models trained and evaluated in three modes: training and testing on aligned shapes (I/I), training on aligned shapes and testing on rotated ones (I/SO(3)), training and testing on rotated shapes (SO(3)/SO(3)). (b) Scene reconstructions from the Seismic dataset, using models trained on aligned single objects.

Figure 5: (a) Quantitative results of reconstruction on the Seismic dataset (synthetic scenes) for models trained on single aligned objects. (b) Ablation study on the effect of point cloud density and noise. All models are trained on 300 points with added normal noise of standard deviation 0.005 (black dashed line in the figure) and are evaluated on different sparsity and noise settings.

Table 1: Chamfer-L1 distance, F-Score and IoU achieved by different methods on single object reconstruction from sparse point clouds (300 points) sampled from ShapeNet. We evaluate our method (TF-Onet) on three different versions: (E:0-1,D:0-1) the encoder and the decoder use up to type-1 representations, (E:0-2,D:0-1) the encoder uses up to type-2 and the decoder uses up to type-1 representations, (E:0-2,D:0-2) the encoder and the decoder use up to type-2 representations CHAMFER-L1 ↓ F-SCORE (τ = 1%) ↑ F-SCORE (τ = 2%) ↑ IOU ↑ I/I I/SO(3) SO(3)/SO(3) I/I I/SO(3) SO(3)/SO(3) I/I I/SO(3) SO(3)/SO(3) I/I I/SO(3) SO(3)/SO(

Figure 6: Examples of scene reconstructions using our method, trained only on aligned single objects from ShapeNet. (a) Reconstruction of synthetic scenes from the seismic dataset, (b) Reconstructions of realistic scenes from Matterport3D.(Chang et al., 2017)

Deng et al. (2021) and evaluate on test data points transformed by random SO(3) rotations. We evaluate models that were trained either on aligned training data (the I/SO(3) case), or on training data augmented by SO(3) rotations (the SO(3)/SO(3) case). In the case of our method (TF-Onet), as shown in Table1, we experiment with different choices for the type of the intermediate representations used by our encoder and decoder. By adjusting the number of channels we make sure that all of our models have the same number of learnable parameters. We compare with the Occupancy Network

6. ACKNOWLEDGEMENTS

We gratefully acknowledge financial support by the following grants: NSF FRR 2220868, NSF IIS-RI 2212433, NSF TRIPODS 1934960, NSF CPS 2038873, ARL DCIST CRA W911NF-17-2-0181, ARO MURI W911NF-20-1-0080, ONR N00014-17-1-2093, and ONR N00014-22-1-2677.

annex

Figure 7 : Visualization of each dimension of a type-l feature extracted by our TF-Onet encoder (with l = 0, 1, 2). A type-l feature corresponds to a 2l + 1 dimensional vector where each dimension is denoted with index m ∈ {-l, . . . , l}. The last right column shows the final reconstruction achieved by our method for the corresponding input.

A.4 RECONSTRUCTIONS OF SCANNED SCENES

We further investigate the generalization of our method to novel scenes by reconstructing real objects of different domains and levels of clutter (Fig. 8 ). The input is a sparse, unoriented point cloud scanned from scenes consisting of an arbitrary number of randomly placed objects. The model is agnostic to the number of objects in the scene and is trained only on synthetic single objects. The model outputs high-quality reconstructions even in difficult settings with objects very close to each other. We observe small artifacts in very cluttered scenes, localized in regions of incidence between objects (e.g. mugs in line 4 of Fig. 8 ). 

A.5 ROBUSTNESS TO CLUTTER

We evaluate the model on the Seismic dataset (Sec. 4.2) which contains scenes with varying levels of clutter. In Fig. 9 we measure the Intersection over Union (IoU) of the reconstruction versus the minimum distance between any two objects in the scene (normalized with respect to the room size).

