ON THE UNIVERSALITY OF ROTATION EQUIVARIANT POINT CLOUD NETWORKS

Abstract

Learning functions on point clouds has applications in many fields, including computer vision, computer graphics, physics, and chemistry. Recently, there has been a growing interest in neural architectures that are invariant or equivariant to all three shape-preserving transformations of point clouds: translation, rotation, and permutation. In this paper, we present a first study of the approximation power of these architectures. We first derive two sufficient conditions for an equivariant architecture to have the universal approximation property, based on a novel characterization of the space of equivariant polynomials. We then use these conditions to show that two recently suggested models (Thomas et al., 2018; Fuchs et al., 2020) are universal, and for devising two other novel universal architectures.

1. INTRODUCTION

Designing neural networks that respect data symmetry is a powerful approach for obtaining efficient deep models. Prominent examples being convolutional networks which respect the translational invariance of images, graph neural networks which respect the permutation invariance of graphs (Gilmer et al., 2017; Maron et al., 2019b) , networks such as (Zaheer et al., 2017; Qi et al., 2017a) which respect the permutation invariance of sets, and networks which respect 3D rotational symmetries (Cohen et al., 2018; Weiler et al., 2018; Esteves et al., 2018; Worrall & Brostow, 2018; Kondor et al., 2018a) . While the expressive power of equivariant models is reduced by design to include only equivariant functions, a desirable property of equivariant networks is universality: the ability to approximate any continuous equivariant function. This is not always the case: while convolutional networks and networks for sets are universal (Yarotsky, 2018; Segol & Lipman, 2019) , popular graph neural networks are not (Xu et al., 2019; Morris et al., 2018) . In this paper, we consider the universality of networks that respect the symmetries of 3D point clouds: translations, rotations, and permutations. Designing such networks is a popular paradigm in recent years (Thomas et al., 2018; Fuchs et al., 2020; Poulenard et al., 2019; Zhao et al., 2019) . While there have been many works on the universality of permutation invariant networks (Zaheer et al., 2017; Maron et al., 2019c; Keriven & Peyré, 2019) , and a recent work discussing the universality of rotation equivariant networks (Bogatskiy et al., 2020) , this is a first paper which discusses the universality of networks which combine rotations, permutations and translations. We start the paper with a general, architecture-agnostic, discussion, and derive two sufficient conditions for universality. These conditions are a result of a novel characterization of equivariant polynomials for the symmetry group of interest. We use these conditions in order to prove universality of the prominent Tensor Field Networks (TFN) architecture (Thomas et al., 2018; Fuchs et al., 2020) . The following is a weakened and simplified statement of Theorem 2 stated later on in the paper: Theorem (Simplification of Theorem 2). Any continuous equivariant function on point clouds can be approximated uniformly on compact sets by a composition of TFN layers. We use our general discussion to prove the universality of two additional equivariant models: the first is a simple modification of the TFN architecture which allows for universality using only low dimensional filters. The second is a minimal architecture which is based on tensor product representations, rather than the more commonly used irreducible representations of SO(3). We discuss the advantages and disadvantages of both approaches. To summarize, the contributions of this paper are: (1) A general approach for proving the universality of rotation equivariant models for point clouds; (2) A proof that two recent equivariant models (Thomas et al., 2018; Fuchs et al., 2020) are universal; (3) Two additional simple and novel universal architectures.

2. PREVIOUS WORK

Deep learning on point clouds. (Qi et al., 2017a; Zaheer et al., 2017) were the first to apply neural networks directly to the raw point cloud data, by using pointwise functions and pooling operations. Many subsequent works used local neighborhood information (Qi et al., 2017b; Wang et al., 2019b; Atzmon et al., 2018) . We refer the reader to a recent survey for more details (Guo et al., 2020) . In contrast with the aforementioned works which focused solely on permutation invariance, more related to this paper are works that additionally incorporated invariance to rigid motions. (Thomas et al., 2018) proposed Tensor Field Networks (TFN) and showed their efficacy on physics and chemistry tasks. (Kondor et al., 2018b) also suggested an equivariant model for continuous rotations. (Li et al., 2019) suggested models that are equivariant to discrete subgroups of SO(3). (Poulenard et al., 2019) suggested an invariant model based on spherical harmonics. (Fuchs et al., 2020) followed TFN and added an attention mechanism. Recently, (Zhao et al., 2019) proposed a quaternion equivariant point capsule network that also achieves rotation and translation invariance. Universal approximation for invariant networks. Understanding the approximation power of invariant models is a popular research goal. Most of the current results assume that the symmetry group is a permutation group. (Zaheer et al., 2017; Qi et al., 2017a; Segol & Lipman, 2019; Maron et al., 2020; Serviansky et al., 2020) proved universality for several S n -invariant and equivariant models. (Maron et al., 2019b; a; Keriven & Peyré, 2019; Maehara & NT, 2019) studied the approximation power of high-order graph neural networks. (Maron et al., 2019c; Ravanbakhsh, 2020) targeted universality of networks that use high-order representations for permutation groups (Yarotsky, 2018) provided several theoretical constructions of universal equivariant neural network models based on polynomial invariants, including an SE(2) equivariant model. In a recent work (Bogatskiy et al., 2020) presented a universal approximation theorem for networks that are equivariant to several Lie groups including SO(3). The main difference from our paper is that we prove a universality theorem for a more complex group that besides rotations also includes translations and permutations.

3. A FRAMEWORK FOR PROVING UNIVERSALITY

In this section, we describe a framework for proving the universality of equivariant networks. We begin with some mathematical preliminaries:

3.1. MATHEMATICAL SETUP

An action of a group G on a real vector space W is a collection of maps ρ(g) : W → W defined for any g ∈ G, such that ρ(g 1 ) • ρ(g 2 ) = ρ(g 1 g 2 ) for all g 1 , g 2 ∈ G, and the identity element of G is mapped to the identity mapping on W . We say ρ is a representation of G if ρ(g) is a linear map for every g ∈ G. As is customary, when it does not cause confusion we often say that W itself is a representation of G . In this paper, we are interested in functions on point clouds. Point clouds are sets of vectors in R 3 arranged as matrices: X = (x 1 , . . . , x n ) ∈ R 3×n . Many machine learning tasks on point clouds, such as classification, aim to learn a function which is invariant to rigid motions and relabeling of the points. Put differently, such functions are required to be invariant to the action of G = R 3 SO(3) × S n on R 3×n via ρ G (t, R, P )(X) = R(X -t1 T n )P T , (1) where t ∈ R 3 defines a translation, R is a rotation and P is a permutation matrix. Equivariant functions are generalizations of invariant functions: If G acts on W 1 via some action ρ 1 (g), and on W 2 via some other group action ρ 2 (g), we say that a function f : W 1 → W 2 is equivariant if f (ρ 1 (g)w) = ρ 2 (g)f (w), ∀w ∈ W 1 and g ∈ G. Invariant functions correspond to the special case where ρ 2 (g) is the identity mapping for all g ∈ G. In some machine learning tasks on point clouds, the functions learned are not invariant but rather equivariant. For example, segmentation tasks assign a discrete label to each point. They are invariant to translations and rotations but equivariant to permutations -in the sense that permuting the input causes a corresponding permutation of the output. Another example is predicting a normal for each point of a point cloud. This task is invariant to translations but equivariant to both rotations and permutations. In this paper, we are interested in learning equivariant functions from point clouds into W n T , where W T is some representation of SO(3). The equivariance of these functions is with respect to the action ρ G on point clouds defined in equation 1, and the action of G on W n T defined by applying the rotation action from the left and permutation action from the right as in 1, but 'ignoring' the translation component. Thus, G-equivariant functions will be translation invariant. This formulation of equivariance includes the normals prediction example by taking W T = R 3 , as well as the segmentation case by setting W T = R with the trivial identity representation. We focus on the harder case of functions into W n T which are equivariant to permutations, since it easily implies the easier case of permutation invariant functions to W T . Notation. We use the notation N + = N ∪ {0} and N * + = r∈N N r + . We set [D] = {1, . . . , D} and [D] 0 = {0, . . . , D}. Proofs. Proofs appear in the appendices, arranged according to sections.

3.2. CONDITIONS FOR UNIVERSALITY

The semi-lifted approach In general, highly expressive equivariant neural networks can be achieved by using a 'lifted approach', where intermediate features in the network belong to high dimensional representations of the group. In the context of point clouds where typically n 3, many papers, e.g., (Thomas et al., 2018; Kondor, 2018; Bogatskiy et al., 2020 ) use a 'semi-lifted' approach, where hidden layers hold only higher dimensional representations of SO(3), but not high order permutation representations. In this subsection, we propose a strategy for achieving universality with the semi-lifted approach. We begin by an axiomatic formulation of the semi-lifted approach (see illustration in inset): we assume that our neural networks are composed of two main components: the first component is a family F feat of parametric continuous G-equivariant functions f feat which map the original point cloud R 3×n to a semi-lifted point cloud W n feat = n i=1 W feat , where W feat is a lifted (i.e., high-order) representation of SO(3). The second component is a family of parametric linear SO(3)-equivariant functions F pool , which map from the high order representation W feat down to the target representation W T . Each such SO(3)-equivariant function Λ : W feat → W T can be extended to a SO(3) × S n equivariant function Λ : W n feat → W n T by applying Λ elementwise. For every positive integer C, these two families of functions induce a family of functions F C obtained by summing C different compositions of these functions: F C (F feat , F pool ) = {f |f (X) = C c=1 Λc (g c (X)), (Λ c , g c ) ∈ F pool × F feat }. (2) Conditions for universality We now describe two conditions that guarantee universality using the semi-lifted approach. The first step is showing, as in (Yarotsky, 2018) , that continuous G-equivariant functions C G (R 3×n , W n T ) can be approximated by G-equivariant polynomials P G (R 3×n , W n T ). Lemma 1. Any continuous G-equivariant function in C G (R 3×n , W n T ) can be approximated uniformly on compact sets by G-equivariant polynomials in P G (R 3×n , W n T ). Universality is now reduced to the approximation of G-equivariant polynomials. Next, we provide two conditions which guarantee that G-equivariant polynomials of degree D can be expressed by function spaces F C (F feat , F pool ) as defined in equation 2. The idea behind these conditions is that explicit characterizations of polynomials equivariant to the joint action of translations, rotations and permutations is challenging. However, it is possible to explicitly characterize polynomials equivariant to translations and permutations (but not rotations). The key observation is then that this characterization can be rewritten as a sum of functions to W n feat , a high dimensional representations of SO(3) which is equivariant to translations, permutations and rotations, composed with a linear map which is permutation equivariant (but does not respect rotations). Accordingly, our first condition is that F feat contains a spanning set of such functions to W n feat . We call this condition D-spanning: Definition 1 (D-spanning). For D ∈ N + , let F feat be a subset of C G (R 3×n , W n feat ). We say that F feat is D-spanning, if there exist f 1 , . . . , f K ∈ F feat , such that every polynomial p : R 3×n → R n of degree D which is invariant to translations and equivariant to permutations, can be written as p(X) = K k=1 Λk (f k (X)), where Λ k : W feat → R are all linear functionals, and Λk : W n feat → R n are the functions defined by elementwise applications of Λ k . In Lemma 4 we explicitly construct a D-spanning family of functions. This provides us with a concrete condition which implies D-spanning for other function families as well. The second condition is that F pool contains all SO(3) linear equivariant layers. We call this condition Linear universality. Intuitively, taking linear rotation equivariant Λ k in equation 3 ensures that the resulting function p will be rotation equivariant and thus fully G-equivariant, and linear universality guarantees the ability to express all such G invariant functions. Definition 2 (Linear universality). We say that a collection F pool of equivariant linear functionals between two representations W feat and W T of SO(3) is linearly universal, if it contains all linear SO(3)-equivariant mappings between the two representations. When these two conditions apply, a rather simple symmetrization arguments leads to the following theorem: Theorem 1. If F feat is D-spanning and F pool is linearly universal, then there exists some C(D) ∈ N such that for all C ≥ C(D) the function space F C (F feat , F pool ) contains all G-equivariant polynomials of degree ≤ D. Proof idea. By the D-spanning assumption, there exist f 1 , . . . , f K ∈ F feat such that any vector valued polynomial invariant to translations and equivariant to permutations is of the form p(X) = K k=1 Λk (f k (X)), While by definition this holds for functions p whose image is R n , this is easily extended to functions to W n T as well. It remains to show that when p is also SO(3)-equivariant, we can choose Λ k to be SO(3) equivariant. This is accomplished by averaging over SO(3). 

3.3. UNIVERSALITY CONDITIONS IN ACTION

In the remainder of the paper, we prove the universality of several G-equivariant architectures, based on the framework we discussed in the previous subsection. We discuss two different strategies for achieving universality, which differ mainly in the type of lifted representations of SO(3) they use: (i) The first strategy uses (direct sums of) tensor-product representations; (ii) The second uses (direct sums of) irreducible representations. The main advantage of the first strategy from the perspective of our methodology is that achieving the D-spanning property is more straightforward. The advantage of irreducible representations is that they almost automatically guarantees the linear universality property. In Section 4 we discuss universality through tensor product representations, and give an example of a minimal tensor representation network architecture that would satisfy universality. In section 5 we discuss universality through irreducible representations, which is currently the more common strategy. We show that the TFN architecture (Thomas et al., 2018; Fuchs et al., 2020) which follows this strategy is universal, and describe a simple tweak that achieves universality using only low order filters, though the representations throughout the network are high dimensional.

4. UNIVERSALITY WITH TENSOR REPRESENTATIONS

In this section, we prove universality for models that are based on tensor product representations, as defined below. The main advantage of this approach is that D-spanning is achieved rather easily. The main drawbacks are that its data representation is somewhat redundant and that characterizing the linear equivariant layers is more laborious. Tensor representations We begin by defining tensor representations. For k ∈ N + denote T k = R 3 k . SO(3) acts on T k by the tensor product representation, i.e., by applying the matrix Kronecker product k times: ρ k (R) := R ⊗k . The inset illustrates the vector spaces and action for k = 1, 2, 3. With this action, for any i 1 , . . . , i k ∈ [n], the map from R 3×n to T k defined by (x 1 , . . . , x n ) → x i1 ⊗ x i2 . . . ⊗ x i k (5) is SO(3) equivariant. A D-spanning family We now show that tensor representations can be used to define a finite set of D-spanning functions. The lifted representation W feat will be given by W T feat = D T =0 T T . The D-spanning functions are indexed by vectors r = (r 1 , . . . , r K ), where each r k is a non-negative integer. Denoting T = r 1 , the functions Q ( r) : R 3×n → T n T , Q ( r) = (Q ( r) j ) n j=1 are defined for fixed j ∈ [n] by Q ( r) j (X) = n i2,...,i K =1 x ⊗r1 j ⊗ x ⊗r2 i2 ⊗ x ⊗r3 i3 ⊗ . . . ⊗ x ⊗r K i K . The functions Q ( r) j are SO(3) equivariant as they are a sum of equivariant functions from equation 5. Thus Q ( r) is SO(3) × S n equivariant. The motivation behind the definition of these functions is that known characterizations of permutation equivariant polynomials Segol & Lipman (2019) tell us that the entries of these tensor valued functions span all permutation equivariant polynomials (see the proof of Lemma 2 for more details). To account for translation invariance, we compose the functions Q ( r) with the centralization operation and define the set of functions Q D = {ι • Q ( r) (X - 1 n X1 n 1 T n )| r 1 ≤ D}, ( ) where ι is the natural embedding that takes each T T into W T feat = D T =0 T T . In the following lemma, we prove that this set is D-spanning. Lemma 2. For every D ∈ N + , the set Q D is D-spanning. Proof idea. It is known (Segol & Lipman, 2019 ) (Theorem 2) that polynomials p : R 3×n → R n which are S n -equivariant, are spanned by polynomials of the form p α = (p j α ) n j=1 , defined as p j α (X) = n i2,...,i K =1 x α1 j x α2 i2 . . . x α k i k (8) where α = (α 1 , . . . , α K ) and each α k ∈ N 3 + is a multi-index. We first show that these polynomials can be extracted from Q ( r) and then use them to represent p. A minimal universal architecture Once we have shown that Q D is D-spanning, we can design D-spanning architectures, by devising architectures that are able to span all elements of Q D . As we will now show, the compositional nature of neural networks allows us to do this in a very clean manner. We define a parametric function f (X, V |θ 1 , θ 2 ) which maps R 3×n ⊕T n k to R 3×n ⊕T n k+1 as follows: For all j ∈ [n], we have f j (X, V ) = (x j , Ṽj (X, V )), where Ṽj (X, V |θ 1 , θ 2 ) = θ 1 (x j ⊗ V j ) + θ 2 i (x i ⊗ V i ) (9) We denote the set of functions (X, V ) → f (X, V |θ 1 , θ 2 ) obtained by choosing the parameters θ 1 , θ 2 ∈ R, by F min . While in the hidden layers of our network the data is represented using both coordinates (X, V ), the input to the network only contains an X coordinate and the output only contains a V coordinate. To this end, we define the functions ext(X) = (X, 1 n ) and π V (X, V ) = V. ( ) We can achieve D-spanning by composition of functions in F min with these functions and centralizing: Lemma 3. The function set Q D is contained in F feat = {ι • π V • f 1 • f 2 • . . . • f T • ext(X - 1 n X1 n 1 T n )| f j ∈ F min , T ≤ D}. ( ) Thus F feat is D-spanning. Proof idea. The proof is technical and follows by induction on D. To complete the construction of a universal network, we now need to characterize all linear equivariant functions from W T feat to the target representation W T . In Appendix G we show how this can be done for the trivial representation W T = R. This characterization gives us a set of linear functions F pool , which combined with F feat defined in equation 11 (corresponds to SO(3) invariant functions) gives us a universal architecture as in Theorem 1. However, the disadvantage of this approach is that implementation of the linear functions in F pool is somewhat cumbersome. In the next section we discuss irreducible representations, which give us a systematic way to address linear equivariant mappings into any W T . Proving D-spanning for these networks is accomplished via the D-spanning property of tensor representations, through the following lemma Lemma 4. If all functions in Q D can be written as ι • Q ( r) (X - 1 n X1 n 1 T n ) = K k=1 Âk f k (X), where f k ∈ F feat , A k : W feat → W T feat and Âk : W n feat → (W T feat ) n is defined by elementwise application of A k , then F feat is D-spanning. We note that as before, A k are not necessarily SO(3)equivariant. Proof idea. The lemma follows directly from the assumptions.

5. UNIVERSALITY WITH IRREDUCIBLE REPRESENTATIONS

In this section, we discuss how to achieve universality when using irreducible representations of SO(3). We will begin by defining irreducible representations, and explaining how linear universality is easily achieved by them, while the D-spanning properties of tensor representations can be preserved. This discussion can be seen as an interpretation of the choices made in the construction of TFN and similar networks in the literature. We then show that these architectures are indeed universal.

5.1. IRREDUCIBLE REPRESENTATIONS OF SO(3)

In general, any finite-dimensional representation W of a compact group H can be decomposed into irreducible representations: a subspace W 0 ⊂ W is H-invariant if hw ∈ W 0 for all h ∈ H, w ∈ W 0 . A representation W is irreducible if it has no non-trivial invariant subspaces. In the case of SO(3), all irreducible real representations are defined by matrices D ( ) (R), called the real Wigner D-matrices, acting on W := R 2 +1 by matrix multiplication. In particular, the representation for = 0, 1 are D (0) (R) = 1 and D (1) (R) = R. Linear maps between irreducible representations As mentioned above, one of the main advantages of using irreducible representations is that there is a very simple characterization of all linear equivariant maps between two direct sums of irreducible representations. We use the notation W l for direct sums of irreducible representations, where l = ( 1 , . . . , K ) ∈ N K + and W l = K k=1 W k . Lemma 5. Let l (1) = ( (1) 1 , . . . , K1 ) and l (2) = ( (2) 1 , . . . , K2 ). A function Λ = (Λ 1 , . . . , Λ K2 ) is a linear equivariant mapping between W l (1) and W l (2) , if and only if there exists a K 1 × K 2 matrix M with M ij = 0 whenever (1) i = (2) j , such that Λ j (V ) = K1 i=1 M ij V i ( ) where V = (V i ) K1 i=1 and V i ∈ W (1) i for all i = 1, . . . , K 1 . Proof idea. This lemma is a simple generalization of Schur's lemma, a classical tool in representation theory, which asserts that a non-zero linear map between irreducible representations is a scaled multiply of the identity mapping. Lemma 5 was stated in the complex setting in Kondor (2018) . While Schur's lemma, and thus Lemma 5, does not always hold for representations over the reals, we observe here that it holds for real irreducible representations of SO(3) since their dimension is always odd. Clebsch-Gordan decomposition of tensor products As any finite-dimensional representation of SO(3) can be decomposed into a direct sum of irreducible representations, this is true for tensor representations as well. In particular, the Clebsch-Gordan coefficients provide an explicit formula for decomposing the tensor product of two irreducible representations W 1 and W 2 into a direct sum of irreducible representations. This decomposition can be easily extended to decompose the tensor product W l1 ⊗ W l2 into a direct sum of irreducible representations, where l 1 , l 2 are now vectors. In matrix notation, this means there is a unitary linear equivariant U (l 1 , l 2 ) mapping of W l1 ⊗ W l2 onto W l , where the explicit values of l = l(l 1 , l 2 ) and the matrix U (l 1 , l 2 ) can be inferred directly from the case where 1 and 2 are scalars. By repeatedly taking tensor products and applying Clebsch-Gordan decompositions to the result, TFN and similar architectures can achieve the D-spanning property in a manner analogous to tensor representations, and also enjoy linear universality since they maintain irreducible representations throughout the network.

5.2. TENSOR FIELD NETWORKS

We now describe the basic layers of the TFN architecture (Thomas et al., 2018) , which are based on irreducible representations, and suggest an architecture based on these layers which can approximate G-equivariant maps into any representation W n l T , l T ∈ N * + . There are some superficial differences between our description of TFN and the description in the original paper, for more details see Appendix F. We note that the universality of TFN also implies the universality of Fuchs et al. ( 2020), which is a generalization of TFN that enables adding an attention mechanism. Assuming the attention mechanism is not restricted to local neighborhoods, this method is at least as expressive as TFN. TFNs are composed of three types of layers: (i) Convolution (ii) Self-interaction and (iii) Nonlinearities. In our architecture, we only use the first two layers types, which we will now describe:foot_0 . Convolution. Convolutional layers involve taking tensor products of a filter and a feature vector to create a new feature vector, and then decomposing into irreducible representations. Unlike in standard CNN, a filter here depends on the input, and is a function F : R 3 → W l D , where l D = [0, 1, . . . , D] T . The -th component of the filter F (x) = F (0) (x), . . . , F (D) (x) will be given by F ( ) m (x) = R ( ) ( x )Y m (x), m = -, . . . , where x = x/ x if x = 0 and x = 0 otherwise, Y m are spherical harmonics, and R ( ) any polynomial of degree ≤ D. In Appendix F we show that these polynomial functions can be replaced by fully connected networks, since the latter can approximate all polynomials uniformly. The convolution of an input feature V ∈ W n li and a filter F as defined above, will give an output feature Ṽ = ( Ṽa ) n a=1 ∈ W n l0 , where l o = l(l f , l i ), which is given by Ṽa (X, V ) = U (l f , l i ) θ 0 V a + n b=1 F (x a -x b ) ⊗ V b . ( ) More formally we will think of convolutional layer as functions of the form f (X, V ) = (X, Ṽ (X, V )). These functions are defined by a choice of D, a choice of a scalar polynomial R ( ) , = 0, . . . , D, and a choice of the parameter θ 0 ∈ R in equation 14. We denote the set of all such functions f by F D . Self Interaction layers. Self interaction layers are linear functions from Λ : W n l → W n l T , which are obtained from elementwise application of equivariant linear functions Λ : W l → W l T . These linear functions can be specified by a choice of matrix M with the sparsity pattern described in Lemma 5. Activation functions. TFN, as well as other papers, proposed several activation functions. We find that these layers are not necessary for universality and thus we do not define them here. Network architecture. For our universality proof, we suggest a simple architecture which depends on two positive integer parameters (C, D): For given D, we will define F feat (D) as the set of function obtained by 2D recursive convolutions F feat (D) = {π V • f 2D • . . . f 2 • f 1 • ext(X)| f j ∈ F D }, where ext and π V are defined as in equation 10. The output of a function in F feat (D) is in W n l(D) , for some l(D) which depends on D. We then define F pool (D) to be the self-interaction layers which map W n l(D) to W n l T . This choice of F feat (D) and F pool (D), together with a choice of the number of channels C, defines the final network architecture F TFN C,D = F C (F feat (D), F pool (D)) as in equation 2. In the appendix we prove the universality of TFN: Theorem 2. For all n ∈ N, l T ∈ N * + , 1. For D ∈ N + , every G-equivariant polynomial p : R 3×n → W n T of degree D is in F TFN C(D),D . 2. Every continuous G-equivariant function can be approximated uniformly on compact sets by functions in ∪ D∈N+ F TFN C(D),D As discussed previously, the linear universality of F pool is guaranteed. Thus proving Theorem 2 amounts to showing that F feat (D) is D-spanning. This is done using the sufficient condition for D-spanning defined in Lemma 4. Proof idea. The proof is rather technical and involved. A useful observation (see Dai & Xu (2013) ) used in the proof is that the filters of orders = 0, 1, . . . , D, defined in equation 13, span all polynomial functions of degree D on R 3 . This observation is used to show that all functions in Q D can be expressed by F feat (D) and so F feat is D-spanning, as stated in Lemma 2. Alternative architecture The complexity of the TFN network used to construct G-equivariant polynomials of degree D, can be reduced using a simple modifications of the convolutional layer in equation 14: We add two parameters θ 1 , θ 2 ∈ R to the convolutional layer, which is now defined as: Ṽa (X, V ) = U (l f , l i ) θ 1 n b=1 F (x a -x b ) ⊗ V b + θ 2 n b=1 F (x a -x b ) ⊗ V a . With this simple change, we can show that F feat (D) is D-spanning even if we only take filters of order 0 and 1 throughout the network. This is shown in Appendix E.

6. CONCLUSION

In this paper, we have presented a new framework for proving the universality of G-equivariant point cloud networks. We used this framework for proving the universality of the TFN model Thomas et al. (2018); Fuchs et al. (2020) , and for devising two additional novel simple universal architectures. In the future we hope to extend these simple constructions to operational G-equivariant networks with universality guarantees and competitive practical performance. Our universal architectures do not require activation functions, and use a single self-interaction layer. In Appendix H we present an experiment indicating that the performance of TFN is not significantly altered by these simplifications. Our architectures also require high order representations, and our experiments show that using increasingly high order representations does indeed improve performance. To date, practical TFN implementation included a relatively small amount of layers, and did not use very high order representations. We believe our theoretical results will inspire interest in stable implementation of larger architectures. On the other hand, an interesting open problem is understanding whether universality can be achieved using only low-dimensional representations. Finally, we believe that the framework we developed here will be useful for proving the universality of other G-equivariant models for point cloud networks, and other related equivariant models. We note that large parts of our discussion can be easily generalized to symmetry groups of the form G = R d H × S n acting on R d×n , where H can be any compact topological group. Lemma B.2. Let G be a compact group, Let ρ 1 and ρ 2 be continuousfoot_1 representations of G on the Euclidean spaces W 1 and W 2 . Let K ⊆ W 1 be a compact set. Then every equivariant function f : W 1 → W 2 can be approximated uniformly on K by a sequence of equivariant polynomials p k : W 1 → W 2 . Let µ be the Haar probability measure associated with the compact group G. Let 1 denote the compact set obtained as an image of the compact set G × K under the continuous mapping (g, X) → ρ 1 (g)X. Using the Stone-Weierstrass theorem, let p k be a sequence of (not necessarily equivariant) polynomials which approximate f uniformly on K 1 . Every degree D polynomial p : W 1 → W 2 induces a G-equivariant function p (X) = G ρ 2 (g -1 )p(ρ 1 (g)X)dµ(g). This function p is a degree D polynomial as well: This is because p can be approximated uniformly on K 1 by "Riemann Sums" of the form N j=1 w j ρ 2 (g -1 j )p(ρ 1 (g j )X) which are degree D polynomials, and because degree D polynomials are closed in C(K 1 ). Now for all X ∈ K 1 , continuity of the function g → ρ 2 (g -1 ) implies that the operator norm of ρ 2 (g -1 ) is bounded uniformly by some constant N > 0, and so | p k (X) -f (X)| = G ρ 2 (g -1 )p k (ρ 1 (g)X) -ρ 2 (g -1 )f (ρ 1 (g)X)dµ(g) = G ρ 2 (g -1 ) [p k (ρ 1 (g)X) -f (ρ 1 (g)X)] dµ(g) ≤ N f -p k C(K1) → 0 B.2 PROOF OF THEOREM 1 Theorem 1. If F feat is D-spanning and F pool is linearly universal, then there exists some C(D) ∈ N such that for all C ≥ C(D) the function space F C (F feat , F pool ) contains all G-equivariant polynomials of degree ≤ D. Proof. By the D-spanning assumption, there exist f 1 , . . . , f K ∈ F feat such that any vector valued polynomial p : R 3×n → R n invariant to translations and equivariant to permutations is of the form p(X) = K k=1 Λk (f k (X)), where Λ k are linear functions to R. If p is a matrix valued polynomial mapping R 3×n to W n T = R t×n , which is invariant to translations and equivariant to permutations, then it is of the form p = (p ij ) i∈[t],j∈ [n] , and each p i = (p ij ) j∈ [n] is itself invariant to translations and permutation equivariant. It follows that matrix valued p can also be written in the form equation 17, the only difference being that the image of the linear functions Λ k is now R t . Now let p : R 3×n → W n T be a G-equivariant polynomial of degree ≤ D. It remains to show that we can choose Λ k to be SO(3) equivariant. We do this by a symmetrization argument: denote the Haar probability measure on SO(3) by ν, and the action of SO(3) on W feat and W T by ρ 1 and ρ 2 respectively Denote p = (p j ) n j=1 and f k = (f j k ) n j=1 . For every j = 1, . . . , n, we use the SO(3) equivariance of p j and f j k to obtain p j (X) = SO(3) ρ 2 (R -1 ) • p j (RX)dν(R) = K k=1 SO(3) ρ 2 (R -1 ) • Λ k • f k j (RX)dν(R) = K k=1 SO(3) ρ 2 (R -1 ) • Λ k ρ 1 (R) • f j k (X) dν(R) = K k=1 Λk • f j k (X), 1. If r 1 > 0, we set r = (r 1 -1, r 2 , . . . , r K ). We know that ι • Q (r) ( X) ∈ F feat (D -1) by the induction hypothesis. So there exist f 2 , . . . , f D such that ι • π V • f 2 • . . . • f D • ext( X) = ι • Q (r) ( X). ( ) Now choose f ∈ F min to be the function whose V coordinate Ṽ = ( Ṽj ) n j=1 , is given by Ṽj (X, V ) = x j ⊗ V j , obtained by setting θ 1 = 1, θ 2 = 0 in equation 9. Then , we have Ṽj ( X, Q (r) ( X)) = n i2,...,i K =1 xj ⊗ x⊗(r1-1) j ⊗ x⊗r2 i2 ⊗ . . . ⊗ x⊗r K i K = Q ( r) j ( X). and so ι • π V • f 1 • f 2 • . . . • f D • ext(X - 1 n X1 n 1 T n ) = ι • Q ( r) ( X). ( ) and ι • Q ( r) (X -1 n X1 n 1 T n ) ∈ F feat (D). 2. If r 1 = 0. We assume without loss of generality that r 2 > 0. Set r = (r 2 -1, r 3 , . . . , r K ). As before by the induction hypothesis there exist f 2 , . . . , f D which satisfy equation 19. This time we choose f 1 ∈ F min to be the function whose V coordinate Ṽ = ( Ṽj ) n j=1 , is given by Ṽj (X, V ) = j x j ⊗ V j , obtained by setting θ 1 = 0, θ 2 = 1 in equation 9. Then we have Ṽj ( X, Q (r) ( X)) = n j=1 n i3,...,i K =1 xj ⊗ x⊗(r2-1) j ⊗ x⊗r2 i3 ⊗ . . . ⊗ x⊗r K i K = n i2,i3,...,i K =1 x⊗r2 i2 ⊗ x⊗r2 i3 ⊗ . . . ⊗ x⊗r K i K = Q ( r) j ( X). Thus equation 20 holds, and so again we have that ι • Q ( r) (X -1 n X1 n 1 T n ) ∈ F feat (D). Finally we prove Lemma 4 Lemma 4. If all functions in Q D can be written as ι • Q ( r) (X - 1 n X1 n 1 T n ) = K k=1 Âk f k (X), where f k ∈ F feat , A k : W feat → W T feat and Âk : W n feat → (W T feat ) n is defined by elementwise application of A k , then F feat is D-spanning. Proof. If the conditions in Lemma 4 hold, then since Q D is D-spanning, every translation invariant and permutation equivariant polynomials p of degree D can be written as p(X) = r| r 1≤D Λ r ι • Q ( r) (X - 1 n X1 n 1 T n ) = r| r 1≤D Λ r K r k=1 ι • Âk, r f k, r (X) = r| r 1≤D K r k=1 Λk, r (f k, r (X)) where we denote Λ k, r = Λ r • ι • A k, r . Thus we proved F feat is D-spanning. where A k : Y → T T are linear functions, Âk : Y n → T n T are induced by elementwise application, and f k ∈ Y. This notation is useful because: (i) by Lemma 4 it is sufficient to show that Q ( r) ( X) is in G 2D,D , T T for all r ∈ Σ T and all T ≤ D, and because (ii) it enables comparison of the expressive power of function spaces Y 1 , Y 2 whose elements map to different spaces Y n 1 , Y n 2 , since the elements in Y i , T T , i = 1, 2 both map to the same space. In particular, note that if for every f ∈ Y 2 there is a g ∈ Y 1 and a linear map A : Y 1 → Y 2 such that f (X) = Â • g(X), then Y 2 , T T ⊆ Y 1 , T T . We now use this abstract discussion to prove some useful results: the first is that for the purpose of this lemma, we can 'forget about' the multiplication by a unitary matrix in equation 14, used for decomposition into irreducible representations: To see this, denote by GJ,D the function space obtained by taking J consecutive convolutions with D-filters without multiplying by a unitary matrix in equation 14. Since Kronecker products of unitary matrices are unitary matrices, we obtain that the elements in G J,D and GJ,D differ only by multiplication by a unitary matrix, and thus GJ,D , T T ⊆ G J,D , T T and G J,D , T T ⊆ GJ,D , T T , so both sets are equal. Next, we prove that adding convolutional layers (enlarging J) or taking higher order filters (enlarging D) can only increase the expressive power of a network. Lemma D.2. For all J, D, T ∈ N + , 1. G J,D , T T ⊆ G J+1,D , T T . 2. G J,D , T T ⊆ G J,D+1 , T T . Proof. The first claim follows from the fact that every function f in G J,D , T T can be identified with a function in G J+1,D , T T by taking the J + 1 convolutional layer in equation 14 with θ 0 = 1, F = 0. The second claim follows from the fact that D-filters can be identified with D + 1-filters whose D + 1-th entry is 0. The last preliminary lemma we will need is Lemma D.3. For every J, D ∈ N + , and every t, s ∈ N + , if p ∈ G J,D , T t , then the function q defined by q a (X) = n b=1 (x a -xb ) ⊗s ⊗ p b (X) is in G J+1,D , T t+s . Proof. This lemma is based on the fact that the space of s homogeneous polynomial on R 3 is spanned by polynomials of the form x s-Y m (x) for = s, s -2, s -4 . . . (Dai & Xu, 2013) . For each such , and s ≤ D, these polynomials can be realized by filters F ( ) by setting R ( ) ( x ) = x s so that F ( ) m (x) = x s Y m (x) = x s-Y m (x). For every D ∈ N and s ≤ D, we can construct a D-filter F s,D = (F (0) , . . . , F (D) ) where F (s) , F (s-2) , . . . are as defined above and the other filters are zero. Since both the entries of F s,D (x), and the entries of x ⊗s , span the space of s-homogeneous polynomials on R 3 , it follows that there exists a linear mapping B s : W l D → T s so that x ⊗s = B s (F s,D (x)), ∀x ∈ R 3 . Thus, since p can be written as a sum of compositions of linear mappings with functions in G J,D as in equation 22, and similarly x ⊗s is obtained as a linear image of functions in G 1,D as in equation 23, we deduce that n b=1 (x a -x b ) ⊗ p b (X) = n b=1 (x a -xb ) ⊗ p b (X) is in G J+1,D , T t+s As a final preliminary, we note that D-filters can perform an averaging operation by setting R (0) = 1 and θ 0 , R (1) , . . . , R (D) = 0 in equation 13 and equation 14 . We call this D-filter an averaging filter. We are now ready to prove our claim: we need to show that for every D, T ∈ N + where T ≤ D, for every r ∈ Σ T , the function Q ( r) is in G 2D,D , T T . Note that due to the inclusion relations in Lemma D.2 it is sufficient to prove this for the case T = D. We prove this by induction on D. For D = 0, vectors r ∈ Σ 0 contains only zeros and so Q ( r) ( X) = 1 n = π V • ext(X) ∈ G 0,0 , T 0 . We now assume the claim is true for all D with D > D ≥ 0 and prove the claim is true for D. We need to show that for every r ∈ Σ D the function Q ( r) is in G 2D,D , T D . We prove this yet again by induction, this time on the value of r 1 : assume that r ∈ Σ D and r 1 = 0.. Denote by r the vector in Σ D-1 defined by r = (r 2 -1, r 3 , . . . , r K ). By the induction assumption on D, we know that Q (r) ( X) ∈ G 2(D-1),D-1,D-1 and so q a (X) = n b=1 (x a -xb ) ⊗ Q (r) b ( X) = n b=1 (x a -xb ) ⊗ x⊗r2-1 b ⊗ n i3,...,i K =1 x⊗r3 i3 ⊗ . . . ⊗ x⊗r K i K = xa ⊗ n b=1 Q (r) b ( X) -Q ( r) ( X) is in G 2D-1,D-1 , T D by Lemma D.3, which is contained in G 2D-1,D , T D by Lemma D.2. Since xa has zero mean, while Q ( r) a ( X) does not depend on a since r 1 = 0, applying an averaging filter to q a gives us a constant value -Q ( r) a ( X) in each coordinate a ∈ [n], and so Q ( r) ( X) is in G 2D,D , T D . Now assume the claim is true for all r ∈ Σ D which sum to D, and whose first coordinate is smaller than some r 1 ≥ 1, we now prove the claim is true when the first coordinate of r is equal to r 1 . The vector r = (r 2 , . . . , r K ) obtained from r by removing the first coordinate, sums to D = D -r 1 < D, and so by the induction hypothesis on D we know that Q (r) ∈ G 2D ,D , T D . By Lemma D.3 we obtain a function q a ∈ G 2D +1,D , T D ⊆ G 2D,D , T D defined by q a (X) = n b=1 (x a -xb ) ⊗r1 ⊗ Q (r) b ( X) = n b=1 (x a -xb ) ⊗r1 ⊗ x⊗r2 b ⊗ n i3,...,i K =1 x⊗r3 i3 ⊗ . . . ⊗ x⊗r K i K = Q ( r) a ( X) + additional terms where the additional terms are linear combinations of functions of the form P D Q (r ) a ( X) where r ∈ Σ D and their first coordinate r 1 is smaller than r 1 , and P D : T D → T D is a permutation. By the induction hypothesis on r 1 , each such Q (r ) is in G 2D,D , T D . It follows that P D Q (r ) a ( X), a = 1, . . . , n, and thus Q ( r) ( X), are in G 2D,D , T D as well. This concludes the proof of Theorem 2.

E ALTERNATIVE TFN ARCHITECTURE

In this appendix we show that replacing the standard TFN convolutional layer with the layer defined in equation 15: Ṽa (X, V ) = U (l f , l i ) θ 1 n b=1 F (x a -x b ) ⊗ V b + θ 2 n b=1 F (x a -x b ) ⊗ V a , we can obtain D-spanning networks using 2D consecutive convolutions with 1-filters (that is, filters in W l1 , where l 1 = [0, 1] T ). Our discussion here is somewhat informal, meant to provide the general ideas without delving into the details as we have done for the standard TFN architecture in the proof of Theorem 2. In the end of our discussion we will explain what is necessary to make this argument completely rigorous. We will only need two fixed filters for our argument here: The first is the 1-filter F Id = (F (0) , F (1) ) defined by setting R (0) ( x ) = 0 and R (1) ( x ) = x to obtain F Id (x) = x Y 1 (x) = x x = x. The second is the filter F 1 defined by setting R (0) ( x ) = 1 and R (1) ( x ) = 0, so that F 1 (x) = 1. We prove our claim by showing that a pair of convolutions with 1-filters can construct any convolutional layer defined in equation 9 for the D-spanning architecture using tensor representations. The claim then follows from the fact that D convolutions of the latter architecture suffice for achieving D-spanning, as shown in Lemma 3. Convolutions for tensor representations, defined in equation 9, are composed of two terms: Ṽ tensor,1 a ( X, V ) = xa ⊗ V a and Ṽ tensor,2 a ( X, V ) = n b=1 xb ⊗ V b . To obtain the first term Ṽ tensor,1 a , we set θ 1 = 0, θ 2 = 1/n, F = F Id in equation 15 we obtain (the decomposition into irreducibles of) Ṽ tensor,1 a ( X, V ) = xa ⊗ V a . Thus this term can in fact be expressed by a single convolution. We can leave this outcome unchanged by a second convolution, defined by setting θ 1 = 0, θ 2 = 1/n, F = F 1 . To obtain the second term Ṽ tensor,2 a , we apply a first convolution with θ 1 = -1, F = F Id , θ 2 = 0, to obtain n b=1 (x b -x a ) ⊗ V b = n b=1 (x b -xa ) ⊗ V b = Ṽ tensor,2 a (V, X) -xa ⊗ n b=1 V b By applying an additional averaging filter, defined by setting θ 1 = 1 n , F = F 1 , θ 2 = 0, we obtain Ṽ tensor,2 a (V, X). This concludes our 'informal proof'. Our discussion here has been somewhat inaccurate, since in practice F Id (x) = (0, x) ∈ W 0 ⊕ W 1 and F 1 (x) = (1, 0) ∈ W 0 ⊕ W 1 . Moreover, in our proof we have glossed over the multiplication by the unitary matrix used to obtain decomposition into irreducible representations. However the ideas discussed here can be used to show that 2D convolutions with 1-filters can satisfy the sufficient condition for D-spanning defined in Lemma 4. See our treatment of Theorem 2 for more details.

F COMPARISON WITH ORIGINAL TFN PAPER

In this Appendix we discuss three superficial differences between the presentation of the TFN architecture in Thomas et al. (2018) and our presentation here: 1. We define convolutional layers between features residing in direct sums of irreducible representations, while (Thomas et al., 2018) focuses on features which inhabit a single irreducible representation. This difference is non-essential, as direct sums of irreducible representations can be represented as multiple channels where each feature inhabits a single irreducible representation. 2. The term θ 0 V a in equation 14 appears in (Fuchs et al., 2020) , but does not appear explicitly in (Thomas et al., 2018) . However it can be obtained by concatenation of the input of a self-interaction layer to the output, and then applying a self-interaction layer. 3. We take the scalar functions R ( ) to be polynomials, while (Thomas et al., 2018 ) take them to be fully connected networks composed with radial basis functions. Using polynomial scalar bases is convenient for our presentation here since it enables exact expression of equivariant polynomials. Replacing polynomial bases with fully connected networks, we obtain approximation of equivariant polynomials instead of exact expression. It can be shown that if p is a G-equivariant polynomial which can be expressed by some network F C,D defined with filters coming from a polynomial scalar basis, then p can be approximated on a compact set K, up to an arbitrary error, by a similar network with scalar functions coming from a sufficiently large fully connected network.

G TENSOR UNIVERSALITY

In this section we show how to construct the complete set F pool of linear SO(3) invariant functionals from W T feat = D T =0 T T to R. Since each such functional Λ is of the form Λ(w 0 , . . . , w D ) = D T =0 Λ T (w T ), where each Λ T is SO(3)-invariant, it is sufficient to characterize all linear SO(3)-invariant functionals Λ : T D → R. It will be convenient to denote W = R 3 and W ⊗D ∼ = R 3 D = T D . We achieve our characterization using the bijective correspondence between linear functional Λ : W ⊗D → R and multi-linear functions Λ : W D → R: each such Λ corresponds to a unique Λ, such that Λ(e i1 , . . . , e i D ) = Λ(e i1 ⊗ . . . ⊗ e i D ), ∀(i 1 , . . . , i D ) ∈ [3] D , where e 1 , e 2 , e 3 denote the standard basis elements of R 3 . We define a spanning set of equivariant linear functionals on W ⊗D via a corresponding characterization for multi-linear functionals on W D . Specifically, set K D = {k ∈ N + | D -3k is even and non-negative. } For k ∈ K D we define a multi-linear functional: Λk (w 1 , . . . , w D ) = det(w 1 , w 2 , w 3 ) × . . . × det(w 3k-2 , w 3k-1 , w 3k ) × w 3k+1 , w 3k+2 × . . . We note that (i) equation 24 provides a (cumbersome) way to compute all linear invariant functionals Λ k,σ explicitly by evaluating the corresponding Λk,σ on the 3 D elements of the standard basis and (ii) the set λ D is spanning, but is not linearly independent. For example, since w 1 , w 2 = w 2 , w 1 , the space of SO(3) invariant functionals on T 2 = W ⊗2 is one dimensional while |λ 2 | = 2. × w D-1 , w D , Proof of Proposition 1. We first show that the bijective correspondence between linear functional Λ : W ⊗D → R and multi-linear functions Λ : W D → R, extends to a bijective correspondence between SO(3)-invariant linear/multi-linear functionals. The action of SO(3) on W D is defined by Multi-linear functionals on W D invariant to ρ are a subset of the set of polynomials on W D invariant to ρ. It is known (see (Kraft & Procesi, 2000) , page 114), that all such polynomials are algebraically generated by functions of the form det(w i1 , w i2 , w i3 ) and w j1 , w j2 , where i 1 , i 2 , i 3 , j 1 , j 2 ∈ [D]. Equivalently, SO(3)-invariant polynomials are spanned by linear combinations of polynomials of the form det(w i1 , w i2 , w i3 ) det(w i4 , w i5 , w i6 ) . . . w j1 , w j2 w j3 , w j4 . . . . When considering the subset of multi-linear invariant polynomials, we see that they must be spanned by polynomials as in equation 27, where each w 1 , . . . , w D appears exactly once in each polynomial in the spanning set. These precisely correspond to the functions in λ D . H EXPERIMENTS This section provides an experimental evaluation of different design choices of the TFN architecture, inspired by our theoretical analysis. We study the following questions: 1. The importance of non-linear activation. Our proof shows that using non-linear activation functions is not necessary for proving universality. Here, we empirically test the possibility of removing these layers. 2. The importance of high-dimensional irreducible representations. Our theoretical analysis shows that in order to represent/approximate high degree polynomials, high-order representations should be used. Here, we check whether using high-order representations has practical benefits. 3. The effect of self-interaction layers. Our proof suggests that it is enough to use self interaction linear layers at the end of the model. We empirically compare this approach with the more common approach of using self-interaction layers after each convolutional layer. Dataset. We use the QM9 (Ramakrishnan et al., 2014) dataset for our experiments. The dataset contains 134K molecules, with node 3D positions, 5 categorical node features and 4 categorical edge features. The task is a molecule property prediction regression task. Framework. We used pytorch (Paszke et al., 2017) as the deep learning framework and the Deep Graph Library (DGL) (Wang et al., 2019a) as the graph learning framework. All experiments ran on NVIDIA GV100 GPUs. Experimental setup. We use the the TFN implementation from Fuchs et al. (2020) . We trained each model variant for 50 epochs on the homo target variable using an 1 loss function and the ADAM optimizer with learning rate 10 -3 and report results on the test set on the final epoch averaged over two runs. We used the default parameters and data splits from Fuchs et al. (2020) . Architecture. The architecture consists of 4 TFN convolutional layers, each followed by a linear self-interaction layer. We used 16 copies of each irreducible representation used. We used normbased non-linearity as in the original TFN paper (Thomas et al., 2018) . These convolutional layers are followed by a max-pooling layer and two fully connected layers with 16d features in the hidden layer, where d is the maximal degree of irreducible representations used. Figure 1 : 1 error versus maximal irreducible representation used. It is clear that the error is reduced as higher order representation are used. Semi-transparent color represents standard deviation. Results. Table 1 and Figure 1 present the results. The main conclusions are: (1) The experiments show that, at least for this task, using non-linear activations does not improve performance. This result fits our theoretical analysis which shows that these layers are not needed for universality. ( 2) Figure 1 presents a plot of error vs representation degrees used. The plot clearly shows that using high-dimensional representations (up to order 3) improves performance, which also fits our analysis. Using representation orders higher than 3 is significantly more time consuming, and was found to have little effect on the results (as in in Fuchs et al. ( 2020)), though we believe this to be application-dependent. (3) Using self interaction layers only at the end of the model is shown to have marginal negative effect on the results.



Since convolution layers in TFN are not linear, the non-linearities are formally redundant By this we mean that the maps (g, X) → ρj(g)X, j = 1, 2 are jointly continuous



As a result of Theorem 1 and Lemma 1 we obtain our universality result (see inset for illustration) Corollary 1. For all C, D ∈ N + , let F C,D denote function spaces generated by a pair of functions spaces which are D-spanning and linearly universal as in equation 2. Then any continuous G-equivariant function in C G (R 3×n , W n T ) can be approximated uniformly on compact sets by equivariant functions in F = D∈N F C(D),D .

and for (k, σ) ∈ K D × S D we define Λk,σ (w 1 , . . . , w D ) = Λk (w σ(1) , . . . , w σ(D) ) (26) Proposition 1. The space of linear invariant functions from T D to R is spanned by the set of linear invariant functionals λ D = {Λ k,σ | (k, σ) ∈ K D × S D } induced by the multi-linear functional Λk,σ described in equation 25 and equation 26

ρ(R)(w 1 , . . . , w D ) = (Rw 1 , . . . , Rw D ).The action ρ(R) = R ⊗D of SO(3) on W ⊗D is such that the map (w 1 , . . . , w D ) → w 1 ⊗ w 2 . . . w D is SO(3)equivariant.It follows that if Λ and Λ satisfy equation 24, then for all R ∈ SO(3), the same equation holds for the pair Λ • ρ(R) and Λ • ρ(R). Thus SO(3)-invariance of Λ is equivalent to SO(3)-invariance of Λ.

Results obtained on the QM9 datasetRamakrishnan et al. (2014) for different design choices in the TFN architecture. Results are reported for the homo target variable, and are multiplied by 10 3 .

acknowledgement

Acknowledgments The authors would like to thank Fabian B. Fuchs for making code available and Taco Cohen for helpful discussion. N.D. is supported by THEORINET Simons award 814643.

annex

Yongheng Zhao, Tolga Birdal, Jan Eric Lenssen, Emanuele Menegatti, Leonidas Guibas, and Federico Tombari. Quaternion equivariant capsule networks for 3d point clouds. arXiv preprint arXiv:1912.12098, 2019.

A NOTATION

We introduce some notation for the proofs in the appendices. We use the shortened notation X = X -1 n X1 n 1 T n and denote the columns of X by (x 1 , . . . , xn ). We denoteA first step in proving denseness of G-equivariance polynomials, and in the proof in the next subsection is the following simple lemma, which shows that translation invariance can be dealt with simply by centralizing the point cloud.In the following, ρ W T is some representation of SO(3) on a finite dimensional real vector space W T . this induces an actionThis is also the action of G which we consider, ρ G = ρ W T ×Sn , where we have invariance with respect to the translation coordinate. The action of G on R 3×n is defined in equation 1.T is G-equivariant, if and only if there exists a function h which is equivariant with respect to the action of SO(3) × S n on R 3×n , andProof. Recall that G-equivariance means SO(3) × S n equivariance and translation invariance. Thus if f is G-equivariant then equation 16 holds with h = f .On the other hand, if f satisfies equation 16 then we claim it is G-equivariant. Indeed, for all (t, R, PWe now prove denseness of G-equivariant polynomials in the space of G-invariant continuous functions (Lemma 1).Proof of Lemma 1. Let K ⊆ R 3×n be a compact set. We need to show that continuous Gequivariant functions can be approximated uniformly in K by G-equivariant polynomials. Let K 0 denote the compact set which is the image of K under the centralizing map X → X -1 n X1 n 1 T n . By Lemma B.1, it is sufficient to show that every SO(3) × S n equivariant continuous function f can be approximated uniformly on K 0 by a sequence of SO(3) × S n equivariant polynomials p k . The argument is now concluded by the following general lemma:where Λk stands for the equivariant linear functional from W feat to W T , defined for w ∈ W feat byThus we have shown that p is in F C (F feat , F pool ) for C = K, as required.

C PROOFS FOR SECTION 4

We prove Lemma 2 Lemma 2. For every D ∈ N + , the set Q is D-spanning.Proof. It is known (Segol & Lipman, 2019 ) (Theorem 2) that polynomials p : R 3×n → R n which are S n -equivariant, are spanned by polynomials of the form p α = (p j α ) n j=1 , defined aswhere α = (α 1 , . . . , α K ) and each α k ∈ N 3 + is a multi-index. It follows that S n -equivariant polynomials of degree ≤ D are spanned by polynomials of the form p j α wheresum of all r k by T , and r = (r k ) K k=1 , we see that there exists a linear functional Λ α, r :where we recall thatis defined in equation 6 asThus polynomials p = (p j ) n j=1 which are of degree ≤ D, and are S n equivariant, can be written asT , and ι -1 T is the left inverse of the embedding ι. If p is also translation invariant, thenWe prove Lemma 3 Lemma 3. The function set Q D is contained inThus F feat is D-spanning.Proof. In this proof we make the dependence of F feat on D explicit and denote F feat (D).We prove the claim by induction on D. Assume D = 0. Then Q 0 contains only the constant function X → 1 n ∈ T n 0 , and this is precisely the function π V • ext ∈ F feat (0). Now assume the claim holds for all D with D -1 ≥ D ≥ 0, and prove the claim for D. Choose r = (r 1 , . . . , r k ) ∈ Σ T for some T ≤ D, we need to show that the function Q ( r) is in F feat (D). Since F feat (D -1) ⊆ F feat (D) we know from the induction hypothesis that this is true if T < D. Now assume T = D. We consider two cases: D PROOFS FOR SECTION 5We prove Lemma 5 Lemma 5. Let l (1) = ((1) 1 , . . . ,(1)) is a linear equivariant mapping between W l (1) and W l (2) , if and only if there exists ai for all i = 1, . . . , K 1 .Proof. As mentioned in the main text, this lemma is based on Schur's lemma. This lemma is typically stated for complex representations, but holds for odd dimensional real representation as well.We recount the lemma and its proof here for completeness (see also (Fulton & Harris, 2013) ).Lemma D.1 (Schur's Lemma for SO(3)). Let Λ : W 1 → W 2 be a linear equivariant map. If 1 = 2 then Λ = 0. Otherwise Λ is a scalar multiply of the identity.Proof. Let Λ : W 1 → W 2 be a linear equivariant map. The image and kernel of Λ are invariant subspaces of W 1 and W 2 , respectively. It follows that if Λ = 0 then Λ is a linear isomorphism so necessarily 1 = 2 . Now assume 1 = 2 . Since the dimension of W 1 is odd, Λ has a real eigenvalue λ. The linear function Λ -λI is equivariant and has a non-trivial kernel, so Λ -λI = 0.We now return to the proof of Lemma 5. Note that each Λ j : W l (1) → W (2) j is linear and SO(3) equivariant. Next denote the restrictions of each Λ j to W (1) i , i = 1, . . . , K 2 by Λ ij , and note thatBy considering vectors in W l (1) of the form (0, . . . , 0, V i , 0 . . . , 0) we see that eachj is linear and SO(3)-equivariant. Thus by Schur's lemma, if(2) If Y is a space of functions from R 3×n → Y n , we denote by Y, T T the space of all functions p : R 3×n → T n T of the form

