UNIVERSAL APPROXIMATION THEOREM FOR EQUIVARIANT MAPS BY GROUP CNNS Anonymous

Abstract

Group symmetry is inherent in a wide variety of data distributions. Data processing that preserves symmetry is described as an equivariant map and often effective in achieving high performance. Convolutional neural networks (CNNs) have been known as models with equivariance and shown to approximate equivariant maps for some specific groups. However, universal approximation theorems for CNNs have been separately derived with individual techniques according to each group and setting. This paper provides a unified method to obtain universal approximation theorems for equivariant maps by CNNs in various settings. As its significant advantage, we can handle non-linear equivariant maps between infinite-dimensional spaces for non-compact groups.

1. INTRODUCTION

Deep neural networks have been widely used as models to approximate underlying functions in various machine learning tasks. The expressive power of fully-connected deep neural networks was first mathematically guaranteed by the universal approximation theorem in Cybenko (1989) , which states that any continuous function on a compact domain can be approximated with any precision by an appropriate neural network with sufficient width and depth. Beyond the classical result stated above, several types of variants of the universal approximation theorem have also been investigated under different conditions. Among a wide variety of deep neural networks, convolutional neural networks (CNNs) have achieved impressive performance for real applications. In particular, almost all of state-of-the-art models for image recognition are based on CNNs. These successes are closely related to the property that performing CNNs commute with translation on pixel coordinate. That is, CNNs can conserve symmetry about translation in image data. In general, this kind of property for symmetry is known as the equivariance, which is a generalization of the invariance. When a data distribution has some symmetry and the task to be solved relates to the symmetry, data processing is desired to be equivariant on the symmetry. In recent years, different types of symmetry have been focused per each task, and it has been proven that CNNs can approximate arbitrary equivariant data processing for specific symmetry. These results are mathematically captured as the universal approximation for equivariant maps and represent the theoretical validity of the use of CNNs. In order to theoretically correctly handle symmetric structures, we have to carefully consider the structure of data space where data distributions are defined. For example, in image recognition tasks, image data are often supposed to have symmetry for translation. When each image data is acquired, there are finite pixels equipped with an image sensor, and an image data is represented by a finitedimensional vector in a Euclidean space R d , where d is the number of pixels. However, we note that the finiteness of pixels stems from the limit of the image sensor and a raw scene behind the image data is thought to be modelled by an element in R S with continuous spatial coordinates S, where R S is a set of functions from S to R. Then, the element in R S is regarded as a functional representation of the image data in R d . In this paper, in order to appropriately formulate data symmetry, we treat both typical data representation in finite-dimensional settings and functional representation in infinite-dimensional settings in a unified manner.

1.1. RELATED WORKS

Symmetry and functional representation. Symmetry is mathematically described in terms of groups and has become an essential concept in machine learning. Gordon et al. (2019) point out that, when data symmetry is represented by a infinite group like the translation group, equivariant maps, which are symmetry-preserving processing, cannot be captured as maps between finitedimensional spaces but can be described by maps between infinite-dimensional function spaces. As a related study about symmetry-preserving processing, Finzi et al. (2020) propose group convolution of functional representations and investigate practical computational methods such as discretization and localization. Universal approximation for continuous maps. The universal approximation theorem, which is the main objective of this paper, is one of the most classical mathematical theorems of neural networks. The universal approximation theorem states that a feedforward fully-connected network (FNN) with a single hidden layer containing finite neurons can approximate a continuous function on a compact subset of R d . Cybenko (1989) proved this theorem for the sigmoid activation function. After his work, some researchers showed similar results to generalize the sigmoidal function to a larger class of activation functions as Barron (1994) , Hornik et al. (1989) , Funahashi (1989) , Kůrková (1992) and Sonoda & Murata (2017) . These results were approximations to functional representations between finite-dimensional vector spaces, but recently Guss & Salakhutdinov (2019) generalized them to continuous maps between infinite-dimensional function spaces in Guss & Salakhutdinov (2019) . Equivariant neural networks. The concept of group-invariant neural networks was first introduced in Shawe-Taylor (1989) in the case of permutation groups. In addition to the invariant case, Zaheer et al. (2017a) designed group-equivariant neural networks for permutation groups and obtained excellent results in many applications. Maron et al. (2019a; 2020) consider and develop a theory of equivariant tensor networks for general finite groups. Petersen & Voigtlaender (2020) established a connection between group CNNs, which are equivariant networks, and FNNs for group finites. However, symmetry are not limited to finite groups. Convolutional neural networks (CNNs) was designed to be equivariant for translation groups and achieved impressive performance in a wide variety of tasks. Gens & Domingos (2014) proposed architectures that are based on CNNs and invariant to more general groups including affine groups. Motivated by CNN's experimental success, many researchers have further generalized this by using group theory. Kondor & Trivedi (2018) proved that, when a group is compact and the group action is transitive, a neural network constrained by some homogeneous structure is equivariant if and only if it becomes a group CNN. Universal approximation for equivariant maps. Compared to the vast studies about universal approximation for continuous maps, there are few existing studies about universal approximation for equivariant maps. Sannai et al. (2019) ; Ravanbakhsh (2020); Keriven & Peyré (2019) considered the equivariant model for finite groups and proved universal approximation property of them by attributing it to the results of Maron et al. (2019b ). Cohen et al. (2019) considered group convolution on a homogeneous space and proved that a linear equivariant map is always convolution-like. Yarotsky (2018) proved universal approximation theorems for nonlinear equivariant maps by CNNlike models when groups are the d-dimensional translation group T (d) = R d or the 2-dimensional Euclidean group SE(2). However, when groups are more general, universal approximation theorems for non-linear equivariant maps have not been obtained.

1.2. PAPER ORGANIZATION AND OUR CONTRIBUTIONS

The paper is organized as follows. In section 2, we introduce the definition of group equivariant maps and provide the essential property that equivariant maps have one-to-one correspondence to theoretically tractable maps called generators. In section 3, we define fully-connected and group convolutional neural networks between function spaces. This formulation is suitable to represent data symmetry. Then, we provide a main theorem called the conversion theorem that can convert FNNs to CNNs. In section 4, using the conversion theorem, we derive universal approximation theorems for non-linear equivariant maps by group CNNs. In particular, this is the first universal approximation theorem for equivariant maps in infinite-dimensional settings. We note that finite and infinite groups are handled in a unified manner. In section 5, we provide concluding remarks and mention future works.

2.1. PRELIMINARIES

We introduce definitions and terminology used in the later discussion. Functional representation. In this paper, sets denoted by S, T and G are assumed to be locally compact, σ-compact, Hausdorff spaces. When S is a set, we denote by R S the set of all maps from S to R and by • ∞ the supremum norm. We call S of R S the index set. We denote by C(S) the set of all continuous maps from S to R. We denote by C 0 (S) the set of continuous functions from S to R which vanish at infinityfoot_0 . For a Borel space S with some measure µ, we denote the set of integrable functions from S to R with respect to µ as L 1 µ (S). For a subset B ⊂ S, the restriction map R B : R S → R B is defined by R B (x) = x| B , where x ∈ R S and x| B is the restriction of the domain of x onto B. When S is a finite set, R S is identified with the finite-dimensional Euclidean space R |S| , where |S| is the cardinality of S. In this sense, R S for general sets S is a generalization of Euclidean spaces. However, R S itself is often intractable for an infinite set S. In such cases, we instead consider C(S), C 0 (S) or L p (S) as relatively tractable subspaces of R S . Group action. We denote the identity element in a group G by 1. We assume that the action of a group G on a set S is continuous. We denote by g • s the left action of g ∈ G to s ∈ S. Then we call G s := {g • s|g ∈ G} the orbit of s ∈ S. From the definition, we have S = s∈S G s . When a subset B ⊂ S is the set of representative elements from all orbits, it satisfies the disjoint condition S = s∈B G s . Then, we call B a base spacefoot_1 and define the projection P B : S → B by mapping s ∈ S to the representative element in B ∩ G s . When a group G acts on sets S and T , the action of G on the product space S × T is defined by g • (s, t) := (g • s, g • t). When a group G acts on a index set S, the G-translation operators T g : R S → R S for g ∈ G are defined by T g [x](s) := x(g -1 • s), where x ∈ R S and s ∈ S. We often denote T g [x] simply by g • x for brevity. Then, group translation determine the actionfoot_2 of G on R S .

2.2. GROUP EQUIVARIANT MAPS

In this section, we introduce group equivariant maps and show their basic properties. First, we define group equivariance. Definition 1 (Group Equivariance). Suppose that a group G acts on sets S and T . Then, a map F : R S → R T is called G-equivariant when F [g • x] = g • F [x] holds for any g ∈ G and x ∈ R S . An example of an equivariant map in image processing is provided in Figure 1 . To clarify the degree of freedom of equivariant maps, we define the generator of equivariant maps.

Definition 2 (Generator). Let B ⊂ T be a base space with respect to the action of

G on T . For a G-equivariant map F : R S → R T , we call F B := R B • F the generator of F . The following theorem shows that equivariant maps can be represented by their generators. Theorem 3 (Degree of Freedom of Equivariant Maps). Let a group G act on sets S and T , and B ⊂ T a base space. Then, a G-equivariant map F : R S → R T has one-to-one correspondence to its generator F B . A detailed version of Theorem 3 is proved in Section A.1. Figure 1 : An example of an equivariant map from RGB images to gray-scale images. An RGB image x is represented by values (i.e., a function) on 2-dimensional spatial coordinates with RGB channels. This corresponds to the case where the index set is S = R 2×3 = R 6 . Similarly, a gray-scale image F [x] after equivariant processing F : R S → R T is represented by values on 2dimensional spatial coordinates with a single gray-scale channel. This corresponds to the case where the index set is T = R 2 . In this figure, the group action is translation of G = R 2 to 2-dimensional spatial coordinates.

3.1. FULLY-CONNECTED NEURAL NETWORKS

To define neural networks, we introduce some notions. A map A : R S → R T is called a bounded affine map if there exist a bounded linear map W : R S → R T and an element b ∈ R T such that A[x] = W [x] + b. (1) Guss & Salakhutdinov (2019) provide the following lemma, which is useful to handle bounded affine maps. Lemma 4 (Integral Form, Guss & Salakhutdinov (2019)). Suppose that S and T are locally compact, σ-compact, Hausdorff, measurable spaces. For a bounded linear map W : C(S) → C(T ), there exist a Borel regular measure µ on S and a weak * continuous family of functions {w(t, •)} t∈T ⊂ L 1 µ (S) such that the following holds for any x ∈ C(S): W [x](t) = S w(t, s)x(s)dµ(s). To use the integral form, we assume in the following that the input and output spaces of A are the class of continuous maps C(S) and C(T ) instead of R S and R T , respectively. Using the integral form, a bounded affine map A is represented by A µ,w,b [x](t) = S w(t, s)x(s)dµ(s) + b(t). (2) In particular, when S and T are finite sets with cardinality d and d ′ , the function spaces C(S) and C(T ) are identified with finite-dimensional Euclidean spaces R d and R d ′ , and thus, an affine map A : R d → R d ′ is parameterized by a weight matrix W = [w(t, s)] s∈[d],t∈[d ′ ] : R d → R d ′ and a bias vector b = [b(t)] t∈[d ′ ] ∈ R d ′ , and (2) induces the following form, which is often used in the literature on neural networks: A[x](t) = d s=1 w(t, s)x(s) + b(t). (3) A continuous function ρ : R → R induces the activation map α ρ : C(S) → C(S) which is defined by α ρ (x) := ρ • x ∈ C(S) for x ∈ C(S). However, for brevity, we denote α ρ by ρ. Then, we can define fully-connected neural networks in general settings. Definition 5 (Fully-connected Neural Networks). Let L ∈ N. A fully-connected neural network with L layers is a composition map of bounded affine maps (A 1 , . . . , A L ) and an activation map ρ represented by ϕ := A L • ρ • A L-1 • • • • • ρ • A 1 , ( ) where A ℓ : C(S ℓ-1 ) → C(S ℓ ) are affine maps for some sequence of sets {S ℓ } L ℓ=0 . Then, we denote by N FNN (ρ, L; S 0 , S L ) the set of all fully-connected neural networks from C(S 0 ) to C(S L ) with L layers and an activation function ρ. We denote the measure of the affine map A 1 in the first layer of a fully-connected neural network ϕ by µ ϕ . This measure µ ϕ is used to describe a condition in the main theorem (Theorem 9).

3.2. GROUP CONVOLUTIONAL NEURAL NETWORKS

We introduce the general form of group convolution. Definition 6 (Group Convolution). Suppose that a group G acts on sets S and T . For a G-invariant measure ν on S, G-invariant functions v : S × T → R and b ∈ C(T ), the biased G-convolution C ν,v,b : C(S) → C(T ) is defined as C ν,v,b [x](t) := S v(t, s)x(s)dν(s) + b(t). (5) In the right hand side, we call the first term the G-convolution and the second term the bias term. In the following, we denote C ν,v,b by C for brevity. When S and T are finite, we note that (5) also can be represented as (3). Definition 6 includes existing definitions of group convolution as follows. When S = T = G, the group G acts on S and T by left translations. Then, (5) without the bias term (i.e., b = 0) is described as C[x](g) = G v(g, h)x(h)dν(h) = G ṽ(h -1 g)x(h)dν(h), wherefoot_3 ṽ(g) := v(g, 1). This is a popular definition of group convolution between two functions on G. Further, when S = G × B and T = G × B ′ , (5) without the bias term is described as C[x](g, t) = G×B v((g, τ ), (h, ς))x(h, ς)dν(h, ς) = G×B ṽ(h -1 g, τ, ς)x(h, ς)dν(h, ς), where ṽ(g, τ, ς) := v((g, τ ), (1, ς)). This coincides with the definition of group convolution in Finzi et al. (2020) . We note that Finzi et al. (2020) also proposes discretization and localization of the above group convolution for implementation. In conventional convolution used for image recognition, G represents spatial information such as pixel coordinate, B and B ′ correspond to channels in consecutive layers ℓ and ℓ + 1 respectively, and v corresponds to a filter. In applications, the filter v is expected to have compact support or be short-tailed on G as in a 3 × 3 convolution filter in discrete convolution. In particular, when v is allowed to be the Dirac delta or highly peaked around a single point in G, such convolution can be interpreted as the 1 × 1 convolution. Then, we define group convolutional neural networks as follows. Definition 7 (Group Convolutional Neural Networks). Let L ∈ N. A G-convolutional neural network with L layers is a composition map of biased convolutions C ℓ : C(S ℓ-1 ) → C(S ℓ ) (ℓ = 1, . . . , L) for some sequence of spaces {B ℓ } L ℓ=0 and an activation map with ρ as Φ := C L • ρ • C L-1 • • • • • ρ • C 1 . (6) Then, we denote by N CNN (G, ρ, L; S 0 , S L ) the set of all G-convolutional neural networks from C(S 0 ) to C(S L ) with respect to a group G with L layers and a fixed activation function ρ. We easily verify the following proposition. Proposition 8. A G-convolutional neural network is G-equivariant. In particular, each biased G-convolution C ν,v,b is G-equivariant. Conversely, Cohen et al. (2019) showed that a G-equivariant linear map is represented by some G-convolution without the bias term when G is locally compact and unimodular, and the action of a group is transitive (i.e., B consists of only a single element).

3.3. CONVERSION THEOREM

In this section, we introduce the main theorem (Theorem 9), which is an essential part of obtaining universal approximation theorems for equivariant maps by group CNNs. Theorem 9 (Conversion Theorem). Suppose that the action of a group G on sets S and T . We assume the following condition: (C1) there exist base spaces B S ⊂ S, B T ⊂ T , and two subgroupsfoot_4  H T ⩽ H S ⩽ G such that S = G/H S × B S and T = G/H T × B T . Further, suppose E ⊂ C 0 (S) is compact and an FNN ϕ : E → C 0 (B T ) with a Lipschitz activation function ρ satisfies (C2) there exists a G-left-invariant locally finite measure ν on S such thatfoot_5 µ ϕ ν. Then, for any ϵ > 0, there exists a CNN Φ : E → C 0 (T ) with the activation function ρ such that the number of layers of Φ equals that of ϕ and R B T • Φ -ϕ ∞ ≤ ϵ. (7) Moreover, for any G-equivariant map F : C 0 (S) → C 0 (T ), the following holds: F | E -Φ ∞ ≤ F B T | E -ϕ ∞ + ϵ. ( ) We provide the proof of Theorem 9 in Section B. Conversion of Universal Approximation Theorems. The conversion theorem can convert a universal approximation theorem by FNNs to a universal approximation theorem for equivariant maps by CNNs as follows. Suppose that the existence of an FNN ϕ which satisfies F B | E -ϕ ∞ ≤ ϵ using some universal approximation theorem by FNNs. Then, Theorem 9 guarantees the existence of a CNN Φ which satisfies F | E -Φ ∞ ≤ 2ϵ. In other words, if an FNN can approximate the generator of the target equivariant map on E, then there exists a CNN which approximates the whole of the equivariant map on E. Applicable Cases. The conversion theorem can be applied to a wide range of group actions. We explain the generality of the conversion theorem. First, sets S and T are not limited to finite sets or Euclidean spaces, and may be more general topological spaces. Second, a group G may be discrete (especially finite) or continuous groups. Moreover, G can be non-compact and non-commutative. Third, the action of a group G on sets S and T may not be transitive, and thus, the sets can be non-homogeneous spaces. In the following, we provide some concrete examples of group actions when S = T and the actions of G on S and T are the same: Inapplicable Cases. We explain some cases where the conversion theorem cannot be applied. First, similar to the above discussion, we consider the setting where S = T and the actions of G on S and T are the same. We note that, even if actions of G 1 and G 2 on S satisfy the conditions in the conversion theorem, a common invariant measure for both G 1 and G 2 may not exist. Then, a group G including G 1 and G 2 as subgroups does not satisfies (C2). For example, there does not exist a common invariant measure about the actions of translation and scaling on a Euclidean space. In particular, the action of the general linear group GL(d) on the Euclidean space does not have locally-finite left-invariant measure on R d . Thus, the conversion theorem cannot applied to the case. Next, as we saw above, our model can handle convolutions on permutation groups, but not on general finite groups. This depends on whether [n] can be represented by a quotient of G, as we will see later. This is also the case for tensor expressions of permutations, which require a different formulation. • Symmetric Group Lastly, we consider the case where the actions of G on S and T differ. Here, S and T may and may not be equal. As a representative case, we consider the invariant case. When the stabilizer in T satisfies H T = G, a G-equivariant map F : C 0 (S) → C 0 (T ) is said to be G-invariant. However, because of the condition H T ⩽ H S in (C1), the conversion theorem cannot apply to the invariant case as long as H S = G. This kind of restriction is similar to existing studies, where the invariant case is separately handled from the equivariant case (Keriven & Peyré (2019) ; Maehara & NT (2019) ; Sannai et al. (2019) ). In fact, we can show that the inequality (7) never hold for non-trivial invariant cases (i.e., H S = G and H T = G) as follows: From H T = G, we have B T = T and R B T = id, and thus, (7) reduces to Φ -ϕ ∞ ≤ ϵ. Here, we note that ϕ is an FNN, which is not invariant in general, and Φ is a CNN, which is invariant. Thus, Φ cannot approximate non-invariant ϕ within a small error ϵ. This implies that (7) does not hold for small ϵ. However, whether (8) holds for the invariant case is an open problem. Remarks on Conditions (C1) and (C2). We consider the conditions (C1) and (C2). In (C1), the subgroup H S ⩽ G (resp. H T ) represents the stabilizer group of the action of G on S (resp. T ). Thus, (C1) requires that the stabilizer group on every point in S (resp. T ) is isomorphic to the common subgroup H S (resp. H T ). When the group action satisfies some moderate conditions, such a requirement is known to be satisfied for most points in the set. As a theoretical result, the principal orbit type theorem (cf. Theorem 1.32, Meinrenken (2003) ) guarantees that, if the group action on a manifold S is proper and S/G is connected, there exist a dense subset S ′ ⊂ S and a subgroup H S ⊂ G called a principal stabilizer such that the stabilizer group on every point in S ′ is isomorphic to H S . Further, (C1) assumes that the sets S and T have the direct product form of some coset G/H and a base space B. Then, the case where the base space B consists of a single point is equivalent to the condition that the set is homogeneous. In this sense, (C1) can be regarded as a relaxation of the homogeneous condition. In many practical cases, a set S on which G acts can be regarded as such a direct product form. For example, when the action is transitive, the direct product decomposition trivially holds with the base space that consists of a single point. Even when the set S itself is not rigorously represented by the direct product form, removing some "small" subset N ⊂ S, the complement S \ N can be often represented by the direct form. For example, when G = O(d) acts on the set S = R d as rotation around the origin N = {0}, S \ N has a direct product form as mentioned above. In applications, removing only the small subset N is expected to be negligible. Next, we provide some remarks on the condition (C2). Let us consider two representative settings of a set S. The first case is the setting where S is finite. When a G-invariant measure ν has a positive value on every singleton in S, ν satisfies (C2) for an arbitrary measure µ ϕ on S. In particular, the counting measure on S is invariant and satisfies (C2). The second case is the setting where S is a Euclidean space R d , and µ ϕ is the Lebesgue measure. Then, (C2) is satisfied with invariant measures on the Euclidean space for various group actions, including translation, rotation, scaling, and an Euclidean group. Here, we give a general method to construct ν in (C2) for a compact-group action. When µ ϕ is locally finite and continuousfoot_8 with respect to the action of a compact group G, the measure ν := ν G * µ ϕ on S for a Haar measure ν G on G satisfies (C2), where (ν G * µ ϕ )(A) := G µ ϕ (g -1 • A)dν G (g).

4.1. UNIVERSAL APPROXIMATION THEOREM IN FINITE DIMENSION

We review the universal approximation theorem in finite-dimensional settings. Cybenko (1989) derived the following seminal universal approximation theorem in finite-dimensional settings. Theorem 10 (Universal Approximation for Continuous Maps by FNNs, Cybenko (1989) ). Let an activation function ρ : R → R be non-constant, bounded and continuous. Let F : R d → R d ′ be a continuous map. Then, for any compact E ⊂ R d and ϵ > 0, there exists a two-layer fully connected neural network ϕ E ∈ N FNN (ρ, 2; [d], [d ′ ]) such that F | E -ϕ E ∞ < ϵ. Since C 0 (S) = R |S| for a finite set S, we obtain the following theorem by combining Theorem 9 with Theorem 10. Theorem 11 (Universal Approximation for Equivariant Continuous Maps by CNNs). Let an activation function ρ : R → R be non-constant, bounded and Lipschitz continuous. Suppose that a finite group G acts on finite sets S and T and (C1) in Thoerem 9 holds. Let F : R |S| → R |T | be a G-equivariant continuous map. For any compact set E ⊂ R |S| and ϵ > 0, there exists a two-layer convolutional neural network Φ E ∈ N CNN (ρ, 2; |S|, |T |) such that F | E -Φ E ∞ < ϵ. We note that Petersen & Voigtlaender (2020) obtained a similar result to Theorem 11 in the case of finite groups. Universality of DeepSets. DeepSets is known as invariant/equivariant models with sets as input and is known to have universality for invariant/equivariant functions on set permutation (Zaheer et al. (2017b) ; Ravanbakhsh ( 2020  E : E → R n such that Φ E (x) -F | E (x) ∞ < ϵ. The proof of Theorem 12 is provided in Section C.

4.2. UNIVERSAL APPROXIMATION THEOREM IN INFINITE DIMENSION

Guss & Salakhutdinov (2019) derived a universal approximation theorem for continuous maps by FNNs in infinite-dimensional settings. However, the universal approximation theorem in Guss & Salakhutdinov (2019) assumed that the index set S in the input layer and T in the output layer are compact. Combining the conversion theorem with it, we can derive a corresponding universal approximation theorem for equivariant maps with respect to compact groups. However, the compactness condition for S and T is a crucial shortcoming to handle the action of non-compact groups such as translation or scaling. In order to overcome the above obstacle, we can show a novel universal approximation theorem for Lipschitz maps by FNNs as follows. Theorem 13 (Universal Approximation for Lipschitz Maps by FNNs). Let an activation function ρ : R → R be continuous and non-polynomial. Let S ⊂ R d and T ⊂ R d ′ be domains. Let F : C 0 (S) → C 0 (T ) be a Lipschitz map. Then, for any compact E ⊂ C 0 (S) and ϵ > 0, there exist N ∈ N and a two-layer fully connected neural network ϕ E = A 2 • ρ • A 1 ∈ N FNN (ρ, 2; S, T ) such that A 1 [•] = W (1) [•] + b (1) : E → C 0 ([N ]) = R N , A 2 [•] = W (2) [•] + b (2) : R N → C 0 (T ), µ ϕ E is the Lebesgue measure, and F | E -ϕ E ∞ < ϵ. We provide proof of Theorem 13 in the appendix. We note that S ⊂ R d and T ⊂ R d ′ in Theorem 13 are allowed to be non-compact unlike the result in Guss & Salakhutdinov (2019) . Combining Theorem 9 with Theorem 13, we obtain the following theorem. Theorem 14 (Universal Approximation for Equivariant Lipschitz Maps by CNNs). Let an activation function ρ : R → R be Lipschitz continuous and non-polynomial. Suppose that a group G acts on S ⊂ R d and T ⊂ R d ′ , and ( C1) and (C2) in Thoerem 9 hold for the Lebesgue measure µ ϕ . Let F : C 0 (S) → C 0 (T ) be a G-equivariant Lipschitz map. Then, for any compact set E ⊂ C 0 (S) and ϵ > 0, there exists a two-layer convolutional neural network Φ E ∈ N CNN (ρ, 2; S, T ) such that F | E -Φ E ∞ < ϵ. Lastly, we mention some universal approximation theorems for some concrete groups. When a group G is an Euclidean group E(d) or a special Euclidean group SE(d), Theorem 14 shows that group CNNs are universal approximators of G-equivariant maps. Although Yarotsky (2018) showed that group CNNs can approximate SE(2)-equivariant maps, our result for d ≥ 3 was not shown in existing studies. Since Euclidean groups can be used to represent 3D motion and point cloud, Theorem 14 can provide the theoretical guarantee of 3D data processing with group CNNs. As another example, when a group G is SO + (d, 1), G acts on the upper half plane H d+1 , which is shown to be suitable for word representations in NLP (Nickel & Kiela (2017) ). Since the action of G preserves the distance on H d+1 , group convolution with SO + (d, 1) may be useful for NLP.

5. CONCLUSION

We have considered universal approximation theorems for equivariant maps by group CNNs. To prove the theorems, we showed that an equivariant map is uniquely determined by its generator. Thus, when we can take a fully-connected neural network to approximate the generator, the approximator of the equivariant map can be described as a group CNN from the conversion theorem. In this way, the universal approximation for equivariant maps by group CNNs can be obtained through the universal approximation for the generator by FNNs. We have described FNNs and group CNNs in an abstract way. In particular, we provided a novel universal approximation theorem by FNNs in the infinite dimension, where the support of the input functions is unbounded. Using this result, we obtained the universal approximation theorem for equivariant maps for non-compact groups. We mention future work. In Theorem 14, we assumed sets S and T to be subspaces of Euclidean spaces. However, in the conversion theorem (Theorem 9), sets S and T do not need to be subspaces of Euclidean spaces and may have a more general topological structure. Thus, if there is a universal approximation theorem in non-Euclidean spaces (Courrieu (2005) ; Kratsios (2019)), we may be able to combine it with the conversion theorem and derive its equivariant version. Next, we note the problem of computational complexity. Although group convolution can be implemented by, e.g., discretization and localization as in Finzi et al. (2020) , such implementation cannot be applied to high-dimensional groups due to high computational cost. To use group CNNs for actual machinelearning problems, it is required to construct effective architecture for practical implementation.



A function f on a locally compact space is said to vanish at infinity if, for any ϵ, there exists a compact subset K ⊂ S such that sup s∈S\K |f (s)| < ϵ. The choice of the base space is not unique in general. However, the topological structure of a base space can be induced by the quotient space S/G. We note that Tg • T g ′ = T g ′ g and the group translation operator is the action of G on R S from the right. A bivariate G-invariant function v : G × G → R is determined by the univariate function ṽ : G → R because v(g, h) = v(h -1 g, h -1 h) = v(h -1 g, 1) = ṽ(h -1 g). HS and HT are not assumed to be normal subgroups. µ ϕ ≪ ν means that µ ϕ is absolutely continuous with respect to ν. A singleton is a set with exactly one element. The upper half plane is defined byH d+1 := {(x1, . . . , x d+1 ) ∈ R d+1 |x d+1 > 0}. A measure µ ϕ is said to be continuous with respect to the action of a group G if µ ϕ (g • A) is continuous with respect to g ∈ G for all Borel set A ⊂ S.



The action of G = S n on S = [n] as permutation has the decomposition [n] = S n /Stab(1) × { * }, where H S = Stab(1) is the set of all permutations on [n] that fix 1 ∈ [n] and B S = { * } is a singleton 7 . Then, the counting measure can be taken as an invariant measure ν. • Rotation Group. The action of G = O(d) on S = R d \ {0} as rotation around 0 ∈ R d has the decomposition R d \ {0} = O(d)/O(d -1) × R + The cases where G = SO(d) or S = S d-1 have similar decomposition. Then, the Lebesgue measure can be taken as an invariant measure ν. • Translation Group. The action of G = R d on S = R d as translation has the trivial decomposition R d = R d /{0} × { * }. Then, the Lebesgue measure can be taken as an invariant measure ν. • Euclidean Group. The action of G = E(d) on S = R d as isometry has the decomposition R d = E(d)/O(d) × { * }. The case where G = SE(d) has a similar decomposition. Then, the Lebesgue measure can be taken as an invariant measure ν. • Scaling Group. The action of G = R >0 on S = R d \ {0} as scalar multiplication has the decomposition R d \{0} = R >0 /{1}×S d-1 . Then, the measure ν r ×ν S d-1 can be taken as an invariant measure ν, where the measure ν r on R >0 is determined by ν r ([a, b]) := log b a and ν S d-1 is a uniform measure on S d-1 . • Lorentz Group. The action of G = SO + (d, 1), a subgroup of the Lorentz group O(d, 1), on the upper half plane 8 S = H d+1 as matrix multiplication has the decomposition H d+1 = SO + (d, 1)/SO(n) × { * }. Then, the π # (ν + ) can be taken as a left-invariant measure ν, where ν + is a left-invariant measure on SO + (d, 1), π : SO + (d, 1) → SO + (d, 1)/SO(d) is a canonical projection, and π # (ν + ) is the pushforward measure.

)). The equiariant model is a stack of affine transformations with W = λE + γ1 (1 is the all-one matrix) and bias b = c • (1, ..., 1) ⊤ and then an activation function acted on. Here, we prove the universality of DeepSets as a corollary of Theorem 11. Firstly, we consider the equivariant model of DeepSets as the one we are dealing with by setting S, T G, H and B as follows. We set S = T = [n], G = S n , H = Stab(1) := {s ∈ S n | s(1) = 1} and B = { * }, where { * } is a singleton. Then we can see that Stab(1) is a subgroup of G and its left cosets G/H = [n]. As a set, S n /Stab(1) is equal to [n], and the canonical S n -action on S n /Stab(1) is equivalent to the permutation action on [n]. Therefore, C(G/H × B) = C([n]) = R n holds, and the equivariant model of our paper is equal to that of DeepSets. Theorem 12. For any permutation equivariant function F : R n → R n , a compact set E ⊂ R n and ϵ > 0, there is an equivariant model of DeepSets (or equivalently, our model) Φ

