ON THE UNIVERSAL APPROXIMATION PROPERTY OF DEEP FULLY CONVOLUTIONAL NEURAL NETWORKS

Abstract

We study the approximation of shift-invariant or equivariant functions by deep fully convolutional networks from the dynamical systems perspective. We prove that deep residual fully convolutional networks and their continuous-layer counterpart can achieve universal approximation of these symmetric functions at constant channel width. Moreover, we show that the same can be achieved by nonresidual variants with at least 2 channels in each layer and convolutional kernel size of at least 2. In addition, we show that these requirements are necessary, in the sense that networks with fewer channels or smaller kernels fail to be universal approximators.

1. INTRODUCTION

Convolutional Neural Networks (CNN) are widely used as fundamental building blocks in the design of modern deep learning architectures, for it can extract key data features with much fewer parameters, lowering both memory requirement and computational cost. When the input data contains spatial structure, such as pictures or videos, this parsimony often does not hurt their performance. This is particularly interesting in the case of fully convolutional neural networks (FCNN) (Long et al., 2015) , built by the composition of convolution, nonlinear activation and summing (averaging) layers, with the last layer being a permutation invariant pooling operator, see Figure 1 a prominent feature of FCNN is that, when shifting the input data indices (e.g. picture, video, or other higher-dimensional spatial data), the output result should remain the same. This is called shift invariance. An example application of FCNN is image classification problems where the class label (or probability, under the softmax activation) of the image remains the same under translating the image (i.e. shifting the image pixels). A variant of FCNN applies to problems where the output data has the same size as the input data, e.g. pixel-wise segmentation of images (Badrinarayanan et al., 2017) . In this case, simply stacking the fully convolutional layers is enough. We call this type of CNN equivariant fully convolutional neural network (eq-FCNN), since when shifting the input data indices, the output data indices shift by the same amount. This is called shift equivariance. It is believed that the success of these convolutional architectures hinges on shift invariance or equivariance, which capture intrinsic structure in spatial data. From an approximation theory viewpoint, this presents a delicate trade-off between expressiveness and invariance: layers cannot be too complex to break the invariance property, but should not be too simple that it loses approximation power. The interaction of invariance and network architectures are subjects of intense study in recent years. For example, Cohen & Welling (2016c) designed the steerable CNN to handle the motion group for robotics. Deep sets (Zaheer et al., 2017) are proposed to solve the permutation invariance and equivariance. Other approaches to build equivariance and shift invariance include parameter sharing (Ravanbakhsh et al., 2017) and the homogeneous space approach (Cohen & Welling, 2016b; Cohen et al., 2019) . See Bronstein et al. (2017) for a more recently survey. Among these architectures, the FCNN is perhaps the simplest and most widely used model. Therefore, the study of its theoretical properties is naturally a first and fundamental step for investigating other more complicated architectures. In this paper, we focus on the expressive power of the FCNN. Mathematically, we consider whether a function F can be approximated via the FCNN (or eq-FCNN) function family in L p sense. This is also known as universal approximation in L p . In the literature, many results on fully connected neural networks can be found, e.g. Lu et al. (2017) ; Yarotsky (2018a) ; Shen et al. (2019) . However, relatively few results address the approximation of shift invariant functions via fully convolutional networks. An intuitive reason is that the symmetry constraint (shift invariance) will hinder the unconditioned universal approximation. This can be also proved rigorously. In Li et al. (2022b) , the authors showed that if a function can be approximated by an invariant function family to arbitrary accuracy, then the function itself must be invariant. As a consequence, when we consider the approximation property of the FCNN, we should only consider shift invariant functions. This brings new difficulty for obtaining results compared to those for fully connected neural networks. For this reason, many existing results on convolutional network approximation rely on some ways of breaking shift invariance, thus applying to general function classes without symmetry constraints (Oono & Suzuki, 2019) . Moreover, results on convolutional networks usually require (at least one) layers to have a large number of channels. In contrast, we establish universal approximation results for fully convolutional networks where shift invariance is preserved. Moreover, we show that approximation can be achieved by increasing depth at constant channel numbers, with fixed kernel size in each layer. The main result of this paper (Theorem 1) shows that if we choose ReLU as the activation function and the terminal layer is chosen as a general pooling operator satisfying mild technical conditions (e.g. max, summation), then convolutional layers with at least 2 channels and kernel size at least 2 can achieve universal approximation of shift invariant functions via repeated stacking (composition). The result is sharp in the sense that neither the size of convolution kernel nor the channel number can be further reduced while preserving the universal approximation property. To prove the result on FCNN, we rely on the dynamical systems approach where residual neural networks are idealized as continuous-time dynamical systems. This approach was introduced in E (2017) and first used to develop stable architectures (Haber & Ruthotto, 2017) and control-based training algorithms (Li et al., 2018) . This is also popularized in the machine learning literature as neural ODEs (Chen et al., 2018) . On the approximation theory front, the dynamical systems approach was used to prove universal approximation of general model architectures through composition (Li et al., 2022a) . The work of Li et al. (2022b) , extended the result to functions/networks with symmetry constraints, and as a corollary obtained a universal approximation result for residual fully convolutional networks with kernel sizes equal to the image size. The results in this paper restrict the size of kernel in a more practical way, and can handle common architectures for applications, which typically use kernel sizes ranging from 3 -7. Moreover, we also establish here the sharpness of the requirements on channel numbers and kernel sizes. The restriction on width and kernel size actually can provide more interesting results in the theoretical setting. This is because if we establish our approximation results using finite (and minimal) width and kernel size requirements, they can be used to obtain the universal approximation property for a variety of larger models by simply showing them to contain our minimal construction. In summary, the main contributions of this work are as follows: 1. We prove the universal approximation property of both continuous and time-discretized fully convolutional neural network with residual blocks and kernel size of at least 2. This result concerns about the deep but narrow neural networks with residual blocks. We provide a sufficient condition result about the universal approximation property with respect to the shift invariance. The result does not rely on the specific choice of nonlinear activation functions, nor the choice of the last layer. 2. Further, we prove the universal approximation property of fully convolutional neural network with ReLU activations having no less than two channels each layer, and kernel size of at least 2. 3. Finally, we show that the channel number and kernel size requirements above are sharp, since further reducing them may lose the universal approximation property. 4. The above three points hold true for the approximation of shift equivariant mappings via eq-FCNN. The paper is organized as follows: Section 2 introduces the mathematical formulation and the main results of this paper. Section 3 introduces the main analytical tool and the bridging result in this paper for the positive part of the main theorem, while Section 4 and 5 proves the sharpness part of the main theorem. The discussion and conclusion is shown in Section 6. In Appendix A we provide the shift equivariant result, and the remaining technical proof are collected in Appendix B.

2. FORMULATION AND MAIN RESULTS

In this section, we introduce the notation and formulation of the approximation problem, and then present our main results. We first recall the definition of convolution. Consider two rank d tensors x and y ∈ X, where X := R n1 × • • • × R n d . We denote by n = [n 1 , . . . , n d ] the data dimensions, and we define the convolution of x, y by z = x * y with [z] i = j [x] j [y] i+j-1 , Here, i, j are multi-indices (beginning with 1) and the arithmetic uses the periodic boundary condition. Taking x ∈ R 3×3 as an example, we denote x =   [x] (1,1) [x] (1,2) [x] (1,3) [x] (2,1) [x] (2,2) [x] (2,3) [x] (3,1) [x] (3,2) [x] (3,3)   , where [x] (1,4) is identified with [x] (1,1) , and similarly for the other indices. Let us also define the translation operator T k with respect to a multi-index k by [T k x] i = [x] i+k . The key symmetry condition concerned in this paper -shift equivariance -can now be stated as the following commuting relationship: T k (x * y) = x * (T k y). We now introduce the definition of the fully convolutional neural network (FCNN) architecture we subsequently study. Let F r := r i=1 v i σ(w i * • + b i 1), w i ∈ X, v i , b i ∈ R , be a function family representing possible forms for each convolutional layer with r channels. Here, σ(x) = max(x, 0) is the ReLU function. Let the final layer be a pooling operation g : X → R obeying the following condition : g is Lipschitz, and permutation invariant with respect to all the coordinates of its input data, i.e. the value of g does not depend on the order of its inputs. Examples of such a pooling operator include summation g(x) = x → i [x] i and max g(x) = x → max i [x] i . Note that this is stronger than just requiring g to be shift invariant. In practice, the convolutional kernel used will be more restrictive, say a kernel size of 3 or 5. To study the effect of kernel size, we define the support for an element x ∈ X as supp(x) := (j 1 , j 2 , • • • , j d ), where j s is the minimal number such that if the multi-index i has i s > j s for some s, then [x] i = 0. For example, the support of tensor x = 0 1 0 1 0 0 1 0 0 is (3, 2). Two remarks on this definition of support are in order. First, the element x ∈ X with supp(x) ≤ j = (j 1 , j 2 , • • • , j d ) can be identified with an element x ∈ R j1×j2×•••×j d .foot_0 Second, a convolution kernel with size of s can be regarded as a tensor w ∈ X with support ≤ s = (s, s, s, • • • , s). Thus, we may define the convolutional layer family with support up to ℓ as F r,ℓ := r i=1 v i σ(w i * • + b i 1), w i ∈ X, supp(w i ) ≤ ℓ, v i , b i ∈ R With these notations in mind, we now introduce the following hypothesis spaces defining fully convolutional neural networks and their residual variants CNN r,ℓ = {g • f m • • • • • f 1 : f 1 , • • • , f m ∈ F r,ℓ , m ≥ 1}, (2) resCNN r,ℓ = {g • (id +f m ) • • • • • (id +f 1 ) : f 1 , • • • , f m ∈ F r,ℓ , m ≥ 1} For any family F of functions X → R, let us define F + R := {φ + b, φ ∈ F, b ∈ R}. This expands the hypothesis space by adding a constant bias to the original function family F. Observe that all functions in the families CNN •,• and resCNN •,• are shift invariant in the following sense. Definition 1. A function φ : X → R is called shift invariant if φ(x) = φ(T k x) for all x ∈ X, k ≤ n. A function family X is called shift invariant if for all its member are shift invariant. A function family X satisfies the shift invariant universal approximation property (shift invariant UAP for short) if 1. The function family X is shift invariant, and 2. For any shift invariant continuous (or L p ) function ψ, tolerance ε > 0, compact set K ⊂ X and p ∈ [1, ∞), there exists φ ∈ X such that ∥ψ -φ∥ L p (K) ≤ ε. Correspondingly, we define shift equivariance. A mapping φ : X → X is called shift equivariant if φ(T k x) = T k (φ(x)). We will prove the shift equivariant UAP in Appendix A. The main result of this paper is as follows. 2 Theorem 1 (Universal Approximation Property of CNN). The following statements hold: 1. The residual FCNN hypothesis space resCNN r,ℓ possesses the shift invariant UAP for r ≥ 1 and ℓ ≥ 2. The non-residual hypothesis space CNN r,ℓ possesses the shift invariant UAP for r ≥ 2 and ℓ ≥ 2. 2. The kernel size 2 is optimal in the following sense: for ℓ with min ℓ s = 1, then neither resCNN ∞,ℓ + R nor CNN ∞,ℓ + R possess the shift invariant UAP. 3. The channel-width requirement for non-residual fully convolutional neural network is optimal, in the sense that the function family CNN 1,∞ + R does not possess the shift invariant UAP. Notice that due the extended hypothesis space from the added bias, the sharpness results are stronger than just implying that CNN ∞,ℓ or resCNN ∞,ℓ does not possess the shift invariant UAP. The reason we establish the sharpness results for • + R is to ensure that the lack of approximation power does not arise from the fact that the ReLU activation function σ has non-negative range. Note that this sign restriction does not affect the positive result, since with at least 2 channels one can produce output ranges of any sign. Although this theorem only considers the approximation of shift invariant architectures, similar result can be established for the shift equivariant architectures. We will discuss it in detail in Appendix A. Furthermore, in this section we restrict the activation function σ to be the ReLU function, but this restriction is necessary only for the non-residual case. As we will see in Appendix B.4, for residual FCNNs we can relax our requirement on σ to include a large variety of common activation functions. Theorem 1 indicates the following basic trade-off in the design of deep convolutional neural network architecture: if we enlarge the depth of the neural network, then even if we choose in each layer a simple function (in this theorem, 2 channels with each kernel in channel with size of 2), we can still expect a high expressive power. However, the mapping adopted in each layer cannot be degenerate, otherwise it will fail to capture information of the input data. The second and third part of this Theorem tells that this degeneracy may come from either channel number or the kernel size (support of the convolutional kernel).

2.1. COMPARISON WITH PREVIOUS WORK

We compare this theorem to existing works on the approximation theory of convolutional networks and related architectures. The existing result around the approximation capabilities of convolutional neural networks can be categorized into several classes. One either • takes the kernel as full-size (same size as the input) (e.g. Li et al. (2022b) ) which is not often used in practice, • assumes a sufficiently large channel number in order to adopt some kernel learning methods (e.g. Bietti ( 2021 • removes the nonlinear activation function and reduce to a linear approximation problem (e.g. Zhou (2020)), then use complex fully connected terminal layer(s) to achieve approximation. As a consequence, few, if any, results are obtained when the kernel size is small (and the channel number is fixed). Indeed, none of the results we are aware of have considered situations where both kernel size and the width are limited. However, this is in fact the case when designing deep (residual) NNs, as the ResNet family, where the primary change is increasing depth. Our result indicates that even though each layer is relatively simple, much more complicated functions can be ultimately approximated via composition. Furthermore, our analytical techniques (especially for the residual case) does not depend on the explicit form of the activation function and the pooling operator in the last layer. Another highlight feature of our result is with respect to the shift invariance, which might be overlooked in some approximation result for convolutional neural network. We restrict our attention to the periodic boundary condition case, which leads to architectures that are exactly shift invariant or equivariant. This significantly confines the expression power of the hypothesis spaces. If such a symmetry is not imposed on each layer, then one can achieve universal approximation of general functions, but at the cost of breaking shift equivariance. For example, Oono & Suzuki (2019) and Okumoto & Suzuki (2021) drop the equivariant constraints and builds the deep convolutional neural network with zero boundary condition, achieving universal approximation property of non-symmetric functions. This is because the boundary condition will deteriorate the interior equivariance structure when the network is deep enough. Also, the shift invariance considered here is about the pixel (i.e. the input data), while some other attempts like Yang & Wang (2020) build a wavelet-like architecture to approximate a function invariant to the spatial translation, i.e., functions satisfy that f = f (• -k) for k ∈ Z.

2.2. TECHNICAL NOVELTIES IN THE DYNAMICAL SYSTEMS APPROACH

In this paper, we develop the dynamical systems approach to analyze the approximation theory of compositional architectures first introduced in Li et al. (2022a) without symmetry considerations, and subsequently extended to handle symmetric functions with respect to transitive subgroups of the permutation group Li et al. (2022b) . While shift symmetry is covered under this setting, the results in Li et al. (2022b) can only handle the case where the convolution filters have the same size as the input dimension. In contrast, the results here are established for small and constant filter (and channel) sizes. This is an important distinction, as such configurations are precisely those used in most practical applications. On the technical side, the filter size restriction requires developing new arguments to show how arbitrary point sets can be transported under a flow -a key ingredient in the proof of universal approximation through composition (See Section B.4 for a detailed discussion). Furthermore, the restriction on filter sizes also enabled us to address new questions, such as a minimal size requirements, that cannot be handled by the analysis in Li et al. (2022b) . The results and mathematical techniques for these sharpness results are new. Concretely, to provide a sharp lower bound on the filter size and channel number requirements, we develop some techniques to extract special features of functions in CNN 1,• + R and CNN •,1 + R that leads to the failure of universal approximation. Detailed constructions are found in Section 4 and Section 5. The construction and the corresponding analysis in this part are nontrivial, and we believe that the examples are also useful in analyzing the approximation property of other architectures.

3. PART 1 OF THEOREM 1: THE DYNAMICAL SYSTEMS APPROACH

The core technique we employ to analyze both CNN and resCNN is the dynamical systems approach: in which we idealize residual networks into continuous-time dynamical systems. In this section, we introduce the key elements of this approach. We first introduce the flow map, also called the Poincaré mapping, for time-homogenous dynamical systems. Definition 2 (Flow Map). Suppose f : X → X is Lipschitz, we define the flow map associated with f at time horizon T as ϕ(f , T )(x) = z(T ), where ż(t) = f (z(t)) with initial data z(0) = x. It follows from Arnold (1973) that the mapping ϕ(f, T ) is Lipschitz for any real number T , and the inverse of ϕ(f, T ) is ϕ(-f, T ), hence the flow map is bi-Lipschitz. Based on the flow map, we define the dynamical hypothesis space for the convolutional neural network. Define the dynamical hypothesis space with convolutional kernel as CODE r,ℓ = {g • ϕ(f m , t m ) • • • • • ϕ(f 1 , t 1 ) : f 1 , • • • , f m ∈ F r,ℓ , t 1 , • • • , t m ∈ R}. (4) The following proposition shows we can use residual blocks to approximate continuous dynamical systems. Proposition 1. Suppose that F is a bi-Lipschitz function family. For given Φ = ϕ(f m , t m ) • • • • • ϕ(f 1 , t 1 ), f i ∈ F, and compact K ⊂ X, ε > 0, there exists Φ = (id +s m ′ g m ′ ) • • • • • (id +s 1 g 1 ), g i ∈ F for some s i > 0, i = 1, . . . , m ′ , such that ∥Φ -Φ∥ L p (K) ≤ ε. The following result shows the shift invariant UAP for the continuous hypothesis space, and its proof can be found in Appendix A. Theorem 2. The dynamical hypothesis space CODE 1,2 satisfies the shift invariant UAP. The rough proof strategy is as follows. We reduce the problem to finite point transportation, i.e., we need to show that the hypothesis space can transport arbitrary but finitely many points (in different orbits under the action of the translation group) to any other set of points. This is done in the previous work of Li et al. (2022b) . A key technical difficulty here is that the kernel size is limited, thus previous known constructions of point transportation (Li et al. (2022b) ) cannot achieve this. Here, we show that we can employ more composition of layers to construct auxiliary mappings to achieve this transportation property. The intuition is that finite-size kernels (satisfying some minimal requirements), when stacked many times, is as good as a full-sized kernel for domain rearrangement -a key enabler of universal approximation through composition. With this theorem in hand, we now prove the first part of Theorem 1. Proof of the first part of Theorem 1. By the straightforward inclusion relationship, it suffices to show that the function family resCNN 1,2 and CNN 2,2 have UAP. For the residual version, it follows from Proposition 1 that if CODE 1,2 satisfies UAP, then so does resCNN 1,2 . In other words, a convergent time discretization inherits universal approximation properties. Thus, given Theorem 2 it suffices to prove the remaining CNN case. We begin with a weaker result, showing that CNN 3,2 satisfies UAP. We prove that resCNN 1,2 ⊂ CNN 3,2 . For given f = vσ(w * • + b1) with σ = ReLU, we write x + f (x) = σ(x) + (-1)σ(-x) + vσ(w * x + b1). This relation indicates that resCNN 1,2 ⊂ CNN 3,2 , which means that CNN 3,2 has UAP. However, this approach cannot handle the case CNN 2,2 , since the inclusion resCNN 1,2 ⊂ CNN 2,2 does not hold. This leads to further modification. In the following, we prove that for a given G ∈ resCNN 1,2 , compact K ⊂ X, there exists H ∈ CNN 2,2 such that H(x) = G(x) for all x ∈ K. Suppose that G = g • f M • • • • • f 1 , where f i (x) = x + v i σ(w i * x + b i 1). Set γ i = f i • • • • f 1 , γ 0 = id and note that each γ i is a Lipschitz mapping. We now consider a sufficiently large real number R > 0 such that |γ i (x)| ≤ R holds for all i = 0, 1, • • • , M and x ∈ K. This can be done since each γ i is Lipschitz, and K is a compact set. Define u 0 (x) = σ(x + R1), (5) and u i (x) = σ(x) + v i σ(w i * x + (b i -(Σ k [w i ] k )R)1) ∈ F 2,2 (6) for i = 1, 2, • • • , M . Similarly, define η s (x) = u s • • • • • u 1 • u 0 . Clearly, η s ∈ CNN 2,2 for all s ≥ 0. We now prove by induction that η i (x) = γ i (x) + R1 for i = 0, 1, • • • , M. The base case (i = 0) is obvious from the definition (5), since x + R1 > 0 for all x ∈ K. Suppose that (7) holds for i, then η i+1 (x) =σ(γ i (x) + R1) + v i σ(w i * (γ 1 (x) + R1) + (b i -(Σ k [w i ] k )R)1) =γ i (x) + R1 + v i σ(w i * γ i (x) + b i 1) =γ i+1 (x) + R1. The first equation uses the definition of (6), and the second equation follows from γ i (x) + R1 ≥ 0 and w i * R1 = (Σ k [w i ] k )R1. This proves (7) by induction. Finally, we set u M +1 (x) = σ(x)σ(R1), then H(x) := g(u M +1 (η M (x))) = g(γ M (x)) = G(x) for all x ∈ K. By construction, we have H ∈ CNN 2,2 , therefore we have proved that the UAP holds for CNN 2,2 . We remark that the shift equivariance of the dynamical system (and the resulting flow map) may prompt one to consider the same equation in the quotient space with respect to shift symmetry, see Cohen & Welling (2016a) . However, in the case of flow approximation, we found no new useful tools in the quotient space to analyze approximation, thus this abstraction is not adopted here. We now give a concrete examples to show that we cannot directly deduce UAP from earlier results by a quotient argument. Observe that for the non-symmetric setting, the result in Li et al. (2022a) requires that the control family F be (restricted) affine invariant. If we directly require this affine invariance in the quotient space, then it will be reduced to scaling invariant. However, the scaling invariant property cannot induce the UAP, and the proof of this is similar to those in Section 4.

4. PART 2 OF THEOREM 1: SHARPNESS OF THE KERNEL SIZE REQUIREMENT

In this section, we prove the second part of Theorem 1. Consider kernels with support ℓ such that min ℓ s = 1. Without loss of generality we can assume that ℓ 1 = 1. We use the following example to illustrate the main intuition behind this sharpness result. More precisely, we show that the sum of two univariate function cannot approximate a bivariate function well. As an explicit example, we show that there exists ε 0 such that ∥xy -f (x) -g(y)∥ L p ([0,1] 2 ) ≥ ε 0 for all choice of L p functions f and g. Suppose that for some f, g ∈ L p ([0, 1] 2 ), ∥xy -f (x) - g(y)∥ L p ([0,1] 2 ) = ε. we define I = [0, 1/2] 2 and p 1 = [0, 0], p 2 = [1/2, 0], p 3 = [0, 1/2], p 4 = [1/2, 1/2]. For convenience, denote by h(x, y) = xy. Consider the following value M = ∥h(x + p 1 ) + h(x + p 4 ) -h(x + p 2 ) -h(x + p 3 )∥ L p (I) . Under review as a conference paper at ICLR 2023 Direct calculation yields that M = 4 -p+1 p > 0. However, for h(x, y) = f (x) + g(y), it holds that h(x + p 1 ) + h(x + p 4 ) -h(x + p 2 ) -h(x + p 3 ) = 0. By triangle inequality, M ≤ 0 + 4 i=1 ∥h(x + p i ) -ĥ(x + p i )∥ L p (K) ≤ 4ε. Therefore, it holds that ε ≥ M 4 > 0, concluding the result. For the general case of establishing the sharpness result, we mimic the example above. We introduce the following auxiliary space. For x ∈ X and integer I, define x I: as the tensor in ) ] I: = ψ(x I: )}. We illustrate the function family H in the following R 3×3 example. If F ∈ H, then F should have the following form: X 1 = R n2×•••×n d , such that [x I: ] (i2,••• ,i d ) = [x] (I,i2,••• ,i d ) . Define H as the mapping X → R such that H := {g • φ : X → R : ∃ψ : X 1 → X 1 , such that [φ(x F = g     ψ(x (1,1) , x (1,2) , x (1,3) ) ψ(x (2,1) , x (2,2) , x (2,3) ) ψ(x (3,1) , x (3,2) , x (3,3) )     . By the assumption on ℓ, it is straightforward to deduce that CNN r,ℓ , resCNN r,ℓ , CODE r,ℓ are all in H. It remains to show that H does not possess the shift invariant UAP. The idea follows the simple example above, by noting that in this case x is now (x (1,1) , x (1,2) , x (1,3) ), y is now (x (2,1) , x (2,2) , x (2,3) ), and f (x) + g(y) is now some general permutation invariant function. We now carry out this proof. Proof of the second part of Theorem 1. As discussed before, it suffices to show that H + R does not satisfy UAP. Let us set F (x) = i1>i2 (ψ([x] i1: ) -ψ([x] i2: )), where ψ(y) = i ′ [y] i ′ , and K = [0, 1] n , we show that there exists a constant ε 0 > 0, such that for all H ∈ H, it holds ∥F -H∥ L p (K) ≥ ε 0 . (8) Choose two subregions of K, K 1 = {x ∈ K, x 1: ≫ x 2,: ≫ • • • ≫ x n1: }, and K 2 = {x ∈ K, x 2: ≫ x 1: ≫ x 3: ≫ • • • ≫ x n1: }. Here, we say for z 1 and z 2 ∈ X 1 , z 1 ≫ z 2 means min i [z 1 ] i ≥ max i [z 2 ] i . Consider the mapping τ , that flips first and second rows (along first index), that is, [τ (x)] 2: = [x] 1: , [τ (x)] 1: = [x] 2: , [τ (x)] i: = [x] i: , i ̸ = 1, 2. (9) Then τ (K 1 ) = K 2 . By the definition of H, we have (H • τ )(x) = H(x) for x ∈ K 1 , and H ∈ H + R. But F • τ = -F , which implies that 2∥F ∥ L p (K1) = ∥F -F • τ ∥ L p (K1) ≤ ∥H -H • τ ∥ L p (K1) + ∥F -H∥ L p (K) + ∥F • τ -H • τ ∥ L p (K) = 2∥F -H∥ L p (K) . In the last equation, the last two terms are equal since τ is measure preserving.

REQUIREMENT

In this section, we show that the FCNN with only one channel per layer cannot satisfy the shift invariant UAP. The key to proving this part is the following observation. Suppose G ∈ CNN 1,∞ + R, then G is continuous, piecewise linear. Moreover, by direct calculation, we obtain that there exists g ∈ X, such that for a.e. x ∈ K, the gradient of G is 0 or g. The last assertion can be proved from direct calculation on the gradient of G. Proof of the third part of Theorem 1. Based on the above observation, we now show that F (x) = |x| cannot be approximated by such G in the unit ball B(0, 1). By a change of variables we rewrite x∈B(0,1) |F (x) -G(x)| p dx = ξ∈∂B(0,1) 1 0 |F (tξ) -G(tξ)| p t |n|-1 dtdS, where |n| = n 1 n 2 • • • n d . We consider the hemisphere defined by ξ ∈ ∂B(0, 1) such that ξ • g < 0. On this hemisphere, f (t) = F (tξ) = t is increasing while g(t) = G(tξ) is decreasing in t. To proceed, we state and prove the following lemma. Lemma 1. For f : [a, b] → R that is increasing, we have inf g decreasing in [a,b] |f -g| p = inf g constant in[a,b] |f -g| p . Proof. The ≤ part is obvious, so it suffices to prove the ≥ part. Given any decreasing g, set a constant g such that g(t) = g(t 0 ) if f (t 0 ) = g(t 0 ) for some t 0 , and g = f (1) if there does not exist such a t 0 . We can easily verify that |f (t) - g(t)| ≥ |f (t) -g(t)| for all t ∈ [0, 1]. Using this lemma, we can show that 1 0 |f (t) -g(t)| p t |n|-1 dt ≥ 1 1/2 |f (t) -g(t)| p dt • ( 1 2 ) |n|-1 ≥( 1 2 ) |n|-1 inf a∈[1/2,1] 1 1/2 |f (t) -a| p dt =( 1 2 ) |n|-1 inf a∈[1/2,1] (1 -a) p+1 + (a -1/2) p+1 p + 1 =2 -|n|+1 • 2 • (1/4) p+1 p + 1 =: C p . The last line follows from the fact that the minimization problem attains its infimum at a = 3 4 . Therefore, x∈B(0,1) |F (x) -G(x)| p > Cp 2 α, where α is the Lebesgue measure of ∂B(0, 1). This implies the third part of Theorem 1.

6. CONCLUSION

We provided the first approximation result of deep fully convolutional neural networks with the fixed channel number and limited convolution kernel size, and quantify the minimal requirements on these to achieve universal approximation of shift invariant (or equivariant) functions. We proved that the fully convolutional neural network with residual blocks resCNN r,ℓ achieves shift invariant UAP if and only if r ≥ 1 and ℓ ≥ 2. This result does not require the specific form of the activation function. For the non-residual version, we proved that CNN r,ℓ has the shift invariant UAP if and only if r ≥ 2 and ℓ ≥ 2. The if part requires specifying σ to be the ReLU operator. In addition, the results also hold for their corresponding equivariant versions. The proof is based on developing tools for dynamical hypothesis spaces, which has the flexibility to handle variable architectures, and obtain approximation results that highlight the power of function composition. We conclude with some discussion on future directions. In this paper, the shift invariant UAP for CNN 2,2 was established for ReLU activations. The proof relies on the special structure of ReLU: ReLU(x) = x for x > 0, hence we can make use of translation to replace the residual part. This construction was outlined in the proof of the first part of Theorem 1. It will be of interest to study if the other activations, such as sigmoid or tanh, can also achieve shift-invariant UAP at fixed widths and limited kernel sizes. Further, one may wish to establish explicit approximation rates in terms of depth, and identify suitable function classes that can be efficiently approximated by these invariance/equivariance preserving networks. Finally, one may also consider extending the current theory to handle up-sampling and down-sampling layers that are commonly featured in deep architectures. In addition to approximation error, it is very natural and useful to consider the generalization error (statistical error) in the overall analysis of a machine learning model. Compared to shallow and wide models, few generalization results in the deep-but-narrow setting (for layers greater than 3) have been established. While the current paper only concerns approximation theory, it is nevertheless an important future direction to establish generalization estimates. In these appendices, we provide the remaining details of the proofs of various results presented in this paper.

NETWORKS

If we remove the final layer in CNN or resCNN, then we obtain a neural network whose output data is with the same size of the input data. This is the original definition of FCNN introduced in (Long et al., 2015) , primarily used for pixel-wise image tasks. Correspondingly, the symmetry property is changed to shift equivariance, instead of shift invariance. This leads to the definition of the following hypothesis spaces that parallels the shift invariant counterparts in the main paper. eqCNN r,ℓ = {f m • • • • • f 1 : f 1 , • • • , f m ∈ F r,ℓ , m ≥ 1.} (13) eqresCNN r,ℓ = {(id +f m ) • • • • • (id +f 1 ) : f 1 , • • • , f m ∈ F r,ℓ , m ≥ 1.}. (14) eqCODE r,ℓ = {ϕ(f m , t m ) • • • • • ϕ(f 1 , t 1 ) : f 1 , • • • , f m ∈ F r,ℓ , t 1 , t 2 , • • • , t m ∈ R, m ≥ 1.}. ( ) To distinguish from functions φ : X → R, we use the word "mappings" to refer to functions from X to X. Definition 3. The mapping φ is called shift equivariant if T k (φ(x)) = φ(T k (x)), x ∈ X, k The mapping family X is said to have the shift equivariant UAP if 1. Each mapping in X is shift equivariant, and 2. Given any shift equivariant continuous mapping φ, compact set K ⊆ X, and tolerance ε > 0, there exists a mapping ψ ∈ X such that ∥ψ -φ∥ L p (K) ≤ ε. Then, the analogous result with respect to equivariant approximation is stated as follows. Theorem 3. We have the following results. 1. For the fully convolutional neural network with residual blocks, it holds that eqresCNN r,ℓ possesses the shift equivariant UAP for r ≥ 1, and ℓ ≥ 2. For nonresidual versions, eqCNN r,ℓ possesses the shift equivariant UAP for r ≥ 2 and ℓ ≥ 2. 2. The kernel size 2 is optimal in the following sense: for ℓ with min ℓ s = 1, then neither eqresCNN ∞,ℓ + R nor eqCNN ∞,ℓ + R possesses eq-UAP. 3. The number of channel for non-residual fully convolutional neural network is optimal, in the sense that the mapping family eqCNN 1,∞ + R does not possess the shift equivariant UAP. To prove Theorem 3, we start with the following proposition, which links the universal approximation property of invariant function family and that of an equivariant mapping family. A version of this was proved in Li et al. (2022b) in a rather abstract setting for general transitive groups. We provide a more explicit proof in Appendix B.3 in the specific case where we are only concerned with shift operator T k . Proposition 2 (Equivariant UAP is Sufficient). Suppose g : X → R is Lipschitz, permutation invariant, and g(X) = R. If a mapping family A possesses the shift equivariant UAP, then B = {g • φ : φ ∈ A} possesses the shift invariant UAP. The corollary of this proposition is that, using the first part of Theorem 3 we can derive the first part of Theorem 1, while using the second and third part of Theorem 1 we can prove the second and third part of Theorem 3. Finally, it remains to show the following theorem, which is an equivariant version of Theorem 2. As in the proof procedure described in Appendix B.4, all other results will be reduced to this theorem. Theorem 4. The mapping family eqCODE 1,2 satisfies the shift equivariant UAP. Proof of Theorem 2. Using the fact {g • f : f ∈ eqCODE r,ℓ } = CODE r,ℓ , Theorem 4, and Proposition 2. Proof of Theorem 3. The first statement follows from Theorem 4 and Proposition 1. Suppose the second statement is false, that is, eqCNN ∞,ℓ possesses eq-UAP. Then, since {g • f : f ∈ eqCNN∞, ℓ} ⊂ H + R, from Proposition 2 we can deduce that CNN ∞,ℓ has shift invariant UAP, contradicting to Theorem 1. The remaining sharpness result also can be proved in a similar manner. Thus, to deduce Theorem 3 it remains to prove Theorem 4 and Proposition 2. These proofs are presented in Appendices B.4 and B.3 respectively.

B TECHNICAL DETAILS

In this section, we provide detailed proofs of the results that were quoted in the main paper and the previous section in the appendix. The overall plan of the proof is the shown in the following diagram, including the dependency of each propositions, theorems, and their proofs. 

B.1 CLOSURE OF A MAPPING FAMILY

We begin with some elementary properties of compositional hypothesis spaces. For a given bi-Lipschitz mapping family G, consider its closure in the following sense clo(G) := {φ ∈ C(X, X) : for any compact K ⊂ X, ε > 0, ∃ψ ∈ G, such that∥ψ -φ∥ L p (K) ≤ ε}. Then, we have the following lemma. Lemma 2. If G is closed under composition, then so is clo(G). Proof. Take α, β ∈ clo(G), by definition it suffices to show that α • β ∈ clo(G). For given compact K ⊂ X and ε > 0, there exists β ′ ∈ G such that ∥β ′ -β∥ L p (K) ≤ ε ′ = ε 2Lip (α) . Also, there exists α ′ ∈ G such that ∥α ′ -α∥ L p (β ′ (K)) ≤ ε 2 Lip β ′-1 . Then ∥α ′ • β ′ -α • β∥ L p (K) ≤∥α ′ • β ′ -α • β ′ ∥ L p (K) + ∥α • β ′ -α • β∥ L p (K) ≤ ε 2 + Lip (α)ε ′ ≤ε. Since α ′ • β ′ ∈ G, this shows α • β ∈ clo(G). Lemma 3. If clo(G) possesses shift equivariant UAP, then so does G. Proof. By definition, for given shift equivariant function φ, compact set K ⊂ X, tolerance ε > 0, there exists φ 1 ∈ clo(G) such that ∥φ -φ 1 ∥ L p (K) ≤ ε. From the definition of clo(G), there exists φ 2 ∈ G such that ∥φ 1 -φ 2 ∥ L p (K) ≤ ε. Hence, ∥φ -φ 2 ∥ L p (K) ≤ 2ε, which concludes the result.

B.2 PROOF OF PROPOSITION 1

We first prove that Φ = ϕ(f , T ) can be approximated by such Φ = (id +(T /n)f ) n . For given t > 0, since |ϕ(f , t)(x) -x -tf (x)| ≤ t 0 |f (ϕ(f , s)(x)) -f (x)|ds ≤ t 0 L|ϕ(f , s)(x) -x|ds ≤ Lt 2 2 |f (x)| + t 0 L|ϕ(f , s)(x) -x -sf (x)|ds. Therefore, by Gronwall's inequality, we obtain, |ϕ(f , t)(x) -x -tf (x)| ≤C(t) + t 0 C(s)e L(t-s) ds ≤ Lt 2 2 |f (x)| + Lt 3 3 |f (x)| ≤L|f (x)|t 2 , for t ≤ min(ln 2/L, 1), where C(t) = t 2 L 2 |f (x)|. This local estimate can be improved to obtain the desired approximation result. For given T , set t = T /n. Define Φ n = (ϕ(f , t)) n and Φ n = (id +tf (x)) n . Then, it holds that |Φ n (x) -Φ n (x)| ≤|Φ(Φ n-1 )(x) -Φ(Φ n-1 (x)) + Φ(Φ n-1 (x)) -Φ n (x)| ≤L|f (Φ n (x))|t 2 + Lip Φ|Φ n-1 (x) -Φ n-1 (x)|. (20) Suppose that M = sup t∈[0,T ] |f (x)|, since Lip Φ = 1 + tL. |Φ n (x) -Φ n (x)| ≤ LM t 2 + (1 + tL)|Φ n-1 (x) -Φ n-1 (x)|. (21) By discrete Gronwall's inequality, it holds that |Φ n (x) -Φ n (x)| ≤ (1 + tL) n -1 (1 + tL) -1 LM t 2 ≤ e T L M t. The above estimates implies that, define G = {(id +t m f m ) • • • • (id +t 1 f 1 ) : f 1 , • • • , f m ∈ F} then ϕ(f , t) ∈ clo(G) . By Lemma 2, we conclude the result.

B.3 PROOF OF PROPOSITION 2

Proof of Proposition 2. Without loss of generality, we assume that K = [-a, a] n , otherwise we can enlarge K. Define K 1 = {x ∈ K : [x] 1 > [x] i , ∀i ̸ = 1.} Then, it is easy to check that K = i (T i K 1 ) up to a measure zero set. Define ε ′ := ε |n|(1+Lip g) , by results in Li et al. (2022a, Theorem 3.8) , for any ε ′ > 0 there exists u such that ∥F -g • u∥ L p (K) ≤ ε ′ . ( ) Note that u here is not necessarily equivariant, otherwise we are done. Now we attempt to find f by some kind of equivariantization on u as explained below. Since u is in L p , we consider a compact set O ⊂ K 1 such that ∥u∥ L p (K1\O) ≤ ε ′ . Take a smooth truncation function χ ∈ C ∞ (R d ), whose value is in [0, 1], such that χ| O = 1 and χ| K c 1 = 0. Then ũ = χu is a smoothed and truncated version of u. For x ∈ T k K 1 with index k, define f (x) = T k ( ũ(T -k (x))). Since different T k K 1 are disjoint, the value of f is unique in ∪ k T k K 1 . We set f (x) = 0 in the complement of ∪ k T k K 1 . The truncation function χ ensures that f vanishes on the boundary of Q A , therefore f is continuous, and direct verification shows that f is shift equivariant. It then suffices to estimate ∥F -g • f ∥ L p , since both F and g • f are equivariant, it is natural and helpful to restrict our estimation on K 1 , since ∥F -g • f ∥ L p (K) = |n| ∥F -g • f ∥ L p (K1) . To estimate the error on K 1 , we first estimate the error ∥u -f ∥ L p (K1) . Since u and f | K1 = ũ coincide on O, we have ∥u -f ∥ L p (K1) = ∥u -ũ∥ L p (K1) ≤ ∥u∥ L p (K1\O) = ε ′ . ( ) The inequality follows from the fact that χ takes value in [0, 1]. Since g is Lipschitz, we have ∥g • u -g • f ∥ L p (K1) ≤ Lip gε ′ , yielding that ∥F -g • f ∥ L p (K1) ≤ (1 + Lip g)ε ′ . We finally have ∥F -g • f ∥ L p (K) ≤ (1 + Lip g)|n|ε ′ = ε.

B.4 THE PROOF OF THEOREM 4

In this section we prove the UAP of eqCODE 1,2 . Here, we relax the constraint that σ = ReLU. We make the following assumption on σ, which is called "well function" in Li et al. (2022a) . Definition 4. We say a Lipschitz function h : R → R is a well function if {x ∈ R : h(x) = 0} is a bounded (closed) interval. In this section, we assume that there exists a well function h in the closure of span{vσ(w • +b), v, w, b ∈ R}. The commonly used activation functions meet this assumption, including ReLU, Sigmoid and Tanh, see Li et al. (2022a) . For a given continuous function u : R → R, define the coordinate zooming function u ⊗ : X → X by [u ⊗ (x)] i = u([x] i ). We say a point x is a stabilizer if and only if there exists a non-trivial k ̸ = 0, such that T k x = x. We say a point set X = {x 1 , x 2 , . . . , x n } is shift distinct, if for some i 1 , i 2 , k with T k (x i1 ) = x i2 then we must have i 1 = i 2 and k = 0. Notice that if a point set X is shift distinct, then for any member x ∈ X, the only k so that T k (x) = x is k = 0. This is implied by the definition of shift distinctness. The basic approximation framework is based on the following two properties. Proposition 3 (Basic Framework). Given a family A of mappings X → X. Suppose A is closed under composition. If A satisfies the following two conditions: 1.(Coordinate zooming property) For any continuous function u, the mapping u ⊗ is in A. 2.(Point matching property) For a given shift distinct point set x 1 , • • • , x M , a target point set y 1 , • • • , y M , and a stabilizer point set s 1 , s 2 , • • • , s N , a tolerance ε > 0, there exists a mapping φ ∈ A such that |φ(x i ) -y i | ≤ ε and |φ(s i )| ≤ 1. Then, A possesses the shift equivariant UAP. Note that for the point matching property is to say, we can use mappings in G to move each x i to y i , while keeping a stabilizer set stay around the original point. We now use this proposition to prove Theorem 4, consider the closure in the UAP sense, that is, A :={φ : for all compact K ⊂ X, ε > 0, there exists ψ ∈ eqCODE 1,2 , ∥φ -ψ∥ L p (K) ≤ ε.} Proposition 4. The following results hold for the mapping family A. 1. A is closed under composition. 2. Given w ∈ X, supp w ≤ 2, and b ∈ R, then the flow map ϕ(h(w * • + b1), t) ∈ A.

3.

A satisfies the coordinate zooming property. 4. If A possesses shift equivariant UAP, then so does eqCODE 1,2 . Suppose now the point matching property holds for A, we then prove the main theorem (Theorem 4). Proof of Theorem 4. By the last part of Proposition 4, it suffices to show that A possesses shift equivariant UAP. By the third part (and the first part), we know that if A has the point matching property, then A has shift equivariant UAP, which concludes the result. From the proof, we know that: Once the point matching property is proved, Theorem 4 is then proved. The proof of the point matching property is the most technical part in this paper. We first give a sketch of the proof. Sketch of the proof of the point matching property. In this sketch, we only consider the case when there are no stabilizers, i.e. when N = 0. Step 1. We first show that if A has the following point reordering property, then A has the point matching property. For any shift distinct point set i, we can find a mapping φ ∈ A such that [φ(x j )] i > [φ(x j ′ )] i ′ if j < j ′ or j = j ′ but i ≺ j. Here the partial order ≺ is the lexicographic order. For brevity, we say in this case that φ(x j ) is ordered. Step 2. To begin with, we first prove that there exists a mapping β ∈ A, such that [α(x 1 )] i > [α(x j )] i ′ for j ̸ = 1 and any indices i, i ′ . Step 3. Set z j = β(y j ). Now we are ready for an induction argument. Suppose for z 2 , • • • , z M we have a mapping ψ ∈ A to fulfill the point reordering property. We modify it to the mapping ψ = ψ • u ⊗ ∈ A, such that it satisfies the following conditions • ψ(z 2 ), • • • , ψ(z M ) are ordered. • [ ψ(z 1 )] i > [ ψ(z j )] i ′ for j ̸ = 1, and indices i, i ′ . Step 4. Finally, we modify ψ to get φ such that φ(z j ) is ordered. Till now, we prove the point reordering property for A. The full proof of Theorem 4 is put in Appendix B.7.

B.5 PROOF OF PROPOSITION 3

Proof of Proposition 3. Without loss of generality, we can suppose that K = [-a, a] n . Otherwise, we can expand K to a sufficiently large hypercube. Step 1. Given a scale δ > 0, consider the grid δZ n with size δ. Let q ∈ Z n be a tensor with all coordinates being integers, and χ q be the indicator of the cube □ q,δ := x : [x] i ∈ [[q] i δ, ([q] i + 1)δ] . ( ) Since φ is in L p (K), by standard approximation theory φ can be approximated by equivariant piecewise constant (and shift equivariant) functions φ 0 (x) = q,□ q,δ ⊂K y q χ q (x), where y q = λ(□ q,δ ) -1 □ q,δ φ(x)dx is the local average value of φ in □ q,δ . Then, we have ∥φ -φ 0 ∥ L p (K) ≤ ω φ (δ)[λ(K)] 1/p → 0 as δ → 0, where ω φ is the modulus of continuity (restricted to the region K), i.e., ω φ (δ) := sup |x-y|≤δ |φ(x) -φ(y)| for x and y in K and λ(K) is the Lebesgue measure of K. Step 2. Let qδ be a vertex of □ q,δ Define I as the maximal subset of I 0 = {q : qδ ∈ K} such that {qδ : q ∈ I} is shift distinct. By the maximal property, and the definition of shift distinctness, for each q ∈ I 0 , only two situations can happen. 1. There exists a shift operator T k and q ′ ∈ I, such that T k q ′ = q. 2. q itself is a stabilizer, that is, there exists a shift operator T k with k ̸ = 0 such that T k q = q. By the construction of y q , it holds that T k y q = y T k q . Given ε > 0, by the point matching property, we can find f ∈ A such that • For q ∈ I 0 that is not a stabilizer, |f (qδ) -y q | ≤ ε. • For q ∈ I 0 that is a stabilizer, |f (qδ)| ≤ 1. For α ∈ (0, 1), define the shrunken cube □ α q,δ := {x ∈ X : [x] i ∈ [[q] i δ, ([q] i + α)δ]}, and define K α = □ q,δ ⊂K □ α q,δ , which is a subset of K. Given β > 0, we now use the coordinate zooming property of A to find u ⊗ ∈ A such that u([ih, (i + αh)]) ⊂ [ih, (i + β n δ)] for i ∈ {i s : s = 1, . . . , n; i ∈ I}. To do this, we construct a piecewise linear function u such that explicitly, and select ε < β 3n δ. By the coordinate zooming property, u ⊗ is in A. Therefore, we have |f (u ⊗ (x)) -y q | ≤ 2ε for x ∈ □ α q,δ , (36) if q is not a stabilizer, and |f (u ⊗ (x))| ≤ 1 + ε for x ∈ □ α q,δ , (37) if q is a stabilizer. These two estimates (36) and (37) will be useful in the final step. Step 3. We are ready to estimate the error ∥φ -f • u ⊗ ∥ L p (K) . The estimation is split into three parts, K \ K α , K α 1 = q is not a stabilizer □ α q,δ , K α 2 = q is a stabilizer □ α q,δ . (38) Notice that K α = □ α q,δ . For K α 1 , from (36) in the end of Step 2, we have ∥f • u ⊗ -φ 0 ∥ L ∞ (K α 1 ) ≤ 2ε, and thus ∥f • u ⊗ -φ 0 ∥ L p (K α 1 ) ≤ 2ε[λ(K α )] 1/p ≤ 2ε[λ(K)] 1/p . ( ) For K α 2 , note that if q is a stabilizer, then all points in □ q,δ will be close to a hyperplane Γ i,j := {x ∈ X : [x] i = [x] j } for some distinct i, j, the distance from those points to Γ i,j will be smaller than |n|δ. Therefore, the Lebesgue measure of K α 2 ⊂ K 2 will be smaller than that of all points whose distance to the union of hyperplanes Γ i,j is less than √ nδ, which is O(δ). Thus, we have ∥f • u ⊗ -φ 0 ∥ L p (K α 2 ) ≤ (1 + ε + ∥φ 0 ∥ C(K) )O(δ) ≤ (1 + ε + ∥φ∥ C(K) )O(δ). The last line holds since ∥φ 0 ∥ C(K) ≤ ∥φ∥ C(K) by construction. For K \ K α , we have ∥f • u ⊗ -φ 0 ∥ L p (K\K α ) ≤ (∥f ∥ C(K) + ∥φ∥ C(K) ) λ(K \ K α ) 1/p ≤ (∥f ∥ C(K) + ∥φ∥ C(K) )(1 -α d ) 1/p [λ(K)] 1/p . We first choose δ sufficiently small such that the right hand side of ( 40) is not greater than ε, then choose α such that 1 -α is sufficiently small, and (∥f ∥ C(K) + ∥φ∥ C(K) )(1 -α d ) 1/p ≤ ε. The we conclude the result since f • u ⊗ ∈ A.

B.6 PROOF OF PROPOSITION 4

The first part and the last part comes from Lemma 2 and 3 respectively. Part 2. We show that for f , g ∈ F 1,2 , then ϕ(f + g, T ) ∈ A. From Proposition 1, we know that for compact K ⊂ X and tolerance ε > 0. Then for sufficiently large n, set t = T /n, we then have ∥(id +tf + tg) nϕ(f + g, T )∥ L p (K) ≤ ε. From classical ODE theory, the dynamics will move each [α(x j )] i to [a(t) j ] i such that |α(x j ) i -[a(t) j ] i | ≤ C 1 (e tC2 -1), for some constants C 1 , C 2 depending only on ε. We choose a sufficiently small t such that the right hand side is less than ε. Therefore, we always have vσ(r). Note that this δ only depends on ε. Then we can choose µ = min( 1 2 tδ, ε 2 ), and therefore we at least have [a(t) J ] I > [a(t) 1 ] I ′ , while the other order are preserved since ε is now much larger than µ. This contradicts with the minimal choice of α. Hence, we conclude the result. Step 3. Set z j = β(y j ). Now we are ready to proceed with induction. Suppose for z 2 , • • • , z M we have a mapping ψ ∈ A to fulfill the point reordering property. We modify it to the mapping ψ = ψ • u ⊗ ∈ A, such that it satisfies the following conditions • ψ(z 2 ), • • • , ψ(z M ) are ordered. • [ ψ](z 1 )] i > [ ψ(z j )] i ′ for j ̸ = 1, and indices i, i ′ . Since restricting ψ in the line R1, we can obtain a continuous increasing bijection from R1 to R1. We can find a > 0, such that a > max Define u such that u fixes all [z j ] i for j ̸ = 1 and all indices i, but sends [z 1 ] i to the interval [a -1, a + 1]. We consider ψ = ψ • u ⊗ ∈ A, which satisfies the following conditions • ψ(z 2 ), • • • , ψ(z M ) are ordered. • [ ψ(z 1 )] i > [ ψ(z j )] i ′ for j ̸ = 1, and indices i, i ′ . Step 4. Set p j = ψ(z j ). Similarly, we consider the following quantity. For γ ∈ A such that 1. γ(p 2 ), • • • , γ(p M ) is ordered. 2. [γ(p 1 )] i > [γ(p j )] i ′ for j ̸ = 1, and indices i = i ′ . We define the following quantity L(γ) = {(i, i ′ ) : i ≺ i ′ , [γ(x 1 )] i < [γ(x 1 )] i ′ }. We claim that this quantity is zero. Suppose not, then we can find a pair (I, I ′ ) such that I ≺ I ′

but

[γ(x 1 )] I ′ -[γ(x 1 )] I is minimal among all the choices that make this value positive. One can verify the following result, that there must be no other indices k such that [γ(x 1 )] I ′ > [γ(x 1 )] k > [γ(x 1 )] I ,



In what follows, we define for multi-indices i, j the partial order i ≥ j if is ≥ js, s = 1, 2, • • • .. 2 In this paper, we always fix a p ∈ [1, ∞).



Figure 1: An illustration of fully convolutional neural network.

); Favero et al. (2021); Xiao (2022)) or averaging methods (e.g. Yarotsky (2018b); Bao et al. (2019); Petersen & Voigtlaender (2020); Jiang et al. (2021).)

Figure 2: The directed acyclic graph represents the dependency of the result and the section layout.

u| [(i+α)δ,(i+1)δ] (x) = (x -(i -α)δ)/(1 -α) + i + β 2n δ(35)

and vσ([α(x 1 )] I ′ + b -ε) = 1.

vσ([a(t)  J ] I + b) = 0 and vσ([a(t) 1 ] I ′ + b) > δ := min r∈[1+ε,1+3ε]

z j )] i + 2Lip ψ.

. Consequently,

annex

We now estimate the error between (id +tf + tg) n and [(id +tf ) • (id + tg)] n .

Since

(id +tf ) • (id +tg)(x) = (id +tg)(x) + tf (x + tg(x)),we then haveSimilar as (20), a telescoping decomposition yields [(id +tf ) • (id +tg)] n (x) -(id +tf + tg) n (x)Since from (20),then it holds thatUsing the discrete Gronwall's inequality again, we haveHence, for sufficiently large n, we can show thatFinally, by the same estimates in Proposition 1, we havefor sufficiently large n.Therefore, we prove that ϕ(f + g, T ) ∈ clo(A). Using similar approach, we can show that if f is a linear combination of mappings in F 1,2 , then ϕ(f , T ) ∈ clo(A).Step 3. Now, for a certain well function h, we have proved thatChoose w such that [w] (1,1,••• ,1) = w but = 0 for other multi-indices. Then the dynamical system is decoupled, in the sense that each coordinate evolves as the one-dimensional dynamical systemfor some v, w, b ∈ R.For the following proof, we suppose that I = {x ∈ R : h(x) = 0}. Note that both K and G are function families consisting of function from R to R. The third part of the result comes from the following lemma.Lemma 4. For any increasing continuous function u : R → R, we have u ∈ clo(G).Proof. Given a interval K ⊂ R, and tolerance ε > 0, we can find a ũ ∈ G such that ∥u -ũ∥ L p (K) ≤ ε.We choose x 1 < x 2 < • • • < x n , with x 1 and x n being two endpoints of the interval K. Notice that if we can find ũ ∈ G such that ũ(x i ) = u(x i ), i = 1, 2, • • • , M , then by the monotonicity of both u and ũ, it holds that ∥ũ -u∥ C(K) ≤ ω(K, h),where h = max i x i+1 -x i , and ω(K, •) is the oscillation of u in the interval I. Then, we obtainwhich implies the lemma.We prove by induction that there exists φ n ∈ G, such thatSet φ 0 = id . Choose suitable v and b ∈ R such that) for certain v = ±1 and certain t > 0. Therefore, we found φ n+1 ∈ G which satisfies (48).

B.7 COMPLETE PROOF OF THEOREM 4

In this section, we complete the proof of Theorem 4. As discussed in the sketch of proof (Section B.4), we first consider the case when there are no stabilizers to be dealt with.Step 1. We first show that if A has the following point reordering property, then A has the point matching property.For any shift distinct point setHere the partial order ≺ is the lexicographical order. For brevity, we say in this case φ(x j ) is ordered.Without loss of generality we can assume that y j is also shift distinct. Suppose there exist φ x andand all the indices i. Therefore, the mapping φ -1 y • u ⊗ • φ x is then constructed to satisfy the point matching property.Step 2. To begin with, we first prove that there exists a mapping β ∈ A, such thatfor j ̸ = 1 and any indices i, i ′ . We first show that we can perturb the point setIn what follows, we say that in this case x j are perturbed.The perturbation argument is based on the following minimal argument. For α ∈ A, consider the following quantity:Suppose α minimizes this quantity, it suffices to show that E(α) = 0. Otherwise, we consider a pair (I, J) and (I ′ , J ′ ) sch that (I, J) ̸ = (I ′ , J ′ ) butSince α(x J ) and α(x J ′ ) must be shift distinct, no matter whether J and J ′ are identical, we can deduce that there exists a k and e, where e = (0, 0,So without loss of generality, we may assume that k = 0, and e = (1, 0, 0, • • • , 0).Consider the following dynamicsHere the constant b is chosen to ensure that theThen for sufficiently small t > 0, the inequalityleads to a contradiction of minimality. Therefore, there exists α ∈ A such that α(So far, we can assume that x j , j = 1, 2, • • • , M, itself is perturbed, since we can apply a perturbation α to achieve it otherwise.Consider the following quantity:We choose an α ∈ A to minimize this quantity in A subject to α(x j ) is perturbed. Now it suffices to prove that K(α) = 0. Suppose not, then there exists (I, J,is the smallest one among all choice that makes the above value non-negative. Clearly, no otherWe define a continuous function v : R → R, such that• For other pairs, (i,Here ε and µ are two parameters whose values will be determined later. SetConsider the dynamics (for short, we only write the equations for the coordinates we are concerned with)We choose certain b and v ∈ R such thatsince in this case either I ≺ k or k ≺ I ′ should be satisfied, which contradicts with the choice of I and I ′ . Therefore, there exists a continuous function v such that v([γ(x 1 )])From a similar argument as in Step 2, we can construct a new γ satisfying the condition but with minimal L(γ). Therefore, we conclude the result.

Dealing with Stabilizers

We conclude the proof with the situation where there are stabilizers. We prove that there exists a ζ ∈ A such that) for all possible choice of (i, j), (i ′ , j ′ ). In such a ζ exists, we can proceed, as we did in Step 3, to find a φ ∈ A, such that 1.[φ([φ(x j )] i > [φ(z j ′ )] i ′ for all possible choice of (i, j), (i ′ , j ′ ).We first show that, with this point reordering property with stabilizers, we can prove the point matching property. Compared to what we did in Step 1, it suffices to additionally assign the value o 1 , o 2 , . . . , o N , as the target of s j , where each coordinate of p j is chosen to be ≤ 1.Suppose for target y 1 , • • • , y M and õ = 0, such a φ y ∈ A can be found. We choose o 1 , • • • , o N around the value φ y ( õ), such that |o i -φ y ( õ)| ≤ ε, and moreover we can assign v in these value such that v([s j ] i ) = [o j ] i . Hence, the requirement of point matching property can be fulfilled if we choose ε < 1 2 (Lip φ -1 y ) -1 . Now we prove the existence of ζ. Consider the following quantity,)] i ′ } And we choose ζ ∈ A to minimize this quantity. We only need to show that H(ζ) = 0. Otherwise, we can prove that we can construct a new ζ ∈ A with a lower value of H(ζ).

This construction is similar to what we did in

Step 2, in that we only need to find a such a pair (I, J,52) and there are no other coordinates between these two value, but with an e = (0,We assert that such pairs can be found. Suppose this assertion does not hold. Since we assume that H(ζ) ̸ = 0, which immediately implies that there exists at least one pair J and J ′ , such that for some multi-indices I and I ′ , (52) holds. We choose such (I, J, I ′ , J ′ ) to minimize the quantity )] I ′ +e . Therefore, the quantity ω should be zero. Thus, the problem reduces the case when ω = 0.In this case, we start from a pair [x J ] I = [s J ′ ] I ′ , we can show that for all e = (0, • • • , 1, • • • , 0), we have [x J ] I+e = [s J ′ ] I ′ +e . Repeating this procedure, we can know that the above identity holds for all choice of e. Therefore, there exists a shift operator T k such that T k x J = s J ′ , which also leads to a contradiction, since it implies that x J is a stabilizer.

