UNIVERSAL APPROXIMATION AND MODEL COMPRES-SION FOR RADIAL NEURAL NETWORKS

Abstract

We introduce a class of fully-connected neural networks whose activation functions, rather than being pointwise, rescale feature vectors by a function depending only on their norm. We call such networks radial neural networks, extending previous work on rotation equivariant networks that considers rescaling activations in less generality. We prove universal approximation theorems for radial neural networks, including in the more difficult cases of bounded widths and unbounded domains. Our proof techniques are novel, distinct from those in the pointwise case. Additionally, radial neural networks exhibit a rich group of orthogonal changeof-basis symmetries on the vector space of trainable parameters. Factoring out these symmetries leads to a practical lossless model compression algorithm. Optimization of the compressed model by gradient descent is equivalent to projected gradient descent for the full model.

1. INTRODUCTION

Inspired by biological neural networks, the theory of artificial neural networks has largely focused on pointwise (or "local") nonlinear layers (Rosenblatt, 1958; Cybenko, 1989) , in which the same function σ : R → R is applied to each coordinate independently: R n → R n , v = (v 1 , . . . , v n ) → (σ(v 1 ) , σ(v 2 ) , . . . , σ(v n )). (1.1) In networks with pointwise nonlinearities, the standard basis vectors in R n can be interpreted as "neurons" and the nonlinearity as a "neuron activation." Research has generally focused on finding functions σ which lead to more stable training, have less sensitivity to initialization, or are better adapted to certain applications (Ramachandran et al., 2017; Misra, 2019; Milletarí et al., 2018; Clevert et al., 2015; Klambauer et al., 2017) . Many σ have been considered, including sigmoid, ReLU, arctangent, ELU, Swish, and others. However, by setting aside the biological metaphor, it is possible to consider a much broader class of nonlinearities, which are not necessarily pointwise, but instead depend simultaneously on many coordinates. Freedom from the pointwise assumption allows one to design activations that yield expressive function classes with specific advantages. Additionally, certain choices of non-pointwise activations maximize symmetry in the parameter space of the network, leading to compressibility and other desirable properties. In this paper, we introduce radial neural networks which employ non-pointwise nonlinearities called radial rescaling activations. Such networks enjoy several provable properties including high model compressibility, symmetry in optimization, and universal approximation. Radial rescaling activations are defined by rescaling each vector by a scalar that depends only on the norm of the vector: ρ : R n → R n , v → λ(|v|)v, (1.2) where λ is a scalar-valued function of the norm. Whereas in the pointwise setting, only the linear layers mix information between different components of the latent features, for radial rescaling, all coordinates of the activation output vector are affected by all coordinates of the activation input vector. The inherent geometric symmetry of radial rescalings makes them particularly useful for designing equivariant neural networks (Weiler & Cesa, 2019; Sabour et al., 2017; Weiler et al., 2018a; b) . We note that radial neural networks constitute a simple and previously unconsidered type of multilayer radial basis functions network (Broomhead & Lowe, 1988) , namely, one where the number of hidden activation neurons (often denoted N ) in each layer is equal to one. Indeed, pre-composing equation 1.2 with a translation and post-composing with a linear map, one obtains a special case of the local linear model extension of a radial basis functions network. σ σ σ σ ρ ||•|| λ W i-1 W i W i-1 W i In our first set of main results, we prove that radial neural networks are in fact universal approximators. Specifically, we demonstrate that any asymptotically affine function can be approximated with a radial neural network, suggesting potentially good extrapolation behavior. Moreover, this approximation can be done with bounded width. Our approach to proving these results departs markedly from techniques used in the pointwise case. Additionally, our result is not implied by the universality property of radial basis functions networks in general, and differs in significant ways, particularly in the bounded width property and the approximation of asymptotically affine functions. In our second set of main results, we exploit parameter space symmetries of radial neural networks to achieve model compression. Using the fact that radial rescaling activations commute with orthogonal transformations, we develop a practical algorithm to systematically factor out orthogonal symmetries via iterated QR decompositions. This leads to another radial neural network with fewer neurons in each hidden layer. The resulting model compression algorithm is lossless: the compressed network and the original network both have the same value of the loss function on any batch of training data. Furthermore, we prove that the loss of the compressed model after one step of gradient descent is equal to the loss of the original model after one step of projected gradient descent. As explained below, projected gradient descent involves zeroing out certain parameter values after each step of gradient descent. Although training the original network may result in a lower loss function after fewer epochs, in many cases the compressed network takes less time per epoch to train and is faster in reaching a local minimum. To summarize, our main contributions and headline results are: • Radial rescaling activations are an alternative to pointwise activations: We provide a formalization of radial neural networks, a new class of neural networks; • Radial neural networks are universal approximators: Results include a) approximation of asymptotically affine functions, and b) bounded width approximation; • Radial neural networks are inherently compressible: We prove a lossless compression algorithm for such networks and a theorem providing the relationship between optimization of the original and compressed networks. • Radial neural networks have practical advantages: We describe experiments verifying all theoretical results and showing that radial networks outperform pointwise networks on a noisy image recovery task.

2. RELATED WORK

Radial rescaling activations. As noted, radial rescaling activations are a special case of the activations used in radial basis functions networks (Broomhead & Lowe, 1988) . Radial rescaling functions have the symmetry property of preserving vector directions, and hence exhibit rotation equivariance. Consequently, examples of such functions, such as the squashing nonlinearity and Norm-ReLU, feature in the study of rotationally equivariant neural networks (Weiler & Cesa, 2019; Sabour et al., 2017; Weiler et al., 2018a; b; Jeffreys & Lau, 2021) . However, previous works apply the activation only along the channel dimension, and consider the orthogonal group O(n) only for n = 2, 3. In contrast, we apply the activation across the entire hidden layer, and O(n)-equivariance where n is the hidden layer dimension. Our constructions echo the vector neurons formalism (Deng et al., 2021) , in which the output of a nonlinearity is a vector rather than a scalar. Universal approximation. Neural networks of arbitrary width and sigmoid activations have long been known to be universal approximators (Cybenko, 1989) . Universality can also be achieved by bounded width networks with arbitrary depth (Lu et al., 2017b) , and generalizes to other activations and architectures (Hornik, 1991; Yarotsky, 2022; Ravanbakhsh, 2020; Sonoda & Murata, 2017) . While most work has focused on compact domains, some recent work also considers non-compact domains (Kidger & Lyons, 2020; Wang & Qu, 2022) , but only for L p functions, which are less general than asymptotically affine functions. The techniques used for pointwise activations do not generalize to radial rescaling activations, where all activation output coordinates are affected by all input coordinates. Consequently, individual radial neural network approximators of two different functions cannot be easily combined to an approximator of the sum of the functions. The standard proof of universal approximation for radial basis functions networks requires an unbounded increase the number of hidden activation neurons, and hence does not apply to the case of radial neural networks (Park & Sandberg, 1991) . Groups and symmetry. Appearances of symmetry in machine learning have generally focused on symmetric input and output spaces. Most prominently, equivariant neural networks incorporate symmetry as an inductive bias and feature weight-sharing constraints based on equivariance. Examples include G-convolution, steerable CNN, and Clebsch-Gordon networks (Cohen et al., 2019; Weiler & Cesa, 2019; Cohen & Welling, 2016; Chidester et al., 2018; Kondor & Trivedi, 2018; Bao & Song, 2019; Worrall et al., 2017; Cohen & Welling, 2017; Weiler et al., 2018b; Dieleman et al., 2016; Lang & Weiler, 2021; Ravanbakhsh et al., 2017) . By contrast, our approach does not depend on symmetries of the input domain, output space, or feedforward mapping. Instead, we exploit parameter space symmetries and obtain results that apply to domains with no apparent symmetry.

Model compression.

A major goal in machine learning is to find methods to reduce the number of trainable parameters, decrease memory usage, or accelerate inference and training (Cheng et al., 2017; Zhang et al., 2018) . Our approach toward this goal differs significantly from most existing methods in that it is based on the inherent symmetry of network parameter spaces. One prior method is weight pruning, which removes redundant weights with little loss in accuracy (Han et al., 2015; Blalock et al., 2020; Karnin, 1990) . Pruning can be done during training (Frankle & Carbin, 2018) or at initialization (Lee et al., 2019; Wang et al., 2020) . Gradient-based pruning removes weights by estimating the increase in loss resulting from their removal (LeCun et al., 1990; Hassibi & Stork, 1993; Dong et al., 2017; Molchanov et al., 2016) . A complementary approach is quantization, which decreases the bit depth of weights (Wu et al., 2016; Howard et al., 2017; Gong et al., 2014) . Knowledge distillation identifies a small model mimicking the performance of a larger model (Buciluǎ et al., 2006; Hinton et al., 2015; Ba & Caruana, 2013) . Matrix Factorization methods replace fully connected layers with lower rank or sparse factored tensors (Cheng et al., 2015a; b; Tai et al., 2015; Lebedev et al., 2014; Rigamonti et al., 2013; Lu et al., 2017a ) and can often be applied before training. Our method involves a type of matrix factorization based on the QR decomposition; however, rather than aim for rank reduction, we leverage this decomposition to reduce hidden widths via change-of-basis operations on the hidden representations. Close to our method are lossless compression methods which remove stable neurons in ReLU networks (Serra et al., 2021; 2020) or exploit permutation parameter space symmetry to remove neurons (Sourek et al., 2020) ; our compression instead follows from the symmetries of the radial rescaling activation. Finally, the compression results of Jeffreys & Lau (2021) , while conceptually similar to ours, are weaker, as the unitary group action is only on disjoint layers, and the results are only stated for the squashing nonlinearity. -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 (1) Step-ReLU(r) (3) Shifted ReLU (2) Squashing function 

3. RADIAL NEURAL NETWORKS

In this section, we define radial rescaling functions and radial neural networks. Let h : R → R be a function. For any n ≥ 1, set: n) for some piecewise differentiable h : R → R. Hence, ρ sends each input vector to a scalar multiple of itself, and that scalar depends only on the norm of the vector 1 . It is easy to show that radial rescaling functions commute with orthogonal transformations. h (n) : R n → R n h (n) (v) = h(|v|) v |v| for v ̸ = 0, and h (n) (0) = 0. A function ρ : R n → R n is called a radial rescaling function if ρ = h ( Example 1. (1) Step-ReLU, where h(r) = r if r ≥ 1 and 0 otherwise. In this case, the radial rescaling function is given by ρ : R n → R n , v → v if |v| ≥ 1; v → 0 if |v| < 1 (3.1) (2) The squashing function, where h(r) = r 2 /(r 2 +1). (3) Shifted ReLU, where h(r) = max(0, rb) for r > 0 and b is a real number. See Figure 2 . We refer to Weiler & Cesa (2019) and the references therein for more examples and discussion of radial functions. A radial neural network with L layers consists of positive integers n i indicating the width of each layer i = 0, 1, . . . , L; the trainable parameters, comprising of a matrix W i ∈ R ni×ni-1 of weights and a bias vector b i ∈ R ni for each i = 1, . . . , L; and a radial rescaling function ρ i : R ni → R ni for each i = 1, . . . , L. We refer to the tuple n = (n 0 , n 1 , . . . , n L ) as the widths vector of the neural network. The hidden widths vector is n hid = (n 1 , n 2 , . . . , n L-1 ). The feedforward function F : R n0 → R n L of a radial neural network is defined in the usual way as an iterated composition of affine maps and activations. Explicitly, set F 0 = id R n 0 and the partial feedforward functions are: F i : R n0 → R ni , x → ρ i (W i • F i-1 (x) + b i ) for i = 1, . . . , L. Then the feedforward function is F = F L . Radial neural networks are a special type of radial basis functions network; we explain the connection in Appendix F. Remark 2. If b i = 0 for all i, then we have F (x) = W (µ(x)x) where µ : R n → R is a scalar- valued function and W = W L W L-1 • • • W 1 ∈ R n L ×n0 is the product of the weight matrices. If any of the biases are non-zero, then the feedforward function lacks such a simple form.

4. UNIVERSAL APPROXIMATION

We now consider two universal approximation results. The first approximates asymptotically affine functions with a network of unbounded width. The second generalizes to bounded width. Proofs 1 A function R n → R that depends only on the norm of a vector is known as a radial function. Radial rescaling functions rescale each vector according to the radial function v → λ(|v|) := h(|v|) |v| . This explains the connection to Equation 1.2.  R n K c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 2 + e 2 c 2 S 2 • ρ • T 2 R m f (c 2 ) Φ

4.1. APPROXIMATION OF ASYMPTOTICALLY AFFINE FUNCTIONS

A continuous function f : R n → R m is asymptotically affine if there exists an affine map L : R n → R m such that, for every ϵ > 0, there is a compact subset K of R n such that |L(x) -f (x)| < ϵ for all x ∈ R n \ K. In particular, continuous functions with compact support are asymptotically affine. The continuity of f and compactness of K imply that, for any ϵ > 0, there exist c 1 , . . . , c N ∈ K and r 1 , . . . , r N ∈ (0, 1) such that, first, the union of the balls B ri (c i ) covers K and, second, for all i, we have f (B ri (c i ) ∩ K) ⊆ B ϵ (f (c i )). Let N (f, K, ϵ) be the minimal 2 choice of N . Theorem 3 (Universal approximation). Let f : R n → R m be an asymptotically affine function. For any ϵ > 0, there exists a compact set K ⊂ R n and a function F : R n → R m such that: 1. F is the feedforward function of a radial neural network with N = N (f, K, ϵ) layers whose hidden widths are (n + 1, n + 2, . . . , n + N ).

2.. For any

x ∈ R n , we have |F (x) -f (x)| < ϵ. We note that the approximation in Theorem 3 is valid on all of R n , not only on K. To give an idea of the proof, first fix c 1 , . . . , c N ∈ K and r 1 , . . . , r N ∈ (0, 1) as above. Let e 1 , . . . , e N be orthonormal basis vectors extending R n to R n+N . For i = 1, . . . , N define affine maps T i : R n+i-1 → R n+i and S i : R n+i → R n+i by T i (z) = z -c i + h i e i S i (z) = z -(1 + h -1 i )⟨e i , z⟩e i + c i + e i where h i = 1 -r 2 i and ⟨e i , z⟩ is the coefficient of e i in z. Setting ρ i to be Step-ReLU (Equation 3.1) on R n+i , these maps are chosen so that the composition S i •ρ i •T i maps the points in B ri (c i ) to c i +e i , while keeping points outside this ball the same. We now describe a radial neural network with widths (n, n + 1, . . . , n + N, m) whose feedforward function approximates f . For i = 1, . . . , N the affine map from layer i -1 to layer i is given by z → T i • S i-1 (z), with S 0 = id R n . The activation at each hidden layer is Step-ReLU. Let L be the affine map such that |L -f | < ϵ on R n \ K. The affine map from layer N to the output layer is Φ • S N where Φ : R n+N → R m is the unique affine map determined by x → L(x) if x ∈ R n , and e i → f (c i ) -L(c i ). See Figure 3 for an illustration of this construction. Theorem 3 has the following straightforward corollary: Corollary 4. Radial neural networks are dense in the space of all continuous functions with respect to the topology of compact convergence, and hence satisfy cc-universality.

4.2. BOUNDED WIDTH APPROXIMATION

We now turn our attention to a bounded width universal approximation result. 2 In many cases, the constant N (f, K, ϵ) can be bounded explicitly. For example, if K is the unit cube in R n and f is Lipschitz continuous with Lipschitz constant R, then N (f, K, ϵ) ≤ R √ n 2ϵ n . Theorem 5. Let f : R n → R m be an asymptotically affine function. For any ϵ > 0, there exists a compact set K ⊂ R n and a function F : R n → R m such that: 1. F is the feedforward function of a radial neural network with N = N (f, K, ϵ) hidden layers whose widths are all n + m + 1.

2.. For any

x ∈ R n , we have |F (x) -f (x)| < ϵ. The proof, which is more involved than that of Theorem 3, relies on using orthogonal dimensions to represent the domain and the range of f , together with an indicator dimension to distinguish the two. We regard points in R n+m+1 as triples (x, y, θ) where x ∈ R n , y ∈ R m and θ ∈ R. The proof of Theorem 5 parallels that of Theorem 3, but instead of mapping points in B ri (c i ) to c i + e i , we map the points in B ri ((c i , 0, 0)) to (0, f (ci)-L(0) s , 1), where s is chosen such that different balls do not interfere. The final layer then uses an affine map (x, y, θ) → L(x) + sy, which takes (x, 0, 0) to L(x), and (0, f (ci)-L(0) s , 1) to f (c i ). We remark on several additional results; see Appendix B for full statements and proofs. The bound of Theorem 5 can be strengthened to max(n, m) + 1 in the case of functions f : K → R m defined on a compact domain K ⊂ R n (i.e., ignoring asymptotic behavior). Furthermore, with more layers, it is possible to reduce that bound to max(n, m).

5. MODEL COMPRESSION

In this section, we prove a model compression result. Specifically, we provide an algorithm which, given any radial neural network, computes a different radial neural network with smaller widths. The resulting compressed network has the same feedforward function as the original network, and hence the same value of the loss function on any batch of training data. In other words, our model compression procedure is lossless. Although our algorithm is practical and explicit, it reflects more conceptual phenomena, namely, a change-of-basis action on network parameter spaces.

5.1. PARAMETER SPACE SYMMETRIES

Suppose a fully connected network has L layers and widths given by the tuple n = (n 0 , n 1 , n 2 , . . . , n L-1 , n L ). In other words, the i-th layer has input width n i-1 and output width n i . The parameter space is defined as the vector space of all possible choices of parameter values. Hence, it is given by the following product of vector spaces: Param(n) = R n1×n0 × R n2×n1 × • • • × R n L ×n L-1 × (R n1 × R n2 × • • • × R n L ) An element is a pair of tuples (W, b) where W = (W i ∈ R ni×ni-1 ) L i=1 are the weights and b = (b i ∈ R ni ) L i=1 are the biases. To describe certain symmetries of the parameter space, consider the following product of orthogonal groups, with sizes corresponding to hidden layer widths: O(n hid ) = O(n 1 ) × O(n 2 ) × • • • × O(n L-1 ) There is a change-of-basis action of this group on the parameter space. Explicitly, the tuple of orthogonal matrices Q = (Q i ) ∈ O(n hid ) transforms the parameter values (W, b) to Q • W := Q i W i Q -1 i-1 and Q • b := (Q i b i ), where we set Q 0 and Q L to be identity matrices.

5.2. MODEL COMPRESSION

In order to state the compression result, we first define the reduced widths. Namely, the reduction n red = (n red 0 , n red 1 , . . . , n red L ) of a widths vector n is defined recursively by setting n red 0 = n 0 , then n red i = min(n i , n red i-1 + 1) for i = 1, . . . , L -1, and finally n red for the corresponding tuple of restrictions, which are all radial rescaling functions. The following result relies on Algorithm 1 below. Theorem 6. Let (W, b, ρ) be a radial neural network with widths n. Let W red and b red be the weights and biases of the compressed network produced by Algorithm 1. The feedforward function of the original network (W, b, ρ) coincides with that of the compressed network (W red , b red , ρ red ). Algorithm 1: QR Model Compression (QR-compress) input : W, b ∈ Param(n) output : Q ∈ O(n hid ) and W red , b red ∈ Param(n red ) Q, W red , b red ← [ ], [ ], [ ] // initialize output lists A 1 ← [b 1 W 1 ] // matrix of size n 1 × (n 0 + 1) for i ← 1 to L -1 do // iterate through layers Q i , R i ← QR-decomp(A i , mode = 'complete') // A i = Q i Inc i R i Append Q i to Q Append first column of R i to b red // reduced bias for layer i Append remainder of R i to W red // reduced weights for layer i Set A i+1 ← [b i+1 W i+1 Q i Inc i ] // matrix of size ni+1 × (n red i + 1) end Append the first column of A L to b red // reduced bias for last layer Append the remainder of A L to W red // reduced weights for last layer return Q, W red , b red We explain the notation of the algorithm. The inclusion matrix Inc i ∈ R ni×n red i has ones along the main diagonal and zeros elsewhere. The method QR-decomp with mode = 'complete' computes the complete QR decomposition of the n i × (1 + n red i-1 ) matrix A i as Q i Inc i R i where Q i ∈ O(n i ) and R i is upper-triangular of size n red i × (1 + n red i-1 ). The definition of n red i implies that either n red i = n red i-1 + 1 or n red i = n i . The matrix R i is of size n red i × n red i in the former case and of size n i × (1 + n red i-1 ) in the latter case. Example 7. Suppose the widths of a radial neural network are (1, 8, 16, 8, 1) . Then it has 4 i=1 (n i-1 + 1)n i = 305 trainable parameters. The reduced network has widths (1, 2, 3, 4, 1) and 4 i=1 (n red i-1 + 1)(n red i ) = 34 trainable parameters. Another example appears in Figure 4 . We note that the tuple of matrices Q produced by Algorithm 1 does not feature in the statement of Theorem 6, but is important in the proof (which appears in Appendix C). Namely, an induction argument shows that the i-th partial feedforward function of the original and reduced models are related via the matrices Q i and Inc i . A crucial ingredient in the proof is that radial rescaling activations commute with orthogonal transformations.

6. PROJECTED GRADIENT DESCENT

The typical use case for model compression algorithms is to produce a smaller version of the fully trained model which can be deployed to make inference more efficient. It is also worth considering whether compression can be used to accelerate training. For example, for some compression algorithms, the compressed and full models have the same feedforward function after a step of gradient descent is applied to each, and so one can compress before training and still reach the same minimum. Unfortunately, in the context of radial neural networks, compression using Algorithm 1 and then training does not necessarily give the same result as training and then compression (see Appendix D.6 for a counterexample). However, QR-compress does lead to a precise mathematical relationship between optimization of the two models: the loss of the compressed model after one Under review as a conference paper at ICLR 2023 step of gradient descent is equivalent to the loss of (a transformed version of) the original model after one step of projected gradient descent. Proofs appear in Appendix D. R R 4 R 4 R 4 R 4 R    • • • •    ρ    • • • • • • • • • • • • • • • •    ρ • • • • R R 2 R 2 R 4 R 4 R • • ρ    • • • • • • • •    ρ • • • • R R 2 R 2 R 3 R 3 R • • ρ   • • • • • •   ρ • • • To state our results, fix widths n and radial rescaling functions ρ as above. The loss function L : Param(n) → R associated to a batch of training data is defined as taking parameter values (W, b) to the sum j C(F (x j ), y j ) where C is a cost function on the output space, F = F (W,b,ρ) is the feedforward of the radial neural network with the specified parameters, and (x j , y j ) ∈ R n0 ×R n L are the data points. Similarly, we have a loss function L red on the parameter space Param(n red ) with reduced widths vector. For any learning rate η > 0, we obtain gradient descent maps: γ : Param(n) → Param(n); (W, b) → (W, b) -η∇ (W,b) L γ red : Param(n red ) → Param(n red ); (V, c) → (V, c) -η∇ (V,c) L red γ proj : Param(n) → Param(n); (W, b) → Proj (γ(W, b)) where the last is the projected gradient descent map on Param(n). The map Proj zeroes out all entries in the bottom left (n i -n red i ) × n red i-1 submatrix of W i -∇ Wi L, and the bottom (n i -n red i ) entries in b i -∇ bi L, for each i. Schematically: W i -∇ Wi L = * * * * → * * 0 * , b i -∇ bi L = * * → * 0 To state the following theorem, recall that, applying Algorithm 1 to parameters (W, b), we obtain the reduced model (W red , b red ) and an orthogonal parameter symmetry Q. We consider, for k ≥ 0, the k-fold composition γ k = γ • γ • • • • • γ and similarly for γ red and γ red . Theorem 8. Let W red , b red , Q = QR-compress(W, b) be the outputs of Algorithm 1 applied to (W, b) ∈ Param(n). Set U = Q -1 • (W, b) -(W red , b red ). For any k ≥ 0, we have: γ k (W, b) = Q • γ k (Q -1 • (W, b)) γ k proj (Q -1 • (W, b)) = γ k red (W red , b red ) + U. We conclude that gradient descent with initial values (W, b) is equivalent to gradient descent with initial values Q -1 • (W, b) since at any stage we can apply Q ±1 to move from one to the other (using the action from Section 5.1). Furthermore, projected gradient descent with initial values  Q -1 • (W,

7. EXPERIMENTS

In addition to our theoretical results, we provide an implementation of Algorithm 1 in order to validate the claims of Theorems 6 and 8 empirically, as well as a demonstration that a radial network outperforms a MLP on a noisy image recovery task. Full experimental details are in Appendix E. (1) Empirical verification of Theorem 6. We learn the function f (x) = e -x 2 from samples using a radial neural network with widths n = (1, 6, 7, 1) and activation the radial shifted sigmoid h(x) = 1/(1 + e -x+s ). Applying QR-compress gives a compressed radial neural network with widths n red = (1, 2, 3, 1). Theorem 6 implies that the respective neural functions F and F red are equal. Over 10 random initializations, the mean absolute error is negligible up to machine precision: (1/N ) j |F (x j ) -F red (x j )| = 1.31 • 10 -8 ± 4.45 • 10 -9 . (2) Empirical verification of Theorem 8. We verified the claim on same synthetic data as above. See Appendix E for details. (3) The compressed model trains faster. Our compression method may be applied before training to produce a smaller model class which trains faster without sacrificing accuracy. We demonstrate this in learning the function R 2 → R 2 sending (t 1 , t 2 ) to (e -t 2 1 , e -t 2 2 ) using a radial neural network with widths (2, 16, 64, 128, 16, 2) and activation the radial sigmoid h(r) = 1/(1 + e -r ). Applying QR-compress gives a compressed network with widths n red = (2, 3, 4, 5, 6, 2). We trained both models until the training loss was ≤ 0.01. Over 10 random initializations on our system, the reduced network trained in 15.32 ± 2.53 seconds and the original network trained in 31.24 ± 4.55 seconds. (4) Noisy image recovery. A Step-ReLU radial network performs better than an otherwise comparable network with pointwise ReLU on a noisy image recovery task. Using samples of MNIST with significant added noise, the network must identify from which original sample the noisy sample derives (see Figure 5 ). We observe that the radial network 1) is able to obtain a better fit, 2) has faster convergence, and 3) generalizes better than the pointwise ReLU. We hypothesize the radial nature of the random noise makes radial networks well-adapted to the task. Our data takes n = 3 original MNIST images with the same label, and produces m = 100 noisy images for each, with a 240 train / 60 test split. Over 10 trials, each training for 150 epochs, the radial network achieves training loss 0.00256 ±3.074 • 10 -1 with accuracy 1 ± 0, while the ReLU MLP has training loss 0.295 ±2.259 • 10 -1 with accuracy 0.768 ±2.199 • 10 -1 . On the test set, the radial network has loss 0.00266 ±3.749 • 10 -4 with accuracy 1 ± 0, while the ReLU MLP has loss 0.305 ±2.588 • 10 -1 with accuracy 0.757 ±2.464 • 10 -1 . The convergence rates are illustrated in Figure 5 , with the radial network outperforming the ReLU MLP, and 150 epochs are sufficient for all methods to converge. In another experiment, we sampled 1000 MNIST digits and added spherical noise to each. Using a 800/200 train/test split, we trained both a Step-ReLU and a pointwise ReLU network for the task of recovering the original digit (so there are ten classes total). Both networks had dimension vector [784, 785, 786, 10] . After 1000 epochs, the Step-ReLU achieved a training loss of 0.008248 and a test loss of 1.193811, while for pointwise ReLU these were 7.534195 and 1.374735, respectively.

8. CONCLUSIONS AND DISCUSSION

This paper demonstrates that radial neural networks are universal approximators and that their parameter spaces exhibit a rich symmetry group, leading to a model compression algorithm. The results of this work combine to build a theoretical foundation for the use of radial neural networks, and suggest that radial neural networks hold promise for wider practical applicability. Furthermore, this work makes an argument for considering non-pointwise nonlinearities in neural networks. There are two main limitations of our results, each providing an opportunity for future work. First, our universal approximation constructions currently work only for Step-ReLU radial rescaling radial activations; it would be desirable to generalize to other activations. Additionally, Theorem 6 achieves compression only for networks whose widths satisfy n i > n i-1 + 1 for some i. Networks which do not have increasing widths anywhere, such as encoders, would not be compressible. Further extensions of this work include: First, little is currently known about the stability properties of radial neural networks during training, as well as their sensitivity to initialization. Second, radial rescaling activations provide an extreme case of symmetry; there may be benefits to combining radial and pointwise activations within a single network, for example, through 'block' radial rescaling functions. Our techniques may yield weaker compression properties for more general radial basis functions networks; radial neural networks may be the most compressible such networks. Third, radial rescaling activations can be used within convolutional or group-equivariant NNs. Finally, based on the theoretical advantages and experiments laid out in this paper, future empirical work will further explore applications in which we expect radial networks to outperform alternate methods. Such potential applications include data spaces with circular or distance-based class boundaries.

ETHICS STATEMENT

Our work is primarily focused on theoretical foundations of machine learning, however, it does have a direct application in the form a model compression. Model compression is largely beneficial to the world since it allows for inference to run on smaller systems which use less energy. On the other hand, when models may be run on smaller systems such as smartphones, it is easier to use deep models covertly, for example, for facial recognition and surveillance. This may make abuses of deep learning technology easier to hide.

REPRODUCIBILITY STATEMENT

The theoretical results of this paper, namely Theorem 3, Theorem 5, Theorem 6, and Theorem 8, may be independently verified through their proofs, which we include in their entirety in the appendices, including all necessary definitions, lemmas, and hypotheses in precise and complete mathematical language. The empirical verification of Section 7 may be reproduced using the code included with the supplementary materials. In addition, Algorithm 1 is written in detailed pseudocode, allowing readers to recreate our algorithm in a programming language of their choosing.

B.2 TOPOLOGY

Let K be a compact subset of R n and let f : K → R m be a continuous function. Lemma 9. For any ϵ > 0, there exist c 1 , . . . , c N ∈ K and r 1 , . . . , r N ∈ (0, 1) such that, first, the union of the balls B ri (c i ) covers K; second, for all i, we have f (B ri (c i ) ∩ K) ⊆ B ϵ (f (c i )). Proof. The continuity of f implies that for each c ∈ K, there exists r = r c such that f (B rc (c) ∩ K) ⊆ B ϵ (f (c)). The subsets B rc (c) ∩ K form an open cover of K. The compactness of K implies that there is a finite subcover. The result follows. We also prove a variation of Lemma 9 that additionally guarantees that none of the balls in the cover of K contains the center point of another ball. Lemma 10. For any ϵ > 0, there exist c 1 , . . . , c M ∈ K and r 1 , . . . , r M ∈ (0, 1) such that, first, the union of the balls B ri (c i ) covers K; second, for all i, we have f (B ri (c i )) ⊆ B ϵ (f (c i )); and, third, |c i -c j | ≥ r i . Proof. Because f is continuous on a compact domain, it is uniformly continuous. So, there exists r > 0 such that f (B r (c) ∩ K) ⊆ B ϵ (f (c)) for each c ∈ K. Because K is compact it has a finite volume, and so does B r/2 (K) = c∈K B r/2 (c). Hence, there exists a finite maximal packing of B r/2 (K) with balls of radius r/2. That is, a collection c 1 , . . . , c M ∈ B r/2 (K) such that, for all i, B r/2 (c i ) ⊆ B r/2 (K) and, for all j ̸ = i, B r/2 (c i ) ∩ B r/2 (c j ) = ∅. The first condition implies that c i ∈ K. The second condition implies that |c i -c j | ≥ r. Finally, we argue that K ⊆ M i=1 B r (c i ). To see this, suppose, for a contradiction, that x ∈ K does not belong to M i=1 B r (c i ). Then B r/2 (c i )∩B r/2 (x) = ∅, and x could be added to the packing, which contradicts the fact that the packing was chosen to be maximal. So the union of the balls B r (c i ) covers K. We turn our attention to the minimal choices of N and M in Lemmas 9 and 10. Definition 11. Given f : K → R m continuous and ϵ > 0, let N (f, K, ϵ) be the minimal choice of N in Lemma 9, and let M (f, K, ϵ) be the minimal choice of M in Lemma 10. Observe that M (f, K, ϵ) ≥ N (f, K, ϵ). In many cases, it is possible to give explicit bounds for the constants N (f, K, ϵ) and M (f, K, ϵ). As an illustration, we give the argument in the case that K is the closed unit cube in R n and f : K → R m is Lipschtiz continuous. Proposition 12. Let K = [0, 1] n ⊂ R n be the (closed) unit cube and let f : K → R m be Lipschitz continuous with Lipschitz constant R. For any ϵ > 0, we have: N (f, K, ϵ) ≤ R √ n 2ϵ n and M (f, K, ϵ) ≤ Γ(n/2 + 1) π n/2 2 + 2R ϵ n . Proof. For the first inequality, observe that the unit cube can be covered with R √ n 2ϵ n cubes of side length 2ϵ R √ n . Each cube is contained in a ball of radius ϵ R centered at the center of the cube. (In general, a cube of side length a in R n is contained in a ball of radius a √ n 2 .) Lipschitz continuity implies that, for all x, x ′ ∈ K, if |x -x ′ | < ϵ/R then |f (x) -f (x ′ )| ≤ R|x -x ′ | < ϵ. For the second inequality, let r = ϵ/R. Lipschitz continuity implies that, for all x, x ′ ∈ K, if |x -x ′ | < r then |f (x) -f (x ′ )| ≤ R|x -x ′ | < ϵ. The n-dimensional volume of the set of points with distance at most r/2 to the unit cube is vol(B r/2 (K)) ≤ (1 + r) n . The volume of a ball with radius r/2 is vol(B r/2 (0)) = π n/2 Γ(n/2+1) (r/2) n . Hence, any packing of B r/2 (K) with balls of radius r/2 consists of at most vol(B r/2 (K)) vol(B r/2 (0)) ≤ Γ(n/2 + 1) π n/2 2 + 2R ϵ n such balls. So there also exists a maximal packing with at most that many balls. This packing can be used in the proof of Theorem 10, which implies that it is a bound on M (f, K, ϵ). We note in passing that any differentiable function f : K → R n on a compact subset K of R n is Lipschitz continuous. Indeed, the compactness of K implies that there exists R such that |f ′ (x)| ≤ R for all x ∈ K. Then one can take R to be the Lipschitz constant of f . B.3 PROOF OF THEOREM 3: UA FOR ASYMPTOTICALLY AFFINE FUNCTIONS In this section, we restate and prove Theorem 3, which proves that radial neural networks are universal approximators of asymptotically affine functions. We recall the definition of such functions: Definition 13. A function f : R n → R m is asymptotically affine if there exists an affine function L : R n → R m such that, for all ϵ > 0, there exists a compact set K ⊂ R n such that |L(x) -f (x)| < ϵ for all x ∈ R n \ K. We say that L is the limit of f . Remark 14. An asymptotically linear function is defined in the same way, except L is taken to be linear (i.e., given just by applying matrix multiplication without translation). Hence any asymptotically linear function is in particular an asymptotically affine function, and Theorem 3 applies to asymptotically linear functions as well. Given an asymptotically affine function f : R n → R m and ϵ > 0, let K be a compact set as in Definition 13. We apply Lemma 9 to the restriction f | K of f to K and produce a minimal constant N = N (f | K , K, ϵ) as in Definition 11. We write simply N (f, K, ϵ) for this constant. Theorem 3 (Universal approximation). Let f : R n → R m be an asymptotically affine function. For any ϵ > 0, there exists a compact set K ⊂ R n and a function F : R n → R m such that: 1. F is the feedforward function of a radial neural network with N = N (f, K, ϵ) layers whose hidden widths are (n + 1, n + 2, . . . , n + N ).

2.. For any

x ∈ R n , we have |F (x) -f (x)| < ϵ. Proof. By the hypothesis on f , there exists an affine function L : R n → R m and a compact set K ⊂ R n such that |L(x) -f (x)| < ϵ for all x ∈ R n \ K. Abbreviate N (f, K, ϵ) by N . As in Lemma 9, fix c 1 , . . . , c N ∈ K and r 1 , . . . , r N ∈ (0, 1) such that, first, the union of the balls B ri (c i ) covers K and, second, for all i, we have f (B ri (c i )) ⊆ B ϵ (f (c i )). Let U = N i=1 B ri (c i ), so that K ⊂ U . Define F : R n → R m as: F (x) = L(x) if x / ∈ U f (c j ) where j is the smallest index with x ∈ B rj (c j ) If x / ∈ U , then |F (x) -f (x)| = |L(x) -f (x)| < ϵ. Hence suppose x ∈ U . Let j be the smallest index such that x ∈ B rj (c j ). Then F (x) = f (c j ), and, by the choice of r j , we have: |F (x) -f (x)| = |f (c j ) -f (x)| < ϵ. We proceed to show that F is the feedforward function of a radial neural network. Let e 1 , . . . , e N be orthonormal basis vectors extending R n to R n+N . We regard each R n+i-1 as a subspace of R n+i by embedding into the first n + i -1 coordinates. For i = 1, . . . , N , we set h i = 1 -r 2 i and define the following affine transformations: T i : R n+i-1 → R n+i S i : R n+i → R n+i z → z -c i + h i e i z → z -(1 + h -1 i )⟨e i , z⟩e i + c i + e i where ⟨e i , z⟩ is the coefficient of e i in z. Consider the radial neural network with widths (n, n + 1, . . . , n + N, m), whose affine transformations and activations are given by: • For i = 1, . . . , N the affine transformation from layer i -1 to layer i is given by z → T i • S i-1 (z), where S 0 = id R n . • The activation function at the i-th hidden layer is Step-ReLU on R n+i , that is: ρ i : R n+i -→ R n+i , z -→ z if |z| ≥ 1 0 otherwise • The affine transformation from layer i = N to the output layer is z → Φ L,f,c • S N (z) where Φ L,f,c is the affine transformation given by: Φ L,f,c : R n+N → R m , x + N i=1 a i e i → L(x) + N i=1 a i (f (c i ) -L(c i )) which can be shown to be affine when L is affine. Indeed, write L(x) = Ax + b where A is a matrix in R m×n and b ∈ R m is a vector. Then Φ L,f,c is the composition of the linear map given by the matrix [A f (c 1 ) -L(c 1 ) f (c 2 ) -L(c 2 ) • • • f (c N ) -L(c N )] ∈ R m×(n+N ) and translation by b ∈ R m . Note that we regard each f (c i ) -L(c i ) ∈ R m as a column vector in the matrix above. We claim that the feedforward function of the above radial neural network is exactly F . To show this, we first state a lemma, whose (omitted) proof is an elementary computation. Lemma 3.1. For i = 1, . . . , N , the composition S i • T i is the embedding R n+i-1 → R n+i . Next, recursively define G i : R n → R n+i via G i = S i • ρ i • T i • G i-1 , where G 0 = id R n . The function G i admits an direct formulation: Proposition 3.2. For i = 0, 1, . . . , N , we have: G i (x) = x if x / ∈ i j=1 B rj (c j ) c j + e j where j ≤ i is the smallest index with x ∈ B rj (c j ) . Proof. We proceed by induction. The base step i = 0 is immediate. For the induction step, assume the claim is true for i -1, where 0 ≤ i -1 < N . There are three cases to consider. Case 1. Suppose x / ∈ i j=1 B rj (c j ). Then in particular x / ∈ i-1 j=1 B rj (c j ), so the induction hypothesis implies that G i-1 (x) = x. Additionally, x / ∈ B ri (c i ), so: |T i (x)| = |x -c i + h i e i | = |x -c i | + h 2 i ≥ r 2 i + 1 -r 2 i = 1. Using the definition of ρ i and Lemma 3.1, we compute: G i (x) = S i • ρ i • T i • G i-1 (x) = S i • ρ i • T i (x) = S i • T i (x) = x. Case 2. Suppose x ∈ B j \ j-1 k=1 B r k (c k ) for some j ≤ i -1. Then the induction hypothesis implies that G i-1 (x) = c j + e j . We compute: |T i (c j + e j )| = |c j + e j -c i + h i e i | > |e j | = 1. Therefore, G i (x) = S i • ρ i • T i (c j + e j ) = S i • T i (c j + e j ) = c j + e j . Case 3. Finally, suppose x ∈ B i \ i-1 j=1 B rj (c j ). The induction hypothesis implies that G i-1 (x) = x. Since x ∈ B ri (c i ), we have: |T i (x)| = |x -c i + h i e i | = |x -c i | + h 2 i < r 2 i + 1 -r 2 i = 1. Therefore: G i (x) = S i • ρ i • T i (x) = S i (0) = c i + e i . This completes the proof of the proposition. Finally, we show that the function F defined at the beginning of the proof is the feedforward function of the above radial neural network. The computation is elementary: F feedforward = Φ L,f,c • S N • ρ N • T N • S N -1 • ρ N -1 • T N -1 • • • • S 1 • ρ 1 • T 1 = Φ L,f,c • G N = F where the first equality follows from the definition of the feedforward function, the second from the definition of G N , and the last from the case i = N of Proposition 3.2 together with the definition of Φ L,f,c . This completes the proof of the theorem.

FUNCTIONS

We restate and prove Theorem 5, which strengthens Theorem 3 by providing a bounded width radial neural network approximation of any asymptotically affine function. Theorem 5. Let f : R n → R m be an asymptotically affine function. For any ϵ > 0, there exists a compact set K ⊂ R n and a function F : R n → R m such that: 1. F is the feedforward function of a radial neural network with N = N (f, K, ϵ) hidden layers whose widths are all n + m + 1.

2.. For any

x ∈ R n , we have |F (x) -f (x)| < ϵ. Proof. By the hypothesis on f , there exists an affine function L : R n → R m and a compact set K ⊂ R n such that |L(x) -f (x)| < ϵ for all x ∈ R n \ K. Given ϵ > 0, let N = N (f, K, ϵ ) and use Lemma 9 to choose c 1 , . . . , c N ∈ K and r 1 , . . . , r N ∈ (0, 1) such that the union of the balls B ri (c i ) covers K, and, for all i, we have f (B ri (c i )) ⊆ B ϵ (f (c i )). Let s be the minimal non-zero value of |f (c i ) -f (c j )| for i, j ∈ {1, . . . , N }, that is, s = min i,j,f (ci)̸ =f (cj ) |f (c i ) -f (c j )|. Using the decomposition R n+m+1 ∼ = R n × R m × R, we write elements of R n+m+1 as (x, y, θ), where x ∈ R n , y ∈ R m , and θ ∈ R. For i = 1, . . . , N , set: T i : R n+m+1 → R n+m+1 , (x, y, θ) → x -(1 -θ)c i , y -θ f (c i ) -L(0) s , (1 -θ)h i where h i = 1 -r 2 i . Note that T i is an invertible affine transformation, whose inverse is given by: T -1 i (x, y, θ) = x + θ h i c i , y + 1 - θ h i f (c i ) -L(0) s , 1 - θ h i For i = 1, . . . , N , define G i : R n → R n+m+1 via the following recursive definition: G i = T -1 i • ρ • T i • G i-1 , where G 0 (x) = (x, 0, 0) : R n → R n+m+1 is the inclusion, and ρ : R n+m+1 → R n+m+1 is Step-ReLU on R n+m+1 . We claim that, for x ∈ R n , we have: G i (x) = (x, 0, 0) if x / ∈ i j=1 B rj (c j ) 0, f (cj )-L(0) s , 1 where j ≤ i is the smallest index with x ∈ B rj (c j ) This claim can be verified by a straightforward induction argument, similar to the one given in the proof of Proposition 3.2, and using the following key facts: • For x ∈ R n , T i (x, 0, 0) = (x -c i , 0, h i ) < 1 if and only if |x -c i | < r i . • T -1 i (0) = 0, f (ci)-L(0) s , 1 . • T i 0, f (cj )-L(0) s , 1 = 0, f (cj )-f (ci) s , 0 , which, by the choice of s, has norm at least 1 if f (c j ) ̸ = f (c i ), and is 0 if f (c j ) = f (c i ). Let Φ : R n+m+1 → R m denote the affine map sending (x, y, θ) to L(x) + sy. It follows that F = Φ • G N satisfies F (x) = L(x) if x / ∈ N j=1 B rj (c j ) f (c j ) where j is the smallest index with x ∈ B rj (c j ) By construction, F is the feedforward function of a radial neural network with N hidden layers whose widths are all n + m + 1. Let x ∈ R n . If x ∈ K, let j be the smallest index such that x ∈ B rj (c j ). Then F (x) = f (c j ), and, by the choice of r j , we have |F (x)-f (x)| = |f (c j )-f (x)| < ϵ. Otherwise, x ∈ R n \ K, and |F (x) -f (x)| = |L(x) -f (x)| < ϵ. B.5 ADDITIONAL RESULT: BOUND OF max(n, m) + 1 We state and prove an additional bounded width result. In contrast to the results above, the theorem below only holds for functions defined on a compact domain, without assumptions about the asymptotic behavior. The proof is an adaptation of the proof of Theorem 5, so we give only a sketch. Theorem 15. Let f : K → R m be a continuous function, where K is a compact subset of R n . For any ϵ > 0, there exists F : R n → R m such that: 1. F is the feedforward function of a radial neural network with N (f, K, ϵ) hidden layers whose widths are all max(n, m) + 1. x ∈ K, we have |F (x) -f (x)| < ϵ. Sketch of proof. The construction appearing in the proof of Theorem 5 with L ≡ 0 can be used to produce a radial neural network with N (f, K, ϵ) hidden layers with widths n + m + 1 that approximates f on K. (Note that the approximation works only on K, as f is not defined outside of K.) All values in the hidden layers are of the form (x, 0, 0) or (0, y, 1). We can therefore replace (x, y, θ) ∈ R n+m+1 by (x + y, θ) ∈ R max(n,m) × R ∼ = R max(n,m)+1 everywhere, without affecting any statements about the hidden layers. In particular, the transformation T i becomes T i : R max(n,m)+1 → R max(n,m)+1 , (x, θ) → x -(1 -θ)c i -θ f (c i ) s , (1 -θ)h i . With this change the final affine map Φ sends (x, θ) to sx. From the rest of the proof of Theorem 5 it follows that the feedforward function F of the radial network satisfies |F (x) -f (x)| < ϵ for all x ∈ K.

B.6 ADDITIONAL RESULT: BOUND OF max(n, m)

In this section, we prove a different version of the result of the previous section. Specifically, we reduce the bound on the widths to max(n, m) at the cost of using more layers. Again, we focus on functions defined on a compact domain without assumptions about their asymptotic behavior. Recall the notation M (f, K, ϵ) from Theorem 10 and Theorem 11. Theorem 16. Let f : K → R m be a continuous function, where K is a compact subset of R n for n ≥ 2. For any ϵ > 0, there exists F : R n → R m such that: 1. F is the feedforward function of a radial neural network with 2M (f, K, ϵ/2) hidden layers whose widths are all max(n, m).

2.. For any

x ∈ K, we have |F (x) -f (x)| < ϵ. Proof. We first consider the proof in the case n = m. Set M = M (f, K, ϵ). As in Lemma 10, fix c 1 , . . . , c M ∈ K and r 1 , . . . , r M ∈ (0, 1) such that, first, the union of the balls B ri (c i ) covers K; second, for all i, we have f (B ri (c i )) ⊆ B ϵ/2 (f (c i )); and third, |c i -c j | ≥ r i for i ̸ = j. For i = 1, . . . , M , set T i : R n → R n , x → x -c i r i , and recursively define G i : R n → R n as G i = T -1 i • ρ • T i • G i-1 , where G 0 = id R n is the identity on R n and ρ : R n → R n is Step-ReLU. Lemma 16.1. For i = 0, 1, . . . , N , we have: G i (x) = x if x / ∈ i j=1 B rj (c j ) c j where j ≤ i is the smallest index with x ∈ B rj (c j ). We omit the full proof of Lemma 16.1, as it is a standard induction argument similar to Proposition 3.2, relying on the following two facts. First, |T i (x)| < 1 if and only if x ∈ B ri (c i ). Second, by the choice of c i , we have |c i -c j | ≥ r i for all i ̸ = j. This implies that |T i (c j )| ≥ 1 for i ̸ = j. Next, perform the following loop over i = 1, . . . , M : • Set P i-1 = {c 1 , . . . , c M } ∪ {d 1 , . . . , d i-1 } • Choose d i in B ϵ/2 (f (c i )) that is not colinear with any pair of points in P i-1 . This is where we use the hypothesis that n ≥ 2. • Let s i be the minimum distance between any point on the line through c i and d i and any point in P i-1 \ {c i }. • Let U i : R n → R n be the following affine transformation: U i : R n → R n , x → x -d i s i + 1 |c i -d i | - 1 s i ⟨x -d i , c i -d i ⟩ |c i -d i | 2 (c i -d i ) • Define H i : R n → R n recursively as H i = U -1 i • ρ • U i • H i-1 , where H 0 = id R n . We note that the transformation U i can also be written as A i (x -d i ) where A i is the linear map given by A i = 1 si proj ⟨ci-di⟩ ⊥ + 1 |ci-di| proj ⟨ci-di⟩ , which involves the projections onto the line spanned by c i -d i and onto the orthogonal complement of this line. Lemma 16.2. For i, j = 1, . . . , M , we have: H i (c j ) = d j if j ≤ i c j if j > i Proof. It is immediate that U i (d i ) = 0 and |U i (c i )| = 1/2. It is also straightforward to show, using the choice of s i , that |U i (p)| ≥ 1 for all p ∈ P i-1 \ {c i }. It follows that U -1 i • ρ • U i sends c i to d i and fixes all other points in P i-1 . Lemma 16.3. For x ∈ K, we have H M •G M (x) = d i where i is the smallest index with x ∈ B ri (c i ) Proof. Let x ∈ K. By Lemma 16.1, we have that G M (x) = c i where i is the smallest index with x ∈ B ri (c i ). (We use the fact that the balls {B ri (c i )} cover K.) By Lemma 16.2, we have that H M (c i ) = d i for all i. The result follows. Set F = H M • G M . We see that, for x ∈ K: |F (x) -f (x)| = |d i -f (x)| ≤ |d i -f (c i )| + |f (c i ) -f (x)| < ϵ/2 + ϵ/2 = ϵ where i is the smallest index with x ∈ B ri (c i ). We show that F is the feedforward function of a radial neural network with 2M hidden layers, all of width equal to n. Indeed, take the affine transformations and activations as follows: • For i = 1, . . . , M the affine transformation from layer i -1 to layer i is given by x → T i • T -1 i-1 (x), where T 0 = id R n . • For i = 1, . . . , M the affine transformation from layer M + i -1 to layer M + i is given by x → U i • U -1 i-1 (x), where U 0 = T -1 N . • The activation at each hidden layer is Step-ReLU on R n that is ρ(x) = x if |x| ≥ 1 and 0 otherwise. • Layer 2M + 1 has the affine transformation U -1 M . It is immediate from definitions that the feedforward function of this network is F . To conclude the proof, we discuss the cases where n ̸ = m. Suppose n < m so that max(n, m) = m. Then we can regard K as a compact subset of R m and apply the above constructions. Suppose n > m so that max(n, m) = n. Let inc : R m → R n . Apply the above constructions to the function f = inc • f : K → R n .

C MODEL COMPRESSION PROOFS

The aim of this appendix is to give a proof of Theorem 6. In order to do so, we first (1) provide background on a relevant version of the QR decomposition, and (2) establish basic properties of radial rescaling activations.

C.1 THE QR DECOMPOSITION

In this section, we recall the QR decomposition and note several relevant facts. For integers n and m, let (R n×m ) upper denote the vector space of upper triangular n by m matrices. Theorem 17 (QR Decomposition). The following map is surjective: O(n) × R n×m upper -→ R n×m Q , R → Q • R In other words, any matrix can be written as the product of an orthogonal matrix and an uppertriangular matrix. When m ≤ n, the last n -m rows of any matrix in (R n×m ) upper are zero, and the top m rows form an upper-triangular m by m matrix. These observations lead to the following "complete" version of the QR decomposition, which coincides with the above result when m ≥ n: Corollary 18 (Complete QR Decomposition). The following map is surjective: µ : O(n) × R k×m upper -→ R n×m Q , R → Q • inc • R where k = min(n, m) and inc : R k → R n is the standard inclusion into the first k coordinates. We make some remarks: 1. There are several algorithms for computing the QR decomposition of a given matrix. One is Gram-Schmidt orthogonalization, and another is the method of Householder reflections. The latter has computational complexity O(n 2 m) in the case of a n × m matrix with n ≥ m. The package numpy includes a function numpy.linalg.qr that computes the QR decomposition of a matrix using Householder reflections. 2. In each iteration of the loop in Algorithm 1, the method QR-decomp with mode = 'complete' takes as input a matrix A i of size n i × (n red i-1 + 1), and produces an orthogonal matrix Q i ∈ O(n i ) and an upper-triangular matrix R i of size min(n i , n red i-1 + 1) × (n red i-1 + 1) such that A i = Q i • inc i • R i . Note that n red i = min(n i , n red i-1 + 1). 3. The QR decomposition is not unique in general, or, in other words, the map µ is not injective in general. For example, if n > m, each fiber of µ contains a copy of the orthogonal group O(n -m). 4. The QR decomposition is unique (in a certain sense) for invertible square matrices. To be precise, let B + n be the subset of of (R n×n ) upper consisting of upper triangular n by n matrices with positive entries along the diagonal. Both B + n and O(n) are subgroups of the general linear group GL n (R), and the multiplication map O(n) × B + n → GL n (R) is bijective. However, the QR decomposition is not unique for non-invertible square matrices.

C.2 RESCALING FUNCTIONS

We now prove the following basic facts about radial rescaling functions: Lemma 19. Let ρ = h (n) : R n → R n be a radial rescaling function on R n . 1. The function ρ commutes with any orthogonal transformation of R n . That is, ρ • Q = Q • ρ for any Q ∈ O(n). 2. If m ≤ n and inc : R m → R n is the standard inclusion into the first m coordinates, then: h (n) • inc = inc • h (m) . Proof. Suppose Q ∈ O(n) is an orthogonal transformation of R n . Since Q is norm-preserving, we have |Qv| = |v| for any v ∈ R n . Since Q is linear, we have Q(λv) = λQv for any λ ∈ R and v ∈ R n . Using the definition of a = h (n) we compute: ρ(Qv) = h(|Qv|) |Qv| Qv = h(|v|) |v| Qv = Q h(|v|) |v| v = Q(ρ(v)). The first claim follows. The second claim is an elementary verification. More generally, the restriction of the radial rescaling function ρ to a linear subspace of R n is a radial rescaling function on that subspace. Given a tuple radial rescaling functions ρ = (ρ i : R ni → R ni ) L i=1 suited to widths n = (n i ) L i=1 , we write ρ red = ρ red i : R n red i → R n red i for the tuple of restrictions suited to the reduced widths n red , so that ρ red i = ρ i R n red i . C.3 PROOF OF THEOREM 6 Adopting notation from above and Section 5, we now restate and prove Theorem 6. Theorem 6. Let (W, b, ρ) be a radial neural network with widths n. Let W red and b red be the weights and biases of the compressed network produced by Algorithm 1. The feedforward function of the original network (W, b, ρ) coincides with that of the compressed network (W red , b red , ρ red ). Proof. Let (W red , b red , Q) = QR-Compress(W, b) be the output of Algorithm 1, so that Q ∈ O(n hid ) and (W red , b red , ρ red ) is a neural network with widths n red and radial rescaling ac- Additionally, we have the partial feedforward functions F i and F red i . We show by induction that tivations ρ red i = ρ i R n red i . Let F = F (W F i = Q i • inc i • F red i for any i = 0, 1, . . . , N . (Continuing conventions from Sections 5.1 and 5.2, we set Q 0 = id R n 0 , Q L = id R n L , and inc i : R n red i → R ni to be the inclusion map.) The base step i = 0 immediate. For the induction step, let x ∈ R n0 . Then: F i (x) = ρ i (W i • F i-1 (x) + b i ) = ρ i W i • Q i-1 • inc i-1 • F red i-1 (x) + b i = ρ i [b i W i • Q i-1 • inc i-1 ] 1 F red i-1 (x) = ρ i Q i • inc i • b red i W red i 1 F red i-1 (x) = Q i • inc i • ρ i R n red i W red i • F red i-1 (x) + b red i = Q i • inc i • F red i The first equality on the definition of the partial feedforward function F i ; the second on the induction hypothesis; the fourth on an inspection of Algorithm 1, noting that R i = [b red i W red i ]; the fifth on the results of Lemma 19, observing that ρ i • inc i = ρ i | R n red i = inc i • ρ red i ; and the sixth on the definition of F red i . In the case i = L, we have: F = F L = Q L • inc L • F red L = F red since Q L = inc L = id R n L and F red L = F red . The theorem now follows. The techniques of the above proof can be used to show that the action of the group O(n hid ) of orthogonal change-of-basis symmetries on the parameter space Param(n) leaves the feedforward function unchanged. We do not use this result directly, but state is precisely it nonetheless: Proposition 20. Let (W, b, ρ) be a radial neural network with widths vector n. Suppose g ∈ O(n hid ). Then the original and transformed networks have the same feedforward function: F (g•W, g•b, ρ) = F (W, b, ρ) In other words, fix parameters (W, b) ∈ Param(n), radial rescaling activations ρ, and g ∈ O(n hid ). Then the radial neural network with parameters (W, b) has the same feedforward function as the radial neural network with transformed parameters (g • W, g • b), where we take radial rescaling activations ρ in both cases. We remark that Proposition 20 is analogous to the "non-negative homogeneity" (or "positive scaling invariance") of the pointwise ReLU activation functionfoot_0 . In that setting, instead of considering the product of orthogonal groups O(n hid ), one considers the rescaling action of the following subgroup of L-1 i=1 GL ni : G = g = (g i ) ∈ L-1 i=1 GL ni | each g i is diagonal with positive diagonal entries Note that G is isomorphic to the product L-1 i=1 R ni >0 , and the action on Param(n) is given by the same formulas as those appearing near the end of Section 5.1. The feedforward function of a MLP with pointwise ReLU activations is invariant for the action of G on Param(n).

D PROJECTED GRADIENT DESCENT PROOFS

In this section, we give a proof of Theorem 8, which relates projected gradient descent for a representation with dimension n to (usual) gradient descent for the corresponding reduced representation with dimension vector n red . This proof requires some set up and background resutls.

D.1 GRADIENT DESCENT AND ORTHOGONAL SYMMETRIES

We first prove a result that gradient descent commutes with invariant orthogonal transformations. This section is general and departs from the specific case of radial neural networks.

D.1.1 SETTING

Let L : V = R p → R be a smooth function. Semantically, V is a the parameter space of a neural network and L the loss function with respect to a batch of training data. The differential dL v of L at v ∈ V is row vector, while the gradient ∇ v of L at v is a column vectorfoot_1 : dL v = ∂L ∂x1 v • • • ∂L ∂xp v ∇ v L =        ∂L ∂x1 v . . . ∂L ∂xp v        Hence ∇ v L is the transpose of dL v , that is: ∇ v L = (dL v ) T . A step of gradient descent with respect to L at learning rate η > 0 is defined as: γ = γ η : V -→ V v -→ v -η∇ v L We drop η from the notation when it is clear from context. For any k ≥ 0, we denote by γ k the k-fold composition of the gradient descent map γ: γ k = k γ • γ • • • • • γ D.1.2 INVARIANT GROUP ACTION Now suppose ρ : G → GL(V ) is an action of a Lie group G on V such that L is G-invariant, i.e.: L(ρ(g)(v)) = L(v) for all g ∈ G and v ∈ V . We write simply g • v for ρ(g)(v), and g for ρ(g). Lemma 21. For any v ∈ V and g ∈ G, we have: ∇ v L = g T • (∇ g•v L) Proof. The proof is a computation: ∇ v L = (d v L) T = (d(L • g) v ) T = (dL g•v • dg v ) T = (dL g•v • g) T = g T • (dL g•v ) T = g T • (∇L g•v ) The second equality relies on the hypothesis that L • g = L, the third on the chain rule, and the fourth on the fact that dg v = g since g is a linear map. One can perform the computation of the proof in coordinates, for i = 1, . . . , p: (∇ v L) i = (dL v ) i = ∂L ∂x i v = ∂(L • g) ∂x i v = ∂L ∂x j gv ∂g j ∂x i v = (∇ gv L) j g i j = (g T ) j i (∇ gv L) j = g T • ∇ gv L i D.1.3 ORTHOGONAL CASE Furthermore, suppose the action of G is by orthogonal transformations, so that ρ(g) T = ρ(g) -1 for all g ∈ G. Then Lemma 21 implies that ∇ g•v L = g • ∇ v L (D.1) for any v ∈ V and g ∈ G. The proof of the following lemma is immediate from Equation D.1, together with the definition of γ. See Figure 6 for an illustration. Lemma 22. Suppose the action of G on V is by orthogonal transformations, and that L is Ginvariant. Then the action of G commutes with gradient descent (for any learning rate). That is, γ k (g • v) = g • γ k (v) for any v ∈ V , g ∈ G, and k ≥ 0. If loss is invariant with respect to an orthogonal transformation Q of the parameter space, then optimization of the network by gradient descent is also invariant with respect to Q. (Note: in this example, projected and usual gradient descent match; this is not the case in higher dimensions, as explained in D.6.)

D.2 GRADIENT DESCENT NOTATION AND SET-UP

We now turn our attention back to radial neural networks. In this section, we recall notation from above, and introduce new notation that will be relevant for the formulation and proof of Theorem 8.

D.2.1 MERGING WIDTHS AND BIASES

Let n = (n 0 , n 1 , n 2 , . . . , n L-1 , n L ) be the widths vector of an MLP. Recall the definition of Param(n) as the parameter space of all possible choices of trainable parameters: Param(n) = R n1×n0 × R n2×n1 × • • • × R n L ×n L-1 × (R n1 × R n2 × • • • × R n L ) We have been denoting an element therein as a pair of tuples (W, b) where W = (W i ∈ R ni×ni-1 ) L i=1 are the weights and b = (b i ∈ R ni ) L i=1 are the biases. However, in this appendix we adopt different notation. Observe that, placing each bias vector as a extra column on the left of the weight matrix, we obtain matrices: A i = [b i W i ] ∈ R ni×(1+ni-1) . Thus, there is an isomorphism: Param(n) ≃ L i=1 R ni×(ni-1+1) = R n1×(n0+1) × R n2×(n1+1) × • • • × R n L ×(n L-1 +1) In this appendix, we regard an element of Param(n) as a tuple of 'merged' matrices A = (A i ∈ R ni×(1+ni-1) ) L i=1 . We now define convenient maps to translate between the merged notation and the split notation. For each i, define the extension-by-one map from R ni to R × R ni ≃ R ni+1 as follows: ext i : R ni → R ni+1 v = (v 1 , v 2 , . . . , v ni ) → (1, v 1 , v 2 , . . . , v ni ) (D. 2) Observe that, for any i and x ∈ R ni-1 , we have A i • ext i-1 (x) = W i x + b i . Consequently, the i-th partial feedforward function can be defined recursively as: F i = ρ i • A i • ext i-1 • F i-1 (D.3) where ρ i : R ni → R ni is the activationfoot_2 at the i-th layer, and F 0 is the identity on R n0 .

D.2.2 ORTHOGONAL CHANGE-OF-BASIS ACTION

To describe the orthogonal change-of-basis symmetries of the parameter space in the merged notation, recall the following product of orthogonal groups, with sizes corresponding to the widths of the hidden layers: O(n hid ) = O(n 1 ) × O(n 2 ) × • • • × O(n L-1 ) In the merged notation, the element Q = (Q i ) L-1 i=1 O(n hid ) transforms A ∈ Param(n) as: A → Q • A := Q i • A i • 1 0 0 Q -1 i-1 L i=1 (D.4) where Q 0 = id n0 and Q L = id n L .

D.2.3 MODEL COMPRESSION ALGORITHM

We now restate Algorithm 1 in the merged notation. We emphasize that Algorithms 1 and 2 are mathematically equivalent; the later simply uses more compact notation. Algorithm 2: QR Model Compression (QR-compress) input : A ∈ Param(n) output : Q ∈ O(n hidden ) and V ∈ Param(n red ) Q, V ← [ ], [ ] // initialize output matrix lists M 1 ← A 1 for i ← 1 to L -1 do // iterate through layers Q i , R i ← QR-decomp(M i , mode = 'complete') // M i = Q i • inc i • R i Append Q i to Q Append R i to V // reduced merged weights for layer i Set M i+1 ← A i+1 • 1 0 0 Q i • inc i // transform next layer end Append M L to V return Q, V We explain the notation. As noted in Appendix B.1, the symbol '•' denotes composition of maps, or matrix multiplication in the case of linear maps. The standard inclusion inc i : R n red i → R ni maps into the first n red i coordinates. As a matrix, Inc i ∈ R ni×n red i has ones along the main diagonal and zeros elsewhere. The method QR-decomp with mode = 'complete' computes the complete QR decomposition of the n i × (1 + n red i-1 ) matrix M i as Q i • inc i • R i where Q i ∈ O(n i ) and R i is upper-triangular of size n red i ×(1+n red i-1 ). The definition of n red i implies that either n red i = n red i-1 +1 or n red i = n i . The matrix R i is of size n red i × n red i in the former case and of size n i × (1 + n red i-1 ) in the latter case.

D.2.4 GRADIENT DESCENT DEFINITIONS

As in Section 6, we fix: • a widths vector n = (n 0 , n 1 , . . . , n L ). • a tuple ρ = (ρ 1 , . . . , ρ L ) of radial rescaling activations, where ρ i : R ni → R ni for i = 1, . . . , L. • a batch of training data {(x j , y j )} ⊆ R n0 × R n L = R n red 0 × R n red L . • a cost function C : R n L × R n L → R As a result, we have a loss function on Param(n): L : Param(n) → R L(A) = C(F (A,ρ) (x j ), y j ) where F (A,ρ) is the feedforward of the radial neural network with (merged) parameters A and activations ρ. We emphasize that the loss function L depends on the batch of training data chosen above; however, for clarity, we omit extra notation indicating this dependency since the batch of training data is fixed throughout this discussion. Similarly, we have: • the reduced widths vector n red = (n red 0 , n red 1 , . . . , n red L ). Let V, Q = QR-Compress(A) be the outputs of Algorithm 2 (which is equivalent to 1), so that V = (W red , b red ) ∈ Param(n red ) are the parameters of the compressed model corresponding to the full model with merged parameters A = (W, b), and Q ∈ O(n hid ) is an orthogonal change-of-basis symmetry of the parameter space. Moreover, set T = Q -1 • A ∈ Param int (n), where we use the change-of-basis action from Appendix D.2 and Proposition 23. We have the following rephrasing of Theorem 8. Theorem 24 (Theorem 8). Let A ∈ Param(n), and let V, Q, T be as above. For any k ≥ 0: 1. γ k (A) = Q • γ k (T) 2. γ k proj (T) = γ k red (V) + T -V. More precisely, the second equality is  γ k proj (T) = ι(γ k red (V)) + T -ι(V)

V

T W γ k red (V) γ k proj (T) γ k (T) γ k (W) +T -V proj-GD on Param(n) +T -V GD on Param(n red ) GD on Param(n) Q• Q• GD on Param(n) Param(n red ) Param int (n) Param(n) D.5 PROOF OF THEOREM 8 We begin by explaining the sense in which Param int (n) interpolates between Param(n) and Param(n red ). One extends Diagram D.5 as follows: Param(n red ) ι2 -- Param int (n) q2 m m ι1 --Param(n) q1 m m • The map ι 2 : Param(n red ) → Param int (n) takes B = (B i ) ∈ Param(n red ) and pad each matrix with n i -n red i rows of zeros on the bottom and n i-1 -n red i-1 columns of zeros on the right: B = (B i ) L i=1 → ι 2 (B) = B i 0 0 0 L i=1 It is straightforward to check that ι 2 is a well-defined injective linear map. • The map q 2 : Param int (n) → Param(n red ) extracts from T the top left n red i × (1 + n red i-1 ) matrix: T = T i T (1) i T (2) i 0 T (4) i L i=1 → q 2 (T) = T (1) i L i=1 It is straightforward to check that q 2 is a surjective linear map. The transpose of q 2 is the inclusion ι 2 . Lemma 25. We have the following: 1. The inclusion ι : Param(n red ) → Param(n) coincides with the composition ι 1 • ι 2 , and commutes with the loss functions: Param(n red ) ι1•ι2=ι / / Lred $ $ Param(n) L { { R 2. The following diagram commutes: Param int (n) q2 / / / / _ ι1 Param(n red ) Lred Param(n) L / / R 3. For any T ∈ Param int (n), we have: q 1 ∇ ι1(T) L = ι 2 ∇ q2(T) L red . Proof. We have the following standard inclusions into the first coordinates and projections onto the first coordinates, for i = 0, 1, . . . , L: inc i = inc n red i ,ni : R n red i → R ni , inc i = inc 1+n red i ,1+ni : R 1+n red i → R 1+ni , π i : R ni → R n red i , π i : R 1+ni → R 1+n red i . Observe that Param int (n) is the subspace of Param(n) consisting of those T = (T 1 , . . . , T L ) ∈ Param(n) such that: (id ni -inc i • π i ) • T i • inc i-1 • π i-1 = 0 for i = 1, . . . , L. By the definition of radial rescaling functions, for each i = 1, . . . , L, there is a piece-wise differentiable function h i : R → R such that ρ i = h (ni) i . Note that ρ red i = h (n red i ) i , and h (ni) • inc i = inc i • h (n red i ) . The identity ι = ι 1 • ι 2 follows directly from definitions. To prove the commutativity of the first diagram, it is enough to show that, for any X in Param(n red ), the feedforward functions of X and ι(X) coincide. This follows easily from the fact that, for i = 1, . . . , L, we have: π i • h (ni) • inc i = π i • inc i • h (n red i ) = h (n red i ) . For the second claim, let T ∈ Param int (n). It suffices to show that ι 1 (T) and q 2 (T) have the same feedforward function. Recall the ext i maps and the formulation of the feedforward function in the merged notation given in Equation D.3. Using this set-up, the key computation is: inc i • h (n red i ) • π i • T i • ext ni-1 • inc i-1 = h (ni) • inc i • π i • T i • inc i-1 • ext ni-1 = h (ni) • T i • inc i-1 • ext ni-1 = h (ni) • T i • ext ni-1 • inc i-1 with the point p = (a, b, c, d, e, f, g, h, i, j) in R 10 . To be even more explicit, the weights for first layer are W 1 = b d f , the bias in the first hidden hidden layer is b 1 = (a, c, e), the weights for the second layer are W 2 = [h i j], and the bias for the output layer is b 2 = g. The action of the orthogonal group O(n) = O(3) on Param(n) ≃ R 10 can be expressed as: Q →    Q 0 0 0 0 Q 0 0 0 0 1 0 0 0 0 Q    , where the rows and columns are divided according to the partition 3 + 3 + 1 + 3 = 10. Consider the functionfoot_3 : L : Param(n) → R p = (a, b, c, d, e, f, g, h, i, j) → h(a + b) + i(c + d) + j(e + f ) + g By the product rule, we have: ∇ p L = (h, h, i, i, j, j, 1, a + b, c + d, e + f ) One easily checks that L(Q • p) = L(p) and that ∇ Q•p L = Q • ∇ p L for any Q ∈ O(3). The interpolating space is the eight-dimensional subspace of Param(n) ≃ R 10 with e = f = 0 (using the notation of Equation D.6). Suppose p ′ = (a, b, c, d, 0, 0, g, h, i, j) belongs to the interpolating space. Then the gradient is ∇ p ′ L = (h, h, i, i, j, j, 1, a + b, c + d, 0) which does not belong to the interpolating space. So one step of usual gradient descent, with learning rate η > 0 yields: γ :p ′ = (a, b, c, d, 0, 0, g, h, i, j) → (a -ηh , b -ηh , c -ηi , d -ηi , -ηj , -ηj , g -η , h -η(a + b) , i -η(c + d) , j) On the other hand, one step of projected gradient descent yields: γ proj : p ′ = (a, b, c, d, 0, 0, g, h, i, j) → (a -ηh , b -ηh , c -ηi , d -ηi , 0 , 0 , g -η , h -η(a + b) , i -η(c + d) , j) Direct computation shows that the difference between the evaluation of L after one step of gradient descent and the evaluation of L after one step of projected gradient descent is: L(γ(p ′ )) -L(γ proj (p ′ )) = 2ηj 2 .

E EXPERIMENTS

As mentioned in Section 7, we provide an implementation of Algorithm 1 in order to (1) empirically validate that our implementation satisfies the claims of Theorems 6 and Theorem 8 and (2) quantify real-world performance. Our implementation uses a generalization of radial neural networks, which we explain presently.

E.1 RADIAL NEURAL NETWORKS WITH SHIFTS

In this section, we consider radial neural networks with an extra trainable parameter in each layer that shifts the radial rescaling activation. Adding such parameters allows for more flexibility in the model, and (as shown in Theorem 26) the model compression of Theorem 6 holds for such networks. It is this generalization that we use in our experiments. Let h : R → R be a function. For any n ≥ 1 and any t ∈ R, the corresponding shifted radial rescaling function R n is given by: ρ = h (n,t) : v → h(|v| -t) |v| v if v ̸ = 0 and ρ(0) = 0. A radial neural network with shifts consists of the following data: 1. Hyperparameters: A positive integer L and a widths vector n = (n 0 , n 1 , n 2 , . . . , n L ). 2. Trainable parameters: (a) A choice of weights and biases (W, b) ∈ Param(n). (b) A vector of shifts t = (t 1 , t 2 , . . . , t L ) ∈ R L . 3. Activations: A tuple h = (h 1 , . . . , h L ) of piecewise differentiable functions R → R. Together with the shifts, we have the shifted radial rescaling activation ρ i = h (ni,ti) i : R ni → R ni in each layer. The feedforward function of a radial neural network with shifts is defined in the usual recursive way, as in Section 3. The trainable parameters form the vector space Param(n) × R L , and the loss function of a batch of training data {(x i , y i )} ⊂ R n0 × R n L is defined as L : Param(n) × R L -→ R; (W, t) → j C(F (W,b,t,h) (x j ), y j ) where F (W,b,t,h) is the feedforward function of a radial neural network with weights W, biases b, shifts t, and radial rescaling activations produced from h. We have the gradient descent map: We now state a generalization of Theorem 6 for the case of radial neural networks with shifts. We omit a proof, as it uses the same techniques as the proof of Theorem 6. Theorem 26. Let (W, b, t, h) be a radial neural network with shifts and widths vector n. Let W red and b red be the weights and biases of the compressed network produced by Algorithm 1. The feedforward function of the original network (W, b, t, h) coincides with that of the compressed network (W red , b red , t, h). γ : Param(n) × R L -→ Param(n) × R L Theorem 8 also generalizes to the setting of radial neural networks with shifts, using projected gradient descent with respect to the subspace Param int (n) × R L of Param(n) × R L . E.2 IMPLEMENTATION DETAILS Our implementation is written in Python and uses the QR decomposition routine in NumPy Harris et al. (2020) . We also implement a general class RadNet for radial neural networks using PyTorch Paszke et al. (2019) . For brevity, we write Ŵ for (W, b) and Ŵred for (W red , b red ). (1) Empirical verification of Theorem 6. We use synthetic data to learn the function f (x) = e -x 2 with N = 121 samples x j = -3 + j/20 for 0 ≤ j < 121. We model f Ŵ as a radial neural network with widths n = (1, 6, 7, 1) and activation the radial shifted sigmoid h(x) = 1/(1 + e -x+s ). Applying QR-compress gives a radial neural network f Ŵred with widths n red = (1, 2, 3, 1). Theorem 6 implies that the neural functions of f Ŵ and f Ŵred are equal. Over 10 random initializations of Ŵ, the mean absolute error (1/N ) j |f Ŵ(x j ) -f Ŵred (x j )| = 1.31 • 10 -8 ± 4.45 • 10 -9 . Thus f Ŵ and f Ŵred agree up to machine precision. (2) Empirical verification of Theorem 8. Adopting the notation from above, the claim is that training f Q -1 • Ŵ with objective L by projected gradient descent coincides with training f Ŵred with objective L red by usual gradient descent. We verified this on synthetic data using 3000 epochs at learning rate 0.01. Over 10 random initializations of Ŵ, the loss functions match up to machine precision with |L -L red | = 4.02 • 10 -9 ± 7.01 • 10 -9 . (3) Reduced model trains faster. Due to the relation between projected gradient descent of the full network Ŵ and gradient descent of the reduced network Ŵred , method may be applied before training to produce a smaller model class which trains faster without sacrificing accuracy. We test this hypothesis in learning the function f : R 2 → R 2 sending x = (t 1 , t 2 ) to (e -t 2 1 , e -t 2 2 ) using N = 121 2 samples (-3 + j/20, -3 + k/20) for 0 ≤ j, k < 121. We model f Ŵ as a radial neural network with layer widths n = (2, 16, 64, 128, 16, 2) and activation the radial sigmoid h(r) = 1/(1 + e -r ). Applying QR-compress gives a radial neural network f Ŵred with widths n red = (2, 3, 4, 5, 6, 2). We trained both models until the training loss was ≤ 0.01. Running on a system with an Intel i5-8257U@1.40GHz and 8GB of RAM and averaged over 10 random initializations, the reduced network trained in 15.32 ± 2.53 seconds and the original network trained in 31.24 ± 4.55 seconds. (4) Comparison with ReLU MLP on noisy image recovery. We show that a Step-ReLU radial network performs better than an otherwise comparable network with pointwise ReLU on a noisy image recovery task. Using samples of MNIST with significant added noise the network classification task is to identify from which original sample the noisy sample derives. Specifically, we choose n samples from MNIST, all with the same MNIST label, and produce m noisy samples from each by adding noise. The noise is added by considering each sample as a point in R 784 , and adding uniform random noise in a ball around each. The radius of the ball around a given point is the product of the noise level variable (noise scale, which is the same for all points) and the minimal distance to another sample point (which varies from point to point). As indicated in Figure 5 , when noise scale=3 the classification task is difficult for the human eye. Our data takes n = 3 original MNIST images with the same label, and produces m = 100 noisy images for each, with noise scale=3. We perform a 240 train / 60 test split of the 300 data points. Both models have three layers with widths (d, d + 1, d + 2, n = 3), where d = 28 2 = 784; hence, both models have 620, 158 trainable parameters Over 10 trials, each training for 150 epochs and learning rate 0.05 for both models, the radial network achieves training loss 0.00256 ±3.074•10 -4 with accuracy 1 ± 0, while the ReLU MLP has training loss 0.295 ±2.259 • 10 -1 with accuracy 0.768 ±2.199 • 10 -1 . On the test set, the radial network has loss 0.00266 ±3.749 • 10 -4 with accuracy 1 ± 0, while the ReLU MLP has loss 0.305 ±2.588 • 10 -1 with accuracy 0.757 ±2.464 • 10 -1 . The convergence rates are illustrated in Figure 5 , with the radial network outperforming the ReLU MLP. We note that 150 epochs is sufficient for all methods to converge, although the ReLU MLP does not always converge to zero loss. We observe that the radial network 1) is able to obtain a better fit, 2) has faster convergence, and 3) generalizes better than the pointwise ReLU. We hypothesize the radial nature of the random noise makes radials networks well-adapted to the task.

F RELATION TO RADIAL BASIS FUNCTION NETWORKS

In this appendix, we show that radial neural networks are equivalent to a particular class of multilayer radial basis functions networks. This class is obtained by imposing the condition that the so-called 'hidden dimension' at each layer is equal to one; the total number of layers, however, is unconstrained. To our knowledge, the literature contains no universal approximation result for this class of radial basis functions networks.

F.1 SINGLE LAYER CASE

We first recall the definition of a radial basis function network. A local linear model extension of a radial basis function network (henceforth abbreviated simply by RBFN) consists of: • An input dimension n, an output dimension m, and a 'hidden' dimension N . • For i = 1, . . . , N , a matrix W i ∈ R m×n , a vector b i ∈ R n , and a weight a i ∈ R m . • A nonlinear functionfoot_4 λ : R → R. Let F ℓ be the partial feedforward functions for this RBFN, defined recursively as above. We claim that F ℓ (x) = W ℓ • G ℓ-1 for any x ∈ R n0 and ℓ = 1, . . . , L. We prove this by induction. The base case is ℓ = 1: F 1 (x) = W 1 • ρ 0 (F 0 (x) + b 0 ) = W 1 x = W 1 • G 0 (x) For the induction step, take ℓ > 1 and compute: F ℓ (x) = W ℓ • ρ ℓ-1 (F ℓ-1 (x) + b ℓ-1 ) = W ℓ • ρ ℓ-1 (W ℓ-1 G ℓ-2 (x) + b ℓ-1 ) = W ℓ • G ℓ-1 (x) The first claim now follows from the case ℓ = L, using the fact that W L+1 is the identity. For the second statement, let (W, b, ρ) be a constrained multilayer RBFN with L layers and widths vector (n 0 , . . . , n L ). Consider the radial neural network with L + 1 layers and the following: • Widths vector (n 0 , n 0 , n 1 , . . . , n L-1 , n L ). The first two layers have the same dimension. • Weight matrices given by W1 = id n0 and Wℓ = W ℓ-1 for ℓ = 2, . . . , L + 1. • Bias vectors given by bℓ = b ℓ-1 for ℓ = 1, 2, . . . , L, and bL+1 = 0. • Radial rescaling activations given by ρℓ = ρ ℓ-1 for ℓ = 1, . . . , L, and ρL+1 = id n L . One uses the recursive definition of the partial feedforward functions to show that, for ℓ = 1, . . . , L, we have F ℓ (x) = W ℓ • G ℓ (x), where F ℓ and G ℓ are the partial feedforward functions of the RBFN and radial neural network, respectively. Then: G L+1 (x) = ρL+1 WL+1 • G L (x) + bL+1 = W L • G L (x) = F L (x), so the two feedforward functions coincide.

F.4 CONCLUSIONS

While radial neural networks are equivalent to a certain class of radial basis function network, we point out differences between our results and the standard theory of radial basis functions network. First, RBFNs generally only have two layers; we consider ones with unbounded depth. Second, to our knowledge, ours is the first universal approximation result such that: • it uses networks in the subclass of multilayer RBFNs satisfying the constraint that all the number of 'hidden neurons' in each layer is equal to 1. • it approximates functions with networks of bounded width. • it can be used to approximate asymptotically affine functions, rather than functions defined on a compact domain. Our compressibility result may apply to multilayer RBFNs where the number of 'hidden neurons' N ℓ at each layer is not equal to 1, but we expect the compression to be weaker, and that constrained mulitlayer RBFNs are in some sense the most compressible type of RBFN.



See Armenta and Jodoin, The Representation Theory of Neural Networks, arXiv:2007.12213; Dinh, Pascanu, Bengio, and Bengio, Sharp Minima Can Generalize For Deep Nets, ICML 2017; Meng, Zheng, Zhang, Chen, Ye, Ma, Yu, and Liu, G-SGD: Optimizing ReLU Neural Networks in its Positively Scale-Invariant Space, 2019; and Neyshabur, Salakhutdinov, and Srebro. Path-SGD: path-normalized optimization in deep neural networks, NIPS'15. Following usual conventions, we regard column vectors as elements of V and row vectors as elements of the dual vector space V * . The differential dLv of L at v ∈ V is also known as the Jacobian of L at v ∈ V . In this general formulation, ρi can be any piece-wise differentiable function; for most of the rest of the paper we will be interested in the case where ρi is a radial rescaling function. For A ∈ Param(n), the neural function of the neural network with affine maps determined by A and identity activation functions is R → R; x → L(W)x. The function L can appear as a loss function for certain batches of training data and cost function on R. A more general version allows for a different nonlinear function for every i = 1, . . . , N .



Figure 1: (Left) Pointwise activations distinguish a specific basis of each hidden layer and treat each coordinate independently, see equation 1.1. (Right) Radial rescaling activations rescale each feature vector by a function of the norm, see equation 1.2.

Figure 2: Examples of different radial rescaling functions in R 1 , see Example 1.

Figure 3: Two layers of the radial neural network used in the proof of Theorem 3. (Left) The compact set K is covered with open balls. (Middle) Points close to c 2 (green ball) are mapped to c 2 + e 2 , all other points are kept the same. (Right) In the final layer, c 2 + e 2 is mapped to f (c 2 ).

= n L . For a tuple ρ = (ρ i : R ni → R ni ) L i=1 of radial rescaling functions, we write ρ red = ρ red i : R n red i → R n red i

Figure4: Model compression in 3 steps. Layer widths can be iteratively reduced to 1 greater than the previous. The number of trainable parameters reduces from 33 to 17.

b) is equivalent to gradient descent on Param(n red ) with initial values (W red , b red ) since at any stage we can move from one to the other by ±U. Neither Q nor U depends on k.

Figure 5: (Left) Different levels of noise. (Right) Training five Step-ReLU radial networks and five ReLU MLPs on data with n=3 original images, m=100 noisy copies of each.

,b,ρ) denote the feedforward function of the radial neural network with parameters (W, b) and activations ρ. Similarly, let F red = F (W red ,b red ,ρ red ) denote the feedforward function of the radial neural network with parameters (W red , b red ) and activations ρ red .

Figure 6: Illustration of Lemma 22.If loss is invariant with respect to an orthogonal transformation Q of the parameter space, then optimization of the network by gradient descent is also invariant with respect to Q. (Note: in this example, projected and usual gradient descent match; this is not the case in higher dimensions, as explained in D.6.)

where ι : Param(n red → Param(n) is the inclusion into the top left corner in each coordinate. Also, in the statement of Theorem 8, we have U = T -V. We summarize this result in the following diagram. The left horizontal maps indicate the addition of U = T -V, the right horizontal arrows indicate the action of Q, and the vertical maps are various versions of gradient descent. The shaded regions indicate the (smallest) vector space to which the various representations naturally belong.

which updates the entries of W, b, and t. The groupO(n hid ) = O(n 1 ) × • • • × O(n L-1) acts on Param(n) as usual (see Section 5.1), and on R L trivially. The neural function is unchanged by this action. We conclude that the O(n hid ) action on Param(n) × R L commutes with gradient descent γ.

A ORGANIZATION OF THE APPENDICES

This paper is a contribution to the mathematical foundations of machine learning, and our results are motivated by expanding the applicability and performance of neural networks. At the same time, we give precise mathematical formulations of our results and proofs. The purposes of these appendices are several: 1. To clarify the mathematical conventions and terminology, thus making the paper more accessible.2. To provide full proofs of the main results.3. To develop context around various construction appearing in the main text.4. To discuss in detail examples, special cases, and generalizations of our results.5. To specify implementation details for the experiments.We now give a summary of the contents of the appendices.Appendix B contains proofs the universal approximation results (Theorems 3 and 5) stated in Section 4 of the main text, as well as proofs of additional bounded width results. The proofs use notation given in Appendix B.1, and rely on preliminary topological considerations given in Appendix B.2.In Appendix C, we give a proof of the model compression result given in Theorem 6, which appears in Section 5. For clarity and background we begin the appendix with a discussion of the version of the QR decomposition relevant for our purposes (Appendix C.1). We also establish elementary properties of radial rescaling activations (Appendix C.2).The focus of Appendix D is projected gradient descent, elaborating on Section 6. We first prove a result on the interaction of gradient descent and orthogonal transformations (Appendix D.1), before formulating projected gradient descent in more detail (Appendix D.2), and introducing the so-called interpolating space (Appendix D.3). We restate Theorem 8 in more convenient notation (Appendix D.4) before proceeding to the proof (Appendix D.5).Appendix E contains implementation details for the experiments summarized in Section 7. Several of our implementations use shifted radial rescaling activations, which we formulate in Appendix E.1.Appendix F explains the connection between our constructions and radial basis functions networks. While radial neural networks turn out to be a specific type of radial basis functions network, our universality results are not implied by those for general radial basis functions networks.

B UNIVERSAL APPROXIMATION PROOFS AND ADDITIONAL RESULTS

In this section, we provide full proofs of the universal approximation (UA) results for radial neural networks, as stated in Section 4. In order to do so, we first clarify our notational conventions (Appendix B.1), and collect basic topological results (Appendix B.2).

B.1 NOTATION

Recall that, for a point c in the Euclidean space R n and a positive real number r, we denote the r-ball around c by B r (c) = {x ∈ R n | |x -c| < r}. All networks in this section have the Step-ReLU radial rescaling activation function, defined as:• denotes the composition of functions. We identify a linear map with a corresponding matrix (in the standard bases). In the case of linear maps, the operation • can be be identified with matrix multiplication. Recall also that an affine map L : R n → R m is one of the from L(x) = Ax+b for a matrix A ∈ R m×n and b ∈ R m .• the restrictions ρ red = (ρ red 1 , . . . , ρ red L ), where ρ red : RUsing the fact that n red 0 = n 0 and n red L = n L , there is a loss function on Param(n red ):where F (B,ρ red ) is the feedforward of the radial neural network with parameters B ∈ Param(n red ) and activations ρ red . (Again, technically speaking, the loss function L red depends on the batch of training data fixed above.) For any learning rate η > 0, we obtain a gradient descent maps:In this section, we introduce a subspace Param int (n) of Param(n), that, as we will later see, interpolates between Param(n) and Param(n red ).) block of T i is zero for each i. Schematically:where the rows are divided as n red i on top and n i -n red i on the bottom, while the columns are divided as (1 + n red i-1 ) on the left and n i-1 -n red i-1 on the right. Letbe the inclusion. The following proposition follows from an elementary analysis of the workings of Algorithm 2 (or, equivalently, Algorithm 1). Proposition 23. Let A ∈ Param(n) and let Q ∈ O(n hid ) be the tuple of orthogonal matrices produced by Algorithm 2. Then Q -1 • A belongs to Param int (n).

Define a map

by taking A ∈ Param(n) and zeroing out the bottom left (n i -n red i ) × (1 + n red i-1 ) block of A i for each i. Schematically:It is straightforward to check that q 1 is a well-defined, surjective linear map. The transpose of q 1 is the inclusion ι 1 . We summarize the situation in the following diagram:We observe that the composition q 1 • ι is the identity on Param int (n).

D.4 PROJECTED GRADIENT DESCENT AND MODEL COMPRESSION

Recall from Section 6 that the projected gradient descent map on Param(n) is given by:where A = (W, b) are the merged parameters (Appendix D.2), and, in the notation of the previous section, the map Proj is ι 1 • q 1 . To reiterate, while all entries of each weight matrix and each bias vector contribute to the computation of the gradient ∇ A L = ∇ (W,b) L, only those not in the bottom left submatrix get updated under the projected gradient descent map γ proj .which uses the fact that, as well as the fact that ext i • inc i = inc i • ext i . Applying this relation successively starting with the second-to-last layer (i = L -1) and ending in the first (i = 1), one obtains the result. For the last claim, one computes ∇ T (L • ι 1 ) in two different ways. The first way is:where we use the fact that ι 1 is a linear map whose transpose is q 1 . The second way uses the commutative diagram of the second part of the Lemma:We also use the fact that q 2 is a linear map whose transpose is ι 2 .Proof of Theorem 8. As above, let R, Q = QR-compress(A) be the outputs of Algorithm 1, so that V = (W red , b red ) ∈ Param(n red ) is the dimensional reduction of the merged parametersThe action of Q ∈ O(n hid ) on Param(n) is an orthogonal transformation, so the first claim follows from Lemma 22.For the second claim, it suffices to consider the case η = 1. The general case follows similarly. We proceed by induction. The base case k = 0 amounts to Theorem 6. For the induction step, we set k) . Moreover, q 2 Z (k) = γ k red (V). We compute:where the second equality uses the induction hypothesis; the third invokes the definition of γ proj ; the fourth uses the fact that; the fifth and sixth use Lemma 25 above; and the last uses the definition of γ red . D.6 EXAMPLE We now discuss an example where projected gradient descent does not match usual gradient descent.Let n = (1, 3, 1) be a widths vector. The space of parameters with this widths vector is 10dimensional:We identify a choice of parameters (in the merged notation)The feedforward function of a RBFN is defined as:The integer N is commonly referred to as 'the hidden number of neurons'. This is a bit of a misnomer. Really there is only one layer with input dimension n and output dimension m; the integer N is part of the specification of the activation function.We observe that if N = 1 and a 1 = 0, then the feedforward function is given by:where ρ is the radial rescaling function determined by λ. In words, one adds b 1 = b ∈ R n to the input vector x, applies the activation ρ to obtain new vector in R n , and then applies the linear transformation determined by the matrix W 1 = W to obtain the output vector in R m . Motivated by this observation, we say that a RBFN is constrained if N = 1 and a 1 = 0.

F.2 CONSTRAINED MULTILAYER CASE

Next, we consider the constrained multilayer case of a radial basis functions network. Specifically, a constrained multilayer RBFN consists of:• A widths vector (n 0 , . . . , n L ) where L is the number of layers.• A matrix W ℓ ∈ R n ℓ ×n ℓ-1 for ℓ = 1, . . . , L.• A vector b ℓ ∈ R n ℓ for ℓ = 0, 1, . . . , L -1.• A nonlinear function λ ℓ : R → R for ℓ = 0, 1, . . . , L -1. (Equivalently, the corresponding radial rescaling function ρ ℓ : R n ℓ → R n ℓ for ℓ = 0, . . . , L -1.)The feedforward function is defined as follows. For ℓ = 0, . . . , L, we recursively define F ℓ : R n0 → R n ℓ by setting F 0 (x) = x andfor ℓ = 1, . . . , L. The feedforward function is F L .

F.3 RELATION TO RADIAL NEURAL NETWORKS

We now demonstrate that radial neural networks are equivalent to constrained multilayer RBFNs.Proposition 27. For any radial neural network, there is a constrained multilayer RBFN with the same feedforward function. Conversely, for any constrained multiplayer RBFN, there is a radial neural network with the same feedforward function.Proof. For the first statement, let (W, b, ρ) be a radial neural network with L layers and widths vector (n 0 , . . . , n L ). Recall the partial feedforward functions G ℓ : R n0 → R n ℓ defined recursively by setting G 0 (x) = x and G ℓ (x) = ρ ℓ (W ℓ G ℓ-1 (x) + b ℓ )The feedforward function is G L . Consider the constrained multilayer RBFN with L + 1 layers and the following:• Widths vector (n 0 , n 1 , . . . , n L-1 , n L , n L ). The last two layers have the same dimension.• Weight matrices W ℓ ∈ R n ℓ ×n ℓ-1 for ℓ = 1, . . . , L and W L+1 = id n L ∈ R n L ×n L .• A vector b ℓ ∈ R n ℓ for ℓ = 1, . . . , L, and b 0 = 0 ∈ R n0 .• A radial rescaling activation ρ ℓ : R n ℓ → R n ℓ for ℓ = 1, . . . , L, and ρ 0 = id n0 .

