UNIVERSAL APPROXIMATION AND MODEL COMPRES-SION FOR RADIAL NEURAL NETWORKS

Abstract

We introduce a class of fully-connected neural networks whose activation functions, rather than being pointwise, rescale feature vectors by a function depending only on their norm. We call such networks radial neural networks, extending previous work on rotation equivariant networks that considers rescaling activations in less generality. We prove universal approximation theorems for radial neural networks, including in the more difficult cases of bounded widths and unbounded domains. Our proof techniques are novel, distinct from those in the pointwise case. Additionally, radial neural networks exhibit a rich group of orthogonal changeof-basis symmetries on the vector space of trainable parameters. Factoring out these symmetries leads to a practical lossless model compression algorithm. Optimization of the compressed model by gradient descent is equivalent to projected gradient descent for the full model.

1. INTRODUCTION

Inspired by biological neural networks, the theory of artificial neural networks has largely focused on pointwise (or "local") nonlinear layers (Rosenblatt, 1958; Cybenko, 1989) , in which the same function σ : R → R is applied to each coordinate independently: R n → R n , v = (v 1 , . . . , v n ) → (σ(v 1 ) , σ(v 2 ) , . . . , σ(v n )). (1.1) In networks with pointwise nonlinearities, the standard basis vectors in R n can be interpreted as "neurons" and the nonlinearity as a "neuron activation." Research has generally focused on finding functions σ which lead to more stable training, have less sensitivity to initialization, or are better adapted to certain applications (Ramachandran et al., 2017; Misra, 2019; Milletarí et al., 2018; Clevert et al., 2015; Klambauer et al., 2017) . Many σ have been considered, including sigmoid, ReLU, arctangent, ELU, Swish, and others. However, by setting aside the biological metaphor, it is possible to consider a much broader class of nonlinearities, which are not necessarily pointwise, but instead depend simultaneously on many coordinates. Freedom from the pointwise assumption allows one to design activations that yield expressive function classes with specific advantages. Additionally, certain choices of non-pointwise activations maximize symmetry in the parameter space of the network, leading to compressibility and other desirable properties. In this paper, we introduce radial neural networks which employ non-pointwise nonlinearities called radial rescaling activations. Such networks enjoy several provable properties including high model compressibility, symmetry in optimization, and universal approximation. Radial rescaling activations are defined by rescaling each vector by a scalar that depends only on the norm of the vector: ρ : R n → R n , v → λ(|v|)v, (1.2) where λ is a scalar-valued function of the norm. Whereas in the pointwise setting, only the linear layers mix information between different components of the latent features, for radial rescaling, all coordinates of the activation output vector are affected by all coordinates of the activation input vector. The inherent geometric symmetry of radial rescalings makes them particularly useful for designing equivariant neural networks (Weiler & Cesa, 2019; Sabour et al., 2017; Weiler et al., 2018a; b) . We note that radial neural networks constitute a simple and previously unconsidered type of multilayer radial basis functions network (Broomhead & Lowe, 1988) , namely, one where the number of hidden activation neurons (often denoted N ) in each layer is equal to one. Indeed, pre-composing equation 1.2 with a translation and post-composing with a linear map, one obtains a special case of the local linear model extension of a radial basis functions network. σ σ σ σ ρ ||•|| λ W i-1 W i W i-1 W i In our first set of main results, we prove that radial neural networks are in fact universal approximators. Specifically, we demonstrate that any asymptotically affine function can be approximated with a radial neural network, suggesting potentially good extrapolation behavior. Moreover, this approximation can be done with bounded width. Our approach to proving these results departs markedly from techniques used in the pointwise case. Additionally, our result is not implied by the universality property of radial basis functions networks in general, and differs in significant ways, particularly in the bounded width property and the approximation of asymptotically affine functions. In our second set of main results, we exploit parameter space symmetries of radial neural networks to achieve model compression. Using the fact that radial rescaling activations commute with orthogonal transformations, we develop a practical algorithm to systematically factor out orthogonal symmetries via iterated QR decompositions. This leads to another radial neural network with fewer neurons in each hidden layer. The resulting model compression algorithm is lossless: the compressed network and the original network both have the same value of the loss function on any batch of training data. Furthermore, we prove that the loss of the compressed model after one step of gradient descent is equal to the loss of the original model after one step of projected gradient descent. As explained below, projected gradient descent involves zeroing out certain parameter values after each step of gradient descent. Although training the original network may result in a lower loss function after fewer epochs, in many cases the compressed network takes less time per epoch to train and is faster in reaching a local minimum. To summarize, our main contributions and headline results are: 

2. RELATED WORK

Radial rescaling activations. As noted, radial rescaling activations are a special case of the activations used in radial basis functions networks (Broomhead & Lowe, 1988) . Radial rescaling functions have the symmetry property of preserving vector directions, and hence exhibit rotation



Figure 1: (Left) Pointwise activations distinguish a specific basis of each hidden layer and treat each coordinate independently, see equation 1.1. (Right) Radial rescaling activations rescale each feature vector by a function of the norm, see equation 1.2.

Radial rescaling activations are an alternative to pointwise activations: We provide a formalization of radial neural networks, a new class of neural networks; • Radial neural networks are universal approximators: Results include a) approximation of asymptotically affine functions, and b) bounded width approximation; • Radial neural networks are inherently compressible: We prove a lossless compression algorithm for such networks and a theorem providing the relationship between optimization of the original and compressed networks. • Radial neural networks have practical advantages: We describe experiments verifying all theoretical results and showing that radial networks outperform pointwise networks on a noisy image recovery task.

