UNIVERSAL APPROXIMATION AND MODEL COMPRES-SION FOR RADIAL NEURAL NETWORKS

Abstract

We introduce a class of fully-connected neural networks whose activation functions, rather than being pointwise, rescale feature vectors by a function depending only on their norm. We call such networks radial neural networks, extending previous work on rotation equivariant networks that considers rescaling activations in less generality. We prove universal approximation theorems for radial neural networks, including in the more difficult cases of bounded widths and unbounded domains. Our proof techniques are novel, distinct from those in the pointwise case. Additionally, radial neural networks exhibit a rich group of orthogonal changeof-basis symmetries on the vector space of trainable parameters. Factoring out these symmetries leads to a practical lossless model compression algorithm. Optimization of the compressed model by gradient descent is equivalent to projected gradient descent for the full model.

1. INTRODUCTION

Inspired by biological neural networks, the theory of artificial neural networks has largely focused on pointwise (or "local") nonlinear layers (Rosenblatt, 1958; Cybenko, 1989) , in which the same function σ : R → R is applied to each coordinate independently: R n → R n , v = (v 1 , . . . , v n ) → (σ(v 1 ) , σ(v 2 ) , . . . , σ(v n )). (1.1) In networks with pointwise nonlinearities, the standard basis vectors in R n can be interpreted as "neurons" and the nonlinearity as a "neuron activation." Research has generally focused on finding functions σ which lead to more stable training, have less sensitivity to initialization, or are better adapted to certain applications (Ramachandran et al., 2017; Misra, 2019; Milletarí et al., 2018; Clevert et al., 2015; Klambauer et al., 2017) . Many σ have been considered, including sigmoid, ReLU, arctangent, ELU, Swish, and others. However, by setting aside the biological metaphor, it is possible to consider a much broader class of nonlinearities, which are not necessarily pointwise, but instead depend simultaneously on many coordinates. Freedom from the pointwise assumption allows one to design activations that yield expressive function classes with specific advantages. Additionally, certain choices of non-pointwise activations maximize symmetry in the parameter space of the network, leading to compressibility and other desirable properties. In this paper, we introduce radial neural networks which employ non-pointwise nonlinearities called radial rescaling activations. Such networks enjoy several provable properties including high model compressibility, symmetry in optimization, and universal approximation. Radial rescaling activations are defined by rescaling each vector by a scalar that depends only on the norm of the vector: ρ : R n → R n , v → λ(|v|)v, (1.2) where λ is a scalar-valued function of the norm. Whereas in the pointwise setting, only the linear layers mix information between different components of the latent features, for radial rescaling, all coordinates of the activation output vector are affected by all coordinates of the activation input vector. The inherent geometric symmetry of radial rescalings makes them particularly useful for designing equivariant neural networks (Weiler & Cesa, 2019; Sabour et al., 2017; Weiler et al., 2018a; b) .

