RINGING RELUS: HARMONIC DISTORTION ANALYSIS OF NONLINEAR FEEDFORWARD NETWORKS

Abstract

In this paper, we apply harmonic distortion analysis to understand the effect of nonlinearities in the spectral domain. Each nonlinear layer creates higherfrequency harmonics, which we call "blueshift", whose magnitude increases with network depth, thereby increasing the "roughness" of the output landscape. Unlike differential models (such as vanishing gradients, sharpness), this provides a more global view of how network architectures behave across larger areas of their parameter domain. For example, the model predicts that residual connections are able to counter the effect by dampening corresponding higher frequency modes. We empirically verify the connection between blueshift and architectural choices, and provide evidence for a connection with trainability.

1. INTRODUCTION

Figure 1 : Continous transition of a loss path between linear feedforward ("linear"), nonlinear feedforward ("ReLU") and nonlinear residual ("ResNet") regimes. Graph: loss path near initialization of a ResNet56 v2 with LReLUs with negative slope α ∈ [0, 1] and residual branch weight ν ∈ [0, 1]. Left: α = 0, ν = 0, Middle: α = 1, ν = 0, Right: α = 1, ν = 1. In the past decade, the emergence of practical deep neural networks arguably has had disruptive impact on applications of machine learning. Depth as such appears to be key to expressive models (Raghu et al., 2017) . However, depth also comes with challenges concerning training stability. Theoretical problems include vanishing and exploding gradients (Hochreiter, 1991) , chaotic feedforward dynamics (Poole et al., 2016) , or decorrelation of gradients (Balduzzi et al., 2017) . In practice, a number of "recipes" are widely used, such as specific nonlinearities (Glorot et al., 2011; He et al., 2015) , normalization methods such as batch normalization (Ioffe & Szegedy, 2015), shortcut architectures (Srivastava et al., 2015; He et al., 2016a; b) , or multi-path architecture with (Huang et al., 2017) and without shortcuts (Szegedy et al., 2016) . Broadly speaking, a key research question is to understand how the shape of the network function, i.e., the map from inputs and parameters to outputs, is affected by architectural choices. Our paper considers specifically the roughness of the weights-to-outputs function ("w-o function") of nonlinear feed-forward networks. Motivated by the recent visualizations of (Li et al., 2018) , which show how depth increases roughness and residual connections smoothen the output again, our goal is to provide an analytical explanation of this effect, and study its implications on network design and trainability. To this end, we first formalize "roughness" as the decay-rate of the expected power spectrum of a function class. Our main contribution is to then apply harmonic distortion analysis to nonlinear feedforward networks, which predicts the creation of high-frequency "harmonics" (thereby "blueshifting" the power spectrum) by polynomial nonlinearities with large higher-order coefficients. Based on this model, we discuss how network depth increases blueshift and thus roughness, while shortcut connections, low-degree nonlinearities and parallel computation paths dampen it. In relation to trainability, we show an analytic link between blueshift and exploding gradients. Unlike the former model, the spectral view describes a more global behavior of the w-o function over regions in the parameter domain. Experiments confirm the theoretical predictions: We observe the predicted effects of depth, shortcuts and parallel computation on blueshift, and are able to differentiate different types of nonlinearities by the decay rate of coefficients of a polynomial approximation. The findings are in-line with known advantages in trainability of the different architectures. We further strengthen the evidence by training a large set of networks with a different amount of nonlinearity and depth, which shows a clear correlation between blueshift and training-problems, as well as a trade-off with expressivity. In summary, our paper explains how network architecture affects roughness, shows a connection to trainability, and thereby provides a new tool for analyzing the design of deep networks.

2. RELATED WORK

Vanishing or exploding gradients are a central numerical problem (Hochreiter, 1991; Pascanu et al., 2013; Yang et al., 2019) : If the the magnitudes of the singular values of layer Jacobians deviates from one, subspaces are attenuated (|σ| < 1) or amplified (|σ| > 1), potentially cascading exponentially over multiple layers (Pennington et al., 2017a) . Formally, the behavior of stacks of matrices and nonlinear functions can be modeled by random matrix theory or Gaussian mean-field approximations (Poole et al., 2016; Pennington et al., 2017b; 2018) . The gist is that at initialization, orthogonal weight matrices are needed, which is challenging for convolutional architectures. A solution for tanh-networks is given by Xiao et al. ( 2018); for ReLU, there is a negative result (Pennington et al., 2017b) . Using mean-field theory, it can be shown that batch normalization (Ioffe & Szegedy, 2015) leads to exploding gradients at initialization (Yang et al., 2019) (which equalize after a few steps, but that might be too late (Frankle et al., 2020) ). A different route is taken by Balduzzi et al. (2017) , who observe an increasing decorrelation of gradients in the input space. Similar to our paper, they show that deeper networks lead to spectral whitening (starting from brown noise); however, the analysis is performed with respect to the inputs x, not weights W. The scale-space structure shown by our model might give further hints on the mechanisms behind training difficulties. By visualizing random slices of the loss surface, Li et al. (2018) observe that the loss surface of deep feedforward networks transitions between nearly convex to chaotic with increasing depth; our work explains these observations by spectral analysis. Duvenaud et al. (2014) visualize pathologies on the landscape of deep gaussian processes that model deep wide-limit nonlinear networks. Fourier analysis of network functions (Candès, 1999 ) wrt. input (Rahaman et al., 2019; Xu et al., 2019; Xu, 2018; Basri et al., 2019; Yang & Salman, 2019) has been used to show an inductive bias towards low-frequency functions (wrt. input x), as well as a strong anisotropy of this spectrum. Wang et al. (2020) prove under some assumptions that all "bad" local minima of a deep residual network are very shallow.

3. HARMONIC DISTORTION

We now analyze the effect of a nonlinearity by relating the Fourier spectrum of a preactivation with that of its postactivation. Let f denote the preactivation of a single neuron of a neural network consisting of L-layers. We use x to denote the input to the whole network and thus to f , W to denote

