TRAINING BATCHNORM AND ONLY BATCHNORM: ON THE EXPRESSIVE POWER OF RANDOM FEATURES IN CNNS

Abstract

A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but-in a broader sense-they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features. * Work done while an intern and student researcher at Facebook AI Research. 1 He et al. ( 2016) find better accuracy when using BatchNorm before activation rather than after in ResNets.

1. INTRODUCTION

Throughout the literature on deep learning, a wide variety of techniques rely on learning affine transformations of features-multiplying each feature by a learned coefficient γ and adding a learned bias β. This includes everything from multi-task learning (Mudrakarta et al., 2019) to style transfer and generation (e.g., Dumoulin et al., 2017; Huang & Belongie, 2017; Karras et al., 2019) . One of the most common examples of these affine parameters are in feature normalization techniques like BatchNorm (Ioffe & Szegedy, 2015) . Considering their practical importance and their presence in nearly all modern neural networks, we know relatively little about the role and expressive power of affine parameters used to transform features in this way. To gain insight into this question, we focus on the γ and β parameters in BatchNorm. BatchNorm is nearly ubiquitous in deep convolutional neural networks (CNNs) for computer vision, meaning these affine parameters are present by default in numerous models that researchers and practitioners train every day. Computing BatchNorm proceeds in two steps during training (see Appendix A for full details). First, each pre-activation 1 is normalized according to the mean and standard deviation across the mini-batch. These normalized pre-activations are then scaled and shifted by a trainable per-feature coefficient γ and bias β. One fact we do know about γ and β in BatchNorm is that their presence has a meaningful effect on the performance of ResNets, improving accuracy by 0.5% to 2% on CIFAR-10 (Krizhevsky et al., 2009) and 2% on ImageNet (Deng et al., 2009) (Figure 1 ). These improvements are large enough Test Accuracy (%) 14-1 14-2 14-4 14-8 14-16 that, were γ and β proposed as a new technique, it would likely see wide adoption. However, they are small enough that it is difficult to isolate the specific role γ and β play in these improvements. More generally, the central challenge of scientifically investigating per-feature affine parameters is distinguishing their contribution from that of the features they transform. In all practical contexts, these affine parameters are trained jointly with (as in the case of BatchNorm) or after the features themselves (Mudrakarta et al., 2019; Dumoulin et al., 2017) . In order to study these parameters in isolation, we instead train them on a network composed entirely of random features. Concretely, we freeze all weights at initialization and train only the γ and β parameters in BatchNorm. Although the networks still retain the same number of features, only a small fraction of parameters (at most 0.6%) are trainable. This experiment forces all learning to take place in γ and β, making it possible to assess the expressive power of a network whose only degree of freedom is scaling and shifting random features. We emphasize that our goal is scientific in nature: to assess the performance and the mechanisms by which networks use this limited capacity to represent meaningful functions; we neither intend nor expect this experiment to reach SOTA accuracy. We make the following findings: • When training only γ and β, sufficiently deep networks (e.g., reach surprisingly high (although non-SOTA) accuracy: 82% on CIFAR-10 and 32% top-5 on ImageNet. This demonstrates the expressive power of the affine BatchNorm parameters. • Training an equivalent number of randomly-selected parameters per channel performs far worse (56% on CIFAR-10 and 4% top-5 on ImageNet). This demonstrates that γ and β have particularly significant expressive power as per-feature coefficients and biases. • When training only BatchNorm, γ naturally learns to disable between a quarter to half of all channels by converging to values close to zero. This demonstrates that γ and β achieve this accuracy in part by imposing per-feature sparsity. • When training all parameters, deeper and wider networks have smaller γ values but few features are outright disabled. This hints at the role γ may play in moderating activations in settings where disabling γ and β leads to lower accuracy (the right parts of the plots in Figure 1 ). In summary, we find that γ and β have noteworthy expressive power in their own right and that this expressive power results from their particular position as a per-feature coefficient and bias. Beyond offering insights into affine parameters that transform features, this observation has broader implications for our understanding of neural networks composed of random features. By freezing all other parameters at initialization, we are training networks constructed by learning shifts and rescalings of random features. In this light, our results demonstrate that the random features available at initialization provide sufficient raw material to represent high-accuracy functions for image classification. Although prior work considers models with random features and a trainable linear output layer (e.g., Rahimi & Recht, 2009; Jaeger, 2003; Maass et al., 2002) , we reveal the expressive power of networks configured such that trainable affine parameters appear after each random feature.

2. RELATED WORK

BatchNorm. BatchNorm makes it possible to train deeper networks (He et al., 2015a) and causes SGD to converge sooner (Ioffe & Szegedy, 2015) . However, the underlying mechanisms by which it



Figure 1: Accuracy when training deep (left) and wide (center) ResNets for CIFAR-10 and deep ResNets for ImageNet (right) as described in Table1when all parameters are trainable (blue) and all parameters except γ and β are trainable (purple). Training with γ and β enabled results in accuracy 0.5% to 2% (CIFAR-10) and 2% (ImageNet) higher than with γ and β disabled.

