TRAINING BATCHNORM AND ONLY BATCHNORM: ON THE EXPRESSIVE POWER OF RANDOM FEATURES IN CNNS

Abstract

A wide variety of deep learning techniques from style transfer to multitask learning rely on training affine transformations of features. Most prominent among these is the popular feature normalization technique BatchNorm, which normalizes activations and then subsequently applies a learned affine transform. In this paper, we aim to understand the role and expressive power of affine parameters used to transform features in this way. To isolate the contribution of these parameters from that of the learned features they transform, we investigate the performance achieved when training only these parameters in BatchNorm and freezing all weights at their random initializations. Doing so leads to surprisingly high performance considering the significant limitations that this style of training imposes. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the expressive power of affine parameters in deep learning, but-in a broader sense-they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features. * Work done while an intern and student researcher at Facebook AI Research. 1 He et al. ( 2016) find better accuracy when using BatchNorm before activation rather than after in ResNets.

1. INTRODUCTION

Throughout the literature on deep learning, a wide variety of techniques rely on learning affine transformations of features-multiplying each feature by a learned coefficient γ and adding a learned bias β. This includes everything from multi-task learning (Mudrakarta et al., 2019) to style transfer and generation (e.g., Dumoulin et al., 2017; Huang & Belongie, 2017; Karras et al., 2019) . One of the most common examples of these affine parameters are in feature normalization techniques like BatchNorm (Ioffe & Szegedy, 2015) . Considering their practical importance and their presence in nearly all modern neural networks, we know relatively little about the role and expressive power of affine parameters used to transform features in this way. To gain insight into this question, we focus on the γ and β parameters in BatchNorm. BatchNorm is nearly ubiquitous in deep convolutional neural networks (CNNs) for computer vision, meaning these affine parameters are present by default in numerous models that researchers and practitioners train every day. Computing BatchNorm proceeds in two steps during training (see Appendix A for full details). First, each pre-activation 1 is normalized according to the mean and standard deviation across the mini-batch. These normalized pre-activations are then scaled and shifted by a trainable per-feature coefficient γ and bias β. One fact we do know about γ and β in BatchNorm is that their presence has a meaningful effect on the performance of ResNets, improving accuracy by 0.5% to 2% on CIFAR-10 (Krizhevsky et al., 2009) and 2% on ImageNet (Deng et al., 2009) (Figure 1 ). These improvements are large enough

