DECONSTRUCTING THE REGULARIZATION OF BATCH-NORM

Abstract

Batch normalization (BatchNorm) has become a standard technique in deep learning. Its popularity is in no small part due to its often positive effect on generalization. Despite this success, the regularization effect of the technique is still poorly understood. This study aims to decompose BatchNorm into separate mechanisms that are much simpler. We identify three effects of BatchNorm and assess their impact directly with ablations and interventions. Our experiments show that preventing explosive growth at the final layer at initialization and during training can recover a large part of BatchNorm's generalization boost. This regularization mechanism can lift accuracy by 2.9% for Resnet-50 on Imagenet without BatchNorm. We show it is linked to other methods like Dropout and recent initializations like Fixup. Surprisingly, this simple mechanism matches the improvement of 0.9% of the more complex Dropout regularization for the state-of-the-art Efficientnet-B8 model on Imagenet. This demonstrates the underrated effectiveness of simple regularizations and sheds light on directions to further improve generalization for deep nets.

1. INTRODUCTION

Deep learning has made remarkable progress on a variety of domains in the last decade. While part of this progress relied on training larger models on larger datasets, it also depended crucially on the development of new training methods. A prominent example of such a development is batch normalization (BatchNorm) (Ioffe and Szegedy, 2015) , which has become a standard component of training protocols. For example, state-of-the-art models in image recognition (Szegedy et al., 2017; He et al., 2016; Tan and Le, 2019), object detection (He et al., 2019; Du et al., 2019) , and image segmentation (Chen et al., 2017) all use BatchNorm. Despite its prominence, the mechanisms behind BatchNorm's effectiveness are not well-understood (Santurkar et al., 2018; Bjorck et al., 2018; Yang et al., 2019) . Perhaps at the core of the confusion is that BatchNorm has many effects. It has been correlated to reducing covariate shift (Ioffe and Szegedy, 2015), enabling higher learning rates (Bjorck et al., 2018) , improving initialization (Zhang et al., 2019) , and improving conditioning (Desjardins et al., 2015) , to name a few. These entangled effects make it difficult to properly study the technique. In this work, we deconstruct some of the effects of BatchNorm in search of much simpler components. The advantage of this approach compared to previous work is that it allows going beyond correlating these effects to BatchNorm by evaluating their impact separately. The mechanisms we consider in this work are purposefully simple. These simpler mechanisms are easier to understand and, surprisingly, they are competitive even at the level of the state-of-the-art. Our contributions can be summarized as follows: 1. How does normalization help generalization? We isolate and quantify the benefits of the different effects of BatchNorm using additive penalties and ablations. To our knowledge, we are the first to provide empirical evidence that BatchNorm's effect of regularizing against explosive growth at initialization and during training can recover a large part of its generalization boost. Replicating this effect with Fixup initialization and the proposed additive penalty improves accuracy by 2.9% for Resnet-50 without BatchNorm. 2. Links to Fixup and Dropout. We draw novel connections between the regularization on the final layer, Dropout regularization, Fixup initialization and BatchNorm. 3. Simplicity in regularization. The mechanism we identify can be useful as a standalone regularization. It produces 0.9% improvement on the Efficientnet B8 architecture, matching the more complex Dropout regularization. The effects we evaluate are the implicit regularizing effect on the norms at the final layer, and also its primary effect of standardizing the intermediate layers. In order to test these purposefully simple mechanisms we will rely on ablations and additive penalties. The use of additive penalties allows us to disentangle these effects where we control for the positive effect of BatchNorm on initialization by using the recently proposed Fixup initializer (Zhang et al., 2019) .

2.1. REGULARIZING AGAINST EXPLOSIVE GROWTH IN THE FINAL LAYER

First, we characterize the implicit effect of normalization on the final layer. Consider a neural network of the form NN(x) = WEmb(x) with loss L where x ∈ R I is the input of the network, W ∈ R K×H is the final weight matrix in the model, and Emb(x) : R I → R H is a feature embedding network with L layers. Let us take the common case where Emb(x) = Swish(γBatchNorm(PreEmb(x)) + β) where PreEmb(x) is the output of a residual network, Swish(x) = xσ(ρx) is the Swish activation (Ramachandran et al., 2017; Elfwing et al., 2018) with scalar parameter ρ (typically denoted β), and BatchNorm parameters γ, β. BatchNorm makes weight decay regularization on γ, β approximately equivalent to an additive penalty on the norm on the feature embedding L(NN(x)) + λ γ 2 + λ β 2 = L(NN(x)) + λ 4 E[ Emb(x) 2 ] + O(|ρ|). See the Appendix A for the derivation. It means that the norm of the BatchNorm parameters alone is enough to directly control the norm of the feature embedding. It is a guarantee that the norm of the feature embedding cannot grow explosively during training as long as these parameters are small. This regularization effect of BatchNorm can occur even without explicit weight decay due to the tendency of stochastic gradient descent to favor low norm parameters (Wilson et al., 2017) . This equivalency does not hold without BatchNorm because the activation of the embedding network become an important factor in the norm of the feature embedding ( γ 2 + β 2 = E γPreEmb(x) + β 2 in general). Indeed, (Balduzzi et al., 2017; Gehring et al., 2017; Zhang et al., 2019) have shown that the activations of residual networks without BatchNorm tend explode exponentially in the depth of the network at initialization. This results in an extremely large embedding norm, even though the parameters are relatively small. We confirm experimentally in Section 4.3 that networks without BatchNorm have much larger feature embedding norms. Feature Embedding L2 (EL2) We propose to assess the effect of this regularization mechanism by isolating it as the following additive penalty R EL2 (NN) = 1 H E[ Emb(x) 2 ]. (2)



THE REGULARIZATION EFFECTS OF BATCH NORMALIZATION

