DECONSTRUCTING THE REGULARIZATION OF BATCH-NORM

Abstract

Batch normalization (BatchNorm) has become a standard technique in deep learning. Its popularity is in no small part due to its often positive effect on generalization. Despite this success, the regularization effect of the technique is still poorly understood. This study aims to decompose BatchNorm into separate mechanisms that are much simpler. We identify three effects of BatchNorm and assess their impact directly with ablations and interventions. Our experiments show that preventing explosive growth at the final layer at initialization and during training can recover a large part of BatchNorm's generalization boost. This regularization mechanism can lift accuracy by 2.9% for Resnet-50 on Imagenet without BatchNorm. We show it is linked to other methods like Dropout and recent initializations like Fixup. Surprisingly, this simple mechanism matches the improvement of 0.9% of the more complex Dropout regularization for the state-of-the-art Efficientnet-B8 model on Imagenet. This demonstrates the underrated effectiveness of simple regularizations and sheds light on directions to further improve generalization for deep nets.

1. INTRODUCTION

Deep learning has made remarkable progress on a variety of domains in the last decade. While part of this progress relied on training larger models on larger datasets, it also depended crucially on the development of new training methods. A prominent example of such a development is batch normalization (BatchNorm) (Ioffe and Szegedy, 2015) , which has become a standard component of training protocols. For example, state-of-the-art models in image recognition (Szegedy et al., 2017; He et al., 2016; Tan and Le, 2019), object detection (He et al., 2019; Du et al., 2019) , and image segmentation (Chen et al., 2017) all use BatchNorm. Despite its prominence, the mechanisms behind BatchNorm's effectiveness are not well-understood (Santurkar et al., 2018; Bjorck et al., 2018; Yang et al., 2019) . Perhaps at the core of the confusion is that BatchNorm has many effects. It has been correlated to reducing covariate shift (Ioffe and Szegedy, 2015), enabling higher learning rates (Bjorck et al., 2018) , improving initialization (Zhang et al., 2019) , and improving conditioning (Desjardins et al., 2015) , to name a few. These entangled effects make it difficult to properly study the technique. In this work, we deconstruct some of the effects of BatchNorm in search of much simpler components. The advantage of this approach compared to previous work is that it allows going beyond correlating these effects to BatchNorm by evaluating their impact separately. The mechanisms we consider in this work are purposefully simple. These simpler mechanisms are easier to understand and, surprisingly, they are competitive even at the level of the state-of-the-art. Our contributions can be summarized as follows: 1. How does normalization help generalization? We isolate and quantify the benefits of the different effects of BatchNorm using additive penalties and ablations. To our knowledge, we are the first to provide empirical evidence that BatchNorm's effect of regularizing against explosive growth at initialization and during training can recover a large part of its generalization boost. Replicating this effect with Fixup initialization and the proposed additive penalty improves accuracy by 2.9% for Resnet-50 without BatchNorm.

