BATCH NORMALIZATION EXPLAINED

Abstract

A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN theoretically from the perspective of function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"). We demonstrate that BN is an unsupervised learning technique that -independent of the DN's weights or gradient-based learning -adapts the geometry of a DN's spline partition to match the data. BN provides a "smart initialization" that boosts the performance of DN learning, because it adapts even a DN initialized with random weights to align its spline partition with the data. We also show that the variation of BN statistics between mini-batches introduces a dropout-like random perturbation to the partition boundaries and hence the decision boundary for classification problems. This per mini-batch perturbation reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary.

1. INTRODUCTION

Deep learning has made major impacts in a wide range of applications. Mathematically, a deep (neural) network (DN) maps an input vector x to a sequence of L feature maps z ℓ , ℓ = 1, . . . , L by successively applying the simple nonlinear transformation (termed a DN layer) z ℓ+1 = a (W ℓ z ℓ + c ℓ ) , ℓ = 0, . . . , L -1 (1) with z 0 = x, W ℓ the weight matrix, c ℓ the bias vector, and a an activation operator that applies a scalar nonlinear activation function a to each element of its vector input. The structure of W ℓ , c ℓ controls the type of layer (e.g., circulant matrix for convolutional layer). For regression tasks, the DN prediction is simply z L , while for classification tasks, z L is often processed through a softmax operator Goodfellow et al. (2016) . The DN parameters W ℓ , c ℓ are learned from a collection of training data samples X = {x i , i = 1, . . . , n} (augmented with the corresponding ground-truth labels y i in supervised settings) by optimizing an objective function (e.g., squared error or crossentropy). Learning is typically performed via some flavor of stochastic gradient descent (SGD) over randomized mini-batches of training data samples B ⊂ X Goodfellow et al. (2016) . While a host of different DN architectures have been developed over the past several years, modern, high-performing DNs nearly universally employ batch normalization (BN) Ioffe & Szegedy (2015) to center and normalize the entries of the feature maps using four additional parameters µ ℓ , σ ℓ , β ℓ , γ ℓ . Define z ℓ,k as k th entry of feature map z ℓ of length D ℓ , w ℓ,k as the k th row of the weight matrix W ℓ , and µ ℓ,k , σ ℓ,k , β ℓ,k , γ ℓ,k as the k th entries of the BN parameter vectors µ ℓ , σ ℓ , β ℓ , γ ℓ , respectively. Then we can write the BN-equipped layer ℓ mapping extending (1) as z ℓ+1,k = a ⟨w ℓ,k , z ℓ ⟩ -µ ℓ,k σ ℓ,k γ ℓ,k + β ℓ,k , k = 1, . . . , D ℓ . The parameters µ ℓ , σ ℓ are computed as the element-wise mean and standard deviation of In each plot, blue lines correspond to folded hyperplanes introduced by the units of the corresponding layer, while gray lines correspond to (folded) hyperplanes introduced by previous layers. Top row: Without BN (i.e., using (1)), the folded hyperplanes are spread throughout the input space, resulting in a spline partition that is agnostic to the data. Bottom row: With BN (i.e., using (2)), the folded hyperplanes are drawn towards the data, resulting in an adaptive spline partition that -even with random weights -minimizes the distance between the partition boundaries and the data and thus increases the density of partition regions around the data. In this paper, we study BN theoretically from a different perspective that provides new insights into how it boosts DN optimization and inference performance. Our perspective is function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"); see Balestriero & Baraniuk (2021; 2018); Balestriero et al. (2019) and Appendix B for more details.



Note that the DN bias c ℓ from (1) has been subsumed into µ ℓ and β ℓ .



Figure1: Visualization of the input-space spline partition ("linear regions") of a four-layer DN with 2D input space, 6 units per layer, leaky-ReLU activation function, and random weights W ℓ . The training data samples are denoted with black dots. In each plot, blue lines correspond to folded hyperplanes introduced by the units of the corresponding layer, while gray lines correspond to (folded) hyperplanes introduced by previous layers. Top row: Without BN (i.e., using (1)), the folded hyperplanes are spread throughout the input space, resulting in a spline partition that is agnostic to the data. Bottom row: With BN (i.e., using (2)), the folded hyperplanes are drawn towards the data, resulting in an adaptive spline partition that -even with random weights -minimizes the distance between the partition boundaries and the data and thus increases the density of partition regions around the data.

are learned along with W ℓ via SGD.1 The empirical fact that BN significantly improves both training speed and generalization performance of a DN in a wide range of tasks has made it ubiquitous, as evidenced by the 40,000 citations of the originating paperIoffe & Szegedy (2015). Only limited progress has been made to date explaining BN, primarily in the context of optimization. By studying how backpropagation updates the layer weights, LeCun et al. (1998) observed that unnormalized feature maps are constrained to live on a low-dimensional subspace that limits the capacity of gradient-based learning. By slightly altering the BN formula (2), Salimans & Kingma (2016) showed that renormalization via σ ℓ smooths the optimization landscape and enables faster training. Similarly, Bjorck et al. (2018); Santurkar et al. (2018); Kohler et al. (2019) confirmed BN's impact on the gradient distribution and optimization landscape through large-scale experiments. Using mean field theory, Yang et al. (2019) characterized the gradient statistics of BN in fully connected feed-forward networks with random weights to show that it regularizes the gradients and improves the optimization landscape conditioning. One should not take away from the above analyses that BN's only effect is to smooth the optimization loss surface or stabilize gradients. If this were the case, then BN would be redundant in advanced architectures like residual Li et al. (2017) and mollifying networks Gulcehre et al. (2016) that have been proven to have improved optimization landscapes Li et al. (2018); Riedi et al. (2022) and have been coupled with advanced optimization techniques like Adam Kingma & Ba (2014). Quite to the contrary, BN significantly improves the performance of even these advanced networks and techniques.

