BATCH NORMALIZATION EXPLAINED

Abstract

A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN theoretically from the perspective of function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"). We demonstrate that BN is an unsupervised learning technique that -independent of the DN's weights or gradient-based learning -adapts the geometry of a DN's spline partition to match the data. BN provides a "smart initialization" that boosts the performance of DN learning, because it adapts even a DN initialized with random weights to align its spline partition with the data. We also show that the variation of BN statistics between mini-batches introduces a dropout-like random perturbation to the partition boundaries and hence the decision boundary for classification problems. This per mini-batch perturbation reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary.

1. INTRODUCTION

Deep learning has made major impacts in a wide range of applications. Mathematically, a deep (neural) network (DN) maps an input vector x to a sequence of L feature maps z ℓ , ℓ = 1, . . . , L by successively applying the simple nonlinear transformation (termed a DN layer) z ℓ+1 = a (W ℓ z ℓ + c ℓ ) , ℓ = 0, . . . , L -1 with z 0 = x, W ℓ the weight matrix, c ℓ the bias vector, and a an activation operator that applies a scalar nonlinear activation function a to each element of its vector input. The structure of W ℓ , c ℓ controls the type of layer (e.g., circulant matrix for convolutional layer). For regression tasks, the DN prediction is simply z L , while for classification tasks, z L is often processed through a softmax operator Goodfellow et al. (2015) to center and normalize the entries of the feature maps using four additional parameters µ ℓ , σ ℓ , β ℓ , γ ℓ . Define z ℓ,k as k th entry of feature map z ℓ of length D ℓ , w ℓ,k as the k th row of the weight matrix W ℓ , and µ ℓ,k , σ ℓ,k , β ℓ,k , γ ℓ,k as the k th entries of the BN parameter vectors µ ℓ , σ ℓ , β ℓ , γ ℓ , respectively. Then we can write the BN-equipped layer ℓ mapping extending (1) as z ℓ+1,k = a ⟨w ℓ,k , z ℓ ⟩ -µ ℓ,k σ ℓ,k γ ℓ,k + β ℓ,k , k = 1, . . . , D ℓ . The parameters µ ℓ , σ ℓ are computed as the element-wise mean and standard deviation of W ℓ z ℓ for each mini-batch during training and for the entire training set during testing. The parameters β ℓ , γ ℓ



(2016). The DN parameters W ℓ , c ℓ are learned from a collection of training data samples X = {x i , i = 1, . . . , n} (augmented with the corresponding ground-truth labels y i in supervised settings) by optimizing an objective function (e.g., squared error or crossentropy). Learning is typically performed via some flavor of stochastic gradient descent (SGD) over randomized mini-batches of training data samples B ⊂ X Goodfellow et al.(2016). While a host of different DN architectures have been developed over the past several years, modern, high-performing DNs nearly universally employ batch normalization (BN) Ioffe & Szegedy

