BATCH NORMALIZATION EXPLAINED

Abstract

A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN theoretically from the perspective of function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"). We demonstrate that BN is an unsupervised learning technique that -independent of the DN's weights or gradient-based learning -adapts the geometry of a DN's spline partition to match the data. BN provides a "smart initialization" that boosts the performance of DN learning, because it adapts even a DN initialized with random weights to align its spline partition with the data. We also show that the variation of BN statistics between mini-batches introduces a dropout-like random perturbation to the partition boundaries and hence the decision boundary for classification problems. This per mini-batch perturbation reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary.

1. INTRODUCTION

Deep learning has made major impacts in a wide range of applications. Mathematically, a deep (neural) network (DN) maps an input vector x to a sequence of L feature maps z ℓ , ℓ = 1, . . . , L by successively applying the simple nonlinear transformation (termed a DN layer) z ℓ+1 = a (W ℓ z ℓ + c ℓ ) , ℓ = 0, . . . , L -1 (1) with z 0 = x, W ℓ the weight matrix, c ℓ the bias vector, and a an activation operator that applies a scalar nonlinear activation function a to each element of its vector input. The structure of W ℓ , c ℓ controls the type of layer (e.g., circulant matrix for convolutional layer). For regression tasks, the DN prediction is simply z L , while for classification tasks, z L is often processed through a softmax operator Goodfellow et al. (2016) . The DN parameters W ℓ , c ℓ are learned from a collection of training data samples X = {x i , i = 1, . . . , n} (augmented with the corresponding ground-truth labels y i in supervised settings) by optimizing an objective function (e.g., squared error or crossentropy). Learning is typically performed via some flavor of stochastic gradient descent (SGD) over randomized mini-batches of training data samples B ⊂ X Goodfellow et al. (2016) . While a host of different DN architectures have been developed over the past several years, modern, high-performing DNs nearly universally employ batch normalization (BN) Ioffe & Szegedy (2015) to center and normalize the entries of the feature maps using four additional parameters µ ℓ , σ ℓ , β ℓ , γ ℓ . Define z ℓ,k as k th entry of feature map z ℓ of length D ℓ , w ℓ,k as the k th row of the weight matrix W ℓ , and µ ℓ,k , σ ℓ,k , β ℓ,k , γ ℓ,k as the k th entries of the BN parameter vectors µ ℓ , σ ℓ , β ℓ , γ ℓ , respectively. Then we can write the BN-equipped layer ℓ mapping extending (1) as z ℓ+1,k = a ⟨w ℓ,k , z ℓ ⟩ -µ ℓ,k σ ℓ,k γ ℓ,k + β ℓ,k , k = 1, . . . , D ℓ . The parameters µ ℓ , σ ℓ are computed as the element-wise mean and standard deviation of W ℓ z ℓ for each mini-batch during training and for the entire training set during testing. The parameters β ℓ , γ ℓ with BN without BN layer 0 layer 1 layer 2 layer 3 Figure 1 : Visualization of the input-space spline partition ("linear regions") of a four-layer DN with 2D input space, 6 units per layer, leaky-ReLU activation function, and random weights W ℓ . The training data samples are denoted with black dots. In each plot, blue lines correspond to folded hyperplanes introduced by the units of the corresponding layer, while gray lines correspond to (folded) hyperplanes introduced by previous layers. Top row: Without BN (i.e., using (1)), the folded hyperplanes are spread throughout the input space, resulting in a spline partition that is agnostic to the data. Bottom row: With BN (i.e., using (2)), the folded hyperplanes are drawn towards the data, resulting in an adaptive spline partition that -even with random weights -minimizes the distance between the partition boundaries and the data and thus increases the density of partition regions around the data. are learned along with W ℓ via SGD. 1 The empirical fact that BN significantly improves both training speed and generalization performance of a DN in a wide range of tasks has made it ubiquitous, as evidenced by the 40,000 citations of the originating paper Ioffe & Szegedy (2015) . Only limited progress has been made to date explaining BN, primarily in the context of optimization. By studying how backpropagation updates the layer weights, LeCun et al. (1998) observed that unnormalized feature maps are constrained to live on a low-dimensional subspace that limits the capacity of gradient-based learning. By slightly altering the BN formula (2), Salimans & Kingma (2016) showed that renormalization via σ ℓ smooths the optimization landscape and enables faster training. Similarly, Bjorck et al. (2018) ; Santurkar et al. (2018) ; Kohler et al. (2019) confirmed BN's impact on the gradient distribution and optimization landscape through large-scale experiments. Using mean field theory, Yang et al. (2019) characterized the gradient statistics of BN in fully connected feed-forward networks with random weights to show that it regularizes the gradients and improves the optimization landscape conditioning. One should not take away from the above analyses that BN's only effect is to smooth the optimization loss surface or stabilize gradients. If this were the case, then BN would be redundant in advanced architectures like residual Li et al. (2017) and mollifying networks Gulcehre et al. (2016) that have been proven to have improved optimization landscapes Li et al. (2018) ; Riedi et al. (2022) and have been coupled with advanced optimization techniques like Adam Kingma & Ba (2014) . Quite to the contrary, BN significantly improves the performance of even these advanced networks and techniques. In this paper, we study BN theoretically from a different perspective that provides new insights into how it boosts DN optimization and inference performance. Our perspective is function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"); see Balestriero & Baraniuk (2021; 2018) ; Balestriero et al. (2019) and Appendix B for more details. The key finding of our study is that BN is an unsupervised learning technique that -independent of the DN's weights or gradient-based learning -adapts the geometry of a DN's spline partition to match the data. Our three main theoretical contributions are as follows: • BN adapts the layer/DN input space spline partition to minimize the total least squares (TLS) distance between the spline partition boundaries and the layer/DN inputs, thereby increasing the number of partition regions around the training data and enabling finer approximation (see Figure 1 ). The BN parameter µ ℓ translates the boundaries towards the data, while the parameter σ ℓ folds the boundaries towards the data (see Sections 2 and 3). • BN's adaptation of the spline partition provides a "smart initialization" that boosts the performance of DN learning, because it adapts even a DN initialized with random weights W ℓ to align the spline partition to the data (see Section 4). • BN's statistics vary between mini-batches, which introduces a dropout-like random jitter perturbation to the partition boundaries and hence the decision boundary for classification problems. This jitter reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary (see Section 5). The proofs for our results are provided in the Appendices; the codebase will be released upon completion of the review process to reproduce all the figures and results of the study.

2. SINGLE-LAYER ANALYSIS OF BATCH NORMALIZATION

In this section, we investigate how BN impacts one individual DN layer. Our analysis leverages the identification that DN layers using continuous piecewise linear activation functions a in (1) and (2) are splines that partition their input space into convex polytopal regions. We show that the BN parameter µ ℓ translates the regions such that they concentrate around the training data.

2.1. BATCH NORMALIZATION DETAILS

The BN parameters β ℓ , γ ℓ , along with the DN weights W ℓ , are learned directly through the optimization of the DN's objective function (e.g., squared error or cross-entropy) evaluated at the training data samples X = {x i , i = 1, . . . , n} and labels (if available). Current practice performs the optimization using some flavor of stochastic gradient descent (SGD) over randomized mini-batches of training data samples B ⊂ X . Our first new result is that we can set γ ℓ = 1 and β ℓ = 0 with no or negligible impact on DN performance for current architectures, training datasets, and tasks. First, we prove in Appendix D that we can set γ ℓ = 1 both in theory and in practice. Proposition 1. The BN parameter γ ℓ ̸ = 0 does not impact the approximation expressivity of a DN, because its value can be absorbed into W ℓ+1 , β ℓ . Second, we demonstrate numerically in Appendix D that setting β ℓ = 0 has negligible impact on DN performance. Henceforth, we will assume for our theoretical analysis that γ ℓ = 1, β ℓ = 0 and will clarify for each experiment whether or not we enforce these constraints. Let X ℓ denote the collection of feature maps z ℓ at the input to layer ℓ produced by all inputs x in the entire training data set X , and similarly let B ℓ denote the collection of feature maps z ℓ at the input to layer ℓ produced by all inputs x in the mini-batch B. For each mini-batch B during training, the BN parameters µ ℓ , σ ℓ are calculated directly as the mean and standard deviation of the current mini-batch feature maps B ℓ µ ℓ ← 1 |B ℓ | z ℓ ∈B ℓ W ℓ z ℓ , σ ℓ ← 1 |B ℓ | z ℓ ∈B ℓ W ℓ z ℓ -µ ℓ 2 , where the right-hand side square is taken element-wise. After SGD learning is complete, a final fixed "test time" mean µ ℓ and standard deviation σ ℓ are computed using the above formulae over all of the training data,foot_1 i.e., with B ℓ = X ℓ . Note that no label information enters into the calculation of µ ℓ , σ ℓ .

2.2. DEEP NETWORK SPLINE PARTITION (ONE LAYER)

We focus on the lionshare of modern DNs that employ continuous piecewise-linear activation functions a in (1) and (2). To streamline our notation, but without loss of generality, we assume that a consists of exactly two linear pieces that connect at the origin, such as the ubiquitous ReLU (a(u) = max(0, u)), leaky-ReLU (a(u) = max(α, u), α > 0), and absolute value (a(u) = max(-u, u)). The extension to more general continuous piecewise-linear activation functions is straightforward Balestriero & Baraniuk (2018; 2021) ; moreover, the extension to an infinite class of smooth activation functions (including the sigmoid gated learning unit, among others) follows from a simple probabilistic argument Balestriero & Baraniuk (2019) . Inserting pooling operators Goodfellow et al. (2016) between layers does not impact our results (see Appendix B). A DN layer ℓ equipped with BN and employing such a piecewise-linear activation function is a continuous piecewise-affine (CPA) spline operator defined by a partition Ω ℓ of the layer's input space R D ℓ into a collection of convex polytopal regions and a corresponding collection of affine transformations (one for each region) mapping layer inputs z ℓ to layer outputs z ℓ+1 . Here we explain how the partition regions in Ω ℓ are formed; then in Section 2.3 we begin our investigation of how these regions are transformed by BN. Define the pre-activation of layer ℓ by h ℓ such that the layer output z ℓ+1 = a(h ℓ ); from (2) its k th entry is given by h ℓ,k = ⟨w ℓ,k , z ℓ ⟩ -µ ℓ,k σ ℓ,k . Note from (3) that σ ℓ,k > 0 as long as ∥w ℓ,k ∥ 2 2 > 0 and as long as not all inputs are orthogonal to w ℓ,k . With typical CPA nonlinearities a, the k th feature map output z ℓ+1,k = a(h ℓ,k ) is linear in h ℓ,k for all inputs with same sign. The separation between those two linear regimes is formed by the collection of layer inputs z ℓ that produce pre-activations with h ℓ,k = 0, hence lie on the D ℓ -1 dimensional hyperplane H ℓ,k = z ℓ ∈ R D ℓ : h ℓ,k = 0 = z ℓ ∈ R D ℓ : ⟨w ℓ,k , z ℓ ⟩ = µ ℓ,k . (5) Note that H ℓ,k is independent of the value of σ ℓ,k . The boundary ∂Ω ℓ of the layer's input space partition Ω ℓ is obtained simply by combining all of the H ℓ,k into the hyperplane arrangement Zaslavsky (1975) ∂Ω ℓ = ∪ D ℓ k=1 H ℓ,k . For additional results on the DN spline partition, see Montufar et al. (2014) ; Raghu et al. (2017) ; Serra et al. (2018) ; Balestriero et al. (2019) ; the only property of interest here is that, for all inputs lying in the same region ω ∈ Ω ℓ , the layer mapping is a simple affine transformation z ℓ = ω∈Ω (A ℓ (ω)z ℓ-1 + b ℓ (ω))1 {z ℓ-1 ∈ω} (see Appendix B).

2.3. BATCH NORMALIZATION PARAMETER µ TRANSLATES THE SPLINE PARTITION BOUNDARIES TOWARDS THE TRAINING DATA (PART 1)

With the above background in place, we now demonstrate that the BN parameter µ ℓ impacts the spline partition Ω ℓ of the input space of DN layer ℓ by translating its boundaries ∂Ω ℓ towards the current mini-batch training data X ℓ . We begin with some definitions. The Euclidean distance from a point v in layer ℓ's input space to the layer's k th hyperplane H ℓ,k is easily calculated as (e.g., Eq. 1 in Amaldi & Coniglio (2013) ) d(v, H ℓ,k ) = |⟨w ℓ,k , v⟩ -µ ℓ,k | ∥w ℓ,k ∥ 2 (7) as long as ∥w ℓ,k ∥ > 0. Then, the average squared distance between H ℓ,k and a collection of points V in layer ℓ's input space is given by L k (µ ℓ,k , V) = 1 |V| v∈V d (v, H ℓ,k ) 2 = σ 2 ℓ,k ∥w ℓ,k ∥ 2 2 , where we have made explicit the dependency of L k on µ ℓ,k through H ℓ,k . Going one step further, the total least squares (TLS) distance (Samuelson, 1942; Golub & Van Loan, 1980) between a collection of points V in layer ℓ's input space and layer ℓ's partition Ω ℓ is given by L(µ ℓ , V) = D ℓ k=1 L k (µ ℓ,k , V). In Appendix E.1, we prove that the BN parameter µ ℓ as computed in (3) is the unique solution of the strictly convex optimization problem of minimizing the average TLS distance (9) between the training data and layer ℓ's hyperplanes H ℓ,k and hence spline partition region boundaries ∂Ω ℓ . Theorem 1. Consider layer ℓ of a DN as described in (2) and a mini-batch of layer inputs B ℓ ⊂ X ℓ . Then µ ℓ in (3) is the unique minimizer of L(µ ℓ , B ℓ ), and µ ℓ is the unique minimizer of L(µ ℓ , X ℓ ). In words, at each layer ℓ of a DN, BN explicitly adapts the input-space partition Ω ℓ by using µ ℓ to translate its boundaries H ℓ,1 , H ℓ,2 , . . . to minimize the TLS distance to the training data. Figure 1 demonstrates empirically in two dimensions how this translation focuses the layer's spline partition on the data. Moreover, this translation takes on a very special form. We prove in Appendix E.2 that BN transforms the spline partition boundary ∂Ω ℓ into a central hyperplane arrangement Stanley et al. (2004) . It is worth noting that the above results do not involve any data label information, and so -at least as far as µ and σ are concerned -BN can be interpreted as an unsupervised learning technique.

3. MULTIPLE LAYER ANALYSIS OF BATCH NORMALIZATION

We now extend the single-layer analysis of the previous section to the composition of two or more DN layers. We begin by showing that the layers' µ ℓ continue to translate the hyperplanes that comprise the spline partition boundary such that they concentrate around the training data. We then show that the layers' σ ℓ fold those same hyperplanes with the same goal.

3.1. DEEP NETWORK SPLINE PARTITION (MULTIPLE LAYERS)

Taking advantage of the fact that a composition of multiple CPA splines is itself a CPA spline, we now extend the results from Section 2.2 regarding one DN layer to the composition of layers 1, . . . , ℓ, ℓ > 1 that maps the DN input x to the feature map z ℓ+1 . 3 The denote partition of this mapping by Ω |ℓ , where we introduce the shorthand |ℓ to denote the mapping through layers 1, . . . , ℓ. Appendix B provides closed-form formulas for the per-region affine mappings. As in Section 2.2, we are primarily interested in the boundary ∂Ω |ℓ of the spline partition Ω |ℓ . Recall that the boundary of the spline partition of a single layer was easily found in ( 4)-( 6) as the rearrangement of the hyperplanes formed where the layer's pre-activation equals zero. With multiple layers, the situation is almost the same as in ( 5) ∂Ω |ℓ = ℓ j=1 Dj k=1 x ∈ R D1 : h j,k = 0 ; further details are provided in Appendix B. The salient result of interest to us is that ∂Ω |ℓ is constructed from the hyperplanes H j,k pulled back through the preceding layer(s). This process folds those hyperplanes (toy depiction given in Figure 8 ) based on the preceding layers' partitions such that the folded H j,k consist of a collection of facets F j,k,ω = {x ∈ ω : h j,k = 0} = {x ∈ ω : ⟨w j,k , z j ⟩ = µ j,k }, ω ∈ Ω |j , which simplifies (10) to In the one-layer case, we saw in Theorem 1 that BN independently translates each DN layer's hyperplanes H ℓ,k towards the training data to minimize the TLS distance. In the multilayer case, as we have just seen, those hyperplanes H ℓ,k become folded hyperplanes F ℓ,k (recall ( 10)). ∂Ω |ℓ = ℓ j=1 Dj k=1 F j,k , where F ℓ,k ≜ ω∈Ω |j F j,k,ω . H 2,k varying µ 2,k (a) F 2,k varying µ 2,k (b) F 2,k varying σ1 (c) We now demonstrate that the BN parameter µ ℓ translates the folded hyperplanes F ℓ,k -and thus adapts Ω |ℓ -towards the input-space training data X . To this end, denote the squared distance from a point x in the DN input space to the folded hyperplane F ℓ,k by d(x, F ℓ,k ) = min x ′ ∈F ℓ,k ∥x -x ′ ∥ 2 . ( ) Theorem 2. Consider layer ℓ > 1 of a layer as described in (2) with fixed weight matrices and BN parameters from layers 1 through ℓ -1 and fixed weights W ℓ . Let µ ℓ and µ ′ ℓ yield the hyperplanes H ℓ,k and H ′ ℓ,k and their corresponding folded hyperplanes F ℓ,k and F ′ ℓ,k . Then we have that d(z ℓ (x), H ℓ,k ) < d(z ℓ (x), H ′ ℓ,k ) =⇒ d(x, F ℓ,k ) < d(x, F ′ ℓ,k ). In words, translating a hyperplane closer to z ℓ in layer ℓ's input space moves the corresponding folded hyperplane closer to the DN input x that produced z ℓ , which is of particular interest for inputs x i from the training data X . We also have the following corollary. Corollary 1. Consider layer ℓ > 1 of a trained BN-equipped DN as described in Theorem 2. Then z ℓ (x) lies on hyperplane H ℓ,k for some k if and only x lies on the corresponding folded hyperplane F ℓ,k in the DN input space; that is, d(z ℓ-1 (x), H ℓ,k ) = 0 ⇐⇒ d(x, F ℓ,k ) = 0. Figure 2(b) illustrates empirically how µ 2 translates the folded hyperplanes F 2,k realized by the second layer of a toy DN. The impact of the BN parameters σ, is also crucial albeit not as crucial as µ. For completeness, we study how this parameter translates and folds adjacent facets composing the folded hyperplanes F ℓ,k in Figure 2 and Appendix C.

3.3. BATCH-NORMALIZATION INCREASES THE DENSITY OF PARTITION REGIONS AROUND THE TRAINING DATA

We now extend the toy DN numerical experiments reported in Figures 1 to more realistic DN architectures and higher-dimensional settings. We focus on three settings all involving random weights W ℓ : (i) zero bias c ℓ = 0 in (1), (ii) random bias c ℓ in (1), and (iii) BN in (2). 2D Toy Dataset. We continue visualizing the effect of BN in a 2D input space by reproducing the experiment of Figure 1 but with a more realistic DN with 11 layers of width 1024 and training data consisting of 50 samples from a star shape in 2D (see the leftmost plots in Figure 3 ). For each of the above three settings, Figure 3 visualizes in the 2D input space the concentration of the contribution to the partition boundary (recall (11)) from three specific layers (j = 1, 7, 11). The concentration at each point in the input space corresponds to the number of folded hyperplane facets F j,k,ω passing through an ϵ-ball centered on that point. This concentration calculation can be performed efficiently via the technique of Harman & Lacko (2010) CIFAR Images. We now consider a high-dimensional input space (CIFAR images) with a Resnet9 architecture (random weights). Since we cannot visualize the spline partition in the 3072D input space, we present a summary of the same concentration analysis carried out in Figure 3 for the partition boundary by measuring the number of folded hyperplanes passing through an ϵ-ball centered around 100 randomly sampled training images (BN statistics are obtained from the full training set). We report these observation in Figure 4 (left) and again clearly see that -in contrast to zero and random bias initialization -BN focuses the spline partition to lie close to the data points. To quantify the concentration of regions away from the training data, we repeat the same measurement but for 100 iid random Gaussian images that are scaled to the same mean and variance as the CIFAR images. For this case, we observe in Figure 4 (right) that BN only focused the partition around the CIFAR images and not around the random ones. This is in concurrence with the low-dimensional experiment in Figure 3 .

4. BENEFIT ONE: BATCH NORMALIZATION IS A SMART INITIALIZATION

Up to this point, our analysis has revealed that BN effects a task-independent, unsupervised learning that aligns a DN's spline partition with the training data independent of any labels. We now demonstrate that BN's utility extends to supervised learning tasks like regression and classification that feature both data and labels. In particular, BN can be viewed as providing a "smart initialization" that expedites SGD-based supervised learning. BN's "smart initialization" reaches a higher-performing solution faster because it provides SGD with a spline partition that is already adapted to the training dataset. This findings provides a novel and complementary understanding of BN that was previously studied solely when continuously employed during training as a mean to better condition a DN's loss landscape. We also show in Figure 5 in the Appendix a similar observation on CIFAR100. ResNet9 with random weights. In each experiment, the weights and biases are initialized according to one of the three strategies (zero bias, random bias, and BN over the entire training set), and then standard SGD is performed over mini-batches. Importantly, in the BN case, the BN parameters µ ℓ , β ℓ computed for the initialization are frozen for all SGD iterations; this will showcase BN's role as a smart initialization. As we see from Figure 5 , SGD with BN initialization converges faster and to an higher-performing classifier. Since BN is only used as an initializer here, the performance gain can be attributed to the better positioning of the ResNet's initial spline partition Ω. 

5. BENEFIT TWO: BATCH NORMALIZATION INCREASES MARGINS

The BN parameters µ ℓ , σ ℓ are re-calculated for each mini-batch B ℓ and hence can be interpreted as stochastic estimators of the mean and standard deviation of the complete training data set X ℓ . A direct calculation Von Mises (2014) yields the following result. Proposition 2. Consider layer ℓ > 1 of a BN-equipped DN as described in (2). Assume that the layer input z ℓ follows an arbitrary iid distribution with zero mean and diagonal covariance matrix diag(mρ). Then we have that var(µ ℓ,k )= ⟨w 2 ℓ,k , mρ⟩ |B ℓ | ≤ ∥w 2 ℓ,k ∥ ∥mρ∥ |B ℓ | , var(σ 2 ℓ,k )= 1 |B ℓ | ϕ 4 ℓ,k - ⟨w 2 ℓ,k , ρ⟩ 2 (|B ℓ | -3) |B ℓ | -1 . ( ) with ϕ 4 ℓ,k the fourth-order central moment of ⟨w ℓ,k , z ℓ ⟩ and w 2 ℓ,k the coordinate-wise square. Consequently, during learning, BN's centering and scaling introduces both additive and multiplicative noise to the feature maps whose variance increases as the mini-batch size |B ℓ | decreases. This noise becomes detrimental for small mini-batches, which has been empirically observed in Ioffe (2017) . We illustrate this result in Figure 6 , where we depict the DN decision boundary realizations from different mini-batches. We also provide in Figure 7 the empirical, analytical, and controlled parameter distributions of BN applied on a Gaussian input with varying mean and variance. Figure 6 suggests the interpretation that BN noise induces "jitter" in the DN decision boundary. Small amounts of jitter can be beneficial, since it forces the DN to learn a representation with an increased margin around the decision boundary. Large amounts of jitter can be detrimental, since the increased margin around the decision boundary might be too large for the classification task. This jitter noise is reminiscent of dropout and other techniques that artificially add noise in the DN  W ℓ z ℓ-1 ∼ N ([1, 0, -1], diag([1, 3, 0.1])) (ℓ is not relevant in this context). The empirical distributions in black are obtained by repeatedly sampling mini-batches of size 64 from a training set of size 1000. The analytical distributions in green are from ( 13). The empirical distributions in red are of the noise-controlled BN, where we increase the variances of the BN parameters to match the variances that would result from a virtual mini-batch size (32) that is smaller than the actual size (64), hence producing more regularizing jitter perturbations. The different realizations of µ ℓ and σ ℓ for each mini-batch affect the geometry of the DN partition and decision boundary (see Figure 2 ) of the current mini-batch. 2018). We focus on the effect of BN jitter on DNs for classification problems here, but jitter will also improve DN performance on regression problems. To further demonstrate that jitter noise increases the margin between the learned decision boundary and the training set (and hence improves generalization), we conducted an experiment where we fed additional Gaussian additive noise and Chi-square multiplicative noise to the layer pre-activations of a DN to increase the variances of µ ℓ and σ ℓ as desired (as in Figure 7 ), in addition to the BNinduced noise. For a Resnet9, we observed that increasing these variances about 15% increased the classification accuracies (averaged over 5 runs) from 93.34% to 93.68% (CIFAR10), from 72.22% to 72.74% (CIFAR100) and from 96.16% to 96.41% (SVHN). Note that this performance boost comes in addition to that obtained from BN's smart initialization (recall Section 4).

6. CONCLUSIONS

In this paper, our theoretical analysis of BN shed light and explained two novel crucial ways on how BN boosts DNs performances. First, BN provides a "smart initialization" that solve a total least squares (TLS) optimization to adapt the DN input space spline partition to the data to improve learning and ultimately function approximation. Second, for classification applications, BN introduces a random jitter perturbation to the DN decision boundary that forces the model to learn boundaries with increased margins to the closest training samples. From the results that we derived one can directly see how to further improve batch-normalization. For example, by controlling the strength of the noise standard deviation of the batch-normalization parameters to further control the decision boundary margin, or by altering the optimization problem that batch-normalization minimizes to enforce a specific adaptivity of the DN partition to the dataset at hand. We hope that this work will motivate researchers to further extend BN into task-specific methods leveraging a priori knowledge of the properties one desires to enforce into their DN.

7. REPRODUCIBILITY STATEMENT

The proofs and further derivations of the theoretical results provided throughout the main text are given in the Appendix. Additional explanatory figures are also provided. The codebase to reproduce not only the quantitative experiments e.g. the performances without learnable γ and β or the performance of BN used as an initialization only, but also the qualitative ones will be released upon completion of the review process in the hope to increase the theoretical study around BN and to further improve this eponymous technique. We did not consider any sensitive dataset and focused on the standard computer vision ones which are CIFAR variants and Imagenet. For the architectures we also used the standard Resnet50 and Resnet9, additional details are also provided in the Appendix. Aaron R Voelker, Jan Gosmann, and Terrence C Stewart. Efficiently sampling vectors and coordinates from the n-sphere and n-ball. 

A APPENDIX

We provide in the following appendices details on the Max-Affine Spline formulation of DNs (Sec. B) and the proofs of the various theoretical results from the main text (Sec. E).

B DETAILS ON CONTINUOUS PIECEWISE AFFINE DEEP NETWORKS

The goal of this section is to provide additional details into the forming of the per-region affine mappings of CPA DNs. As mentioned in the main text, any DN that is formed from CPA nonlinearities can be expressed itself as a CPA operator. The per-region affine mappings are thus entirely defined by the state of the DN nonlinearities. For an activation function such as ReLU, leaky-ReLU or absolute value, the nonlinearity state is completely determined by the sign of the activation input as it determines which of the two linear mapping to apply to produce its output. Let denote this code as q ℓ (z ℓ-1 ) ∈ {α, 1} D ℓ given by [q ℓ (z ℓ-1 )] i = α, [W ℓ z ℓ-1 + b ℓ ] i ≤ 0 1, [W ℓ z ℓ-1 + b ℓ ] i > 0 (14) where the pre-activation formula above can be replaced with the one from (2) if BN is employed. For a max-pooling type of nonlinearity the state corresponds to the argmax obtained in each pooling regions, for details on how to generalize the below in that case we refer the reader to Balestriero & Baraniuk (2018) . Based on the above, the layer input-output mapping can be written as z ℓ = Q ℓ (z ℓ-1 )(W ℓ z ℓ-1 + b ℓ ). ( ) where Q ℓ produces a diagonal matrix from the vector q ℓ , and one has α = 0 for ReLU, α = -1 for absolute value and α > 0 for leaky-ReLU; see Balestriero & Baraniuk (2018) for additional details. The up-to-layer-ℓ mapping can thus be easily written as z ℓ = A 1|ℓ (x)x + B 1|ℓ (x) with the following slope and bias parameters A 1|ℓ (x) =Q ℓ W ℓ Q ℓ-1 W ℓ-1 . . . Q 1 W 1 , B 1|ℓ (x) = ℓ i=1 (Q ℓ W ℓ Q ℓ-1 W ℓ-1 . . . Q i+1 W i+1 ) b i , where for clarity we abbreviated Q ℓ (z ℓ-1 ) as Q ℓ . From the above formulation, it is clear that whenever the codes q ℓ stay the same for different inputs, the layer input-output mapping remains linear. This defines a region ω ℓ of Ω ℓ from ( 6), the layerinput-space partition region, as ω q ℓ = {z ℓ-1 ∈ R D ℓ-1 : q ℓ (z ℓ-1 ) = q}, q ∈ {α, 1} D l . ( ) In a similar way, the up-to-layer-ℓ input space partition region can be defined. The multilayer region is defined as ω q1,...,q ℓ 1|ℓ = ℓ i=1 {x ∈ R D : q i (x) = q i }, q i ∈ {α, 1} D l . ( ) Definition 1 (Layer/DN partition). The layer ℓ (resp. -up-to-layer-ℓ) input space partition is given by Ω ℓ = {ω q ℓ , q ∈ {α, 1} D l } \ ∅, Ω 1|ℓ = {ω q1,...,q ℓ 1|ℓ , q i ∈ {α, 1} Di , ∀i} \ ∅. Note that Ω 1|L forms the entire DN input space partition. Both unit and layer input space partitioning can be rewritten as Power Diagrams, a generalization of Voronoi Diagrams Balestriero et al. (2019) . Composing layers then simply refines successively the previously build input space partitioning via a subdivision process to obtain the -up to layer ℓinput space partitioning Ω |ℓ .

C BATCH NORMALIZATION PARAMETER σ FOLDS THE PARTITION REGIONS TOWARDS THE TRAINING DATA

We are now in a position to describe the effect of the BN parameter σ ℓ , which has no effect on the spline partition of an individual DN layer but comes into play for a composition of two or more layers. In contrast to the hyperplane translation effected by µ ℓ , σ ℓ optimizes the dihedral angles between adjacent facets of the folded hyperplanes in subsequent layers in order to swing them closer to the training data. Define Q ℓ as the square diagonal matrix whose diagonal entry i is determined by the sign of the pre-activation h ℓ,i . That is, [Q ℓ ] i,i = α/σ ℓ,i , h ℓ,i < 0 1/σ ℓ,i , h ℓ,i ≥ 0, with α = 0 for ReLU, α > 0 for leaky-ReLU, and α = -1 for absolute value (recall the definition of the activation function from Section 2.2; see Balestriero & Baraniuk (2019) for additional nonlinearities). Note that Q ℓ is constant across each region ω in the layer's spline partition Ω ℓ since, by definition, none of the pre-activations h ℓ,i change sign within region ω. We will use Q ℓ (ω) to denote the dependency on the region ω; to compute Q ℓ (ω) given ω, one merely samples a layer input from ω, computes the pre-activations h ℓ,i , and applies (23). In Appendix ?? we prove that the BN parameter σ ℓ adjusts the dihedral folding angles of adjacent facets of each folded hyperplane created by layer ℓ + 1 in order to align the facets with the training data. Figure 2 (c) illustrates empirically how σ 1 folds the facets of F 2,k realized by the second layer of a toy DN. Since the relevant expressions quickly (but predictably) grow in length with the number of layers, to expose the salient points, but without loss of generality, we will focus the next theorem on the composition of the first two DN layers (layers ℓ = 1, 2). In this case, there are two geometric quantities of interest that combine to create the input space spline partition Ω = Ω |2 : layer 1's hyperplanes H 1,i and layer 2's folded hyperplanes F 2,k . Theorem 3. Given a 2-layer (ℓ = 1, 2) BN-equipped DN employing a leaky-ReLU or absolute value activation function, consider two adjacent regions ω, ω ′ from the spline partition Ω whose boundaries contain the facets F 2,k,ω , F 2,k,ω ′ created by folding across the boundaries' shared hyperplane H 1,i (see Figure 8 ). The dihedral angles between these three (partial) hyperplanes are given by: θ (F 2,k,ω , H 1,i ) = arccos   w ⊤ 2,k Q 1 (ω)W 1 w 1,i w ⊤ 2,k Q 1 (ω)W 1 ∥w 1,i ∥   , θ (F 2,k,ω ′ , H 1,i ) = arccos   w ⊤ 2,k Q 1 (ω ′ )W 1 w 1,i w ⊤ 2,k Q 1 (ω ′ )W 1 ∥w 1,i ∥   , θ (F 2,k,ω , F 2,k,ω ′ ) = arccos   w ⊤ 2,k Q 1 (ω)W 1 W ⊤ 1 Q 1 (ω ′ )w 2,k w ⊤ 2,k Q 1 (ω)W 1 w ⊤ 2,k Q 1 (ω ′ )W 1   . In ( 24)-( 26), Q 1 (ω) and Q 1 (ω ′ ) differ by only one diagonal entry at index (i, i): one matrix takes the value 1 σ1,i and the other the value α σ1,i , as per Eq. 23. Since (8) implies that σ 2 ℓ,i ∝ min µ ℓ,i L(µ ℓ,i , B ℓ ), we have the following two insights that we state for ℓ = 1. (These results extend in a straightforward fashion for more than two layers as well as for more complicated piecewise linear activation functions.) On the one hand, if H 1,i is well-aligned with the training data B 1 , then the TLS error, and hence σ 2 1,i , will be small. For the absolute value activation function, formulae ( 24) and ( 25) then tell us that both θ (F 2,k,ω , H 1,i ) ≈ 0 and θ (F 2,k,ω ′ , H 1,i ) ≈ 0, meaning that F 2,k,ω and F 2,k,ω ′ will be folded to closely align with H 1,i (and hence the data). The connection between σ 2 ℓ,i and (24,25) lies in the entries of the Q matrix. Basically, the σ ℓ are used in the denominator of Q and thus bend more or less the angles. For the ReLU/leaky-ReLU activation function, either F 2,k,ω or F 2,k,ω ′ will be folded to closely align with H 1,i ; the other facet will be unchanged/mildly folded. On the other hand, if H 1,i is not well-aligned with the training data B 1 , then the TLS error, and hence σ 2 1,i , will be large. This will force Q 1 (ω) ≈ Q 1 (ω ′ ) and thus θ (F 2,k,ω , F 2,k,ω ′ ) ≈ π, meaning that a poorly aligned layer-1 hyperplane H 1,i will not appreciably fold intersecting layer-2 facets. Figure 9 illustrates empirically how the BN parameter σ ℓ,k measures the quality of the fit of the (folded) hyperplanes to the data in the TLS error sense for a toy DN. In summary, and extrapolating to the general case, the effect of the BN parameter σ ℓ is to fold the layer-(ℓ + 1) hyperplanes (also the ℓ + 2 and subsequent hyperplanes) that contribute to the spline partition boundary Ω in order to align them with the layer-ℓ hyperplanes that match the data well. Hence, not only µ ℓ but also σ ℓ plays a crucial role in aligning the DN spline partition with the training data (recall Figure 1 ). The (folded) hyperplanes are colored based on the corresponding value σ 2 ℓ,k /∥w ℓ,k ∥ 2 , which is proportional to the total least squares (TLS) fitting error to the data (blue: small error, close to the data points; green: large error, far from the data points).

D ROLE OF THE BN PARAMETERS β, γ AND PROOF OF PROPOSITION 1

Without loss of generality, consider a simple two-layer DN to illustrate. Then, it is clear that γ 1 simply rescales the rows of W 2 and the entries of β 1 z 2,r = a   D1 k=1 [w 2,r ] k a ⟨w 1,k ,x⟩-µ 1,k σ 1,k γ 1,k + β 1,k -µ 2,r σ 2,r γ 2,r + β 2,r   (27) = a   D1 k=1 γ 1,k [w 2,r ] k a ⟨w 1,k ,x⟩-µ 1,k σ 1,k + β 1,k γ 1,k -µ 2,r σ 2,r γ 2,r + β 2,r   (28) = a   D1 k=1 [w ′ 2,r ] k a ⟨w 1,k ,x⟩-µ 1,k σ 1,k + β ′ 1,k -µ 2,r σ 2,r γ 2,r + β 2,r   and so does not need to be optimized separately. Here, [w 2,r ] k denotes the k-th entry of the r-th row of W 2 . We obtain (28), because standard activation and pooling functions (e.g., (leaky-)ReLU, absolute value, max-pooling) are such that a(cu) = c a(u). This leaves β ℓ as the only learnable parameter that needs to be considered. Now, our BN theoretical analysis relies on setting β ℓ = 0, which corresponds to the standard initialization of BN. In this setting, we saw that BN "fits" the partition boundaries to the data samples exactly. Now, by observing the form of (2), it is clear that learning β ℓ enables to "undo" this fitting if needed to better solve the task-at-hand. However, we have found that in most practical scenarios, fixing β ℓ = 0 throughout training actually does not alter final performances. In fact, on Imagenet, the top1 accuracy of a Resnet18 goes from 67.60% to 65.93% when doing such a change, and on a Resnet34, from 68.71% to 67.21%. The drop seems to remain the same even on more complicated architecture as for a Resnet50, the top1 test accuracy goes down from 77.11% to 74.98%, where again we emphasize that the exact same hyper-parameters are employed for both situations i.e. the drop could potentially be reduced by tuning the hyper-parameters. Obviously those numbers might vary depending on optimizers and other hyper-parameters, we used here the standard ones for those architectures since our point is merely that all our theoretical results relying on β ℓ = 0, γ ℓ = 1 still applies to high performing models. In addition to the above Imagenet results, we provide in Table 1 the results for the classification accuracy on various dataset and with two architectures. This set of results simply demonstrates that the learnable parameters of BN (γ, β) have very little impact of performances.  ∂ z∈Z |⟨w ℓ,k , z ℓ-1 ⟩ -µ k | 2 ∥w ℓ,k ∥ 2 2 = -2 z∈Z (⟨w ℓ,k , z ℓ-1 ⟩ -µ k ) ∥w ℓ,k ∥ 2 2 (33) = -2 z∈Z ⟨w ℓ,k , z ℓ-1 ⟩ ∥w ℓ,k ∥ 2 2 + 2Card(Z) µ k ∥w ℓ,k ∥ 2 2 (34) (35) the above first derivative of the total least square (quadratic) loss function is thus a linear function of µ k being 0 at the unique point given by -2 z∈Z ⟨w ℓ,k , z ℓ-1 ⟩ ∥w ℓ,k ∥ 2 2 + 2Card(Z) µ k ∥w ℓ,k ∥ 2 2 = 0 ⇐⇒ µ k = z∈Z ⟨w ℓ,k , z ℓ-1 ⟩ Card(Z) confirming that the average of the pre-activation feature maps (per-dimension) is indeed the optimum of the optimization problem. One can verify easily that it is indeed a minimum by taking the second derivative of the total least square which indeed positive and given by 2Card(Z) ∥w ℓ,k ∥ 2 2 . The above can be done for each dimension k in a similar manner. Now, by inserting this optimal value back into the total least square loss, we obtain the desired result.

E.2 PROOF OF CENTRAL HYPERPLANE ARRANGEMENT

Corollary 2. BN constrains the input space partition boundaries ∂Ω ℓ of each layer ℓ of a BNequipped DN to be a central hyperplane arrangement; indeed, the average of the layer's training data inputs z ℓ ∈ D ℓ k=1 H ℓ,k as long as ∥w ℓ,k ∥ > 0. Proof. In order to prove the desired result i.e. that there exists a nonempty intersection between all the hyperplanes, we first demonstrate that the layer input centroid z ℓ-1 indeed to one hyperplane, say k. Then it will be direct to see that this holds regardless of k and thus the intersection of all hyperplanes contains at least z ℓ-1 which is enough to prove the statement. For a data point (in our case z ℓ-1 ) to belong to the k th (unit) hyperplane H ℓ,k of layer ℓ, we must ensure that this point belong to the set of the hyperplane defined as (recall ( 5)) H ℓ,k = z ℓ-1 ∈ R D ℓ-1 : ⟨w ℓ,k , z ℓ-1 ⟩ = [µ ℓ ] k , in our case we can simply use the data centroid and ensure that it fulfils the hyperplane equality ⟨w ℓ,k , z ℓ-1 ⟩ = w ℓ,k , z∈Z z Card(Z) = z∈Z ⟨w ℓ,k , z⟩ Card(Z) = [µ * ℓ ] k where the last equation gives in fact the batch-normalization mean parameter. So now, recalling the equation of H ℓ,k we see that the point z ℓ-1 makes plane projection [µ * ℓ ] k which equals the bias of the hyperplane effectively making z ℓ-1 part of the (batch-normalized) hyperplane H ℓ,k . Doing the above for each k ∈ D ℓ we see that the layer input centroid belongs to all the unit hyperplane that are shifted by the correct batch-normalization parameter, hence we directly obtain the desired result z ℓ-1 ⊂ k∈D ℓ H ℓ,k , concluding the proof.

E.3 PROOF OF THEOREM 2

Proof. Define by x * the shortest point in P ℓ,k from x defined by x * = arg min u∈P ℓ,k ∥x -u∥ 2 . The path from x to x * is a straight line in the input space which we define by l(θ) = x * θ + (1 -θ)x, θ ∈ [0, 1], s.t. l(0) = x, our original point, and l( 1) is the shortest point on the kinked hyperplane. Now, in the input space of layer ℓ, this parametric line becomes a continuous piecewise affine parametric line defined as z ℓ-1 (θ) = (f ℓ-1 • • • • • f 1 )(l(θ). By definition, if P ℓ,k is brought closer to x, it means that ∃θ < 1 s.t. l(θ) ∈ P ℓ,k . Similarly this can be defined in the layer input space as follows. ∃θ ′ < 1 s.t. z ℓ-1 (θ) ∈ H ℓ,k ∃θ < 1 s.t. l(θ) ∈ P ℓ,k this demonstrates that when moving the layer hyperplane s.t. it intersects the kinked path z ℓ-1 at a point z ℓ-1 (θ ′ ) with θ ′ < 1, then the distance in the input space is also reduced. Now, the BN fitting is greedy and tried to minimize the length of the straight line between z ℓ-1 (0) a.k.a z ℓ-1 (x) and the hyperplane H ℓ,k . However, notice that if the length of this straight line decreases by brining the hyperplane closer to z ℓ-1 (x) then this also decreases the θ ′ s.t. z ℓ-1 (θ ′ ) ∈ H ≪,k in turn reducing the distance between x and P ℓ,k in the DN input space, giving the desired (second) result. Conversely, if z ℓ-1 (0) ∈ H ℓ,k then the point x lies in the zero-set of the unit, in turn making it belong to the kinked hyperplane P ℓ,k which corresponds to this exact set.

E.4 PROOF OF PROPOSITION 3

Proposition 3. Consider an L-layer DN configured to learn a binary classifier from the labeled training data X using a leaky-ReLU activation function, arbitrary weights W ℓ at all layers, BN at layers 1, . . . , L -1, and layer L configured as in (1) with c L = 0. Then, for any training mini-batch from X , there will be at least one data point on either side of the decision boundary. Proof. When using leaky-ReLU the input to the last layer will have positive and negative values in each dimension for at least 1 in the current minibatch. That means that each dimension will have at least 1 negative value and all the other positive or vice-versa. As the last layer is initialized with zero bias, the decision boundary is defined in the last layer input space as the hyperplanes (or zero-set) of each output unit. Also, being on one side or the other of the decision boundary in the DN input space is equivalent to being on one side or the other of the linear decision boundary in the last layer input space. Combining those two results we obtain that at initialization, there has to be at least 1 sample one side of the decision boundary and the others on the other side. 

F DATASET DESCRIPTIONS

MNISTThe MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60, 000 training images and 10, 000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. The original creators of the database keep a list of some of the methods tested on it. In their original paper, they use a support-vector machine to get an error rate of 0.8%. SVHNSVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600, 000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. CIFAR10The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60, 000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class. Computer algorithms for recognizing objects in photos often learn by example. CIFAR-10 is a set of images that can be used to teach a computer how to recognize objects. Since the images in CIFAR-10 are low-resolution (32x32), this dataset can allow researchers to quickly try different algorithms to see what works. Various kinds of convolutional neural networks tend to be the best at recognizing the images in CIFAR-10. CIFAR100This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).



Note that the DN bias c ℓ from (1) has been subsumed into µ ℓ and β ℓ . or more commonly as an exponential moving average of the training mini-batch values. Our analysis applies to any composition of ℓ DN layers; we focus on the first ℓ layers only for concreteness.



Figure 2: Translation and folding effected by the BN parameters µ ℓ , σ ℓ on a two-layer DN with 8 units per layer, 2D input space, and random weights. (a) Varying µ 2,k translates the layer 2, unit k hyperplane H 2,k (recall (5)) viewed in a 2D slice of its own input space. (b) Translation of that same hyperplane but now viewed in the DN input space, where it becomes the folded hyperplane F 2,k (recall (11)). (c) Fixing µ 2,k and varying σ1 folds the next layer(s) hyperplanes, viewed in the DN input space.

Figure 3: Visualization in the 2D input space of the contribution ∂Ω 1 j to the spline partition boundary Ω 1 from layers j = 1, 7, 11 of an 11-layer DN of width 1024. The training data set X consists of 50 samples from a star-shaped distribution (left plots). We plot the concentration of the folded hyperplane facets in an ϵ-ball around each 2D input space point for the three initialization settings described in the text: zero bias, random bias, and BN. Darker color indicates more partition boundaries crossing through that location. Each plot is normalized by the maximum concentration attained, the value of which is noted.

Figure 5: Image classification using a Resnet50 on Imagenet. Standard data augmentation was used during SGD training but no BN was employed. Instead, the DN was initialized with random weights (blue) or with random weights and warmed-up BN statistics () across the entire training set (recall Figure 3 and (3)). Each column corresponds to different learning rate used by SGD and multiple runs are produced for the right cases. BN's "smart initialization" reaches a higher-performing solution faster because it provides SGD with a spline partition that is already adapted to the training dataset. This findings provides a novel and complementary understanding of BN that was previously studied solely when continuously employed during training as a mean to better condition a DN's loss landscape. We also show in Figure5in the Appendix a similar observation on CIFAR100.

This result showcases the importance of DN initialization and how a good initialization alone plays a crucial role in performance, as has also been empirically studied with slope variance constraintsMishkin & Matas (2015);Xie et al. (2017), singular values constraintsJia et al. (2017) or with orthogonal constraints Saxe et al. (2013);Bansal et al. (2018).

Figure 6: Realizations of the classification decision boundary on a toy 2D binary classification (red vs green) problem obtained solely by sampling different minibatches and thus observing different realizations of µ ℓ , σ ℓ (recall (3)) at initialization and after training. Each mini-batch produces a different decision boundary depicted in blue. For two different mini-batch sizes |B ℓ | = 16, 256, we change the variance as per Proposition 2. Larger batch sizes clearly produce smaller variability in the decision boundary both at initialization and after training; µ ℓ , σ ℓ distributions are provided in Figure 7.

input and/or feature maps to improve generalization performance Srivastava (2013); Pham et al. (2014); Molchanov et al. (2017); Wang et al. (

Figure 8: Sketch of the situation in Theorem 3 for a two-dimensional input space.

Figure9: Layer-1 hyperplanes H 1,k (left) and layer-4 folded hyperplanes F 4,k (right) depicted in the 2D input space of a toy 4-layer DN trained on the data points denoted with black dots. The (folded) hyperplanes are colored based on the corresponding value σ 2 ℓ,k /∥w ℓ,k ∥ 2 , which is proportional to the total least squares (TLS) fitting error to the data (blue: small error, close to the data points; green: large error, far from the data points).

Figure10: Image classification using a Resnet9 on CIFAR100. No BN or data augmentation was used during SGD training. Instead, the DN was initialized with random weights and zero bias (blue), random bias (orange), or via BN across the entire data set as in Figure3(green). Each training was repeated 10 times with learning rate cross-validation, and we plot the average test accuracy (of the best valid set learning rate) vs. learning epoch. BN's "smart initialization" reaches a higherperforming solution faster because it provides SGD with a spline partition that is already adapted to the training dataset.

Technical report, Tech. Rep.) Waterloo, ON: Centre for Theoretical Neuroscience. doi: 10.13140 . . . , 2017. Richard Von Mises. Mathematical theory of probability and statistics. Academic Press, 2014. Zichao Wang, Randall Balestriero, and Richard Baraniuk. A max-affine spline perspective of recurrent neural networks. In International Conference on Learning Representations, 2018. Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6176-6185, 2017. Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S Schoenholz. A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019. Thomas Zaslavsky. Facing up to arrangements: Face-count formulas for partitions of space by hyperplanes: Face-count formulas for partitions of space by hyperplanes, volume 154. American Mathematical Soc., 1975.

Test accuracy of various models when employing (yes) or not (no) the BN learnable parameters, as was demonstrated in the main paper, those parameters have very little impact on the final test accuracy (no data-augmentation is used).

E PROOFS

We propose in this section the various proofs supporting the diverse theoretical claims from the main part of the paper.

E.1 PROOF OF THEOREM 1

Proof. In order to prove the theorem we will demonstrate below that the optimum of the total least square optimization problem is reached at the unique global optimum given by the average of the data, hence corresponding to the batch-normalization mean parameter. Then we demonstrate that at this minimum, the value of the total least square loss is given by the variance parameter of batchnormalization.The optimization problem is given byit is clear that the optimization problemcan be decomposed into multiple independent optimization problem for each dimension of the vector µ, since we are working with an unconstrained optimization problem with separable sum. We thus focus on a single µ k for now. The optimization problem becomes

