A SPHERICAL ANALYSIS OF ADAM WITH BATCH NORMALIZATION

Abstract

Paper under double-blind review Batch Normalization (BN) is a prominent deep learning technique. In spite of its apparent simplicity, its implications over optimization are yet to be fully understood. While previous studies mostly focus on the interaction between BN and stochastic gradient descent (SGD), we develop a geometric perspective which allows us to precisely characterize the relation between BN and Adam. More precisely, we leverage the radial invariance of groups of parameters, such as filters for convolutional neural networks, to translate the optimization steps on the L 2 unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, we use it to derive the first effective learning rate expression of Adam. Then we show that, in the presence of BN layers, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere. Finally, our analysis outlines phenomena that previous variants of Adam act on and we experimentally validate their importance in the optimization process.

1. INTRODUCTION

Figure 1 : Illustration of the spherical perspective for SGD. The loss function L of a NN w.r.t. the parameters x k ∈ R d of a neuron followed by a BN is radially invariant. The neuron update x k → x k+1 in the original space, with velocity η k ∇L(x k ), corresponds to an update u k → u k+1 of its projection through an exponential map on the unit hypersphere S d-1 with velocity η e k ∇L(u k ) at order 2 (see details in Section 2.3). The optimization process of deep neural networks is still poorly understood. Their training involves minimizing a high-dimensional non-convex function, which has been proved to be a NP-hard problem (Blum & Rivest, 1989 ). Yet, elementary gradient-based methods show good results in practice. et al., 2018; Hoffer et al., 2018b) . None of them studied the interaction between BN and one of the most common adaptive schemes for Neural Networks (NN), Adam (Kingma & Ba, 2015) , except van Laarhoven (2017), which tackled it only in the asymptotic regime. In this work, we provide an extensive analysis of the relation between BN and Adam during the whole training procedure. One of the key effects of BN is to make NNs invariant to positive scalings of groups of parameters. The core idea of this paper is precisely to focus on these groups of radially-invariant parameters and analyze their optimization projected on the L 2 unit hypersphere (see Fig. 1 ), which is topologically equivalent to the quotient manifold of the parameter space by the scaling action. One could directly optimize parameters on the hypersphere as Cho & Lee ( 2017), yet, most optimization methods are still performed successfully in the original parameter space. Here we propose to study an optimization scheme for a given group of radially-invariant parameters through its image scheme on the unit hypersphere. This geometric perspective sheds light on the interaction between normalization layers and Adam, and also outlines an interesting link between standard SGD and a variant of Adam adapted and constrained to the unit hypersphere: AdamG (Cho & Lee, 2017). We believe this kind of analysis is an important step towards a better understanding of the effect of BN on NN optimization. Please note that, although our discussion and experiments focus on BN, our analysis could be applied to any radially-invariant model. The paper is organized as follows. In Section 2, we introduce our spherical framework to study the optimization of radially-invariant models. We also define a generic optimization scheme that encompasses methods such as SGD with momentum (SGD-M) and Adam. We then derive its image step on the unit hypersphere, leading to definitions and expressions of effective learning rate and effective learning direction. This new definition is explicit and has a clear interpretation, whereas the definition of van Laarhoven ( 2017) is asymptotic and the definitions of Arora et al. ( 2019) and of Hoffer et al. (2018b) are variational. In Section 3, we leverage the tools of our spherical framework to demonstrate that in presence of BN layers, SGD has an adaptive behaviour. Formally, we show that SGD is equivalent to AdamG, a variant of Adam adapted and constrained to the hypersphere, without momentum. In Section 4, we analyze the effective learning direction for Adam. The spherical framework highlights phenomena that previous variants of Adam (Loshchilov & Hutter, 2017; Cho & Lee, 2017) act on. We perform an empirical study of these phenomena and show that they play a significant role in the training of convolutional neural networks (CNNs). In Section 5, these results are put in perspective with related work. Our main contributions are the following: • A framework to analyze and compare order-1 optimization schemes of radially-invariant models; • The first explicit expression of the effective learning rate for Adam; • The demonstration that, in the presence of BN layers, standard SGD has an adaptive behaviour; • The identification and study of geometrical phenomena that occur with Adam and impact significantly the training of CNNs with BN.

2. SPHERICAL FRAMEWORK AND EFFECTIVE LEARNING RATE

In this section, we provide background on radial invariance and introduce a generic optimization scheme. Projecting the scheme update on the unit hypersphere leads to the formal definitions of effective learning rate and learning direction. This geometric perspective leads to the first explicit expression of the effective learning rate for Adam. The main notations are summarized in Figure 1 .

2.1. RADIAL INVARIANCE

We consider a family of parametric functions φ x : R in → R out parameterized by a group of radiallyinvariant parameters x ∈ R d {0}, i.e., ∀ρ > 0, φ ρx = φ x (possible other parameters of φ x are omitted for clarity), a dataset D ⊂ R in × R out , a loss function : R out × R out → R and a training loss function L : R d → R defined as: L(x) def = 1 |D| (s,t)∈D (φ x (s), t). It verifies: ∀ρ > 0, L(ρx) = L(x). In the context of NNs, the group of radially-invariant parameters x can be the parameters of a single neuron in a linear layer or the parameters of a whole filter in a convolutional layer, followed by BN (see Appendix A for details, and Appendix B for the application to other normalization schemes such as InstanceNorm (Ulyanov et al., 2016 ), LayerNorm (Ba et al., 2016) or GroupNorm (Wu & He, 2018)). The quotient of the parameter space by the equivalence relation associated to radial invariance is topologically equivalent to a sphere. We consider here the L 2 sphere S d-1 = {u ∈ R d / u 2 = 1} whose canonical metric corresponds to angles: d S (u 1 , u 2 ) = arccos( u 1 , u 2 ). This choice of metric is relevant to study NNs since filters in CNNs or neurons in MLPs are applied through scalar product to input data. Besides, normalization in BN layers is also performed using the L 2 norm. Our framework relies on the decomposition of vectors into radial and tangential components. During optimization, we write the radially-invariant parameters at step k ≥ 0 as x k = r k u k where r k = x k and u k = x k / x k . For any quantity q k ∈ R d at step k, we write q ⊥ k = q k -q k , u k u k its tangential component relatively to the current direction u k .

