A SPHERICAL ANALYSIS OF ADAM WITH BATCH NORMALIZATION

Abstract

Paper under double-blind review Batch Normalization (BN) is a prominent deep learning technique. In spite of its apparent simplicity, its implications over optimization are yet to be fully understood. While previous studies mostly focus on the interaction between BN and stochastic gradient descent (SGD), we develop a geometric perspective which allows us to precisely characterize the relation between BN and Adam. More precisely, we leverage the radial invariance of groups of parameters, such as filters for convolutional neural networks, to translate the optimization steps on the L 2 unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, we use it to derive the first effective learning rate expression of Adam. Then we show that, in the presence of BN layers, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere. Finally, our analysis outlines phenomena that previous variants of Adam act on and we experimentally validate their importance in the optimization process.

1. INTRODUCTION

Figure 1 : Illustration of the spherical perspective for SGD. The loss function L of a NN w.r.t. the parameters x k ∈ R d of a neuron followed by a BN is radially invariant. The neuron update x k → x k+1 in the original space, with velocity η k ∇L(x k ), corresponds to an update u k → u k+1 of its projection through an exponential map on the unit hypersphere S d-1 with velocity η e k ∇L(u k ) at order 2 (see details in Section 2.3). The optimization process of deep neural networks is still poorly understood. Their training involves minimizing a high-dimensional non-convex function, which has been proved to be a NP-hard problem (Blum & Rivest, 1989 ). Yet, elementary gradient-based methods show good results in practice. One of the key effects of BN is to make NNs invariant to positive scalings of groups of parameters. The core idea of this paper is precisely to focus on these groups of radially-invariant parameters and analyze their optimization projected on the L 2 unit hypersphere (see Fig. 1 ), which is topologically equivalent to the quotient manifold of the parameter space by the scaling action. One could directly optimize parameters on the hypersphere as Cho & Lee ( 2017), yet, most optimization methods are still performed successfully in the original parameter space. Here we propose to study an optimization scheme for a given group of radially-invariant parameters through its image scheme on the unit hypersphere. This geometric perspective sheds light on the interaction between normalization layers and Adam, and also outlines an interesting link between standard SGD and a variant of Adam adapted and constrained to the unit hypersphere: AdamG (Cho & Lee, 2017). We believe this kind of analysis

