HOW DOES ADAPTIVE OPTIMIZATION IMPACT LOCAL NEURAL NETWORK GEOMETRY?

Abstract

Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce R OPT med , a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where R Adam med is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where R SGD med is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.

1. INTRODUCTION

The efficient minimization of a parameterized loss function is a core primitive in statistics, optimization and machine learning. Gradient descent (GD), which iteratively updates a parameter vector with a step along the gradient of the loss function evaluated at that vector, is a simple yet canonical algorithm which has been applied to efficiently solve such minimization problems with enormous success. However, in modern machine learning, and especially deep learning, one frequently encounters problems where the loss functions are high dimensional, non-convex and non-smooth. The optimization landscape of such problems is thus extremely challenging, and in these settings gradient descent often suffers from prohibitively high iteration complexity. To deal with these difficulties and improve optimization efficiency, practitioners in recent years have developed many variants of GD. One prominent class of these GD variants is the family of adaptive algorithms (Duchi et al., 2011; Tieleman et al., 2012; Kingma & Ba, 2015) . At a high level, adaptive methods scale the gradient with an adpatively selected preconditioning matrix, which is constructed via a moving average of past gradients. These methods are reminiscent of second order gradient descent, since they construct approximations to the Hessian of the loss functions, while remaining computationally feasible since they eschew full computation of the Hessian. A vast line of empirical work has demonstrated the superiority of adaptive methods over GD to optimize deep neural networks, especially on Natural Language Processing (NLP) tasks with transformers (Vaswani et al., 2017; Devlin et al., 2019) . From a theoretical perspective, adaptive methods are well understood in the traditional context of convex optimization. For instance, Duchi et al. (2011) show that when the loss function is convex, then the Adagrad algorithm yields regret guarantees that are provably as good as those obtained by using the best (diagonal) preconditioner in hindsight. The key mechanism that underlies this improved performance, is that the loss function has some global geometric property (such as sparsity or a coordinate wise bounded Lipschitz constant), and the algorithm adapts to this global geometry by adaptively selecting learning rates for features that are more informative. However, in non-convex optimization, and deep learning in particular, it is highly unclear whether this simple characterization is sufficient to explain the superiority of adaptive methods over GD. Indeed, for large scale neural networks, global guarantees on the geometric properties of the loss are typically vacuous. For instance, for a 20-layer feedforward neural network, if we scale up the weights in each layer by a factor of 1.5, then the global Lipschitz constant of the network is scaled up by a factor of at least e 10 . Hence it only makes sense to study convergence by looking at the local geometry of the loss along the trajectory of the optimization algorithm (Arora et al., 2018) . (right) The 10th largest value over median in the diagonal of loss Hessian (which can be viewed as a variant of R OPT med (t) defined in eq. ( 1)) for Adam and SGD+M. Since the full Hessian is too big, here we selected several layers and randomly sampled 200 coordinates per layer to compute. Moreover, the interaction between an optimization algorithm and neural network geometry is highly complex -recent work has shown that geometric characteristics of iterates encountered during optimization is highly dependent on the choice of optimization algorithm and associated hyperparameters (Lewkowycz et al., 2020; Cohen et al., 2021) . For instance, Cohen et al. ( 2021) demonstrate that while training neural networks with GD, the maximum eigenvalue of the Hessian evaluated at the GD iterates first increases and then plateaus at a level 2/(step size). The viewpoint from convex optimization, where a loss function has some (potentially) non-uniform but fixed underlying geometry that we must adapt to, is thus insufficient for neural networks, since the choice of optimization algorithm can actually interact with and influence the observed geometry significantly. To provide another example of this interactive phenomenon, we consider the following experiment. On the same network training loss function f , we run stochastic gradient descent with momentum (SGD+M) and Adam to obtain two different trajectories. We select an iterate x Adam from the Adam trajectory and an iterate x SGD from the SGD trajectory, such that f (x Adam ) = f (x SGD ). We then run SGD+M twice, once from x Adam and once from x SGD . If the underlying geometry of the loss function f was truly fixed, then we would not expect a significant difference in the performance of running SGD+M from either of the two iterates. However, as shown in Figure 1 (left), running SGD+M from x Adam achieves lower loss than that from x SGD , suggesting that Adam may bias the trajectory towards a region which is more favorable for rapid training. This motivates the following question. How does adaptive optimization impact the observed geometry of a neural network loss function, relative to SGD (with momentum)? The remainder of this paper is dedicated to answering the above question. To this end, for each iterate in a trajectory produced by running an optimization algorithm OPT, where the Hessian of the tth iterate is given by H (t) ∈ R d×d , we define the second order statistic R OPT med (t) in the following fashion. For the tth iterate in the trajectory, let R OPT med (t) be the ratio of maximum of the absolute entries of the diagonal of H (t) , to the median of the absolute entries of the diagonal of H (t) . Concretely, we define R OPT med (t) = max{|H (t) ii |} d i=1 median {|H (t) ii |} d i=1 . (1) This statistic thus measures the uniformity of the diagonal of Hessian, where a smaller value of R OPT med (t) implies that the Hessian has a more uniform diagonal. It can also be viewed as a stable 1 variant of 1 Consider the case where one parameter has little impact on the loss, then the second derivative w.r.t. this parameter is almost zero, making 



Figure 1: (left) Training losses of SGD+M starting from xSGD and xAdam.(right) The 10th largest value over median in the diagonal of loss Hessian (which can be viewed as a variant of R OPT med (t) defined in eq. (1)) for Adam and SGD+M. Since the full Hessian is too big, here we selected several layers and randomly sampled 200 coordinates per layer to compute.

Figure 2: Training losses of Adam and SGD+M on the sentence classification task described in Section 4.1.

max{|H

we consider median which is more stable.

