ADAPTIVE NORMS FOR DEEP LEARNING WITH REGULARIZED NEWTON METHODS

Abstract

We investigate the use of regularized Newton methods with adaptive norms for optimizing neural networks. This approach can be seen as a second-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we prove that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for provable convergence of second-order trust region methods with standard worst-case complexities on general non-convex objectives. Furthermore, we run experiments across different neural architectures and datasets to find that the ellipsoidal constraints constantly outperform their spherical counterpart both in terms of number of backpropagations and asymptotic loss value. Finally, we find comparable performance to state-of-the-art first-order methods in terms of backpropagations, but further advances in hardware are needed to render Newton methods competitive in terms of computational time.

1. INTRODUCTION

We consider finite-sum optimization problems of the form min w∈R d L(w) := n i=1 (f (w, x i , y i )) , which typically arise in neural network training, e.g. for empirical risk minimization over a set of data points (x i , y i ) ∈ R in × R out , i = 1, . . . , n. Here, : R out × R out → R + is a convex loss function and f : R in × R d → R out represents the neural network mapping parameterized by the concatenation of the weight layers w ∈ R d , which is non-convex due to its multiplicative nature and potentially non-linear activation functions. We assume that L is lower bounded and twice differentiable, i.e. L ∈ C 2 (R d , R) and consider finding a first-and second-order stationary point w for which ∇L( w) ≤ g and λ min ∇ 2 L( w) ≥ -H . In the era of deep neural networks, stochastic gradient descent (SGD) is one of the most widely used training algorithms (Bottou, 2010) . What makes SGD so attractive is its simplicity and per-iteration cost that is independent of the size of the training set (n) and scale linearly in the dimensionality (d). However, gradient descent is known to be inadequate to optimize functions that are ill-conditioned (Nesterov, 2013; Shalev-Shwartz et al., 2017) and thus adaptive gradient methods that employ dynamic, coordinate-wise learning rates based on past gradients-including Adagrad (Duchi et al., 2011 ), RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014)-have become a popular alternative, often providing significant speed-ups over SGD. From a theoretical perspective, Newton methods provide stronger convergence guarantees by appropriately transforming the gradient in ill-conditioned regions according to second-order derivatives. It is precisely this Hessian information that allows regularized Newton methods to enjoy superlinear local convergence as well as to provably escape saddle points (Conn et al., 2000) . While second-order algorithms have a long-standing history even in the realm of neural network training (Hagan & Menhaj, 1994; Becker et al., 1988) , they were mostly considered as too computationally and memory expensive for practical applications. Yet, the seminal work of Martens ( 2010) renewed interest for their use in deep learning by proposing efficient Hessian-free methods that only access second-order information via matrix-vector products which can be computed at the cost of an additional backpropagation (Pearlmutter, 1994; Schraudolph, 2002) . Among the class of regularized Newton methods,

