ADAPTIVE NORMS FOR DEEP LEARNING WITH REGULARIZED NEWTON METHODS

Abstract

We investigate the use of regularized Newton methods with adaptive norms for optimizing neural networks. This approach can be seen as a second-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we prove that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for provable convergence of second-order trust region methods with standard worst-case complexities on general non-convex objectives. Furthermore, we run experiments across different neural architectures and datasets to find that the ellipsoidal constraints constantly outperform their spherical counterpart both in terms of number of backpropagations and asymptotic loss value. Finally, we find comparable performance to state-of-the-art first-order methods in terms of backpropagations, but further advances in hardware are needed to render Newton methods competitive in terms of computational time.

1. INTRODUCTION

We consider finite-sum optimization problems of the form min w∈R d L(w) := n i=1 (f (w, x i , y i )) , which typically arise in neural network training, e.g. for empirical risk minimization over a set of data points (x i , y i ) ∈ R in × R out , i = 1, . . . , n. Here, : R out × R out → R + is a convex loss function and f : R in × R d → R out represents the neural network mapping parameterized by the concatenation of the weight layers w ∈ R d , which is non-convex due to its multiplicative nature and potentially non-linear activation functions. We assume that L is lower bounded and twice differentiable, i.e. L ∈ C 2 (R d , R) and consider finding a first-and second-order stationary point w for which ∇L( w) ≤ g and λ min ∇ 2 L( w) ≥ -H . In the era of deep neural networks, stochastic gradient descent (SGD) is one of the most widely used training algorithms (Bottou, 2010) . What makes SGD so attractive is its simplicity and per-iteration cost that is independent of the size of the training set (n) and scale linearly in the dimensionality (d). However, gradient descent is known to be inadequate to optimize functions that are ill-conditioned (Nesterov, 2013; Shalev-Shwartz et al., 2017) and thus adaptive gradient methods that employ dynamic, coordinate-wise learning rates based on past gradients-including Adagrad (Duchi et al., 2011 ), RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014)-have become a popular alternative, often providing significant speed-ups over SGD. From a theoretical perspective, Newton methods provide stronger convergence guarantees by appropriately transforming the gradient in ill-conditioned regions according to second-order derivatives. It is precisely this Hessian information that allows regularized Newton methods to enjoy superlinear local convergence as well as to provably escape saddle points (Conn et al., 2000) . While second-order algorithms have a long-standing history even in the realm of neural network training (Hagan & Menhaj, 1994; Becker et al., 1988) , they were mostly considered as too computationally and memory expensive for practical applications. Yet, the seminal work of Martens ( 2010) renewed interest for their use in deep learning by proposing efficient Hessian-free methods that only access second-order information via matrix-vector products which can be computed at the cost of an additional backpropagation (Pearlmutter, 1994; Schraudolph, 2002) . Among the class of regularized Newton methods, trust region (Conn et al., 2000) and cubic regularization algorithms (Cartis et al., 2011) are the most principled approaches in the sense that they yield the strongest convergence guarantees. Recently, stochastic extensions have emerged (Xu et al., 2017b; Yao et al., 2018; Kohler & Lucchi, 2017; Gratton et al., 2017) , which suggest their applicability for deep learning. We here propose a simple modification to make TR methods even more suitable for neural network training. Particularly, we build upon the following alternative view on adaptive gradient methods: While gradient descent can be interpreted as a spherically constrained first-order TR method, preconditioned gradient methods-such as Adagrad-can be seen as first-order TR methods with ellipsoidal trust region constraint. This observation is particularly interesting since spherical constraints are blind to the underlying geometry of the problem, but ellipsoids can adapt to local landscape characteristics, thereby allowing for more suitable steps in regions that are ill-conditioned. We will leverage this analogy and investigate the use of the Adagrad and RMSProp preconditioning matrices as ellipsoidal trust region shapes within a stochastic second-order TR algorithm (Xu et al., 2017a; Yao et al., 2018) . Since no ellipsoid fits all objective functions, our main contribution lies in the identification of adequate matrix-induced constraints that lead to provable convergence and significant practical speed-ups for the specific case of deep learning. On the whole, our contribution is threefold: • We provide a new perspective on adaptive gradient methods that contributes to a better understanding of their inner-workings. • We investigate the first application of ellipsoidal TR methods for deep learning. We show that the RMSProp matrix can directly be applied as constraint inducing norm in second-order TR algorithms while preserving all convergence guarantees (Theorem 1). • Finally, we provide an experimental benchmark across different real-world datasets and architectures (Section 5). We compare second-order methods also to adaptive gradient methods and show results in terms of backpropagations, epochs, and wall-clock time; a comparison we were not able to find in the literature. Our main empirical results demonstrate that ellipsoidal constraints prove to be a very effective modification of the trust region method in the sense that they constantly outperform the spherical TR method, both in terms of number of backprogations and asymptotic loss value on a variety of tasks.

2. RELATED WORK

First-order methods The prototypical method for optimizing Eq. ( 1) is SGD (Robbins & Monro, 1951) . The practical success of SGD in non-convex optimization is unquestioned and theoretical explanations of this phenomenon are starting to appear. Recent findings suggest the ability of this method to escape saddle points and reach local minima in polynomial time, but they either need to artificially add noise to the iterates (Ge et al., 2015; Lee et al., 2016) or make an assumption on the inherent noise of SGD (Daneshmand et al., 2018) . For neural networks, a recent line of research proclaims the effectiveness of SGD, but the results come at the cost of strong assumptions such as heavy over-parametrization and Gaussian inputs (Du et al., 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Du & Lee, 2018; Allen-Zhu et al., 2018) . Adaptive gradient methods (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2014) build on the intuition that larger (smaller) learning rates for smaller (larger) gradient components balance their respective influences and thereby the methods behave as if optimizing a more isotropic surface. Such approaches have first been suggested for neural nets by LeCun et al. (2012) and convergence guarantees are starting to appear (Ward et al., 2018; Li & Orabona, 2018) . However, these are not superior to the O( -2 g ) worst-case complexity of standard gradient descent (Cartis et al., 2012b) .

Regularized Newton methods

The most principled class of regularized Newton methods are trust region (TR) and adaptive cubic regularization algorithms (ARC) (Conn et al., 2000; Cartis et al., 2011) , which repeatedly optimize a local Taylor model of the objective while making sure that the step does not travel too far such that the model stays accurate. While the former finds first-order stationary points within O( -2 g ), ARC only takes at most O( -3/2 g ). However, simple modifications to the TR framework allow these methods to obtain the same accelerated rate (Curtis et al., 2017) . Both methods

