ADAPTIVE NORMS FOR DEEP LEARNING WITH REGULARIZED NEWTON METHODS

Abstract

We investigate the use of regularized Newton methods with adaptive norms for optimizing neural networks. This approach can be seen as a second-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we prove that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for provable convergence of second-order trust region methods with standard worst-case complexities on general non-convex objectives. Furthermore, we run experiments across different neural architectures and datasets to find that the ellipsoidal constraints constantly outperform their spherical counterpart both in terms of number of backpropagations and asymptotic loss value. Finally, we find comparable performance to state-of-the-art first-order methods in terms of backpropagations, but further advances in hardware are needed to render Newton methods competitive in terms of computational time. -3/2 g ). However, simple modifications to the TR framework allow these methods to obtain the same accelerated rate (Curtis et al., 2017). Both methods

1. INTRODUCTION

We consider finite-sum optimization problems of the form min w∈R d L(w) := n i=1 (f (w, x i , y i )) , which typically arise in neural network training, e.g. for empirical risk minimization over a set of data points (x i , y i ) ∈ R in × R out , i = 1, . . . , n. Here, : R out × R out → R + is a convex loss function and f : R in × R d → R out represents the neural network mapping parameterized by the concatenation of the weight layers w ∈ R d , which is non-convex due to its multiplicative nature and potentially non-linear activation functions. We assume that L is lower bounded and twice differentiable, i.e. L ∈ C 2 (R d , R) and consider finding a first-and second-order stationary point w for which ∇L( w) ≤ g and λ min ∇ 2 L( w) ≥ -H . In the era of deep neural networks, stochastic gradient descent (SGD) is one of the most widely used training algorithms (Bottou, 2010) . What makes SGD so attractive is its simplicity and per-iteration cost that is independent of the size of the training set (n) and scale linearly in the dimensionality (d). However, gradient descent is known to be inadequate to optimize functions that are ill-conditioned (Nesterov, 2013; Shalev-Shwartz et al., 2017) and thus adaptive gradient methods that employ dynamic, coordinate-wise learning rates based on past gradients-including Adagrad (Duchi et al., 2011) , RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014) -have become a popular alternative, often providing significant speed-ups over SGD. From a theoretical perspective, Newton methods provide stronger convergence guarantees by appropriately transforming the gradient in ill-conditioned regions according to second-order derivatives. It is precisely this Hessian information that allows regularized Newton methods to enjoy superlinear local convergence as well as to provably escape saddle points (Conn et al., 2000) . While second-order algorithms have a long-standing history even in the realm of neural network training (Hagan & Menhaj, 1994; Becker et al., 1988) , they were mostly considered as too computationally and memory expensive for practical applications. Yet, the seminal work of Martens (2010) renewed interest for their use in deep learning by proposing efficient Hessian-free methods that only access second-order information via matrix-vector products which can be computed at the cost of an additional backpropagation (Pearlmutter, 1994; Schraudolph, 2002) . Among the class of regularized Newton methods, trust region (Conn et al., 2000) and cubic regularization algorithms (Cartis et al., 2011) are the most principled approaches in the sense that they yield the strongest convergence guarantees. Recently, stochastic extensions have emerged (Xu et al., 2017b; Yao et al., 2018; Kohler & Lucchi, 2017; Gratton et al., 2017) , which suggest their applicability for deep learning. We here propose a simple modification to make TR methods even more suitable for neural network training. Particularly, we build upon the following alternative view on adaptive gradient methods: While gradient descent can be interpreted as a spherically constrained first-order TR method, preconditioned gradient methods-such as Adagrad-can be seen as first-order TR methods with ellipsoidal trust region constraint. This observation is particularly interesting since spherical constraints are blind to the underlying geometry of the problem, but ellipsoids can adapt to local landscape characteristics, thereby allowing for more suitable steps in regions that are ill-conditioned. We will leverage this analogy and investigate the use of the Adagrad and RMSProp preconditioning matrices as ellipsoidal trust region shapes within a stochastic second-order TR algorithm (Xu et al., 2017a; Yao et al., 2018) . Since no ellipsoid fits all objective functions, our main contribution lies in the identification of adequate matrix-induced constraints that lead to provable convergence and significant practical speed-ups for the specific case of deep learning. On the whole, our contribution is threefold: • We provide a new perspective on adaptive gradient methods that contributes to a better understanding of their inner-workings. • We investigate the first application of ellipsoidal TR methods for deep learning. We show that the RMSProp matrix can directly be applied as constraint inducing norm in second-order TR algorithms while preserving all convergence guarantees (Theorem 1). • Finally, we provide an experimental benchmark across different real-world datasets and architectures (Section 5). We compare second-order methods also to adaptive gradient methods and show results in terms of backpropagations, epochs, and wall-clock time; a comparison we were not able to find in the literature. Our main empirical results demonstrate that ellipsoidal constraints prove to be a very effective modification of the trust region method in the sense that they constantly outperform the spherical TR method, both in terms of number of backprogations and asymptotic loss value on a variety of tasks.

2. RELATED WORK

First-order methods The prototypical method for optimizing Eq. ( 1) is SGD (Robbins & Monro, 1951) . The practical success of SGD in non-convex optimization is unquestioned and theoretical explanations of this phenomenon are starting to appear. Recent findings suggest the ability of this method to escape saddle points and reach local minima in polynomial time, but they either need to artificially add noise to the iterates (Ge et al., 2015; Lee et al., 2016) or make an assumption on the inherent noise of SGD (Daneshmand et al., 2018) . For neural networks, a recent line of research proclaims the effectiveness of SGD, but the results come at the cost of strong assumptions such as heavy over-parametrization and Gaussian inputs (Du et al., 2017; Brutzkus & Globerson, 2017; Li & Yuan, 2017; Du & Lee, 2018; Allen-Zhu et al., 2018) . Adaptive gradient methods (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2014) build on the intuition that larger (smaller) learning rates for smaller (larger) gradient components balance their respective influences and thereby the methods behave as if optimizing a more isotropic surface. Such approaches have first been suggested for neural nets by LeCun et al. (2012) and convergence guarantees are starting to appear (Ward et al., 2018; Li & Orabona, 2018) . However, these are not superior to the O( -2 g ) worst-case complexity of standard gradient descent (Cartis et al., 2012b) .

Regularized Newton methods

The most principled class of regularized Newton methods are trust region (TR) and adaptive cubic regularization algorithms (ARC) (Conn et al., 2000; Cartis et al., 2011) , which repeatedly optimize a local Taylor model of the objective while making sure that the step does not travel too far such that the model stays accurate. While the former finds first-order stationary points within O( -2 g ), ARC only takes at most O( take at most O( -3 H ) iterations to find an H approximate second-order stationary point (Cartis et al., 2012a) . These rates are optimal for second-order Lipschitz continuous functions (Carmon et al., 2017; Cartis et al., 2012a) and they can be retained even when only sub-sampled gradient and Hessian information is used (Kohler & Lucchi, 2017; Yao et al., 2018; Xu et al., 2017b; Blanchet et al., 2016; Liu et al., 2018) . Furthermore, the involved Hessian information can be computed solely based on Hessian-vector products, which are implementable very efficiently (Pearlmutter, 1994) . This makes these methods particularly attractive for deep learning, but the empirical evidence of their applicability is rather limited. We are only aware of the works of Liu et al. (2018) and Xu et al. (2017a) , which report promising first results but are by no means fully encompassing. Gauss-Newton methods An interesting line of research proposes to replace the Hessian by (approximations of) the generalized-Gauss-Newton matrix (GGN) within a Levenberg-Marquardt frameworkfoot_0 (LeCun et al., 2012; Martens, 2010; Martens & Grosse, 2015) . As the GGN matrix is always positive semidefinite, these methods cannot leverage negative curvature to escape saddles and hence, there exist no second-order convergence guarantees. Furthermore, there are cases in neural networks where the Hessian is better conditioned than the GGN matrix (Mizutani & Dreyfus, 2008) . Nevertheless, the above works report promising preliminary results, most notably Grosse & Martens (2016) find that K-FAC can be faster than SGD on a small convnet. On the other hand, recent findings report performance at best comparable to SGD on the much larger ResNet architecture (Ma et al., 2019) . Moreover, Xu et al. (2017a) reports many cases where TR and GGN algorithms perform similarly. This line of work can be seen as complementary to our approach since it is straightforward to replace the Hessian in the TR framework with the GGN matrix. Furthermore, the preconditioners used in Martens (2010) and Chapelle & Erhan (2011) , namely diagonal estimates of the empirical Fisher and Fisher matrix, respectively, can directly be used as matrix norms in our ellipsoidal TR framework.

3. AN ALTERNATIVE VIEW ON ADAPTIVE GRADIENT METHODS

Adaptively preconditioned gradient methods update iterates as w t+1 = w t -η t A -1/2 t g t , where g t is a stochastic estimate of ∇L(w t ) and A t is a positive definite symmetric pre-conditioning matrix. In Adagrad, A ada,t is the un-centered second moment matrix of the past gradients computed as A ada,t := G t G t + I, where > 0, I is the d × d identity matrix and G t = [g 1 , g 2 , . . . , g t ]. Building up on the intuition that past gradients might become obsolete in quickly changing non-convex landscapes, RMSprop (and Adam) introduce an exponential weight decay leading to the preconditioning matrix A rms,t := (1 -β)G t diag(β t , . . . , β 0 )G t + I, where β ∈ (0, 1). In order to save computational efforts, the diagonal versions diag(A ada ) and diag(A rms ) are more commonly applied in practice, which in turn gives rise to coordinate-wise adaptive stepsizes that are enlarged (reduced) in coordinates that have seen past gradient components with a smaller (larger) magnitude.

3.1. ADAPTIVE PRECONDITIONING AS ELLIPSOIDAL TRUST REGION

Starting from the fact that adaptive methods employ coordinate-wise stepsizes, one can take a principled view on these methods. Namely, their update steps arise from minimizing a first-order Taylor model of the function L within an ellipsoidal search space around the current iterate w t , where the diameter of the ellipsoid along a particular coordinate is implicitly given by η t and g t A -1 t . Correspondingly, vanilla (S)GD optimizes the same first-order model within a spherical constraint. Figure 1 (top) illustrates this effect by showing not only the iterates of GD and Adagrad but also the implicit trust regions within which the local models were optimized at each step. 2 It is well known that GD struggles to progress towards the minimizer of quadratics along lowcurvature directions (see e.g., Goh (2017) ). While this effect is negligible for well-conditioned objectives (Fig. 1 , left), it leads to a drastic slow-down when the problem is ill-conditioned (Fig. 1 , Here is precisely where the advantage of adaptive stepsize methods comes into play. As illustrated by the dashed lines, Adagrad's search space is damped along the direction of high curvature (vertical axis) and elongated along the low curvature direction (horizontal axis). This allows the method to move further horizontally early on to enter the valley with a smaller distance to the optimizer w * along the low curvature direction which accelerates convergence. Let us now formally establish the result that allows us to re-interpret adaptive gradient methods from the trust region perspective introduced above. Lemma 1 (Preconditioned gradient methods as TR). A preconditioned gradient step w t+1 -w t = s t := -η t A -1 t g t (4) with stepsize η t > 0, symmetric positive definite preconditioner A t ∈ R d×d and g t = 0 minimizes a first-order model around w t ∈ R d in an ellipsoid given by A t in the sense that s t := arg min s∈R d m 1 t (s) = L(w t ) + s g t , s.t. s At ≤ η t g t A -1 t . Corollary 1 (Rmsprop). The step s rms,t := -η t A -1/2 rms,t g t minimizes a first-order Taylor model around w t in an ellipsoid given by A 1/2 rms,t (Eq. 3) in the sense that s rms,t := arg min s∈R d m 1 t (s) = L(w t ) + s g t , s.t. s A 1/2 rms,t ≤ η t g t A -1/2 rms,t . Equivalent results can be established for Adam using g adam,t := (1 -β) t k=0 β t-k g t as well as for Adagrad by replacing the matrix A ada into the constraint in Eq. ( 6). Of course, the update procedure in Eq. ( 5) is merely a reinterpretation of the original preconditioned update, and thus the employed trust region radii are defined implicitly by the current gradient and stepsize.

3.2. DIAGONAL VERSUS FULL PRECONDITIONING

A closer look at Figure 1 reveals that the first two problems are perfectly axis-aligned, which makes these objectives particularly attractive for diagonal preconditioning. For comparison, we report another quadratic instance, where the Hessian is no longer zero on the off-diagonals (Fig. 1 , right). As can be seen, this introduces a tilt in the level sets and reduces the superiority of diagonal Adagrad over plain GD. However, using the full preconditioner A ada re-establishes the original speed up. Yet, non-diagonal preconditioning comes at the cost of taking the inverse square root of a large matrix, which is why this approach has been relatively unexplored (see Agarwal et al. (2018) for an exception). Interestingly, early results by Becker et al. (1988) on the curvature of neural nets report a strong diagonal dominance of the Hessian matrix ∇ 2 L(w). However, the reported numbers are only for tiny networks of at most 256 parameters. We here take a first step towards generalizing these findings to modern day networks. Furthermore, we contrast the diagonal dominance of real Hessians to the expected behavior of random Wigner matrices. 3 For further evidence, we also compare Hessians of Ordinary Least Squares (OLS) problems with random inputs. For this purpose, let δ A define the ratio of diagonal to overall mass of a matrix A, i.e. δ A := i |Ai,i| i j |Ai,j | as in (Becker et al., 1988) . Proposition 1 (Diagonal share of Wigner matrix). For a random Gaussianfoot_3 Wigner matrix W (see Eq. ( 42)) the diagonal mass of the expected absolute matrix amounts to: δ E[|W|] = 1 1+(d-1) σ 2 σ 1 . Thus, if we suppose the Hessian at any given point w were a random Wigner matrix we would expect the share of diagonal mass to fall with O(1/d) as the network grows in size. In the following, we derive a similar result for the large n limit in the case of OLS Hessians. Proposition 2 (Diagonal share of OLS Hessian). Let X ∈ R d×n and assume each x i,j is generated i.i.d. with zero-mean and finite second moment σ 2 > 0. Then the share of diagonal mass of the expected matrix E [|H ols |] amounts to: δ E[|H ols |] n→∞ → √ n √ n+(d-1) √ 2 π . Empirical simulations suggest that this result holds already in small n settings (see Figure D .2) and finite n results can be likely derived under assumptions such as Gaussian data. As can be seen in Figure 2 below, even for a practical batch size of n = 32 the diagonal mass δ H of neural networks stays above both benchmarks for random inputs as well as with real-world data. These results are in line with Becker et al. (1988) and suggest that full matrix preconditioning might indeed not be worth the additional computational cost. We thus use diagonal preconditioning in all of our experiments in Section 5 but note that further theoretical and empirical elaborations of these findings are needed to assess the Hessian structure and hence effectiveness of full-matrix pre-conditioning, which is out of the scope of the work at hand.

4. SECOND-ORDER TRUST REGION METHODS

Cubic regularization (Nesterov & Polyak, 2006; Cartis et al., 2011) and trust region methods belong to the family of globalized Newton methods. Both frameworks compute parameter updates by optimizing regularized (former) or constrained (latter) second-order Taylor models of the objective L around the current iterate w t . 5 In particular, in iteration t the update step of the trust region algorithm is computed as min s∈R d m t (s) := L(w t ) + g t s + 1 2 s B t s , s.t. s At ≤ ∆ t , where ∆ t > 0 and g t and B t are either ∇L(w t ) and ∇ 2 L(w t ) or suitable approximations. The matrix A t induces the shape of the constraint set. So far, the common choice for neural networks is A t := I, ∀t which gives rise to spherical trust regions (Xu et al., 2017a; Liu et al., 2018) . By solving the constrained problem ( 7), TR methods overcome the problem that pure Newton steps may be ascending, attracted by saddles or not even computable. Please see Appendix B for more details. Why ellipsoids? There are many sources for ill-conditioning in neural networks such as un-centered and correlated inputs (LeCun et al., 2012) , saturated hidden units, and different weight scales in different layers (Van Der Smagt & Hirzinger, 1998) . While the quadratic term of model ( 7) accounts for such ill-conditioning to some extent, the spherical constraint is completely blind towards the loss surface. Thus, it is advisable to instead measure distances in norms that reflect the underlying geometry (see Chap. 7.7 in Conn et al. (2000) ). The ellipsoids we propose are such that they allow for longer steps along coordinates that have seen small gradient components in the past and vice versa. Thereby the TR shape is adaptively adjusted to fit the current region of the loss landscape. This is not only effective when the iterates are in an ill-conditioned neighborhood of a minimizer (Fig. 1 ), but it also helps to escape elongated plateaus (see autoencoder in Sec. 5). Contrary to adaptive first-order methods, the diameter (∆ t ) is updated directly depending on whether or not the local Taylor model is an adequate approximation at the current point.

4.1. CONVERGENCE OF ELLIPSOIDAL TRUST REGION METHODS

Inspired by the success of adaptive gradient methods, we investigate the use of their preconditioning matrices as norm inducing matrices for second-order TR methods. The crucial condition for convergence is that the applied norms are not degenerate during the entire minimization process in the sense that the ellipsoids do not flatten out (or blow up) completely along any given direction. The following definition formalizes this intuition. Definition 1 (Uniformly equivalent norms). The norms w At := (w A t w) 1/2 induced by symmetric positive definite matrices A t are called uniformly equivalent, if ∃µ ≥ 1 such that ∀w ∈ R d , ∀t = 1, 2, . . . 1 µ w At ≤ w 2 ≤ µ w At . We now establish a result which shows that the RMSProp ellipsoid is indeed uniformly equivalent. Lemma 2 (Uniform equivalence). Suppose g t 2 ≤ L 2 H for all w t ∈ R d , t = 1, 2, . . . Then there always exists > 0 such that the proposed preconditioning matrices A rms,t (Eq. 3) are uniformly equivalent, i.e. Def. 1 holds. The same holds for the diagonal variant. Consequently, the ellipsoids A rms,t can directly be applied to any convergent TR framework without losing the guarantee of convergence (Conn et al. (2000) , Theorem 6.6.8). 6 In Theorem 1 we extend this result by showing the (to the best of our knowledge) first convergence rate for ellipsoidal TR methods. Interestingly, similar results cannot be established for A ada,t , which reflects the widely known vanishing stepsize problem that arises since squared gradients are continuously added to the preconditioning matrix. At least partially, this effect inspired the development of RMSprop (Tieleman & Hinton, 2012) and Adadelta (Zeiler, 2012).

4.2. A STOCHASTIC ELLIPSOIDAL TR FRAMEWORK FOR NEURAL NETWORK TRAINING

Since neural network training often constitutes a large-scale learning problem in which the number of datapoints n is high, we here opt for a stochastic TR framework in order to circumvent memory issues and reduce the computational complexity. To obtain convergence without computing full derivative information, we first need to assume sufficiently accurate gradient and Hessian estimates. Assumption 1 (Sufficiently accurate derivatives). The approximations of the gradient and Hessian at step t satisfy g t -∇L(w t ) ≤ δ g and B t -∇ 2 L(w t ) ≤ δ H , where δ g ≤ (1-η) g 4 and δ H ≤ min (1-η)v H 2 , 1 , for some 0 < v < 1. For finite-sum objectives such as Eq. ( 1), the above condition can be met by random sub-sampling due to classical concentration results for sums of random variables (Xu et al., 2017b; Kohler & Lucchi, 2017; Tripuraneni et al., 2017) . Following these references, we assume access to the full function value in each iteration for our theoretical analysis but we note that convergence can be retained even for fully stochastic trust region methods (Gratton et al., 2017; Chen et al., 2018; Blanchet et al., 2016) and indeed our experiments in Section 5 use sub-sampled function values due to memory constraints. Secondly, we adapt the framework of Yao et al. (2018) ; Xu et al. (2017b) , which allows for cheap inexact subproblem minimization, to the case of iteration-dependent constraint norms (Alg. 1). Algorithm 1 Stochastic Ellipsoidal Trust Region Method 1: Input: w 0 ∈ R d , γ > 1, 1 > η > 0, ∆ 0 > 0 2: for t = 0, 1, . . . , until convergence do 3: Compute approximations g t and B t .

4:

If g t ≤ g , set g t := 0. Given that the adaptive norms induced by A rms,t satisfy uniform equivalence as shown in Lemma 2, the following Theorem establishes an O max -2 g -1 H , -3 H worst-case iteration complexity which effectively matches the one of Yao et al. (2018) . Theorem 1 (Convergence rate of Algorithm 1). Assume that L(w) is second-order smooth with Lipschitz constants L g and L H . Furthermore, let Assumption 1 and 2 hold. Then Algorithm 1 finds an O( g , H ) first-and second-order stationary point in at most O max -2 g -1 H , -3 H iterations. The proof of this statement is a straight-forward adaption of the proof for spherical constraints, taking into account that the guaranteed model decrease changes when the computed step s t lies outside the Trust Region. Due to the uniform equivalence established in Lemma 2, the altered diameter of the trust region along that direction and hence the change factor is always strictly positive and finite. 

5. EXPERIMENTS

To validate our claim that ellipsoidal TR methods yield improved performance over spherical ones, we run a set of experiments on two image datasets and three types of network architectures. All methods run on (almost) the same hyperparameters across all experiments (see Table 1 in Appendix B)As depicted in Fig. 3 , the ellipsoidal TR methods consistently outperform their spherical counterpart in the sense that they reach full training accuracy substantially faster on all problems. Moreover, their limit points are in all cases lower than those of the uniform method. Interestingly, this makes an actual difference in the image reconstruction quality of autoencoders (see Figure 12 ), where the spherically constrained TR method struggles to escape a saddle. We thus draw the clear conclusion that the ellipsoidal constraints we propose are to be preferred over spherical ones when training neural nets with second-order methods. More experimental and architectural details are provided in App. C. To put the previous results into context, we also benchmark several state-of-the-art gradient methods. For a fair comparison, we report results in terms of number of backpropagations, epochs and time. All figures can be found in App. C. Our findings are mixed: For small nets such as the MLPs the TR method with RMSProp ellipsoids is superior in all metrics, even when benchmarked in terms of time. However, while Fig. 9 indicates that ellipsoidal TR methods are slightly superior in terms of backpropagations even for ResNets and Autoencoders, a close look at Fig. 10 and 11 reveals that they at best manage to keep pace with first-order methods in terms of epochs and are inferior in time.

6. CONCLUSION

We investigated the use of ellipsoidal trust region constraints for neural networks. We have shown that the RMSProp matrix satisfies the necessary conditions for convergence and our experimental results demonstrate that ellipsoidal TR methods outperform their spherical counterparts significantly across a large set of experiments. We thus consider the development of further ellipsoids that can potentially adapt even better to the loss landscape such as e.g. (block-) diagonal hessian approximations (e.g. Bekas et al. (2007) ) or approximations of higher order derivatives as an interesting direction of future research. Interestingly, the gradient method benchmark indicates that the value of Hessian information for neural network training is limited for mainly three reasons: 1) second-order methods rarely yield better limit points, which suggests that saddles and spurious local minima are not a major obstacle in modern day architectures; 2) The per-iteration time complexity is noticeably lower for first-order methods (Figure 11 ). The latter observations suggests that advances in distributed second-order algorithms (e.g., Osawa et al. (2018) ; Dünner et al. (2018) ) constitute a promising direction of research towards the goal of a more widespread use of Newton-type methods in deep learning.



This algorithm is a simplified TR method, initially tailored for non-linear least squares problems(Nocedal & Wright, 2006)2 We only plot every other trust region. Since the models are linear, the minimizer is always on the boundary. Of course, Hessians do not have i.i.d. entries but the symmetry of Wigner matrices suggests that this baseline is not completely off. The argument naturally extends to any distribution with positive expected absolute values. In the following we only treat TR methods, but we emphasize that the use of matrix induced norms can directly be transferred to the cubic regularization framework. Note that the assumption of bounded batch gradients, i.e. smooth objectives, is common in the analysis of stochastic algorithms(Allen-Zhu, 2017;Defazio et al., 2014;Schmidt et al., 2017;Duchi et al., 2011).



Figure 1: Top: Iterates and implicit trust regions of GD and Adagrad on quadratic objectives with different condition number κ. Bottom: Average log suboptimality over iterations as well as 90% confidence intervals of 30 runs with random initialization

Figure 2: Diagonal mass of neural network Hessian δ H relative to δ E[|W|] and δ E[|H ols |] of corresponding dimensionality for random inputs as well as at random initialization, middle and after reaching 90% training accuracy with RMSProp on CIFAR-10. Mean and 95% confidence interval over 10 independent runs.

Figure 3: Mean and 95% confidence interval of 10 runs. Green dotted line indicates 99% training accuracy.

Set A t := A rms,t or A t := diag (A rms,t ) (see Eq. (3)). Each update step s t yields at least as much model decrease as the Cauchy-and Eigenpoint simultaneously, i.e.m t (s t ) ≤ m t (s C t ) and m t (s t ) ≤ m t (s E t ), where s C t and s E t are defined in Eq.(28).

