A STRAIGHTFORWARD LINE SEARCH APPROACH ON THE EXPECTED EMPIRICAL LOSS FOR STOCHASTIC DEEP LEARNING PROBLEMS

Abstract

A fundamental challenge in deep learning is that the optimal step sizes for update steps of stochastic gradient descent are unknown. In traditional optimization, line searches are used to determine good step sizes, however, in deep learning, it is too costly to search for good step sizes on the expected empirical loss due to noisy losses. This empirical work shows that it is possible to approximate the expected empirical loss on vertical cross sections for common deep learning tasks considerably cheaply. This is achieved by applying traditional one-dimensional function fitting to measured noisy losses of such cross sections. The step to a minimum of the resulting approximation is then used as step size for the optimization. This approach leads to a robust and straightforward optimization method which performs well across datasets and architectures without the need of hyperparameter tuning.

1. INTRODUCTION AND BACKGROUND

The automatic determination of an optimal learning rate schedule to train models with stochastic gradient descent or similar optimizers is still not solved satisfactorily for standard and especially new deep learning tasks. Frequently, optimization approaches utilize the information of the loss and gradient of a single batch to perform an update step. However, those approaches focus on the batch loss, whereas the optimal step size should actually be determined for the empirical loss, which is the expected loss over all batches. In classical optimization line searches are commonly used to determine good step sizes. In deep learning, however, the noisy loss functions makes it impractically costly to search for step sizes on the empirical loss. This work empirically revisits that the empirical loss has a simple shape in the direction of noisy gradients. Based on this information, it is shown that the empirical loss can be easily fitted with lower order polynomials in these directions. This is done by performing a straightforward, one-dimensional regression on batch losses sampled in such a direction. It then becomes simple to determine a suitable minimum and thus a good step size from the approximated function. This results in a line search on the empirical loss. Compared to the direct measurement of the empirical loss on several locations, our approach is cost-efficient since it solely requires a sample size of about 500 losses to approximate a cross section of the loss. From a practical point of view this is still too expensive to determine the step size for each step. Fortunately, it turns out to be sufficient to estimate a new step size only a few times during a training process, which, does not require any additional time due to more beneficial update steps. We show that this straightforward optimization approach called ELF (Empirical Loss Fitting optimizer), performs robustly across datasets and models without the need for hyperparameter tuning. This makes ELF a choice to be considered in order to achieve good results for new deep learning tasks out of the box. In the following we will revisit the fundamentals of optimization in deep learning to make our approach easily understandable. Following Goodfellow et al. (2016) , the aim of optimization in deep learning generally means to find a global minimum of the true loss (risk) function L true which is the expected loss over all elements of the data generating distribution p data : L true (θ) = E (x,y)∼p data L(f (x; θ), y) (1) where L is the loss function for each sample (x, y), θ are the parameters to optimize and f the model function. However, p data is usually unknown and we need to use an empirical approximation pdata , which is usually indirectly given by a dataset T. Due to the central limit theorem we can assume pdata to be Gaussian. In practice optimization is performed on the empirical loss L emp : L emp (θ) = E (x,y)∼ pdata L(f (x; θ), y) = 1 |T| (x,y)∈T L(f (x; θ), y) An unsolved task is to find a global minimum of L true by optimizing on L emp if |T| is finite. Thus, we have to assume that a small value of L emp will also be small for L true . Estimating L emp is impractical and expensive, therefore we approximate it with mini batches: L batch (θ, B) = 1 |B| (x,y)∈B⊂T L(f (x; θ), y) where B denotes a batch. We call the dataset split in batches T batch . We now can reinterpret L emp as the empirical mean value over a list of losses L which includes the output of L batch (θ, B) for each batch B: L emp (θ) = 1 |L| L batch (θ,B)∈L L batch (θ, B) A vertical cross section l emp (s) of L emp (θ) in the direction d through the parameter vector θ 0 is given by l emp (s; θ 0 , d) = L emp (θ 0 + s • d) For simplification, we refer to l as line function or cross section. The step size to the minimum of l emp (s) is called s min . 2006)). However, Mutschler & Zell (2020) empirically suggests that line searches on L batch are not optimal since minima of line functions of L batch are not always good estimators for the minima of line functions of L emp . Thus, it seems more promising to perform a line search on L emp . This is cost intensive since we need to determine L(f (x; θ 0 + s • d), y) for all (x, y) ∈ T for multiple s of a line function. Probabilistic Line Search (PLS) (Mahsereci & Hennig (2017)) addresses this problem by performing Gaussian process regressions, which result in multiple one dimensional cubic splines. In addition, a probabilistic belief over the first (= Armijo condition) and second Wolfe condition is introduced to find good update positions. The major drawback of this conceptually appealing but complex method is, that for each batch the squared gradients of each input sample have to be computed. This is not supported by default by common deep learning libraries and therefore has to be implemented manually for every layer in the model, which makes its application impractical. Gradient-only line search (GOLS1) (Kafka & Wilke (2019)) pointed out empirically that the noise of directional derivatives in negative gradient direction is considerably smaller than the noise of the losses. (1984) ). The details as well as the empirical foundation of our approach are explained in the following.



Many direct and indirect line search approaches for deep learning are often applied on L batch (θ, B) (Mutschler & Zell (2020), Berrada et al. (2019), Rolinek & Martius (2018), Baydin et al. (2017),Vaswani et al. (2019)). Mutschler & Zell (2020) approximate an exact line search, which implies estimating the global minimum of a line function, by using one-dimensional parabolic approximations. The other approaches, directly or indirectly, perform inexact line searches by estimating positions of the line function, which fulfill specific conditions, such as the Goldberg, Armijo and Wolfe conditions (Jorge Nocedal (

They argue that they can approximate a line search on L emp by considering consecutive noisy directional derivatives. Adaptive methods, such as Kingma & Ba (2014) Luo et al. (2019) Reddi et al. (2018) Liu et al. (2019) Tieleman & Hinton (2012) Zeiler (2012) Robbins & Monro (1951) concentrate more on finding good directions than on optimal step sizes. Thus, they could benefit from line search approaches applied on their estimated directions. Second order methods, such as Berahas et al. (2019) Schraudolph et al. (2007) Martens & Grosse (2015) Ramamurthy & Duffy (2017) Botev et al. (2017) tend to find better directions but are generally too expensive for deep learning scenarios.Our approach follows PLS and GOLS1 by performing a line search directly on L emp . We use a regression on multiple L batch (θ 0 + s • d, B) values sampled with different step sizes s and different batches B, to estimate a minimum of a line function of L emp in direction d. Consequently, this work is a further step towards efficient steepest descent line searches on L emp , which show linear convergence on any deterministic function that is twice continuously differentiable, has a relative minimum and only positive eigenvalues of the Hessian at the minimum (see Luenberger et al.

