LEARNING THE STEP-SIZE POLICY FOR THE LIMITED-MEMORY BROYDEN-FLETCHER-GOLDFARB-SHANNO ALGORITHM

Abstract

We consider the problem of how to learn a step-size policy for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm. This is a limited computational memory quasi-Newton method widely used for deterministic unconstrained optimization but currently avoided in large-scale problems for requiring step sizes to be provided at each iteration. Existing methodologies for the step size selection for L-BFGS use heuristic tuning of design parameters and massive re-evaluations of the objective function and gradient to find appropriate step-lengths. We propose a neural network architecture with local information of the current iterate as the input. The step-length policy is learned from data of similar optimization problems, avoids additional evaluations of the objective function, and guarantees that the output step remains inside a pre-defined interval. The corresponding training procedure is formulated as a stochastic optimization problem using the backpropagation through time algorithm. The performance of the proposed method is evaluated on the training of classifiers for the MNIST database for handwritten digits and for CIFAR-10. The results show that the proposed algorithm outperforms heuristically tuned optimizers such as ADAM, RMSprop, L-BFGS with a backtracking line search and L-BFGS with a constant step size. The numerical results also show that a learned policy can be used as a warm-start to train new policies for different problems after a few additional training steps, highlighting its potential use in multiple large-scale optimization problems.

1. INTRODUCTION

Consider the unconstrained optimization problem minimize x f (x) where f : R n → R is an objective function that is differentiable for all x ∈ R n , with n being the number of decision variables forming x. Let ∇ x f (x 0 ) be the gradient of f (x) evaluated at some x 0 ∈ R n . A general quasi-Newton algorithm for solving this problem iterates x k+1 = x k -t k H k g k (2) for an initial x 0 ∈ R n until a given stop criterion is met. At the k-th iteration, g k = ∇ x f (x k ) is the gradient, H k is a positive-definite matrix satisfying the secant equation (Nocedal and Wright, 2006, p. 137 ) and t k is the step size. In this paper, we develop a policy that learns to suitably determine step sizes t k when the product H k g k is calculated by the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Liu and Nocedal, 1989) . The main contributions of the paper are: 1. We propose a neural network architecture defining this policy taking as input local information of the current iterate. In contrast with more standard strategies, this policy is tuning-free and avoids re-evaluations of the objective function and gradients at each step. The training procedure is formulated as a stochastic optimization problem and can be performed by easily applying backpropagation through time (TBPTT). 2. Training classifiers in the MNIST database (LeCun et al., 1998) , our approach is competitive against heuristically tuned optimization procedures. Our tests show that the proposed policy is not only able to outperform competitors such as ADAM and RMSprop in wall-clock time and optimal/final value, but also performs better than L-BFGS with backtracking line searches, which is the gold standard, and with constant step sizes, which is the baseline. 3. According to subsequent experiments on CIFAR-10 (Krizhevsky et al., 2009) , the proposed policy can generalize to different classes of problems after a few additional training steps on examples from these classes. This indicates that learning may be transferable between distinct types of tasks, allowing to explore transfer learning strategies. This result is a step towards the development of optimization methods that frees the designer from tuning control parameters as it will be motivated in Section 2. The remaining parts of this paper are organized as follows: Section 3 presents the classical L-BFGS algorithm and discuss some methodologies to determine step sizes; Section 4 contains the architecture for the proposed policy and also discussions on how it was implemented; Section 5 describes the training procedure; and, finally, Section 6 presents experiments using classifiers to operate on MNIST and CIFAR-10 databases. The notation is mainly standard. Scalars are plain lower-case letters, vectors are bold lower-case, and matrices are bold upper-case. The clip function is defined as clip u l (y) := min (u, max (l, y)).

2. MOTIVATION

Most algorithms used in artificial intelligence and statistics are based on optimization theory, which has widely collaborated for the success of machine learning applications in the last decades. However, this two-way bridge seems not to be currently leveraging its full potential in the other sense, that is, to learn how to automate optimization procedures. Indeed, performing satisfactory optimization, or solving learning problems, still relies upon the appropriate tuning of parameters of the chosen algorithm, which are often grouped with other hyper-parameters of the learning task. Despite the existence of several methodologies to obtain good values for these parameters (Bengio, 2000; Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2015; Daniel et al., 2016; Dong et al., 2018) , the search for tuning-free algorithms that perform better than heuristically designed ones is of great interest among practitioner and theoreticians. Indeed, besides the generally-desirable faster convergence, the ready-to-use nature of such algorithms allows the user to focus his attention on other problem-level hyper-parameters while the optimization procedure is automatically performed, resulting in better time and effort allocation. As recent advancements of machine learning have helped automatize the solution of numberless problems, optimization theory should equally benefit from these, balancing the bridge flows. From a wider viewpoint, most optimization problem requires the user to select an algorithm and tune it to some extent. Although intuition and knowledge about the problem can speed-up these processes, trial-and-error methodologies are often employed which can be a time-consuming and inefficient task. With that in mind, the concept of Learned optimizers has been gathering attention in the last few years and, basically, refers to optimization policies and routines that were learned by looking at instances of optimization problems, here called tasks. This idea was introduced by Li and Malik (2016) and Andrychowicz et al. ( 2016) building upon previous results of "learning to learn" or "meta-learning" (Thrun and Pratt, 1998; Hochreiter et al., 2001) . In the former, the authors presented an optimization policy based on a neural network trained by reinforcement learning and taking as input the history of gradient vectors at previous iterations. The latter adopts a long short-term memory (LSTM) to achieve a similar task, but the learning is done by truncated backpropagation through time after unrolling the proposed optimizer for a certain number of steps. Subsequently, it was shown in Metz et al. ( 2019) how multilayer perceptrons (MLP), adequately trained using a combined gradient estimation method, can perform faster in wall-clock time compared to current algorithms of choice. Also within this scenario, in Xu et al. ( 2019) a reinforcement learning-based methodology to auto-learn an adaptive learning rate is presented. Following this same fashion, in this present paper, instead of completely learning an optimizer from data, we propose a mixture of these ideas into a classical optimization procedure. Thus, the resulting optimizer, composed by a combination of L-BFGS and the proposed policy, will be learned in a constrained domain that assures valuable mathematical properties. The idea is to leverage both frameworks, inheriting the theoretical aspects assured by optimization theory while learning a policy to rule out the hand-design of parameters.

