GUARANTEES FOR TUNING THE STEP SIZE USING A LEARNING-TO-LEARN APPROACH Anonymous

Abstract

Learning-to-learn-using optimization algorithms to learn a new optimizer-has successfully trained efficient optimizers in practice. This approach relies on metagradient descent on a meta-objective based on the trajectory that the optimizer generates. However, there were few theoretical guarantees on how to avoid metagradient explosion/vanishing problems, or how to train an optimizer with good generalization performance. In this paper we study the learning-to-learn approach on a simple problem of tuning the step size for quadratic loss. Our results show that although there is a way to design the meta-objective so that the metagradient remain polynomially bounded, computing the meta-gradient directly using backpropagation leads to numerical issues that look similar to gradient explosion/vanishing problems. We also characterize when it is necessary to compute the meta-objective on a separate validation set instead of the original training set. Finally, we verify our results empirically and show that a similar phenomenon appears even for more complicated learned optimizers parametrized by neural networks.

1. INTRODUCTION

Choosing the right optimization algorithm and related hyper-parameters is important for training a deep neural network. Recently, a series of works (e.g., Andrychowicz et al. (2016) ; Wichrowska et al. (2017) ) proposed to use learning algorithms to find a better optimizer. These papers use a learning-to-learn approach: they design a class of possible optimizers (often parametrized by a neural network), and then optimize the parameters of the optimizer (later referred to as metaparameters) to achieve better performance. We refer to the optimization of the optimizer as the meta optimization problem, and the application of the learned optimizer as the inner optimization problem. The learning-to-learn approach solves the meta optimization problem by defining a metaobjective function based on the trajectory that the inner-optimizer generates, and then using backpropagation to compute the meta-gradient (Franceschi et al., 2017) . Although the learning-to-learn approach has shown empirical success, there are very few theoretical guarantees for learned optimizers. In particular, since the optimization for meta-parameters is usually a nonconvex problem, does it have bad local optimal solutions? Current ways of optimizing meta-parameters rely on unrolling the trajectory of the inner-optimizer, which is very expensive and often lead to exploding/vanishing gradient problems. Is there a way to alleviate these problems? Can we have a provable way of designing meta-objective to make sure that the inner optimizers can achieve good generalization performance? In this paper we answer some of these problems in a simple setting, where we use the learningto-learn approach to tune the step size of the standard gradient descent/stochastic gradient descent algorithm. We will see that even in this simple setting, many of the challenges still remain and we can get better learned optimizers by choosing the right meta-objective function. Though our results are proved only in the simple setting, we empirically verify the results using complicated learned optimizers with neural network parametrizations. 2019) highlighted several challenges in the meta-optimization for learning-to-learn approach. First, they observed that the optimal parameters for the learned optimizer (or even just the step size for gradient descent) can depend on the number of training steps t of the inner-optimization problem, which is also observed by Wu et al. (2018) . Ge et al. (2019) theoretically proved this in a least-squares setting. Because of this, one needs to ensure that the inner training has enough number of steps (similar to the number of steps that it would take when we apply the learned optimizer). However, when the number of steps is large, the meta-gradient can often explode or vanish, which makes it difficult to solve the meta-optimization problem.

Metz et al. (

Our first result shows that this is still true in the case of tuning step size for gradient descent on a simple quadratic objective. In this setting, we show that there is a unique local and global minimizer for the step size, and we also give a simple way to get rid of the gradient explosion/vanishing problem. Theorem 1 (Informal). For tuning the step size of gradient descent on a quadratic objective, if the meta-objective is the loss of the last iteration, then the meta-gradient can explode/vanish. If the meta-objective is the log of the loss of the last iteration, then the meta-gradient is polynomially bounded. Further, doing meta-gradient descent with a meta step size of 1/ √ k (where k is the number of meta-gradient steps) provably converges to the optimal step size for the inner-optimizer. Surprisingly, even though taking the log of the objective solves the gradient explosion/vanishing problem, one cannot simply implement such an algorithm using auto-differentiation tools such as those used in TensorFlow (Abadi et al., 2016) . The reason is that even though the meta-gradient is polynomially bounded, if we compute the meta-gradient using the standard back-propagation algorithm, the meta-gradient will be the ratio of two exponentially large/small numbers, which causes numerical issues. Detailed discussion for the first result appears in Section 3 (Theorem 3 and Theorem 4). The generalization performance of the learned optimizer is another challenge. If one just tries to optimize the performance of the learned optimizer on the training set (we refer to this as the trainby-train approach), then the learned optimizer might overfit. Metz et al. (2019) proposed to use a train-by-validation approach instead, where the meta-objective is defined to be the performance of the learned optimizer on a separate validation set. Our second result considers a simple least squares setting where y = w * , x + ξ and ξ ∼ N (0, σ 2 ). We show that when the number of samples is small and the noise is large, it is important to use train-by-validation; while when the number of samples is much larger train-by-train can also learn a good optimizer. Theorem 2 (Informal). For a simple least squares problem in d dimensions, if the number of samples n is a constant fraction of d (e.g., d/2), and the samples have large noise, then the train-by-train approach performs much worse than train-by-validation. On the other hand, when number of samples n is large, train-by-train can get close to error dσ 2 /n, which is optimal. We discuss the details in Section 4 (Theorem 5 and Theorem 6). In Section 5 we show that such observations also hold empirically for more complicated learned optimizers-an optimizer parametrized by neural network.

1.2. RELATED WORK

Learning-to-learn for supervised learning Hochreiter et al. ( 2001) introduced the application of gradient descent method to meta-learning. The idea of using a neural network to parametrize an optimizer started in Andrychowicz et al. (2016) , which used an LSTM to directly learn the update rule. Before that, the idea of using optimization to tune parameters for optimzers also appeared in Maclaurin et al. (2015) . Later, Li & Malik ( 2016 



); Bello et al. (2017) applied techniques from reinforcement learning to learn an optimizer. Wichrowska et al. (2017) used a hierarchical RNN as the optimizer. Metz et al. (2019) adopted a small MLP as the optimizer and used dynamic weighting of two gradient estimators to stabilize and speedup the meta-training process.

