NON-GREEDY GRADIENT-BASED HYPERPARAMETER OPTIMIZATION OVER LONG HORIZONS

Abstract

Gradient-based meta-learning has earned a widespread popularity in few-shot deep learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn meta-parameters online, but this introduces greediness which comes with a significant performance drop. In this work, we enable nongreedy meta-learning of hyperparameters over long horizons by sharing hyperparameters that are contiguous in time, and using the sign of hypergradients rather than their magnitude to indicate convergence. We implement this with forwardmode differentiation, which we extend to the popular momentum-based SGD optimizer. We demonstrate that the hyperparameters of this optimizer can be learned non-greedily without gradient degradation over ∼ 10 4 inner gradient steps, by only requiring ∼ 10 outer gradient steps. On CIFAR-10, we outperform greedy and random search methods for the same computational budget by nearly 10%.

1. INTRODUCTION

Deep neural networks have shown tremendous success on a wide range of applications, including classification (He et al., 2016) , generative models (Brock et al., 2019) , natural language processing (Devlin et al., 2018) and speech recognition (Oord et al., 2016) . This success is in part due to effective optimizers such as SGD with momentum or Adam (Kingma & Ba, 2015) , which require carefully tuned hyperparameters for each application. In recent years, a long list of heuristics to tune such hyperparameters has been compiled by the deep learning community, including things like: how to best decay the learning rate (Loshchilov & Hutter, 2017) , how to scale hyperparameters with the budget available (Li et al., 2020) , and how to scale learning rate with batch size (Goyal et al., 2017) . Unfortunately these heuristics are often dataset specific and architecture dependent (Dong et al., 2020) , and must constantly evolve to accommodate new optimizers (Loshchilov & Hutter, 2019) , or new tools, like batch normalization which allows for larger learning rates and smaller weight decay (Ioffe & Szegedy, 2015) . With so many ways to choose hyperparameters, the deep learning community is at risk of adopting models based on how much effort went into tuning them, rather than their methodological insight. The field of hyperparameter optimization (HPO) aims to find hyperparameters that provide a good generalization performance automatically. Unfortunately, existing tools are rather unpopular for deep networks, largely owing to their inefficiency. Here we focus on gradient-based HPO, which calculates hypergradients, i.e. the gradient of some generalization loss with respect to each hyperparameter. Gradient-based HPO should be more efficient than black box methods as the dimensionality of the hyperparameter space increases, since it relies on gradients rather than trial and error. In practice however, learning hyperparameters with gradients has only been popular in few-shot learning tasks where the horizon is short. This is because long horizons cause hypergradient degradation, and incur a memory cost that makes reverse-mode differentiation prohibitive. Greedy alternatives alleviate both of these issues, but come at the cost of solving hyperparameters locally instead of globally. Forward-mode differentiation has been shown to offer a memory cost constant with horizon size, but it doesn't address gradient degradation and only scales to few hyperparameters, which has limited its use to the greedy setting as well. To the best of our knowledge, this work demonstrates for the first time that gradient-based HPO can be applied for long horizon problems like CIFAR-10 without being greedy. Specifically, we make the following contributions: (1) we propose to share hyperparameters through time and show that this significantly reduces the variance of hypergradients, (2) we show that the sign of hypergradients is a better indicator of convergence than their magnitude and enables a small number of outer steps, (3) we combine the above in a forward-mode algorithm adapted to modern SGD optimizers with momentum and weight decay, and (4) we show that our method significantly outperforms random search and greedy alternatives when used with the same computational budget.

2. RELATED WORK

There are many ways to perform hyperparameter optimization (HPO), including Bayesian optimization (Snoek et al., 2015) , reinforcement learning (Zoph & Le, 2017), a mix of the two (Falkner et al., 2018) , evolutionary algorithms (Jaderberg et al., 2017) and gradient-based methods (Bengio, 2000) . Here we focus on the latter, but a comparison of HPO methods can be found in (Feurer & Hutter, 2019) . Modern work in meta-learning deals with various forms of gradient-based HPO, many examples of which are discussed in this survey (Hospedales et al., 2020) . However, meta-learning typically focuses on the few-shot regime where horizons are conveniently short, while in this work we focus on long horizons. Gradient-based HPO. Using the gradient of some validation loss with respect to the hyperparameters is typically the preferred choice when the underlying optimization is differentiable. This is a type of bilevel optimization (Franceschi et al., 2018) which stems from earlier work on backpropagation through time (Werbos, 1990) and real-time recurrent learning (Williams & Zipser, 1989) . Unfortunately, differentiating optimization is an expensive procedure in both time and memory, and most proposed methods are limited to small models and toy datasets (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016) . Efforts to make the problem more tractable include optimization shortcuts (Fu et al., 2016 ), truncation (Shaban et al., 2019) and implicit gradients (Rajeswaran et al., 2019; Lorraine et al., 2019) . Truncation can be combined with our approach but produces biased gradients (Metz et al., 2019) , while implicit differentiation is only applicable to hyperparameters that define the training loss (e.g. augmentation) but not to hyperparameters that define how the training loss is minimized (e.g. optimizer hyperparameters). Forward-mode differentiation (Williams & Zipser, 1989 ) boasts a memory cost constant with horizon size, but it doesn't address gradient degradation which has restricted its use to the greedy setting (Franceschi et al., 2017) .



Figure1: Our method applied to the SGD optimizer to learn (from left to right) the learning rate schedule α, the momentum β, and weight decay µ for a WRN-16-1 on CIFAR-10. For each outer step (color) we solve CIFAR-10 from scratch for 50 epochs, and update all hyperparameters such that the final weights minimize some validation loss. We use hyperparameter sharing over 10, 50 and 50 epochs for α, β and µ respectively. All hyperparameters are initialized to zero and converge within just 10 outer steps to values that significantly outperform the online greedy alternative(Baydin et al.,  2018), and match aggressively hand-tuned baselines for this setting (see Section 5.2).

