NON-GREEDY GRADIENT-BASED HYPERPARAMETER OPTIMIZATION OVER LONG HORIZONS

Abstract

Gradient-based meta-learning has earned a widespread popularity in few-shot deep learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn meta-parameters online, but this introduces greediness which comes with a significant performance drop. In this work, we enable nongreedy meta-learning of hyperparameters over long horizons by sharing hyperparameters that are contiguous in time, and using the sign of hypergradients rather than their magnitude to indicate convergence. We implement this with forwardmode differentiation, which we extend to the popular momentum-based SGD optimizer. We demonstrate that the hyperparameters of this optimizer can be learned non-greedily without gradient degradation over ∼ 10 4 inner gradient steps, by only requiring ∼ 10 outer gradient steps. On CIFAR-10, we outperform greedy and random search methods for the same computational budget by nearly 10%.

1. INTRODUCTION

Deep neural networks have shown tremendous success on a wide range of applications, including classification (He et al., 2016) , generative models (Brock et al., 2019) , natural language processing (Devlin et al., 2018) and speech recognition (Oord et al., 2016) . This success is in part due to effective optimizers such as SGD with momentum or Adam (Kingma & Ba, 2015) , which require carefully tuned hyperparameters for each application. In recent years, a long list of heuristics to tune such hyperparameters has been compiled by the deep learning community, including things like: how to best decay the learning rate (Loshchilov & Hutter, 2017) , how to scale hyperparameters with the budget available (Li et al., 2020) , and how to scale learning rate with batch size (Goyal et al., 2017) . Unfortunately these heuristics are often dataset specific and architecture dependent (Dong et al., 2020) , and must constantly evolve to accommodate new optimizers (Loshchilov & Hutter, 2019) , or new tools, like batch normalization which allows for larger learning rates and smaller weight decay (Ioffe & Szegedy, 2015) . With so many ways to choose hyperparameters, the deep learning community is at risk of adopting models based on how much effort went into tuning them, rather than their methodological insight. The field of hyperparameter optimization (HPO) aims to find hyperparameters that provide a good generalization performance automatically. Unfortunately, existing tools are rather unpopular for deep networks, largely owing to their inefficiency. Here we focus on gradient-based HPO, which calculates hypergradients, i.e. the gradient of some generalization loss with respect to each hyperparameter. Gradient-based HPO should be more efficient than black box methods as the dimensionality of the hyperparameter space increases, since it relies on gradients rather than trial and error. In practice however, learning hyperparameters with gradients has only been popular in few-shot learning tasks where the horizon is short. This is because long horizons cause hypergradient degradation, and incur a memory cost that makes reverse-mode differentiation prohibitive. Greedy alternatives alleviate both of these issues, but come at the cost of solving hyperparameters locally instead of globally. Forward-mode differentiation has been shown to offer a memory cost constant with horizon size, but it doesn't address gradient degradation and only scales to few hyperparameters, which has limited its use to the greedy setting as well. 1

