DEEP POWER LAWS FOR HYPERPARAMETER OPTIMIZATION

Abstract

Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Law (DPL), a neural network model conditioned to yield predictions that follow a power-law scaling pattern. Our model dynamically decides which configurations to pause and train incrementally by making use of multifidelity estimation. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 57 diverse search spaces. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors. We open-source our implementation and make our code publicly available at:

1. INTRODUCTION

Hyperparameter Optimization (HPO) is a major challenge for the Machine Learning community. Unfortunately, HPO is not yet feasible for Deep Learning (DL) methods due to the high cost of evaluating multiple configurations. Recently, Gray-box HPO (a.k.a. multi-fidelity HPO) has emerged as a promising paradigm for HPO in DL, by discarding poorly-performing hyperparameter configurations after observing the validation error on the low-level fidelities of the optimization procedure (Li et al., 2017; Falkner et al., 2018; Awad et al., 2021; Li et al., 2020) . The advantage of gray-box HPO compared to online HPO (Chen et al., 2017; Parker-Holder et al., 2020) , or meta-gradient HPO (Maclaurin et al., 2015; Franceschi et al., 2017; Lorraine et al., 2020) is the ability to tune all types of hyperparameters. In recent years, a stream of papers highlights the fact that the performance of DL methods is predictable (Hestness et al., 2017) , concretely, that the validation error rate is a power law function of the model size, or dataset size (Rosenfeld et al., 2020; 2021) . Such a power law relationship has been subsequently validated in the domain of NLP, too (Ghorbani et al., 2022) . In this paper, we demonstrate that the power-law principle has the potential to be a game-changer in HPO, because we can evaluate hyperparameter configurations in low-budget regimes (e.g. after a few epochs), then estimate the performance on the full dataset using dataset-specific power law models. We introduce Deep Power Law (DPL) ensembles, a probabilistic surrogate for Bayesian optimization (BO) that estimates the performance of a hyperparameter configuration at future budgets using ensembles of deep power law functions. Subsequently, a novel proposed flavor of BO dynamically decides which configurations to pause and train incrementally by relying on the performance estimations of the surrogate. We demonstrate that our method achieves the new state-of-the-art in HPO for DL by comparing against 8 strong HPO baselines, and 57 datasets of three diverse modalities (tabular, image, and natural language processing). As a result, we believe the proposed method has the potential to finally make HPO for DL a feasible reality. Overall, our contributions can be summarized as follows: • We introduce a novel probabilistic surrogate for gray-box HPO based on ensembles of deep power law functions. • We derive a simple mechanism to combine our surrogate with Bayesian optimization. • Finally, we demonstrate the superiority of our method against the current state-of-the-art in HPO for Deep Learning, with a very large-scale HPO experimental protocol.

2. RELATED WORK

Multi-fidelity HPO relaxes the black box assumption by assuming it has access to the learning curve of a hyperparameter configuration. Such a learning curve is the function that maps either training time or dataset size, to the validation performance. The early performance of configurations (i.e. first segment of the learning curve) can be used to discard unpromising configurations, before waiting for full convergence. Successive halving (Jamieson & Talwalkar, 2016) is a widely used multi-fidelity method that randomly samples hyperparameter configurations, starts evaluating them, and ends a fraction of them upon reaching a predefined budget. Afterward, the budget is multiplied by the fraction of discarded hyperparameter configurations and the process continues until the maximum budget is reached. Although the method relies only on the last observed value of the learning curve, it is very efficient. In recent years, various flavors of successive halving have been elaborated, including Hyperband (Li et al., 2017) , which effectively runs successive halving in parallel with different settings. A major improvement to Hyperband is replacing random search with a more efficient sampling strategy (Awad et al., 2021; Falkner et al., 2018) . However, the only assumption these methods make about the learning curve is that it will improve over time. In contrast, we fit surrogates that exploit a power law assumption on the curves. Learning curve prediction is a related topic, where the performance of a configuration is predicted based on a partially observed learning curve. Typically, the assumptions about the learning curve are much stronger than those described above. The prediction is often based on the assumption that the performance increases at the beginning and then flattened towards the end. One way to model this behavior is to define a weighted set of parametric functions (Domhan et al., 2015; Klein et al., 2017) . Then, the parameters of all functions are determined so that the resulting prediction best matches the observed learning curve. Another approach is to use learning curves from already evaluated configurations and to find an affine transformation that leads to a well-matched learning curve (Chandrashekaran & Lane, 2017) . A more data-driven approach is to learn the typical learning curve behavior directly from learning curves across different datasets (Wistuba & Pedapati, 2020) . Learning curve prediction algorithms can be combined with successive halving (Baker et al., 2018) . In contrast to this line of research, we actually fit ensembles of power law surrogates for conducting multi-fidelity HPO with Bayesian optimization. Scaling laws describe the relationship between the performance of deep learning models as a function of dataset size or model size. Concretely, Hestness et al. (2017) show empirically for different data modalities and neural architectures that a power law relationship holds when growing the dataset. Further work confirms this observation and extends it by demonstrating the power law relationship also with regard to the model size (Rosenfeld et al., 2020; 2021; Ghorbani et al., 2022) . From a practical angle, Yang et al. (2022) propose to tune hyperparameters on a small-scale model and then transfer it to a large-scale version. In contrast to these papers, we directly use the power law assumption for training surrogates in Bayesian optimization for HPO.

3. PRELIMINARIES

Hyperparameter Optimization (HPO) demands finding the configurations λ ∈ Λ of a Machine Learning method that achieve the lowest validation loss L (Val) of a model (e.g. a neural network), which is parameterized with θ and learned to minimize the training loss L (Train) as:  For simplicity we denote the validation loss as our function of interest f (λ) = L (V al) (λ, θ * (λ)). The optimal hyperparameter configurations λ * of Equation 1 are found via an HPO policy A (also



* := arg min λ∈Λ L (V al) (λ, θ * (λ)) , s.t. θ * (λ) := arg min θ∈Θ L (T rain) (λ, θ)

availability

https://anonymous.4open.science/r/DeepRegret-0F61/ 

