DEEP POWER LAWS FOR HYPERPARAMETER OPTIMIZATION

Abstract

Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Law (DPL), a neural network model conditioned to yield predictions that follow a power-law scaling pattern. Our model dynamically decides which configurations to pause and train incrementally by making use of multifidelity estimation. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 57 diverse search spaces. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors. We open-source our implementation and make our code publicly available at:

1. INTRODUCTION

Hyperparameter Optimization (HPO) is a major challenge for the Machine Learning community. Unfortunately, HPO is not yet feasible for Deep Learning (DL) methods due to the high cost of evaluating multiple configurations. Recently, Gray-box HPO (a.k.a. multi-fidelity HPO) has emerged as a promising paradigm for HPO in DL, by discarding poorly-performing hyperparameter configurations after observing the validation error on the low-level fidelities of the optimization procedure (Li et al., 2017; Falkner et al., 2018; Awad et al., 2021; Li et al., 2020) . The advantage of gray-box HPO compared to online HPO (Chen et al., 2017; Parker-Holder et al., 2020) , or meta-gradient HPO (Maclaurin et al., 2015; Franceschi et al., 2017; Lorraine et al., 2020) is the ability to tune all types of hyperparameters. In recent years, a stream of papers highlights the fact that the performance of DL methods is predictable (Hestness et al., 2017) , concretely, that the validation error rate is a power law function of the model size, or dataset size (Rosenfeld et al., 2020; 2021) . Such a power law relationship has been subsequently validated in the domain of NLP, too (Ghorbani et al., 2022) . In this paper, we demonstrate that the power-law principle has the potential to be a game-changer in HPO, because we can evaluate hyperparameter configurations in low-budget regimes (e.g. after a few epochs), then estimate the performance on the full dataset using dataset-specific power law models. We introduce Deep Power Law (DPL) ensembles, a probabilistic surrogate for Bayesian optimization (BO) that estimates the performance of a hyperparameter configuration at future budgets using ensembles of deep power law functions. Subsequently, a novel proposed flavor of BO dynamically decides which configurations to pause and train incrementally by relying on the performance estimations of the surrogate. We demonstrate that our method achieves the new state-of-the-art in HPO for DL by comparing against 8 strong HPO baselines, and 57 datasets of three diverse modalities (tabular, image, and natural language processing). As a result, we believe the proposed method has the potential to finally make HPO for DL a feasible reality. Overall, our contributions can be summarized as follows: • We introduce a novel probabilistic surrogate for gray-box HPO based on ensembles of deep power law functions.

availability

https://anonymous.4open.science/r/DeepRegret-0F61/ 

