DEEP RANKING ENSEMBLES FOR HYPERPARAMETER OPTIMIZATION

Abstract

Automatically optimizing the hyperparameters of Machine Learning algorithms is one of the primary open questions in AI. Existing work in Hyperparameter Optimization (HPO) trains surrogate models for approximating the response surface of hyperparameters as a regression task. In contrast, we hypothesize that the optimal strategy for training surrogates is to preserve the ranks of the performances of hyperparameter configurations as a Learning to Rank problem. As a result, we present a novel method that meta-learns neural network surrogates optimized for ranking the configurations' performances while modeling their uncertainty via ensembling. In a large-scale experimental protocol comprising 12 baselines, 16 HPO search spaces and 86 datasets/tasks, we demonstrate that our method achieves new state-of-the-art results in HPO.

1. INTRODUCTION

Hyperparameter Optimization (HPO) is a crucial ingredient in training state-of-the-art Machine Learning (ML) algorithms. The three popular families of HPO techniques are Bayesian Optimization (Hutter et al., 2019) , Evolutionary Algorithms (Awad et al., 2021a) , and Reinforcement Learning (Wu & Frazier, 2019; Jomaa et al., 2019) . Among these paradigms, Bayesian Optimization (BO) stands out as the most popular approach to guide the HPO search. At its core, BO fits a parametric function (called a surrogate) to estimate the evaluated performances (e.g. validation error rates) of a set of hyperparameter configurations. The task of fitting the surrogate to the observed data points is treated as a probabilistic regression, where the common choice for the surrogate is Gaussian Processes (GP) (Snoek et al., 2012) . Consequently, BO uses the probabilistic predictions of the configurations' performances for exploring the search space of hyperparameters. For an introduction to BO, we refer the interested reader to Hutter et al. (2019) . In this paper, we highlight that the current BO approach of training surrogates through a regression task is sub-optimal. We furthermore hypothesize that fitting a surrogate to evaluated configurations is instead a learning-to-rank (L2R) problem (Burges et al., 2005) . The evaluation criterion for HPO is the performance of the top-ranked configuration. In contrast, the regression loss measures the surrogate's ability to estimate all observed performances and does not pay any special consideration to the top-performing configuration(s). We propose that BO surrogates must be learned to estimate the ranks of the configurations with a special emphasis on correctly predicting the ranks of the top-performing configurations. Unfortunately, the current BO machinery cannot be naively extended for L2R, because Gaussian Processes (GP) are not directly applicable to ranking. In this paper, we propose a novel paradigm to train probabilistic surrogates for learning to rank in HPO with neural network ensembles 1 . Our networks are learned to minimize L2R listwise losses (Cao et al., 2007) , and the ensemble's uncertainty estimation is modeled by training diverse networks via the Deep Ensemble paradigm (Lakshminarayanan et al., 2017) . While there have been a few HPO-related works using flavors of basic ranking losses (Bardenet et al., 2013; Wistuba & Pedapati, 2020; Öztürk et al., 2022) , ours is the first systematic treatment of HPO through a methodologically-principled L2R formulation. To achieve state-of-the-art HPO results, we follow the established practice of transfer-learning the ranking surrogates from evaluations on previous datasets (Wistuba & Grabocka, 2021) . Furthermore, we boost the transfer quality by using dataset meta-features as an extra source of information (Jomaa et al., 2021a) . We conducted large-scale experiments using HPO-B (Pineda Arango et al., 2021) , the largest public HPO benchmark and compared them against 12 state-of-the-art HPO baselines. We ultimately demonstrate that our method Deep Ranking Ensembles (DRE) sets the new state-of-the-art in HPO by a statistically-significant margin. This paper introduces three main technical contributions: • We introduce a novel neural network BO surrogate (named Deep Ranking Ensembles) optimized with Learning-to-Rank (L2R) losses; • We propose a new technique for meta-learning our ensemble surrogate from large-scale public meta-datasets; • Deep Ranking Ensembles achieve the new state-of-the-art in HPO, demonstrated through a very large-scale experimental protocol. (2021) . In contrast, we train BO surrogates using a learning-to-rank problem definition (Cao et al., 2007) .

2. RELATED WORK

Transfer HPO refers to the problem definition of speeding up HPO by transferring knowledge from evaluations of hyperparameter configurations on other auxiliary datasets (Wistuba & Grabocka, 2021; Feurer et al., 2015; 2018) . For example, the hyper-parameters of a Gaussian Process can be meta-learned on previous datasets and then transferred to new tasks (Wang et al., 2021) . Similarly, a deep GP's kernel parameters can also be meta-learned across auxiliary tasks (Wistuba & Grabocka, 2021) . Another method trains ensembles of GPs weighted proportionally to the similarity between the new task and the auxiliary ones (Wistuba et al., 2016) . When performing transfer HPO, it is useful to embed additional information about the dataset. Some approaches use dataset meta-features to warm-initialize the HPO (Feurer et al., 2015; Wistuba et al., 2015) , or to condition the surrogate during pre-training (Bardenet et al., 2013) . Recent works propose an attention mechanism to train dataset-aware surrogates (Wei et al., 2019) , or utilize deep sets to extract meta-features (Jomaa et al., 2021b) . In complement to the prior work, we meta-learn ranking surrogates with meta-features. Learning to Rank (L2R) is a problem definition that demands estimating the rank (a.k.a. relevance, or importance) of an instance in a set (Burges et al., 2005) . The primary application domain for L2R is information retrieval (ranking websites in a search engine) (Ai et al., 2018) , or e-commerce systems (ranking recommended products or advertisements) (Tang & Wang, 2018; Wu et al., 2018) . However, L2R is applicable in diverse applications, from learning distance functions among images in computer vision (Cakir et al., 2019) , up to ranking financial events (Feng et al., 2021) . In this paper, we emphasize the link between HPO and L2R and train neural surrogates for BO with L2R. Learning to Rank for HPO is a strategy for conducting HPO with an L2R optimization approach. There exist some literature on transfer-learning HPO methods that employ ranking objective within their transfer mechanisms. SCoT uses a surrogate-based ranking mechanism for transferring hyperparameter configurations across datasets (Bardenet et al., 2013) . On the other hand, Feurer et al. (2018) use a weighted ensemble of Gaussian Processes with one GP per auxiliary dataset, while the ensemble weights are learned with a pairwise ranking-based loss. Modeling the ranks of the learning

availability

https://github.com/releaunifreiburg/ DeepRankingEnsembles 

