ZERO-SHOT TRANSFER LEARNING FOR GRAY-BOX HYPER-PARAMETER OPTIMIZATION

Abstract

Zero-shot hyper-parameter optimization refers to the process of selecting hyperparameter configurations that are expected to perform well for a given dataset upfront, without access to any observations of the losses of the target response. Existing zero-shot approaches are posed as initialization strategies for Bayesian Optimization and they often rely on engineered meta-features to measure dataset similarity, operating under the assumption that the responses of similar datasets behaves similarly with respect to the same hyper-parameters. Solutions for zeroshot HPO are embarrassingly parallelizable and thus can reduce vastly the required wallclock time of learning a single model. We propose a very simple HPO model called Gray-box Zero(O)-Shot Initialization (GROSI) as a conditional parametric surrogate that learns a universal response model by exploiting the relationship between the hyper-parameters and the dataset meta-features directly. In contrast to existing HPO solutions, we achieve transfer of knowledge without engineered metafeatures, but rather through a shared model that is trained simultaneously across all datasets. We design and optimize a novel loss function that allows us to regress from the dataset/hyper-parameter pair unto the response. Experiments on 120 datasets demonstrate the strong performance of GROSI, compared to conventional initialization strategies. We also show that by fine-tuning GROSI to the target dataset, we can outperform state-of-the-art sequential HPO algorithms.

1. INTRODUCTION

Within the research community, the concentration of efforts towards solving the problem of hyperparameter optimization (HPO) has been mainly through sequential model-based optimization (SMBO), i.e. iteratively fitting a probabilistic response model, typically a Gaussian process (Rasmussen (2003) ), to a history of observations of losses of the target response, and suggesting the next hyper-parameters via a policy, acquisition function, that balances exploration and exploitation by leveraging the uncertainty in the posterior distribution (Jones et al. (1998) ; Wistuba et al. (2018) ; Snoek et al. (2012) ). However, even when solutions are defined in conjunction with transfer learning techniques (Bardenet et al. (2013) ; Wistuba et al. (2016) ; Feurer et al. (2015) ), the performance of SMBO is heavily affected by the choice of the initial hyper-parameters. Furthermore, SMBO is sequential by design and additional acceleration by parallelization is not possible. In this paper, we present the problem of zero-shot hyper-parameter optimization as a meta-learning objective that exploits dataset information as part of the surrogate model. Instead of treating HPO as a black-box function, operating blindly on the response of the hyper-parameters alone, we treat it as a gray-box function (Whitley et al. (2016) ), by capturing the relationship among the dataset meta-features and hyper-parameters to approximate the response model. In this paper, we propose a novel formulation of HPO as a conditional gray-box function optimization problem, Section 4, that allows us to regress from the dataset/hyper-parameter pair directly onto the response. Driven by the assumption that similar datasets should have similar response approximations, we introduce an additional data-driven similarity regularization objective to penalize the difference between the predicted response of similar datasets. In Section 5, we perform an extensive battery of experiments that highlight the capacity of our universal model to serve as a solution for: (1) zero-shot HPO as a stand-alone task, (2) zero-shot as an initialization strategy for Bayesian Optimization (BO), (3) transferable sequential model-based optimization. A summary of our contributions is: • a formulation of the zero-shot hyper-parameter optimization problem in which our response model predicts upfront the full set of hyper-parameter configurations to try, without access to observations of losses of the target response; • a novel multi-task optimization objective that models the inherent similarity between datasets and their respective responses; • three new meta-datasets with different search spaces and cardinalities to facilitate the experiments and serve as a benchmark for future work; • an empirical demonstration of the performance of our approach through a battery of experiments that address the aforementioned research aspects, and a comparison against state-of-the-art HPO solutions for transfer-learning.

2. RELATED WORK

The straightforward zero-shot approaches for HPO consist of random search (Bergstra & Bengio (2012) ), or simply selecting hyper-parameters that perform well on general tasks (Brazdil et al. (2003) ). Some recent work has also shown that simply selecting random hyper-parameters from a restricted search space significantly outperforms existing solutions, and improves the performance of conventional SMBO approaches (Perrone et al. ( 2019)). The restricted search space is created by eliminating regions that are further away from the best hyper-parameters of the training tasks. Another prominent direction for zero-shot HPO depends heavily on engineered meta-features, i.e. dataset characteristics (Vanschoren (2018) ), to measure the similarity of datasets. Following the assumption that the responses of similar datasets behave similarly to the hyper-parameters, it has been shown that even the simplest of meta-features (Bardenet et al. (2013) ) improve the performance of single task BO algorithms (Feurer et al. (2014; 2015) ). The target response is initialized with the top-performing hyper-parameters of the dataset nearest neighbor in the meta-feature space. The shortcomings of using engineered meta-features are that they are hard to define (Leite & Brazdil (2005) ), and are often selected through trial-and-error or expert domain knowledge. As a remedy, replacing engineered meta-features with learned meta-features (Jomaa et al. ( 2019)) compensates for such limitations, by producing expressive meta-features agnostic to any meta-task, such as HPO. Zero-shot HPO is also posed as an optimization problem that aims to minimize the meta-loss over a collection of datasets (Wistuba et al. (2015a) ) by replacing the discrete minimum function with a differentiable softmin function as an approximation. The initial configurations boost the single task BO without any meta-features. In (Wistuba et al. (2015b) ), hyper-parameter combinations are assigned a static ranking based on the cumulative average normalized error, and dataset similarity is estimated based on the relative ranking of these combinations. Winkelmolen et al. (2020) introduce a Bayesian Optimization solution for zero-shot HPO by iteratively fitting a surrogate model over the observed responses of different tasks, and selecting the next hyper-parameters and datasets that minimize the aggregated observed loss. Aside from zero-shot HPO, transfer learning is employed by learning better response models (Wistuba et al. (2016) ) based on the similarity of the response. Feurer et al. (2018) propose an ensemble model for BO by building the target response model as a weighted sum of the predictions of base models as well as the target model. In addition to the transferable response models, Volpp et al. (2019) design a transferable acquisition function as a policy for hyper-parameter optimization defined in a reinforcement learning framework. As a replacement to the standard Gaussian process, Perrone et al. (2018) train a multi-task adaptive Bayesian linear regression model with a shared feature extractor that provides context information for each independent task. In contrast to the literature, we formulate the problem of zero-shot HPO as a gray-box function optimization problem, by designing a universal response model defined over the combined domain of datasets and hyper-parameters. We rely on the embeddings to estimate the similarities across datasets and design a novel multi-task optimization objective to regress directly on the response. This allows us to delineate from the complexity paired with Bayesian uncertainty, as well as the trouble of engineering similarity measures.

3. HYPER-PARAMETER OPTIMIZATION

Consider a dataset D = { x (Train) , y (Train) , x (Val) , y (Val) , x (Test) , y (Test) } for a supervised learning task, with training, validation and test splits of predictors x ∈ X and targets y ∈ Y. We aim at training a parametric approximation of the target using ŷ := f (θ, λ) : X → Y, where θ ∈ Θ denotes the parameters and λ ∈ Λ its hyper-parameters, by minimizing a loss function L : Y × Y → R as: Val) , f x (Val) ; θ * , λ s.t. θ * = arg min θ∈Θ L y (Train) , f x (Train) ; θ, λ λ * = arg min λ∈Λ L y ( We hereafter denote the validation error as the response (λ) := L y (Val) , f x (Val) ; θ * , λ . Unfortunately, a direct optimization of the response (λ) in terms of λ is not trivial, because θ * is the result of the minimization problem and its gradients with respect to λ are not easy to compute. Instead, in order to learn the optimal hyper-parameters λ we train a probabilistic surrogate ˆ (λ; β) : Λ × B → R parameterized by β ∈ B, with B as the space of response model parameters, that minimizes the log-likelihood of approximating the response (λ) over a set of K evaluations S := {(λ 1 , (λ 1 )) , . . . , (λ K , (λ K ))}. We denote P as the probability of estimating the response given a surrogate model. Given the surrogate, the next hyper-parameter to be evaluated λ (next) is computed by maximizing an acquisition function A (e.g. EI (Močkus (1975) ) as: λ (next) := arg max λ∈Λ A( ˆ (λ; β * )) s.t. β * := arg min β∈B K k=1 ln P (λ k ), ˆ (λ k ; β) 4 META-LEARNING OF CONDITIONAL GRAY-BOX SURROGATES Let us define a collection of T datasets as D (1) , . . . , D (T ) and let (t) (λ) measure the response of the hyper-parameter λ on the t-th dataset D (t) . Furthermore, assume we have previously evaluated K (t) many hyper-parameters λ (t) k , k ∈ {1, . . . , K (t) } on that particular dataset. We condition the surrogate ˆ to capture the characteristics of the t-th dataset, by taking as input the meta-features representation of the dataset as φ (t) . Therefore, a dataset-aware surrogate can be trained using meta-learning over a cumulative objective function O (β) as: O (β) := T t=1 K (t) k=1 (t) λ (t) k -ˆ λ (t) k , φ (t) ; β 2 (3) 4.1 THE META-FEATURE EXTRACTOR Introducing engineered meta-features has had a significant impact on hyper-parameter optimization. However, learning meta-features across datasets of varying schema in a task-agnostic setting provides more representative characteristics than to rely on hard-to-tune empirical estimates. The meta-feature extractor is a set-based function (Zaheer et al. (2017) ) that presents itself as an extended derivation of the Kolmogorov-Arnold representation theorem (Krková (1992) ), which states that a multi-variate function φ can be defined as an aggregation of univariate functions over single variables, Appendix B. Each supervised (tabular) dataset D (t) := x (t) , y (t) consists of instances x (t) ∈ X ∈ R N ×M and targets y (t) ∈ Y ∈ R N ×C such that N , M and C represent the number of instances, predictors and targets respectively. The dataset can be further represented as a set of smaller components, set of sets, D (t) = x (t) i,m , y i,c | m ∈ {1, . . . , M }, i ∈ {1, . . . , N }, c ∈ {1, . . . , C} . A tabular dataset composed of columns (predictors, targets) and rows (instances) is reduced to single predictor-target pairs instead of an instance-target pairs. Based on this representation, a meta-feature extractor parameterized as a neural network (Jomaa et al. ( 2019)), is formulated in Equation 4. For simplicity of notation, we drop the superscript (t) unless needed. φ(D) = h 1 M C M m=1 C c=1 g 1 N N i=1 f (x i,m , y i,c ) with f : R 2 → R K f , g : R K f → R Kg and h : R Kg → R K represented by neural networks with K f , K g , and K output units, respectively. This set-based formulation captures the correlation between each variable (predictor) and its assigned target and is permutation-invariant, i.e. the output is unaffected by the ordering of the pairs in the set. Other set-based functions such as (Edwards & Storkey (2016) ; Lee et al. (2019) ) can also be used for meta-feature extraction, however, we focus on this deep-set formulation (Jomaa et al. ( 2019)) because it is proven to work properly for hyper-parameter optimization.

4.2. THE AUXILIARY DATASET IDENTIFICATION TASK

The dataset identification task introduced previously as dataset similarity learning (Jomaa et al. ( 2019)), ensures that the meta-features of similar datasets are colocated in the meta-feature space, providing more expressive and distinct meta-features for every dataset. Let p D a joint distribution over dataset pairs such that (D (t) , D (q) , s) ∈ T × T × {0, 1} with s being a binary dataset similarity indicator. We define a classification model ŝ : T × T → R + that provides an unnormalized probability estimate for s being 1, as follows: ŝ(D (t) , D (q) ) = e -γZ(φ (t) ,φ (q) ) where Z : R k × R k → R + represents any distance metric, and γ a tuneable hyper-parameter. For simplicity, we use the Euclidean distance to measure the similarity between the extracted metafeatures, i.e. Z φ (t) , φ (q) = φ (t) -φ (q) , and set γ = 1. The classification model is trained by optimizing the negative log likelihood: P(β) := - (t,q)∼p D + log ŝ(D (t) , D (q) ) - (t,q)∼p D - log 1 -ŝ(D (t) , D (q) ) with p D + as the distribution of similar datasets, p D + = {(D (t) , D (q) , s) ∼ p D | s = 1}, and p D -as the distribution of dissimilar datasets, p D -= {(D (t) , D (q) , s) ∼ p D | s = 0}. Similar datasets are defined as multi-fidelity subsets (batches) of each dataset.

4.3. DATA-DRIVEN SIMILARITY REGULARIZATION

Our surrogate differs from prior practices, because we do not consider the response to be entirely black-box. Instead, since we know the features and the target values of a dataset even before evaluating any hyper-parameter, we model a gray-box surrogate by exploiting the dataset characteristics φ when approximating the response . As a result, if the surrogate faces a new dataset that is similar to one of the T datasets from the collection it was optimized (i.e. similar meta-features φ extracted directly from the dataset), it will estimate a similar response. Yet, if we know apriori that two datasets are similar by means of the distance of their meta-features, we can explicitly regularize the surrogate to produce similar response estimations for such similar datasets, as: R (β) := T -1 t=1 T q=t+1 K (t) k=1 φ (t) -φ (q) ˆ λ (t) k , φ (t) ; β -ˆ λ (t) k , φ (q) ; β 2 (7) Overall we train the surrogate model to estimate the collection of response evaluations and explicitly capture the dataset similarity by solving the following problem, Equation 8, end-to-end, where α ∈ R controls the amount of similarity regularization, and δ ∈ R controls the impact of the dataset identification task: β * := arg min β∈B O (β) + α R (β) + δ P (β) (8) NETWORK ARCHITECTURE Our model architecture is divided into two modules, l := φ • ψ, the meta-feature extractor φ, and the regression head ψ. The meta-feature extractor φ : Rfoot_1 → R K h is composed of three functions, Equation 4, namely f , g and h. The regression head is also composed of two functions, i.e. ψ : ψ 1 •ψ 2 . We define by ψ 1 : R K h × Λ → R K ψfoot_0 as the function that takes as input the meta-feature/hyperparameter pair, and by ψ 2 : R K ψ 1 → R the function that approximates the response. Finally, let Dense(n) define one fully connected layer with n neurons, and ResidualBlock(n,m) be m× Dense(n) with residual connections (Zagoruyko & Komodakis (2016) ). We select the architecture presented in Table 1 based on the best observed average performance on the held-out validation sets across all meta-datasets, Appendix E.1. 

5. EXPERIMENTS

Our experiments are designed to answer three research questions 1 : • Q1: Can we learn a universal response model that provides useful hyper-parameter initializations from unseen datasets without access to previous observations of hyper-parameters for the dataset itself? • Q2: Do the proposed suggestions serve as a good initialization strategy for existing SMBO algorithms? • Q3: Aside from zero-shot HPO, does the performance of our method improve by refitting the response model to the observations of the hyper-parameters for the target dataset and how well does our approach compare to state-of-the-art methods in transfer learning for HPO? 2

5.1. TRAINING PROTOCOL

In Algorithm 1 we describe the pseudo-code for optimizing our response model via standard metalearning optimization routines. We use stochastic gradient descent to optimize the internal model, and Adam optimizer (Kingma & Ba ( 2015)) to optimize the outer loop. We set the number of inner iterations to v = 5, and use a learning rate of 0.001 for both optimizers. We use a batch size of 8 tasks sampled randomly with each iteration. The code is implemented in Tensoflow (Abadi et al. (2016) ). The performance of the various optimizers is assessed by measuring the regret, which represents the distance between an observed response and the optimal response on a response surface. For hyper-parameter optimization, the meta-datasets are provided beforehand, consequently, the optimal response is known. Since we normalize the response surfaces between (0, 1), we observe the normalized regret. The reported results represent the average over 5-fold cross-validation split for each meta-dataset, with 80 meta-train, 16 meta-valid, and 24 meta-test sets, and one unit of standard deviation.

5.2. META-DATASET

We create three meta-datasets by using 120 datasets chosen from the UCI repository (Asuncion & Newman (2007) ). We then create the meta-instances by training a feedforward neural network and report the validation accuracy. Each dataset is provided with a predefined split 60% train, 15% validation, and 25% test instances. We train each configuration for 50 epochs with a learning rate of 0.001. The hyper-parameter search space is described in Table 2 . The layout hyper-parameter (Jomaa et al. ( 2019)) corresponds to the overall shape of the neural network, and provides information regarding the number of neurons in each layer. For example, all the layers in the neural network with a layout share the same number of neurons. We introduce an additional layout, , where the number of neurons in each layer is successively halved until it reaches the corresponding number of neurons in the centeral layer, then doubles successively.We also use dropout (Srivastava et al. (2014) ) and batch normalization (Ioffe & Szegedy (2015) ) as regularization strategies, and stochastic gradient descent (GD), ADAM (Kingma & Ba ( 2015)) and RMSProp (Tieleman & Hinton (2012) ) as optimizers. SeLU (Klambauer et al. (2017) ) represents the self-normalizing activation unit. The search space consists of all possible combinations of the hyper-parameters. After removing redundant configurations, the resulting meta-datasets have 256, 288 and 324 unique configurations respectively. For the purposes of our algorithm, we need access to the datasets used to generate the meta-featuresfoot_2 . Further details are available in Appendix C.

5.3. BASELINES

We introduce two sets of baselines to evaluate against the different aspects of our approach: ZERO-SHOT HYPER-PARAMETER OPTIMIZATION • Random search (Bergstra & Bengio (2012) ) is the simplest approach where the hyperparameters are selected randomly. • Average Rank represents the top hyper-parameters that had on average the highest-ranking across the meta-train datasets. • NN-METAFEATURE (Feurer et al. (2015) ) refers to the process of selecting the topperforming hyper-parameters of the nearest neighboring dataset based on their metafeatures. We use two sets of well-established engineered meta-features, which we refer to as MF1 (Feurer et al. (2015) ) and MF2 (Wistuba et al. ( 2016)), as well as learned metafeatures (Jomaa et al. ( 2019)), which we denote by D2V. The similarity is measured by the Euclidean distance. • Ellipsoid (Perrone et al. ( 2019)) is also a random search approach, however the hyperparameters are sampled from a hyper-ellipsoid search space that is restricted to encompass as many optimal hyper-parameters from the training dataset as possible.

SEQUENTIAL-MODEL BASED OPTIMIZATION FOR TRANSFER LEARNING

• GP (Rasmussen ( 2003)) is standard Gaussian process response model with a Matern 3/2 and automatic relevance determination. This approach is trained independently on each dataset. • SMFO (Wistuba et al. (2015b) ) is a sequential model-free approach that provides a collection of hyper-parameters by minimizing the ranking loss across all the tasks in the meta-train datasets. • TST-R (Wistuba et al. (2016) ) is a two-stage approach where the parameters of the target response model are adjusted via a kernel-weighted average based on the similarity of the hyper-parameter response between the target dataset and the training datasets. We also evaluate the variant of this approach that relies on meta-features, by replacing the engineered meta-features with learned meta-features, TST-D2V. • RGPE (Feurer et al. (2018) ) is an ensemble model that estimates the target response model as a weighted combination of the training datasets' response models and the target itself. The weights are assigned based on a ranking loss of the respective model. • ABLR (Perrone et al. ( 2018)) is a multi-task ensemble of adaptive Bayesian linear regression models with all the tasks sharing a common feature extractor. • TAF-R (Wistuba et al. ( 2018)) learns a transferable acquisition function, unlike the aforementioned algorithms that focus on a transferable response model, that selects the next hyper-parameter based on a weighted combination of the expected improvement of the target task, and predicted improvement on the source tasks. • MetaBO (Volpp et al. (2019) ) is another transferable acquisition function, optimized as a policy in a reinforcement learning framework. This approach, however, demands a pre-computed target response model as part of the state representation. In our approach, we learn a universal response model based on the underlying assumption that the response is not only dependent on the hyper-parameters, as is assumed in black-box optimization techniques, but also on the dataset itself, presenting the problem as a gray-box function optimization.

Q1: ZERO-SHOT HPO AS A STAND-ALONE PROBLEM

In Table 3 we report the final normalized regret achieved by the different zero-shot approaches for the first 20 hyper-parameters (Feurer et al. (2015) ). Our method provides dataset-conditioned hyperparameters that perform better than heuristics for small budgetsfoot_3 . The use of engineered meta-features to represent datasets for HPO solutions is not reliable, as the results achieved by NN-MF1 and NN-MF2 are no better than random. On the other hand, using the meta-features extracted from the dataset directly, NN-D2V serves as a better approximation. Furthermore, random sampling from the restricted hyper-ellipsoid also outperforms the use of initialization strategies based on meta-features. We obtain the zero-shot hyper-parameters via Algorithm 2. The D2V meta-features are obtained via Algorithm 5. Table 3 : Results on several zero-shot HPO benchmarks. The numbers reported are the average normalized regret for 120 tasks on each meta-dataset evaluated as the average of a 5-fold crossvalidation scheme. We report the best results in bold and underline the second best. 

Q3: SEQUENTIAL GRAY-BOX FUNCTION OPTIMIZATION

The proposed universal response model provides useful hyper-parameters upfront without access to any observations of losses of the target responses. However, by iteratively refitting the model to the history of observations, the response prediction is improved, as depicted in Figure 5 , and summarized in Table 4 . We refit our algorithm by optimizing Equation 3, on the history of observed losses on the target dataset, Algorithm 4. We evaluate two policies for selecting the next hyper-parameter after refitting, (1) greedily selecting the hyper-parameter with the highest predicted response, GROSI(+1), and (2) selecting the next hyper-parameter randomly from the top 5 hyper-parameters with the highest predicted response, GROSI(+10), which achieved the best regret on average across the three meta-datasets, Appendix E.2. In contrast to the baselines that select hyper-parameters through an acquisition function that capitalizes on the uncertainty of the posterior samples, we incorporate uncertainty by selecting the next hyper-parameter from the top-k hyper-parameters uniformly at random and thus introduce a small trade-off between exploration and exploitation. Furthermore, our method outperforms the state-of-the-art in transfer learning approaches for HPO in several cases while demonstrating in general competitive performance across all three meta-datasets, Table 4 foot_4 . The baselines are warm-started with 20 randomly selected hyper-parameters (Feurer et al. (2015) . For better readability, the uncertainty quantification can be found in Figure 4 . 3alone is suboptimal and does not scale across all meta-datasets. We notice that adding the auxiliary dataset identification task, Equation 6 brings on significant improvement, similarly with the similarity driven regularization, Equation 7. This reinforces the notion that the responses of similar datasets behave similarly with regards to the hyper-parameters. Both losses help generate more expressive meta-features, the former more directly, by optimizing the inter-and intra-dataset similarities, and the latter indirectly by penalizing the difference in the predicted response. We also initialize the meta-feature extractor, φ, by pretraining it independently, Algorithm 5. However, we notice that this leads to generally poor performance as the model arrives quickly at a local optimum. An artifact of the meta-dataset, we notice that pretraining GROSI for Regularization Md provides a small lift. A small sensitivity analysis can be found in Appendix F.1.

A DETAILED PROBLEM SETTING

By a learning task we denote a pair (p, ) of an unknown distribution p of pairs (x, y) ∈ R M +L , with M, L ∈ N, and a loss : R L × R L → R. A function ŷ : R M → R L is called a model for task p and (ŷ; p) := E (x,y)∼p ( (y, ŷ(x))) its (expected) loss. Let a be a learning algorithm that yields for every sample D of pairs (x, y) from a task (p, ) and hyper-parameters λ ∈ R P a model ŷ for the task. We call (λ) := (a(D, λ); p) the loss (or the response) of hyper-parameters λ. We say validation loss for the loss estimated on fresh validation data. Sequential single-task hyper-parameter optimization problem. Given an initial number K of pairs (λ k , l k ) of hyper-parameters and their (validation) losses and a budget B ∈ N of trials, find sequentially B many hyper-parameters λ K+1 , . . . , λ K+B , such that their smallest loss min k∈1:K+B (λ k ) is minimal among all such sequences. To compute the next guess λ k+1 , the hyper-parameters λ 1 , . . . , λ k tried so far and their (validation) losses l k := (λ k ) can be used. Zero-shot cross-task hyper-parameter optimization problem. Let p task be an unknown distribution of supervised learning tasks. Given a sample of triples ((p, ), λ, l) of learning tasks (p, ), hyper-parameters λ and their losses l, find for a fresh task (p, ) and a budget B ∈ N -without any observations of losses of hyper-parameters on this task -a set {λ 1 , . . . , λ B } of hyper-parameters, such that their smallest loss min k∈1:B (λ k ) is minimal among all such sets. Sequential cross-task hyper-parameter optimization problem. Given both, (i) a sample of triples ((p, ), λ, l) of learning tasks (p, ), hyper-parameters λ and their losses l, and (ii) a fresh task (p, ) and a budget B ∈ N, find sequentially B many hyper-parameters λ 1 , . . . , λ B , such that their smallest loss min k∈1:B (λ k ) is minimal among all such sequences. To compute the next guess λ k+1 , the hyper-parameters λ 1 , . . . , λ k tried so far and their (validation) losses l k := (λ k ) as well as all data on other tasks can be used.

B THE META-FEATURE EXTRACTOR

The meta-feature extractor is a set-based function, and is represented as an extended derivation of the Kolmogorov-Arnold representation theorem (Krková (1992) ), which states that a multi-variate function φ can be defined as an aggregation of univariate functions over single variables: φ(x 1 , . . . , x M ) ≈ 2M j=0 h m M m=1 g m,j (x m ) It is important to note that φ is permutation invariant, i.e. unaffected by any permutation on the input, which allows us to obtain the same output for the same multi-variate data regardless of the order of input. As a simple variant of this formulation (Zaheer et al. (2017) ), we can replace the set of functions h m , with single function h, and g m,j with a function g. In this paper, we incorporate the meta-feature extractor as part of the response model, effectively learning a conditional response on the dataset meta-features directly such that the approximation is defined as ˆ (λ (t) , φ (t) ; β), with φ (t) = φ(D (t) ). 

D ALGORITHMS

We define p T as the task distribution that represents pairs of datasets and hyper-parameters, i.e. T (t,k) = (D (t) , λ k ) ∈ T × Λ, and p D be the distribution of the datasets as defined in Section 4.2. Algorithm 1 provides the overall optimization framework for GROSI, our approach. Algorithm 1 Learn GROSI(D) 1: Require:p D : distribution over datasets,p T : distribution over tasks 2: Require:lr inner ,lr outer : learning rates 3: Randomly initialize β ∈ B, the parameters of our response model l 4: while not done do 5: Set β ← β 6: Sample (D (t) , λ (t) k ) = T (t,k) ∼ p T 7: for v steps 3 and 5 . Algorithm 2 Zero-shot HPO 1: Require: target dataset D (t) ; response model l; desired zero-shot hyper-parameters K 2: H ← arg min K λ∈Λ l D (t) , λ 3: return H For sequential model-based optimization, a surrogate l is fitted to the observed responses of the unknown function. Several initialization strategies exist to expedite the transfer of information across tasks, Section 5.4. In Algorithm 3, we present the generic pseudo-code for SMBO, that requires an acquisition function, a, to sample the next iterate from the domain. Algorithm 3 Sequential Model-based Optimization Warm-start 1: Require: target dataset D (t) ; response model l; desired zero-shot hyper-parameters K, number of trials I, acquisition function a 2: Get initial hyper-parameters H 0 ←Zero-shot HPO 3: λ min ← arg min λ∈H0 l(D (t) , λ) 4: for i = 1 . . . I 5: fit li to H i-1 6: λ ← arg max λ∈Λ a l(D (t) , λ) 7: H i ← H i-1 {λ} 8: if l D (t) , λ < l D (t) , λ min 9: λ min ← λ 10: return λ min In Section 5.4, we propose to initialize our response model on the target dataset, then iterativly tune it to that particular dataset. Initially, we select top K configurations based on Algorithm 2, our zero-shot approach. Then, via Algorithm 4, we sample uniformly at random from the top X ranking configurations. If X = 1, then this represents the greedy policy. Meta-feature learning from datasets with varying schema was initially proposed in (Jomaa et al. (2019) ). For our approach, we introduce a set-based meta-feature extractor module to handle datasets Algorithm 4 Learn GROSI(+X) 1: Require: target dataset D (t) ; response model l; desired zero-shot hyper-parameters K, number of trials I, Number of top configurations to choose from, X 2: Get initial hyper-parameters H 0 ←Zero-shot HPO 3: λ min ← arg min λ∈H0 l(D (t) , λ) 4: for i = 1 . . . I

5:

fit li to H i-1 by optimizing Equation 36: λ ∼ Uniform arg min X λ∈Λ\Hi-1 l D (t) , λ 7: H i ← H i-1 {λ} 8: if l D (t) , λ < l D (t) , λ min 9: λ min ← λ 10: return λ min of varying schema as well, however, we optimize Equation 8 and use the dataset identification task as an auxiliary objective. However, to pre-train the meta-feature extractor for the Ablation study, Section 5.5, as well as in order to extract meta-features for the NN-D2V, and TST-D2V, we follow Algorithm 5, with p D + as the distribution of similar datasets, p D + = {(D (t) , D (q) , s) ∼ p D | s = 1}, and p D -as the distribution of dissimilar datasets, p D -= {(D (t) , D (q) , s) ∼ p D | s = 0}. Similar datasets are defined as multi-fidelity subsets (batches) of each dataset. 

F ADDITIONAL EXPERIMENTAL RESULTS

F.1 HYPER-PARAMETER SENSITIVITY ANALYSIS We optimize our response model by minimizing Equation 8, which includes the dataset identification task, Equation 6, and the similarity-driven regularization task, Equation 7, with auxiliary weights δ and α assigned to both respectively. We report below the performance of our universal response model for different auxiliary weights. The results confirm the importance of emphasizing the auxiliary dataset identification task in conjunction with the similarity-driven regularization loss, which reinforces the intuition that similar datasets behave similarly to the hyper-parameter response. The reported results throughout the paper are based on δ = 1 and α = 0.5. . . 

Q1: ZERO-SHOT HPO AS A STAND-ALONE PROBLEM

As a plausibility argument for the usefulness of our zero-shot strategy, we depict in Figure 3 the top 10 suggested hyper-parameters by our approach, as well as two initialization strategies on the actual response surface. Our picks can be seen colocated near the different optima in the search space whereas hyper-parameters of other strategies are dispersed.

Q3: SEQUENTIAL GRAY-BOX FUNCTION OPTIMIZATION

We refit the universal response model to the observations of the response on the target dataset by optimizing Equation 3. We depict the improvement achieved over the zero-shot approach in Figure 5 . 2016)). The first row represents the actual response surface. We notice that the true response and the predicted response are similar, and the location of the predicted minima, in green, overlaps with the minima of the actual response. Figure 5 : : Average normalized regret for GROSI, our zero-shot approach, and the refitted response models GROSI(+1) and GROSI(+10). We shade in light blue the improvement over zero-shot performance.



For a better understanding of the different problem settings, see Appendix A The associated code and meta-dataset described will be available upon acceptance. Unfortunately, we could not evaluate our approach on some of the published meta-datasets(Schilling et al. (2016)) due to the unavailability of the associated datasets (original predictors and target values) used for generation of the meta-instances We depict the surrogate with the response model in F.2 as a plausibility check. For better visualization, some baselines are removed from Figure2, but are still reported in Table4 CONCLUSIONIn this paper, we formulate HPO as a gray-box function optimization problem that incorporates an important domain of the response function, the dataset itself. We design a novel universal response model for zero-shot HPO that provides good initial hyper-parameters for unseen datasets in the absence of associated observations of hyper-parameters. We propose and optimize a novel multi-task objective to estimate the response while learning expressive dataset meta-features. We also reinforce the assumption that similar datasets behave similarly to hyper-parameters by introducing a novel similarity-driven regularization technique. As part of future work, we will investigate the impact of our approach within the reinforcement learning framework.



Figure 1: Average normalized regret for single-task sequential HPO methods, a Gaussian process, with different initialization strategies.

Figure 2: Average normalized regret for state-of-the-art transfer learning HPO methods.

q) ∼ p D\(D (t) 9: Evaluate gradients G ← ∇ β (O + α R + δ P) 10: Compute adapted parameters with stochastic gradient descent: β ← β -lr inner G 11: Update β ← β -lr outer (β -β ) 12: return β After optimizing our objective via Algorithm D we apply Algorithm 2 to observe the results presented in Tables

Standalone Meta-feature Learning 1: Require:p D + : distribution over similar datasets, p D -distribution over dissimilar datasets 2: Require:lr φ learning rate 3: Randomly initialize β ∈ B, the parameters of the meta-feature extractor φ 4: while not done do 5:Sample (D(t) , D(q) , 1) ∼ p D + and (D(t) , D (r) , 0) ∼ p D -(Both samples share D(t) with stochastic gradient descent: β ← β -lr φ G 8:Update β ← β -lr φ (β -β ) 9: return β

Figure3: Top 10 hyper-parameters suggested by our approach for test datasets from the three different meta-datasets. We reduce the dimensionality of each search space into a 2D representation via TSNE(Liu et al. (2016)). The first row represents the actual response surface. We notice that the true response and the predicted response are similar, and the location of the predicted minima, in green, overlaps with the minima of the actual response.

Figure 4: Average normalized regret for state-of-the-art transfer learning HPO methods, with uncertainty quantification.

The network architecture optimized for every meta-dataset.

Hyper-parameter search space for the meta-datasets. The name of each the meta-datasets is inspired by the most prominent hyper-parameter, highlighted in red.

Normalized regret for state-of-the-art transfer learning HPO methods for up to 80 trials after initialization with 20 configurations.We perform several ablation experiments to analyze the contribution of each objective to the overall performance. The results are detailed in Table5. Treating zero-shot HPO as a simple regression model by optimizing Equation

Final results for zero-shot HPO for different variations of our model optimized with different objectives. The numbers reported are the average normalized regret after 20 trials.

Summary of the 120 UCI datasets used to generate the meta-datasets.

Final results for different variants of our sequential model optimization policy. The numbers reported are the average normalized regret after 80 trials on held-out validation after initialization with the exact same 20 configurations suggested by the our zero-shot approach.

Final results of our universal response model optimized with different auxiliary weights. The numbers reported are the average normalized regret after 20 trials.

C META-DATASETS

C.1 LAYOUT HYPER-PARAMETER Below are some examples of the number of neurons per layer for networks with different layout hyper-parameters given 4 neurons and 5 layers:• Layout : [4, 4, 4, 4, 4] • Layout : [4, 8, 16, 32, 64] • Layout : [64, 32, 16, 8, 4] • Layout : [4, 8, 16, 8, 4] • Layout : [16, 8, 4, 8, 16] The search space consists of all possible combinations of the hyper-parameters. After removing redundant configurations, e.g. layout with 1 layer is similar to a layout with 1 layer, the resulting meta-datasets have 256, 288, and 324 unique configurations respectively.

C.2 HYPER-PARAMETER ENCODING

Below is description of the encodings applied to our hyper-parameters. We also like to note that the scalar values are normalized between (0, 1). 

C.3 THE UCI DATASETS

Table 7 is an overview of the UCI datasets used to generate the meta-datasets.

E EXPERIMENTAL DETAILS E.1 NETWORK ARCHITECTURE

Our model architecture is divided into two modules, l := φ • ψ, the meta-feature extractor φ, and the regression head ψ. The meta-feature extractor φ : R 2 → R K h is composed of three functions, Equation 4, namely f , g and h. The regression head is also composed of two functions, i.e.ψ :We define by ψ 1 : R K h × Λ → R K ψ 1 as the function that takes as input the meta-feature/hyperparameter pair, and by ψ 2 : R K ψ 1 → R the function that approximates the response. Finally, let Dense(n) define one fully connected layer with n neurons, and ResidualBlock(n,m) be m× Dense(n) with residual connections (Zagoruyko & Komodakis (2016) ).To select a single universal response model, we evaluate the validation performance on the three network architectures described in Table 8 . We select the architecture that has the best average performance between the three across the three meta-datasets, Table 9 , which turns out to be Architecture 3. The architectures assign a different number of trainable variables for the meta-feature extractor and the coupled regression head. We propose GROSI as a zero-shot HPO solution. However, to emphasize the ability of our surrogate model to quickly adapt to new target datasets, we extend it into a sequential optimization approach.Starting with the proposed zero-shot configurations, we fine tune our model via Algorithm 4. We select the X = 10 based on the best average performance observed on the held-out validation sets, Table10.

