MULTI-SOURCE UNSUPERVISED HYPERPARAMETER OPTIMIZATION

Abstract

How can we conduct efficient hyperparameter optimization for a completely new task? In this work, we consider a novel setting, where we search for the optimal hyperparameters for a target task of interest using only unlabeled target task and 'somewhat relevant' source task datasets. In this setting, it is essential to estimate the ground-truth target task objective using only the available information. We propose estimators to unbiasedly approximate the ground-truth with a desirable variance property. Building on these estimators, we provide a general and tractable hyperparameter optimization procedure for our setting. The experimental evaluations demonstrate that the proposed framework broadens the applications of automated hyperparameter optimization.

1. INTRODUCTION

Hyperparameter optimization (HPO) has been a pivotal part of machine learning (ML) and contributed to achieving a good performance in a wide range of tasks (Feurer & Hutter, 2019) . It is widely acknowledged that the performance of deep neural networks depends greatly on the configuration of the hyperparameters (Dacrema et al., 2019; Henderson et al., 2018; Lucic et al., 2018) . HPO is formulated as a special case of a black-box function optimization problem, where the input is a set of hyperparameters, and the output is a validation score. Among the black-box optimization methods, adaptive algorithms, such as Bayesian optimization (BO) (Brochu et al., 2010; Shahriari et al., 2015; Frazier, 2018) have shown superior empirical performance compared with traditional algorithms, such as grid search or random search (Bergstra & Bengio, 2012) . One critical assumption in HPO is the availability of an accurate validation score. However, in reality, there are many cases where we cannot access the ground-truth of the task of interest (referred to as target task hereinafter). For example, in display advertising, predicting the effectiveness of each advertisement, i.e., click-through rates (CTR), is important for showing relevant advertisements (ads) to users. Therefore, it is necessary to conduct HPO before a new ad campaign starts. However, for new ads that have not yet been displayed to users, one cannot use labeled data to conduct HPO. In this case, the standard HPO procedure is infeasible, as one cannot utilize the labeled target task data and the true validation score of the ML model under consideration. In this work, we address the infeasibility issue of HPO when the labels of the target task are unavailable. To formulate this situation, we introduce a novel HPO setting called multi-source unsupervised hyperparameter optimization (MSU-HPO). In MSU-HPO, it is assumed that we do not have the labeled data for a target task. However, we do have the data for some source tasks with a different distribution from the target task. It is natural to assume that we have access to multiple source tasks in most practical settings. In the display advertising example, several labeled datasets of old ads that have already been deployed are often available, which we can use as labeled source task datasets. To the best of our knowledge, no HPO approach that can address a situation without labeled target task data exists despite its significance and possibility for applications. A problem with MSU-HPO is that the ground-truth is inaccessible, and one cannot directly apply the standard HPO procedure. Thus, it is essential to accurately approximate it using only available data. For this purpose, we propose two estimators, enabling the evaluation of the ML models without the labeled target task data. Our estimators are general and can be used in combination with any common black-box optimization methods, such as Gaussian process-based BO (Srinivas et al., 2010; Snoek et al., 2012; Hennig & Schuler, 2012; Contal et al., 2014; Hernández-Lobato et al., 2014; Wang & Jegelka, 2017) and the tree-structured Parzen estimator (Bergstra et al., 2011; 2013) . In addition, we show that the proposed estimators can unbiasedly approximate the target task objective, one of which achieves a desirable variance property by selecting useful source tasks based on a task divergence measure. We also present a general and computationally inexpensive HPO procedure for MSU-HPO building on our estimators. Finally, we demonstrate that our estimators work properly through numerical experiments with synthetic and real-world datasets. Related Work. A typical HPO setting is to find a better set of hyperparameters using a labeled target task of interest. As faster convergence is an essential performance metric of the HPO methods, the research community is moving on to the multi-source or transfer settings for which there are some previously solved related source tasks. By combining the additional source task information and the labeled target task dataset, it has been shown that one can improve the hyperparameter search efficiency, and thus reach a better solution with fewer evaluations (Bonilla et al., 2008; Bardenet et al., 2013; Swersky et al., 2013; Yogatama & Mann, 2014; Ramachandran et al., 2018; Springenberg et al., 2016; Poloczek et al., 2017; Wistuba et al., 2018; Feurer et al., 2018; Perrone et al., 2018; 2019; Salinas et al., 2019) . A critical difference between the multi-source HPOs and our MSU-HPO settings is the existence of labels for the target task. Previous studies usually assume that analysts can utilize labeled target data. However, as discussed above, this is often unavailable, and thus, most of these methods are infeasible. One possible solution to address the unavailablity of labeled target data is to use warm starting methods (Vanschoren, 2019) , which aims to find good initial hyperparameters for the target task. Learning Initialization (LI) finds promising hyperparameters by minimizing a sum of a loss function surrogated by a Gaussian process on each source task (Wistuba et al., 2015) . While LI is effective when the source and target tasks are quite similar, it is hard to achieve a reasonable performance otherwise. In contrast, DistBO learns the similarity between the source and target tasks with a joint Gaussian process model on hyperparameters and data representations (Law et al., 2019) . However, many transfer methods including DistBO need abundant hyperparameter evaluations for the source tasks to surrogate objective function for each task well, which will be confirmed in our experiments. Another related field is model evaluation in covariate shift, whose objective is to evaluate the performance of the ML models of the target task using only a relevant single source dataset (Sugiyama et al., 2007; You et al., 2019; Zhong et al., 2010) . These studies build on the importance sampling (IS) method (Elvira et al., 2015; Sugiyama et al., 2007) to obtain an unbiased estimate of ground-truth model performances. While our proposed methods are also based on IS, a major difference is that we assume that there are multiple source datasets with different distributions. We demonstrate that with the multi-source setting, the previous IS method can fail, and propose an estimator satisfying the optimal variance property. Moreover, as these methods are specific to model evaluation, the connection between the IS-based estimation techniques and the automated HPO methods has not yet been explored despite their possible, broad applications. Consequently, we are the first to empirically evaluate the possible combination of the IS-based unbiased estimation and adaptive HPO. Contributions. The contributions of this work can be summarized as follows: (i): We formulate a novel and highly practical HPO setting, MSU-HPO. (ii): We propose two unbiased estimators for the ground-truth validation score calculable with the available data. Additionally, we demonstrate that one of them achieves optimal finite variance among a reasonable class of unbiased estimators. (iii): We describe a flexible and computationally tractable HPO procedure building on the proposed estimators. (iv): We empirically demonstrate that the proposed procedure works favorably in MSU-HPO setting. Furthermore, our empirical results suggest a new possible connection between the adaptive HPO and IS-based unbiased estimation techniques.

2. PROBLEM SETTING

In this section, we formulate MSU-HPO. Let X ⊆ R d be the d-dimensional input space and Y ⊆ R be the real-valued output space. We use p T (x, y) to denote the joint probability density function of the input and output variables X ∈ X and Y ∈ Y of the target task. The objective of this work is to find the best set of hyperparameters θ with respect to the target distribution: θ opt = arg min θ∈Θ f T (θ)

