OPTIMIZING LARGE-SCALE HYPERPARAMETERS VIA AUTOMATED LEARNING ALGORITHM

Abstract

Modern machine learning algorithms usually involve tuning multiple (from one to thousands) hyperparameters which play a pivotal role in terms of model generalizability. Black-box optimization and gradient-based algorithms are two dominant approaches to hyperparameter optimization while they have totally distinct advantages. How to design a new hyperparameter optimization technique inheriting all benefits from both approaches is still an open problem. To address this challenging problem, in this paper, we propose a new hyperparameter optimization method with zeroth-order hyper-gradients (HOZOG). Specifically, we first exactly formulate hyperparameter optimization as an A-based constrained optimization problem, where A is a black-box optimization algorithm (such as deep neural network). Then, we use the average zeroth-order hyper-gradients to update hyperparameters. We provide the feasibility analysis of using HOZOG to achieve hyperparameter optimization. Finally, the experimental results on three representative hyperparameter (the size is from 1 to 1250) optimization tasks demonstrate the benefits of HOZOG in terms of simplicity, scalability, flexibility, effectiveness and efficiency compared with the state-of-the-art hyperparameter optimization methods.

1. INTRODUCTION

Modern machine learning algorithms usually involve tuning multiple hyperparameters whose size could be from one to thousands. For example, support vector machines (Vapnik, 2013) have the regularization parameter and kernel hyperparameter, deep neural networks (Krizhevsky et al., 2012) have the optimization hyperparameters (e.g., learning rate schedules and momentum) and regularization hyperparameters (e.g., weight decay and dropout rates). The performance of the most prominent algorithms strongly depends on the appropriate setting of these hyperparameters. Traditional hyperparameter tuning is treated as a bi-level optimization problem as follows. min λ∈R p f (λ) = E(w(λ), λ), s.t. w(λ) ∈ arg min w∈R d L(w, λ) where w ∈ R d are the model parameters, λ ∈ R p are the hyperparameters, the outer objective Efoot_0 represents a proxy of the generalization error w.r.t. the hyperparameters, the inner objective L represents traditional learning problems (such as regularized empirical risk minimization problems), and w(λ) are the optimal model parameters of the inner objective L for the fixed hyperparameters λ. Note that the size of hyperparameters is normally much smaller than the one of model parameters (i.e., p d). Choosing appropriate values of hyperparameters is extremely computationally challenging due to the nested structure involved in the optimization problem. However, at the same time both researchers and practitioners desire the hyperparameter optimization methods as effective, efficient, scalable, simple and flexiblefoot_1 as possible. Classic techniques such as grid search (Gu & Ling, 2015) and random search (Bergstra & Bengio, 2012) have a very restricted application in modern hyperparameter optimization tasks, because they only can manage a very small number of hyperparameters and cannot guarantee to converge to 1 . Table 1 clearly shows that black-box optimization and gradient-based approaches have totally distinct advantages, i.e., black-box optimization approach is simple, flexible and salable in term of model parameters, while gradient-based approach is effective, efficient and scalable in term of hyperparmeters. Each property of E2S2F is an important criterion to a successful hyperparameter optimization method. To the best of our knowledge, there is still no algorithm satisfying all the five properties simultaneously. Designing a hyperparameter optimization method having the benefits of both approaches is still an open problem. To address this challenging problem, in this paper, we propose a new hyperparameter optimization method with zeroth-order hyper-gradients (HOZOG). Specifically, we first exactly formulate hyperparameter optimization as an A-based constrained optimization problem, where A is a black-box optimization algorithm (such as the deep neural network). Then, we use the average zeroth-order hyper-gradients to update hyperparameters. We provide the feasibility analysis of using HOZOG to achieve hyperparameter optimization. Finally, the experimental results of various hyperparameter (the size is from 1 to 1250) optimization problems demonstrate the benefits of HOZOG in terms of E2S2F compared with the state-of-the-art hyperparameter optimization methods.

2. HYPERPARAMETER OPTIMIZATION BASED ON ZEROTH-ORDER HYPER-GRADIENTS

In this section, we first give a brief review of black-box optimization and gradient-based algorithms, and then provide our HOZOG algorithm. Finally, we provide the feasibility analysis of HOZOG.

2.1. BRIEF REVIEW OF BLACK-BOX OPTIMIZATION AND GRADIENT-BASED ALGORITHMS

Black-box optimization algorithms: Black-box optimization algorithms view the bilevel optimization problem f as a black-box function. Existing black-box optimization methods (Snoek et al., 2012; Falkner et al., 2018) mainly employ Bayesian optimization (Brochu et al., 2010) to solve (1). Black-box optimization approach has good simplicity and flexibility. However, a lot of references have pointed out that it can only handle hyperparmeters from a few to several dozens (Falkner et al., 2018) while the number of hyperparmeters in real hyperparameter optimization problems would range from hundreds to thousands. Thus, black-box optimization approach has weak scalability in term of the size of of hyperparmeters. Gradient-based algorithms: The existing gradient-based algorithms can be divided into two parts (i.e., inexact gradients and exact gradients). The approach of inexact gradients first solves the inner problem approximately, and then estimates the gradient of (1) based on the approximate solution by the approach of implicit differentiation (Pedregosa, 2016) . Because the implicit differentiation involves Hessian matrices of sizes of d × d and d × p where p d, they have poor scalability. The



The choice of objective function E depends on the specified tasks. For example, accuracy, AUC or F1 can be used for binary classification problem. Square error loss or absolute error loss can be used as the objective of E for regression problems on validation samples. "effective": good generalization performance. "efficient": running fast. "scalable": scalable in terms of the sizes of hyperparameters and model parameters. "simple": easy to be implemented. "flexible": flexible to various learning algorithms.



Representative black-box optimization and gradient-based hyperparameter optimization algorithms. ("BB" and "G" are the abbreviations of black-box and gradient respectively, and "♣" denotes that the property holds for a small number of hyperparmaters or medium-sized training set. "Scalable-H" and "Scalable-P" denotes scalability in terms of hyperparameters and model parameters respectively.) For modern hyperparameter tuning tasks, black-box optimization(Snoek et al.,  2012; Falkner et al., 2018)  and gradient-based algorithms(Maclaurin et al., 2015; Franceschi et al.,  2018; 2017)  are currently the dominant approaches due to the advantages in terms of effectiveness, efficiency, scalability, simplicity and flexibility which are abbreviated as E2S2F in this paper. We provide a brief review of representative black-box optimization and gradient-based hyperparameter optimization algorithms in §2.1, and a detailed comparison of them in terms of the above properties in Table

