FEDHPO-BENCH: A BENCHMARK SUITE FOR FEDER-ATED HYPERPARAMETER OPTIMIZATION

Abstract

Hyperparameter optimization (HPO) is crucial for machine learning algorithms to achieve satisfactory performance. Its research progress has been boosted by existing HPO benchmarks. Nonetheless, existing efforts in benchmarking all focus on HPO for traditional centralized learning while ignoring federated learning (FL), a promising paradigm for collaboratively learning models from dispersed data. In this paper, we first identify some uniqueness of HPO for FL algorithms from various aspects. Due to this uniqueness, existing HPO benchmarks no longer satisfy the need to compare HPO methods in the FL setting. To facilitate the research of HPO in the FL setting, we propose and implement a benchmark suite FEDHPO-BENCH that incorporates comprehensive FedHPO problems, enables flexible customization of the function evaluations, and eases continuing extensions. We also conduct extensive experiments based on FEDHPO-BENCH to provide the community with more insights into FedHPO. We open-sourced FEDHPO-BENCH at https://github.com/FedHPO-Bench/FedHPO-Bench-ICLR23 

1. INTRODUCTION

Most machine learning (ML) algorithms expose many design choices, which can drastically impact the ultimate performance. Hyperparameter optimization (HPO) (Feurer & Hutter, 2019) aims at making the right choices without human intervention. To this end, HPO methods usually attempt to solve min λ∈Λ1×•••×Λ K f (λ), where each Λ k corresponds to the candidate choices of a specific hyperparameter, e.g., taking the learning rate from Λ 1 = [0.01, 1.0] and the batch size from Λ 2 = {16, 32, 64}. For each specified λ, f (λ) is the output result (e.g., validation loss) of executing the considered algorithm configured by λ. A solution λ * found for such a problem is expected to make the considered algorithm lead to superior generalization performance. Research in this line has been facilitated by HPO benchmarks (Gijsbers et al., 2019; Eggensperger et al., 2021; Pineda-Arango et al., 2021) , which prepare many HPO problems so that different HPO methods can be effortlessly compared, encouraging fair, reliable, and reproducible empirical studies. However, existing HPO benchmarks all focus on traditional learning paradigms, where the functions to be optimized correspond to centralized learning tasks. Federated learning (FL) (McMahan et al., 2017; Li et al., 2020a) , as a privacy-preserving paradigm for collaboratively learning a model from distributed data, has not been considered. Actually, along with the increasing privacy concerns from the whole society, FL has been gaining more attention from academia and industry. Meanwhile, HPO for FL algorithms (denoted by FedHPO from now on) is identified as a critical and promising open problem in FL (Kairouz et al., 2019) . In this paper, we first elaborate on several differences between FedHPO and traditional HPO (see Section 2.2), which essentially come from FL's distributed nature and the heterogeneity among FL's participants. These differences make existing HPO benchmarks inappropriate for studying FedHPO and, in particular, unusable for comparing FedHPO methods. Consequently, several recently proposed FedHPO methods (Zhou et al., 2021; Dai et al., 2020; Khodak et al., 2021; Zhang et al., 2021; Guo et al., 2022) are evaluated on respective problems and have not been uniformly implemented in one FL framework and well benchmarked. Motivated by FedHPO's uniqueness and the successes of existing HPO benchmarks, we propose and implement FEDHPO-BENCH, a dedicated benchmark suite, to facilitate the research and application of FedHPO. FEDHPO-BENCH is featured by satisfying the desiderata as follows: Comprehensiveness. FL tasks are diverse in terms of domain, model architecture, heterogeneity among participants, etc. The objective functions of their corresponding FedHPO problems are thus likely to be diverse. Hence, FEDHPO-BENCH provides a comprehensive collection of FedHPO problems for drawing an unbiased conclusion from comparisons of HPO methods. Flexibility. Users may have different levels of privacy and fairness concerns, which may correspond to different multi-objective optimization problems. Meanwhile, the execution time for function evaluation depends on the system condition. Thus, FEDHPO-BENCH allows users to flexibly tailor the FedHPO problems to their privacy protection needs, fairness demands, and system conditions. Extensibility. As a developing field, new FedHPO problems and novel FedHPO methods constantly emerge, and FL's best practice continuously evolves. Thus, we build FEDHPO-BENCH on a popular FL framework, FederatedScope (FS) (Xie et al., 2022) , and make it more of a benchmarking tool that can effortlessly incorporate novel ingredients. To our knowledge, FEDHPO-BENCH is the first FedHPO benchmark. We conduct extensive empirical studies with it to validate its usability and attain more insights into FedHPO.

2. BACKGROUND AND MOTIVATIONS

We first give a brief introduction to the settings of HPO and its related benchmarks. Then we present and explain the uniqueness of FedHPO to show the demand for dedicated FedHPO benchmarks. Due to the space limitation, more discussions about related works are deferred to Appendix B.

2.1. PROBLEM SETTINGS AND EXISTING BENCHMARKS

In the literature (Feurer & Hutter, 2019) , HPO is often formulated as solving min λ∈Λ1×•••×Λ K f (λ), where each Λ k corresponds to candidate choices of a specific hyperparameter, and their Cartesian product (denoted by ×) constitute the search space. In practice, such Λ k is often bounded and can be continuous (e.g., an interval of real numbers) or discrete (e.g., a set of categories/integers). Each function evaluation at a specified hyperparameter configuration λ means to execute the corresponding algorithm accordingly and return the value of considered metric (e.g., validation loss) as the result f (λ). HPO methods generally solve such a problem with a series of function evaluations. As a fullfidelity function evaluation is extremely costly, multi-fidelity methods exploit low-fidelity function evaluation, e.g., training for fewer epochs (Swersky et al., 2014; Domhan et al., 2015) or on a subset of data (Klein et al., 2017; Petrak, 2000; Swersky et al., 2013) , to approximate the exact result. Thus, it would be convenient to treat f as f (λ, b), λ ∈ Λ 1 × • • • × Λ K , b ∈ B 1 × • • • × B L , where each B l corresponds to the possible choices of a specific fidelity dimension. HPO benchmarks (Gijsbers et al., 2019; Eggensperger et al., 2021; Pineda-Arango et al., 2021) have prepared many HPO problems, i.e., various kinds of objective functions, for comparing HPO methods. To evaluate these functions, HPO benchmarks, e.g., HPOBench (Eggensperger et al., 2021) , often provide three modes: (1) "Raw" means truly executing the corresponding algorithm; (2) "Tabular" means querying a lookup table, where each entry corresponds to a specific f (λ, b); (3) "Surrogate" means querying a surrogate model that might be trained on the tabular data.

2.2. UNIQUENESS OF FEDERATED HYPERPARAMETER OPTIMIZATION

Generally, traditional HPO methods are applicable to FedHPO problemsfoot_0 , where, in each trial, the value f (λ, b) is evaluated, that is to say, an accordingly configured FL training course is conducted, as the dashed black box in Figure 1 illustrates. Conceptually, there are N clients, each of which has its specific data, and a server coordinates them to learn a model θ collaboratively by an FL algorithm such as FedAvg (McMahan et al., 2017) and FedOpt (Asad et al., 2020) . Such FL algorithms are iterative. In the t-th round, the server broadcasts the global model θ (t) to sampled clients; then, these clients make local updates and send the updates back; finally, the server aggregates the updates to produce θ (t+1) . After executing the FL algorithm configured by λ for several such rounds, e.g., #round= T according to the specified fidelity b, the performance, e.g., best validation loss ever achieved during this FL course, is returned as f (λ, b).



Despite the various scenarios in literature, we restrict our discussion about FedHPO to one of the most general FL scenarios that have been adopted in existing FedHPO works(Khodak et al., 2021; Zhang et al., 2021).

