SINGLE-SHOT GENERAL HYPER-PARAMETER OPTIMIZATION FOR FEDERATED LEARNING

Abstract

We address the problem of hyper-parameter optimization (HPO) for federated learning (FL-HPO). We introduce Federated Loss SuRface Aggregation (FLoRA), a general FL-HPO solution framework that can address use cases of tabular data and any Machine Learning (ML) model including gradient boosting training algorithms, SVMs, neural networks, among others and thereby further expands the scope of FL-HPO. FLoRA enables single-shot FL-HPO: identifying a single set of good hyper-parameters that are subsequently used in a single FL training. Thus, it enables FL-HPO solutions with minimal additional communication overhead compared to FL training without HPO. Utilizing standard smoothness assumptions, we theoretically characterize the optimality gap of FLoRA for any convex and non-convex loss functions, which explicitly accounts for the heterogeneous nature of the parties' local data distributions, a dominant characteristic of FL systems. Our empirical evaluation of FLoRA for multiple FL algorithms on seven OpenML datasets demonstrates significant model accuracy improvements over the baselines, and robustness to increasing number of parties involved in FL-HPO training. Published as a conference paper at ICLR 2023 included performing multiple stochastic gradient descent (SGD) local iterations at the clients and randomly selecting a small subset of the clients to compute and send updates to the server. Subsequently, compression techniques were used to minimize the size of model updates to the server. The accuracy and communication overhead of these techniques are sensitive to their HPs (McMahan et al., 2017a) .

1. INTRODUCTION

Traditional machine learning (ML) approaches require training data to be gathered at a central location where the learning algorithm runs. In real world scenarios, however, training data is often subject to privacy or regulatory constraints restricting the way data can be shared, used and transmitted. Examples of such regulations include the European General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Cybersecurity Law of China (CLA) and HIPAA, among others. Federated learning (FL), first proposed in McMahan et al. (2017b) , has recently become a popular approach to address privacy concerns by allowing collaborative training of ML models among multiple parties where each party can keep its data private. FL-HPO problem. Despite the privacy protection FL brings along, there are many open problems in FL domain, one of which is hyper-parameter optimization for FL or FL-HPO (Kairouz et al., 2019; Khodak et al., 2021) . Existing FL systems require a user (or all participating parties) to pre-set (agree on) multiple hyper-parameters (HPs) (i) for the model being trained (such as number of layers for neural networks or tree depth and number of trees in tree ensembles), (ii) for the FL algorithms, and (iii) for aggregation (if such hyper-parameters exist). Hyper-parameter optimization (HPO) for FL is important because the choice of HPs can have dramatic impact on model performance much like in traditional centralized ML (McMahan et al., 2017b) . While HPO has been widely studied in the centralized ML setting (Hutter et al., 2019) , it comes with unique challenges in the FL setting. First, existing HPO techniques often make use of the entire dataset, which is not available centrally in FL. Secondly, they need to train many models for a large number of HP configurations which is prohibitively expensive in terms of communication and training time in FL settings; training a single model already has a high communication overhead (Kairouz et al., 2019) . Thirdly, one important challenge that has not been adequately explored in FL-HPO literature is support for tabular data, which are widely used in enterprise settings, such as financial services and other traditional industries, preferring traditional models with some explanability (Ludwig et al., 2020) . Although a few approaches have been recently proposed for FL-HPO, they focus on handling HPO using personalization techniques (Khodak et al., 2021) and neural networks (Khodak et al., 2020) . To the best of our knowledge, there is no FL-HPO approach to train non-neural network models, such as gradient boosted decision trees (Friedman, 2001 ) (e.g., XGBoost (Chen & Guestrin, 2016) ) that are particularly common in the enterprise setting, even though there are existing FL algorithms for such models (Li et al., 2020; Ong et al., 2020) . This leads to our motivating question: Can we develop a FL-HPO scheme that performs HPO for any ML model in a FL environment without significantly increasing the already-high communication overhead of FL? In this paper, we address the aforementioned challenges of FL-HPO and our motivating question. We focus on the problem where the model HPs are shared by all parties and we seek a set of HPs and train a single model that is used by all parties. Our motivating question leads to four further requirements that make the problem challenging: (C1) To perform FL-HPO with any ML model, we cannot make any assumption that two models with different HPs can perform some "weightsharing", allowing our solution to be applied beyond fixed architecture neural networks. (C2) To be general across ML models, we do not assume the ability to perform "multi-fidelity" HPO to reduce the communication overhead of FL-HPO (see discussion in Appendix A.1). (C3) To avoid increasing the FL communication overhead, we seek to perform "single-shot" FL-HPO, where we can perform FL-HPO while requiring only a single FL model training. (C4) To be applicable to FL with data heterogeneity, we cannot assume that parties have independent and identically distributed (IID) data. Contributions. Given the above FL-HPO problem setting, we make the following contributions: • ( §2) We present a novel framework Federated Loss SuRface Aggregation (FLoRA) that leverages meta-learning techniques enabling asynchronous local HPOs on each party to perform single-shot HPO for the global FL-HPO problem. • ( §2.3) We provide theoretical guarantees for the set of HPs selected by FLoRA covering both IID and Non-IID cases regardless of the convexity of the loss function. To the best of our knowledge, this is the first rigorous theoretical analysis for FL-HPO problem and also the first optimality gap constructed in terms of the estimated loss given a target distribution. • ( §3) We evaluate FLoRA on the FL-HPO of Histogram based Gradient Boosted Decision Trees (HGB), Support Vector Machines (SVM) and Multi-layered Perceptrons (MLP) on seven classification datasets from OpenML (Vanschoren et al., 2013) , highlighting (i) its performance relative to baselines, and (ii) the effect of data heterogeneity. Figure 1 : Communication overhead savings of FLoRA compared to "multi-shot" FL-HPO for the same level of performance. We are considering 2 FL-HPO problems on the Electricity dataset (with HGB and MLP). We use the relative regret (defined in §3) of each scheme as the performance metric (lower is better), where a regret of 1 denotes performance of the singleshot baseline while a regret of 0 implies optimal performance. FLoRA and the single-shot baseline require a single federated model training. In Figure 1 , we present a snapshot of our empirical results which highlights the communication overhead reduction we achieve from FLoRA while producing higher quality models. As baselines, we directly adopt an existing centralized HPO scheme that requires federated training of multiple models and term this a "multi-shot" FL-HPO baseline. The figure also shows a "singleshot" baseline that uses curated HPs (described in §3), and FLoRA is also single-shot. Figure 1 shows that the "multi-shot" approach requires a significantly large number of FL model trainings (39 for MLP and 24 for HGB) and hence more communication to find a HP that matches the performance of the HP found by FLoRA, highlighting the efficiency and effectiveness of FLoRA. This result demonstrates that FLoRA is a FL-HPO scheme that works with any ML model (HGB, MLP, etc), providing competitive performance without significantly increasing the communication overhead of FL by only requiring a single FL model training.

1.1. RELATED WORK

Performance optimization of FL systems. One of the main challenges in FL is achieving high accuracy with low communication overhead. FedAvg (McMahan et al., 2017a ) is a predominant FL algorithm and several optimization schemes build on it. Initially, communication optimizations Need for FL-HPO of tabular data models. As most existing FL-HPO approaches focus on SGDbased algorithms and neural networks, one major limitation they share is that they do not support tree-based models, such as gradient boosted trees (Friedman, 2001) , a popular model for enterprise setting. These models provide explanability for predictions which is required for financial and healthcare FL use-cases. As laid out in a policy paper by the OECD, numerous regulations of member countries govern the use of analytics and data (OECD, 2021): GDPR, for example, requires decisionmaking models for financial services and insurance to be explainable, which is mostly achieved using traditional models such as decision tree variants (Goodman & Flaxman, 2017) . Outside consumer finance, governance rules require explanability of portfolio and risk management for auditing purposes (Gensler & Bailey, 2020) . Again, DNNs (deep neural networks) are not satisfactory from a current regulatory point of view and, thus, the financial services and insurance sectors rely on more explainable models (such as tree-based ones), also in federation. This paper. Our framework improves on the above approaches in several ways, summarized also in Table 1 . (1) It is more general, as it can tune multiple HPs and is applicable to non SGD-training settings such as gradient boosting trees. This is achieved by treating FL-HPO as a black-box HPO problem (as opposed to grey-box HPO where we can leverage techniques such as weight-sharing and multi-fidelity), which has been addressed in centralized HPO literature using grid search, random search (Bergstra & Bengio, 2012) and Bayesian Optimization approaches (Shahriari et al., 2016) . The key challenge is the requirement to perform computationally intensive evaluations on a large number of HPO configurations, where each evaluation involves training a model and scoring it on a validation dataset.  (i) L ∈ Θ L , i ∈ [p] with Θ = Θ G × Θ L . FL systems usually include an aggregator which introduces an additional set of HPs φ ∈ Φ. Finally, we have a FL algorithm F F M, φ, θ G , {θ (i) L } i∈[p] , A, D → m ∈ M, which takes as input all the relevant HPs and per-party datasets and generates a model. In this case, the FL-HPO problem can be stated in the two following ways depending on the desired goals: (i) Ideally, for a global holdout/validation dataset D (possibly from the same distribution as the aggregated dataset D), the target problem is: min φ∈Φ,θ G ∈Θ G ,θ (i) L ∈Θ L ,i∈[p] L F M, φ, θ G , {θ (i) L } i∈[p] , A, D , D . (2.3) (ii) An alternative target problem would involve per-party holdout datasets D i , i ∈ [p] as follows: min φ∈Φ,θ G ∈Θ G ,θ (i) L ∈Θ L ,i∈[p] Agg L F M, φ, θ G , {θ (i) L } i∈[p] , A, D , D i , i ∈ [p] , (2.4) where Agg : R p → R is some aggregation function (such as average or maximum) that scalarizes the p per-party predictive losses. Contrasting problem (2.1) to problems (2.3) & (2.4), we can see that the FL-HPO is significantly more complicated than the centralized HPO problem. In the ensuing presentation, we focus on problem (2.3) although our proposed single-shot FL-HPO scheme can be applied and evaluated for problem (2.4). We simplify the FL-HPO problem in the following ways: (i) we assume that there is no personalization so there are no per-party local HPs θ (i) L , i ∈ [p], (ii) we only focus on the model class HPs θ G , deferring HPO for aggregator HPs φ for future work as many of them are set based on the communication and computational resources available in the FL system and cannot be directly optimized with regards to some predictive performance metrics, and (iii) we assume there is a global holdout/validation set D which is only used to evaluate the final global model's performance but can not be accessed during HPO process. And parties can only access their own private training D i and validation D i sets. Hence the problem we will study is stated as for a fixed aggregator HP φ: min θ G ∈Θ G L F M, φ, θ G , A, D , D . (2.5) This problem appears similar to the centralized HPO problem (2.1). However, note that the main challenges in (2.5) is (i) the need for a federated training for each set of HPs θ G , and (ii) the need to evaluate the trained model on the global validation set D (which is usually not available in usual FL-HPO setting). Hence it is not practical (from a communication overhead and functional perspective) to apply existing off-the-shelf HPO schemes to problem (2.5). In the subsequent discussion, for simplicity purposes, we will use θ to denote the global HPs, dropping the "G" subscript.

2.1. LEVERAGING LOCAL HPOS

Algorithm 1 FL-HPO with FLoRA 1: Input:Θ, M, A, F , {(Di, D i )} i∈[p] , T 2: for each party Pi, i ∈ [p] do 3: Run HPO to generate T (HP, loss) pairs E (i) = (θ (i) t , L (i) t ), t ∈ [T ] , (2.6) While it is possible but extremely expensive to apply off-the-shelf HPO solvers (such as Bayesian Optimization (BO) (Shahriari et al., 2016) , Hyperopt (Bergstra et al., 2011) , etc.), we wish to understand how we can leverage local and asynchronous HPOs in each of the parties. We begin with a simple but intuitive hypothesis underlying various metalearning schemes for HPO (Vanschoren, 2018; Wistuba et al., 2018; Ram, 2022) : if a HP configuration θ has good performance for all parties independently, then θ is a strong candidate for federated training. θ (i) t ∈ Θ, L (i) t := L(A(M, θ (i) t , Di), D i ). 4: end for 5: Collect all E = {E (i) , i ∈ [p]} With this hypothesis, we present our proposed FLoRA in Algorithm 1. In this scheme, we allow each party to perform HPO locally and asynchronously with some adaptive HPO scheme such as BO (line 3). Then, at each party i ∈ [p], we collect all the attempted T HPs θ (i) t , t ∈ [T ] = {1, 2, . . . , T } and their corresponding predictive loss L (i) t into a set E (i) (line 3, equation (2.6)). Then these per-party sets of (HP, loss) pairs E (i) are collected at the aggregator (line 5). This operation has at most O(pT ) communication overhead (note that the number of HPs are usually much smaller than the number of columns or number of rows in the per-party datasets). These sets are then used to generate an aggregated loss surface : Θ → R (line 6) which will then be used to make the final single-shot HP recommendation θ ∈ Θ (line 7) for the federated training to create the final model m ∈ M (line 8). We will discuss the generation of the aggregated loss surface in detail in §2.2. Before that, we briefly want to discuss the motivation behind some of our choices in Algorithm 1. Remarks. Using adaptive HPO schemes instead of non-adaptive schemes (such as random search or grid search) allows us to efficiently approximate the local loss surface more accurately (and with more certainty) in regions of the HP space where the local performance is favorable instead of trying to approximate the loss surface well over the complete HP space. This has advantages both in terms of computational efficiency and loss surface approximation. Moreover, each party executes HPO asynchronously, without coordination with HPO results from other parties or with the aggregator. This is in line with our objective to minimize communication overhead. Although there could be strategies that involve coordination between parties, they could involve many rounds of communication. Our experimental results show that this approach is effective for the datasets we evaluated for.

2.2. LOSS SURFACE AGGREGATION

Given the sets of (HP, loss) pairs E (i) = (θ (i) t , L (i) t ), i ∈ [p], t ∈ [T ] at the aggregator, we wish to construct a loss surface : Θ → R that best emulates the (relative) performance loss (θ) we would observe when training the model on D. Based on our hypothesis, we want the loss surface to be such that it would have a relatively low (θ) if θ has a low loss for all parties simultaneously. However, because of the asynchronous and adaptive nature of the local HPOs, for any HP θ ∈ Θ, we would not have the corresponding losses from all the parties. For that reason, we will model the loss surfaces using regressors that try to map any HP to their corresponding loss. We present four ways of constructing such loss surfaces, and we also briefly summarize them in Table 2 . The most straightforward way to construct such a loss surface is to merge all the per-party sets E (i) to get a single set E = ∪ i∈[p] E (i) and use it to train a regressor f : Θ → R (such as a Random Forest Regressor (Breiman, 2001) ) using the HPs θ as the covariates and the corresponding loss as the dependent variable. Then we can define the loss surface as this single global model or SGM ˆ (θ) := f (θ). However, this loss surface is extremely optimistic, assigning a low loss to a HP if it had a low loss estimate on any one of the parties, making it unsuitable in the presence of data heterogeneity. We can leverage uncertainty quantification u : Θ → R + around regressor predictions to get a loss surface ˆ (θ) := f (θ) + αu(θ) for some α > 0 -the single global model with uncertainty or SGM+U. This would improve the robustness of SGM by penalizing parts of the HP space which were not well explored by all parties' local HPOs. (i) and α > 0 is a constant. fi : Θ → R for any i ∈ [p] is the per-party loss surface generated using the party's loss pairs E (i) . Surface ˆ (θ) := Optimism Non-IID SGM f (θ) High SGM+U f (θ) + α • u(θ) Medium Partial MPLM max i∈[p] fi(θ) Low APLM 1 /p i∈[p] fi(θ) Medium Instead of merging the per-party E (i) , we can also use them to train a per-party local regressor model f (i) : Θ → R and use their ensemble as the loss surface. One way is to use the average of the per-party local models or APLM as the loss surface ˆ (θ) := 1 /p i∈[p] f (i) (θ). This is less optimistic than SGM and provides some level of robustness in the presence of non-IID heterogeneous per-party distributions since it will assign a low loss for a HP only if its average across all per-party regressors is low, which implies that most parties observed a relatively low loss around this HP. An even more robust loss surface would be the maximum of the per-party local models or MPLM ˆ (θ) := max i∈[p] f (i) (θ) which would only assign a low loss to a HP only if it has low loss estimate across all parties, making it extremely capable of handling data heterogeneity (as we will also highlight in our empirical evaluations). We discuss these loss surfaces in detail in Appendix B. In §2.3, we theoretically quantify the performance guarantees for MPLM and APLM, and in §3, we empirically evaluate all these loss surfaces.

2.3. OPTIMALITY ANALYSIS

We now rigorously analyze the sub-optimality of the HP selected by FLoRA. We are interested in providing a bound for the following optimality gap: G := ˜ ( θ , D) -˜ (θ , D) , where θ ∈ arg min θ∈Θ ˜ (θ, D). (2.8) Here ˜ (θ, D) is an estimate of the true loss (θ, D) := E (x,y)∼D L(A(θ, D), (x, y)) (see Definition C.1) given some validation set D sampled from D, which is the model performance metric during evaluation and/or inference time. Recall that θ selected by FLoRA is defined as in (2.7), and θ * denotes the optimal HP given by ˜ for a desired data distribution D we want to learn. We present our main results in Theorem 2.1. Informally, we show how to bound the optimality gap by picking the 'worst-case' HP setting that maximizes the combination of Wasserstein distances of the local data distributions and actual quality of local HPO approximation across parties. The more precise theorem statement and its proof with formal discussion of technical definitions and assumptions can be found in Appendix C. Theorem 2.1. Suppose that the loss estimate ˜ and the unified loss surface ˆ are Lipschitz continuous. Consider the optimality gap G defined in (2.8), where θ is selected by FLoRA with each party i ∈ [p] collecting T (HP, loss) pairs {(θ (i) t , L (i) t )} t∈[T ] during the local HPO run. For a desired data distribution D = p i=1 w i D i , where {D i } i∈[p] are the sets of parties' local data distributions and w i ∈ [0, 1], ∀i ∈ [p], we have G ≤ max θ∈ Θ i∈[p] C α C β j∈[p],j =i w j W 1 (D j , D i ) + C L, Li min t∈[T ] d(θ, θ (i) t ) + δ i . (2.9) In particular, when all parties have i.i.d. local data distributions, (2.9) reduces to G ≤ max θ∈ Θ i∈[p] C α C L, Li min t∈[T ] d(θ, θ (i) t ) + δ i . Here C α , C β and C L, Li are constants only related to the unified loss surface and Lipschitz-ness, respectively, W 1 (•, •) and d(•, •) are distance metrics defined over data distribution and hyperparameter space Θ, respectively, and δ i is the maximum per sample training error for the local loss surface i , i.e., δ i = max t |L (i) t -i (θ (i) t )|. There are several interesting observations from Theorem 2.1: (i) The first term in our bound (2.9) characterizes the errors incurred by parties' data heterogeneity, measured via the 1-Wasserstein distance (Villani, 2021) -the magnitude of Non-IIDness in a FL system; we can see it vanish under the IID setting. (ii) The last two terms measure the quality of the local HPO approximation, which can be reduced if a good loss surface is selected. For example, if we use non-parametric regression models as the loss surfaces, the per-sample training error can be arbitrarily small (that is δ i ≈ 0), but at the cost of increasing L i for i . (iii) The min t∈[T ] d(θ, θ t ) term indicates that the optimality gap depends only on the HP trials θ (i) t that are closest to the optimal HP setting. (iv) If we assume each party's training dataset D i is of size n i sampled as D i ∼ D ni i , we can view w i = ni n where n = p i=1 n i , i.e., with probability w i the desired data distribution D is sampled from D i . Now we would like to compare our theoretical results with existing analyses such as Khodak et al. (2021) and He et al. (2020) . Among many differences in the FL-HPO problem setting, there are two key points we want to emphasize: (i) Theorem 2.1 presents the first optimality gap in terms of loss function value for the single-shot FL-HPO setting, and can be applied to both algorithmic and model architecture HPs, while existing works either lack theoretical guarantees or establish weaker optimality gap measured by the regret defined for an online setting and are only applicable to single HP optimization. (ii) We only make mild Lipschitz assumption regarding the loss function, and do not require any assumptions on the convexity of the loss function, the parties' local data distributions, or the training algorithms, while existing works usually require convexity and certain restrictions on the ML training algorithm to obtain their convergence guarantees.

3. EMPIRICAL EVALUATION

In this section, we evaluate FLoRA with different loss surfaces for the FL-HPO problems on a variety of ML models -histograms based gradient boosted (HGB) decision trees (Friedman, 2001) , Support Vector Machines (SVM) with RBF kernel and multi-layered perceptrons (MLP), using their respective scikit-learn implementations (Pedregosa et al., 2011) on OpenML (Vanschoren et al., 2013) classification problems. First, we fix the number of parties p = 3 and compare FLoRA to a baseline on 7 datasets. Then we study the data heterogeneity effect on the performance of FLoRA. Finally, we evaluate FLoRA with different parameter choices, in particular, the number of local HPO rounds and the communication overhead in the aggregation of the per-party (HP, loss) pairs. More comprehensive experimental results and FLoRA performance on real FL systems can be found in Appendix D. Baselines. To appropriately evaluate our proposed single-shot FL-HPO scheme, we need to select a meaningful single-shot baseline. For this, we choose the default HP configuration of scikit-learn as the single-shot baseline for two main reasons: (i) the default HP configuration in scikit-learn is set manually based on expert prior knowledge and extensive empirical evaluation, and (ii) these are also used as the defaults in the Auto-Sklearn package (Feurer et al., 2015; 2020) , one of the leading open-source AutoML python packages, which maintains a carefully selected portfolio of default configurations. While there are some existing schemes for FL-HPO, we are unable to compare FLoRA to them -see Table 1 for detailed comparison. Implementation and evaluation metric. We emulate the final FL (Algorithm 1, line 8) with a centralized training using the pooled data. We chose this implementation because we want to evaluate the final performance of any HP configuration (baseline or recommended by FLoRA) in a statistically robust manner with multiple train/validation splits (for example, via 10-fold cross-validation) instead of evaluating the performance on a single train/validation. This form of evaluation is extremely expensive to perform in a real FL system and generally not feasible, but allows us to evaluate how the performance of our single-shot HP recommendation fairs against that of the best-possible HP found via a full-scale centralized HPO. In all datasets, we consider the balanced accuracy as the metric we wish to maximize. For the local per-party HPOs (as well as the centralized HPO we execute to compute the regret), we maximize the 10-fold cross-validated balanced accuracy. For Table 3 -4, we report the relative regret, computed as (a -a) (a -b) , where a is the best metric obtained via the centralized HPO, b is the result of the above single-shot baseline, and a is the result of the HP recommended by FLoRA. The baseline has a relative regret of 1 and smaller values imply better performance. A value larger than 1 implies that the recommended HP performs worse than the single-shot baseline. Comparison to single-shot baseline. We first compare FLoRA with the baseline across different datasets, ML models and FLoRA loss surfaces summarized in Table 3 with the individual results detailed in Appendix D.3. For each method, we report the aggregate performance over all considered datasets in terms of (i) inter-quartile range, (ii) Wins/Ties/Losses of FLoRA w.r.t. the single-shot baseline, and (iii) a one-sided Wilcoxon Signed Ranked Test of statistical significance with the null hypothesis that the median of the difference between the single-shot baseline and FLoRA is positive against the alternative that the difference is negative (implying FLoRA improves over the baseline). Finally, we report an "Overall" performance, further aggregated across all ML models. Effect of data heterogeneity. In the second set of experiments, we study the effect of increasing the number of parties in the FL-HPO problem. For each data set, we increase the number of parties p up until each party has at least 100 training samples. We present the relative regrets in Table 4 . It also displays γ p := (1-mini∈[p] L (i) ) /(1-max i∈[p] L (i) ), where L (i) = min t∈[T ] L (i) t is the minimum loss observed during the local HPO at party i. This ratio γ p is always greater than 1, and quantifies the inter-party data heterogeneity -precisely The results indicate that, with low or moderate increase in γ p (EEG eye state, Electricity for moderate p), the proposed scheme is able to achieve low relative regret. However, with significant increase in γ p (Pollen, Electricity with p = 50, 100 and EEG Eye State with p = 50), the relative regret increases as well (even > 1 in a few cases). γ p ∼ 1 + O max i,j∈[p] W 1 (D i , D j ) (Appendix C.5). We also simulate a different form of data heterogeneity based on the MNIST dataset and present the results in Table 5 . In particular, there are 4 parties in total, with half of the parties having 4 times higher probability to have more even digits while the other half have more odd digits. In most challenging cases, MPLM (the most pessimistic loss function) has the most graceful degradation in relative regret compared to the remaining loss surfaces. Communication savings over multi-shot. Effect of different choices in FLoRA. In this set of experiments, we consider FLoRA with the APLM loss surface, and ablate the effect of different choices in FLoRA on 2 datasets each for SVM and MLP. First, we study the impact of the thoroughness of the per-party local HPOs, quantified by the number of HPO rounds T in Figure 3a . The results indicate that for really small T (< 20) the relative regret of FLoRA can be very high. However, after that point, the relative regret converges to its best possible value. We present the results for other loss surfaces in Appendix D.6. Figure 2 : Communication savings of FLoRA compared to "multi-shot" FL-HPO for the same level of relative regret (lower is better). Each pair of ( , ) connected by a dashed line corresponds to a dataset labeled as D1-D7 for ease of visualization. See Table 6 in Appendix D.1 for dataset names. We also study the effect of the communication overhead of FLoRA for fixed level of local HPO thoroughness. We assume that each party performs T = 100 rounds of local asynchronous HPO. However, instead of sending all T (HP, loss) pairs, we consider sending T < T of the "best" (HP, loss) pairs -that is, (HP, loss) pairs with the T lowest losses. Changing the value of T trades off the communication overhead of the FLoRA step where the aggregators collect the per-party loss pairs (Algorithm 1, line 5). The results for this study are presented in Figure 3b , and indicate that, for really small T , the relative regret can be really high. However, for a moderately high value of T < T , FLoRA converges to its best possible performance. Results and discussions on other loss surfaces are in Appendix D.7.

4. CONCLUSION AND FUTURE WORK

Effective selection of HPs in FL settings is a challenging problem. In this paper, we introduced FLoRA, a single-shot FL-HPO algorithm that can be applied to any ML model. We provided a theoretical analysis which bounds the optimality gap incurred by the HP selected by FLoRA. Our experiments show that FLoRA can effectively select HPs that outperform the baseline with just a single FL training. As future work, we wish to extend FLoRA to the Combined Algorithm Selection and HPO (CASH) problem (Thornton et al., 2012; Feurer et al., 2015) with or without fixed ML pipeline architecture (Baudart et al., 2021; Hirzel et al., 2022; Katz et al., 2020; Marinescu et al., 2021) , especially in the presence of computational and fairness constraints (Liu et al., 2020; Ram et al., 2020) , and understand the various theoretical trade-offs in FL-HPO (Ram et al., 2023) . One limitation of FLoRA is that it cannot handle HPs that are inactive during local HPO. These include aggregator specific and some FL training specific HPs. It is unlikely that such HPs can be handled in single-shot FL-HPO without any additional information or structure. As future work, we wish to extend FLoRA to handle such HPs in "few-shot" FL-HPO, potentially in conjunction with multi-fidelity HP evaluations.

REPRODUCIBILITY STATEMENT

The code and instructions to reproduce our numerical results can be found in supplemental materials. For formal definitions, assumptions and proofs of Theorem 2.1, one can find them in Appendix C. We provide a description of the datasets used in our experiments in Appendix D.1. One can also find the original datasets in supplemental materials. A ADDITIONAL DISCUSSION ON RELATED WORKS In this section, we provide further discussion of related work and how our work compares against them. A.1 MULTI-FIDELITY HPO AND FL-HPO In centralized HPO, the evaluation of a single HP configuration θ entails the training of the ML model with the provided HP and training data, the inference of the trained model on some held-out samples (validation data) to get predictions, and the evaluation of the quality of these predictions against available ground-truth. The most computationally expensive part of this process is the model training. Multi-fidelity HPO seeks to reduce the overall computational cost of HPO with cheap evaluations that essentially reduce the model training cost (Swersky et al., 2014; Klein et al., 2017; Li et al., 2018; Falkner et al., 2018) . For models trained via some form of (stochastic) gradient descent, a cheap evaluation of a HP configuration is obtained by computing the predictive performance of the model trained with a limited number of gradient descent iterations -the number of iterations is the notion of budget. Under the assumption that more budget does not degrade the performance of any given HP configuration, multi-fidelity HPO solvers adaptively allocate more budget to more promising HP configurations while discarding low-performing HP configurations cheaply by evaluating them with a low budget. With models not trained by gradient descent, such as decision-tree based models or nearest-neighbor models or some kernel machines, a commonly used notion of budget is the training set size -cheap evaluations are performed by training the model for any given HP configuration with just a smaller training set size, with the assumption that smaller training set sizes usually speed up the training, which is usually true for most ML models. One significant distinction between the two notions of budget: With the iteration budget, we are able to progressively train the models for the high performing HP configurations (albeit with additional storage by checkpointing the models after each evaluation with every budget allocation). With the training size budget, we usually have to train the models from scratch for each budget allocation; for example, there is no standard way of progressively updating a decision tree trained on 100 samples with a new training size allocation of 500 samples.

A.1.1 MULTI-FIDELITY IN FL-HPO

In FL-HPO, one of the main bottlenecks is the communication overhead and cheap evaluation in multi-fidelity FL-HPO would require us to control the number of communication rounds in a particular FL training. For gradient descent based training scheme, the number of iterations is a commonly used notion of budget (Khodak et al., 2021) since this number of iterations is a good surrogate for the number of communication rounds that gets allocated to the (cheap) FL training. Hence, multi-fidelity FL-HPO is useful for neural networks. However, the training set size as a notion of budget (used for other ML models) does not necessarily control the number of rounds of communication in FL-HPO. For example, when training a decision tree of depth 5 in a FL setting (Ong et al., 2020) , one would require around 5-6 rounds of communication regardless of the per-party training set sizes -reducing the training set size on each party would not reduce the number of communication rounds in the FL training. Hence, the training set size is not a useful notion of budget for multi-fidelity FL-HPO. For this reason, we focus on the FL-HPO problem where we cannot make use of multi-fidelity HP evaluations. This does not consider the checkpointing overhead present in multi-fidelity FL-HPO with iteration budget. Precise communication overheads of different schemes. We would like to highlight that, even when comparing to some multi-fidelity FL-HPO scheme, the single-shot significantly smaller than the communication overhead of multi-shot HPO, but still higher than that of FLoRA for moderately high N .

B ADDITIONAL DISCUSSIONS ON LOSS SURFACES

Single global model (SGM). We merge all the sets E = ∪ i∈[p] E (i) and use it as a training set for a regressor f : Θ → R, which considers the HPs θ ∈ Θ as the covariates and the corresponding loss as the dependent variable. For example, we can train a random forest regressor (Breiman, 2001) on this training set E. Then we can define the loss surface (θ) := f (θ). While this loss surface is simple to obtain, it may not be able to handle Non-iid party data distribution well: it is actually overly optimistic -under the assumption that every party generates unique HPs during the local HPO, this single global loss surface would assign a low loss to any HP θ which has a low loss at any one of the parties. This implies that this loss surface would end up recommending HPs that have low loss in just one of the parties, but not necessarily on all parties. Single global model with uncertainty (SGM+U). Given the merged set E = ∪ i∈[p] E (i) , we can train a regressor that provides uncertainty quantification around its predictions (such as Gaussian Process Regressor (Williams & Rasmussen, 2006)) as f : Θ → R, u : Θ → R + , where f (θ) is the mean prediction of the model at θ ∈ Θ while u(θ) quantifies the uncertainty around this prediction f (θ). We define the loss surface as (θ) := f (θ) + α • u(θ) for some α > 0. The uncertainty function u depends on the regressor used to model the loss surface. Gaussian Process Regressors naturally provide uncertainty estimates for predictions. For Random Forest Regressors, uncertainty estimates for predictions have been generated based on the variance of the individual tree predictions in the Random Forest, and we use this in our experiments. For all our experiments, α = 1, implies that we are considering a loss value that is a single standard deviation away from the predicted loss. This loss surface does prefer HPs that have a low loss even in just one of the parties, but it penalizes a HP if the model estimates high uncertainty around this HP. Usually, a high uncertainty around a HP would be either because the training set E does not have many samples around this HP (implying that many parties did not view the region containing this HP as one with low loss), or because there are multiple samples in the region around this HP but parties do not collectively agree that this is a promising region for HPs. Hence this makes SGM+U more desirable than SGM, giving us a loss surface that estimates low loss for HPs that are simultaneously thought to be promising to multiple parties.

Maximum of per-party local models (MPLM)

. We can train one regressor f (i) : Θ → R, i ∈ [p] with each of the per-party set E (i) . Given this, we can construct the loss surface as (θ) := max i∈[p] f (i) (θ). This can be seen as a much more pessimistic loss surface, assigning a low loss to a HP only if it has a low loss estimate across all parties.

Average of per-party local models (APLM).

A less pessimistic version of MPLM would be to construct the loss surface as the average of the per-party regressors f (i) , i ∈ [p] instead of the maximum, defined as (θ) := 1 /p p i=1 f (i) (θ). This is also less optimistic than SGM since it will assign a low loss for a HP only if its average across all per-party regressors is low, which implies that all parties observed a relatively low loss around this HP. Intuitively, we believe that loss surfaces such as SGM+U or APLM would be the most promising while the extremely optimistic and pessimistic SGM and MPLM respectively would be relatively less promising, with MPLM being superior to SGM. But MPLM would also be most robust to data heterogeneity, as evidenced by our empirical evaluations.

C THEORETICAL ANALYSIS

In this section, we will provide a detailed and rigorous proof to Theorem 2.1. We first formally define some notation we will use throughout this section in Section C.1, and then we discuss the Lipschitz smoothness assumptions (and possible relaxation) we require to obtain our optimality guarantee in Section C.2. Finally, we establish the sub-optimality of the HP selected by FLoRA in Section C. Here D is the data distribution of the test set. Let ˜ (θ, D) be an estimate of the loss defined in (C.1) given some validation (holdout) set D sampled from D, which is the model performance metric during evaluation and/or inference time. We assume the parties' training sets are collected before the federated learning such that D is fixed and unchanged during the HPO and FL processes, in order words, we do not consider streaming data/online setting.  (θ) = p i=1 α i (θ) • i (θ). (C.2) In particular, i) If α i (θ) = 1 /p, ∀i ∈ [p], θ ∈ Θ, then this reduces to APLM loss surface. ii) If α i (θ) = I i (θ) = max j∈[p] j (θ) , then this reduces to the MPLM loss surface (assuming all j (θ)s are unique). We formalize the distance metric used in our analysis to evaluate the distance between two given data distributions using 1-Wasserstein distance (Villani, 2021) . Definition C.3 (1-Wasserstein distance (Villani, 2021) ). For two distributions µ, ν with bounded support, the 1-Wasserstein distance is defined as W 1 (µ, ν) := sup f ∈F1 E x∼µ f (x) -E x∼ν f (x), (C.3) where F 1 = {f : f is continuous, Lipschitz(f ) ≤ 1}. We now define a distance metric d : Θ × Θ → R + used in the rest of our analysis. Assuming we have m HPs, if Θ ⊂ R m , then there are various distances available such as θθ ρ (the ρ-norm). The more general case is where we have R continuous/real HPs, I integer HPs, and C categorical HPs;  m = R + I + C. In that case, Θ ⊂ R R × Z I × C C , and any θ = (θ R , θ Z , θ C ) ∈ Θ R × Θ Z × Θ C , where θ R ∈ Θ R , θ Z ∈ Θ Z , θ C ∈ Θ C respectively G k = (V k , E k ), k ∈ [C] where • There is a node N kj in G k for each category ξ kj for each j ∈ [n k ] and V k = {N k1 , . . . N kn k }. • There is an undirected edge (N kj , N kj ) for each pair j, j ∈ [n k ], and E k = {(N kj , N kj ), j, j ∈ [n k ]}. Given the per-categorical HP graph G k , k ∈ [C], we define the graph Cartesian product G = k∈[C] G k and G = (V, E) such that • V = {N (j1,j2,...,j C ) : (ξ 1j1 , ξ 2j2 , . . . ξ kj k , . . . , ξ Cj C ) ∈ Θ C , j k ∈ [n k ]∀k ∈ [C]}. • E = {(N (j1,j2,...,j C ) , N (j 1 ,j 2 ,...,j C ) ) : IFF∃t ∈ [C] such that ∀k = t, ξ kj k = ξ kj k , and ∃(N tjt , N tj t ) ∈ E t }. Then for any θ C , θ C ∈ Θ C with corresponding nodes N, N ∈ V, (Oh et al., 2019 , Theorem 2.2.1) says that the length of the shortest path between nodes N and N in G is a distance. We can consider this distance as d C : Θ C × Θ C → R + . Of course, there are other ways of defining distances in the categorical space. Then we can define a distance d : (Θ R × Θ Z × Θ C ) × (Θ R × Θ Z × Θ C ) → R + between two HPs θ, θ as d(θ, θ ) = d R,Z ((θ R , θ Z ), (θ R , θ Z )) + d C (θ C , θ C ). (C.4) Proposition C.4. Given distance metrics d R,Z and d C , the function d : Θ × Θ defined in (C.4) is a valid distance metric.

C.2 LIPSCHITZ CONTINUITY ASSUMPTIONS

To facilitate our analysis later, we make the following Lipschitz smoothness assumptions regarding the loss function ˜ and also the per-party loss surface i . Assumption C.5 (Lipschitz smoothness). For a fixed data distribution D and ∀θ, θ ∈ Θ ⊂ Θ, we have | ˜ (θ, D) -˜ (θ , D)| ≤ L(D) • d(θ, θ ), (C.5) | i (θ) -i (θ )| ≤ L i • d(θ, θ ), (C.6) where d(•, •) is the distance metric (see (C.4)) defined over the hyper-parameter search space θ. For a fixed set of hyper-parameters θ ∈ Θ ⊂ Θ and some data distributions D and D , we have | ˜ (θ, D) -˜ (θ, D )| ≤ β(θ) • W 1 (D, D ). (C.7) Remark. Note that we explicitly use a Θ ⊂ Θ to highlight that we need Lipschitz smoothness only in some particular parts of the HP space. In fact, our analysis only requires the Lipschitz smoothness at θ * (HP selected by FLoRA), θ * (the optimal HP) and a HP space containing the aforementioned two HPs and the set of HPs tried in local HPO runs, i.e., {θ (i) t } t∈[T ] , which most of the time is much smaller than the entire HP search space. Moreover, the above Lipschitz smoothness assumption w.r.t. a general HP space, which could be a combination of continuous and discrete variables, may be strong. We will show later in this section that it can be relaxed to a milder assumption based on the modulus of continuity without significantly affecting our main results. For simplicity, we can always assume that L(D) ≤ L, ∀D and β(θ) ≤ β, ∀θ. For a more general handling, we can consider the notion of modulus of continuity in the form of a increasing real-valued functions ω : R + → R + with lim t→0 ω(t) = ω(0) = 0. Then we can say that the estimated loss ˜ (θ, D) and the loss surface i (θ) admits ωD and ω as a modulus of continuity (respectively) if | (θ, D) -(θ , D)| ≤ ωD (d(θ, θ )) (C.8) | i (θ) -i (θ )| ≤ ω(d(θ, θ )). (C.9) If we further assume that ωD , ω to be concave, then we can say that these functions are sublinear as follows: ωD (t) ≤ ÃD • t + BD , (C.10) ω(t) ≤ A • t + B. (C.11) These conditions give us (indirectly) something similar in spirit to the guarantees of Lipschitz continuity, but is a more rigorous way of achieving such guarantees. C.3 PROOF TO THEOREM 2.1 In this section, we provide detailed proofs to the result we stated in Theorem 2.1 in Section 2.3. We are interested in providing a bound for the optimality gap defined in (2.8), we restate it as following G := ˜ ( θ , D) -˜ (θ , D), where θ ∈ arg min θ∈Θ ˜ (θ, D). (C.12) Note that this bound is the optimality gap for the output of FLoRA in terms of the estimated loss ˜ . We (re)state our main results in the following theorem (in a more precise manner than Theorem 2.1). Theorem C.6. Consider the optimality gap defined in (2.8), where θ * is selected by FLoRA with each party i ∈ [p] collecting T (HP, loss) pairs {(θ  G ≤ 2 max θ∈ Θ i∈[p] α i (θ) β(θ) j∈[p],j =i w j W 1 (D j , D i ) + L(D i ) + L i min t∈[T ] d(θ, θ (i) t ) + δ i , (C.13) where δ i is the maximum per sample training error for the local loss surface i , i.e., δ i = max t |L (i) t - i (θ (i) t )|. In particular, when all parties have i.i.d. local data distributions, (C.13) reduces to G ≤ 2 max θ∈ Θ p i=1 α i (θ) L(D i ) + L i min t∈[T ] d(θ, θ (i) t ) + δ i . We quantify the relationship between i (θ) and the estimated loss function ˜ (θ, D i ) as follows: | i (θ) -˜ (θ, D i )| := i (θ, T ). (C.14) Proposition C.7. Consider θ and θ are two sets of HP defined in (2.7) and (C.12), respectively, and {D i } i∈[p] and D are the sets of parties' local data distributions and the target (global) data distribution we want to learn, for a given HP space such that θ , θ ∈ Θ ⊂ Θ, we have ˜ ( θ , D) -˜ (θ , D) ≤ 2 max θ∈ Θ i∈[p] α i (θ) β(θ)W 1 (D, D i ) + i (θ, T ) . (C.15) Proof. Consider the definition of θ and θ , we can obtain ˜ ( θ , D) -˜ (θ , D) = ˜ ( θ , D) -( θ ) + ( θ ) -(θ ) + (θ ) -˜ (θ , D) ≤ 2 max θ∈ Θ⊂Θ ˜ (θ, D) -(θ) , where the inequality follows from the fact that ( θ ) -(θ ) ≤ 0. Moreover, observe that for any θ ∈ Θ ⊂ Θ, by the definition of (θ) in (C.2), we have | ˜ (θ, D) -(θ)| = ˜ (θ, D) -i∈[p] α i (θ) • i (θ) = ˜ (θ, D) -i∈[p] α i (θ) • ˜ (θ, D i ) + i∈[p] α i (θ) • ˜ (θ, D i ) -i∈[p] α i (θ) • i (θ) ≤ i∈[p] α i (θ) ˜ (θ, D) -˜ (θ, D i ) + i∈[p] α i (θ) ˜ (θ, D i ) -i (θ) ≤ i∈[p] α i (θ) β(θ)W 1 (D, D i ) + i∈[p] α i (θ) i (θ, T ), where the last inequality follows from assumption (C.7) and definition (C.14). We  i ∈ [0, 1], ∀i ∈ [p], we have W 1 (D, D i ) ≤ j∈[p],j =i w j W 1 (D j , D i ). (C.16) In particular, when D i , i ∈ [p] are i.i.d. data distribution, i.e., all parties in a federated learning system possess i. i.d. local data distribution -that is, W 1 (D j , D i ) = 0∀i, j ∈ [p] -then j∈[p],j =i w j W 1 (D j , D i ) = 0. Therefore, W 1 (D, D i ) = 0, ∀i ∈ [p]. Proof. By the definition of 1-Wasserstein distance in (C.3) and the fact that D = i∈[p] w i D i , we can obtain W 1 (D, D i ) = sup f ∈F1 E (x,y)∼D f (x, y) -E (xi,yi)∼Di f (x i , y i ) = sup f ∈F1 j∈[p] w j E (xj ,yj )∼Dj f (x j , y j ) -E (xi,yi)∼Di f (x i , y i ) = sup f ∈F1 i =j,j∈[p] w j E (xj ,yj )∼Dj f (x j , y j ) -E (xi,yi)∼Di f (x i , y i ) ≤ i =j,j∈[p] w j sup f ∈F1 E (xj ,yj )∼Dj f (x j , y j ) -E (xi,yi)∼Di f (x i , y i ) ≤ i =j,j∈[p] w j W 1 (D j , D i ). Proposition C.9. For any party i, i ∈ [p], consider a (HP, loss) pair (θ (i) t , L t ) collected during the local HPO run for party i, for any θ ∈ Θ ⊂ Θ, we have i (θ, T ) ≤ L(D i ) + L i min t∈[T ] d(θ, θ (i) t ) + δ i , (C.17) where δ i = max t |L (i) t -i (θ (i) t ) | is the maximum per sample training error for the local loss surface i .

Proof. By the definition of

i (θ, T ) i (θ, T ) = ˜ (θ, D i ) -i (θ) = ˜ (θ, D i ) -˜ (θ (i) t , D i ) Smoothness of ˜ + ˜ (θ (i) t , D i ) -i (θ (i) t ) Modeling error + i (θ (i) t ) -i (θ) Smoothness of , where θ 

First note that

| ˜ (θ (i) t , D i ) -i (θ (i) t )| = |L (i) t -i (θ (i) t )| ≤ max t |L (i) t -i (θ (i) t )| ≤ δ i . In view of (C.5) and (C.7), we have i (θ, T ) ≤ L(D i )d(θ, θ (i) t ) + δ i + L i d(θ, θ (i) t ), which immediately implies the result in (C.17). In view of the results established in Proposition C.7-C.9, we immediately obtain the optimality guarantee presented in Theorem C.6 (also Theorem 2.1). The following proposition characterizes i (θ, T ) when we relax the Lipschitz continuity assumption with respect to the HP space θ (required by Proposition C.9) to only modulus continuity defined at the end of Section C.2. Proposition C.10. Assume that the estimated loss ˜ (θ, D i ) and the loss surface i (θ) admit concave functions ωDi and ω i respectively as a modulus of continuity with respect to θ ∈ Θ for each party i ∈ [p]. Then, for any party i, i ∈ [p], with the set of (HP, loss) pairs {(θ (i) t , L (i) t )} t∈[T ] collected during the local HPO run for party i, for any θ ∈ Θ ⊂ Θ, there exists ÃDi , A i , BDi , B i ≥ 0 such that i (θ, T ) ≤ ÃDi + A i min t∈[T ] d(θ, θ (i) t ) + BDi + B i + δ i , (C.18) where δ i = max t |L t )| is the maximum per sample training error for the local loss surface i . Proof. By the definition of i (θ, T ) i (θ, T ) = ˜ (θ, D i ) -i (θ) = ˜ (θ, D i ) -˜ (θ (i) t , D i ) Smoothness of ˜ + ˜ (θ (i) t , D i ) -i (θ (i) t ) Modeling error + i (θ (i) t ) -i (θ) Smoothness of , where θ (i) t , t ∈ [T ] is any one of the HP tried during local HPO run on party i ∈ [p]. First note that | ˜ (θ (i) t , D i ) -i (θ (i) t )| = |L (i) t -i (θ (i) t )| ≤ max t |L (i) t -i (θ (i) t )| ≤ δ i . In view of (C.8) and (C.9), we have i (θ, T ) ≤ ωDi (d(θ, θ (i) t )) + δ i + ω(d(θ, θ (i) t )), ≤ δ i + min t∈[T ] ωDi (d(θ, θ (i) t )) + ω(d(θ, θ (i) t )) .

Concavity of a function

ω : [0, ∞] → [0, ∞] implies that there exists A, B > 0 such that ω(t) ≤ At + B. Using that, we can find some ÃDi , A i , BDi , B i > 0 which allows us to simplify the above to i (θ, T ) ≤ δ i + ( ÃDi + A) • min t∈[T ] d(θ, θ (i) t ) + ( BDi + B).

C.4 RELATIVE REGRETS

As a byproduct, we can also provide a bound for the following relative regret we use in our experiments. Corollary C.11. Let us assume θ and θ are defined in (2.7) and (C.12), and θ and θ b are the hyper-parameter settings selected by centralized HPO and some baseline hyper-parameters, respectively, then we can bound the relative regret as follows, for a given data distribution D, we have ˜ ( θ , D) -˜ ( θ , D) ˜ ( θ , D) -˜ (θ b , D) ≤ 2 max θ∈ Θ p i=1 α i (θ) β(θ) j∈[p],j =i w j W 1 (D j , D i ) + L(D i ) + L i min t∈[T ] d(θ, θ (i) t ) + δ i (θ b , D) -( θ , D) . (C.19) Proof. By the definition of relative regret, we have ˜ ( θ , D) -˜ ( θ , D) ˜ ( θ , D) -˜ (θ b , D) = ˜ ( θ , D) -˜ ( θ , D) ˜ (θ b .D) -˜ ( θ , D) ≤ ˜ ( θ , D) -˜ (θ , D) (θ b , D) -( θ , D) , where the last inequality follows from the fact that θ is the minimizer of ˜ (θ, D). Moreover, in view of the result in Theorem C.6, the result in (C.19) follows.

C.5 QUANTIFYING DATA HETEROGENEITY IN FL-HPO

As defined in §3, we utilize γ p as a surrogate to quantify the heterogeneity between the per-party data distributions. Here we will motivate this choice. By definition γ p = 1 -min i∈[p] min t∈[T ] L (i) t 1 -max i∈[p] min t∈[T ] L (i) t = 1 -min t∈[T ] L ( i) t 1 -min t∈[T ] L ( i) t where i := arg min i∈[p] min t∈[T ] L (i) t and i := arg max i∈[p] min t∈[T ] L (i) t = 1 -min t∈[T ] L ( i) t + min t∈[T ] L ( i) t -min t∈[T ] L ( i) t 1 -min t∈[T ] L ( i) t = 1 + min t∈[T ] L ( i) t -min t∈[T ] L ( i) t 1 -min t∈[T ] L ( i) t ≈ 1 + ˜ (θ ( i) , D i ) -˜ (θ ( i) , D i ) 1 -min t∈[T ] L ( i) t where θ ( i) , θ ( i) are the best HPs seen at parties i, i resp. = 1 + ˜ (θ ( i) , D i ) -(θ ( i) , D i ) + (θ ( i) , D i ) -˜ (θ ( i) , D i ) 1 -min t∈[T ] L ( i) t ≤ 1 + β • W 1 (D i , D i ) 1 -min t∈[T ] L ( i) t ≤ 1 + β • max i,j∈[p] W 1 (D i , D j ) 1 -min t∈[T ] L ( i) t ≈ 1 + O max i,j∈[p] W 1 (D i , D j ) . This implies that γ p is closely tied to the maximum 1-Wassertein distance between any pair of per-party distributions.

D EXPERIMENTAL SETTING D.1 DATASET DETAILS

For our evaluation of single-shot HPO, we consider 7 binary classification datasets of varying sizes and characteristics from OpenML (Vanschoren et al., 2013) such that there is at least a significant room for improvement over the single-shot baseline performance. We consider datasets which have at least > 3% potential improvement in balanced accuracy for gradient boosted decision trees. Note that this only ensures room for improvement for HGB, while highlighting cases with no room for improvement for SVM and MLP as we see in our results. The details of the binary classification datasets used in our evaluation is reported in Table 6 . We report the 10-fold cross-validated balanced accuracy of the default HP configuration on each of datasets with centralized training. The "Gap" column for the results for all datasets and models in §D.3 denote the difference between the best 10-fold cross-validated balanced accuracy obtained via centralized HPO and the 10-fold cross-validated balanced accuracy of the default HP configuration. 

D.2 SEARCH SPACE

We use the search space definition used in the NeurIPS 2020 Black-box optimization challenge (https://bbochallenge.com/), described in details in the API documentationfoot_0 .

D.2.1 HISTOGRAM BASED GRADIENT BOOSTED TREES

Given this format for defining the HPO search space, we utilize the following precise search space for the HistGradientBoostingClassifier in scikit-learn: api_config = { "max_iter": {"type": "int", "space": "linear", "range": (10, 200)}, "learning_rate": {"type": "real", "space": "log", "range": (1e-3, 1.0)}, "min_samples_leaf": {"type": "int", "space": "linear", "range": (1, 40)}, "l2_regularization": {"type": "real", "space": "log", "range": (1e-4, 1.0)}, } The HP configuration we consider for the single-shot baseline described in §3 is as follows: config = { "max_iter": 100, "learning_rate": 0.1, "min_samples_leaf": 20, "l2_regularization": 0, }

D.2.2 KERNEL SVM WITH RBF KERNEL

For SVC(kernel="rbf") in scikit-learn, we use the following search space: • the maximum accuracy of the best local HP across all parties "PMax" := max i∈[p] max t (1-˜ (θ (i) t , D i )) • γ p = PMax / PMin, and finally • the regret for each of the considered loss surfaces in FLoRA. For each of the three methods, we also report the aggregate performance over all considered datasets in terms of mean ± standard deviation ("mean±std"), inter-quartile range ("IQR"), Wins/Ties/Losses of FLoRA with respect to the single-shot baseline ("W/T/L"), and a one-sided Wilcoxon Signed Ranked Test of statistical significance ("WSRT") with the null hypothesis that the median of the difference between the single-shot baseline and FLoRA is positive against the alternative that the difference is negative (implying FLoRA improves over the baseline).' These aggregate metrics are collected in Table 10 along with a set of final aggregate metrics across all datasets and methods. HGB. The results in Table 7 indicate that, in almost all cases, with all loss functions, FLoRA is able to improve upon the baseline to varying degrees (there is only one case where SGM performs worse than the baseline on Sonar). On average (across the datasets), SGM+U, MPLM and APLM perform better than SGM as we expected. MPLM performs better than SGM both in terms of average and standard deviation. Looking at the individual datasets, we see that, for datasets with low γ p (EEG eye state, Electricity), all the proposed loss surface have low relative regret, indicating that the problem is easier as expected. For datasets with high γ p (Heart statlog, Oil spill), the relative regret of all loss surfaces are higher (but still much smaller than 1), indicating that our proposed single-shot scheme can show improvement even in cases where there is significant difference in the per-party losses (and hence datasets). SVM. For SVM we continue with the datasets selected using HGB (datasets with a "Gap" of at least 3%). Of the 7 datasets (Table 6 ), we skip Electricity because it takes a prohibitively long time for SVM to be trained on this dataset with a single HP. So we consider 6 datasets in this evaluation and present the corresponding results in Table 8 . Of the 6, note that 2 of these datasets (Pollen, Heart Statlog) have really small "Gap" (highlighted in red in Table 8 ). Moreover, 2 of the datasets (Heart statlog, Oil Spill) also have really high γ p indicating a high level of heterogeneity between the per-party distributions (again highlighted in red). In this case, there are a couple of datasets (Oil Spill and Pollen) where FLoRA is unable to show any improvement over the single-shot baseline (see underlined entries in Table 8 ), but both these cases either have a small or moderate "Gap" and/or have a high γ p . Moreover, in one case, MPLM incurs a regret of 6.8, but this is a case with really high γ p = 1.14 -MPLM rejects any HP that has a low score in even one of the parties, and in that process MLP. We consider all 7 datasets for the evaluation of FLoRA on FL-HPO for MLP HPs and present the results in Table 9 . As with SVM, there are a few datasets with a small room for improvement ("Gap") and/or high γ p , again highlighted in red in Table 9 . In some of these cases, FLoRA is unable to improve upon the single-shot baseline (Pollen, EEG Eye State). Other than these hard cases, FLoRA again able to show significant improvement over the single-shot baseline, with APLM performing the best. Aggregate. The results for all the methods and datasets are aggregated in Table 10 . All FLoRA loss surfaces show strong performance with respect to the single-shot baseline, with significantly more wins than losses, and 3rd-quartile regret values less than 1 (indicating improvement over the baseline). All FLoRA loss surfaces have a p-value of less than 0.05, indicating that we can reject the null hypothesis. Overall, APLM shows the best performance over all loss surfaces, both in terms of Wins/Ties/Losses over the baseline as well as in terms of the Wilcoxon Signed Rank Test, with the highest statistic and a p-value close to 10 -3 . APLM also has significantly lower 3rd-quartile than all other loss surfaces. MPLM appears to have the worst performance but much of that is attributable to the really high regret of 6.8 and 2.84 it received for SVM with Heart Statlog and Pollen (both hard cases as discussed earlier). Otherwise, MPLM performs second best both for FL-HPO with HGB and MLP. D.4 COMMUNICATION SAVINGS OF FLoRA OVER MULTI-SHOT FL-HPO Here we compare the communication savings one would obtain from FLoRA relative to a multi-shot FL-HPO scheme. FLoRA is a single-shot FL-HPO scheme and hence utilizes a single federated model training (Algorithm 1, line 8). In contrast to a single-shot solution, we can try to solve the FL-HPO problem (2.5) in a manner that allows multiple federated model trainings. In that case, we could utilize existing HPO schemes such a HyperOpt (Bergstra et al., 2011) for FL-HPO, where each HP proposal is evaluated with a complete federated model training. While this multi-shot FL-HPO scheme would have a very high communication overhead, we expect this scheme to find HPs and corresponding models that have better predictive performance than single-shot schemes. In fact, given enough number of federated model trainings, we expect this multi-shot FL-HPO scheme to achieve 0 relative regret. With such a baseline, we perform a comparison with FLoRA where we note the number of federated model trainings required by the multi-shot FL-HPO to reach the predictive performance (in terms of the relative regret) of FLoRA. Since each federated training incurs a large amount of communication overhead, a larger number of federated model trainings imply a larger communication overhead. This number of federated model training serves as a surrogate for the communication gain of FLoRA over the multi-shot baseline to achieve the same level of performance. We present these results for FL-HPO of HGB (Figure 5 ), SVM (Figure 6 ) and MLP (Figure 7 ). In each of these figures, we present the communication savings for all the four different loss surfaces, and each pair of ( , ) connected by a dashed line corresponds to a single dataset label from D1-D7. The names (and details) of the dataset corresponding to these dataset labels is provided in Table 6 . The results for HGB in Figure 5 indicate that, in most datasets, across all loss surfaces, FLoRA can provide almost 10× savings and upto 20 -30× in a few cases. Overall, SGM performs the worst with 6 . The results for SVM in Figure 6 indicate that the communication savings of FLoRA are much more spread out. Note that, as mentioned in Appendix D.3, the Electricity dataset (D2) is not considered in the SVM FL-HPO since it takes prohibitively long to execute the centralized HPO. Furthermore, we have also excluded the datasets in each plot where the relative regret for FLoRA is greater than 1 (see Table 8 ) since in that case, FLoRA is not even able to improve upon the single-shot baseline. There are more cases in this set of results where the savings from FLoRA are less than 10×, but there are also cases where FLoRA provides close to 100× savings. Overall, APLM provides the most robust performance in terms of savings with a savings IQR of 7.25 -29.75×, while MPLM provides the worst performance with a savings IQR of 1.75 -9.75×. SGM and SGM+U provide a similar wide-ranging savings IQR of 2.25 -80×. The results for MLP are presented in Figure 7 . In this case, the communication savings from FLoRA again range from less than 10× to over 90×. APLM again provides the most robust performance with a savings IQR of 5.5 -16×. SGM+U and MPLM have the same savings IQR of 3 -13×. SGM has the most wide-ranging savings IQR of 2 -33×. Overall, the results across all loss surfaces, datasets and ML models indicate that FLoRA with any loss surface leads to a significant communication savings over multi-shot FL-HPO. Among the loss surfaces, APLM provides one of the best savings among the loss surfaces, while being the most robust across ML models and datasets. SGM+U also provides better savings than SGM. MPLM has the least favorable savings because of its pessimism. However, as shown in other experiments, MPLM is able to provide strong performance relative to the remaining loss surfaces especially when the FL-HPO problem is especially challenging (as with the case of increasing number of parties discussed in Appendix D.5 or the non-IID-ness of the per-party data distribution highlighted in Appendix D.8).

D.5 EFFECT OF INCREASING THE NUMBER OF PARTIES

We now study the effect of increasing the number of parties in the FL-HPO problem on training HGB model on 3 datasets . For each data set, we increase the number of parties p up until each party has at least 100 training samples. We present the relative regrets for all the loss surfaces in  := (1-mini∈[p] L (i) ) /(1-max i∈[p] L (i) ), L (i) = min t∈[T ] L (i) t is the minimum loss observed during the local asynchronous HPO at party i. This ratio γ p is always greater than 1, and highlights the difference in the observed performances across the parties. A ratio closer to 1 indicates that all the parties have relatively similar performances on their respective training data, while a ratio much higher than 1 indicating significant discrepancy between the per-party performances, implicitly indicating the difference in the per-party data distributions. We notice that increasing the number of parties does not have a significant effect on γ p for the Electricity dataset until p = 100, but significantly increases for the Pollen dataset earlier (making the problem harder). For the EEG eye state, the increase in γ p with increasing p is moderate until p = 50. The results indicate that, with low or moderate increase in γ p (EEG eye state, Electricity for moderate p), the proposed scheme is able to achieve low relative regret -the increase in the number of parties does not directly imply degradation in performance. However, with significant increase in γ p (Pollen, Electricity with p = 50, 100 and EEG Eye State with p = 50), we see a significant increase in the relative regret (eventually going over 1 in a few cases). In this challenging case, MPLM (the most pessimistic loss function) has the most graceful degradation in relative regret compared to the remaining loss surfaces. In this experiment, we report additional results to study the effect of the "thoroughness" of the local HPO runs (in terms of the number of HPO rounds T ) on the overall performance of FLoRA for all the loss surfaces in Table 12 . In almost all cases, FLoRA does not require T to be too large to get enough information about the local HPO loss surface to get to its best possible performance.

D.7 EFFECT OF COMMUNICATION OVERHEAD

While in the previous experiment, we studied the effect of the thoroughness of the local HPO runs on the performance of FLoRA, here we consider a subtly different setup. We assume that each party performs T = 100 rounds of local asynchronous HPO. However, instead of sending all T (HP, loss) pairs, we consider sending T < T of the "best" (HP, loss) pairs -that is, (HP, loss) pairs with the T lowest losses. Changing the value of T trades off the communication overhead of the FLoRA step where the aggregators collect the per-party loss pairs (Algorithm 1, line 5). We consider 2 datasets each for 2 of the methods (SVM, MLP) and all the loss surfaces for FLoRA, and report all the results in Table 13 .

D.8 EFFECT OF NON-IID PARTY DATA DISTRIBUTION

In this section, we simulate different degree of non-iidness for parties' local data distribution based on MNIST datasetfoot_1 . In particular, for "imb1" there are 4 parties in total, half of the parties (with 4 times more probability to) have more even digits while the other half have more odd digits. In "imb2" case, parties' local data distributions are more heterogeneous, as there are 4 parties in the FL system with first 3 parties having 2 distinguish class labels and the last parties have the rest 4 class labels, i.e., each party only accesses a unique subset of class labels. For both simulated imbalanced datasets, all parties have 500 training images and 2500 testing images. We use again balanced accuracy as the target performance metric, and relative regret as the metric to measure the performance of FLoRA. We summarize the results in Table 14 . First of all, FLoRA can improve over the baseline HPs in all imbalance scenarios for both MLP and SVM models. Secondly, we can see that overall MPLM (the most pessimistic loss surface) is the most robust loss surface which can always obtain improvements over the baseline HPs for all scenarios. More specifically, when the degree of non-iidness is mild ("imb1" case), all loss surfaces can improve over the baseline HPs. In "imb2" case when the degree of non-iidness is more intense, we can see that the performance of FLoRA with SGM loss surface drops significantly. This is also supported by our intuitions and theoretical analysis.

D.9 FEDERATED LEARNING TESTBED EVALUATION

We now conduct experiments for histrogram boosted tree model in a FL testbed, utilizing IBM FL library (Ludwig et al., 2020; Ong et al., 2020) , More specifically, we reserved 40% of oil spill and electricity and 20% of EEG eye state as global hold-out set only for evaluating the final FL model performance. Each party randomly sampled from the rest of the original dataset to obtain their own training dataset. We use the same HP search space as in Appendix D.2. We report the balanced accuracy of any HP (baseline or recommended by FLoRA) on a single train/test split. Given balanced accuracy as the evaluation metric, we utilize (1 -balanced accuracy) as the loss L (i) t in Algorithm 1 Each party will run HPO to generate T = 500 (HP, loss) pairs and use those pairs to generate loss surface either collaboratively or by their own according to different aggregation procedures described 



https://github.com/rdturnermtl/bbo_challenge_starter_kit/ #configuration-space We download MNIST dataset from http://yann.lecun.com/exdb/mnist/



ML setting, we would consider a model class M and its corresponding learning algorithm A parameterized collectively with HPs θ ∈ Θ, and given a training set D, we can learn a single model A(M, θ, D) → m ∈ M. Given some predictive loss L(m, D ) of any model m scored on some holdout set D , the centralized HPO problem can be stated as min θ∈Θ L(A(M, θ, D), D ). (2.1) In the most general FL setting, we have p parties P 1 , . . . , P p each with their private local training dataset D i , i ∈ [p] = {1, 2, . . . , p}. Let D = ∪ p i=1 D i denote the aggregated training dataset and D = {D i } i∈[p] denote the set of per-party datasets. Each model class (and corresponding learning algorithm) is parameterized by global HPs θ G ∈ Θ G shared by all parties and per-party local HPs θ

Figure 3: Effect of different choices on FLoRA with the APLM loss surface for different methods and datasets. More results and other loss surfaces are presented in Appendix D.6 and D.7.standard HPO to FL-HPO. For ease of exposition, we present results with HGB with 2 loss surfaces -MPLM and APLM. More results are presented in Appendix D.4. Similar to Figure1, we report the number of FL model trainings required for multi-shot ( ) to match the relative regret achieved by FLoRA ( ), with the single-shot baseline performance (relative regret of 1 with 1 FL model training, denoted by •) presented as a reference. In aggregate, FLoRA with APLM achieves a median savings of 8×, 15× and 10× over the multi-shot baseline for HGB, SVM and MLP respectively.

FLoRA can still provide communications savings. Consider a simplistic setup where each FL training requires C communication rounds. For the single-shot FLoRA, the overall communication overhead will be O(C) since we would need a single FL training and some additional communication which is much much smaller than the FL training communication overhead (see §2.1, Figure 4c and 3b). If multi-shot FL-HPO tries N HP configurations, then the communication overhead will be O(C • N ) (see Fig 4b). For a multi-fidelity FL-HPO which leverages some form of successive halving/elimination, it can be shown that the total communication overhead for attempting N HP configurations is O(C • log N ) which is(a) A single FL training. (b) Multi-shot FL-HPO. (c) FL-HPO with FLoRA.

Figure 4: Visualizations of the communication in FL training and different FL-HPO setup. (4a) A single FL training involves multiple communication rounds. (4b) Multi-shot FL-HPO requires multiple FL trainings for each of the different HPs θ G evaluated. (4c) FL-HPO with FLoRA requires a single FL training for the selected HP θ .

3 and a byproduct of the result in presented in Section C.4. C.1 TECHNICAL DEFINITIONS Definition C.1 (Loss functions). For a given set of parties' data D = {D i } i∈[p] and any θ ∈ Θ, the true target loss (any predictive performance metric, such as, the training loss) can be expressed as: (θ, D) := E (x,y)∼D test perf. of trained model L( A(θ, D) trained model , (x, y)). (C.1)

Now we are ready to provide a more general definition of the unified loss surface constructed by FLoRA as follows: Definition C.2 (Unified loss surface). Given the local loss surfaces i : Θ → R for each party i ∈ [p] generated by T (HP, loss) pairs {(θ t∈[T ] , and the weight vector α := (α 1 , . . . , α p ) where α i (θ) ∈ [0, 1], ∀i ∈ [p], p i=1 α i (θ) = 1, we can define the global loss surface : Θ → R as

t∈[T ] during the local HPO run. For a desired data distribution D = p i=1 w i D i , where {D i } i∈[p] are the sets of parties' local data distributions and w i ∈ [0, 1], ∀i ∈ [p], we have

(i) t , t ∈ [T ] is any one of the HP tried during local HPO run on party i ∈ [p].

Figure 5: Communication overhead of multi-shot HPO compared to FLoRA to match the relative regret of FLoRA for FL-HPO with Histogram based Gradient Boosted Decision Trees (HGB). The dataset indices are based on the Table6.

Figure 6: Communication overhead of multi-shot HPO compared to FLoRA to match the relative regret of FLoRA for FL-HPO with Support Vector Machines (SVM). The dataset indices are based on the Table6.

Figure 7: Communication overhead of multi-shot HPO compared to FLoRA to match the relative regret of FLoRA for FL-HPO with Multi-layered Perceptrons (MLP). The dataset indices are based on the Table 6.

In the distributed FL setting this problem is exacerbated because validation sets are local to the parties and each FL training/scoring evaluation is communication intensive. Therefore a brute force application of centralized black-box HPO approaches that select HPs in an outer loop and proceed with FL training evaluations is not feasible. (2) It yields minimal HPO communication overhead. This is achieved by building a loss surface from local asynchronous HPO at the parties that yields a single optimized HP configuration used to train a global model with a single FL training. (3) It is the first that theoretically characterizes the optimality gap in an

Positioning of our proposed framework FLoRA against existing literature. SS: single-shot. MF: multi-fidelity. WS: weight-sharing. θG: global model HPs. φ: aggregator HPs. See §2. †: FedHPO-B is a benchmarking suite and not an algorithm, and the properties correspond to the problems in the suite.

Loss surfaces: f : Θ → R is the global

Comparison of different loss surfaces (the 4 rightmost columns) for FLoRA relative to the baseline for single-shot 3-party FL-HPO in terms of the relative regret (lower is better).

Effect of increasing the number of parties on FLoRA with different loss surfaces for HGB.

Effect of data hetergeneity on FLoRA with MNIST.

denote the continuous, integer and categorical HPs in θ. Distances over R R ×Z I is available, such as ρ-norm. Let d R,Z : (Θ R ×Θ Z )×(Θ R ×Θ Z ) → R + be some such distance. To define distances over categorical spaces, there are some techniques such as one described byOh et al. (2019):Assume that each of the C HPs θ C,k , k ∈ [C] have n k categories {ξ k1 , ξ k2 , . . . , ξ kn k }. Then we define a complete undirected graph

now dive into each term in (C.15) to provide tight bounds for W 1 (D, D i ) and i (θ, T ) in the following propositions.

OpenML binary classification dataset details

HGB

SVM HPs since the local HPOs on these disparate distributions did not concentrate on the same region of the HP space, thereby incuring a high MPLM loss in almost all regions of the HP where some local HPO focused on. Other than these expected hard cases, FLoRA is able to improve upon the baseline in most cases, and achieve optimal performance (zero regret) in a few cases (EEG Eye State, Heart Statlog).

Aggregate Table

Effect of increasing the number of parties on FLoRA with all 4 loss surfaces for HGB.

A complete version of Table 4 in Section 3). It also displays γ p

Effect of T .

Effect of the number of best (HP, loss) pairs T < T sent to aggregator by each party after doing local HPO with T = 100.

Effect of Non-iidness among parties' local data distribution.

Performance of FLoRA with the IBM-FL system in terms of the balanced accuracy on a holdout test set (higher is better). The baseline is still the default HP configuration of HistGradientBoostingClassifier in scikit-learn. Once the loss surface is generated, the aggregator uses Hyperopt(Bergstra et al., 2011) to select the best HP candidate and train a federated XGBoost model via the IBM FL library using the selected HPs. Table15summarizes the experimental results for 3 datasets, indicating that FLoRA can significantly improve over the baseline in IBM FL testbed.

ACKNOWLEDGEMENTS

We would like to thank Martin Wistuba for his input during the initial discussions on this research thread. We would also like to thank the organizers of the NFFL workshop at NeurIPS'21 for allowing us to present an initial version of this work to a wider audience (Zhou et al., 2021) .

api_config = {

"C": {"type": "real", "space": "log", "range": (0.01, 1000.0)}, "gamma": {"type": "real", "space": "log", "range": (1e-5, 10.0)}, "tol": {"type": "real", "space": "log", "range": (1e-5, 1e-1)}, }The single shot baseline we consider for SVC from Auto-sklearn (Feurer et al., 2015) is: config = { "C": 1.0, "gamma": 0.1, "tol": 1e-3,

D.2.3 MULTI-LAYERED PERCEPTRONS

For the MLPClassifier(solver="adam") from scikit-learn, we consider both architectural HP such as hidden-layer-sizes as well as optimizer parameters such as alpha and learning-rate-init for the Adam optimizer (Kingma & Ba, 2015) . We consider the following search space: api_config = { "hidden_layer_sizes": {"type": "int", "space": "linear", "range": (50, 200)}, "alpha": {"type": "real", "space": "log", "range": (1e-5, 1e1)}, "learning_rate_init": {"type": "real", "space": "log", "range": (1e-5, 1e-1)}, }We utilize the following single shot baseline: config = { "hidden_layer_sizes: 100, "alpha": 1e-4, "learning_rate_init": 1e-3, }We fix the remaining HPs of MLPClassifier as with values used by Auto-sklearn.activation="relu", early_stopping=True, shuffle=True, batch_size="auto", tol=1e-4, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-8,

D.3 DETAILED RESULTS OF COMPARISON AGAINST BASELINES

Here we present the relevant details and the performance of FLoRA on the FL-HPO of (i) histograms based gradient boosted trees (HGB) in Table 7 ), (ii) nonlinear support vector machines (SVM) in Table 8 , and (iii) multi-layered perceptrons (MLP) in Table 9 . We use the search spaces and the single-shot baselines presented in §D.2. We utilize all 7 datasets for each of the method except for the Electricity dataset with SVM because of the infeasible amount of time taken by SVM on this dataset. For each setup, we report the following:• Performance of the single-shot baseline ("SSBaseline"),• the best centralized HPO performance ("Best"),• the available "Gap" for improvement,• the minimum accuracy of the best local HP across all parties "PMin" := min i∈[p] max t (1 -˜ (θ

