SINGLE-SHOT GENERAL HYPER-PARAMETER OPTIMIZATION FOR FEDERATED LEARNING

Abstract

We address the problem of hyper-parameter optimization (HPO) for federated learning (FL-HPO). We introduce Federated Loss SuRface Aggregation (FLoRA), a general FL-HPO solution framework that can address use cases of tabular data and any Machine Learning (ML) model including gradient boosting training algorithms, SVMs, neural networks, among others and thereby further expands the scope of FL-HPO. FLoRA enables single-shot FL-HPO: identifying a single set of good hyper-parameters that are subsequently used in a single FL training. Thus, it enables FL-HPO solutions with minimal additional communication overhead compared to FL training without HPO. Utilizing standard smoothness assumptions, we theoretically characterize the optimality gap of FLoRA for any convex and non-convex loss functions, which explicitly accounts for the heterogeneous nature of the parties' local data distributions, a dominant characteristic of FL systems. Our empirical evaluation of FLoRA for multiple FL algorithms on seven OpenML datasets demonstrates significant model accuracy improvements over the baselines, and robustness to increasing number of parties involved in FL-HPO training.

1. INTRODUCTION

Traditional machine learning (ML) approaches require training data to be gathered at a central location where the learning algorithm runs. In real world scenarios, however, training data is often subject to privacy or regulatory constraints restricting the way data can be shared, used and transmitted. Examples of such regulations include the European General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Cybersecurity Law of China (CLA) and HIPAA, among others. Federated learning (FL), first proposed in McMahan et al. (2017b) , has recently become a popular approach to address privacy concerns by allowing collaborative training of ML models among multiple parties where each party can keep its data private. FL-HPO problem. Despite the privacy protection FL brings along, there are many open problems in FL domain, one of which is hyper-parameter optimization for FL or FL-HPO (Kairouz et al., 2019; Khodak et al., 2021) . Existing FL systems require a user (or all participating parties) to pre-set (agree on) multiple hyper-parameters (HPs) (i) for the model being trained (such as number of layers for neural networks or tree depth and number of trees in tree ensembles), (ii) for the FL algorithms, and (iii) for aggregation (if such hyper-parameters exist). Hyper-parameter optimization (HPO) for FL is important because the choice of HPs can have dramatic impact on model performance much like in traditional centralized ML (McMahan et al., 2017b) . While HPO has been widely studied in the centralized ML setting (Hutter et al., 2019) , it comes with unique challenges in the FL setting. First, existing HPO techniques often make use of the entire dataset, which is not available centrally in FL. Secondly, they need to train many models for a large number of HP configurations which is prohibitively expensive in terms of communication and training time in FL settings; training a single model already has a high communication overhead (Kairouz et al., 2019) . Thirdly, one important challenge that has not been adequately explored in FL-HPO literature is support for tabular data, which are widely used in enterprise settings, such as financial services and other traditional industries, preferring traditional models with some explanability (Ludwig et al., 2020) . Although a few approaches have been recently proposed for FL-HPO, they focus on handling HPO using personalization techniques (Khodak et al., 2021) and neural networks (Khodak et al., 2020) . To the best of our knowledge, there is no FL-HPO approach to train non-neural network models, such as gradient boosted decision trees (Friedman, 2001) (e.g., XGBoost (Chen & Guestrin, 2016)) that are particularly common in the enterprise setting, even though there are existing FL algorithms for such models (Li et al., 2020; Ong et al., 2020) . This leads to our motivating question: Can we develop a FL-HPO scheme that performs HPO for any ML model in a FL environment without significantly increasing the already-high communication overhead of FL? In this paper, we address the aforementioned challenges of FL-HPO and our motivating question. We focus on the problem where the model HPs are shared by all parties and we seek a set of HPs and train a single model that is used by all parties. Our motivating question leads to four further requirements that make the problem challenging: (C1) To perform FL-HPO with any ML model, we cannot make any assumption that two models with different HPs can perform some "weightsharing", allowing our solution to be applied beyond fixed architecture neural networks. (C2) To be general across ML models, we do not assume the ability to perform "multi-fidelity" HPO to reduce the communication overhead of FL-HPO (see discussion in Appendix A.1). (C3) To avoid increasing the FL communication overhead, we seek to perform "single-shot" FL-HPO, where we can perform FL-HPO while requiring only a single FL model training. (C4) To be applicable to FL with data heterogeneity, we cannot assume that parties have independent and identically distributed (IID) data. Contributions. Given the above FL-HPO problem setting, we make the following contributions: of FLoRA compared to "multi-shot" FL-HPO for the same level of performance. We are considering 2 FL-HPO problems on the Electricity dataset (with HGB and MLP). We use the relative regret (defined in §3) of each scheme as the performance metric (lower is better), where a regret of 1 denotes performance of the singleshot baseline while a regret of 0 implies optimal performance. FLoRA and the single-shot baseline require a single federated model training. • ( §2) In Figure 1 , we present a snapshot of our empirical results which highlights the communication overhead reduction we achieve from FLoRA while producing higher quality models. As baselines, we directly adopt an existing centralized HPO scheme that requires federated training of multiple models and term this a "multi-shot" FL-HPO baseline. The figure also shows a "singleshot" baseline that uses curated HPs (described in §3), and FLoRA is also single-shot. Figure 1 shows that the "multi-shot" approach requires a significantly large number of FL model trainings (39 for MLP and 24 for HGB) and hence more communication to find a HP that matches the performance of the HP found by FLoRA, highlighting the efficiency and effectiveness of FLoRA. This result demonstrates that FLoRA is a FL-HPO scheme that works with any ML model (HGB, MLP, etc), providing competitive performance without significantly increasing the communication overhead of FL by only requiring a single FL model training.

1.1. RELATED WORK

Performance optimization of FL systems. One of the main challenges in FL is achieving high accuracy with low communication overhead. FedAvg (McMahan et al., 2017a ) is a predominant FL algorithm and several optimization schemes build on it. Initially, communication optimizations



We present a novel framework Federated Loss SuRface Aggregation (FLoRA) that leverages meta-learning techniques enabling asynchronous local HPOs on each party to perform single-shot HPO for the global FL-HPO problem. • ( §2.3) We provide theoretical guarantees for the set of HPs selected by FLoRA covering both IID and Non-IID cases regardless of the convexity of the loss function. To the best of our knowledge, this is the first rigorous theoretical analysis for FL-HPO problem and also the first optimality gap constructed in terms of the estimated loss given a target distribution. • ( §3) We evaluate FLoRA on the FL-HPO of Histogram based Gradient Boosted Decision Trees (HGB), Support Vector Machines (SVM) and Multi-layered Perceptrons (MLP) on seven classification datasets from OpenML (Vanschoren et al., 2013), highlighting (i) its performance relative to baselines, and (ii) the effect of data heterogeneity.

Figure 1: Communication overhead savings

