FEDX: FEDERATED LEARNING FOR COMPOSITIONAL PAIRWISE RISK OPTIMIZATION

Abstract

In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of compositional pairwise risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of where two sets of data S 1 , S 2 are distributed over multiple machines, ℓ(•; •, •) is a pairwise loss that only depends on the prediction outputs of the input data pairs (z, z ′ ), and f (•) is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. We propose two provable FL algorithms (FedX) for handling linear and nonlinear f , respectively. To address the challenges, we decouple the gradient's components with two types, namely active parts and lazy parts, where the active parts depend on local data that are computed with the local model and the lazy parts depend on other machines that are communicated/computed based on historical models and samples. We develop a novel theoretical analysis to combat the latency of the lazy parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the lazy parts do not degrade the complexities. We conduct empirical studies of FedX for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.

1. INTRODUCTION

This work is motivated by solving the following optimization problem arising in many ML applications in a federated learning (FL) setting: min w∈R d 1 |S 1 | z∈S1 f 1 |S 2 | z ′ ∈S2 ℓ(w; z, z ′ ) g(w;z,S2) , where S 1 and S 2 denote two sets of data points that are distributed over many machines, w denotes the model parameter of a prediction function h w (•) ∈ R do , f (•) is a deterministic function that could be linear or non-linear (possibly non-convex), and ℓ(w; z, z ′ ) = ℓ(h w (z), h w (z ′ )) denotes a pairwise loss that only depends the prediction outputs of the input data z, z ′ . We refer to the above problem as compositional pairwise risk (CPR) minimization problem. When f is a linear function, the above problem is the classic pairwise loss minimization problem, which has applications in AUROC (AUC) maximization (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b) , bipartite ranking (Cohen et al., 1997; Clémenc ¸on et al., 2008; Kotlowski et al., 2011; Dembczynski et al., 2012) , distance metric learning (Radenović et al., 2016; Wu et al., 2017; Yang et al., 2021b) . When f is a non-linear function, the above problem is a special case of finite-sum coupled compositional optimization problem (Wang & Yang, 2022a), which has found applications in various performance measure optimization such as partial AUC maximization (Zhu et al., 2022) , average precision maximization (Qi et al., 2021; Wang et al., 2022 ), NDCG maximization (Qiu et al., 2022) , and p-norm push optimization (Rudin, 2009; Wang & Yang, 2022a) and distance metric learning (Sohn, 2016) . We provide details of some examples of CPR minimization applications in Appendix A. This is in sharp contrast with most existing studies on FL algorithms (Yang, 2013; Konevcnỳ et al., 2016; McMahan et al., 2017; Kairouz et al., 2021; Smith et al., 2018; Stich, 2018; Yu et al., 2019a; b; Khaled et al., 2020; Woodworth et al., 2020b; a; Karimireddy et al., 2020b; 2021; Haddadpour et al., 2019) , which focus on the following empirical risk minimization (ERM) problem with the data set S distributed over different machines: min w∈R d 1 |S| z∈S ℓ(w; z). The major differences between CPR and ERM are (1) the ERM's objective is decomposable over training data, while the CPR is not decomposable over training examples; and (2) the data-dependent losses in ERM are decoupled between different data points; in contrast the data-dependent loss in CPR couples different training data points. These differences pose a big challenge for optimizing CPR in the FL setting, where the training data are distributed on different machines and are prohibited to be moved to a central server. In particular, the gradient of CPR cannot be written as the sum of local gradients at individual machines that only depend on the local data in those machines. Instead, the gradient of CPR at each machine not only depends on local data but also on data in other machines. As a result, the design of communication-efficient FL algorithms for optimizing CPR is much more complicated than that for ERM. In addition, the presence of non-linear function f makes the algorithm design and analysis even more challenging than that with linear f . There are two levels of coupling in CPR with nonlinear f with one level at the pairwise loss ℓ(h w (z), h w (z ′ )) and another level at the non-linear risk of f (g(w; z, S 2 )), which makes estimation of stochastic gradient more tricky. Although optimization of CPR can be solved by existing algorithms in a centralized learning setting (Wang et al., 2017; Ghadimi et al., 2020; Hu et al., 2020; Wang & Yang, 2022a; Qi et al., 2021; Wang et al., 2022; Zhu et al., 2022; Chen et al., 2021) , extension of the existing algorithms to the FL setting is non-trivial. This is different from the extension of centralized algorithms for ERM to the FL setting. In the design and analysis of FL algorithms for ERM, the individual machines compute local gradients and update local models and communicate periodically for averaging models. The rationale of local FL algorithms for ERM is that as long as the gap error between local models and the averaged model is on par with the noise in the stochastic gradients by controlling the communication frequency, the convergence of local FL algorithms will not be sacrificed and is able to enjoy the parallel speed-up of using multiple machines. However, this rationale is not sufficient for developing FL algorithms for CPR optimization due to the challenges mentioned above. To address the challenges, we propose two novel FL algorithms named FedX1 and FedX2 for optimizing CPR with linear and non-linear f , respectively. The main innovation in the algorithm design lies at that we decouple the gradient of the objective with two types, active parts and lazy parts. The active parts depend on data in local machines and the lazy parts depend on data in other machines. We estimate the active parts using the local data and the local model and estimate the lazy parts using the information with delayed communications from other machines that are computed at historical models in the previous round. In terms of analysis, the challenge is that the model used in the computation of stochastic gradient estimator depends on the (historical) samples for computing the lazy parts at the current iteration, which is only exacerbated in the presence of non-linear function f . We develop a novel analysis that allows us to transfer the error of the gradient estimator into the latency error of the lazy parts and the gap error between local models and the global model. Hence, the rationale is that as long as the latency error of the lazy parts and the gap error between local models and the global model is on par with the noise in the stochastic gradient estimator we are able to achieve convergence and linear speed-up. The main contributions of this work are summarized as follows: • We propose two novel communication-efficient algorithms, FedX1 and FedX2, for optimizing the CPR with linear and nonlinear f , respectively. Besides communicating local models, the proposed algorithms need to communicate local prediction outputs only periodically. • We perform novel technical analysis to prove the convergence of both algorithms. We show that both algorithms enjoy parallel speed-up in terms of the iteration complexity, and a lower-order communication complexity. • We conduct empirical studies on two tasks for federated deep partial AUC optimization with a compositional loss and federated deep AUC optimization with a pairwise loss, and demonstrate the advantages of the proposed algorithms over several baselines.

