FEDX: FEDERATED LEARNING FOR COMPOSITIONAL PAIRWISE RISK OPTIMIZATION

Abstract

In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of compositional pairwise risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of where two sets of data S 1 , S 2 are distributed over multiple machines, ℓ(•; •, •) is a pairwise loss that only depends on the prediction outputs of the input data pairs (z, z ′ ), and f (•) is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. We propose two provable FL algorithms (FedX) for handling linear and nonlinear f , respectively. To address the challenges, we decouple the gradient's components with two types, namely active parts and lazy parts, where the active parts depend on local data that are computed with the local model and the lazy parts depend on other machines that are communicated/computed based on historical models and samples. We develop a novel theoretical analysis to combat the latency of the lazy parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the lazy parts do not degrade the complexities. We conduct empirical studies of FedX for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.

1. INTRODUCTION

This work is motivated by solving the following optimization problem arising in many ML applications in a federated learning (FL) setting: min w∈R d 1 |S 1 | z∈S1 f 1 |S 2 | z ′ ∈S2 ℓ(w; z, z ′ ) g(w;z,S2) , where S 1 and S 2 denote two sets of data points that are distributed over many machines, w denotes the model parameter of a prediction function h w (•) ∈ R do , f (•) is a deterministic function that could be linear or non-linear (possibly non-convex), and ℓ(w; z, z ′ ) = ℓ(h w (z), h w (z ′ )) denotes a pairwise loss that only depends the prediction outputs of the input data z, z ′ . We refer to the above problem as compositional pairwise risk (CPR) minimization problem. When f is a linear function, the above problem is the classic pairwise loss minimization problem, which has applications in AUROC (AUC) maximization (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b) , bipartite ranking (Cohen et al., 1997; Clémenc ¸on et al., 2008; Kotlowski et al., 2011; Dembczynski et al., 2012) , distance metric learning (Radenović et al., 2016; Wu et al., 2017; Yang et al., 2021b) . When f is a non-linear function, the above problem is a special case of finite-sum coupled compositional optimization problem (Wang & Yang, 2022a), which has found applications in various performance measure optimization such as partial AUC maximization (Zhu et al., 2022) , average precision maximization (Qi et al., 2021; Wang et al., 2022 ), NDCG maximization (Qiu et al., 2022) , and p-norm

