HANDLING COVARIATE SHIFTS IN FEDERATED LEARNING WITH GENERALIZATION GUARANTEES

Abstract

Covariate shift across clients is a major challenge for federated learning (FL). This work studies the generalization properties of FL under intra-client and interclient covariate shifts. To this end, we propose Federated Importance-weighteD Empirical risk Minimization (FIDEM) to optimize a global FL model, along with new variants of density ratio matching methods, aiming to handle covariate shifts. These methods trade off some level of privacy for improving the overall generalization performance. We theoretically show that FIDEM achieves smaller generalization error than classical empirical risk minimization under some certain settings. Experimental results demonstrate the superiority of FIDEM over federated averaging (McMahan et al., 2017) and other baselines, which would open the door to study FL under distribution shifts more systematically.

1. INTRODUCTION

Federated learning (FL) (Li et al., 2020; Kairouz et al., 2021; Wang et al., 2021) is an efficient and powerful paradigm to collaboratively train a shared machine learning model among multiple clients, such as hospitals and cellphones, without sharing local data. Existing FL literature mainly focuses on training a model under the classical empirical risk minimization (ERM) paradigm in learning theory, with implicitly assuming that the training and test data distributions of each client are the same. However, this stylized setup overlooks the specific requirements of each client. Statistical heterogeneity is a major challenge for FL, which has been mainly studied in terms of non-identical data distributions across clients, i.e., inter-client distribution shifts (Li et al., 2020; Kairouz et al., 2021; Wang et al., 2021) . Even for a single client, the distribution shift between training and test data, i.e., intra-client distribution shift, has been a major challenge for decades (Wang & Deng 2018; Kouw & Loog 2019, and references therein) . For instance, scarce disease data for training and test in a local hospital can be different. To adequately address the statistical heterogeneity challenge in FL, we need to handle both intra-client and inter-client distribution shifts under stringent requirements in terms of privacy and communication costs. We focus on the overall generalization performance on multiple clients by considering both intraclient and inter-client distribution shifts. There exist three major challenges to tackle this problem: 1) how to modify the classical ERM to obtain an unbiased estimate of an overall true risk minimizer under intra-client and inter-client distribution shifts; 2) how to develop an efficient density ratio estimation method under stringent privacy requirements of FL; 3) are there theoretical guarantees for the modified ERM under the improved density ratio method in FL? We aim to address the above challenges in our new paradigm for FL. For description simplicity, in our problem setting, we focus on covariate shift, which is the most commonly used and studied in theory and practice in distribution shifts (Sugiyama et al., 2007; Kanamori et al., 2009; Kato & Teshima, 2021; Uehara et al., 2020; Tripuraneni et al., 2021; Zhou & Levine, 2021) . 1 To be specific, for any client k, covariate shift assumes the conditional distribution p tr k (y|x) = p te k (y|x) := p(y|x) remains the same; while marginal distributions p tr k (x) and p te k (x) can be arbitrarily different, which gives rise to intra-client and inter-client covariate shifts. Handling covariate shift is a challenging issue, especially in federated settings (Kairouz et al., 2021) . To this end, motivated by Sugiyama et al. (2007) under the classical covariate shift setting, we propose Federated Importance-weighteD Empirical risk Minimization (FIDEM), that considers covariate shifts across multiple clients in FL. We show that the learned global model under intra/inter-client covariate shifts is still unbiased in terms of minimizing the overall true risk, i.e., FIDEM is consistent in FL. To handle covariate shifts accurately, we propose a histogram-based density ratio matching method (DRM) under both intra/inter-client distribution shifts. Our method unifies well-known DRMs in FL, which has its own interest in the distribution shift community for ratio estimation (Zadrozny, 2004; Huang et al., 2006; Sugiyama et al., 2007; Kanamori et al., 2009; Sugiyama et al., 2012; Zhang et al., 2020; Kato & Teshima, 2021) . To fully eliminate any privacy risks, we introduce another variant of FIDEM, termed as Federated Independent Importance-weighteD Empirical risk Minimization (FIIDEM). It does not require any form of data sharing among clients and preserves the same level of privacy and same communication costs as those of baseline federated averaging (FedAvg) (McMahan et al., 2017) . An overview of FIDEM is shown in Fig. 1 .

1.1. TECHNICAL CHALLENGES AND CONTRIBUTIONS

Learning on multiple clients in FL under covariate shifts via importance-weighted ERM is challenging due to multiple data owners with own learning objectives, multiple potential but unpredictable train/test shift scenarios, privacy, and communication costs (Kairouz et al., 2021) . To be specific, 1) It is non-trivial to control privacy leakage to other clients while estimating ratios and relax the requirement to have perfect estimates of the supremum over true ratios, which is a key step for nonnegative BD (nnBD) DRM. Our work is the first step towards handling inter/inter-client distribution shifts in FL; 2) It is challenging to obtain per-client generalization bounds for a general nnBD DRM with multiple clients and imperfect estimates of the supremum due to intra/inter-client couplings in ratios. Note that, even if we have access to perfect estimates of density ratios, it is still unclear whether importanceweighted ERM results in smaller excess risk compared to classical ERM. Our work gives an initial attempt by providing an affirmative answer for ridge regression; 3) While well-established benchmarks for multi-client FL have been used, they are usually designed in a way that each client's test samples are drawn uniformly from a set of classes. However, we believe this might not be the case in real-world applications and then design realistic experimental settings in our work. To address those technical challenges, we • Algorithmically propose an intuitive framework to minimize average test error in FL, design efficient mechanisms to control privacy leakage while estimating ratios (FIDEM) along with a privacy-preserving and communication-efficient variant (FIIDEM), and improve nnBD DRM under FL without requiring perfect knowledge of the supremum over true ratios. • Theoretically establish generalization guarantees for general nnBD DRM with multiple clients under imperfect estimates of the supremum, which unifies a number of DRMs, and show benefits of importance weighting in terms of excess risk decoupled from density ratio estimation through bias-variance decomposition. • Experimentally demonstrate more than 16% overall test accuracy improvement over existing FL baselines when training ResNet-18 (He et al., 2016) on CIFAR10 (Krizhevsky) in challenging imbalanced federated settings in terms of data distribution shifts across clients. In conclusion, we expand the concept and application scope of FL to a general setting under intra/interclient covariate shifts, provide an in-depth theoretical understanding of learning with FIDEM via a general DRM, and experimentally validate the utility of the proposed framework. We hope that our work opens the door to a new FL paradigm.

1.2. RELATED WORK

In this section, we overview a summary of related work. See Appendix B for complete discussion. The current FL literature largely focuses on minimizing the empirical risk, under the same training/test data distribution assumption over each client (Li et al., 2020; Kairouz et al., 2021; Wang et al., 2021) . In contrast, we focus on learning under both intra-client and inter-client covariate shifts. Communication-efficient, robust, and secure aggregations can be viewed as complementary technologies, which can be used along with FIDEM to improve FIDEM's scalability and security while addressing overall generalization. Shimodaira (2000) introduced covariate shift where the input train and test distributions are different while the conditional distribution of the output variable given the input variable remains unchanged. Importance-weighted ERM is widely used to improve generalization performance under covariate shift (Zadrozny, 2004; Sugiyama & Müller, 2005; Huang et al., 2006; Sugiyama et al., 2007; Kanamori et al., 2009; Sugiyama et al., 2012; Fang et al., 2020; Zhang et al., 2020; Kato & Teshima, 2021) . Sugiyama et al. (2012) proposed a Bregman divergence-based DRM, which unifies various DRMs. Kato & Teshima (2021) proposed a non-negative Bregman divergence-based DRM when using deep neural networks for density ratio estimation. Our work largely differs from Kato & Teshima (2021) in our problem setting that allows multiple clients, algorithm design to estimate different ratios across clients while controlling privacy leakage, and theoretical analyses to show the benefit of importance weighting in generalization.

2. COVARIATE SHIFT AND FIDEM FOR FL

We first provide the problem setting under intra/inter client covariate shifts, and then describe the proposed FIDEM as an unbiased estimate in terms of minimizing the overall true riskfoot_2 .

2.1. PROBLEM SETTING

Let X ⊆ R dx be a compact metric space, Y ⊆ R dy , and K be the number of clients in an FL setting. Let S k = {(x tr k,i , y tr k,i )} n tr k i=1 denote the training set of client k with n tr k samples drawn i.i.d from an unknown probability distribution p tr k on X × Y. 3 The test data of client k, is drawn from another unknown probability distribution p te k on X × Y. Under the covariate shift setting (Sugiyama et al., 2007; Kanamori et al., 2009; Kato & Teshima, 2021; Uehara et al., 2020; Tripuraneni et al., 2021; Zhou & Levine, 2021) , the conditional distribution p tr k (y|x) = p te k (y|x) := p(y|x) is assumed to be the same for all k, while p tr k (x) and p te k (x) can be arbitrarily different, which gives rise to intra-client and inter-client covariate shifts. We consider supervised learning where the goal is to find a hypothesis h w : X → Y, parameterized by w ∈ R d e.g., weights and biases of a neural network, such that h w (x) (for short h(x)) is a good approximation of the label y ∈ Y corresponding to a new sample x ∈ X . Let ℓ : X × Y → R + denote a loss function. In our FL setting, the true (expected) risk of client k is given by R k (h w ) = E (x,y)∼p te k (x,y) [ℓ(h w (x), y)].

2.2. FIDEM FOR FL UNDER COVARIATE SHIFT

For a scenario with K clients, we first focus on minimizing R l (l ∈ [K]) under intra/inter-client covariate shifts, i.e., p tr k (x) ̸ = p te l (x) for all k. We then formulate FIDEM to minimize the average test error over K clients under covariate shifts by optimizing a global model under our FL setting. FIDEM for one client. Under p tr k (x) ̸ = p te l (x) ∀k, FIDEM focusing on minimizing R l is given by: min w∈R d K k=1 1 n tr k n tr k i=1 p te l (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ) . (2.1) In Appendix C, we elaborate on four special cases of the above scenario, i.e., p tr k (x) ̸ = p te l (x) ∀k, focusing on one client under various covariate shifts and formulate their FIDEM's. Proposition 1. Let l ∈ [K]. FIDEM in Eq. (2.1) is consistent. i.e., the learned function converges in probability to the optimal function in terms of minimizing R l . See Appendix C for the proof. Proposition 1 implies that, under intra/inter-client covariate shifts, FIDEM outputs an unbiased estimate of a true risk minimizer of client l. In Appendix C.1, we show usefulness of importance weighting under no intra-client covariate shifts but inter-client covariate shifts, which is a special and important case of our setting. Building on Eq. (2.1) that aims to minimize R l , we now formulate FIDEM to minimize the average test error over all clients and explain its costs and benefits for federated settings. FIDEM for K clients. Let w be the global model. For K clients under intra/inter-clinet covariate shifts, FIDEM minimizes the average test error over all clients and is formulated as: min w∈R d F (w) := K k=1 F k (w) , (FIDEM) where F k (w) = 1 n tr k n tr k i=1 K l=1 p te l (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ). (2.2) Each client requires an estimate of a ratio in the form of sum test over own train. We emphasize that F k (w) should not be viewed as the local loss function of client k. Our formulation FIDEM is meant to minimize the overall test error over all clients given intra/inter-client covariate shifts. To solve FIDEM, we employ the stochastic gradient descent (SGD) algorithm for T iterations starting from an initial parameter w 0 : w t+1 = w t -η t K k=1 g k (w t ) where η t > 0 is the step size, g k (w t ) is an unbiased estimate of ∇ w F k (w t ) , and w T is the output. Under no covariate shift, both FIDEM and classical ERM result in the same solution, which is a minimizer of the overall empirical risk. The main difference happens under intra-client and interclient covariate shifts. In those challenging settings, FIDEM's solution is an unbiased estimate of a minimizer of the overall true risk, while the solution of ERM minimizes the overall empirical risk.

2.3. PRIVACY, COMMUNICATION, AND COMPUTATION IN FL

Privacy and communication efficiency are major concerns in FL (Kairouz et al., 2021) . We elaborate on them and introduce another variant of FIDEM with the same guarantees and costs as FedAvg. Communication/computational costs and security benefits. Compared to classical ERM, the communication/computational overhead of FIDEM is negligible. 4 To solve FIDEM, client k should compute an unbiased estimate of the weighted gradient ∇ w F k (w t ), which requires a single backward pass at a single parameter w = w t . Hence, given the ratios, there is no extra computational/communication overhead compared to classical ERM. Clients compute the ratios in parallel. In Appendix E, we provide a concrete example and show that the number of communication bits needed during training in standard FL is usually many orders of magnitudes larger than the size of samples shared for estimating the ratios. To further reduce communication costs of density ratio estimation and gradient aggregation, compression methods such as quantization, sparsification, and local updating rules, can be used along with FIDEM on the fly. More importantly, due to importance weighting, g k (w) can be arbitrarily different from an unbiased stochastic gradient of classical ERM for client k, i.e., 1 n tr k n tr k i=1 ∇ w ℓ(h w (x tr k,i ), y tr k,i ). The formulation FIDEM makes it impossible for an adversary to apply gradient inversion attack and obtain private training data of clients (Zhu et al., 2019) . In particular, the attacker cannot find the vanilla (stochastic) gradients and reconstructs data unless the attacker has a perfect knowledge of the ratio r k (x) = K l=1 p te l (x)/p tr k (x). Privacy. Given {r k (x)} K k=1 , FIDEM efficiently minimizes the overall test error over all clients in a privacy-preserving manner. To estimate those ratios, if clients can tolerate some level of privacy leakage, clients send unlabelled samples x te l,j for l ∈ [K] and j ∈ [n te ] from their test distributions. To control privacy leakage to other clients, we propose that the server randomly shuffles these unlabelled samples before broadcasting to clients. In Appendix Q, we discuss an alternative method instead of sending original unlabelled samples and discuss its limitations. To fully eliminate any privacy risks compared to classical ERM, clients may opt to minimize the following surrogate objective, which we name Federated Independent Importance-weighteD Empirical risk Minimization (FIIDEM): min w F (w) := K k=1 1 n tr k n tr k i=1 p te k (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ). (FIIDEM) The formulation FIIDEM preserves the same level of privacy and same communication costs as those of classical ERM, e.g., FedAvg. However, to exploit the entire data distributed among all clients and achieve the optimal global model in terms of overall test error, clients need to compromise some level of privacy and share unlabelled test samples with the server. Hence, in this paper, we focus on the original objective in FIDEM.

3. RATIO ESTIMATION FOR FL UNDER COVARIATE SHIFT

To solve FIDEM, client k should have access to an accurate estimate of this ratio r k (x) = K l=1 p te l (x) p tr k (x) . (3.1) Ratio estimation is a key step for importance weighting (Sugiyama et al., 2007; 2012) . The discrepancy between the true ratio r * k for client k in Eq. (3.1) and the estimated one r k using our ratio model can be measured by E p tr k [BD f (r * k (x) ∥ r k (x))] where the Bregman divergence (BD) associated with a strictly convex f leads to BD-based DRMs (Kato & Teshima, 2021; Kiryo et al., 2017) : Definition 1 (Bregman 1967) . Let B f ⊂ [0, ∞) and f : B f → R be a strictly convex function with bounded gradient. The BD associated with f from z to z is given by BD f (z ∥ z) = f (z) -f (z) - ∇f (z)(z -z). Note that BD f (z ∥ z) is a convex function w.r.t. z; however, it is not necessarily convex w.r.t. z. Motivated by Kato & Teshima (2021) ; Kiryo et al. (2017) , we propose a new histogram-based DRM (HDRM) for FL with multiple clients. HDRM overcomes the over-fitting issue (Kiryo et al., 2017; Kato & Teshima, 2021) while providing an estimate for the upper bound r k = sup x∈X tr r * k (x), which is a key step for non-negative BD (nnBD) DRM. We now extend nnBD DRM to FL settings.

3.1. EXTENSION OF NNBD DRM TO FL

We assume that p tr k (x tr ) > 0 for k ∈ [K] and all x tr ∈ X tr ⊆ X with X te ⊆ X tr , i.e., we need a common data domain with strictly positive train density, which is a common assumption (Kanamori et al., 2009; Kato & Teshima, 2021) . Let H r ⊂ {r : X → B f } denote a hypothesis class for our ratios r k , e.g., neural networks with a given architecture. Our goal is to estimate r k by minimizing the discrepancy j=1 ∇f (r k (x te l,j )) diverges if there is no lower bound on this term (Kiryo et al., 2017; Kato & Teshima, 2021) . To resolve this issue in FL, we consider non-negative BD (nnBD) DRM for client k, i.e., min r k ∈Hr Ê+ f (r k ) where E p tr k [BD f (r * k (x) ∥ r k (x))], Ê+ f (r k ) = ReLU 1 n tr k n tr k i=1 ℓ1(r k (x tr k,i )) - C k n te n te j=1 K l=1 ℓ1(r k (x te l,j )) + 1 n te n te j=1 K l=1 ℓ2(r k (x te l,j )), (3.2) ReLU(z) = max{0, z}, 0 < C k < 1 r k , r k = sup x∈X tr r * k (x), ℓ 1 (z) = ∇f (z)z -f (z), and ℓ 2 (z) = C(∇f (z)z -f (z)) -∇f (z). Intuitively, ReLU is used for non-negativity and 0 < C k < 1 r k acts as a regularization parameter. Substituting different f 's into Eq. (3.2) leads to different variants of nnBD, which covers previous work (Basu et al., 1998; Hastie et al., 2001; Gretton et al., 2009; Nguyen et al., 2010; Kato et al., 2019) . We provide explicit expressions of those variants for client k in Appendix H. In this work, we focus on f (z) = (z-1) 2 2 leading to the well-known least-squares importance fitting (LSIF) variant of nnBD for client k.

3.2. ESTIMATION OF THE UPPER BOUND r k

Estimating r k = sup x∈X tr r * k (x) is a key step for nnBD DRM. For a single train and test distribution, it is shown that overestimating r leads to significant performance degradation (Kato & Teshima, 2021, Section 5) . Kato & Teshima (2021) . To estimate r k , we first partition X tr into M bins where for each bin B m , if there exists some x tr k,i ∈ B m , then we define rk,m := K l=1 Pr{X te l ∈Bm} Pr{X tr k ∈Bm} ≃ 1 n te n te j=1 K l=1 1(x te l,j ∈Bm) 1 n tr k n tr k i=1 1(x tr k,i ∈Bm) for m ∈ [M ]. Otherwise, rk,m = 0. Finally, we propose to use C k = 1 rk where rk = max{r k,1 , • • • , rk,M }. Convergence of rk to r k is established in Appendix G. Furthermore, for high-dimensional data, an efficient implementation of HDRM using k-means clustering is provided in Appendix G. In HDRM, K clients estimate their ratios in parallel. To be specific, clients first share unlabelled test samples with the server. The server returns the randomly shuffled pool of samples to all clients. Then clients find C k 's in parallel. Given C k 's, clients estimate their corresponding ratios in parallel. To handle high-dimensional data samples and deep ratio estimation models, we adopt a variant of SGD. For client k, we divide unlabelled samples {x tr k,i } n tr k i=1 and {x te l,j } n te j=1 for l ∈ [K] into N k batches {x tr k,n,i } B tr k i=1 and {x te k,n,j } B te k j=1 for n ∈ [N k ]. Client k first computes 1 B tr k B tr k i=1 ℓ 1 (r k (x tr k,n,i )) - KC k B te k B te k j=1 ℓ 1 (r k (x te k,n,j )). If it becomes negative, then we apply a gradient ascent step to increase this term. We may also opt to apply 1-norm or 2-norm regularizations. The details of the HDRM algorithm are shown in Algorithm 1 in Appendix A.

4. THEORETICAL GUARANTEES

To address learning on multiple clients in FL, it is essential to obtain per-client generalization bounds for a general nnBD DRM with imperfect estimates of r k 's. Even if we have access to perfect estimates of density ratios, it is still unclear the usefulness of importance weighting. In this section, we firstly study the high-probability generalization guarantees of nnBD DRM under imperfect estimate of r k in terms of BD risk. We then show the benefit of importance weighting in term of excess risk through a refined bias-variance decomposition on a ridge regression problem. Theorem 1, Lemma 1, Theorem 2 are proved in Appendix I, Appendix L, and Appendix M, respectively.

4.1. GENERALIZATION ERROR IN TERMS OF BD RISK

We establish a high-probability bound on the generalization error of nnBD DRM with an arbitrary f for client k in terms of BD risk given by E f (r k ) = Ẽk (x)[ℓ 1 (r k (x))] + K l=1 E p te l [ℓ 2 (r k (x))] , where Ẽk := E p tr k -C k K l=1 E p te l . Our bound for client k depends on the Rademacher complexity (Shalev-Shwartz & Ben-David, 2014)  (z) = ∇f (z)z -f (z) and ℓ 2 (z) = C(∇f (z)z -f (z)) -∇f (z). Assumption 1 (Basic assumptions on ℓ 1 and ℓ 2 ). We assume 1) sup z∈B f max i∈{1,2} |ℓ i (z)| < ∞; 2) ℓ 1 is L 1 -Lipschitz and ℓ 2 is L 2 -Lipschitz on X ; 3) inf r∈Hr Ẽk [ℓ 1 (r k (x))] > 0 for k ∈ [K]. The first two assumptions are satisfied if inf{z|z ∈ B f } > 0 for commonly used loss functions, e.g., unnormalized Kullback-Leibler and logistic regression. The third assumption is mild, commonly used in DRM literature (Kiryo et al., 2017; Lu et al., 2020; Kato & Teshima, 2021) . Theorem 1 (Generalization error bound for client k). Let f be a strictly convex function with bounded gradient. Denote ∆ ℓ := sup z∈B f max i∈{1,2} |ℓ i (z)|, rk := arg min r k ∈Hr Ê+ f (r k ) and r * k := arg min r k ∈Hr E f (r k ) where Ê+ f and E f are defined in Eqs. (3.2) and (4.1), respectively. Suppose that ℓ 1 and ℓ 2 satisfy Assumption 1, then for any 0 < δ < 1, with probability at least 1 -δ: E f (r k ) -E f (r * k ) ≲ R p tr k n tr k (H r ) + C k K l=1 R p te l n te (H r ) + Υ log 1 δ + KC k ∆ ℓ exp -1 Υ (4.2) where Υ = ∆ 2 ℓ (1/n tr k + C 2 k K/n te ). Remark 1. Theorem 1 provides generalization guarantees for a general nnBD DRM in a federated setting under a strictly convex f with bounded gradient. We make the following remarks. 1) Our results are general to cover various ratio models. For example, in Corollary 1 of Appendix J, we consider neural networks with depth L and bounded Frobenius norm ∥W i ∥ F ≤ ∆ Wi and establish explicit generalization error bounds for client k in O √ L L i=1 ∆ Wi (1/ n tr k +K/ √ n te )+ Υ log 1 δ + KC k ∆ ℓ exp -1 Υ . 2) If the additional error due to estimation of r k with HDRM in Section 3 using M bins is considered, it leads to O(K∆ ℓ 1 M + M n tr k ) under mild assumptions. Refer to Appendix K for details. 3) Our error bound increases with K due to the structure of BD risk. Note that K is in a constant order. Our goal is to show that nnBD DRM is guaranteed to generalize in a general federated setting.

4.2. EXCESS RISK AND BENEFIT OF FIDEM

In this section, we aim to demonstrate the benefit of importance weighting in term of excess risk through bias-variance decomposition. We consider the classical least squares problem, a good starting point to understand the superiority of FIDEM over ERM with generalization guarantees. We consider the single client setting K = 1 for the ease of description, and our results can be extended to the multiple clients setting. Let (x, y) denote the (test) data sampled from an unknown probability measure ρ. The least squares problem is to estimate the true parameter θ * , which is assumed to be the unique solution that minimizes the population risk in a Hilbert space H: L(θ * ) = min θ∈H L(θ) where L(θ) := 1 2 E (x,y)∼ρ [(y -θ ⊤ x)] 2 . Moreover, we have L(θ * ) = σ 2 ϵ corresponding to the noise level. For an estimate θ found by a learning algorithm such as ridge regression, its performance is measured by the expected excess risk, E[L(θ)] -L(θ * ), where the expectation is over the random noise, randomness of the algorithm, and training data. In the following, we consider two settings: random-design setting and fixed-design settings where the training data matrix is random and given, respectively. Bias variance decomposition. We need the following noise assumption for our proof. Assumption 2 (Dhillon et al. 2013; Zou et al. 2021, bounded variance)  . Let ϵ := y -θ ⊤ * x. We assume that E[ϵ] = 0 and E[ϵ 2 ] = σ 2 ϵ . We have the following lemma on the bias-variance decomposition of the ridge regression FIDEM estimate in the random-design setting. Lemma 1. Let X ∈ R n×d be the training data matrix. Let W = diag(w 1 , . . . , w n ) with w i = p te (x i )/p tr (x i ) for i ∈ [n], θ be the regularized least square estimate with importance weighting: θ = arg min θ n i=1 w i (θ ⊤ x i -y i ) 2 + λ∥θ∥ 2 2 where λ is the regularization parameter. Denote θ * be the true estimate, then the excess risk can be decomposed as the bias B and the variance V: E[L( θ)] -L(θ * ) = B + V , with B := λ 2 E θ ⊤ * Σ -1 W,λ Σ te Σ -1 W,λ θ * , V := σ 2 ϵ E tr Σ -1 W,λ X ⊤ W 2 XΣ -1 W,λ Σ te , where Σ W,λ := X ⊤ WX+λI, and Σ te = E x [xx ⊤ ]. Note that the expectation is taken over the randomness of the training data matrix X and label noise. Remark 2. Our results in Lemma 1 hold under the fixed-design setting where the training data are given (Dhillon et al., 2013; Hsu et al., 2012) , by omitting the expectations from B and V. One-hot case. To theoretically prove that FIDEM outperforms ERM in non-trivial settings, we start from the one-hot case, along the lines of Zou et al. (2021) , and strictly show that, under which level of covariate shift, the excess risk of FIDEM is always smaller than the classical ERM. To be specific, in the one-hot case, every training data x is sampled from the set of natural basis {e 1 , e 2 , . . . , e d } according to the data distribution given by Pr{x = e i } = λ i where 0 < λ i ≤ 1 and i λ i = 1. The class of one-hot least square instances is characterized by the following problem set: (θ * ; λ 1 , . . . , λ d ) : θ * ∈ H, i λ i = 1 . It is not difficult to show that the population second momentum matrix is Σ tr = E[x i x ⊤ i ] = diag(λ 1 , . . . , λ d ) for i ∈ [n]. Similarly, we assume that each test data follows the same scheme but with different probabilities Pr{x = e i } = λ ′ i , and hence, we have Σ te = diag(λ ′ 1 , . . . , λ ′ d ) . This is a relatively simple setting, which admits covariate shift.Take {µ 1 , µ 2 , . . . , µ d } as the eigenvalues of X ⊤ X. Since x i can only take on natural basis, the eigenvalue µ i can be understood as the number of training data that equals e i . For notational simplicity, we rearrange the order of the training data following the decreasing order of the ratio, such that the i-th sample x i corresponds to the ratio w i as the exact i-th largest value. Theorem 2. Let θ be the estimate of FIDEM, θ v be the classical ERM, and ξ i := λ λ+µi . Under the fixed-design setting in the one-hot case, label noise assumption, and data correlation assumption, if the ratio w i := p te (x i )/p tr (x i ) satisfies λ ′ i λi -1 ≤ wi ≤ ξi λi λ ′ i , (4.3) then we have R( θ) ⩽ R(θ v ) . Remark 3. We have the following remarks: 1) The condition (4.3) is equivalent to λ ′ i λi ∈ 0, 1+ √ 1+4ξi 2 , which requires the training and test data to behave similarly in terms of eigenvalues, avoiding significant differences under distribution shifts for learnability. Other metrics, e.g., similarity on eigenvectors (Tripuraneni et al., 2021) also coincide with the spirit of our assumption. 2) The ratio matrix is W ∈ R n×n . However, we only need its top d eigenvalue, i.e., the top-d ratios. In particular, the last n -d ratios have no effect on the final excess risk. This makes our algorithm robust to noise and domain shift. 3) For the special case by taking the ratio as w i := λ ′ i λi , we have B( θ) = λ 2 d i=1 [(θ * ) i ] 2 λ ′ i [µ i w i + λ] 2 = λ 2 d i=1 [(θ * ) i ] 2 λ i µ i + λi λ ′ i λ 2 , which implies that the ratio can be regarded as an implicit regularization (Zou et al., 2021) .

5. EXPERIMENTAL EVALUATION

In this section we illustrate conditions under which FIDEM is favored over both Federated Averaging without ratio estimation (FedAvg) (McMahan et al., 2017) and FIIDEM. In all experiments, we use a LeNet (LeCun et al., 1989) with cross entropy loss and compute standard deviations over 5 independent executions. Due to the page limit, the implementation details are in Appendix O. Target shift. We consider the case of target shift where the label distribution p(y) changes but the conditional distribution p(x|y) remains invariant. We split the 10-class Fashion MNIST dataset between 5 clients and simulate a target shift by including different fractions of examples from each class across the training data and test data. We further consider the separable case in order to compute the exact ratio for FIDEM and FIIDEM in closed form. The specific distribution and the construction of the ratio can be found in Appendix O.1. The results in Table 1 illustrate that FIIDEM can outperform FedAvg on average while preserving the same level of privacy. By relaxing the privacy slightly the proposed FIDEM improves on FedAvg uniformly across all clients. Even though the proportions of the classes have been artificially created, we believe that this demonstrates a realistic scenario where clients have a different fraction of samples per class. Additional experiments using larger models on the CIFAR10 dataset under a challenging target shift setting can be found in Appendix O.1 where FIDEM is observed to improve uniformly over FedAvg. Covariate shift. We now focus on covariate shift, where p(x) undergoes a shift while p(y|x) remains unchanged. For this setting, we extend the Colored MNIST dataset in Arjovsky et al. (2019) to the multi-client setting. The dataset is constructed by first assigning a binary label 0 to digits from 0-4 and label 1 for digits between 5-9. The label is then flipped with probability 0.25 to make the dataset non-separable. A spurious correlation is introduced by coloring the digits according to their assigned labels and then flipping the colors according to a different probability for each distribution (see Appendix O.2). For this experimental setup, we introduce an idealized scheme, which ignores the color and thus the spurious correlation, i.e., provides an upper bound, and is referred to as Grayscale. FIDEM outperforms both baselines in terms of the average accuracy even in a two-client setting. FIDEM is also close to Grayscale upper bound that by construction ignores the spurious correlations.

6. CONCLUSIONS AND FUTURE WORK

In this work, we focus on FL under both intra-client and inter-client distribution shifts and propose FIDEM to improve the overall generalization performance. We establish high-probability generalization guarantees for a general DRM method in a federated setting. We further show the benefit of importance weighting in term of excess risk through bias-variance decomposition in a ridge regression problem. Our theoretical guarantees indicate how FIDEM can provably solve a learning task under distribution shifts. We experimentally evaluate FIDEM under both label shift and covariate shift cases. Our experimental results validate that under certain covariate and target shifts, the proposed method can learn the task, while baselines such as vanilla federated averaging fails to do so. We anticipate that our methods to be applicable in learning from e.g., medical data, where there might be arbitrary skews on the distribution. In addition, we believe our study can further encourage the investigation of distribution shifts in FL, as this is a critical subject for learning across clients.

Notation:

We use E[•] to denote the expectation and ∥ • ∥ to represent the Euclidean norm of a vector. We use lower-case bold font to denote vectors. Sets and scalars are represented by calligraphic and standard fonts, respectively. We use [n] to denote {1, • • • , n} for an integer n. We use ≲ to ignore terms up to constants and logarithmic factors. The appendix is organized as follows: • Definition of Rademacher complexity, examples of f for BD-based DRM, and the steps of HDRM in Algorithm 1 are provided in Appendix A. • Complete related work is provided in Appendix B. • FIDEM with a focus on minmizing R 1 are provided in Appendix C. • Details of density ratio estimation are provided in Appendix D. • Communication costs of FIDEM and FIIDEM are analyzed in in Appendix E. • UKL, LR, and PU variants of nnBD are provided in Appendix F. • Convergence of r and k-means clustering for HDRM are provided in Appendix G. • UKL, LR, and PU variants of nnBD for multiple clients are provided in Appendix H. • The proof of the core Theorem 1 exists in Appendix I. • Generalization bounds for multi-layer perceptron and multiple clients are established in Appendix J. • Additional error due to estimation of r k with HDRM is analyzed in Appendix K. • Lemma 1 is proved in Appendix L. • Theorem 2 is proved in Appendix M. • A counterexample under which FIDEM cannot outperform ERM is provided in Appendix N. • Additional experimental details are included in Appendix O. • Computational complexity of Algorithm 1 is analyzed in Appendix P. • The limitations of our work are described in Appendix Q.  C log(1 -z) + Cz(log(z) -log(1 -z)), z ∈ (0, 1), C ≤ r Input: Samples {{x tr k,i } n tr k i=1 } K k=1 , {{x te l,j } n te j=1 } K l=1 , learning rate α, regularization Λ(r) and regularization coefficient λ. Output: Ratio model parameters {θ r k } K k=1 . for k = 1 to K (in parallel) do Send n te samples to the server ; Server randomly shuffles and broadcasts samples {{x te l,j } n te j=1 } K l=1 to clients ;  for k = 1 to K (in parallel) do Create M bins and compute rk,m = 1/n te n te j=1 K l=1 1(x te l,j ∈Bm) 1/n tr k n tr k i=1 1(x tr k,i ∈Bm) ; Estimate C k = 1 max{r k,1 ,••• ,r k,M } ; for t = 1 to T do for k = 1 to K (in parallel) do for n = 1 to N k do 10 if 1 B tr k B tr k i=1 ℓ 1 (r k (x tr k,n,i )) -KC k B te k B te k j=1 ℓ 1 (r k (x te k,n,j )) ≥ 0 then g k = -∇ θr 1 B tr k B tr k i=1 ℓ 1 (r k (x tr k,n,i )) -KC k B te k B te k j=1 ℓ 1 (r k (x te k,n,j )) + K B te k B te k j=1 ℓ 2 (r k (x te k,n,j )) + λ 2 Λ(r k ) ; 12 else 13 g k = ∇ θr 1 B tr k B tr k i=1 ℓ 1 (r k (x tr k,n,i )) -KC k B te k B te k j=1 ℓ 1 (r k (x te k,n,j )) + λ 2 Λ( R p n (H) = E S E σ sup r∈H 1 n n i=1 σ i r(x i ) where {σ i } n i=1 are Rademacher variables uniformly chosen from {-1, 1}.

B RELATED WORK

Federated learning. One well-known method in FL is FedAvg (McMahan et al., 2017) . FedAvg and its variants are extensively studied in optimization with a focus on communication efficiency and partial participation of clients while preserving privacy. Indeed, a host of techniques, such as gradient quantization, sparsification, and local updating rules, have been proposed to improve communication efficiency in FL (Alistarh et al., 2017) . Furthermore, robust and secure aggregation schemes have been also proposed to provide robustness against trainingtime attacks launched by an adversary, and to compute aggregated values without being able to inspect the clients' local models and data, respectively (Li et al., 2020; Kairouz et al., 2021; Wang et al., 2021) . Taken together, these work largely focus on minimizing the empirical risk in the optimization objective, under the same training/test data distribution assumption over each client. Differences across clients are handled using personalization methods based on heuristics and currently do not have a statistical learning theoretical support (Smith et al., 2017; Khodak et al., 2019; Li et al., 2021b) . In contrast, we focus on learning and overall generalization performance under both intra-client and inter-client distribution shifts. Communication-efficient, robust, and secure aggregations can be viewed as complementary technologies, which can be used along with our proposed FIDEM method to improve the generalization performance. In our setting, clients can also all participate in every training iteration, such as cross-silo FL. We note that (Hanzely et al., 2020; Gasanov et al., 2022) Importance-weighted ERM and density ratio matching. Density ratio estimation is an important step in various machine learning problems such as learning under covariate shift, learning under noisy labels, anomaly detection, two-sample testing, causal inference, change-pint detection, and classification from positive and unlabelled data (Qin, 1998; Shimodaira, 2000; Cheng & Chu, 2004; Keziou & Leoni-Aubin, 2005; Sugiyama et al., 2007; Kawahara & Sugiyama, 2009; Smola et al., 2009; Hido et al., 2011; Kanamori et al., 2011; Sugiyama et al., 2011; Yamada et al., 2011; Reddi et al., 2015; Liu & Tao, 2015; Kato et al., 2019; Fang et al., 2020; Uehara et al., 2020; Zhang et al., 2020; Kato & Teshima, 2021) . In particular, covariate shift has been observed in real-world applications including brain-computer interfacing, emotion recognition, human activity recognition, spam filtering, and speaker identification (Bickel & Scheffer, 2007; Li et al., 2010; Yamada et al., 2010; Hachiya et al., 2012; Jirayucharoensak et al., 2014) . Shimodaira (2000) introduced covariate shift where the input train and test distributions are different while the conditional distribution of the output variable given the input variable remains unchanged. Importance-weighted ERM is widely used to improve generalization performance under covariate shift (Zadrozny, 2004; Sugiyama & Müller, 2005; Huang et al., 2006; Sugiyama et al., 2007; Kanamori et al., 2009; Sugiyama et al., 2012; Fang et al., 2020; Zhang et al., 2020; Kato & Teshima, 2021) . The premise behind such shifts is that data is frequently biased, and this results in distribution shifts that can be estimated by assuming some (unlabelled) knowledge of the target distribution. The following two categories of domain adaptation methods are most closely related to our work: a) sample-based, and b) feature-based methods. In feature-based methods, the goal is to find a transformation that maps the source samples to target samples (Ganin et al., 2016; Bousmalis et al., 2017; Das & Lee, 2018; Damodaran et al., 2018) . Contrary to feature-based methods, sample-based methods aim at minimizing the target risk through data in the source domain. Importance weighting is often used in sample-based methods (Shimodaira, 2000; Jiang & Zhai, 2007; Baktashmotlagh et al., 2014) . However, the focus on domain adaptation has been mainly to adapt to a single target distribution, not the overall generalization performance on multiple clients, which is addressed in this paper. Statistical generalization and excess risk bounds. Understanding generalization performance of learning algorithms is one essential topic in modern machine learning. Typical techniques to establish generalization guarantees include uniform convergence by Rademacher complexity (Bartlett, 1998) , and its variants (Bartlett et al., 2005) , bias-variance decomposition (Geman et al., 1992; Adlam & Pennington, 2020) , PAC-Bayes (McAllester, 1999) , and stability-based analysis (Bousquet & Elisseeff, 2002; Shalev-Shwartz et al., 2010) . Our work employs the first two techniques to analyze our density ratio estimation method in a federated setting and establish generalization guarantees for FIDEM, respectively. Rademacher complexity has been used in FL to obtain theoretical guarantees on the centralized model (Mohri et al., 2019) and personalized model (Mansour et al., 2020) . Mohri et al. (2019) considered a scenario where a single target distribution is modeled as an unknown mixture of multiple domain distributions and obtained a global modal by minimizing the worst-case loss. This is different from our setting where we consider multiple test distributions for clients and focus on the overall test error. Mansour et al. (2020) studied personalization under the same training/test data distribution assumption over each client, which is different from our setting. Bias-variance decomposition provides a relatively refined characterization of generalization error (or excess risk), where a large bias indicates that a model is not flexible enough to learn from the data and a high variance indicates that the model performs unstably. Bias-variance decomposition is typically studied in two settings, i.e., the fixed and random design setting, which is categorized by whether the (training) data are fixed or random. This technique has been extensively applied in least squares (Hsu et al., 2012; Dieuleveut et al., 2017) , analysis of SGD (Jain et al., 2018; Zou et al., 2021) , and double descent (Adlam & Pennington, 2020) . Information-theoretic bounds on the generalization error and privacy leakage in federated settings were established in (Yagli et al., 2020) . Under partial participation of clients, Yuan et al. (2021) proposed a framework, which distinguishes performance gaps due to unseen client data from performance gap due to unseen client distributions. Still, these work study FL under the same training/test data distribution assumption over each client.

C FIDEM WITH A FOCUS ON MINMIZING R 1

Without loss of generality and for simplicity of notation, in this section, we set l = 1. We consider four typical scenarios under various distribution shifts and formulate their FIDEM with a focus on minmizing R 1 . The details of these scenarios are summarized in Table 4 . Remark 4. Covariance shift (as well as its assumption) is the most commonly used and studied in theory and practice in distribution shifts (Sugiyama et al., 2007; Kanamori et al., 2009; Kato & Teshima, 2021; Uehara et al., 2020; Tripuraneni et al., 2021; Zhou & Levine, 2021) . Handling covariate shift is a challenging issue, especially in federated settings (Kairouz et al., 2021) . No intra-client covariate shift: (No-CS) For description simplicity, we assume that there are only 2 clients but our results can be directly extended to multiple clients. This scenario assumes p tr k (x) = p te k (x) for k = 1, 2. Client 1 aims to learn h w assuming p tr 1 (x) p tr 2 (x) is given. We consider the following FIDEM that is proved to be consistent in terms of minimizing minimizing R 1 : min w∈R d 1 n tr 1 n tr 1 i=1 ℓ(hw(x tr 1,i ), y tr 1,i ) + 1 n tr 2 n tr 2 i=1 p tr 1 (x tr 2,i ) p tr 2 (x tr 2,i ) ℓ(hw(x tr 2,i ), y tr 2,i ). (C.1) Covariate shift only for client 1: (CS on one) We now consider covariate shift only for client 1, i.e., p tr 1 (x) ̸ = p te 1 (x) and p tr 2 (x) = p te 2 (x). We consider the following FIDEM min w∈R d 1 n tr 1 n tr 1 i=1 p te 1 (x tr 1,i ) p tr 1 (x tr 1,i ) ℓ(h w (x tr 1,i ), y tr 1,i ) + 1 n tr 2 n tr 2 i=1 p te 1 (x tr 2,i ) p tr 2 (x tr 2,i ) ℓ(h w (x tr 2,i ), y tr 2,i ). (C.2) Covariate shift for both clients: (CS on both) We assume p tr 1 (x) ̸ = p te 1 (x) and p tr 2 (x) ̸ = p te 2 (x), i.e., covariate shift for both clients. The corresponding FIDEM is the same as Eq. (C.2). Multiple clients: (CS on multi.) Finally, we consider a general scenario with K clients. We assume both intra-client and inter-client covariate shifts by the following FIDEM: min w∈R d K k=1 λ k n tr k n tr k i=1 p te 1 (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ) (C.3) where K k=1 λ k = 1 and λ k ≥ 0. Proposition 2. Let l ∈ [K]. In above settings, FIDEM defined in Eqs. (C.1), (C.2), and (C.3) is consistent. i.e., the learned function converges in probability to the optimal function in terms of minimizing R 1 . Proposition 2 implies that, under various settings, FIDEM outputs an unbiased estimate of a minimizer of the true risk. Proof. For the scenario without intra-client covariate shift, FIDEM in Eq. (C.1) can be expressed as 1 n tr 2 n tr 2 i=1 p tr 1 (x tr 2,i ) p tr 2 (x tr 2,i ) ℓ(h w (x tr 2,i ), y tr 2,i ) n tr 2 →∞ -----→ E p tr 2 (x,y) p tr 1 (x) p tr 2 (x) ℓ(h w (x), y) = E p(y|x) X p tr 1 (x) p tr 2 (x) ℓ(h w (x), y)p tr 2 (x) dx = E p(y|x) X p tr 1 (x)ℓ(h w (x), y) dx = E p(y|x) X p te 1 (x)ℓ(h w (x), y) dx = E p te 1 (x,y) [ℓ(h w (x), y)] = R 1 (h w ). For the scenario with covariate shift only for client 1 or for both clients, FIDEM in Eq. (C.2) admits 1 n tr 2 n tr 2 i=1 p te 1 (x tr 2,i ) p tr 2 (x tr 2,i ) ℓ(h w (x tr 2,i ), y tr 2,i ) n tr 2 →∞ -----→ E p tr 2 (x,y) p te 1 (x) p tr 2 (x) ℓ(h w (x), y) = E p(y|x) X p te 1 (x) p tr 2 (x) ℓ(h w (x), y)p tr 2 (x) dx = E p(y|x) X p te 1 (x)ℓ(h w (x), y) dx = E p te 1 (x,y) [ℓ(h w (x), y)] = R 1 (h w ). We note that p te 1 (x) p tr 2 (x) = p te 1 (x) p tr 1 (x) p tr 1 (x) p tr 2 (x) , which is the product of ratios due to intra-client covariate shift on client 1 and inter-client covariate shift. For multiple clients, let k ∈ [K]. Similarly, we have 1 n tr k n tr k i=1 p te 1 (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ) n tr k →∞ -----→ R 1 (h w ). Then we have K k=1 λ k n tr k n tr k i=1 p te 1 (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ) n tr 1 ,••• ,n tr K →∞ ---------→ R 1 (h w ). The consistency of FIDEM, i.e., convergence in probability, is immediately followed the standard arguments in e.g., (Shimodaira, 2000) [Section 3] and (Sugiyama et al., 2007) [Section 2.2] using the law of large numbers. ■ Note that to solve Eq. (C.3), client 1 needs to estimate p te 1 (x) p tr k (x) for all clients k with λ k > 0 in (C.3). Remark 5. Scaling K k=1 λ k does not affect the optimal parameters in Eq. (C.3). For rotational simplicity, we set λ k = 1 for k ∈ [K].

C.1 NO INTRA-CLIENT SHIFT

In this section, we consider the important and special case of the setting described in Section 2.1 under no intra-client covariate shifts but inter-client covariate shifts. For simplicity, we consider a two clients with train/test distributions P and Q whose train/test densities are denoted by p and q, respectively. We also suppose that we have a sample z ∼ P and z ′ ∼ Q to learn with the goal is to find an unbiased estimate of the overall risk with the smallest variance. In this setting, the classical ERM (FedAvg) objective ℓ(z, θ) + ℓ(z ′ , θ) is an unbiased estimate for the overall risk L(θ) = E P [ℓ(z, θ)] + E Q [ℓ(z ′ , θ)] 5 . In this setting, the objective of FIDEM, i.e., 1 2 ( LP (θ) + LQ (θ)) with LP (θ) = 1 + q(z) p(z) ℓ(z, θ) and LQ (θ) = 1 + p(z ′ ) q(z ′ ) ℓ(z ′ , θ) is an unbiased estimate for the overall risk L(θ). We now show that the our method (FIDEM) has a smaller variance than FedAvg under certain conditions. Let E P [(ℓ(z, θ) -E P [ℓ(z, θ)]) 2 ] = σ 2 P and E Q [(ℓ(z ′ , θ) -E Q [ℓ(z ′ , θ)]) 2 ] = σ 2 Q . For FedAvg, the variance is given by E P,Q [(ℓ(z, θ) + ℓ(z ′ , θ) -L(θ)) 2 ] = σ 2 P + σ 2 Q . For FIDEM, the variance is given by E P,Q [( 1 2 ( LP (θ) + LQ (θ)) -L(θ)) 2 ] = V P + V Q where V P = 1 4 E P [( LP (θ) -L(θ)) 2 ] and V Q = 1 4 E Q [( LQ (θ) -L(θ)) 2 ] . We now expand each term V P and V Q . We can show that V P = 1 4 E P (1 + q(z) p(z) )ℓ(z, θ) -E P [ℓ(z, θ)] -E Q [ℓ(z ′ , θ)] 2 = σ 2 P + σ2 P 4 where σ2 P = E P q(z) p(z) ℓ(z, θ) -E Q [ℓ(z ′ , θ)] 2 + 2E P ℓ(z, θ) -E P [ℓ(z, θ)] q(z) p(z) ℓ(z, θ) - E Q [ℓ(z ′ , θ)] . Similarly, we have V Q = 1 4 E Q (1 + p(z ′ ) q(z ′ ) )ℓ(z ′ , θ) -E P [ℓ(z, θ)] -E Q [ℓ(z ′ , θ)] 2 = σ 2 Q + σ2 Q 4 where σ2 Q = E Q p(z ′ ) q(z ′ ) ℓ(z ′ , θ) -E P [ℓ(z, θ)] 2 + 2E Q ℓ(z ′ , θ) -E Q [ℓ(z ′ , θ)] p(z ′ ) q(z ′ ) ℓ(z ′ , θ) - E P [ℓ(z, θ)] . We note if σ2 P + σ2 Q ≤ 3(σ 2 P + σ 2 Q ) then, FIDEM will have smaller variance than FedAvg, i.e., V P + V Q ≤ σ 2 P + σ 2 Q . The exact condition depends on the loss and densities. To show a concrete example, for the more general and practical case with both intra/inter-client , in Section 4.2, we show that FIDEM results in smaller excess risk compared to FedAvg through a refined bias-variance decomposition. Given two distributions, considering the case of no intra-client shift is a special case, where it is true that FedAvg is an unbiased estimate of the overall risk. However, this unbiasedness breaks as soon as there is only one client whose test and train distributions are different, which is very common in theory and practice. Please note that FIDEM is an unbiased estimate of the overall risk in a general FL setting without requiring any prior knowledge/assumptions on the potential covariate shifts.

D RATIO ESTIMATION D.1 NNBD DRM FOR A SINGLE CLIENT

For simplicity, we firstly focus on the problem of estimating r(x) = p te (x) p tr (x) and then extend our consideration to the estimation of r k (x) in Eq. (3.1). Let r * denote the true density ratio. Our goal is to estimate r * by optimizing our ratio model r. The discrepancy between r and r * is measured by  E p tr [BD f (r * (x) ∥ r(x))]. We note that E p tr [BD f (r * (x) ∥ r(x))] = E f (r) + E p tr [f (r * (x))] where E f (r) = E p tr [∇f (r(x))r(x) -f (r(x))] -E p te [∇f (r(x))]. Note that E p tr [f (r * (x))] is constant w.r.t. Êf (r) = 1 n tr n tr i=1 ∇f (r(x tr i ))r(x tr i ) -f (r(x tr i )) - 1 n te n te j=1 ∇f (r(x te j )). (D.1) Sugiyama et al. (2012) showed that BD-based DRM unifies well-known density ratio estimation methods by substituting an appropriate f in (D.1). However, it is shown that solving BD-based DRM with highly flexible models such as neural networks typically leads to an over-fitting issue (Kato & Teshima, 2021; Kiryo et al., 2017) . In particular, Kato & Teshima (2021) called such issue "train-loss hacking" where -1 n te n te j=1 ∇f (r(x te j )) in (D.1) diverges if there is no lower bound on this term. Even when there exists a lower bound, the model r tends to increase to the largest possible values of its output range at points {x te j } n te j=1 . To resolve such issue, Kato & Teshima (2021) proposed to use non-negative BD (nnBD) DRM, i.e., min r∈Hr Ê+ f (r) where Ê+ f (r) = ReLU 1 n tr n tr i=1 ℓ 1 (r(x tr i )) - C n te n te j=1 ℓ 1 (r(x te j )) + 1 n te n te j=1 ℓ 2 (r(x te j )), (D.2) ReLU(z) = max{0, z}, 0 < C < 1 r , r = sup x∈X tr r * (x), ℓ 1 (z) = ∇f (z)z -f (z), and ℓ 2 (z) = C(∇f (z)z -f (z)) -∇f (z). Substituting f (z) = (z-1) 2 2 into (D. 2), the least-squares importance fitting (LSIF) variant of nnBD is given by Ê+ LSIF (r) = ReLU 1 2n tr n tr i=1 r 2 (x tr i ) - C 2n te n te j=1 r 2 (x te j ) - 1 n te n te j=1 r(x te j ) - C 2 r 2 (x te j ) . In Appendix F, we show explicit expressions for unnormalized Kullback-Leibler (UKL), logistic regression (LR), and positive and unlabeled learning (PU) variants of nnBD. Estimating r = sup x∈X tr r * (x) is a key step for density ratio estimation. It is shown that underestimating C leads to significant performance degradation (Kato & Teshima, 2021, Section 5) . Kato & Teshima (2021) x) . We partition X tr into M bins where for each bin B m , if there exists some x tr i ∈ B m , then we define rm :=  (r k ) = 1 n tr k n tr k i=1 ∇f (r k (x tr k,i ))r k (x tr k,i ) -f (r k (x tr k,i )) - 1 n te n te j=1 K l=1 ∇f (r k (x te l,j )). The nnBD DRM problem for client k is min r k ∈Hr Ê+ f (r k ) where Ê+ f (r k ) = ReLU( Ŝ1,ℓ1 ) + 1 n te n te j=1 K l=1 ℓ 2 (r k (x te l,j )), (D.3) Ŝ1,ℓ1 = 1 n tr k n tr k i=1 ℓ 1 (r k (x tr k,i )) -C k n te n te j=1 K l=1 ℓ 1 (r k (x te l,j )), 0 < C k < 1 r k , and r k = sup x∈X tr r * k (x). Substituting f (z) = (z-1) 2 2 into (D. 3), the LSIF variant of nnBD for client k is given by min r k ∈Hr Ê+ LSIF (r k ) where Ê+ LSIF (r k ) = ReLU( ŜLSIF ) - 1 n te n te j=1 K l=1 r k (x te l,j ) - C k 2 r 2 k (x te l,j ) , (D.4) and ŜLSIF = 1 2n tr k n tr k i=1 r 2 k (x tr k,i ) -C k 2n te n te j=1 K l=1 r 2 k (x te l,j ). We provide explicit expressions for UKL, LR, and PU variants of nnBD for client k in Appendix H.

Our goal is to estimate r

k = sup x∈X tr K l=1 p te l (x) p tr k (x) . For HDRM method, we first partition X tr into M bins where for each bin B m , if there exists some x tr k,i ∈ B m , then we define rk,m := 

D.2 BD-BASED DRM FOR FL

Our goal is to estimate r k by minimizing the discrepancy E p tr k [BD f (r * k (x) ∥ r k (x))], which is equivalent to min r k ∈Hr E f (r k ) where E f (r k ) = E p tr k [∇f (r k (x))r k (x) -f (r k (x))] - K l=1 E p te l [∇f (r k (x))] , (D.5) since E p tr k [BD f (r * k (x) ∥ r k (x))] = E f (r k ) + E p tr k [f (r * k (x))] and E p tr k [f (r * k (x))] is constant w.r.t. r k . Let {x tr k,i } n tr k i=1 and {x te l,j } n te j=1 denote unlabelled samples drawn i.i.d from distributions p tr k and p te l , respectively, for l ∈ [K]. A natural way to solve min r k ∈Hr E f (r k ) is to substitute empirical averages in Eq. (D.5) (Sugiyama et al., 2012) , leading to BD-based DRM for FL: min r k ∈Hr Êf (r k ) where In other words, the number of communication bits needed during training in standard federated learning is usually many orders of magnitudes larger than the size of samples shared for estimating the ratios. To further reduce communication costs of density ratio estimation and gradient aggregation, compression methods such as quantization, sparsification, and local updating rules, can be used along with FIDEM on the fly (Alistarh et al., 2017) . Êf (r k ) = 1 n tr k n tr k i=1 ∇f (r k (x tr k,i ))r k (x tr k,i ) -f (r k (x tr k,i )) - 1 n te n te j=1 K l=1 ∇f (r k (x te l,j )).

E COMMUNICATION COSTS AND FIIDEM

Alternatively, to eliminate any privacy risks, clients may minimize the following surrogate objective, which we name FIIDEM: min w F (w) := K k=1 Fk (w) (E.1) where Fk (w) = 1 n tr k n tr k i=1 p te k (x tr k,i ) p tr k (x tr k,i ) ℓ(h w (x tr k,i ), y tr k,i ). We note that privacy risks are eliminated by solving E.1. However, to exploit the entire data distributed among all clients and achieve the optimal global model in terms of overall test error, clients need to compromise some level of privacy and share unlabelled samples generated from their test distribution with the server. Hence, in this paper, we focus on the original objective F (w) in FIDEM, which is different from F (w). F VARIANTS OF NNBD. In this section, we show explicit expressions for unnormalized Kullback-Leibler (UKL), logistic regression (LR), and positive and unlabeled learning (PU) variants of nnBD. Substituting f (z) = z log(z) -z into Eq. (D.2), we have ℓ 1 (z) = z and ℓ 2 (z) = zC -log(z), and the UKL variant of nnBD is given by Ê+ UKL (r) = ReLU 1 n tr n tr i=1 r(x tr i ) - C n te n te j=1 r(x te j ) - 1 n te n te j=1 log(r(x te j )) -Cr(x te j ) . (F.1) Substituting f (z) = z log(z) -(z + 1) log(z + 1) into Eq. (D. 2), we have ℓ 1 (z) = log(z + 1) and ℓ 2 (z) = C log(z + 1) + log z+1 z , and the LR (BKL) variant of nnBD is given by Ê+ LR (r) = ReLU 1 n tr n tr i=1 log(r(x tr i ) + 1) - C n te n te j=1 log(r(x te j ) + 1) - 1 n te n te j=1 log r(x te j ) r(x te j ) + 1 -C log(r(x te j ) + 1) . (F.2) Substituting f (z) = C log(1 -z) + Cz(log(z) -log(1 -z)) into Eq. (D.2), we have ℓ 1 (z) = -C log(1 -z) and ℓ 2 (z) = -C log(z) + (C -C 2 ) log(1 -z) , and the PU variant of nnBD is given by k-means clustering for HDRM. We note that partitioning the space and counting the number of samples in each bin is not necessarily an easy task when data is high dimensional. In practice, one simple method is to cluster train and test samples using an efficient implementation of k-means clustering with M clusters and count the number of train and test samples in each cluster (Lloyd, 1982) . To estimate the ratios, we need a batch of samples from the test distribution of each client in addition to a batch of samples from the train distribution for each estimating client. The running time of Lloyd's algorithm with M clusters is O(nd x M ) where n is the total number of samples with dimension d x . Ê+ PU (r) = ReLU -C n tr n tr i=1 log(1 -r(x tr i )) + C 2 n te n te j=1 log(1 -r(x te j )) - 1 n te n te j=1 C log(r(x te j )) -(C -C 2 ) log(1 -r(x te j )) . (F.3) G CONVERGENCE OF H UKL, LR, AND PU VARIANTS OF NNBD FOR MULTIPLE CLIENTS. In this section, we provide explicit expressions for UKL, LR, and PU variants of nnBD for client k. The UKL variant of nnBD for client k is given by min r k ∈Hr Ê+ UKL (r k ) where Ê+ UKL (r k ) = ReLU 1 n tr k n tr k i=1 r k (x tr k,i ) - C k n te n te j=1 K l=1 r k (x te l,j ) - 1 n te n te j=1 K l=1 log(r k (x te l,j )) -C k r k (x te l,j ) . (H.1) The LR variant of nnBD for client k is given by min r k ∈Hr Ê+ LR (r k ) where Ê+ LR (r k ) = ReLU 1 n tr k n tr k i=1 log(r k (x tr k,i ) + 1) - C k n te n te j=1 K l=1 log(r k (x te l,j ) + 1) - 1 n te n te j=1 K l=1 log r k (x te l,j ) r k (x te l,j ) + 1 -C k log(r k (x te l,j ) + 1) . (H.2) The PU variant of nnBD for client k is given by min r k ∈Hr Ê+ PU (r k ) where Ê+ PU (r k ) = ReLU -C k n tr k n tr k i=1 log(1 -r k (x tr k,i )) + C 2 k n te n te j=1 K l=1 log(1 -r k (x te l,j )) - 1 n te n te j=1 K l=1 C k log(r k (x te l,j )) -(C k -C 2 k ) log(1 -r k (x te l,j )) . (H.3) I PROOF OF THEOREM 1 In this section, we prove Theorem 1, which establishes an upper bound on the generalization error of nnBD DRM (HDRM method with an arbitrary f ) for client k in terms of BD risk, which holds with high probability along the lines of (Kiryo et al., 2017; Lu et al., 2020; Kato & Teshima, 2021) . We remind that client k's goal is to estimate this ratio: r k (x) = K l=1 p te l (x) p tr k (x) . (I.1) For client k, the BD risk given by E f (r k ) = Ẽk (x)[ℓ 1 (r k (x))] + K l=1 E p te l [ℓ 2 (r k (x))] (I.2) where Ẽk := E p tr k -C k K l=1 E p te l , 0 < C k < 1 r k , r k = sup x∈X tr = K l=1 p te l (x) p tr k (x) , ℓ 1 (z) = ∇f (z)z -f (z), and ℓ 2 (z) = C(∇f (z)z -f (z)) -∇f (z). We note that the definition of C k implies pk = p tr k -C k K l=1 p te l > 0. We remind that f : B f → R is a strictly convex function with bounded gradient ∇f where B f ⊂ [0, ∞), and H r ⊂ {r : X → B f } denotes a hypothesis class for our model r. The nnBD DRM problem for client k is min r k ∈Hr Ê+ f (r k ) where Ê+ f (r k ) = ReLU ( Êp tr k -C k K l=1 Êp te l )[ℓ 1 (r k (x))] + K l=1 Êp te l [ℓ 2 (r k (x))] (I.3) with Êp tr k is the sample average over {x tr k,i } n tr k i=1 , and Êp te l is the sample average over {x te l,j } n te j=1 . In the following, we denote Êk := Êp tr k -C k K l=1 Êp te l for notational simplicity. Let rk := arg min r k ∈Hr Ê+ f (r k ) and r * k := arg min r k ∈Hr E f (r k ). We first decompose the generalization error into maximal deviation and bias terms: E f (r k ) -E f (r * k ) ≤ E f (r k ) -Ê+ f (r k ) + Ê+ f (r k ) -E f (r * k ) ≤ E f (r k ) -Ê+ f (r k ) + Ê+ f (r * k ) -E f (r * k ) ≤ 2 sup r k ∈Hr |E f (r k ) -Ê+ f (r k )| ≤ 2 sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]| + 2 sup r k ∈Hr |E[ Ê+ f (r k )] -E f (r k )| (I.4) where the second inequality holds since rk := arg min r k ∈Hr Ê+ f (r k ). The first term in the RHS of (I.4) is the maximal derivation and the second term is the bias. In the following two lemmas, we find an upper bound on the maximal deviation sup r k ∈Hr | Ê+ f (r k ) - E[ Ê+ f (r k )]| and bias sup r k ∈Hr |E[ Ê+ f (r k )] -E f (r k )|, respectively. Lemma 3 (Maximal deviation bound). Denote ∆ ℓ := sup z∈B f max i∈{1,2} |ℓ i (z)|, then for any 0 < δ < 1, the maximal deviation term is upper bounded with probability at least 1 -δ sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]| ≤ 4L 1 R p tr k n tr k (H r ) + 4(C k L 1 + L 2 ) K l=1 R p te l n te (H r ) + ∆ ℓ 2 1 n tr k + K(1 + C k ) 2 n te log 1 δ . (I.5) Proof. Denote Φ(S k ) := sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]| with S k = {x tr k,1 , • • • , x tr n tr k ,1 , x te 1,1 , • • • , x te K,n te }. Let S (i) k be obtained by replacing element i of set S k by an independent data point taking values from the set X tr . We now measure the absolute value of the difference caused by changing one data point in the maximal deviation term (I.5), i.e., |Φ(S k ) -Φ(S (i) k )|. If the changed point is sampled from p tr k , then the absolute value of the difference caused in the maximal deviation term is upper bounded by 2∆ ℓ n tr k . If the changed point is sampled from p te l , the the absolute value of the difference caused in the maximal deviation term is upper bounded by 2∆ ℓ (C k +1) McDiarmid et al., 1989) , with probability at least 1 -δ, we have sup n te for l = 1, • • • , K. Applying McDiarmid's inequality ( r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]| ≤ E[ sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]|] + ∆ ℓ 2 1 n tr k + K(1 + C k ) 2 n te log 1 δ . In the following, we establish an upper bound on the expected maximal deviation E[sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]| ] by generalization the symmetrization argument in (Kiryo et al., 2017; Lu et al., 2020) followed by applying Talagrand's contraction lemma for two-sided Rademacher complexity. Let m ∈ [M ] and N m ∈ Z + for M ∈ Z + . Let g m : R → R be a L gm -Lipschitz function. Let p m,p denote a probability distribution over X tr . Suppose that {x i } nm,p i=1 are drawn i.i.d. from p m,p for p ∈ [N m ] and m ∈ [M ]. Let ℓ m,p : B f → R + be a L m,p -Lipschitz function and Cm,p be a constant ∀ m, p. Consider the following stochastic process: Rk (r k ) := M m=1 g m Nm p=1 Cm,p Êm,p [ℓ (m,p) (r k (x))] where Êm,p denotes sample average over {x i } nm,p i=1 . In the rest of the proof, we show that E[ sup r k ∈Hr | Rk (r k ) -E[ Rk (r k )]|] ≤ 4 M m=1 Nm p=1 L gm | Cm,p |L m,p R pm,p nm,p (H r ). (I.6) To prove (I.6), we consider a continuous extension of ℓ (m,p) defined on the origin. We note that such extension does not change Rk (r k ) since ℓ (m,p) takes values only in B f . If B f = {(z 1 , z 2 )} for some 0 ≤ z 1 < z 2 , then for any z ∈ [0, z 1 ], we define ℓ (m,p) (z) = lim z↓z1 ℓ (m,p) (z) where lim z↓z1 ℓ (m,p) (z) exists since ℓ (m,p) is uniformly continuous due to Lipschitz continuity. Then ℓ (m,p)  will be L m,p -Lipschitz on z ∈ [0, z 2 ]. Let {x i } nm,p i=1 be an independent copy of {x i } nm,p i=1 . Let denote δ R := E[sup r k ∈Hr | Rk (r k ) -E[ Rk (r k )]|]. r k ∈Hr Êm,p[ℓ(m,p)(rk(x))] -Êm,p[ℓ(m,p)(rk(x))] = M m=1 Lg m Nm p=1 | Cm,p|E Ẽ sup r k ∈Hr Êm,p[ℓ(m,p)(rk(x)) -ℓ (m,p) (0)] -Êm,p[ℓ(m,p)(rk(x)) -ℓ (m,p) (0)] ≤ 4 M m=1 Lg m Nm p=1 | Cm,p|E sup r k ∈Hr Êm,p[σm,p(ℓ(m,p)(rk(x)) -ℓ (m,p) (0))] ≤ 4 M m=1 Lg m Nm p=1 | Cm,p|R pm,p nm,p (Hr) (I.7) where σ m,p are Rademacher variables uniformly chosen from {-1, 1}, Ẽ and Êm,p denote the expectation and sample average over data distribution p m,p and the independent copy {x i } nm,p i=1 , respectively, the third inequality holds by the Lipschitz continuous property of g m , and the last inequality is obtained by applying Talagrand's contraction lemma for two-sided Rademacher complexity (Ledoux & Talagrand, 1991; Bartlett & Mendelson, 2002) . Applying (I.6), we can show that E[ sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]|] ≤ 4L 1 R p tr k n tr k (H r ) + 4(C k L 1 + L 2 ) K l=1 R p te l n te (H r ), which completes the proof. ■ Next we find an upper bound on the bias sup r k ∈Hr |E[ Ê+ f (r k )] -E f (r k )|. Lemma 4 (Bias bound). Denote ∆ ℓ := sup z∈B f max i∈{1,2} |ℓ i (z)|. Assume inf r∈Hr E[ Êk [ℓ 1 (r k (x))]] > 0 for k ∈ [K]. Then, an upper bound on the bias term is given by sup r k ∈Hr |E[ Ê+ f (r k )] -E f (r k )| ≤ (1 + KC k )∆ ℓ exp -2η 2 k ∆ 2 ℓ /n tr k + KC 2 k ∆ 2 ℓ /n te (I.8) for some constant η k > 0. Proof. Let Êk := Êp tr k -C k K l=1 Êp te l . We first note that |E[ Ê+ f (r k )] -E f (r k )| = |E[ Ê+ f (r k ) -Êf (r k )]| = E ReLU Êk [ℓ 1 (r k (x))] -Êk [ℓ 1 (r k (x))] ≤ E ReLU Êk [ℓ 1 (r k (x))] -Êk [ℓ 1 (r k (x))] = E 1 ReLU Êk [ℓ 1 (r k (x))] ̸ = Êk [ℓ 1 (r k (x))] • ReLU Êk [ℓ 1 (r k (x))] -Êk [ℓ 1 (r k (x))] = E 1 ReLU Êk [ℓ 1 (r k (x))] ̸ = Êk [ℓ 1 (r k (x))] sup z:|z|≤(1+KC k )∆ ℓ (ReLU(z) -z) where the third inequality holds due to Jensen's inequality. We note that Êk [ℓ 1 (r k (x))] ≤ (1 + KC k )∆ ℓ implies sup z:|z|≤(1+KC k )∆ ℓ (ReLU(z) -z) ≤ (1 + KC k )∆ ℓ . Due to the assumption inf r∈Hr E[ Êk [ℓ 1 (r k (x))]] > 0, there exists an η k > 0 such that E[ Êk [ℓ 1 (r k (x))]] ≥ η k for all r k ∈ H r . Then we have E 1 ReLU Êk [ℓ 1 (r k (x))] ̸ = Êk [ℓ 1 (r k (x))] = Pr Êk [ℓ 1 (r k (x))] ∈ supp( ReLU) = Pr Êk [ℓ 1 (r k (x))] < 0 = Pr Êk [ℓ 1 (r k (x))] < E[ Êk [ℓ 1 (r k (x))]] -η k where ReLU(z) = ReLU(z) -z. Denote Φ(S k ) := Êk [ℓ 1 (r k (x))] where S k = {x tr k,1 , • • • , x tr n tr k ,1 , x te 1,1 , • • • , x te K,n te }. Let S (i) k be obtained by replacing element i of set S k by an independent data point taking values from the set X tr . We now measure the absolute value of the difference caused by changing one data point in | Φ(S k ) -Φ(S McDiarmid's inequality (McDiarmid et al., 1989) implies: ∆ ℓ C k n te for l = 1, • • • , K. Finally, Pr Êk [ℓ 1 (r k (x))] < E[ Êk [ℓ 1 (r k (x))]] -η k ≤ exp -2η 2 k ∆ 2 ℓ /n tr k + KC 2 k ∆ 2 ℓ /n te , which completes the proof. ■ Substituting the upper bounds in (I.5) and (I.8) into (I.4), with probability at least 1 -δ, we have E f (r k ) -E f (r * k ) ≤ 8L 1 R p tr k n tr k (H r ) + Ψ(δ, ∆ ℓ , n tr k , n te ) + 8(C k L 1 + L 2 ) K l=1 R p te l n te (H r ) (I.9) where Ψ = ∆ ℓ 8( 1 n tr k + K(1+C k ) 2 n te ) log 1 δ + 2(1 + KC k )∆ ℓ exp -2η 2 k ∆ 2 ℓ /n tr k +KC 2 k ∆ 2 ℓ /n te for some constant η k > 0. This completes the proof. . By restricting a function class for density ratios and substituting an upper bounds on its Rademacher complexity, we can obtain explicit generalization error bounds in terms of n tr k , n te in a special case. As an example, the following corollary establishes a generalization bound for multi-layer perceptron density ratio models in terms of the Frobenius norms of weight matrices. Example J.1 (Complexity for multi-layer perceptron class (Golowich et al., 2018) ). Assume that distribution p has a bounded support S p := sup x∈supp(p) ∥x∥ < ∞. Let H be the class of real-valued neural networks with depth L over the domain X tr , W i be the network weight matrix i. Suppose that each weight matrix has a bounded Frobenius norm ∥W i ∥ F ≤ ∆ Wi for i ∈ [L] and the activation ϕ is 1-Lipschitz, and positive-homogeneous function, i.e., ϕ(αz) = αϕ(z), which is applied element-wise. Then we have

CLIENTS

R p n (H) ≤ S p ( √ 2L log 2 + 1) L i=1 ∆ Wi √ n . Remark 6. To control the upper bound ∆ Wi for i ∈ [L], it is natural to employ the sparsity of the weights, e.g., (Golowich et al., 2018, Section 4) and (Hanin & Rolnick, 2019) . We consider a special network architecture where diag(W i )'s are close to 1-sparse unit vectors for i ∈ [L], which implies that the matrices W i 's will be almost rank-1. Then ∥W i ∥ F is upper bounded by 1 for i ∈ [L]. Corollary 1 (Generalization error bound under Example J.1). For Example J.1 and loss functions described in Theorem 1, with probability at least 1 -δ, we have E f (r k ) -E f (r * k ) ≤ K tr k n tr k + K l=1 K te l √ n te + Ψ(δ, ∆ ℓ , n tr k , n te ) where K tr k = O(L 1 S p tr k √ L L i=1 ∆ Wi ), K te l = O(max{L 1 , L 2 }S p te l √ L L i=1 ∆ Wi ), and Ψ = ∆ ℓ 8( 1 n tr k + K(1+C k ) 2 n te ) log 1 δ +2(1+KC k )∆ ℓ exp -2η 2 k ∆ 2 ℓ /n tr k +KC 2 k ∆ 2 ℓ /n te for some constant η k > 0. Finally, we apply union bound and obtain a global generalization error bound that holds for all clients: Corollary 2 (Generalization error bound for multiple clients). Let 0 < δ k < 1 for k ∈ [K]. Let K tr = max k∈[K] K tr k . For Example J.1 and loss functions described in Theorem 1, with probability at least 1 - K k=1 δ k , we have max k∈[K] {E f (r k ) -E f (r * k )} ≤ K tr n tr + K l=1 K te l √ n te + Ψ(δ, ∆ ℓ , n tr , n te ) where Ψ = ∆ ℓ 8( 1 n tr + K(1+C) 2 n te ) log 1 δ + 2(1 + KC)∆ ℓ exp -2η 2 ∆ 2 ℓ /n tr +KC 2 ∆ 2 ℓ /n te , C = max k∈[K] C k , n tr = min k∈[K] n tr k , δ = min k∈[K] δ k , and η = min k∈[K] η k . The rates match the optimal minimax rates for example for a density estimation problem when the density belongs to the Hölder function class (Tsybakov, 2008) [Section 2] with a sufficiently large β based on Definition 1.2 of Tsybakov (2008) . The Ω(1/ √ n) lower bounds are obtained for important problems including nonparametric regression, estimation of functionals, nonparametric testing, and finding a linear combination of M functions to be as close as the target data generating function (Nemirovski, 1998)[Section 5.3 ].

K ADDITIONAL ERROR DUE TO ESTIMATION OF r k

In this section, we consider a practical scenario where we have access to only in imperfect estimate of r k = sup x∈X tr r * k (x) to find C k in Eq. (3.2). In particular, we find additional error when using Ck = 1 rk where rk is obtained by HDRM in Section 3. The nnBD DRM problem for client k using Ck is min r k ∈Hr Ê+ f (r k ) where Ê+ f (r k ) = ReLU ( Êp tr k -Ck K l=1 Êp te l )[ℓ 1 (r k (x))] + K l=1 Êp te l [ℓ 2 (r k (x))]. (K.1) Along the lines of the proof of Lemma 3, we can show that the maximal deviation term using Ck is upper bounded with probability at least 1 -δ: sup r k ∈Hr | Ê+ f (r k ) -E[ Ê+ f (r k )]| ≤ 4L 1 R p tr k n tr k (H r ) + 4( Ck L 1 + L 2 ) K l=1 R p te l n te (H r ) + ∆ ℓ 2 1 n tr k + K(1 + Ck ) 2 n te log 1 δ . (K.2) Under perfect estimate of r k = sup x∈X tr r * k (x) with C k = 1 r k , the nnBD DRM problem for client k is min r k ∈Hr Ê++ f (r k ) where Ê++ f (r k ) = ReLU ( Êp tr k -C k K l=1 Êp te l )[ℓ 1 (r k (x))] + K l=1 Êp te l [ℓ 2 (r k (x))]. (K.3) Applying triangle inequality, we first decompose the bias term sup r k ∈Hr |E[ Ê+ f (r k )] -E f (r k )| ≤ sup r k ∈Hr |E[ Ê+ f (r k ) -Ê++ f (r k )]| + sup r k ∈Hr |E[ Ê++ f (r k )] -E f (r k )|. (K.4) An upper bound on sup r k ∈Hr |E[ Ê++ f (r k )] -E f (r k )| is established similar to the proof of Lemma 4: sup r k ∈Hr |E[ Ê++ f (r k )] -E f (r k )| ≤ (1 + KC k )∆ ℓ exp -2η 2 k ∆ 2 ℓ /n tr k + KC 2 k ∆ 2 ℓ /n te . Substituting Eq. (K.1) and Eq. ( K.3) into |E[ Ê+ f (r k ) -Ê++ f (r k )]|, we have |E[ Ê+ f (r k ) -Ê++ f (r k )]| = |E[ReLU(( Êp tr k -Ck K l=1 Êp te l )[ℓ 1 (r k (x))]) -ReLU(( Êp tr k -C k K l=1 Êp te l )[ℓ 1 (r k (x))])]| , which together with ReLU(a) -ReLU(b) ≤ |a -b| is used to establish the following upper bound: E Ê+ f (r k ) -Ê++ f (r k ) ≤ K∆ ℓ | Ck -C k |. (K.5) Let m * = arg max m∈[M ] rk,m . We note that by the construction of HDRM, there is a constant lower bound on the numerator of rk , i.e., 1 (Wasserman, 2006)  n te n te j=1 K l=1 1(x te l,j ∈ B m * ) ≥ c, [Section 6]: E|p tr k (x; M ) -p tr k (x)| 2 = O(L 2 k /M 2 + M/n tr k ). Putting together with a constant lower bound on the numerator of rk and applying Jensen's inequality, we have: E[| Ck -C k |] ≲ 1 M + M n tr k .

L PROOF OF LEMMA 1

We first note that E[L( θ)] -L(θ * ) = E∥ θ -θ * ∥ 2 Σ te = Bias + Variance . We first find the expression for θ considering the ridge regression problem assuming p te (x) p tr (x) is given. FIDEM problem with Tikhonov regularization is given by θ = arg min θ n i=1 w i (θ ⊤ x i -y i ) 2 + λ∥θ∥ 2 2 where w i = p te (xi) p tr (xi) and λ is the regularization parameter. This is a reweighted least squares problem. The objective function above is strongly convex and differentiable. Applying the fist-order condition, the unique minimum is as follows: θ = X ⊤ WX + λI d -1 X ⊤ Wy (L.1) where W = diag(w 1 , • • • , w n ). Substituting y = Xθ * + ϵ into (L.1), we note that θ = X ⊤ WX + λI d -1 X ⊤ WXθ * + X ⊤ WX + λI d -1 X ⊤ Wϵ and E X,ϵ [ θ] = E X X ⊤ WX + λI d -1 X ⊤ WXθ * . We now characterize the bias B( θ) and variance V( θ) terms when the model estimate is given by (L.1). Let ∥x∥ 2 A := x ⊤ Ax. Substituting the expression for θ into R( θ), the excess risk can be decomposed into a bias and a variance term as follows: R( θ) = E X,ϵ,x,ϵte [(y -θ⊤ x) 2 -(y -θ ⊤ * x) 2 ] = E X,ϵ,x,ϵte [ y -θ ⊤ * x + (θ * -θ) ⊤ x 2 -(y -θ ⊤ * x) 2 ] = E X,ϵ,x (θ * -θ) ⊤ x 2 = E X,ϵ [∥θ * -θ∥ 2 Σ te ] = B + V where the bias is given by B = E X,ϵ X ⊤ WX + λI d -1 X ⊤ WXθ * -θ * 2 Σ te = E X X ⊤ WX + λI d -1 λθ * 2 Σ te . = λ 2 θ ⊤ * E X [∆ W,λ Σ te ∆ W,λ ]θ * with ∆ W,λ = X ⊤ WX + λI d -1 , and the variance is given by V = E X,ϵ X ⊤ WX + λI d -1 X ⊤ Wϵ 2 Σ te = σ 2 ϵ E X [tr (Φ V )] . where ΦV = X ⊤ WX + λI d -1 X ⊤ W 2 X X ⊤ WX + λI d -1 Σ te .

M PROOF OF THEOREM 2

In the one-hot case, it is clear that X ⊤ X = n i=1 x i x ⊤ i and X ⊤ WX = n i=1 w i x i x ⊤ i are diagonal matrices. For bias in the one-hot setting, we have B( θ) = λ 2 θ ⊤ * X ⊤ WX + λI -1 Σ te X ⊤ WX + λI -1 θ * = λ 2 d i=1 [(θ * )i] 2 λ ′ i (λi(X ⊤ WX) + λ) 2 = λ 2 d i=1 [(θ * )i] 2 λ ′ i [µiwi + λ] 2 where the equation holds by the fact that, all matrices are diagonal including X ⊤ X, X ⊤ WX, and Σ te . Accordingly, we have λ i (X ⊤ WX) = λ i (X ⊤ X)λ i (W) with i ∈ [d]. For the classical ERM, the bias is B(θ v ) = λ 2 d i=1 [(θ * ) i ] 2 λ i [µ i + λ] 2 where λ i is the eigenvalue of Σ tr . To achieve B( θ) ⩽ B(θ v ), we have to make some assumptions on the relationship between λ i , λ ′ i and w i . Our analysis of error bound requires λ ′ i [µ i w i + λ] 2 ⩽ λ i [µ i + λ] 2 ⇔ µ i + λ µ i w i + λ ⩽ λ i λ ′ i , (M.1) which implies w i ⩾ λ ′ i λ i -1 , (M.2) such that Eq. (M.1) holds where we use the inequality a+c b+c ⩽ a b + 1 for any a, b, c > 0. For the vanilla ERM, the variance is V(θ v ) = σ 2 ϵ d i=1 λ i µ i [µ i + λ] 2 . For FIDEM, the variance is V( θ) = σ 2 ϵ X ⊤ WX + λI -1 X ⊤ W 2 X X ⊤ WX + λI -1 Σ te = σ 2 ϵ d i=1 λ ′ i λ i (X ⊤ W 2 X) (λ i (X ⊤ WX) + λ) 2 = σ 2 ϵ d i=1 λ ′ i µ i w 2 i [µ i w i + λ] 2 . We note that V( θ) ≤ V(θ v ) can be achieved by λ i µ i [µ i + λ] 2 ≥ λ ′ i µ i w 2 i [µ i w i + λ] 2 . This can be obtained by µ i + λ wi µ i + λ ≥ λ wi µ i + λ ≥ λ ′ i λ i , (M.3) which implies w i ⩽ ξ i λi λ ′ i . Combining Eqs. (M. 2) and (M.3), the proof is complete.

N WHEN FIDEM CANNOT OUTPERFORM ERM

In this section, we provide a counterexample to show that, under which certain case, FIDEM cannot provably outperform ERM. Proposition 3. Under the same setting of Theorem 2, i.e., the fixed-design setting and label noise assumption, under the following condition λ ′ i λ i ⩾ max{ξ, 1 -ξ} . If the ratio satisfies wi ⩽ min 1 √ λ ′ i /λ i -1 ξ + 1 , λ ′ i λi + λ µi λ ′ i λi - λ µi , (N.1) then we have R(θ v ) ⩽ R( θ) . Proof. According to Eq. (M.1), B(θ v ) ⩽ B( θ) holds by µ i + λ µ i w i + λ ⩾ λ i λ ′ i , which is equivalent to w i ⩽ λ ′ i λ i + λ µ i λ ′ i λ i - λ µ i . (N.2) According to Eq. (M.3), V(θ v ) ⩽ V( θ) holds by µ i + λ wi µ i + λ ≤ λ ′ i λ i , which is equivalent to w i ⩽ 1 λ ′ i λ i -1 ξ + 1 . (N.3) Combining Eqs. (N.2) and (N.3), the proof is complete. To validate the condition in Eq. (N.1), we require each term in the RHS to be nonnegative. That implies λ ′ i λ i ⩾ max{ξ, 1 -ξ} , which is our condition in Proposition 3. (He et al., 2016) . Batch normalization in ResNet-18 is treated by averaging the statistics on the server and subsequently broadcasting to the workers. A learning rate of 0.0001 and weight decay of 0.0001 are used. We report the best iterate in terms of average test accuracy after 20, 000 iterations. All reported mean and standard deviations are computed over 5 independent runs except for CIFAR10 which uses 3 independent runs. For target shift the randomisation is also over the realization of the class distributions to ensure that the conclusions are not due to the particularities of the sub-sampled images. All experiments are carried out on an internal cluster using one GPU. Client 81-90 Train 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 /100 5 /9 Test 5 /9 95 /100 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 /9 5 /9 Client 91-100 Train 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 /9 95 /100 Test 95 /100 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 5 /9 /9 5 /9 7 shows that FIDEM uniformly improves the accuracy over FedAvg on this difficult target shift instance. We additionally include a two-client setting in Table 10 with the associated distribution described in Table 11 . To model a scenario closer to real-world FL, we consider a setting with 100 clients on CIFAR10 under challenging distribution shifts and partial participation of clients, which is a requirement for crossdevice FL (Kairouz et al., 2021; Wang et al., 2021) . We sub-sample 5 clients uniformly at random at every round for 200, 000 iterations. The target distribution is described in Table 5 and experimental results can be found in Table 6 . We observe that FIDEM uniformly improves the test accuracy when compared with FedAvg and that the gap is especially large between the worst-performing clients. To compute the exact ratio r(x) we will assume that the distributions are separable. Definition 3 (Separability). A distribution over X × Y is separable if there exists a partition (X i ) m i=1 of X such that p(y i |X i ) = 1 for some y i ∈ Y and all i ∈ [m]. We denote the associated deterministic label assignment as g : X → Y. Proposition 4 provides a way to compute the ratio r(x) when the labels are available and the shift is known.

O.2 COVARIATE SHIFT

The color flipping probability used to generate each of the colored MNIST datasets for the covariate shift experiment can be found in Table 12 . We consider an asymmetric client setup where client 1 in addition has 40 times less training examples than client 2.

O.3 VERIFYING ASSUMPTIONS

Consider the two datasets used for the main experiments in Table 1 and Table 2 . We verify in Figure 2 that the eigenvalues of the training distribution and test distribution for each client satisfy λ ′ i λi ∈ 0, 1+ √ 1+4ξi 2 in Theorem 2. 

FIDEM FIIDEM FedAvg

Average accuracy 0.82 ± 0.00 0.76 ± 0.01 0.76 ± 0.01 Client 1 accuracy 0.89 ± 0.01 0.80 ± 0.02 0.94 ± 0.00 Client 2 accuracy 0.74 ± 0.01 0.71 ± 0.02 0.58 ± 0.01 P COMPUTATIONAL COMPLEXITY OF ALGORITHM 1 We note that clients compute the ratios in parallel where each client needs to estimate one ratio. To estimate density ratios for FIDEM, clients require to send a few unlabelled test samples only once. The server shuffles those samples and broadcasts the shuffled version to clients only once. Compared to FedAvg, the additional computational cost per client is O(T N k ) where T is the number of iterations for Algorithm 1 to converge and N k is the number of batches for ratio estimation. Compared to baseline FedAvg, the additional computation of FIDEM is negligible but leads to substantial improvements of the overall generalization in settings under challenging distribution shifts.

Q LIMITATIONS

In this paper, we focus on settings where ratio estimation is required once prior to model training. Handling distribution shifts in complex non-stationary settings where ratio estimation is an ongoing process is an interesting problem for future work. In addition, various personalization methods have been proposed to improve fairness in terms of uniformity of model performance across clients (Li et al., 2021a; b) . To meet specific requirements of each client, our global model can be combined with a personalized model on each client. Developing new variants of FIDEM with a focus on fairness is an interesting problem for future work. in Theorem 2. The sudden increase in the ratio for the lowest eigenvalues are most likely due to numerical error. Table 13 : Estimating ratio upper bound with k-means clustering. We consider the target shift setup, such that a tight upper bound is known, and construct a single client variant for simplicity. We specifically consider MNIST with a label distribution during training and testing to be q tr ∝ ( 1 /20, 1 /20, 1 /20, 1 /20, 1 /20, 1, 1, 1, 1, 1) ⊤ and q te ∝ (1, 1, 1, 1, 1, 1 /20, 1 /20, 1 /20, 1 /20, 1 /20) ⊤ respectively. The table shows the estimated upper bound on the ratio (r) for a range of clustering sizes. A reasonable estimate of the true maximal ratio of 20 is obtained for a wide range of clustering sizes. Whereas naively binning the space can be problematic due to division by zero, the clustering approach is less prone to this issue as long as #(clusters) ≪ #(datapoints). 



Our results can be extended to other typical distribution shifts, e.g., target shift(Azizzadenesheli, 2022). We provide experimental results on target shift in Section 5. Notations are provided in Appendix A. For notational simplicity, we use the same notation for probability distributions and density functions. The analyses of computational/communication overheads are provided in Appendices P and E, respectively. For notational simplicity, we overload ℓ(z, θ) to denote the loss of model θ on example z. A total number of 1000 images are shown to be sufficient to learn density ratios on CIFAR10(Kato & Teshima, 2021)[10, Section 5.1]. Fashion MNIST is provided under the MIT license.



Figure 1: An overview of FIDEM. Marginal train and test distributions of clients are arbitrarily different leading to intra-client and inter-client covariate shifts. To control privacy leakage, the server randomly shuffles unlabelled test samples and broadcasts to the clients.

Zhang et al. (2020) proposed a one-step approach that jointly learns the predictive model and the corresponding weights in one optimization problem.Sugiyama et al. (2012) proposed a Bregman divergence-based DRM, which unifies various DRMs.Kato & Teshima (2021) proposed a non-negative Bregman divergence-based DRM to resolve the overfitting issue when using deep neural networks for density ratio estimation. While this line of work focuses on DRM with a single train and test distributions, we consider a federated setting with multiple clients in this paper. Domain adaptation. Distribution shifts between a source and a target domain have been a prominent problem in machine learning for several decades(Kouw & Loog, 2019;Wang & Deng, 2018).

for m ∈ [M ]. Otherwise, rm = 0. Finally, we propose to use C ≤ 1 r where r = max{r 1 , • • • , rM }. Convergence of r to r is established in Appendix G. Now, suppose there are K clients where each client provides n te unlabelled test samples to the pool of samples. Our goal is to estimate r k in Eq. (3.1) for k = 1, • • • , K. The BD-based DRM for client k is given by min r k ∈Hr Êf (r k ) where Êf

∈Bm) for m ∈ [M ]. Otherwise, rk,m = 0. Finally, we propose to use C k = 1 rk where rk = max{r k,1 , • • • , rk,M }.

Following a symmetrization argument(Vapnik, 1999), an upper bound on the symmetrized process can be established by Rademacher complexity:δ R ≤ E sup Êm,p[ℓ(m,p)(rk(x))] -Ẽ gm Nm p=1Cm,p Êm,p[ℓ(m,p)(rk(x))]

If the changed point is sampled from p tr k , the the absolute value of the difference caused in the maximal deviation term is upper bounded by ∆ ℓ n tr k . If the changed point is sampled from p te l , the the absolute value of the difference caused in the maximal deviation term is upper bounded by

Our generalization error bound for client k depends on the Rademacher complexity of the hypothesis class for our density ratio model H r ⊂ {r : X → B f } w.r.t. client k train distribution p tr k and all client's test distributions p te l for l ∈ [K]

(x, y) and p tr (x, y) are both separable. Then the ratio can be computed based on the associated label y := g(x) as follows, Due to separability, p te (y|x) = p tr (y|x). So r(x) := p te (x) p tr (x) = p te (x)p te (y|x) p tr (x)p tr (y|x) = p te (x, y) p tr (x, y) . (O.2) It follows that, p te (x, y) p tr (x, y) = p te (x|y)p te (y) p tr (x|y)p tr (y) . (O.3) Using the definition of the target shift assumption, p te (x|y) = p tr (x|y), the conditional distributions cancel and we obtain the claim. ■

Figure 2: The squared ratio of eigenvalues ordered in descending order are all below 1 thus satisfyingλ ′ i λ i ∈ 0, 1+ √ 1+4ξ i 2

considered C as a hyper-parameter, which can be tuned. However, obtaining an efficient estimate of r k is desirable, in particular when training a deep model. Here we propose a histogram-based method for estimation of r k .

of the hypothesis class for our density ratio model H r ⊂ {r : X → B f } w.r.t. client k train distribution p tr k and all client's test distributions p te l for l ∈ [K]. Let R p n (H) denotes the Rademacher complexity of function class H w.r.t. distribution p, formally defined in Appendix A. We first make the following assumptions on ℓ 1

Fashion MNIST with label shift across two clients, where each client receives different fractions of examples from each class. In this case, FIDEM achieves a better average accuracy than the baselines.

A challenging binary classification task on Colored MNIST with covariate shift across two clients. FIDEM is close to the idealised baseline that ignores the spurious correlation (Grayscale).

Examples of f for BD-based methods(Sugiyama et al., 2012;Kato & Teshima, 2021), LSIF = least-squares importance fitting, LR = logistic regression, BKL = binary Kullback-Leibler, UKL = unnormalized Kullback-Leibler, KLIEP = Kullback-Leibler importance estimation procedure, KMM = kernel mean matching , PULogLoss = positive and unlabeled learning with log Loss.

Histogram-based density ratio matching. Loops are executed in parallel on each client. Let n ∈ Z + and p be a distribution, S = {x 1 , • • • , x n } be i.i.d. random variables drawn from p, and H be a function class. The Rademacher complexity of H w.r.t. p is given by:

Details of scenarios described in Section 2.

r. Let {x tr i } n tr i=1 and {x te j } n te j=1 denote unlabelled samples drawn i.i.d from distributions p tr and p te , respectively. Let H r ⊂ {r : X → B f } denote a hypothesis class for our model r. Using an empirical approximation of E f (r * (x) ∥ r(x)),Sugiyama et al. (2012) formulated BD-based DRM problem as min r∈Hr Êf (r) where

We note that sup x∈B r * (x) ≤ sup x∈B p te (x) inf x∈B p tr (x) and p te (x te ) p tr (x tr ) ≤ sup x∈B p te (x) inf x∈B p tr (

The communication overhead to estimate ratios is negligible compared to the communication costs for sharing high-dimensional stochastic gradients over the course of training.Consider the example of CIFAR10 consisting of 32 by 32 images with 3 channels represented by 8 bits. If one shares 1000 unlabelled images 6 , the communication amounts to sharing roughly 3 × 10 6 values each with 8 bits, i.e., 25 × 10 6 total communication bits or 3.1MB. In contrast, during training, the network size alone easily surpasses this size (e.g. the common ResNet-18 has 11 million parameters, each represented by a 32-bit floating point). Standard training of ResNet-18 requires 8 × 10 4 iterations and aggregations, which amounts to 2.816 × 10 13 total communicated bits per client, i.e., 3.5TB during training.

r AND k-MEANS CLUSTERING. Lemma 2. If n tr k , n te , and M go to infinity with sup m Vol(B m ) → 0, then rk → r k .Proof. Let x ∈ X tr . Note that when n tr k , n te , and M go to infinity, the numerator and denominator of rk become Please note that our density ratio in Eq. (3.1) is in the form of a sum of test densities over own train density. So even if one or a few number of ratios are poorly estimated, it will not impact the entire ratio in Eq. (3.1) as nested estimation errors. The error does not propagate in a multiplicative manner but in an additive way.

that is achieved when {x te l,j } n te j=1 are distributed uniformly across M bins. Let x ∈ X and let ptr ∈ B m * ) meets its lower bound, which leads to the maximum deviation fromC k ≤ Ck . Assuming p tr k (x) is L k -Lipschitz with sup x∈X p tr k (x) < ∞,the mean squared error of a histogram-based density estimate with M bins is upper bounded by

Fashion MNIST 7(Xiao et al., 2017), and CIFAR10(Krizhevsky et al., 2009). MNIST consists of images depicting handwritten digits from 0 to 9. The resolution of each image is 28 × 28. The dataset includes 60, 000 images for training. Similarly Fashion MNIST includes grayscale images of clothing of resolution 28 × 28. The training set consists of 60, 000 examples, and the test set of 10, 000 examples. CIFAR10 consists of colored images with a resolution of 32 × 32. The training set contains 50, 000 examples while the test set contains 10, 000 examples. For all experiments we use the cross entropy loss. The stochastic gradient for each of the clients are computed with a batch size of 64 and aggregated on the server, which uses the Adam optimizer. Experiments on MNIST and Fashion MNIST uses a LeNet (LeCun et al., 1998), a learning rate of 0.001, no weight decay, and runs for 5, 000 iterations. For CIFAR10 experiments we use the larger ResNet-18

CIFAR10 target shift distribution across 100 clients where groups of 10 clients shares the same distribution.

Average, worst-case, and best-case client accuracies of CIFAR10 target shift experiment across 100 clients where 5 randomly sampled clients participate in every round of training.For the target shift experiments on Fashion MNIST in Table1, we summarize the different number of data points for each dataset split in Table9. A similar distribution across clients is used for the additional experiments for FIDEM and FedAvg on CIFAR10 (Table8). CIFAR10 differs from Fashion MNIST in the number of examples due to the training set being smaller. The results for CIFAR10 in Table

Target shift on CIFAR10 with ResNet-18. FIDEM FedAvg Average accuracy 0.6004 ± 0.0076 0.4426 ± 0.0291 Client 1 accuracy 0.6714 ± 0.0153 0.3984 ± 0.1497 Client 2 accuracy 0.8196 ± 0.0962 0.7307 ± 0.1533 Client 3 accuracy 0.5412 ± 0.0776 0.3333 ± 0.2251 Client 4 accuracy 0.5087 ± 0.0827 0.3030 ± 0.1106 Client 5 accuracy 0.4610 ± 0.0508 0.4476 ± 0.3649

CIFAR10 target shift distribution.Proposition 4. Assume that the distributions p te

Fashion MNIST target shift distribution.

Fashion MNIST with target shift across two clients.

Two-client Fashion MNIST. The number of samples for each class across the different datasets.

For covariate shift the datasets for each of the client are constructed using different probabilities.

annex

To estimate {r k (x)} K k=1 , clients need to send unlabelled samples x te l,j for l ∈ [K] and j ∈ [n te ] from their test distributions. We note that instead of their true samples, clients can alternatively send samples generated from a generative model (Goodfellow et al., 2020) .Note that training GANs may be computationally extensive due to required computational resources and availability of representative samples. However, we propose to use GANs as an alternative method with clear caveats, only when 1) clients have sufficient computational resources and 2) they are unwilling to share unlabelled data with the server.As a partial mitigation of privacy risks, we introduced FIIDEM. FIIDEM does not require any data sharing among clients and does not require any GAN training. In this paper, we focus on FIDEM since it outputs an unbiased estimate of a minimizer of the overall true risk, and enables us to theoretically show the benefit of importance weighting in generalization.One particular challenge in real-world cross-device FL is to estimate ratios on real-world datasets such as WILDS (Koh et al., 2021) and LEAF (Caldas et al., 2019) . WILDS has been mostly used for domain generalization, where the setting is not similar to ours. We still have to decide on an arbitrary test/train split. LEAF mainly captures inter-client distribution shifts and settings where different clients have different numbers of examples over thousands of clients. This work is not about scalability to thousands of clients experimentally using our single GPU simulated setup. While we anticipate efficient ratio estimation will improve over time, our FIDEM and FIIDEM formulations along with improved ratio estimates will provide reasonable solutions to learn an effective global model in real-world cross-device FL under covariate shifts.

