GENERALIZATION BOUNDS FOR FEDERATED LEARN-ING: FAST RATES, UNPARTICIPATING CLIENTS AND UNBOUNDED LOSSES

Abstract

In federated learning, the underlying data distributions may be different across clients. This paper provides a theoretical analysis of generalization error of federated learning, which captures both heterogeneity and relatedness of the distributions. In particular, we assume that the heterogeneous distributions are sampled from a meta-distribution. In this two-level distribution framework, we characterize the generalization error not only for clients participating in the training but also for unparticipating clients. We first show that the generalization error for unparticipating clients can be bounded by participating generalization error and participating gap caused by clients' sampling. We further establish fast learning bounds of order O( 1mn + 1 m ) for unparticipating clients, where m is the number of clients and n is the sample size at each client. To our knowledge, the obtained fast bounds are state-of-the-art in the two-level distribution framework. Moreover, previous theoretical results mostly require the loss function to be bounded. We derive convergence bounds of order O( 1 √ mn + 1 √ m ) under unbounded assumptions, including sub-exponential and sub-Weibull losses.

1. INTRODUCTION

In federated learning, a common model is trained based on the collaboration of the participating clients holding local data samples (McMahan et al., 2017) . Typically, the underlying distributions vary across clients since the data-generating processes are affected by the local environment. Federated learning is heterogeneous in the scenario where local distributions are different (Wang et al., 2021) . Most existing experimental and theoretical results focus on the convergence of optimization on training datasets (Li et al., 2020b; Karimireddy et al., 2020; Mitra et al., 2021; Mishchenko et al., 2022; Yun et al., 2022) . The generalization error, which is more natural and important in machine leanring, seems not to have been carefully examined in heterogeneous federated learning. As a key performance indicator of the machine learning model, generalization error measures the performance of a trained model by its population risk with the corresponding distribution. However, existing generalization results are generally derived for clients participating in the training, which only captures the performance of the learned model on seen distributions during training (Mohri et al., 2019; Chen et al., 2021; Masiha et al., 2021) . In practice, the probability that a client participates in the federated training is affected by many factors such as the reliability of network connections or the availability of the client. The realistic participation ratio may be slow and a variety of clients never have a chance to participate during the training process (Kairouz et al., 2021; Li et al., 2020a; Yuan et al., 2021) . Though the training process is operated only on participating clients, the trained model will be used by both unparticipating and participating clients. Since the data distributions of unparticipating clients are different from that of participating clients, it is natural and emergent to ask the following question: Would the unparticipating clients benefit from the model trained by participating clients? To answer this question theoretically, we take the participation gap into account in the analysis of generalization error, which is generally ignored by existing works. In addition to the ignored participating gap, existing theoretical results on the generalization error of heterogeneous federated learning have two more limitations to our knowledge. First, all previous learning rates in probability form are of the order O( 1 √ mn ), where m is the number of clients and n is the sample size at each client (Mohri et al., 2019) . We note that faster rates of order O( 1 mn ) are derived in (Chen et al., 2021) . However, their learning rates are in expectation form. Faster learning rates in probability form haven't been derived even only for participating clients. The guarantees in-expectation form reflect the average performance of the model trained based on the randomly sampled datasets. The theoretical bounds in probability form, which we focus on in this paper, reflect the performance of a single sampling on datasets (Klochkov & Zhivotovskiy, 2021; Kanade et al., 2022; Sefidgaran et al., 2022a) . Second, most previous generalization bounds are derived by assuming that the loss function is bounded. However, there are a variety of learning problems that do not satisfy this assumption. This includes regression problems where unbounded noise is added to labels (Kuchibhotla & Patra, 2022; Kuchibhotla & Chakrabortty, 2018; Zhang & Zhou, 2018) , clustering tasks with heavy-tailed distribution (Paul et al., 2021; Vellal et al., 2022) , domain adaptation, and so on. Notable exception works in this direction include (Barnes et al., 2022) and (Sefidgaran et al., 2022b) . However, their results are established under the assumption that local clients are homogeneous, which is highly restrictive in the general federated scenario. In this paper, we assume that data distributions of participating and unparticipating clients are drawn from a meta-distribution P . We argue that this assumption is reasonable in practice. For instance, in cross-device federated learning, the number of total clients is generally large and it is natural to assume that there exists a meta-distribution (Reisizadeh et al., 2020; Wang et al., 2021) . In this learning scenario, we assume that the total number of clients is M . Among all these M clients, only m clients have a chance to participate in the training phase, which means that the training process only involves the m distributions {D i } m i=1 . Note that the total number M and the number of unparticipating clients/distributions is generally larger than m (Hu et al., 2022; Xu & Wang, 2020; Yang et al., 2020) . Practically, the model is trained based on datasets {S i } m i=1 , where S i is the dataset located in client i and is sampled from D i . This two-level framework not only captures the heterogeneity of clients' distributions but also reflects the relatedness of the distributions. Thanks to this framework, we are allowed to characterize the generalization performance of both participating distributions and unparticipating distributions. A similar framework has been used by recent literature (Yuan et al., 2021; Reisizadeh et al., 2020; Wang et al., 2021) . However, these works mainly focus on the optimization performance or only involve experimental results on the generalization. The objective of this work is to provide theoretical results on generalization error in this framework. Our contributions are summarized as follows. • We provide a systematic analysis of the generalization error of federated learning in the two-level framework, which captures the missed participating gap in the existing works. This two-level framework captures both heterogeneity and relatedness of clients' distributions. Moreover, all learning bounds presented in this paper are in probability form instead of expectation form. • We derive fast learning rates in the empirical risk minimization setting. The unparticipating error is bounded by two terms. One is participating error. The other is the participation gap results from missing clients in the training. Our participating bounds and unparticipating bounds are of order O( 1 mn ) and O( 1 mn + 1 m ), respectively. • We study the learning bounds for unbounded loss functions, including sub-gaussian, subexponential, and heavy-tailed losses. Small-ball methods and concentration inequalities for unbounded random variables are used in the unbounded setting. Our bounds are comparable with the existing results with bounded assumptions. The rest of the paper is organized as follows. In Section 2, we describe the two-level distribution framework and provide basic theoretical results in this framework. In Section 3, we derive fast generalization bounds. In Section 4, we go beyond the bounded assumption and provide the generalization bounds for unbounded losses such as sub-exponential and sub-Weibull losses. In Section 5, we discuss related work on the generalization analysis of heterogeneous federated learning. Finally, we conclude this paper in Section 6. All proofs are postponed to the appendix. 

2. TWO-LEVEL DISTRIBUTION FRAMEWORK

Let X denote the input space and Y ⊂ R the output space. For simplicity, we denote Z = (X, Y ) the random variable with support Z = X ×Y. Let D denote the set of all probability distributions on Z and P is a meta-distribution on D. The assumption of meta-distribution is reasonable especially in cross-device federated learning scenario, where the local devices may be a large population of mobile phones. As shown in Figure 1 , in this two-level distribution framework, we assume the total number of clients is M (may be infinity) and the number of clients participating in training is m. It is worth emphasizing that M is generally larger than m owing to unreliable network connections. We denote by D i the distribution associated to client i and assume {D 1 , • • • , D m } are independently sampled from D according to P . Data sample S i = {Z j i } n j=1 located on participating client i is made of n i.i.d realizations of Z following D i . The global model is trained based on {S i } m i=1 and will be used by all M clients. Two-level distribution framework allows us to measure the performance of the global model with respect to clients' distribution P , which quantifies both the participation gap (caused by client sampling) and participating error (caused by data sampling from participating distributions). Throughout the paper we denote F by F = {z → ℓ(h, z) : h ∈ H}. Moreover, we use Z i = (X i , Y i ) to represent the random variables across two-level framework. That is, E[Z i ] = E Di∼P [E Zi∼Di [Z i ]]. Let the hypothesis space H be a family of real-valued functions defined on X . The loss function ℓ : Y × Y → R + is a non-negative function. We denote the population risk L P (h) by L P (h) = E Di∼P [E Z∼Di [ℓ(h(X), Y )]] , where h ∈ H represents the global hypothesis shared by all local clients. The population risk minimizer h * associated to population risk L P (h) is define as h * = arg min h∈H L P (h). However, it is impossible to minimize L P (h) directly because the exact meta distribution and client local distributions are unknown to us. We have access to only a finite number of clients and finite training data at each client. The global objective function defined as population risk is often optimized by the form of empirical risk minimization (ERM) objective function defined as: L S (h) = 1 m m i=1 1 n n j=1 ℓ(h(X j i ), Y j i ), where (X j i , Y j i ) represents the j-th training data point at i-th participating client. For simplicity, we denote Z The semi-excess risk for participating clients is defined as: L D ( h) -L D ( h * ). Semi-excess risk indicates the performance of the learned model h on the unseen data associated with semi-empirical distribution D. The excess risk for unparticipating clients is defined as: L P ( h) -L P (h * ). Excess risk indicates the performance of the learned model h on the unseen clients distributed according to P . Note that the the excess risk L P ( h)-L P (h * ) is defined across two-level distribution framework. It will be shown that, in our analysis, all upper bound of excess risk L P ( h) -L P (h * ) involves semi-excess risk L D ( h) -L D ( h * ) or its upper bound. To understand this framework better, we present our basic results of excess risk as follows. Definition 1 (VC dimension). Let (X , H) be a set system that consists of a set and a class H of subsets of X. A set system (X , H) shatters a set A if each subset of A can be expressed as A ∩ h for some h in H. The VC-dimension of H is the size of the largest set shattered by H. Definition 2 (VC subgraph of real valued function). The subgraph of a function h(∈ H) : X → R is the subset of X × R given by {(x, t) : t < h(x)}. Then the V C-dimension of the function class F is defined as the V C-dimension of the set of subgraphs of functions in H. Theorem 1 (Generalization error for unparticipating clients). Let F be a family of functions related to hypothesis space H : F = {z → ℓ(h, z) : h ∈ H}. For the VC subgraph class F with VC dimension d. If the loss function ℓ is bounded by b, it follows that with probability at least 1 -2δ, j i = (X j i , Y j i ) the data point. Let S i = {Z j i } n L P ( h) -L P (h * ) ≤ c 1 b d mn + b ln(1/δ) 2mn + c 2 b d m + b ln(1/δ) 2m , where c 1 and c 2 are constants. Remark 1. Assume the total number of clients is M and P is a concrete meta-distribution on M different clients' distributions. The global model h is trained based on m participating clients. The excess risk measures the average performance of h on total M clients, which include participating and unparticipating clients. Theorem 1 indicates that, increasing the number of participating clients m leads to the decrease of excess risk L P ( h) -L P (h * ). In cross-device federated learning, the number of participating clients m may be large enough such that the excess risk approaches zero. Based on these discussions, we can give a positive answer to the question asked in Introduction. This is, from the perspective of average performance, unparticipating clients would benefit from the model trained by participating clients. Remark 2. Our theoretical results show that under some assumptions we are able to bound the excess risk L P ( h) -L P (h * ). In the cases when every client is completely different, L P (h * ) will be large. Thus, the generalization error for unparticipating clients L P ( h) will be large. This observation indicates that, we can not expect one commen model works well when heterogeneity is high. We provide experimental results on EMINIST (Cohen et al., 2017) and synthetic data in appendix. Though Theorem 1 is derived for VC class, experiments of neural network justify our theory.

3. FAST LEARNING RATES IN TWO-LEVEL DISTRIBUTION FRAMEWORK

In this section, we present fast learning rates in our two-level distribution framework. Recall that h is the empirical risk minimizer and h * is the population risk minimizer. Our goal is to bound the semi-excess risk for participating clients L D ( h)-L D ( h * ) and excess risk for unparticipating clients L P ( h) -L P (h * ). To get faster learning rates in our two-level distribution framework, we start by making some assumptions on loss function ℓ, hypothesis space H, semi-empirical distribution D, and meta distribution P . Assumption 1. Loss function ℓ is L-Lipschitz in its first argument: |ℓ(y 1 , y) -ℓ(y 2 , y)| ≤ L |y 1 -y 2 | . Definition 3 (Bernstein condition). Let µ be a distribution supported on X × Y and let ℓ be a loss function with domain Y × Y. The tuple (µ, ℓ, H, h * ) satisfies the (β, B)-Bernstein condition with parameter B > 0 if the following holds for any h ∈ H : E (h(X) -h * (X)) 2 ≤ BE [ℓ(h(X), Y ) -ℓ(h * (X), Y )] β . It is well known that fast learning rates require extra assumptions. Bernstein condition is widely used to get fast learning rates in the learning theory community (Xu & Zeevi, 2020; van Erven et al., 2015; Wu et al., 2022) . We emphasize that it is not too restrictive. For example, it is directly implied by the boundedness property of functions with any probability distribution (Bartlett et al., 2004) . Moreover, regression problems with strictly convex loss function satisfy the Bernstein condition if the function class is convex (Lecué & Mendelson, 2013) . Other examples include excess risk functions with minimizer of population risk when the loss function is strongly convex and Lipschitz (Klochkov & Zhivotovskiy, 2021) . Assumption 2. Theoretical analyses in our two-level distribution framework involve different types of Bernstein conditons: (a) The tuple (D, ℓ, H, h * ) satisfies the Bernstein condition with parameter B ′ ≥ 1, 0 < β ′ ≤ 1. That is, for any h ∈ H, 1 m m i=1 E[h(X 1 i ) -h * (X 1 i )] 2 ≤ B ′ (L D (h) -L D ( h * )) β ′ . (b) The tuple (P, ℓ, H, h * ) satisfies the Bernstein condition with parameter B ′′ ≥ 1, 0 < β ′′ ≤ 1. That is, for any h ∈ H, E Di∼P [E X∼Di [h(X) -h * (X)] 2 ] ≤ B ′′ (L P (h) -L P (h * )) β ′′ . For our purposes, we need to check that both (a) and (b) in Assumption 2 hold. A typical example satisfying Assumption 2 is quadratic loss with convex function class H. We provide some examples satisfying Assumption 2 in appendix. For more details, we refer to (Xu & Zeevi, 2020; Wu et al., 2022; van Erven et al., 2015) . Assumption 3 (Uniform entropy numberfoot_0 ). Let H be a family of bounded functions with uniformly entropy number log N (ϵ, H, ∥ • ∥ 2 ). Assume that there exist positive numbers γ, d and p such that log N (ϵ, H, ∥ • ∥ 2 ) ≤ d log p (γ/ϵ) for any 0 < ϵ ≤ γ. Assumption 3 is a mild assumption if the function classes are bounded. We list some popular function classes satisfying Assumption 3: (a) If the VC-dimension of H is finite, then H satisfies assumption 3. For instance, the function class of k-means methods has finite VC dimension. For more details, we refer the reader to (Devroye et al., 2013) . (b) When we set ϵ ∈ (0, 1), then all the unit Euclidean ball B ⊂ R d satisfy assumption 3. (c) If H is a RKHS with kernel k and the rank of k is d, then H satisfies assumption 3.

3.1. FAST LEARNING RATES FOR PARTICIPATING CLIENTS

In this subsection, we provide fast learning rates for participating clients in high probability. To obtain faster convergence rates, we focus on semi-excess risk L D ( h) -L D ( h * ). Theorem 2 (Semi-excess risk for participating clients). Let F be a family of functions bounded by b. Under assumptions 1, 3 and (a) of Assumption 2, when mn ≥ cd log p (mn), it follows that with probability at least 1 -δ, L D ( h)-L D ( h * ) ≤ c 1 log p (mn) mn 1 2-β ′ + c 2 log(1/δ) mn 1 2-β ′ , where c 1 and c 2 are constants depending on γ, p, L, β ′ and B 1 , b, β ′ respectively. Remark 3. Theorem 2 shows that the convergence rate of semi-empirical excess risk ranges from O( 1 √ mn ) to faster order O( 1 mn ), which corresponds to β ′ = 0 and β ′ = 1, respectively. It indicates that, under Bernstein condition, semi-empirical excess risk convergences faster when we increase the number of clients m or the size n of local dataset. We emphasize that our bounds in Theorem 2 is in high probability form, which is more emergent and challenging, when compared to the previous results in expectation form (Chen et al., 2021; Fallah et al., 2021) . The learning bounds in Theorem 2 are conducted for the empirical risk minimizer h. For the inexact minimizer of h, the proof technique and the final bounds only involve an extra optimization term. For more details about the optimization error, we refer to (Wang et al., 2021; Su et al., 2021; Khaled et al., 2019) .

3.2. FAST LEARNING RATES FOR UNPARTICIPATING CLIENTS

In this subsection we provide fast learning rates for unparticipating clients in high probability. To the best of our knowledge, this is the first result derived for unparticipating clients in heterogeneous federated learning. Theorem 3. Let F be a family of functions bounded by b. Under assumptions 1, 3 and (b) of Assumption 2, when m ≥ cd log p (m), for any δ > 0, it follows that with probability at least 1 -δ, L P ( h)-L P (h * ) ≤ c 0 L D ( h) -L D ( h * ) + c 1 log p m m 1 2-β ′′ + c 2 log(1/δ) m 1 2-β ′′ , where c 0 = K K-β ′′ , c 1 and c 2 are constants depending on γ, p, L, β ′′ and B 2 , b, β ′′ respectively. Remark 4. Theorem 3 is developed across the two-level distribution framework, which brings extra challenges to the analysis. It is shown that the upper bound derived in 3 include semiempirical excess risk term L D ( h) -L D ( h * ), which is an outcome of excess risk decomposition across two-level framework. Recall that β ′ and β ′′ are constants defined in Assumption 2. In the cases where β ′ = 1 and β ′′ = 1, it can be shown that excess risk is of order O( 1 mn + 1 m ) with high probability. To present Theorem 3, we must construct a sub-root function that links the expected local Rademacher complexity associated with meta-distribution P and uniform entropy number. However, the conventional Dudley's integral bounds are built under empirical constraints (Boucheron et al., 2013) . We tackle this challenge by extending the techniques developed in (Lei et al., 2016) to our two-level distribution framework.

4. LEARNING RATES FOR SUB-WEIBULL LOSSES

In this section, we provide generalization error bounds for unbounded losses in two-level distribution framework. In particular, we focus on loss functions satisfying sub-Weibull condition. Definition 4 (Sub-Weibull random variables). A random variable X is said to be sub-Weibull if there is constant ∥X∥ ψα < ∞, such that P(|X| ≥ t) ≤ 2 exp(-t α /∥X∥ α ψα ) , for all t ≥ 0. Sub-Gaussian and sub-exponential random variables are two special cases of Sub-Weibull random variables, which correspond to α = 2 and α = 1, respectively. The learning rates derived in two-level framework for sub-exponential losses are deferred to the appendices. In the following we use small-ball method to establish learning rates for more heavytailed losses, where two-side concentration inequalities may fail to hold. This subsection aims at establishing generalization bounds for unbounded losses that have heavier tails than sub-exponential distribution. Since the two side inequalities for empirical process fail to hold when the losses are heavy-tailed, the analysis of heavy-tailed losses require new method to relate empirical risk and population risk. In this subsection, we establish generalization bounds for heterogeneous federated learning by extending the small-ball method from i.i.d setting to our twolevel distribution framework. We consider the quadratic loss function in this section. The extension to general losses can be achived by using the techniques presented in (Mendelson, 2018) . In what follows, we denote by ∥h∥ L2(µ) for Banach spaces L 2 (X , µ). Recall that D is the semi-empirical distribution and P is meta-distribution. In particular, we have ∥h∥ L2(D) = ( 1 m m i=1 E X∼Di [h(X)] 2 ) 1/2 and ∥h∥ L2(P ) = (E Di∼P E X∼Di [h(X)] 2 ) 1/2 . For the sake of clear exposition, we first introduce the small-ball condition. Assumption 4 (Small-ball condition). Let H ⊂ L 2 (D) be a closed and convex class of functions and H -H := {h -h ′ : h, h ′ ∈ H}. (a) Let Q mn (τ ) = inf h∈H-H P(|h(X 1 i )| ≥ τ ∥h∥ L2(D) ) , where X 1 i represent the random sample at i-th participating client. There is a τ ≥ 0 for which Q mn (τ ) > 0. (b) Let Q m (τ ) = inf h∈H-H P |E[h(X 1 i )]| ≥ τ ∥h∥ L2(P ) , where X 1 i represent the random sample at i-th participating client. There is a τ ≥ 0 for which Q m (τ, P ) > 0. Assumption 4, small-ball condition, has been assumed for i.i.d and dependent data-generating process. To obtain high-probability theoretical guarantees, concentration techniques are widely used in the analysis of generalization error (Boucheron et al., 2013) . Intuitively, empirical risk will concentrate around population risk with high probability only when the loss function has well-behaved moments. However, this condition may fail to hold for heavy-tailed losses (Mendelson, 2015) . Assumption 4 appears first in the work of (Mendelson, 2015) . Losses with any sort of moment equivalence satisfy small-ball condition, which is weaker than concentration condition and can be used to model heavy-tailedness. For example, even weak condition ∥h∥ L2(P ) ≤ c∥h∥ L1(P ) yields nontrivial small-ball estimate. Moreover, the equivalence between higher-order moments and second-order moment such as ∥h∥ Lp(P ) ≤ c∥h∥ L2(P ) also leads to small-ball condition (Lecué & Mendelson, 2016) . Based on these observations, condition (b) of Assumption 4 is generally implied when we consider each local distribution D i as a random variable according to client distribution P . Let us discuss condition (a) of Assumption 4 in more detail. Note that the establishment of this assumption 4 only requires H ⊂ L 2 (D) with high probability, where D is the semi-empirical distribution. This requirement is not too restrictive since the elements of D are i.i.d sampled from P . To our knowledge, this is the first time that small-ball condition is used under heterogeneous data generating assumption.

4.1. LEARNING RATES FOR PARTICIPATING CLIENTS WITH SMALL-BALL CONDITION

We first describe the basic idea of generalization analysis for participating clients. Recall that h * is the minimizer of semi-empirical risk L D (h) in H. In this subsection we focus on the measure ∥h -h * ∥ 2 L2(D) , which represents the distance between h and h * with respect to semi-empirical distribution D. For quadratic loss and every h ∈ H, we have LS(h) -LS( h * ) = 1 mn m i=1 n j=1 (h(X j i ) -Y j i ) 2 -( h * (X j i ) -Y j i ) 2 (1) = 1 mn m i=1 n j=1 (h -h * ) 2 (X j i ) + 2 mn m i=1 n j=1 ξ j i (h -h * )(X j i ), where ξ j i = h * (X j i ) -Y j i . Since h is the minimizer of empirical risk L S (h), we have L S ( h) - L S ( h * ) ≤ 0. If on an event ∥h -h * ∥ L2(D) is large, then the summation of two terms in (2) is larger than 0 with high probability. It follows that with high probability ∥ h -h * ∥ L2(D) is small since L S ( h) -L S ( h * ) ≤ 0. Let {(X j i , Y j i )} (m,n) (i,j)=(1,1) be global data samples whose elements {(X j i , Y j i )} n j=1 are i.i.d ran- dom pairs at i-th client. The analysis of the first term in (2) involves the following definition of Rademacher complexity. Definition 5. We define H -H = {h -h ′ : h, h ′ ∈ H} and denote by B m 2 the L 2 (D) unit ball entered at h * , that is B m 2 = {h ∈ H : ∥h -h * ∥ L2(D) ≤ 1}. For every η > 0, define ωmn(η) := inf s > 0 : E sup h∈(H-H)∩sB m 2 1 mn m i=1 n j=1 σ j i h(X j i ) ≤ ηs , where σ j i are Rademacher random variables. The quantity ω mn (η) measures the Rademcher complexity of the localized function set {H -H ∩ sB m 2 }. Note that ω mn (η) depends only on the hypothesis class H and global input samples are drawn from semi-empirical distribution D. Theorem 4. Fix τ > 0 for which Q m (2τ ) > 0 and set η < τ 2 Q mn (2τ )/32. If every random variable V j i = ξ j i h(X j i ) -E[ξ j i h(X j i )] for all h ∈ H -h * is Sub-Weibull. For sufficiently large mn, with probability at least 1 -δ mn -exp -mnQ 2 mn (2τ )/2 one has ∥ h -h * ∥ L2(D) ≤ 2 max ω mn (τ Q mn (2τ )/16), (mn) -1 4 +ι , where 0 < ι < 1 4 and δ mn = exp{-( c1η 2 (mn) 4ι 1 mn m i=1 n j=1 ∥V j i ∥ 2 ψα ∧ c2η α (mn) α(1/2+2ι) max (1,1)≤(i,j)≤(m,n) ∥V j i ∥ α ψα )}. Remark 5. To the best of our knowledge, Theorem 4 provides the first result on the generalization error of heterogeneous federated learning with heavy-tailed losses. It suggests that both hypothesis 'size' and noise level play important roles in the generalization error of heterogeneous learning problems. In Theorem 4, the expression of δ mn dependents on the tail of sub-Weibull random variables V j i . Specifically, the heavier the tail of V j i , the larger δ mn will be. Note that in addition to ∥V j i ∥ ψα , δ mn also depends on (mn) 4ι . That is, the larger mn and ι are, the smaller δ mn will be. To ensure that Theorem 4 holds with high probability, we must ensure that δ mn is small enough. Therefore, when mn is fixed, the heavier the tail of V j i , the larger ι should be. It can be seen from Theorem 4 that the larger ι is, the slower the convergence rate of ∥ h -h * ∥ L2(D) is. Published as a conference paper at ICLR 2023 Corollary 1. Under the same conditions of Theorem 4, for convex function class H and sufficiently large mn, with probability at least 1 -δ mn -exp -mnQ 2 mn (2τ )/2 one has L D ( h) -L D ( h * ) ≤ (2 + τ 2 4 Q mn (2τ )) max (ω 2 mn (τ Q mn (2τ )/16), (mn) -1 2 +ι ), where 0 < ι < 1 2 . Remark 6. Corollary 1 provides the convergence rate of semi-empirical excess risk for Sub-Weibull losses. Compared to Theorem 2, it shows that the convergence rate of excess risk is slower than O( 1 √ mn ).

4.2. LEARNING RATES FOR UNPARTICIPATING CLIENTS WITH SMALL-BALL CONDITION

In the analysis of generalization error for unparticipating clients, we focus on the measure ∥hh * ∥ 2 L2(P ) , which represents the distance between h and h * with respect to meta-distribution P . The analysis of generalization error for unparticipating clients follows a similar path to the previous analysis. Let {(X i , Y i )} m i=1 be dataset whose elements are sampled across the two-level framework, that is E[(X i , Y i )] = E Di∼P E (Xi,Yi)∼Di [(X i , Y i )]. We present the different definitions of Rademacher complexity terms in the following. Definition 6. We define H -H = {h -h ′ : h, h ′ ∈ H} and denote by B 2 the L 2 (P ) unit ball entered at h * . For every η > 0, define ωm(η) := inf s > 0 : E sup h∈(H-H)∩sB 2 1 m m i=1 σih(Xi) ≤ ηs, where σ i are Rademacher random variables. The quantity ω m (η) measures the localized complexity of {(H -H) ∩ sB 2 }. Theorem 5. Fix τ > 0 for which Q m (2τ ) > 0 and set η < τ 2 Q m (2τ )/32. If for all h ∈ H -h * the random variable V i = E[ξ 1 i h(X 1 i )] -E[ξ i h(X i )] is Sub-Weibull. For sufficiently large m, with probability at least 1 -δ m -exp -mQ 2 m (2τ )/2 one has ∥ h * -h * ∥ L2(P ) ≤ 2 max ω m (τ Q m (2τ )/16), m -1 4 +ι , where 0 < ι < 1 4 and δ m = exp{-( c1η 2 m 4ι 1 m m i=1 ∥Vi∥ 2 ψα ∧ c2η α m α(1/2+2ι) max 1≤i≤m ∥Vi∥ α ψα )}. Remark 7. Theorem 5 provides the first result on the generalization error of unparticipating clients in heterogeneous federated learning with heavy-tailed losses. Corollary 2. Assume for all h ∈ H -h * the random variable V ′ i = E[h 2 (X j i )] -E[h 2 (X i )] is Sub- Weibull and the noise h * (X i ) -Y i is independent of X i . Under the same conditions of Theorem 5, for 0 < η < 1 and sufficiently large mn, with probability at least 1-δ ′ -exp -mnQ 2 mn (2τ )/2 - exp -mQ 2 m (2τ )/2 one has L P ( h) -L P (h * ) ≤ c 0 max (ω 2 mn ( τ Q mn (2τ ) 16 ), (mn) -1 2 +ι ) + 2 max ω 2 m ( τ Q m (2τ ) 16 ), m -1 2 +ι , where c 0 = 2 1-η , 0 < ι < 1 2 and δ ′ = δ mn + δ m + exp{-( c1η 2 m 4ι 1 m m i=1 ∥V ′ i ∥ 2 ψα ∧ c2η α m α(1/2+2ι) max 1≤i≤m∥ V ′ i ∥ α ψα )}. Remark 8. Corollary 2 provides the convergence rate of excess risk for Sub-Weibull losses. Compared to Theorem 3, it shows that the convergence rate of excess risk is slower than O( 1 √ mn + 1 √ m ).

5. RELATED WORK

Generalization Error for Heterogeneous Federated Learning. Several attempts have been made in the analysis of generalization error for heterogeneous federated learning. We compare our results with most related works in Table 1 . Complexity-based bounds for participating clients are derived in the work of (Mohri et al., 2019) , who present high probability slow rates of order O( 1 √ mn )  O( 1 √ mn ) O( 1 √ mn + 1 √ m ) Pro Our Results Bounded Bernstein Con O( 1 mn ) O( 1 mn + 1 m ) Pro Our Results Sub-Weibull Small-ball O (mn) 2ι-1 2 O((mn) 2ι-1 2 + m 2ι-1 2 ) Pro for bounded losses. Fast rates bounds are obtained based on stability tools in (Chen et al., 2021; Fallah et al., 2021) . However, their results are in expectation form. Among the existing theoretical work, different measurements are used to model the heterogeneity of local distributions. These measurements include gradient dissimilarity and parameter dissimilarity of local optimal models. Here we argue that it is more natural to make an assumption from the perspective of the datagenerating process. Therefore, in this paper, we assume that the local distributions are sampled from a higher meta-distribution. A similar two-level distribution framework has been used in the analysis of meta-learning. However, the learning scenarios and objectives of federated learning are different from that of meta-learning. The goal of meta-learning is to choose an optimal hypothesis space H from the hypothesis space family H. Ideally, the chosen hypothesis H should contain good hypothesis h ∈ H for each distribution D i sampled from the meta distribution P . In this paper, we focus on the performance of common model h trained by participating clients. The performance of the common model is measured by the population risk with respect to meta distribution P . Another line of research closely related to heterogeneous federated learning is domain adaptation/generalization. In this line, possibly the results in (Li et al., 2022) are most relevant to ours. Generalization error for Unbounded losses. The unbounded assumption brings two major challenges to complexity-based generalization analysis. One is that the two-side concentration inequalities do not hold when the losses are heavy-tailed. The other is that the standard techniques used to upper bound the complexity of hypothesis space are developed for bounded losses. The straightforward way to avoid these two challenges is to assume there exists an envelope function with respect to the underlying distribution and hypothesis class (Adamczak, 2008; Lecué & Mendelson, 2012) . Small-ball method is first proposed to replace the concentration tools for empirical process in (Mendelson, 2015) and further developed in (Mendelson, 2018) . Inspired by the small-ball method, Offset Rademacher complexity-based method provides another replacement for two-side concentration inequality (Liang et al., 2015) . However, most existing generalization bounds for unbounded losses are derived in the i.i.d setting. Roy et al. (2021) extend the small-ball method in the dependent data setting. In this paper, we focus on the heterogeneous federated learning scenario with unbounded losses, where the samples are independent but non-identically distributed.

6. CONCLUSION

We present a systematic generalization analysis of heterogeneous distributed learning. Our analysis captures the generalization performance of the learned model on both participating and unparticipating clients. To our knowledge, this is the first theoretical analysis under the assumption that the local distributions are sampled from a meta-distribution. We recover the current state of art guarantees without using bounded assumptions. Moreover, under the empirical risk minimization setting, we derive fast generalization rates in our two-level distribution setting. 

A SETTING, DEFINITIONS

We denote the population risk L P (h) by L P (h) = E Di∼P [E Z∼Di [ℓ(h(X), Y )]] , where h ∈ H represents the global hypothesis shared by all local clients, ℓ : Y × Y → R + is the non-negative loss function. The population risk minimizer h * is define as h * = arg min h∈H L P (h). In practice, the global objective function is often optimized by the form of empirical risk minimization (ERM) objective function, which is defined as: L S (h) = 1 m m i=1 1 n n j=1 ℓ(h(X j i ), Y j i ), where (X j i , Y j i ) represents the j-th training data point at i-th participating client. We also use The semi-empirical distribution D is defined as Z j i = (X j i , Y j i ) to denote the data point. Let S i = {Z j i } n D = 1 m m i=1 D i . Moreover, the corresponding semi-empirical risk L D (h) is defined as: L D (h) = 1 m m i=1 E Z∼Di [ℓ(h, Z)] . We denote by h * the semi-empirical risk minimizer h * = arg min h∈H L D (h). The semi-excess risk for participating clients is defined as: L D ( h) -L D ( h * ). Semi-excess risk indicates the performance of the learned model h on the participating clients. The excess risk for unparticipating clients is defined as: L P ( h) -L P (h * ). Excess risk indicates the performance of the learned model h on the unparticipating clients. In the following, we denote F by F = {z → ℓ(h, z) : h ∈ H}. Moreover, we use Z i = (X i , Y i ) to represent the random variables across two-level framework. That is, E[Z i ] = E Di∼P [E Zi∼Di [Z i ]].

B PROOF OF THEOREM 1 B.1 PROOF OF THEOREM 1

We first show that semi-excess risk L D ( h) -L D (h * ) can be upper bounded by supremum of the empirical process indexed by H: L D ( h) -L D ( h * ) ≤ L D ( h) -L S ( h) + L S ( h * ) -L D ( h * ) ≤ 2 sup h∈H |L D (h) -L S (h)| , where the first inequality follows from the fact that h is the empirical minimizer. Next, we decompose excess risk L P ( h) -L P (h * ) in two-level framework: L P ( h) -L P (h * ) ≤ 2 sup h∈H |L P (h) -L D (h)| + 2 sup h∈H |L D (h) -L S (h)| . In details: L P ( h) -L P (h * ) ≤ L P ( h) -L P (h * ) -(L S ( h) -L S (h * )) = L P ( h) -L D ( h) + L D (h * ) -L P (h * ) + L D ( h) -L S ( h) + L S (h * ) -L D (h * ) ≤ 2 sup h∈H {|L P (h) -L D (h)|} + 2 sup h∈H {|L D (h) -L S (h)|} , where the first inequality uses  L S ( h)-L S (h * ) ≤ 0 since h (h) -L D (h)|. Let σ = {σ j i } i∈[m],j∈[n] be a collection of independent Rademacher variables, this is random variables taking values uniformly in {+1, -1}. We define the generalized rademacher complexity R mn (F) used in heterogeneous federated learning as is as follows: R mn (F) = E σ   sup h∈H 1 mn m i=1 n j=1 σ j i ℓ(h, Z j i )   . Lemma 1 (Generalization error for participating clinets). Let F be a family of functions related to hypothesis space H : F = {z → ℓ(h, z) : h ∈ H}. Assume that loss function ℓ is bounded by M . Then, for any δ ≥ 0, with probability at least 1 -δ, we have sup h∈H |L D (h) -L S (h)| ≤ 2R mn (F) + 3M ln(1/δ) 2mn . Lemma 1 provide high probability theoretical guarantees for participating clients under bounded assumption. Compared to the classical i.i.d setting in learning theory, the results in Lemma 1 in under independent but non-identically distributed setting. Since each local distribution D i is sampled independently from P , we regard local population risk {E Z∼Di [ℓ(h, Z)]} i∈[m] as a collection of iid random variables. Let σ = {σ i } i∈[m] be a collection of independent Rademacher variables, the rademacher complexity used in the analysis of participating gap is defined as: R m (F) = E D1,••• ,Dm σ sup h∈H 1 m m i=1 σ i E Z∼Di [ℓ(h, Z)] . Lemma 2 (Participation gap). Let F be a family of functions related to hypothesis space H : F = {z → ℓ(h, z) : h ∈ H}. Assume that loss function ℓ is bounded by b. Then, for any δ ≥ 0, with probability at least 1 -δ, we have sup h∈H |L P (h) -L D (h)| ≤ 2R m (F) + 3b ln(1/δ) 2m . The  Φ (S ′ ) -Φ(S) = sup h∈H {|L D (h) -L S ′ (h)|} -sup h∈H {|L D (h) -L S (h)|} ≤ sup h∈H {| {L D (h) -L S ′ (h)} -{L D (h) -L S (h)} |} = sup h∈H {|L S (h) -L S ′ (h)|} = 1 mn sup h∈H |ℓ h, Z k s -ℓ h, Z ′k s |. By assuming that loss function ℓ is bounded by b, then we have |Φ (S ′ ) -Φ(S)| ≤ b mn . We apply McDiarmid's inequality and obtain, for any fixed δ ∈ (0, 1), it follows that with probability at least 1 -δ Φ(S) ≤ E S1,••• ,Sm [Φ(S)] + b ln(2/δ) 2mn Next we consider the expectation of Φ(S). By symmetrization, we have E S1,••• ,Sm [Φ(S)] = E S1,••• ,Sm sup h∈H L D (h) -L S (h) = E S1,••• ,Sm   sup h∈H 1 m m i=1   E Z∼Di [ℓ(h, Z)] - 1 n n j=1 ℓ(h, Z j i )     = E S1,••• ,Sm   sup h∈H 1 m m i=1 E S ′ 1 ,••• ,S ′ m   1 n n j=1 ℓ(h, Z ′j i ) -ℓ(h, Z j i )     ≤ E S1,••• ,Sm S ′ 1 ,••• ,S ′ m   sup h∈H 1 m m i=1 1 n n j=1 ℓ(h, Z ′j i ) -ℓ(h, Z j i )   = E S1,••• ,Sm S ′ 1 ,••• ,S ′ m   sup h∈H E σ 1 mn m i=1 n j=1 σ ij ℓ(h, Z ′j i ) -ℓ(h, Z j i )   ≤ E S1,••• ,Sm S ′ 1 ,••• ,S ′ m σ   sup h∈H 1 mn m i=1 n j=1 σ ij ℓ(h, Z ′j i ) -ℓ(h, Z j i )   ≤ E S ′ 1 ,••• ,S ′ m ,σ   sup h∈H 1 mn m i=1 n j=1 σ ij ℓ(h, Z ′j i )   + E S1,••• ,Sm,σ   sup h∈H 1 mn m i=1 n j=1 -σ ij ℓ(h, Z ′j i )   = 2 E S1,••• ,Sm,σ   sup h∈H 1 mn n i=1 m j=1 σ ij ℓ(h, Z j i )   = 2 E S1,••• ,Sm [R mn (F)] where From the definition of D ′ and D, we have R mn (F) = E σ sup h∈H 1 mn n i=1 m j=1 σ ij ℓ(h, Z j i ) . Replacing one data point of S = S 1 • • • S m make R mn (F) Φ (D ′ ) -Φ(D) = sup h∈H {L P (h) -L D ′ (h)} -sup h∈H {L P (h) -L D (h)} ≤ sup h∈H {{L P (h) -L D ′ (h)} -{L P (h) -L D (h)}} = sup h∈H {L D (h) -L D ′ (h)} = sup h∈H 1 m E Z∼D k [ℓ(h, Z)] -E Z∼D ′ k [ℓ(h, Z)] By assuming that loss function f is bounded by b, then we have |Φ (D ′ ) -Φ(D)| ≤ b m . According to McDiarmid's inequality, for all δ ∈ (0, 1), with probability at least 1 -δ, we have Φ(D) ≤ E D1,••• ,Dm [Φ(D)] + b ln(2/δ) 2m Now we deal with the expectation of Φ(D). By symmetrization, we have E D1,••• ,Dm [Φ(D)] = E D1,••• ,Dm sup h∈H L P (h) -L D (h) = E D1,••• ,Dm sup h∈H (E D ′ ∼P m [L D ′ (h)] -L D (h)) ≤ E D1,••• ,Dm D ′ 1 ,••• ,D ′ m sup h∈H (L D ′ (h) -L D (h)) = E D1,••• ,Dm D ′ 1 ,••• ,D ′ m sup h∈H 1 m m i=1 L D ′ i (h) -L Di (h) = E D1,••• ,Dm D ′ 1 ,••• ,D ′ m sup h∈H 1 m m i=1 σ i L D ′ i (h) -L Di (h) ≤ E D1,••• ,Dm,σ sup h∈H 1 m m i=1 σ i L D ′ i (h) + E D ′ 1 ,••• ,D ′ m ,σ sup h∈H 1 m m i=1 -σ i L Di (h) ≤ 2 E D1,••• ,Dm,σ sup h∈H 1 m m i=1 σ i L Di (h) Thus, we have E D1,••• ,Dm [Φ(D)] = 2 E D1,••• ,Dm,σ sup h∈H 1 m m i=1 σ i L Di (h) = 2 E D1,••• ,Dm,σ sup h∈H 1 m m i=1 σ i E Zi∼Di [ℓ(h, Z i )] ≤ 2 E Z1,••• ,Zm,σ sup h∈H 1 m m i=1 σ i [ℓ(h, Z i )] = 2 E Z1,••• ,Zm R m (L) where R m (L) = E σ sup h∈H 1 m m i=1 σ i ℓ(h, Z i ) . Replacing one point of Z = {Z i } m i=1 make R m (L) vary at most M m , then applying McDiarmid's inequality again, with probability at least 1 -δ we have E Z1,••• ,Zm [R m (L)] ≤ R m (L) + b ln(2/δ) 2m . The goal of Theorem 1 is to present theoretical bounds of excess risk L P ( h) -L P (h  log N (ϵ, F, ∥ • ∥ 2 ) ≤ cd log (1/ϵ). where F is the envolop function of F. Then by Dudley's Theorem (Dudley, 1978; van der Vaart & Wellner, 1996) , we have R mn (F) ≤ cb d mn , R m (F) ≤ cb d m . Combining the results of Lemma 1 and 2 leads to the final results.

C PROOFS OF FAST RATES WITH BOUNDED LOSSES

Covering number can be used to give tighter estimate on the hypothesis size. Here we provide the definition of convering number and uniform entropy number. Definition 7 (Convering number). Let (G, ρ) be a metric space and F ⊆ G. For any ϵ ≥ 0, F ϵ is an ϵ-cover of F with respect of ρ if for all f ∈ F , we can find f ′ ∈ F ϵ such that ρ(f, f ′ ) ≤ ϵ. The covering number N (ϵ, F, ρ) is defined as the minimum size of an ϵ-cover: N (ϵ, F, ρ) := min{|F ϵ | : F ϵ is an ϵ-cover of F w.r.t ρ}. Definition 8 (Uniform entropy number). The entropy number is defined as the logarithm of the covering number. Let (G, ρ) be a normed space with ρ(f, f ′ ) = ∥f -f ′ ∥. Let F be an envelope function of F such that |f (Z)| ≤ F (Z), for all Z and f . We further define uniform entropy number of F as: log N (ϵ, F, ∥ • ∥ 2 ) = sup Q log N (ϵ, F, ∥ • ∥ L2(Q) ), where Q is taken over all probability measures with 0 ≤ QF 2 ≤ ∞. Definition 9. A function φ(r) : [0, ∞) → [0, ∞) is sub-root function if it is nondecreasing and r → φ(r)/ √ r is nonincreasing for r > 0. C.1 EXAMPLES SATISFYING ASSUMPTION 2 C.1.1 EXAMPLE 1 In this subsection, we show that when the hypothesis class contains the h * and h * , Assumption 2 holds. E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )] 2 = 1 m m i=1 E h(X 1 i ) -Y 1 i 2 -( h * (X 1 i ) -Y 1 i ) 2 2 ≤ L 2 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 . Next we show that E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )] = 1 m m i=1 E (h(X 1 i ) -Y 1 i ) 2 -( h * (X 1 i ) -Y 1 i ) 2 = 1 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 . Thus, if h * ∈ H, we have E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )] 2 ≤ L 2 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 = L 2 E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )]. Thus, the tuple D, ℓ, H, h ⋆ satisfies the Bernstein condition with parameter B 2 = 1 such that 1 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 ≤ B 2 E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )]. Similarly, we can prove that E (X,Y )∼P [ℓ(h(X), Y ) -ℓ(h * (X), Y )] 2 ≤ L 2 E(h(X i ) -h * (X i )) 2 = L 2 E (X,Y )∼P [ℓ(h(X), Y ) -ℓ(h * (X), Y )]. Thus, the tuple (P, f, H, h ⋆ ) satisfies the Bernstein condition with parameter B 1 = 1 such that E(h(X i ) -h * (X i )) 2 ≤ B 1 E (X,Y )∼P [ℓ(h(X), Y ) -ℓ(h * (X), Y )].

C.1.2 EXAMPLE 2

In this example, we show that even when the hypothesis class does not include h * and h * , Assumption 2 still holds. If F is a nonempty, closed and convex subset of a Hilbert space with inner product ⟨f, g⟩ = E (X,Y )∼D (f (X, Y )g(X, Y )), where h * is the projection of Y in the space F. By the definition we have ⟨Y -h * (X), h(X)h * (X)⟩ ≤ 0. E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )] = E (X,Y )∼D (Y -h(X)) 2 -(Y -h * (X)) 2 = (h(X) -h * (X)) 2 2 L2 + ⟨Y -h * (X), h * (X) -h(X)⟩ ≥ 1 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 . Thus, for regression problems with square loss function, where ( h * / ∈ H) , we have E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )] 2 ≤ L 2 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 ≤ L 2 E (X,Y )∼D [ℓ(h(X), Y ) -ℓ( h * (X), Y )]. Similarly, we can prove that E (X,Y )∼P [ℓ(h(X), Y ) -ℓ(h * (X), Y )] 2 ≤ L 2 E(h(X i ) -h * (X i )) 2 = L 2 E (X,Y )∼P [ℓ(h(X), Y ) -ℓ(h * (X), Y )].

C.2 PROOF OF THEOREM 2

Lemma 3 (Theorem 2.1 in (Bartlett et al., 2005) ). Let F be a family of functions satisfying ∀f ∈ F, ∀Z ∈ Z, |f (Z)| ≤ b. If sup f ∈F 1 m m i=1 E[f (Z i )] 2 ≤ r, then, for any δ ≥ 0 and any α ≥ 0, with probability at least 1 -δ, (Yousefi et al., 2018) ). Let F be a family of functions satisfying ∀f ∈ sup f ∈F 1 m m i=1 [E[f (Zi)] -f (Zi)] ≤ 2(1 + α)E[Rm(F)] + 2r ln(1/δ) m + 1 3 + 1 α b ln(1/δ) m , where R m (F) = E σ sup f ∈F 1 m m i=1 σ i f (Z i ) . Lemma 4 (Theorem 1 in F, ∀Z ∈ Z, |f (Z)| ≤ b. If sup f ∈F 1 mn m i=1 n j=1 E[f (Z j i ) ] 2 ≤ r, then, for any δ ≥ 0 and any α ≥ 0, with probability at least 1 -δ, sup f ∈F 1 mn m i=1 n j=1 E[f (Z j i )] -f (Z j i ) ≤ 2(1 + α)E[Rmn(F)] + 8r ln(1/δ) mn + 1 + 2 α 4b ln(1/δ) mn , where (Yousefi et al., 2018) ). Let c 1 , c 2 > 0 and s > q > 0. Then the equation x s -c 1 x q -c 2 = 0 has a unique positive solution x 0 satisfying R mn (H) = E σ sup f ∈F 1 mn m i=1 n j=1 σ j i f (Z j i ) . Lemma 5 (Lemma B.1 in x 0 ≤ c s s-1 1 + sc 2 s -q 1 s . Moreover, for any x ≥ x 0 , we have x s ≥ c 1 x q + c 2 . Lemma 6 (Corollary 1 in Lei et al. (2016) ). Let F be a function class with sup f ∈F ∥f ∥ ∞ ≤ b. Assume that there exist three positive numbers γ, d, p such that log N (ϵ, F, ∥ • ∥ 2 ) ≤ d log p (γ/ϵ) for any 0 < ϵ ≤ γ, then for any 0 < r ≤ γ 2 and n ≥ γ -2 there holds that ERn f ∈ F : P f 2 ≤ r ≤ c(b, p, γ) min     dr log p (2γr -1/2 ) n + d log p 2γr -1/2 n     d log p 2γn 1/2 n + rd log p (2γn 1/2 ) n     . Proof of Theorem 2. First we define F * := {f : (X, Y ) → ℓ(h(X), Y ) -ℓ( h * (X), Y ), h ∈ H}. Step One: Combining Assumption 2 (a), for any f ∈ F * we derive that 1 m m i=1 E f 2 (X 1 i , Y 1 i ) = 1 m m i=1 E ℓ(h(X 1 i ), Y 1 i ) -ℓ( h * (X 1 i ), Y 1 i ) 2 ≤ L 2 m m i=1 E h(X 1 i ) -h * (X 1 i ) 2 ≤ B ′ L 2 1 m m i=1 E ℓ(h(X 1 i ), Y 1 i ) -ℓ( h * (X 1 i ), Y 1 i ) β . Let V (f ) := 1 m m i=1 E ℓ(h(X 1 i ), Y 1 i ) -ℓ( h * (X 1 i ), Y 1 i ) 2 , B = B ′ L 2 . Consider the function class G r associated with F * and r ≥ 0: G r := g := rf max(r, V (f )) , f ∈ F * . We denote V + r by V + r = sup g∈Gr 1 mn m i=1 n j=1 E[g(Z j i )] -g(Z j i ) . Let K > 1, 0 < β ≤ 1 and B ≥ 1. We first prove that, if V + r ≤ r 1/β BK , then ∀f ∈ F, 1 m m i=1 E[f (Z 1 i )] ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + r 1/β K . If V (f ) ≤ r, then g = f . It follows that with the assumption V + r ≤ r 1/β BK , we have 1 m m i=1 E[f (Z 1 i )] ≤ 1 mn m i=1 n j=1 f (Z j i ) + r 1/β BK (5) ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + r 1/β K . If V (f ) ≥ r, then g = rf /V (f ). It follows that with the assumption V + r ≤ r 1/β BK , we have 1 m m i=1 E[f (Z 1 i )] ≤ 1 mn m i=1 n j=1 f (Z j i ) + r 1 β -1 V (f ) BK (7) ≤ 1 mn m i=1 n j=1 f (Z j i ) + r 1 β -1 K 1 m m i=1 E[f (Z 1 i )] β (8) ≤ 1 mn m i=1 n j=1 f (Z j i ) + β K 1 m m i=1 E[f (Z 1 i )] + (1 -β)r 1 β K , where the second inequality follows from Bernstein Condition and the third inequality follows from Lemma 5. The obtained inequality can be rewritten as 1 m m i=1 E[f (Z 1 i )] ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + (1 -β)r 1 β (K -β) (10) ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + r 1 β K . ( ) where the second inequality is due to b-c a-c ≤ b a for c < b ≤ a. Step Two: By the construction of G r , it can be varified that 1 mn m i=1 n j=1 E[g(Z j i )] 2 ≤ r. In details: If V (f ) ≤ r, we have g = f. It follows that 1 mn m i=1 n j=1 E[g(Z j i )] 2 = 1 m m i=1 E[g(Z 1 i )] 2 = 1 m m i=1 E[f (Z 1 i )] 2 ≤ V (f ) ≤ r. ( ) If V (f ) > r, we have g = rf /V (f ). It follows that 1 mn m i=1 n j=1 E[g(Z j i )] 2 = 1 m m i=1 E[g(Z 1 i )] 2 = r 2 m[V (f )] 2 m i=1 E[f (Z 1 i )] 2 ≤ r 2 [V (f )] ≤ r. ( ) Combining the boundness of G r and Lemma 4, with probability at least 1 -δ, ∀δ ≥ 0, ∀α ≥ 0, we have sup g∈Gr 1 mn m i=1 n j=1 E[g(Z j i )] -g(Z j i ) ≤ 2(1 + α)E[Rmn(Gr)] + 8r ln(1/δ) mn + 1 + 2 α 4b ln(1/δ) mn , where R mn (G r ) = E σ sup g∈Gr 1 mn m i=1 n j=1 σ j i g(Z j i ) . Next we apply "peeling" technique to bound  E[R mn (G r )]. Given λ > 1, let F * (u, v) := {f ∈ F * : u ≤ V (f ) ≤ v} σ j i g(Z j i ) = E S,σ sup f ∈ F * 1 mn m i=1 n j=1 r max(r, V (f )) σ j i f (Z j i ) ≤ E S,σ sup f ∈ F * (0,r) 1 mn m i=1 n j=1 σ j i f (Z j i ) + E S,σ sup f ∈ F * (r,Bb β ) 1 mn m i=1 n j=1 r V (f ) σ j i f (Z j i ) ≤ E S,σ sup f ∈ F * (0,r) 1 mn m i=1 n j=1 σ j i f (Z j i ) + k j=0 λ -j E S,σ sup f ∈ F * (rλ j ,rλ j+1 ) 1 mn m i=1 n j=1 σ j i f (Z j i ) ≤ E S,σ sup f ∈ F * (0,r) 1 mn m i=1 n j=1 σ j i f (Z j i ) + k j=0 λ -j E S,σ sup f ∈ F * (0,rλ j+1 ) 1 mn m i=1 n j=1 σ j i f (Z j i ) ≤ ψ(r) B + k j=0 λ -j ψ(rλ j+1 ) B . By the property of sub-root function it follows that we have ψ(θr) ≤ θ 1 2 ψ(r) for any θ ≥ 1. Then, E[Rmn(Gr)] ≤ ψ(r) B 1 + √ λ k j=0 λ -j 2 ≤ ψ(r) B 1 + λ √ λ -1 . Taking λ = 4 it follows that E[R mn (G r )] ≤ 5ψ(r)/B. Combining with the property ψ(r) ≤ r/r * ψ(r * ) = √ rr * . The following inequality can be obtained with probability at least 1 -δ, ∀δ ≥ 0, sup g∈Gr 1 mn m i=1 n j=1 E[g(Z j i )] -g(Z j i ) ≤ 10(1 + α) B √ rr * + 8r ln(1/δ) mn + 1 + 2 α 4b ln(1/δ) mn , where r * is the fixed point of sub-root function ψ(r). Step Three: Recall that the condition we get inequality (4) is V + r = sup g∈Gr 1 mn m i=1 n j=1 E[g(Z j i )] -g(Z j i ) ≤ r 1/β BK . We denote A and C by A = 10(1 + α) B √ r * + 8 ln(1/δ) mn , C = 1 + 2 α 4b ln(1/δ) mn Next, we need to solve A √ r + C ≤ r 1/β BK . Assume r 0 is the positive solution of A √ r + C = r 1/β BK , that is r 1/β 0 -ABKr 1 2 0 -BKC = 0, Then by Lemma 5, we have r 1 β 0 ≤ (ABK) 2 2-β + 2BKC 2 -β ≤ (BK) 2 2-β 2 β 2-β 10(1 + α) B 2 2-β (r * ) 1 2-β + 8 ln(1/δ) mn 1 2-β + 1 + 2 α 8BKb ln(1/δ) (2 -β)mn , where the second inequality follows from (x + y) p ≤ 2 p-1 (x p + y p ) for x, y ≥ 0, p ≥ 1. If r * ≤ r 0 , we can take r = r 0 . Then we have V + r0 ≤ A √ r 0 + C = r 1/β 0 BK . Combining inequality 4, we get 1 m m i=1 E[f (Z 1 i )] ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + r 1 β 0 /K ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + (2K) β 2-β (10(1 + α)) 2 2-β (r * ) 1 2-β + 2 β+3 B 2 K β x mn 1 2-β + (1 + (2/α))8Bb ln(1/δ) (2 -β)mn . If r * > r 0 , we can take r = r * . Then we have V + r * ≤ A √ r * + C = (r * ) 1/β BK . Combining inequality 4, we get 1 m m i=1 E[f (Z 1 i )] ≤ K (mn)(K -β) m i=1 n j=1 f (Z j i ) + (r * ) 1 β /K. Step Four: Note that ψ(r) is set to satisfy the following conditon, ψ(r) ≥ BR mn ( F * (0, r)), where B = B 2 L 2 and Clearly G r ⊂ {f ∈ F * : T mn (f ) ≤ r}, where T mn (f ) = 1 m m i=1 E Z∼Di [f (Z)] 2 . Thus, E[R mn (G r )] = E S1,••• ,Sm,σ   sup g∈Gr 1 mn m i=1 n j=1 σ j i g(Z j i )   ≤ E S1,••• ,Sm,σ     sup Tmn(f )≤r f ∈ F * 1 mn m i=1 n j=1 σ j i f (X j i , Y j i )     = R mn ( F * , r). Lemma 6 implies that the sub-root function can be chosen as ψ(r) := c   d log p 2γ(mn) 1/2 mn + rd log p 2γ(mn) 1/2 mn   . In a similar way to Lei et al. (2016) , let r * mn be its fixed point then we know that r * mn = c   d log p 2γ(mn) 1/2 mn + r * mn d log p 2γ(mn) 1/2 mn   . Solving this equality gives r * mn ≤ cd(mn) -1 log p (mn). Combining the fact that 1 mn m i=1 n j=1 ℓ( h(X j i ), Y j i ) -ℓ( h * (X j i ), Y j i ) ≤ 0 and the result of step three, we get the following results L D ( h) -L D ( h * ) = 1 m m i=1 E ℓ( h(X j i ), Y j i ) -ℓ( h * (X 1 i ), Y 1 i ) ≤ (2K) β 2-β (10(1 + α)) 2 2-β max((r * ) 1 2-β , (r * ) 1 β ) + 2 β+3 B 2 K β x mn 1 2-β + (1 + (2/α))8Bb ln(1/δ) (2 -β)mn . When mn ≥ cd log p (mn), we have max((r * ) 1 2-β , (r * ) 1 β ) = (r * ) 1 2-β . Thus, L D ( h)-L D ( h * ) ≤ c 1 log p (mn) mn 1 2-β ′ + c 2 log(1/δ) mn 1 2-β ′ . C.3 PROOF OF THEOREM 3 Proof. First we define F * := {f : (X, Y ) → ℓ(h(X), Y ) -ℓ(h * (X), Y ), h ∈ H}. Step One: Combining Assumption 2 (b), for any f ∈ F * , we have E Di∼P [E (X,Y )∼Di [f (X, Y )]] 2 = E Di∼P E (X,Y )∼Di [ℓ(h(X), Y ) -ℓ(h * (X), Y )] 2 ≤ E Di∼P E (X,Y )∼Di [ℓ(h(X), Y ) -ℓ(h * (X), Y )] 2 = E [ℓ(h(X 1 ), Y 1 ) -ℓ(h * (X 1 ), Y 1 )] 2 ≤ L 2 E[h(X 1 ) -h * (X 1 )] 2 ≤ B ′′ L 2 (L P (h) -L P (h * )) β ′′ . Let V (f ) := E Di∼P E (X,Y )∼Di [ℓ(h(X), Y ) -ℓ(h * (X), Y )] 2 , B = B ′′ L 2 . Consider the function class G r associated with F * and r ≥ 0: G r := g := rf max(r, V (f )) , f ∈ F * . We denote V + r by V + r = sup g∈Gr 1 m m i=1 E[g(Z i )] -E[g(Z 1 i )] . Let K > 1, 0 < β ≤ 1 and B ≥ 1. We first prove that, if V + r ≤ r 1/β BK , then ∀f ∈ F * , E[f (Z 1 )] ≤ K m(K -β) 1 m m i=1 E f (Z 1 i ) + r 1/β K . ( ) If V (f ) ≤ r, then g = f . It follows that with the assumption V + r ≤ r 1/β BK , we have E[f (Z 1 )] ≤ 1 m m i=1 E f (Z 1 i ) + r 1/β BK (16) ≤ K m(K -β) m i=1 Ef (Z 1 i ) + r 1/β K . ( ) If V (f ) ≥ r, then g = rf /V (f ). It follows that with the assumption V + r ≤ r 1/β BK , we have E[f (Z 1 )] ≤ 1 m m i=1 E f (Z 1 i ) + r 1 β -1 V (f ) BK (18) ≤ 1 m m i=1 E f (Z 1 i ) + r 1 β -1 K (E[f (Z 1 )]) β (19) ≤ 1 m m i=1 E f (Z 1 i ) + β K E[f (Z 1 )] + (1 -β)r 1 β K , ( ) where the second inequality follows from Bernstein Condition and the third inequality follows from Lemma 5. The obtained inequality can be rewritten as E[f (Z 1 )] ≤ K m(K -β) m i=1 f (Z 1 i ) + (1 -β)r 1 β (K -β) (21) ≤ K m(K -β) m i=1 f (Z 1 i ) + r 1 β K . ( ) Step Two: By the construction of G r , it can be varified that 1 m m i=1 E Di∼P [E Z 1 i ∼Di [g(Z 1 i )]] 2 ≤ r. In details: First, we have 1 m m i=1 E Di∼P [E Z 1 i ∼Di [g(Z 1 i )]] 2 ≤ 1 m m i=1 E Di∼P [E Z 1 i ∼Di [g(Z 1 i )] 2 ] = 1 m m i=1 E[g(Z i )] 2 . (23) If V (f ) ≤ r, we have g = f. It follows that 1 m m i=1 E[g(Z i )] 2 = E[g(Z i )] 2 = E[f (Z i )] 2 ≤ V (f ) ≤ r. ( ) If V (f ) > r, we have g = rf /V (f ). It follows that 1 m m i=1 E[g(Z i )] 2 = E[g(Z i )] 2 = r 2 [V (f )] 2 E[f (Z i )] 2 ≤ r 2 [V (f )] ≤ r. ( ) Combining the boundness of G r and Lemma 3, with probability at least 1 -δ, ∀δ > 0, ∀α > 0, we have sup g∈Gr 1 m m i=1 E[g(Z i )] -E[g(Z 1 i )] ≤ 2(1 + α) E Di,••• ,Dm,σ sup g∈Gr 1 m m i=1 σ i E Z 1 i ∼Di [g(Z 1 i )] + 2r ln(1/δ) m + 1 3 + 1 α b ln(1/δ) m . Thus, we get sup g∈Gr 1 m m i=1 E[g(Zi)] -E[g(Z 1 i )] ≤ 2(1 + α)E[Rm(Gr)] + 2r ln(1/δ) m + 1 3 + 1 α b ln(1/δ) m . where R m (G r ) = E σ sup g∈Gr 1 m m i=1 σ i g(Z i ) . By "peeling" technique and similar following steps in the proof of Theorem 2, it follows with probability at least 1 -δ, ∀δ ≥ 0 sup g∈Gr 1 mn m i=1 n j=1 E[g(Z j i )] -g(Z j i ) ≤ 10(1 + α) B √ rr * + 2r ln(1/δ) m + 1 3 + 1 α b ln(1/δ) m , where r * is the fixed point of sub-root function ψ(r). Step Three: Then by Lemma 5, we have r 1 β 0 ≤ (BK) 2 2-β 2 β 2-β 10(1 + α) B 2 2-β (r * ) 1 2-β + 2 ln(1/δ) mn 1 2-β + 1 3 + 1 α 2BKb ln(1/δ) (2 -β)mn . Let F be the loss function class. We consider the functional T (f ) := P f 2 here. The structural result on covering numbers implies that log N (ϵ, F, ∥ • ∥ 2 ) ≤ log N (ϵ/L, H, ∥ • ∥ 2 ) ≤ d log p (γL/ϵ). Lemma 6 implies that the sub-root function can be chosen as ψ(r) := c   d log p 2γm 1/2 m + rd log p 2γm 1/2 m   . In a similar way to Lei et al. (2016) , let r * m be its fixed point then we know that r * m = c   d log p 2γm 1/2 m + r * d log p 2γm 1/2 m   . Solving this equality gives r * m ≤ cdm -1 log p (m). In a similar way to the proof of Theorem 2, we have E[f (Z 1 )] ≤ K m(K -β ′′ ) m i=1 f (Z 1 i ) + c 1 log p (m) m 1 2-β ′′ + c 2 log(1/δ) m 1 2-β ′′ . ( ) That is, L P ( h)-L P (h * ) ≤ K K -β ′′ L D ( h) -L D (h * ) + c 1 log p (m) m 1 2-β ′′ + c 2 log(1/δ) m 1 2-β ′′ . Using the fact that L D ( h) -L D (h * ) = L D ( h * ) -L D (h * ) + L D ( h) -L D ( h * ) ≤ L D ( h) -L D ( h * ), we get L P ( h)-L P (h * ) ≤ K K -β ′′ L D ( h) -L D ( h * ) + c 1 log p (m) m 1 2-β ′′ + c 2 log(1/δ) m 1 2-β ′′ . Combining with Theorem 2, we complete the proof.

D PROOFS OF THE RESULTS WITH SUB-EXPONENTIAL LOSSES D.1 LEARNING RATES FOR SUB-EXPONENTIAL LOSSES

We state our results on the convergnece rate of generalization error for sub-exponential losses. First, we consider the participating clients. Let S = {Z j i } (m,n) (i,j=1,1) be global data samples whose subsets S i = {Z j i } n j=1 include i.i.d random variables at i-th client. Theorem 6 (Participating error for sub-exponential losses). Suppose Z j i take valued in a Banach space (Z, ∥ • ∥) and each ∥Z j i ∥ is sub-exponential distributed. We denote by F = {Z → ℓ(h, Z) : h ∈ H} such that, ∀f ∈ F and ∀z, z ′ ∈ Z, |f (z) -f (z ′ )| ≤ L∥z -z ′ ∥. For any δ > 0, if mn ≥ ln(1/δ) ≥ ln 2, then with probability at least 1 -δ, we have sup h∈H |L D (h) -L S (h)| ≤ 2E[R mn (F)] + max i∈[m] 16eL∥∥Z 1 i ∥∥ ψ1 2 ln(1/δ) mn , where R mn (F) = E σ sup f ∈F 1 mn m i=1 n j=1 σ j i f (Z j i ) . Remark 9. Theorem 6 can be used to bound semi-empirical excess by applying standard uniformly supremum of h ∈ H. It is worth emphasizing that the bounds derived in Theorem 6 include Rademacher complexity term and ∥∥Z 1 i ∥∥ ψ1 measuring the tails of input data samples {Z j i } (m,n) (i,j=1,1) . Intuitively, in regression problems, as the noise added to the lables increases, it is expected that participating error increase as well. This phenomenon is ignored under the previous bounded assumption on losses. Example 1 (Linear regression with unbounded loss). Let Z = (X , R), where X is a Hilbert-space with norm ∥ • ∥ H . We denote by (X j i , Y j i ) each sub-exponential random variables in H and R respectively. Let loss function ℓ be a 1-Lipschitz function (absolute function or Huber loss) and F = {(x, y) → f (x, y) = ℓ(⟨w, x⟩ -y) : ∥w∥ H ≤ L}. If mn ≥ ln(1/δ) ≥ ln 2, it follows with probability at least 1 -δ sup h∈H |L D (h) -L S (h)| ≤ 4 √ mn max i∈[m] L∥∥X 1 i ∥∥ ψ1 + max i∈[m] ∥∥Y 1 i ∥∥ ψ1 1 + 6e ln( 1 δ ) . Theorem 7 (Participation gap with unbounded loss). Under the same conditons as Theorem 6 and Example 1, we have sup h∈H |L P (h) -L D (h)| ≤ 8 √ m (L∥∥X 1 ∥∥ ψ1 + ∥∥Y 1 ∥∥ ψ1 ) 1 + 3e ln( 1 δ ) , where X 1 is random vector with expectation across two-level distribution. This is, E[∥X 1 ∥] = E Di∼P E X1∼Di ∥X 1 ∥. Similarly, E[∥Y 1 ∥] = E Di∼P E Y1∼Di ∥Y 1 ∥. Remark 10. Combining the results of Theorem 6 and Theorem 7, it can be shown that upper bound of excess risk is of order O( 1 √ mn + 1 √ m ) . Though this bound is derived under the unbounded assumption, its order is comparable with basic results derived in Theorem 1. Note that the upper bounds in Theorem 7 include terms such as ∥X 1 ∥ and ∥Y 1 ∥, whose underlying distributions are across our two-level framework. This reflects that the participation gap captures the generalization error caused by client sampling. Lemma 7 (Theorem 3.1 in Maurer & Pontil (2021) ). Let f : X n → R and X = (X 1 , . . . , X n ) be a random vector whose elements are independent and take values in a space X . Then for any t ≥ 0 P f (X) -E f X ′ > t ≤ exp    -t 2 4e 2 k ∥f k (X)∥ 2 ψ 1 ∞ + 2e max k ∥f k (X)∥ ψ 1 ∞ t   

PROOF OF THEOREM 6

Proof. We first define a vector space B = g : F → R : sup f ∈F |g(f )| ≤ ∞ . By definition, B is a normed space with norm ∥g∥ B = sup f ∈F |g(f )|. For each Z j i ∈ Z, we define Ẑj i (f ) by (mn) -1 (f (Z j i ) -E[f (Z ′j i )]). Thus, E[ Ẑj i ] ≡ 0, and i j Ẑj i B = sup f ∈F 1 mn i j (f (Z j i ) -E[f (Z ′j i )]) . From Lemma 7, we have i j Ẑj i B -E   i j Ẑj i B   ≤ i j Ẑj i -E   i j Ẑj i   B ≤ max i∈[m] 8e √ mn∥∥ Ẑ1 i ∥ B ∥ ψ1 2 ln(1/δ). Observe that ∥∥ Ẑj i ∥ B ∥ ψ1 = 1 mn sup f ∈F E f (Z j i ) -f (Z ′j i ) | Z ψ1 ≤ 1 mn sup f ∈F E f (Z j i ) -f (Z ′j i ) | Z ψ1 ≤ L mn E Z j i -Z ′j i | Z ψ1 ≤ 2L mn Z 1 i ψ1 . Published as a conference paper at ICLR 2023 Therefore, we get max i∈[m] ∥∥ Ẑ1 i ∥ B ∥ ψ1 ≤ max i∈[m] 2L mn ∥∥Z 1 i ∥∥ ψ1 , and i j Ẑj i B -E   i j Ẑj i B   ≤ max i∈[m] 16eL∥∥Z 1 i ∥∥ ψ1 2 ln(1/δ) mn . By symmetrization, we have E   i j Ẑj i B   ≤ 2 E S1,••• ,Sm [R mn (F)] = 2 E S1,••• ,Sm   E σ   sup f ∈F 1 mn n i=1 m j=1 σ ij f (Z j i ) | Z     .

PROOF OF EXAMPLE 1

Proof. R mn (F) = E σ   sup f ∈F 1 mn n i=1 m j=1 σ ij f (Z j i )   = E σ   sup ∥w∥ H ≤L 1 mn n i=1 m j=1 σ ij ℓ(⟨w, X j i ⟩ -Y j i )   ≤ E σ   sup ∥w∥ H ≤L 1 mn n i=1 m j=1 ⟨w, σ ij X j i ⟩   + E σ   1 mn n i=1 m j=1 σ ij Y j i   = E σ   sup ∥w∥ H ≤L 1 mn ⟨w, n i=1 m j=1 σ ij X j i ⟩   + E σ   1 mn n i=1 m j=1 σ ij Y j i   ≤ L mn E σ   ∥ n i=1 m j=1 σ ij X j i ∥ H   + E σ   1 mn n i=1 m j=1 σ ij Y j i   Next, using Jensen's inequality we can see that E σ   ∥ n i=1 m j=1 σ ij X j i ∥ H   = E σ      ∥ n i=1 m j=1 σ ij X j i ∥ 2 H   1 2    ≤   E σ   ∥ n i=1 m j=1 σ ij X j i ∥ 2 H     1 2 By the assumption that σ 1,1 , . . . , σ m,n are independent, then we have E σ n i=1 m j=1 σ ij X j i 2 H = E σ   (m,n) (i,j)=(1,1) (m,n) (s,k)=(1,1) σ ij σ sk ⟨X j i , X k s ⟩   = E σ   (i,j)̸ =(s,k) σ ij σ sk ⟨X j i , X k s ⟩   + E σ   (m,n) (i,j)=(1,1) σ 2 ij ⟨X j i , X j i ⟩   = E σ   (m,n) (i,j)=(1,1) σ 2 ij ⟨X j i , X j i ⟩   = (i,j) ∥X j i ∥ 2 H . Thus, we have E σ   ∥ n i=1 m j=1 σ ij X j i ∥ H   ≤ (i,j) ∥X j i ∥ 2 H . Similarly, we have E σ   n i=1 m j=1 σ ij Y j i   ≤ (i,j) |Y j i | 2 Therefore, R mn (F) ≤ L mn (i,j) ∥X j i ∥ 2 H + (i,j) |Y j i | 2 Since X 1 1 , • • • , X n m are independent and ∥ • ∥ 2 ≤ 2∥ • ∥ ψ1 , we get E [R mn (H)] ≤ max i∈[m] 2L √ mn ∥∥X 1 i ∥∥ ψ1 + max i∈[m] 1 √ mn ∥∥Y 1 i ∥∥ ψ1 PROOF OF THEOREM 7 Proof. We first define a vector space B = g : F → R : sup f ∈F |g(f )| ≤ ∞ . By definition, B is a normed space with norm ∥g∥ B = sup f ∈F |g(f )|. Let f (D i ) = E Z 1 i ∼Di [f (Z 1 i )] and E[f (D i )] = E Di∼P [E Z 1 i ∼Di [f (Z 1 i )]]. For each Z 1 i ∈ Z, we define Ẑi by (1/m)([f (D i )] - E[f (D i )]). Thus, E[ Ẑi ] ≡ 0, and i Ẑi B = sup f ∈F 1 m i (f (D i ) -E[f (D ′ i )]) . From Lemma 7, we have i Ẑi B -E i Ẑi B ≤ i Ẑj i -E i Ẑi B ≤ 8e √ m∥∥ Ẑi ∥ B ∥ ψ1 2 ln(1/δ). Observe that ∥∥ Ẑi ∥ B ∥ ψ1 = 1 m sup f ∈F |E [f (D i ) -f (D ′ i ) | D]| ψ1 ≤ 1 m sup f ∈F E [|f (D i ) -f (D ′ i )| | D] ψ1 ≤ 1 m E sup f ∈F |f (D i ) -f (D ′ i )| D ψ1 ≤ 1 m sup f ∈F |f (D i ) -f (D ′ i )| ψ1 . E sup f ∈F |f (D i ) -f (D ′ i )| p ≤ E sup f ∈F |f (Z i ) -f (Z ′ i )| p ≤ E |L ∥Z i -Z ′ i ∥| p . Therefore sup f ∈F |f (D i ) -f (D ′ i )| p ≤ ∥L ∥Z i -Z ′ i ∥∥ p and sup f ∈F |f (D i ) -f (D ′ i )| ψ1 ≤ ∥L ∥Z i -Z ′ i ∥∥ ψ1 . Then we get ∥∥ Ẑi ∥ B ∥ ψ1 ≤ 2L m ∥∥Z i ∥∥ ψ1 , and i Ẑi B -E i Ẑ1 B ≤ 16eL∥∥Z i ∥∥ ψ1 2 ln(1/δ) m . By symmetrization, we have E i Ẑi B ≤ 2 E D1,••• ,Dm E σ sup f ∈F 1 mn m i=1 σ i f (D i ) | D ≤ 2 E Z1,••• ,Zm E σ sup f ∈F 1 mn m i=1 σ i f (Z i ) | Z ≤ 2 E Z1,••• ,Zm [R m (F)] ≤ 8 √ m (L∥∥X 1 ∥∥ ψ1 + ∥∥Y 1 ∥∥ ψ1 ) .

E PROOFS OF SMALL-BALL BASED METHOD

Definition 10. Let H ⊂ L 2 (D) be a closed and convex class of functions and H -H := {h -h ′ : h, h ′ ∈ H}. 1. Let Q H (τ ) = inf h∈H P(|h(X 1 i )| ≥ τ ∥h∥ L2(D) ) , where X 1 i represent the random sample at i-th participating client.

2.. Let

Q H (τ, P ) = inf h∈H P |E[h(X 1 i )]| ≥ τ ∥h∥ L2(P ) , where X 1 i represent the random sample at i-th participating client. Definition 11. We denote by B m 2 the L 2 (D) unit ball entered at h * , that is B m 2 = {h ∈ H : ∥h -h * ∥ L2(D) ≤ 1}. For every η > 0, define ωmn(H, η) := inf s > 0 : E sup h∈H∩sB m 2 1 mn m i=1 n j=1 σ j i h(X j i ) ≤ ηs, where σ j i are Rademacher random variables. Definition 12. We denote by B 2 the L 2 (P ) unit ball entered at h * . For every η > 0, define ωm(H, η) := inf s > 0 : E sup h∈H∩sB 2 1 m m i=1 σih(Xi) ≤ ηs, where σ i are Rademacher random variables. Lemma 8 (Theorem 1 in (Zhang & Wei, 2022) ). If {X i } n i=1 are independent centralized random variables such that ∥X i ∥ ψα < ∞ for all 1 ≤ i ≤ n and some 1 > α > 0, then for any weight vector w = (w 1 , . . . , w n ) ∈ R n , the following bounds holds true: P   ∥ n j=1 w i X i ∥ ≥ t   ≤ 2 exp - c 1 t 2 n i=1 w 2 i ∥X i ∥ 2 ψα ∧ c 2 n α max 1≤i≤n w i ∥X i ∥ α ψα .

E.1 PROOF OF THEOREM 4

Proof. Step One: LS(h) -LS( h * ) = 1 mn m i=1 n j=1 (h(X j i ) -Y j i ) 2 -( h * (X j i ) -Y j i ) 2 (27) = 1 mn m i=1 n j=1 (h -h * ) 2 (X j i ) + 2 mn m i=1 n j=1 ξ j i (h -h * )(X j i ), The second term of the RHS of ( 28) is determined by the underline semi-empirical distribution D and the hypothesis space H, therefore we focus on the first term in the following. For any h ∈ H and u > 0, |{(i, j) : h(X j i ) > u}| = m i=1 n j=1 1 {|h(X j i )|≥u} . Also, 1 mn m i=1 n j=1 1 {|h(X j i )|≥u} = 1 m m i=1 E 1 {|h(X 1 i )|≥2u} + 1 mn m i=1 n j=1 1 {|h(X j i )|≥u} - 1 m m i=1 E X∼Di 1 {|h(X 1 i )|≥2u} Let ϕ u : R → [0, 1] be the function ϕ u (t) =    0 t ≤ u, t u -1 u ≤ t ≤ 2u, 1 t ≥ 2u. Note that for every t ∈ R, ϕ u (t) ≥ 1 {t≥2u} and ϕ u (t) ≤ 1 {t≥u} . Thus, 1 mn m i=1 n j=1 1 {|h(X j i )|≥u} ≥ 1 m m i=1 E 1 {|h(X 1 i )|≥2u} + 1 mn m i=1 n j=1 ϕ u (|h(X j i )|) - 1 m m i=1 E ϕ u (|h(X 1 i )|) ≥ inf h∈H P(|h(X 1 i )| ≥ 2u) -sup h∈H 1 mn m i=1 n j=1 ϕ u (|h(X j i )|) - 1 m m i=1 E X∼Di [ϕ u (|h(X)|)] . Since function ϕ u (t) is bounded by 1, using Mcdiarmid's inequality, we get that, for every δ > 0, with probability at least 1 -2 exp(-2δ 2 ), sup h∈H 1 mn m i=1 n j=1 ϕ u (|h(X j i )|) - 1 m m i=1 E ϕ u (|h(X 1 i )|) ≤ E sup h∈H 1 mn m i=1 n j=1 ϕ u (|h(X j i )|) - 1 m m i=1 E ϕ u (|h(X 1 i )|) + δ √ mn . By the Lipschitz property of ϕ u (|t|) and the symmetrization theorem, we have E sup h∈H 1 mn m i=1 n j=1 ϕ u (|h(X j i )|) - 1 m m i=1 E ϕ u (|h(X 1 i )|) ≤ 4 u E sup h∈H 1 mn m i=1 n j=1 σ j i h(X j i ) . Therefore, for every h ∈ H, it follows that with probability at least 1 -2 exp(-2δ 2 ), we have 1 mn m i=1 n j=1 1 {|h(X j i )|≥u} ≥ inf h∈H P(|h(X 1 i )| ≥ 2u) - 4 u E sup h∈H 1 mn m i=1 n j=1 σ j i h(X j i ) - δ √ mn . Step Two: The first term on the RHS can be bounded by small ball condtion. Let H * = H -h * . We first prove that H * is star-shaped around 0. For every h -h * ∈ H * and 0 ≤ λ ≤ 1, we have λ(h -h * ) = λh + (1 -λ) h * -h * . Since H is convex, it follows that λh + (1 -λ) h * ∈ H. Then the claim follows because λ(h -h * ) ∈ H -h * . Assume that these exsits τ > 0 for which Q H * (2τ ) > 0. The for every s ≥ ω(H * , τ Q H * (2τ )/16), we have E sup h∈H * ∩sB m 2 1 mn m i=1 n j=1 σ j i h(X j i ) ≤ τ Q H * (2τ ) 16 s. Let G be a function class associated with H * G = h s : h ∈ H * ∩ sB m 2 ⊂ B m 2 , where B m 2 is the unit ball with respect with L 2 (D) and D is the semi-empirical distribution. E sup g∈G 1 mn m i=1 n j=1 σ j i g(X j i ) = E sup h∈H * ∩sB m 2 1 mn m i=1 n j=1 σ j i h(X j i ) s ≤ τ Q H * (2τ ) 16 ≤ τ Q G (2τ ) 16 , where the last inequality follows from Q G (2τ ) ≥ Q H * (2τ ). By equation ( 30) applied to the function class G, it follows that with probability at least 1 -2 exp(-2δ 2 ) 1 mn m i=1 n j=1 1 {|g(X j i )|≥u} ≥ inf g∈G P(|g(X 1 i )| ≥ 2u) - 4 u E sup g∈G 1 mn m i=1 n j=1 σ j i g(X j i ) - δ √ mn ≥ Q G (2u) - 4 u E sup g∈G 1 mn m i=1 n j=1 σ j i g(X j i ) - δ √ mn ≥ Q G (2u) - 4 u τ Q G (2τ ) 16 - δ √ mn . Now, setting u = τ δ = √ mnQ G (2τ ) 2 , it follows that with probability at least 1 -2 exp(-mnQ G (2τ )/2) 1 mn m i=1 n j=1 1 {|g(X j i )|≥τ } ≥ Q G (2τ ) - Q G (2τ ) 4 - Q G (2τ ) 2 = Q G (2τ ) 4 . Using the condition Q G (2τ ) ≥ Q H * (2τ ), for every s ≥ ω(H * , τ Q H * (2τ )/16 ), it follows that with probability at least 1 -2 exp(-mnQ H * (2τ )/2), we get inf g∈G |{(i, j) : |g(X j i )| > τ }| = m i=1 n j=1 1 {|g(X j i )|≥τ } ≥ mnQ G (2τ ) 4 ≥ mnQ H * (2τ ) 4 . ( ) For every h -h * ∈ H * that satisfies ∥h -h * ∥ L2(D) ≥ s, since H -h * is star-shaped around 0, we have s/∥h -h * ∥ L2(D) (h -h * ) ∈ H * ∩ sB m 2 . Thus, (h -h * )/∥h -h * ∥ L2(D) ∈ G. Combining equation (31), if s ≥ ω(H * , τ Q H * (2τ )/16), then for every h ∈ H that satisfies ∥hh * ∥ L2(D) ≥ s, if follows that with probability at least 1 -2 exp(-mnQ H * (2τ )/2), one has {(i, j) : |(h -h * )(X j i )| > τ ∥h -h * ∥ L2(D) } ≥ mnQ H * (2τ ) 4 Therefore, on that event, we get 1 mn m i=1 n j=1 (h -h * ) 2 (X j i ) > τ 2 4 Q H * (2τ )∥h -h * ∥ 2 L2(D) . Note that H * ⊂ H -H, the same conclusion holds with Q H-H (2τ ) replacing the larger Q H * . That is, if s ≥ ω(H * , τ Q H-H (2τ )/16 ), then for every h ∈ H that satisfies ∥h-h * ∥ L2(D) ≥ s, if follows that with probability at least 1 -2 exp(-mnQ H-H (2τ )/2), one has 1 mn m i=1 n j=1 (h -h * ) 2 (X j i ) > τ 2 4 Q H-H (2τ )∥h -h * ∥ 2 L2(D) . Step Three: Combining equation ( 2), with the same conditions we get L S (h) -L S ( h * ) ≥ 2 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) + τ 2 4 Q H-H (2τ )∥h -h * ∥ 2 L2(D) . Since 1 m m i=1 E ξ(h -h * )(X 1 i ) ≥ 0, then we have 1 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) ≥ 1 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) - 1 m m i=1 E X∼Di ξ(h -h * )(X) . According to Lemma 7, we have P   1 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) - 1 m m i=1 E X∼Di ξ(h -h * )(X) ≥ η∥h -h * ∥ 2 L2(D)   ≤ 2 exp      -    c 1 η 2 0 (mn)∥h -h * ∥ 4 L2(D) 1 mn m i=1 n j=1 2 V j i 2 ψα ∧ c 2 η α 0 (mn) α ∥h -h * ∥ 2α L2(D) max (1,1)≤(i,j)≤(m,n) V j i α ψα         , where V j i = ξ j i (h -h * )(X j i ) -E ξ j i (h -h * )(X j i ) . We denote C 1 , C 2 by C 1 = 1 mn m i=1 n j=1 2 V j i 2 ψα c 1 , C 2 = max (1,1)≤(i,j)≤(m,n) V j i α ψα c 2 . Then we have P   1 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) - 1 m m i=1 E X∼Di ξ(h -h * )(X) ≥ η∥h -h * ∥ 2 L2(D)   ≤ 2 exp - η 2 (mn)∥h -h * ∥ 4 L2(D) C 1 ∧ η α (mn) α ∥h -h * ∥ 2α L2(D) C 2 . To make sure that the probability tends to zero as mn increase, we could chose ∥h -h * ∥ L2(D) ≥ κ = (mn) -1 4 +ι , where 0 < ι < 1 4 . Then with probability at least δ = 1 -2 exp - η 2 (mn) 4ι C 1 ∧ η α (mn) α(1+4ι) 2 C 2 , we have 1 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) - 1 m m i=1 E X∼Di ξ(h -h * )(X) ≤ η∥h -h * ∥ 2 L2(D) . Combining ( 2) and ( 34), if ∥h -h * ∥ L2(D) ≥ max(κ, s), it follows that with probability at least 1 -δ -2 exp(-mnQ H * (2τ )/2), L S (h) -L S ( h * ) ≥ ∥h -h * ∥ 2 L2(D) τ 2 4 Q H-H (2τ ) -4η . Consider η < τ 2 Q H-H (2τ )/16, we get L S (h) -L S ( h * ) ≥ 0. On the same event, we have ∥ h -h * ∥ L2(D) ≤ max(κ, s) = max((mn) -1 4 +ι , ω(H -H, τ Q H-H (2τ )/16)). PROOF OF COROLLARY 1 Proof. Let s = ω(H -H, τ Q H-H (2τ )/16 ) and κ = (mn) -1 4 +ι , where 0 < ι < 1 4 . By theorem 4, we have ∥ h -h * ∥ L2(D) ≤ max(κ, s). Therefore, it suffices to show that for quadratic loss we have LD(h) -LD( h * ) = 1 m m i=1 n j=1 E (h(X j i ) -Y j i ) 2 -( h * (X j i ) -Y j i ) 2 (35) = 1 mn m i=1 n j=1 E (h -h * ) 2 (X j i ) + 2 mn m i=1 n j=1 E ξ j i (h -h * )(X j i ) , Note that for h ∈ H either L D (h) -L D ( h * ) ≤ 2 mn m i=1 n j=1 E (h -h * ) 2 (X j i ) , or L D (h) -L D ( h * ) ≤ 4 mn m i=1 n j=1 E ξ j i (h -h * )(X j i ) . For the first case, we have L D ( h) -L D ( h * ) ≤ 2∥ h -h * ∥ 2 L2(D) ≤ 2 max(κ 2 , s 2 ). For the second case, L S (h) -L S ( h * ) = 1 mn m i=1 n j=1 (h(X j i ) -Y j i ) 2 -( h * (X j i ) -Y j i ) 2 = 1 mn m i=1 n j=1 (h -h * ) 2 (X j i ) + 2 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) ≥ 2 mn m i=1 n j=1 E ξ j i (h -h * )(X j i ) + 2 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) - 2 mn m i=1 n j=1 E ξ j i (h -h * )(X j i ) . For convex function class H, we have E[ξ j i (h-h * )(X j i )] ≥ 0. On the same condition that inequality (34) holds, we get L S (h) -L S ( h * ) ≥ 2 mn m i=1 n j=1 E ξ j i (h -h * )(X j i ) - 2 mn m i=1 n j=1 ξ j i (h -h * )(X j i ) - 2 mn m i=1 n j=1 E ξ j i (h -h * )(X j i ) ≥ 1 2 (L D (h) -L D ( h * )) - τ 2 Q H-H (2τ ) 8 ∥ h -h * ∥ 2 L2(D) ≥ 1 2 (L D (h) -L D ( h * )) - τ 2 Q H-H (2τ ) 8 max(κ 2 , s 2 ) Since L S (h) -L S ( h * ) ≤ 0, it follows that (L D (h) -L D ( h * )) ≤ τ 2 Q H-H (2τ ) 4 max(κ 2 , s 2 ).

E.2 PROOF OF THEOREM 5

Proof. For quadratic loss, we have LD(h) -LD( h * ) = 1 m m i=1 E (h(X 1 i ) -Y 1 i ) 2 -( h * (X 1 i ) -Y 1 i ) 2 (37) = 1 m m i=1 E[(h -h * ) 2 (X 1 i )] + 2 m m i=1 E[ξ 1 i (h -h * )(X 1 i )], Step One: For any h ∈ H and u > 0, |{i : |E[h(X 1 i )]| > u}| = m i=1 1 {|E[h(X 1 i )]|≥u} Also, 1 m m i=1 1 {|E[h(X 1 i )]|≥u} = E Di∼P 1 {|E[h(X 1 i )]|≥2u} + 1 m m i=1 1 {|E[h(X 1 i )]|≥u} -E Di∼P 1 {|E[h(X 1 i )]|≥2u} Let ϕ u : R → [0, 1] be the function ϕ u (t) =    0 t ≤ u, t u -1 u ≤ t ≤ 2u, 1 t ≥ 2u. Note that for every t ∈ R, ϕ u (t) ≥ 1 {t≥2u} and ϕ u (t) ≤ 1 {t≥u} . Thus, 1 m m i=1 1 {|E[h(X 1 i )]|≥u} ≥ E Di∼P 1 {|E[h(X 1 i )]|≥2u} + 1 m m i=1 ϕ u (|E[h(X 1 i )]|) -E Di∼P ϕ u (|E[h(X 1 i )]|) ≥ inf h∈H P(|E[h(X 1 i )]| ≥ 2u) -sup h∈H 1 m m i=1 ϕ u (|E[h(X 1 i )]|) -E Di∼P ϕ u (|E[h(X 1 i )]|) . Since function ϕ u (t) is bounded by 1, using Mcdiarmid's inequality, we get that, for every δ > 0, with probability at least 1 -2 exp(-2δ 2 ), sup h∈H 1 m m i=1 ϕ u (|E[h(X 1 i )]|) -E Di∼P ϕ u (|E[h(X 1 i )]|) ≤ E sup h∈H 1 m m i=1 ϕ u (|E[h(X 1 i )]|) -E Di∼P ϕ u (|E[h(X 1 i )]|) + δ √ m . By the Lipschitz property of ϕ u (|t|) and the symmetrization theorem, we have E sup h∈H 1 m m i=1 ϕ u (|E[h(X 1 i )]|) -E Di∼P ϕ u (|E[h(X 1 i )]|) ≤ 4 u E sup h∈H 1 m m i=1 σ i h(X i ) , where X i is the random sampled across two-level distribution framework. Therefore, for every h ∈ H, it follows that with probability at least 1 -2 exp(-2δ 2 ), we have 1 m m i=1 1 {|E[h(X 1 i )]|≥u} ≥ inf h∈H P(|E[h(X 1 i )]| ≥ 2u) - 4 u E sup h∈H 1 m m i=1 σ i h(X i ) - δ √ m . The first term on the RHS can be bounded by small ball condtion. Let H * = H -h * , we have proved that H * is star-shaped around 0. Assume that these exsits τ > 0 for which Q H * (2τ, P ) > 0. Then for every s ≥ ω m (H * , τ Q H * (2τ, P )/16), we have E sup h∈H * ∩sBm 1 m m i=1 σ i h(X i ) ≤ τ Q H * (2τ, P ) 16 s. Let G be a function class associated with H * G = h s : h ∈ H * ∩ sB m ⊂ B m , where B m is the unit ball with respect with L 2 (P ) and P is the population distribution. E sup g∈G 1 m m i=1 σ i g(X i ) = E sup h∈H * ∩sBm 1 m m i=1 σ i h(X i ) s ≤ τ Q H * (2τ, P ) 16 ≤ τ Q G (2τ, P ) 16 , where the last inequality follows from Q G (2τ, P ) ≥ Q H * (2τ, P ). By applying inequality (40) to the function class G, it follows that with probability at least 1 -2 exp(-2δ 2 ) 1 m m i=1 1 {|E[g(X 1 i )]|≥u} ≥ inf g∈G P(|E[g(X 1 i )]| ≥ 2u) - 4 u E sup g∈G 1 m m i=1 σ i g(X i ) - δ √ m . ≥ Q G (2u, P ) - 4 u E sup g∈G 1 m m i=1 σ i g(X i ) - δ √ m ≥ Q G (2u, P ) - 4 u τ Q G (2τ, P ) 16 - δ √ m . Now, setting u = τ δ = √ mQ G (2τ, P ) 2 , it follows that with probability at least 1 -2 exp(-mQ G (2τ, P )/2) 1 m m i=1 1 {|E[g(X 1 i )]|≥u} ≥ Q G (2τ, P ) - Q G (2τ, P ) 4 - Q G (2τ, P ) 2 = Q G (2τ, P ) 4 . Using the condition Q G (2τ, P ) ≥ Q H * (2τ, P ), for every s ≥ ω m (H * , τ Q H * (2τ, P )/16 ), it follows that with probability at least 1 -2 exp(-mQ H * (2τ, P )/2), we get inf g∈G |{i : |E[g(X 1 i )]| > τ }| = m i=1 1 {|E[g(X 1 i )]|≥τ } ≥ mQ G (2τ, P ) 4 ≥ mQ H * (2τ, P ) 4 . ( ) For every h -h * ∈ H * that satisfies ∥h -h * ∥ L2(P ) ≥ s, since H -h * is star-shaped around 0, we have s/∥h -h * ∥ L2(P ) (h -h * ) ∈ H * ∩ sB m . Thus, (h -h * )/∥h -h * ∥ L2(P ) ∈ G. Combining equation (41), if s ≥ ω m (H * , τ Q H * (2τ, P )/16), then for every h ∈ H that satisfies ∥h -h * ∥ L2(P ) ≥ s, if follows that with probability at least 1 -2 exp(-mQ H * (2τ, P )/2), one has With the same conditions we get {i : |E[(h -h * )(X 1 i )]| > τ ∥h -h * ∥ L2(P ) } ≥ mQ H * (2τ, P ) 4 Therefore, on that event, we get 1 m m i=1 E (h -h * ) 2 (X 1 i ) ≥ 1 m m i=1 E[(h -h * )(X 1 i )] 2 ≥ τ 2 4 Q H * (2τ, P )∥h -h * ∥ 2 L2(P ) . L P (h) -L P (h * ) ≥ 2 m m i=1 E ξ 1 i (h -h * )(X 1 i ) + τ 2 4 Q H-H (2τ, P )∥h -h * ∥ 2 L2(P ) . ( ) Since E Di∼P E (X 1 i ,Y 1 i )∼Di ξ 1 i (h -h * )(X 1 i ) ≥ 0, then we have 1 m m i=1 E (X 1 i ,Y 1 i )∼Di ξ 1 i (h -h * )(X 1 i ) ≥ 1 m m i=1 E (X 1 i ,Y 1 i )∼Di ξ 1 i (h -h * )(X 1 i ) -E Di∼P E (Xi,Yi)∼Di ξ 1 i (h -h * )(X 1 i ) . According to Lemma 7, we have P 1 m m i=1 E (X 1 i ,Y 1 i )∼Di ξ 1 i (h -h * )(X 1 i ) -E Di∼P E (Xi,Yi)∼Di [ξ(h -h * )(X i )] ≥ η∥h -h * ∥ 2 L2(P ) ≤ 2 exp - c 1 η 2 m∥h -h * ∥ 4 L2(P ) 1 m m i=1 ∥V i ∥ 2 ψα ∧ c 2 η α m α ∥h -h * ∥ 2α L2(P ) max 1≤i≤m ∥V i ∥ α ψα , where V i = E (X 1 i ,Y 1 i )∼Di ξ 1 i (h -h * )(X 1 i ) -E Di∼P E (Xi,Yi)∼Di [ξ(h -h * )(X i )] . We denote C 1 , C 2 by C 1 = 1 m m i=1 ∥V i ∥ 2 ψα c 1 , C 2 = max 1≤i≤m ∥V i ∥ α ψα c 2 . Then we have P 1 m m i=1 E (X 1 i ,Y 1 i )∼Di ξ 1 i (h -h * )(X 1 i ) -E Di∼P E (Xi,Yi)∼Di [ξ(h -h * )(X i )] ≥ η∥h -h * ∥ 2 L2(P ) ≤ 2 exp - η 2 m∥h -h * ∥ 4 L2(P ) C 1 ∧ η α m α ∥h -h * ∥ 2α L2(P ) C 2 . To make sure that the probability tends to zero as m increase, we could chose ∥h -h * ∥ L2(P ) ≥ κ = m -1 4 +ι , where 0 < ι < 1 4 . If ∥h -h * ∥ L2(P ) ≥ max(κ, s), it follows that with probability at least 1 -δ m -2 exp(-mQ H * (2τ, P )/2),  L D (h) -L D (h * ) ≥ ∥h -h * ∥ 2 L2(P ) Proof. Since excess risk is defined across our two-level framework, the steps in the proof of Corollary 1 can not be applied directly to derive Corollary 2. The key step to derive Corollary 2 is to bound ∥ h -h * ∥ 2 L2(P ) . First, this term can be decomposed as ∥ h -h * ∥ 2 L2(P ) ≤ 2∥ h -h * ∥ 2 L2(P ) + 2∥ h * -h * ∥ 2 L2(P ) . Note that ∥ h * -h * ∥ 2 L2(P ) has been bounded by Theorem 5. To bound ∥ h -h * ∥ 2 L2(P ) , we use the following decomposition: ∥ h -h * ∥ 2 L2(P ) = ∥ h -h * ∥ 2 L2(P ) -∥ h -h * ∥ 2 L2(D) + ∥ h -h * ∥ 2 L2(D) . Note that ∥ h -h * ∥ 2 L2(D) has been bounded by Theorem 4. According to Lemma 7, we have P 1 m m i=1 E (Xi,Yi)∼Di (h -h * ) 2 (X i ) -E Di∼P E (Xi,Yi)∼Di (h -h * ) 2 (X i ) ≥ η∥h -h * ∥ 2 L2(P ) ≤ 2 exp - c 1 η 2 m∥h -h * ∥ 4 L2(P ) 1 m m i=1 ∥V i ∥ 2 ψα ∧ c 2 η α m α ∥h -h * ∥ 2α L2(P ) max 1≤i≤m ∥V i ∥ α ψα , where V i = E (Xi,Yi)∼Di (h -h * ) 2 (X i ) -E Di∼P E (Xi,Yi)∼Di (h -h * ) 2 (X i ) . We denote C 1 , C 2 by C 1 = 1 m m i=1 ∥V i ∥ 2 ψα c 1 , C 2 = max 1≤i≤m ∥V i ∥ α ψα c 2 . Then we have P 1 m m i=1 E (Xi,Yi)∼Di (h -h * ) 2 (X i ) -E Di∼P E (Xi,Yi)∼Di (h -h * ) 2 (X i ) ≥ η∥h -h * ∥ 2 L2(P ) ≤ 2 exp - η 2 m∥h -h * ∥ 4 L2(P ) C 1 ∧ η α m α ∥h -h * ∥ 2α L2(P ) C 2 . To make sure that the probability tends to zero as m increase, we could chose ∥h -h * ∥ L2(P ) ≥ m -1 4 +ι , where 0 < ι < 1 4 . By following the similar steps of proof of Theorem 5, it can be proved that with probability at least 1 -exp - c 1 η 2 m 4ι 1 m m i=1 ∥V i ∥ 2 ψα ∧ c 2 η α m α(1+4ι) 2 max 1≤i≤m ∥V i ∥ α ψα , one has ∥ h -h * ∥ 2 L2(P ) -∥ h -h * ∥ 2 L2(D) ≤ η∥ h -h * ∥ 2 L2(P ) . Thus, ∥ h -h * ∥ 2 L2(P ) ≤ 1 1 -η ∥ h -h * ∥ 2 L2(D) . Moreover, ∥ h -h * ∥ 2 L2(P ) ≤ 2 1 -η ∥ h -h * ∥ 2 L2(D) + 2∥ h * -h * ∥ 2 L2(P ) . If ξ i = h * (X i ) -Y i is independent of X, then E[ξ i (h -h * )(X i )] = 0. LP (h) -LP (h * ) = 1 m m i=1 E (h(Xi) -Yi) 2 -( h * (Xi) -Yi) 2 (45) = 1 m m i=1 E (h -h * ) 2 (Xi) + 2 m m i=1 E ξi(h -h * )(Xi) (46) = 1 m m i=1 E (h -h * ) 2 (Xi) . Thus, LP (h) -LP (h * ) = 1 m m i=1 E[(h -h * ) 2 (Xi)] (48) = ∥ h -h * ∥ 2 L 2 (P ) ≤ 2 1 -η ∥ h -h * ∥ 2 L 2 (D) + 2∥ h * -h * ∥ 2 L 2 (P ) . ( ) Combining inequality (50) with Theorem 4 and Theorem 5, we complete the proof. F EXPERIMENTAL RESULTS

F.1 CONVOLUTIONAL NEURAL NETWORKS FOR EMNIST TASK

To check the validity of our theory for over-parameterized models, we train convolutional neural network for EMINIST task (Cohen et al., 2017) . In particular, we use FedAdam (Reddi et al., 2020) with server momentum = 0.9. The participating and unparticipating clients are split based on the methods proposed in (Yuan et al., 2021) . We set the unparticipating rate as 0.2. Our experiments are based on Tensorflow Federated (TFF) (Alex & Krzys, 2019) . 

F.2 FEDERATED LINEAR REGRESSION WITH SYNTHETIC DATA

In Figure 4 , we show the numerical experiments results based on the linear regression model. We first describe our linear regression setting as follows. For client i ∈ [m], the dataset is given as S i = {X j i , Y j i } with n samples. Let d be the dimensionality of the input space. We focus on the setting: Y j i | X j i , θ i ∼ N X j i ⊤ θ i , σ 2 i , ∀j = 1, . . . , n, where σ 2 k is a noise parameter. In our experiments, we set σ i = 0.05. For excess risk, we fix n = 20. For semi-excess risk we fix m = 40. 



The definition of uniform entropy number is provided in appendix C.



Figure 1: Illustration of the participation gap and participation error.

j=1 denotes the local training set at i-th participating client and S = S i • • • S m represent the global training set across all participating clients. The empirical risk minimizer h condition on dataset S is define as h = arg min h∈H L S (h). To analyze the generalization in our two-level framework, we further define semi-empirical distribution D and the corresponding semiempirical risk L D (h) by D = 1 m m i=1 D i and L D (h) = 1 m m i=1 E Z∼Di [ℓ(h, Z)] . We extend the previous definitions and denote by h * the semi-empirical risk minimizer h * = arg min h∈H L D (h).

j=1 denotes the local training set at i-th participating client and S = S i • • • S m be the global training set. The empirical risk minimizer h is define as h = arg min h∈H L S (h).

vary at most b mn , then applying McDiarmid's inequality again, with probability at least 1 -δ we haveE S1,••• ,Sn [R mn (F)] ≤ R mn (F) + b ln(2/δ) 2mn PROOF OF LEMMA 2Proof of Lemma 2. We create distributions set D ′ from D by changing one client distribution. Without loss generality, we assume the k-th D k in D is changed into D ′ k , then we define Φ(D) by Φ(D) = sup h∈H L P (h) -L D (h).

and denote k as the smallest integer such that rλ k+1 ≥ Bb β . ThenE[Rmn(Gr)] = E

Consider η < τ 2 Q H-H (2τ, P )/16, we get L D (h) -L D (h * ) ≥ 0. On the same event, we have ∥h -h * ∥ L2(P ) ≤ max(κ, s) = max(m -1 4 +ι , ω m (H -H, τ Q H-H (2τ )/16)).

Figure 2: Generalization error versus the number of participating clients.In Figure2, we show how generalization errors for unparticipating clients and participating clients convergence when we increase the number of participating clients m. Here we fix n = 100. It can be seen that the convergence rate of participating error is slower than that of unparticipating error. This phenomenon matches our theretical results in Theorem 1 well. In our results, the convergence rate of unparticipating error is of order O( 1√ mn + 1 √ m ).Compared to the convergence rate of participating error, which is of order O( 1 √ mn ), unparticipating error is expected to have faster convergence rate. In Figure3, we show how generalization errors for unparticipating clients and unparticipating clients convergence when we increase n. Here we fix m = 20. It can be seen that the convergence rates

Figure 4: Excess risk versus m and Semi-excess risk versus n.

Generalization Bounds for Heterogeneous Federated Learning. SC, Pro, and Exp denote Strong convexity, In probability, and In expectation. Sub-expon denotes sub-exponential.

Fast Learning rates for participating clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Fast Learning rates for unparticipating clients . . . . . . . . . . . . . . . . . . . . . . . . . . Learning rates for participating clients with small-ball condition . . . . . . . . . . . . . . . . 4.2 Learning rates for unparticipating clients with small-ball condition . . . . . . . . . . . . . . . Convolutional Neural Networks for EMNIST Task . . . . . . . . . . . . . . . . . . . . . . . F.2 Federated Linear Regression with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . .

is the minimizer of L S (h). Note that the first term sup h∈H |L P (h) -L D (h)| in the upper bound of excess risk quantifies the participation gap. The second term sup h∈H |L D (h) -L S (h)| quantifies the generalization error for participating clients.Based on these observations, we first provide theoretical bounds on the term sup h∈H |L D (h) -L S (h)|. Then we move to the participating gap sup h∈H |L P

participation gap in Lemma 2 quantifies the error caused by missing clients during training. The Rademacher complexity term R m (F) indicates the 'size' of hypothesis class, which can be future bounded by convering number or VC dimension of the choosen hypothesis.

), it has been shown that excess risk can be bounded by participation gap sup h∈H {|L P (h) -L D (h)|} and participating error sup h∈H {|L D (h) -L S (h)|} . To show the explicit form of the rademacher complexity terms in Lemma 1 and Lemma 2, we take VC class as an example.Proof. We use subgraph dimension(aka pseudo dimension) and Dudley's theorem to bound the empirical rademacher complexity R

42)Published as a conference paper at ICLR 2023 Note that H * ⊂ H -H, the same conclusion holds with Q H-H (2τ, P ) replacing the larger Q H * (2τ, P ). That is, if s ≥ ω m (H * , τ Q H-H (2τ, P )/16), then for every h ∈ H that satisfies ∥h -h * ∥ L2(P ) ≥ s, if follows that with probability at least 1 -2 exp(-mQ H-H (2τ, P )/2), one has

ACKNOWLEDGEMENTS

We appreciate all the anonymous reviewers for their insightful and constructive comments, especially one reviewer's suggestion to add more discussion and explanation. This work is supported by National Natural Science Foundation of China NO. 62076234; the Beijing Natural Science Foundation No. 4222029; the Intelligent Social Governance Interdisciplinary Platform, Major Innovation & Planning Interdisciplinary Platform for the "Double-First Class" Initiative, Renmin University of China; the Beijing Outstanding Young Scientist Program NO.BJJWZYJH012019100020098; the Beijing Key Laboratory of Big Data Management and Analysis Methods, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China; the Public Computing Cloud, Renmin University of China; the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China NO. 2021030199; the Huawei-Renmin University joint program on Information Retrieval; the Unicom Innovation Ecological Cooperation Plan; and the CCF-Huawei Populus Grove Fund.

