HARNESSING CLIENT DRIFT WITH DECOUPLED GRA-DIENT DISSIMILARITY

Abstract

The performance of Federated Learning (FL) typically suffers from client drift caused by heterogeneous data, where data distributions vary with clients. Recent studies show that the gradient dissimilarity between clients induced by the data distribution discrepancy causes the client drift. Thus, existing methods mainly focus on correcting the gradients. However, it is challenging to identify which client should (or not) be corrected. This challenge raises a series of questions: will the local training, without gradient correction, contribute to the server model's generalization on other clients' distributions? when does the generalization contribution hold? how to address the challenge when it fails? To answer these questions, we analyze the generalization contribution of local training and conclude that the generalization contribution of local training is bounded by the conditional Wasserstein distance between clients' distributions. Thus, the key to promote generalization contribution is to leverage similar conditional distributions for local training. As collecting data distribution can cause privacy leakage, we propose decoupling the deep models, i.e., splitting the model into a high-level model and a low-level one, for harnessing client drift. High-level models are trained on shared feature distributions, causing promoted generalization contribution and alleviated gradient dissimilarity. Experimental results demonstrate that FL with decoupled gradient dissimilarity is robust to data heterogeneity.

1. INTRODUCTION

To protect data privacy while cooperatively training machine learning models between personal users and organizations, Federated Learning (FL) (Brendan McMahan et al., 2016) is widely exploited as a powerful framework in recent years. In the FL framework, many clients train models without communicating private data. Federated Average (FedAvg) is proposed to make FL practical in lowbandwidth and low-computing resources environments. However, when data distributions between clients are severely heterogeneous (Non-Independent and Identically Distributed, Non-IID), the convergence rate and the generalization performance of FL are much worse than centralized training which collects all the data (Li et al., 2020a; Karimireddy et al., 2020; Kairouz et al., 2019) . The FL community theoretically and empirically found that the "client drift" caused by the heterogeneous data is the main bottleneck of FedAvg (Li et al., 2020a; Karimireddy et al., 2020; Kairouz et al., 2019; Wang et al., 2020a) . It means that, after several or more training epochs on private datasets, local models on clients become extremely far away from each other. Recent convergence analysis (Li et al., 2020a; Reddi et al., 2021; Woodworth et al., 2020) of FedAvg shows that the degree of client drift is linearly upper bounded by gradient dissimilarity. Therefore, most existing works (Karimireddy et al., 2020; Wang et al., 2020a) focus on gradient correction techniques to accelerate the convergence rate of local training. However, how to correct the gradients during the local training is still an open problem (Kairouz et al., 2019; Woodworth et al., 2020; Karimireddy et al., 2020) , especially for achieving better generalization ability. The challenge lies in the lack of criterion for identifying which client should (or not) be corrected. This challenge raises a fundamental question in FL systems: Can the local training on a specific client m contribute to the generalization performance of the server model when evaluted on other clients' distributions? Moreover, it is also unclear under which conditions the local training can lead to generalization contribution. The in-depth question is how to deal with the conditions where local training cannot contribute to the server models' generalizability to other clients. The high-level model uses h m and samples ĥ from a shared distribution H r as inputs for forward and backward propagation. To answer these questions, we formulate the objective of local training in FL systems as a generalization contribution problem. The generalization contribution means how much local training on one client can improve the generalization performance on other clients' distributions for server models. Specifically, we evaluate the generalization performance of a server model locally trained on one client using other clients' data distributions. Our theoretical analysis shows that the generalization contribution of local training is bounded by the conditional Wasserstein distance between clients' distributions. This implies that even if the marginal distributions on different clients are the same, it is insufficient to achieve a guaranteed generalization performance of local training. Therefore, the key to promoting generalization contribution is to leverage the same or similar conditional distributions for local training. However, collecting data to construct identical distributions shared across clients is forbidden due to privacy concerns. To avoid privacy leakage, we propose decoupling a deep neural network into a low-level model and a high-level one, i.e., a feature extractor network and a classifier network. Consequently, we can construct a shared identical distribution in the feature space. Namely, on each client, we estimate the feature distribution obtained by the low-level network and send the estimated distribution to the server model. After aggregating the received distributions, the server sends the aggregated distribution and the server model to clients simultaneously. Theoretically, we show that introducing such a simple decoupling strategy promotes the generalization contribution and alleviates gradient dissimilarity. Our extensive experimental results demonstrate the effectiveness of our method, where we consider the global test accuracy of four datasets under various FL settings following previous works (He et al., 2020b; Li et al., 2020a; Wang et al., 2020a) . Our main contributions include: (1) We theoretically show that the generalization contribution from clients during training is bounded by the conditional Wasserstein distance between clients' distributions, answering the question that when the local training on one client can contribute to the generalization performance of server models on other clients' distributions. (2) We are the first to theoretically propose that sharing similar features between clients can improve the generalization contribution from local training, and significantly reduce the gradient dissimilarity. (3) We experimentally validate the gradient dissimilarity reduction and benefits of our method on generalization performance.

2. RELATED WORKS

We review FL algorithms aiming to address the Non-IID problem and introduce other works related to measuring client contribution and decoupled training. Due to limited space, we leave a more detailed discussion of the literature review in Appendix C.

2.1. ADDRESSING NON-IID PROBLEM IN FL

Model Regularization focuses on calibrating the local models to restrict them not to be excessively far away from the server model. A number of works like FedProx (Li et al., 2020a ), FedDyn (Acar et al., 2021 ), SCAFFOLD (Karimireddy et al., 2020) and FedIR (Hsu et al., 2020) add a regularizer of local-global model difference. MOON (Li et al., 2021b) adds the local-global contrastive loss to learn a similar representation between clients. Reducing Gradient Variance tries to correct the directions of local updates at clients via other gradient information. This kind of method aims to accelerate and stabilize the convergence, like Fed-Nova (Wang et al., 2020a) , FedAvgM (Hsu et al., 2019), FedAdaGrad, FedYogi, and FedAdam (Reddi et al., 2021) . Our theorem 4.2 provides a new angle to reduce gradient variance. Personalized Federated Learning aims to make clients optimize different personal models to learn knowledge from other clients and adapt to their own datasets (Tan et al., 2022) . The knowledge transfer of personalization is mainly implemented by introducing personalized parameters (Liang et al., 2020; Thapa et al., 2020; Li et al., 2021a) , or knowledge distillation on shared local features or extra datasets (He et al., 2020a; Lin et al., 2020; Li & Wang, 2019) . Due to the preference for optimizing local objective functions, however, personalized federated models do not have a comparable generic performance (evaluated on global test dataset) to normal FL (Chen & Chao, 2021) . Some works (Collins et al., 2021; Arivazhagan et al., 2019) also propose to share feature representations for personalized FL.

2.2. MEASURING CONTRIBUTION FROM CLIENTS

Clients are only willing to participate in an FL training when given enough rewards. Thus, it is important to measure their contributions to the model performance (Yu et al., 2020; Ng et al., 2020; Liu et al., 2022; Sim et al., 2020) . Some works (Yuan et al., 2022) propose to measure the performance gaps from the unseen client distributions experimentally. Data shapley (Ghorbani & Zou, 2019; Sim et al., 2020; Liu et al., 2022) is proposed to measure the generalization performance gain of client participation. Precisely, these works measure the generalization performance gap with or without some clients that never join the whole process of FL. However, we hope to understand the contribution of clients at each communication round. Consequently, our theoretical conclusion guides a modification on data distributions that cannot provide generalization contribution, so that they can improve the generalization performance of the trained model.

2.3. SPLIT TRAINING

Some works propose Split FL (SFL) to utilize split training to accelerate federated learning (Oh et al., 2022; Thapa et al., 2020) . In SFL, the model is split into client-side and server-side parts. At each communication round, the client only downloads the client-side model from the server, and conducts forward propagation, and sends the hidden features to the server to compute the loss and conduct backward propagation. These methods aim to accelerate the training speed of FL on the client side and cannot support local updates. In addition, sending all raw features could introduce a high risk of data leakage. Thus, we omit the comparisons to these methods.

2.4. PRIVACY CONCERNS

There are many other works (Luo et al., 2021; Chang et al., 2019; Li & Wang, 2019; Bistritz et al., 2020; He et al., 2020a; Liang et al., 2020; Thapa et al., 2020; Oh et al., 2022) that propose to share the hidden features to the server or other clients. Different from them, our decoupling strategy shares the parameters of the estimated feature distributions instead of the raw features, avoiding privacy leakage. We demystify the differences between our method and others in Appendix C.

3. PRELIMINARIES

3.1 PROBLEM DEFINITION Suppose we have a set of clients M = {1, 2, • • • , M } with M being the total number of participating clients. FL aims to make these clients with their own data distribution D m cooperatively learn a machine learning model parameterized as θ ∈ R d . Suppose there are C classes in all datasets ∪ m∈M D m indexed by [C]. A sample in D m is denoted by (x, y) ∈ X × [C], where x is a model input in the space X and y is its corresponding label. The model is denoted by ρ(θ; x) : X → R C . Formally, the global optimization problem of FL can be formulated as (McMahan et al., 2017; Li et al., 2020a) : min θ∈R d F (θ) := M m=1 p m F m (θ) = M m=1 p m E (x,y)∼Dm f (θ; x, y), where (McMahan et al., 2017) proposes to utilize local updates. Specifically, at each round r, the server sends the global model θ r-foot_0 to a subset of clients S r ⊆ M which are randomly chosen. Then, all selected clients conduct some iterations of updates to obtain new client models {θ r m }, which are sent back to the server. Finally, the server averages local models according to the dataset size of clients to obtain a new global model θ r . F m (θ) = E (x,y)∼Dm f (θ; x,

3.2. GENERALIZATION QUANTIFICATION

Besides defining the metric for the training procedure, we also introduce a metric for the testing phase. Specifically, we define criteria for measuring the generalization performance for a given deep model. Built upon the margin theory (Koltchinskii & Panchenko, 2002; Elsayed et al., 2018) , for a given model ρ(θ; •) parameterized with θ, we use the worst-case margin 1 to measure the generalizability on the data distribution D: Definition 1. (Worst-case margin.) Given a distribution D, the worst-case margin of model ρ(θ; •) is defined as W d (ρ(θ), D) = E (x,y)∼D inf argmax i ρ(θ;x ′ )i̸ =y d(x ′ , x) with d being a specific distance, where the argmax i ρ(θ; x ′ ) i ̸ = y means the ρ(θ; x ′ ) mis-classifies the x ′ This definition measures the expected largest distance between the data x with label y and the data x ′ that is mis-classified by the model ρ. Thus, smaller margin means higher possibility to misclassify the data x Thus, we can leverage the defined worst-case margin to quantify the generalization performance for a given model ρ and a data distribution D under a specific distance. Moreover, the defined margin is always not less than zero. It is clear that if the margin is equal to zero, the model mis-classifies almost all samples of the given distribution.

4. DECOUPLED TRAINING AGAINST DATA HETEROGENEITY

This section formulates the generalization contribution in FL systems and decoupling gradient dissimilarity. 4.1 GENERALIZATION CONTRIBUTION Although Eq. 1 quantifies the performance of model ρ with parameter θ, it focuses more on the training distribution. In FL, we cooperatively train machine learning models because of a belief that introducing more clients seems to contribute to the performance of the server models. Given client m, we quantify the "belief", i.e., the generalization contribution, in FL systems as follows: E ∆:L(Dm) W d (ρ(θ + ∆), D\D m )), where ∆ is a pseudo gradient 2 obtained by applying a learning algorithm L(•) to a distribution D m , W d is the quantification of generalization, and D\D m means the data distribution of all clients except for client m. Eq. 2 depicts the contribution of client m to generalization ability. Intuitively, we prefer the client where the generalization contribution can be lower bounded. Definition 2. The Conditional Wasserstein distance C d (D, D ′ ) between the distribution D and D ′ : C d (D, D ′ ) = 1 2 E (•,y)∼D inf J∈J (D|y,D ′ |y) E (x,x ′ )∼J d(x, x ′ ) + 1 2 E (•,y)∼D ′ inf J∈J (D|y,D ′ |y) E (x,x ′ )∼J d(x, x ′ ). Built upon Definition 1, 2, and Eq. 2, we are ready to state the following theorem (proof in Appendix B.1). Theorem 4.1. With the pseudo gradient ∆ obtained by L(D m ), the generalization contribution is lower bounded: E ∆:L(Dm) W d (ρ(θ + ∆), D\Dm)) ≥ E ∆:L(Dm) W d (ρ(θ + ∆), Dm) -|E ∆:L(Dm) W d (ρ(θ + ∆), Dm)) -W d (ρ(θ + ∆), Dm))| -2C d (Dm, D\Dm), where Dm represents the dataset sampled from D m . Remark 1. Theorem 4.1 implies that three terms are related to the generalization contribution. The first and second terms are intuitive, showing that the generalization contribution of a distribution D m is expected to be large on and similar to a training dataset Dm . The last term is also intuitive, which implies that promoting the generalization performance requires constructing similar conditional distributions. Both the Definition 2 and Theorem 4.1 use distributions conditioned on the label y, so we write the feature distribution H|y as H for brevity in rest of the paper Built upon the theoretical analysis, it is straightforward to make all client models trained on similar distributions to obtain higher generalization performance. However, collecting data to construct such a distribution is forbidden in FL due to privacy concerns. To address this challenge, we propose decoupling a deep neural network into a feature extractor network φ θ low parameterized by θ low ∈ R d l and a classifier network parameterized by θ high ∈ R d h , and making the classifier network trained on the similar conditional distributions with less discrepancy, as shown in In what follows, we show that such a decoupling strategy can reduce the gradient dissimilarity, besides the promoted generalization performance.

4.2. DECOUPLED GRADIENT DISSIMILARITY

The gradient dissimilarity in FL resulted from heterogeneous data, i.e., the data distribution on client m, D m , is different from that on client k, D k (Karimireddy et al., 2020; Li et al., 2020a) . The commonly used quantitative measure of gradient dissimilarity is defined as inter-client gradient variance (CGV). Definition 3. Inter-client Gradient Variance (CGV): (Kairouz et al., 2019; Karimireddy et al., 2020; Woodworth et al., 2020; Koloskova et al., 2020)  CGV(F, θ) = E (x,y)∼Dm ||∇f m (θ; x, y)-∇F (θ)|| 2 . CGV is usually assumed to be upper bounded (Kairouz et al., 2019; Woodworth et al., 2020; Lian et al., 2017)  , i.e., CGV(F, θ) = E (x,y)∼Dm ||∇f m (θ; x, y) -∇F (θ)|| 2 ≤ σ 2 with a constant σ. Lower bounded gradient dissimilarity benefits the theoretical convergence rate (Woodworth et al., 2020) . Specifically, lower gradient dissimilarity directly causes higher convergence rate (Karimireddy et al., 2020; Li et al., 2020a; Woodworth et al., 2020) . This means that the decoupling strategy can also benefit the convergence rate if the gradient dissimilarity can be reduced. Now, we are ready to demonstrate how to reduce the gradient dissimilarity CGV with our decoupling strategy. With representing ∇f m (θ; x, y) as ∇ θ low f m (θ; x, y), ∇ θ high f m (θ; x, y) , we propose that the CGV can be divided into two terms of the different parts of θ (see Appendix B.2 for details): CGV(F, θ) = E (x,y)∼Dm ||∇fm(θ; x, y) -∇F (θ)|| 2 = E (x,y)∼Dm ||∇ θ low fm(θ; x, y) -∇ θ low F (θ)|| 2 + ||∇ θ high fm(θ; x, y) -∇ θ high F (θ)|| 2 . ( ) According to the chain rule of the gradients of a deep model, we can derive that the high-level part of gradients that are calculated with the raw data and labels (x, y) ∼ D m is equal to gradients with the hidden features and labels (h = φ θ low (x), y) (proof in Appendix B.2): ∇ θ high f m (θ; x, y) = ∇ θ high f m (θ; h, y), ∇ θ high F (θ) = M m=1 p m E (x,y)∼Dm ∇ θ high f (θ; h, y), in which f m (θ; h, y) is computed by forwarding the h = φ θ low (x) through the high-level model without the low-level part. We propose to let all clients share a global feature distribution H which approximates all features of clients. Client m will sample ĥ ∼ H and h m = φ θ low (x)| (x,y)∼Dm to train their classifier network, then the objective function becomes asfoot_2 : min θ∈R d F (θ) := M m=1 pm E (x,y)∼Dm ĥ∼H f (θ; x, ĥ, y) ≜ M m=1 pm E (x,y)∼Dm ĥ∼H f (θ; φ θ low (x), y) + f (θ; ĥ, y) . (6) Here, pm = nm+nm N + N with n m and nm being the sampling size of (x, y) ∼ D m and ĥ ∼ H respectively, and N = M m=1 nm . Now, we are ready to state the following theorem of reducing gradient dissimilarity by sampling features from the same distribution (proof in Appendix B.3). Theorem 4.2. Under the gradient variance measure CGV (Definition 3), with nm satisfying nm nm+nm = N N + N , the objective function F (θ) causes a tighter bounded gradient dissimilarity, i.e., the CGV( F , θ) = E (x,y)∼Dm ||∇ θ low f m (θ; x, y) -∇ θ low F (θ)|| 2 + N 2 (N + N ) 2 ||∇ θ high f m (θ; x, y) - ∇ θ high F (θ)|| 2 ≤ CGV(F, θ). Remark 2. Theorem 4.2 shows that the high-level gradient dissimilarity can be reduced as N 2 (N + N ) 2 times by sampling the same features between clients. Hence, estimating and sharing feature distributions is the key to promoting the generalization contribution and the reduction of gradient dissimilarity. Note that choosing N = ∞ can eliminate high-level dissimilarity. However, two reasons make it impractical to sample infinite features ĥ. First, the distribution is estimated using limited samples, leading to biased estimations. Second, infinite sampling will dramatically increase the calculating cost. We set N = N in our experiments.

4.3. TRAINING PROCEDURE

Algorithm 1 Framework of our method. for each client m ∈ S r in parallel do do θ r+1 m,E-1 , H r+1 m ← ClientUpdate(m, θ r , H r ). end for θ r+1 ← M m=1 pmθ r+1 m,E-1 . Update H r+1 using H r+1 m |m ∈ S r . end for ClientUpdate(m, θ, H): for each local iteration t with t = 0, • • • , T -1 do Sample raw data (x, y) ∼ Dm and ĥ ∼ H|y. θm,t+1 ← θm,t -ηm,t∇ θ f (θ; x, ĥ, y), i.e., Eq. 6 Update Hm using ĥm = φ θ low (x). end for Return θ and Hm to server. The training procedure of the proposed decoupling strategy is simple to implement. Specifically, it merely requires two extra steps compared with the vanilla FedAvg method: a) estimating and broadcasting a global distribution H; b) performing local training with both the local data (x, y) and the hidden features ( ĥ ∼ H|y, y). Moreover, sampling ĥ ∼ H|y has two additional advantages as follows. First, directly sharing the raw hidden features may incur privacy concerns. The raw data may be reconstructed by feature inversion methods (Zhao et al., 2021) . One can use different distribution approximation methods to estimate {h m |m ∈ M} to avoid exposing the raw data. Second, the hidden features usually have much higher dimensions than the raw data (Lin et al., 2021) . Hence, communicating and saving them between clients and servers may not be practical. We can use different distribution approximation methods to obtain H. Transmitting the parameters of H can consume less communication resource than hidden features {h m |m ∈ M}. Following previous work (Kendall & Gal, 2017) , we simply assume a Gaussian distribution to approximate the feature distributions. Namely, on the client-side, we use a Gaussian Distribution N (µ m , σ m ) parameterized with µ m and σ m to approximate the feature distribution on client m. On the server-side, another Gaussian Distribution N (µ g , σ g ) estimate the global feature distributions. As shown in Figure 1 and Algorithm 1, during the local training, clients update µ m and σ m using the real feature h m following a moving average strategy which is widely used in the literature (Ioffe & Szegedy, 2015; Wang et al., 2021) : µ (t+1) m = β m µ (t) m + (1 -β m ) × mean(h m ), σ (t+1) m = β m σ (t) m + (1 -β m ) × variance(h m ), ( ) where t is the iteration of the local training, β m is the momentum coefficient. On the server side, µ g and σ g are aggregated as: where T stands for the maximum iteration of local training. µ g = 1 |S r | i∈S r µ T i , σ g = 1 |S r | i∈S r σ T i , We perform the second step of the proposed decoupling strategy by optimizing the designed objective function, i.e., Eq. 6. Built upon the above analysis, the decoupling strategy can benefit both the performance contribution, i.e., conclusion of Theorem 4.1, and the convergence rate, i.e., Theorem 4.2.

5. EXPERIMENTS

5.1 EXPERIMENT SETUP Federated Datasets and Models. We verify our method with four datasets commonly used in the FL community, i.e., CIFAR-10 (Krizhevsky & Hinton, 2009) , FMNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , and CIFAR-100 (Krizhevsky & Hinton, 2009) . We use the Latent Dirichlet Sampling (LDA) partition method to simulate the Non-IID data distribution, which is the most used partition method in FL (He et al., 2020b; Li et al., 2021b; Luo et al., 2021) . We conduct experiments with two different Non-IID degrees, a = 0.1 and a = 0.05. Some additional experiment results are shown in Appendix D.3. Baselines and Metrics. We choose the classical FL algorithm, FedAvg (McMahan et al., 2017) , and recent effective FL algorithms proposed to address the client drift problem, including FedProx (Li et al., 2020a) , SCAFFOLD (Karimireddy et al., 2020), and FedNova (Wang et al., 2020a) , as our baselines. The detailed hyper-parameters of all experiments are reported in Appendix D. We use two metrics, the best accuracy and the number of communication rounds to achieve a target accuracy, which is set to the best accuracy of FedAvg. We also measure the weight divergence (Karimireddy et al., 2020), 1 |S r | i∈S r ∥ θ -θ i ∥, as it reflects the effect on gradient dissimilarity reduction.

5.2. EXPERIMENTAL RESULTS

Basic FL setting. As shown in Table 1 , using the classical FL training setting, i.e. a = 0.1, E = 5 and M = 10, for CIFAR-10, FMNIST and SVHN, our method achieves much higher generalization performance than other methods. We also find that, for CIFAR-100, the performance of our method is similar to FedProx. We conjecture that CIFAR-100 dataset has more classes than other datasets, leading to the results. Thus, a powerful feature estimation approach instead of a simple Gaussian assumption can be a promising direction to enhance the performance. Impacts of Non-IID Degree. As shown in Table 1 , for all datasets with high Non-IID degree (a = 0.05), our methods obtain more performance gains than the case of lower Non-IID degree (a = 0.1). For example, we obtain 92.37% test accuracy on SVHN with a = 0.1, higher than the FedNova by 3.89%. Furthermore, when Non-IID degree increases to a = 0.05, we obtain 90.25% test accuracy, higher than FedNova by 6.14%. And for CIFAR-100, our method shows benefits when a = 0.05, demonstrating that our method can defend against more severe data heterogeneity. Different Number of Clients. We also show the results of 100-client FL setting in At the initial training stage, the weight divergence is similar for different methods. During this stage, the low-level model is still unstable and the feature estimation is not accurate. After about 500 communication rounds, our method begins to show lower weight divergence than others, indicating that it converges faster than other methods. Convergence Speed. Figure 2 (a) shows that our method can accelerate the convergence of FL. 4 And we compare the communication rounds that different algorithms need to attain the target accuracy in Table 2 . The results show that our method can improve the convergence speed. The possible reasons for failure cases may be due to the too many categories in the dataset.

5.3. ABLATION STUDY

To verify the impacts of the depth of gradient decoupling, we conduct experiments by splitting at different layers, including the 5-th, 9-th, 13-th and 17-th layers. Table 3 demonstrates that our method can obtain benefits at low or middle layers. Decoupling at the 17-th layer will decrease the performance, which is consistent with our conclusion in Sec. 4.2. Specifically, decoupling at a very high layer may not be enough to resist gradient dissimilarity, leading to weak data heterogeneity mitigation. Interestingly, according to Theorem 4.2, decoupling at the 5-th layer should diminish more gradient dissimilarity than the 9-th and 13-th layers; but it does not show performance gains. We conjecture that it is due to the difficulty of distribution estimation, since biased estimation leads to poor generalization contribution. As other works (Lin et al., 2021) indicate, features at the lower level usually are richer larger than at the higher level. Thus, estimating the lower-level features is much more difficult than the higher-level. In this section, we provide some more experimental supports for our method. All experiment results of this section are conducted on CIFAR-10 with ResNet-18, a = 0.1, E = 1 and M = 10. And further experiment result are shown in Appendix D.3 due to the limited space.

5.4. DISCUSSION

Our method only guarantees the reduction of high-level gradient dissimilarity without considering the low-level part. We experimentally find that low-level weight divergence shrinks faster than high-level. Here, we show the layer-wise weight divergence in Figure 3 . We choose and show the divergence of 10 layers in Figure 3 (a), and the different stages of ResNet-18 in Figure 3 (b). As we hope to demonstrate the divergence trend, we normalize each line with its maximum value. The results show that the low-level divergence shrinks faster than the high-level divergence. This means that reducing the high-level gradient dissimilarity is more important than the low-level. We conduct FedAvg with many communication rounds with and without learning rate decay. We show the results of the first 5000 rounds in Figure 8 (a) in Appendix D. FedAvg without learning rate decay can only achieve 86.96% accuracy, and FedAvg with learning rate decay only achieves 82.65% accuracy. The results show that the longer training time cannot fill the generalization performance gap between FedAvg and centralized training, encouraging us to develop new optimization schemes to improve the performance of FL.

6. LIMITATIONS

Estimation of Feature Distribution. In this work, we only use the Gaussian Distribution to estimate the feature distribution. This significantly limits the performance of this framework, while it can work well in our experiments. Future works may exploit better feature estimators like generative models (Goodfellow et al., 2014; Karras et al., 2019) to sample higher-quality features. Extra Communication and Calculation Cost. Our method only needs to communicate the parameters of the estimated feature distribution, which are much less than all features of clients. Some quantization or sparsification methods can be used to further reduce the communication cost.urthermore, our method doubles the calculation costs of the forward and backward process of the high-level model. Thus, more reducing gradient dissimilarity, more calculation costs. This plays as a trade-off and needs to be further studied in the future.

7. CONCLUSION

In this paper, we raise a series of fundamental questions related to measuring the generalization contribution of local training from the clients. Then, we theoretically show the relationship of this generalization contribution with the conditional Wasserstein distance between clients' distributions. The theoretical conclusion inspires us to propose decoupling gradient dissimilarity, which greatly reduces the gradient dissimilarity by training with a shared feature distribution without privacy concerns. We theoretically verify the gradient dissimilarity reduction and experimentally validate our methods' benefits on generalization performance. Our work opens a new view of promoting FL performance from a generalization perspective.

ETHIC STATEMENT

This paper does not raise any ethical concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility, we have listed all hyper-parameters, hardware and software of all experiments in the Appendix D. Due to the privacy concerns, we will upload the anonymous link of codes and instructions during the rebuttal to make it only visible to reviewers. All explanations of assumptions can be found in Section 3 and 4, and the complete proof of Theorem 4.1 and 4.2 can be found in Appendix B.

APPENDIX A BROADER IMPACT

Measuring Client Contribution During Local Training. As discussed in the section 2, current works mainly focus on measuring generalization contribution from clients from participating during the whole training process. We consider measuring this contribution during each communication round, which opens a new angle toward the convergence analysis of FL. Future works may fill the generalization gap between FL and centralized training with all datasets. Relationship between Privacy and Performance. We analyze the relationship between the sharing features and the raw data in section 4.2 and Appendix. However, we do not deeply investigate how sharing features or parameters of estimated feature distribution threatens the privacy of private raw data. Sharing features at a lower level may reduce gradient dissimilarity and high generalization performance of FL, yet leading to higher risks of data privacy. Future works may consider figuring out the trade-off between data privacy and the generalization performance with sharing features. Connections of our work to knowledge distillation and domain generalization. The approximation of features generated based on the client data and low-level models can be seen as a kind of knowledge distillation of other clients. More in-depth analyses of this problem would be an exciting direction, which will be added to our future works. The domain generalization is also an exciting connection to federated learning. It is interesting to connect the measurements of client contribution to the domain generalization.

B PROOF B.1 BOUNDED GENERALIZATION CONTRIBUTION

Given client m, we quantify the generalization contribution, in FL systems as follows: E ∆:L(Dm) W d (ρ(θ + ∆), D\D m )), ( ) where ∆ is a pseudo gradient obtained by applying a learning algorithm L(D m ) to a distribution D m , W d is the quantification of generalization, and D\D m ) means the distribution of all clients except for client m. Theorem B.1. With the pseudo gradient ∆ obtained by L(D m ), the generalization contribution is lower bounded: E ∆:L(Dm) W d (ρ(θ + ∆), D\Dm)) ≥ E ∆:L(Dm) W d (ρ(θ + ∆), Dm) -|E ∆:L(Dm) W d (ρ(θ + ∆), Dm)) -W d (ρ(θ + ∆), Dm))| -2C d (Dm, D\Dm), where Dm represents the dataset sampled from D m . Proof. To derive the lower bound, we decompose the conditional quantification of generalization, i.e., W d (ρ(θ + ∆), D\D m ): W d (ρ(θ + ∆), D\D m ) = W d (ρ(θ + ∆), D\D m ) -W d (ρ(θ + ∆), D m ) + W d (ρ(θ + ∆), D m ) -W d (ρ(θ + ∆), Dm ) + W d (ρ(θ + ∆), Dm ), where we denote ρ as ρ(θ + ∆) for brevity and Dm stands for the dataset sampled from D m . Built upon the decomposition, we have: E ∆:L(Dm) W d (ρ(θ + ∆), D\D m )) ≥ E ∆:L(Dm) W d (ρ(θ + ∆), Dm ) -|E ∆:L(Dm) W d (ρ(θ + ∆), D m )) -W d (ρ(θ + ∆), Dm ))| -|E ∆:L(Dm) W d (ρ(θ + ∆), D\D m )) -W d (ρ(θ + ∆), D m ))|. The first term in Eq. 11 represents the empirical generalization performance. The second term in Eq. 11 means that the performance gap between the model trained on sampled dataset and that trained on the distribution, rigorous analysis can be found in (Montasser et al., 2019) . Note that, the first two terms are independent on the distribution D\D m ), so the focus of generalization contribution is mainly on the last term, i.e., |E ∆: L(Dm) W d (ρ(θ + ∆), D\D m )) -W d (ρ(θ + ∆), D m ))|. The proof is relatively straightforward, as long as we derive the upper bound of W d (ρ(θ + ∆), D m ) and W d (ρ(θ + ∆), D\D m ). For W d (ρ(θ + ∆), D m ), we have: W d (ρ(θ + ∆), D m ) =E (•|y)∼Dm E x∼Dm|y inf argmax i ρ(θ;x ′ )i̸ =y d(x, x ′ ) =E (•|y)∼Dm E (x,x ′′ )∼Jy inf argmax i ρ(θ;x ′ )i̸ =y d(x, x ′ ) ≤E (•|y)∼Dm E (x,x ′′ )∼Jy inf argmax i ρ(θ;x ′ )i̸ =y d(x ′ , x ′′ ) + d(x, x ′′ ) =E (•|y)∼Dm E (x,x ′′ )∼Jy inf argmax i ρ(θ;x ′ )i̸ =y d(x ′ , x ′′ ) + E (•|y)∼Dm E (x,x ′′ )∼Jy d(x, x ′′ ) =E (•|y)∼Dm E x ′′ ∼D\Dm|y inf argmax i ρ(θ;x ′ )i̸ =y d(x ′ , x ′′ ) + E (•|y)∼Dm E (x,x ′′ )∼Jy d(x, x ′′ ), where J y stands for the optimal transport between the conditional distribution D m |y and D\D m |y. Similarly, we have: W d (ρ(θ + ∆), D\D m ) ≤E (•|y)∼D\Dm E x ′′ ∼Dm|y inf argmax i ρ(θ;x ′ )i̸ =y d(x ′ , x ′′ ) + E (•|y)∼D\Dm E (x,x ′′ )∼Jy d(x, x ′′ ). Combining these two inequality, we have: |W d (ρ(θ + ∆), D m ) -W d (ρ(θ + ∆), D\D m )| ≤2C d (D m , D\D m )) + max {δ(D m , D\D m ), γ(D m , D\D m )} , where δ(Dm, D\Dm) =E (•|y)∼Dm E x ′′ ∼D\Dm|y inf argmax i ρ(θ;x ′ ) i ̸ =y d(x ′ , x ′′ ) -E (•|y)∼D\Dm E x ′′ ∼D\Dm|y inf argmax i ρ(θ;x ′ ) i ̸ =y d(x ′ , x ′′ ), and γ(Dm, D\Dm) =E (•|y)∼D\Dm E x ′′ ∼Dm|y inf argmax i ρ(θ;x ′ ) i ̸ =y d(x ′ , x ′′ ) -E (•|y)∼Dm E x ′′ ∼Dm|y inf argmax i ρ(θ;x ′ ) i ̸ =y d(x ′ , x ′′ ). The upper bound is straightforward. For example, if the label distributions are the same, i.e. y ∼ D\D m is equal to y ∼ D m , we have: |W d (ρ(θ + ∆), D m ) -W d (ρ(θ + ∆), D\D m )| ≤ 2C d (D m , D\D m )). According to Eq. 12, the last term in Eq. 11 is bounded: |E ∆:L(Dm) W d (ρ(θ + ∆), D m )) -W d (ρ(θ + ∆), Dm ))| ≤E ∆:L(Dm) |W d (ρ(θ + ∆), D m )) -W d (ρ(θ + ∆), Dm ))|, which is further upper bounded by conditional Wasserstein distance when the label distributions are not the same: E ∆:L(Dm) |W d (ρ(θ + ∆), D m )) -W d (ρ(θ + ∆), Dm ))| ≤ 2C d (D m , D\D m )) + max {δ(D m , D\D m ), γ(D m , D\D m )} . Thus, the label distribution will have additional impact on the bound. If the label distributions are the same, then we have |E ∆:L(Dm) W d (ρ(θ + ∆), Dm)) -W d (ρ(θ + ∆), Dm))| ≤ 2C d (Dm, D\Dm)), which completes the proof.

B.2 JUSTIFICATION OF DECOUPLING GRADIENT VARAINCE

The derivation of Equation 4. Because ∇f m = ∇ θ low f m , ∇ θ high f m ∈ R d , ∇ θ low f m ∈ R d l and ∇ θ high f m ∈ R d h , we have E (x,y)∼Dm ||∇f m (θ; x, y) -∇F (θ)|| 2 (16) = d i=1 (∇f m (θ; x, y) (i) -∇F (θ) (i) ) 2 = d l i=1 (∇f m (θ; x, y) (i) -∇F (θ) (i) ) 2 + d h i=d l +1 (∇f m (θ; x, y) (i) -∇F (θ) (i) ) 2 =E (x,y)∼Dm ||∇ θ low f m (θ; x, y) -∇ θ low F (θ)|| 2 + ||∇ θ high f m (θ; x, y) -∇ θ high F (θ)|| 2 The derivation of Equation 5. Assuming a multi-layers neural network consists L linear layers, each of which is followed by an activation function. And the loss function is CE(•). The forward function can be formulated as: f (θ, x) = CE(τ n (θ n (τ n-1 (θ n-1 τ n-2 (...τ 1 (θ 1 x))))) Then the gradient on l-th weight should be: g l = ∂f ∂θ l = ∂f ∂τ n (z n ) ∂τ n (z n ) ∂z n ∂z n ∂τ n-1 (z n-1 ) ∂τ n-1 (z n-1 ) ∂z n-1 ∂z n-1 ∂τ n-2 (z n-2 ) ... ∂τ l+1 (z l+1 ) ∂z l+1 ∂z l ∂θ l (18) = ∂f ∂τ n (z n ) τ ′ n (z n )θ n τ ′ n (z n-1 )θ n-1 ...τ ′ l+1 (z l+1 )τ l (z l ) = ∂f ∂τ n (z n ) n i=l+2 τ ′ i (z i )θ i τ ′ l+1 (z l+1 )τ l (z l ), in which θ l , τ l , z l , is the weight, activation function, output of the l-th layer, respectively. Thus, we can see that the gradient of l-th layer is independent of the data, hidden features, and weights before l-th layer if we directly input a z l to l-th layer.

B.3 PROOF OF THEOREM 4.2

We restate the optimization goals of using the private raw data (x, y) of clients and the shared hidden features ĥ ∼ H|y as following:  min θ∈R d F (θ) := M m=1 pm E (x,y)∼Dm ĥ∼H f (θ; x, ĥ, y) = M m=1 pm E (x,y)∼Dm ĥ∼H f (θ; x, y) + f (θ; ĥ, y) , , θ) = E (x,y)∼Dm ||∇ θ low f m (θ; x, y) -∇ θ low F (θ)|| 2 + N 2 (N + N ) 2 ||∇ θ high f m (θ; x, y) - ∇ θ high F (θ)|| 2 ≤ CGV(F, θ). Proof. CGV( F , θ) =E (x,y)∼Dm ĥ∼H ||∇ fm (θ; x, ĥ, y) -∇ F (θ)|| 2 =E (x,y)∼Dm [||∇ θ low f m (θ; x, y) -∇ θ low F (θ)|| 2 ] + E (x,y)∼Dm ĥ∼H [||∇ θ high f m (θ; x, y) + ∇ θ high f m (θ; ĥ, y) -∇ θ high F (θ)|| 2 . (22) (23) On m-th client, the number of samples of (x, y) is n m and the ĥm is nm . Then the high-level gradient variance becomes: E (x,y)∼Dm ĥ∼H [|| n m n m + nm ∇ θ high f m (θ; x, y) + nm n m + nm ∇ θ high f m (θ; ĥ, y) -∇ θ high F (θ)|| 2 =E (x,y)∼Dm ĥ∼H [|| n m n m + nm ∇ θ high f m (θ; x, y) + nm n m + nm ∇ θ high f m (θ; ĥ, y) - M m=1 n m + nm N + N ( n m n m + nm ∇ θ high f m (θ; x, y) + nm n m + nm ∇ θ high f m (θ; ĥ, y))|| 2 =E (x,y)∼Dm || n m n m + nm ∇ θ high f m (θ; x, y) - M m=1 n m N + N ∇ θ high f m (θ; x, y)|| 2 = N 2 (N + N ) 2 E (x,y)∼Dm ||∇ θ high f m (θ; x, y) - M m=1 n m N ∇ θ high f m (θ; x, y)|| 2 . ( ) Combining Equation 24and 22, we obtain CGV( F , θ) =E (x,y) ||∇ θ low f m (θ; x, y) -∇ θ low F (θ)|| 2 + N 2 (N + N ) 2 ||∇ θ high f m (θ; x, y) -∇ θ high F (θ)|| 2 , which completes the proof. For the convergence analysis, there have been many convergence analyses of FedAvg from a gradient dissimilarity viewpoint (Woodworth et al., 2020; Lian et al., 2017; Karimireddy et al., 2020) . Specifically, the convergence rate is upper bounded by many factors, among which the gradient dissimilarity plays a crucial role in the bound. In this work, we propose a novel approach inspired by the generalization view to reduce the gradient dissimilarity, we thus provide a tighter bound regarding the convergence rate. This is consistent with our experiments, see Table 2 .

C MORE RELATED WORK C.1 ADDRESSING NON-IID PROBLEM IN FL

The convergence and generalization performance of Federated Learning (FL) (McMahan et al., 2017) suffers from the heterogeneous data distribution across all clients (Zhao et al., 2018; Li et al., 2020b; Kairouz et al., 2019) . There exists a severe divergence between local objective functions of clients, making local models of FL diverge (Li et al., 2020a; Karimireddy et al., 2020) , which is called client drift. Although researchers have designed many new optimization methods to address this problem, it is still an open problem. The performance of federated learning under severe Non-IID data distribution is far behind the centralized training. The previous methods that address Non-IID data problems can be classified into the following directions. Model Regularization focuses on calibrating the local models to restrict them not to be excessively far away from the server model. A number of works (Li et al., 2020a; Acar et al., 2021; Karimireddy et al., 2020) add a regularizer of local-global model difference. FedProx (Li et al., 2020a) adds a penalty of the L2 distance between local models to the server model. SCAFFOLD (Karimireddy et al., 2020) utilizes the history information to correct the local updates of clients. FedDyn (Acar et al., 2021) proposes to dynamically update the risk objective to ensure the device optima is asymptotically consistent. FedIR (Hsu et al., 2020) applies important weight to the client's local objectives to obtain an unbiased estimator of loss. MOON (Li et al., 2021b) adds the local-global contrastive loss to learn a similar representation between clients. CCVR (Luo et al., 2021) transmits the statistics of logits and label information of data samples to calibrate the classifier. Reducing Gradient Variance tries to correct the local updates directions of clients via other gradient information. This kind (Wang et al., 2020a; Hsu et al., 2019; Reddi et al., 2021) of methods aims to accelerate and stabilize the convergence. FedNova (Wang et al., 2020a) normalizes the local updates to eliminate the inconsistency between the local and global optimization objective functions. FedAvgM (Hsu et al., 2019) exploits the history updates of the server model to rectify clients' updates. FEDOPT (Reddi et al., 2021) proposes a unified framework of FL. It considers the clients' updates as the gradients in centralized training to generalize the optimization methods in centralized training into FL. FedAdaGrad and FedAdam are FL versions of AdaGrad and Adam. Sharing Features. Personalized Federated Learning hopes to make clients optimize different personal models to learn knowledge from other clients and adapt their own datasets (Tan et al., 2022) . The knowledge transfer of personalization is mainly implemented by introducing personalized parameters (Liang et al., 2020; Thapa et al., 2020; Li et al., 2021a) , or knowledge distillation (He et al., 2020a; Lin et al., 2020; Li & Wang, 2019; Bistritz et al., 2020) on shared local features or extra datasets. Due to the preference for optimizing local objective functions, however, personalized federated models do not have a comparable generic performance (evaluated on global test dataset) to normal FL (Chen & Chao, 2021) . Our main goal is to learn a better generic model. Thus, we omit comparisons to personalized FL algorithms. Except Personalized Federated Learning, some other works propose to share features to improve federated learning. Cronus (Chang et al., 2019) proposes sharing the logits to defend the poisoning attack. CCVR (Luo et al., 2021) transmit the logits statistics of data samples to calibrate the last layer of Federated models. CCVR (Luo et al., 2021 ) also share the parameters of local feature distribution. However, we do not need to share the number of different labels with the server, which protects the privacy of label distribution of clients. Moreover, our method acts as a framework for exploiting the sharing features to reduce gradient dissimilarity. The feature estimator does not need to be the Gaussian distribution of local features. One may utilize other estimators or even features of some extra datasets rather than the private ones. Sharing Data. The original cause of client drift is data heterogeneity. Some researchers find that sharing a part of private data can significantly improve the convergence speed and generalization performance (Zhao et al., 2018 ), yet it sacrifices the privacy of clients' data. Thus, to both reduce data heterogeneity and protect data privacy, a series of works (Hardt & Rothblum, 2010; Hardt et al., 2012; Chatalic et al., 2021; Johnson et al., 2018; Cai et al., 2021) add noise on data to implement sharing data with privacy guarantee to some degree. Some other works focus on sharing a part of synthetic data (Jeong et al., 2018; Long et al., 2021; Goetz & Tewari, 2020; Hao et al., 2021) or data statistics (Shin et al., 2020; Yoon et al., 2021) to help reduce data heterogeneity rather than raw data. FedDF (Lin et al., 2020) utilizes other data and conducts knowledge distillation based on these data to transfer knowledge of models between server and clients. The core idea of FedDF is to conduct finetuning on the aggregated model via the knowledge distillation with the new shared data.

C.2 MEASURING CONTRIBUTION FROM CLIENTS

Generalization Contribution. Clients are only willing to participate a FL training when given enough rewards. Thus, it is important to measure their contributions to the model performance (Yu et al., 2020; Ng et al., 2020; Liu et al., 2022; Sim et al., 2020) . There have been some works (Yuan et al., 2022; Yu et al., 2020; Ng et al., 2020; Liu et al., 2022; Sim et al., 2020) proposed to measure the generalization contribution from clients in FL. Some works (Yuan et al., 2022) propose to experimentally measure the performance gaps from the unseen client distributions. Data shapley (Ghorbani & Zou, 2019; Yu et al., 2020) is proposed to measure the generalization performance gain of client participation. (Liu et al., 2022) improves the calculation efficiency of Data Shapley. And there is some other work that proposes to measure the contribution by learning-based methods (Zhan et al., 2020) . Our proposed questions are different from these works. Precisely, these works measure the generalization performance gap with or without some clients that never join the collaborative training of clients. However, we hope to understand the contribution of clients at each communication round. Based on this understanding, we can further improve the FL training and obtain a better generalization performance. It has been empirically verified that a large number of selected clients introduces new challenges to optimization and generalization of FL (Charles et al., 2021) , although some theoretical works show Shared Thing Low-level Model Objective (Chatalic et al., 2021; Cai et al., 2021) Raw Data With Noise Shared Others (Long et al., 2021; Hao et al., 2021) Params. of Data Generator Shared Global Model Performance (Yoon et al., 2021; Shin et al., 2020) STAT. of raw Data Shared Global Model Performance (Luo et al., 2021) STAT. of Logis, Label Distribution Shared Global Model Performance (Chang et al., 2019) Hidden Features Shared Defend Poisoning Attack (Li & Wang, 2019; Bistritz et al., 2020) logits Private Personalized FL (He et al., 2020a; Liang et al., 2020) Hidden Features Private Personalized FL (Thapa et al., 2020; Oh et al., 2022) Hidden the benefits from it (Yang et al., 2020) . This encourages us to understand what happens during the local training and aggregation. Client Selection. Several works (Cho et al., 2020; Goetz et al., 2019; Ribero & Vikalo, 2020; Lai et al., 2021) propose new algorithms to strategically select clients rather than randomly. However, these methods only consider the hardware resources or local generalization ability. How local training affects the global generalization ability has not been explored.

C.3 SPLIT TRAINING

To efficiently train neural networks, split training instead of end-to-end training is proposed to break the forward, backward, or model updating dependency between layers of neural networks. To break the backward dependency on subsequent layers, hidden features could be forwarded to another loss function to obtain the Local Error Signals (Marquez et al., 2018; Nøkland & Eidnes, 2019; Löwe et al., 2019; Wang et al., 2020b; Zhuang et al., 2021) . How to design a suitable local error still remains as an open problem. Some works propose to utilize extra modules to synthesize gradients (Jaderberg et al., 2017) , so that the backward and updates of different layers can be decoupled. Features Replay (Huo et al., 2018) is to reload the history features of the preceding layers into the next layers. By reusing the history features, the calculation on different layers could be asynchronously conducted. Some works propose Split FL (SFL) to utilize split training to accelerate federated learning (Oh et al., 2022; Thapa et al., 2020) . In SFL, the model is split into client-side and server-side parts. At each communication round, the client only downloads the client-side model from the server, conducts forward propagation, and sends the hidden features to the server for computing loss and backward propagation. This method aims to accelerate FL's training speed on the client side and cannot support local updates. In addition, sending all raw features could introduce a high data privacy risk. Thus, we omit the comparisons to these methods. We demystify different FL algorithms related to the shared features in Table 4 .

D DETAILS OF EXPERIMENT CONFIGURATION AND ADDITIONAL EXPERIMENTS D.1 HARDWARE AND SOFTWARE CONFIGURATION

We conduct experiments using GPU GTX-2080 Ti, CPU Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz. The operating system is Ubuntu 16.04.6 LTS. The Pytorch version is 1.8.1. The Cuda version is 10.2.

D.2 HYPER-PARAMETERS

The learning rate configuration has been listed in Table 5 . We report the best results and their learning rates (grid search in {0.0001, 0.001, 0.01, 0.1, 0.3}). And for all experiments, we use SGD as optimizer for all experiments, with batch size of 128 and weight decay of 0.0001. Note that we set momentum as 0 for baselines, as we find the momentum of 0.9 may harm the convergence and performance of FedAvg in severe Non-IID situations. We also report the best test accuracy of baselines that are trained with momentum of 0.9 in Table 6 . The client-side momentum in FL training does not always commit better convergence because the momentum introduces larger local updates, increasing the client drift, which is also observed in a recent benchmark (He et al., 2021) . And the server-side momentum (Hsu et al., 2019) may improve the performance. The compared algorithms including FedAvg, FedProx, SCAFFOLD, FedNova do not use the server-side momentum. For the fair comparisons we did not use the server-side momentum for all algorithms. For K = 10 and K = 100, the maximum communication round is 1000, For K = 10 and E = 5, the maximum communication round is 400 (due to the E = 5 increase the calculation cost). The number of clients selected for calculation is 5 per round for K = 10, and 10 for K = 100. Table 5 : Learning rate of all experiments. More Results of the Layer-wise Divergence. We conduct more experiments of the layer divergence of FedAvg with different datasets including FMNIST, SVHN and CIFAR-100, training with ResNet-18 and ResNet-50. As Figure 3 and 9 shows, the divergence of the low-level model divergence shrinks faster than the high-level. Thus, reducing the high-level gradient dissimilarity is more crucial than the low-level.



The similar definition is used in the literature(Franceschi et al., 2018).2 The pseudo gradient at round r is calculated as: ∆ r = θ r-1 T -θ r-1 0 with the maximum local iterations T . We reuse f here for brevity, the input of f can be the input x or the hidden feature h = φ θ low (x). Due to the high instability of training with severe data heterogeneity, we show the actual test accuracy as semitransparent lines and the smoothed test accuracy as opaque lines for better visualization. We also provide more convergence figures in Appendix D.



y) is the local objective function of client m with f (θ; x, y) = CE(ρ(θ; x), y)), CE denotes the cross-entropy loss, p m > 0 and M m=1 p m = 1. Usually, p m is set as nm N , where n m denotes the number of samples on client m and N = M m=1 n m . The clients usually have a low communication bandwidth, causing extremely long training time. To address this issue, the classical FL algorithm FedAvg

Here, d l and d h represent the dimensions of parameters θ low and θ high , respectively Specifically, client m can estimate the its own hidden feature distribution as H m using the local hidden features h = φ θ low (x)| (x,y)∼Dm and send H m to the server for the global distribution approximation. Then, the server aggregates the received distributions to obtain the global feature distribution H and broadcasts it, being similar to the model average in the FedAvg. Finally, classifier networks of all clients thus performs local training on both the local hidden features h (x,y)∼Dm and the shared H during the local training.

server input: initial θ 0 , maximum communication round R client m's input: local iterations T Initialization: server distributes the initial model θ 0 to all clients, and the initial global H 0 . Server_Executes: for each round r = 0, 1, • • • , R do server samples a set of clients Sr ⊆ {1, ..., M }. server communicates θr and H r to all clients m ∈ S.

Figure 2: CIFAR10 with a = 0.1, E = 1, M = 10.

Figure 3: Layer divergence of FedAvg.

21) Theorem B.2. Under the gradient variance measure CGV (Definition 3), with nm satisfying nm nm+nm = N N + N , the objective function F (θ) causes a tighter bounded gradient dissimilarity, i.e., the CGV( F

Best test accuracy (%) of all experimental results.

Communication Round to attain the target accuracy.

Our method works well with all datasets, demonstrating excellent scalability with more clients.

Splitting at different layers.

Demystifying different FL algorithms related to the sharing data and features.

STAT." means statistic information, like mean or standard deviation, "Feat." means hidden features, "Params." means parameters.

D.3 ADDITIONAL EXPERIMENTS

Training with Longer Time. To demonstrate the difficulty of optimization of FedAvg in heterogeneous-data environment, we show the results of training 10000 rounds, as shown in Figure 8 (a). During this 10000 rounds, the highest test accuracy of FedAvg with fixed learning rate is 88.5%, and it of the FedAvg with decayed learnign rate is 82.65%. Note that we set the learning rate decay exponentially decay at each communication round, wich rate 0.997. Even after 2000 rounds, the learning rate becomes as the around 0.0026 times as the original learning rate. Sharing Estimating Parameters with noise. To enhance the security of the sharing feature distribution, we add the noise ϵ ∼ N (0, µ ϵ ) on the σ m and µ m . The privacy degree could be enhanced by the larger µ ϵ . We show the results of our method with different µ ϵ in Figure 8 (b) and Table 7 . The results show that under the high perturbation of the estimated parameters, our method attains both high privacy and generalization gains.More Experiments of the Real-world Datasets. To verify the effect of our methods on the real-world FL datasets, we conduct experiments with Federated EMNIST(FEMNIST) (Caldas et al., 2018; He et al., 2020b) , which has 3400 users, 671585 training samples and 77483 testing samples. We sample 20 clients per round, and conduct local training with 10 epochs. We search the learning rate for algorithms in {0.01, 0.05, 0.1} and find the 0.05 is the best for all algorithm. Figure 10 and Table 8 show that our method converges faster and attains better generalization performance than other methods. Note that the SCAFFOLD is not included the experiments, as it has a very high requirement (storing the control variates) of simulating 3400 clients with few machines.

