TOWARDS UNDERSTANDING AND MITIGATING DIMENSIONAL COLLAPSE IN HETEROGENEOUS FEDERATED LEARNING

Abstract

Federated learning aims to train models collaboratively across different clients without sharing data for privacy considerations. However, one major challenge for this learning paradigm is the data heterogeneity problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe dimensional collapse, in which representations tend to reside in a lower-dimensional space instead of the ambient space. Moreover, we observe a similar phenomenon on models locally trained on each client and deduce that the dimensional collapse on the global model is inherited from local models. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse for local models. To remedy this problem caused by the data heterogeneity, we propose FEDDECORR, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, FEDDECORR applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. FEDDECORR, which is implementation-friendly and computationally-efficient, yields consistent improvements over baselines on standard benchmark datasets. Code: https://github.com/bytedance/FedDecorr.

1. INTRODUCTION

With the rapid development deep learning and the availability of large amounts of data, concerns regarding data privacy have been attracting increasingly more attention from industry and academia. To address this concern, McMahan et al. (2017) propose Federated Learning-a decentralized training paradigm enabling collaborative training across different clients without sharing data. One major challenge in federated learning is the potential discrepancies in the distributions of local training data among clients, which is known as the data heterogeneity problem. In particular, this paper focuses on the heterogeneity of label distributions (see Fig. 1 (a) for an example). Such discrepancies can result in drastic disagreements between the local optima of the clients and the desired global optimum, which may lead to severe performance degradation of the global model. Previous works attempting to tackle this challenge mainly focus on the model parameters, either during local training (Li et al., 2020; Karimireddy et al., 2020) or global aggregation (Wang et al., 2020b) . However, these methods usually result in an excessive computation burden or high communication costs (Li et al., 2021a) because deep neural networks are typically heavily over-parameterized. In contrast, in this work, we focus on the representation space of the model and study the impact of data heterogeneity. To commence, we study how heterogeneous data affects the global model in federated learning in Sec. 3.1. Specifically, we compare representations produced by global models trained under different degrees of data heterogeneity. Since the singular values of the covariance matrix provide a comprehensive characterization of the distribution of high-dimensional embeddings, we use it to study the representations output by each global model. Interestingly, we find that as the degree of data heterogeneity increases, more singular values tend to evolve towards zero. This observation suggests that stronger data heterogeneity causes the trained global model to suffer from more severe dimensional collapse, whereby representations are biased towards residing in a lower-dimensional space (or manifold). A graphical illustration of how heterogeneous training data affect output representations is shown in Fig. 1(b-c ). Our observations suggest that dimensional collapse might be one of the key reasons why federated learning methods struggle under data heterogeneity. Essentially, dimensional collapse is a form of oversimplification in terms of the model, where the representation space is not being fully utilized to discriminate diverse data of different classes. Given the observations made on the global model, we conjecture that the dimensional collapse of the global model is inherited from models locally trained on various clients. This is because the global model is a result of the aggregation of local models. To validate our conjecture, we further visualize the local models in terms of the singular values of representation covariance matrices in Sec. 3.2. Similar to the visualization on the global model, we observe dimensional collapse on representations produced by local models. With this observation, we establish the connection between dimensional collapse of the global model and local models. To further understand the dimensional collapse on local models, we analyze the gradient flow dynamics of local training in Sec. 3.3. Interestingly, we show theoretically that heterogeneous data drive the weight matrices of the local models to be biased to being low-rank, which further results in representation dimensional collapse. Inspired by the observations that dimensional collapse of the global model stems from local models, we consider mitigating dimensional collapse during local training in Sec. 4. In particular, we propose a novel federated learning method termed FEDDECORR. FEDDECORR adds a regularization term during local training to encourage the Frobenius norm of the correlation matrix of representations to be small. We show theoretically and empirically that this proposed regularization term can effectively mitigate dimensional collapse (see Fig. 1(d) for example). Next, in Sec. 5, through extensive experiments on standard benchmark datasets including CIFAR10, CIFAR100, and TinyImageNet, we show that FEDDECORR consistently improves over baseline federated learning methods. In addition, we find that FEDDECORR yields more dramatic improvements in more challenging federated learning setups such as stronger heterogeneity or more number of clients. Lastly, FEDDECORR has extremely low computation overhead and can be built on top of any existing federated learning baseline methods, which makes it widely applicable. Our contributions are summarized as follows. First, we discover through experiments that stronger data heterogeneity in federated learning leads to greater dimensional collapse for global and local models. Second, we develop a theoretical understanding of the dynamics behind our empirical discovery that connects data heterogeneity and dimensional collapse. Third, based on the motivation of mitigating dimensional collapse, we propose a novel method called FEDDECORR, which yields consistent improvements while being implementation-friendly and computationally-efficient. suffers from unstable and slow convergence, resulting in performance degradation. To tackle this challenge, previous works either improve local training (Li et al., 2021b; 2020; Karimireddy et al., 2020; Acar et al., 2021; Al-Shedivat et al., 2020; Wang et al., 2021) or global aggregation (Wang et al., 2020b; Hsu et al., 2019; Luo et al., 2021; Wang et al., 2020a; Lin et al., 2020; Reddi et al., 2020; Wang et al., 2020a) . Most of these methods focus on the model parameter space, which may result in high computation or communication cost due to deep neural networks being overparameterized. Li et al. (2021b) focuses on model representations and uses a contrastive loss to maximize agreements between representations of local models and the global model. However, one drawback of Li et al. (2021b) is that it requires additional forward passes during training, which almost doubles the training cost. In this work, based on our study of how data heterogeneity affects model representations, we propose an effective yet highly efficient method to handle heterogeneous data. Another research trend is in personalized federated learning (Arivazhagan et al., 2019; Li et al., 2021c; Fallah et al., 2020; T Dinh et al., 2020; Hanzely et al., 2020; Huang et al., 2021; Zhang et al., 2020) , which aims to train personalized local models for each client. In this work, however, we focus on the typical setting that aims to train one global model for all clients. Dimensional Collapse. Dimensional collapse of representations has been studied in metric learning (Roth et al., 2020) , self-supervised learning (Jing et al., 2021) , and class incremental learning (Shi et al., 2022) . In this work, we focus on federated learning and discover that stronger data heterogeneity causes a higher degree of dimensional collapse for locally trained models. To the best of our knowledge, this work is the first to discover and analyze dimensional collapse of representations in federated learning. Gradient Flow Dynamics. Arora et al. (2018; 2019) introduce the gradient flow dynamics framework to analyze the dynamics of multi-layer linear neural networks under the ℓ 2 -loss and find deeper neural networks biasing towards low-rank solution during optimization. Following their works, Jing et al. (2021) finds two factors that cause dimensional collapse in self-supervised learning, namely strong data augmentation and implicit regularization from depth. Differently, we focus on federated learning with the cross-entropy loss. More importantly, our analysis focuses on dimensional collapse caused by data heterogeneity in federated learning instead of depth of neural networks. Feature Decorrelation. Feature decorrelation had been used for different purposes, such as preventing mode collapse in self-supervised learning (Bardes et al., 2021; Zbontar et al., 2021; Hua et al., 2021) , boosting generalization (Cogswell et al., 2015; Huang et al., 2018; Xiong et al., 2016) , and improving class incremental learning (Shi et al., 2022) . We instead apply feature decorrelation to counter the undesired dimensional collapse caused by data heterogeneity in federated learning. 3 DIMENSIONAL COLLAPSE CAUSED BY DATA HETEROGENEITY z)(z i -z) ⊤ of the representations over the N test data points in CIFAR100. Here z i is the i-th test data point and z = 1 N N i=1 z i is their average. Finally, we apply the singular value decomposition (SVD) on each of the covariance matrices and visualize the top 100 singular values in Fig. 2(a ). If we define a small value τ as the threshold for a singular value to be significant (e.g., log τ = -2), we observe that for the homogeneous setting, almost all the singular values are significant, i.e., they surpass τ . However, as α decreases, the number of singular values exceeding τ monotonically decreases. This implies that with stronger heterogeneity among local training data, the representation vectors produced by the trained global model tend to reside in a lower-dimensional space, corresponding to more severe dimensional collapse.

3.2. EMPIRICAL OBSERVATIONS ON LOCAL MODELS

Since the global model is obtained by aggregating locally trained models on each client, we conjecture that the dimensional collapse observed on the global model stems from the dimensional collapse of local models. To further validate our conjecture, we continue to study whether increasing data heterogeneity will also lead to more severe dimensional collapse on locally trained models. Specifically, for different α's, we visualize the locally trained model of one client (visualizations on local models of other clients are similar and are provided in Appendix E). Following the same procedure as in Sec. 3.1, we plot the singular values of covariance matrices of representations produced by the local models. We observe from Fig. 2 (b) that locally trained models demonstrate the same trend as the global models-namely, that the presence of stronger data heterogeneity causes more severe dimensional collapse. These experiments corroborate that the global model inherit the adverse dimensional collapse phenomenon from the local models.

3.3. A THEORETICAL EXPLANATION FOR DIMENSIONAL COLLAPSE

Based on the empirical observations in Sec. 3.1 and Sec. 3.2, we now develop a theoretical understanding to explain why heterogeneous training data causes dimensional collapse for the learned representations. Since we have established that the dimensional collapse of global model stems from local models, we focus on studying local models in this section. Without loss of generality, we study local training of one arbitrary client. Specifically, we first analyze the gradient flow dynamics of the model weights during the local training. This analysis shows how heterogeneous local training data drives the model weights towards being low-rank, which leads to dimensional collapse for the representations.

3.3.1. SETUPS AND NOTATIONS

We denote the number of training samples as N , the dimension of input data as d in , and total number of classes as C. The i-th sample is denoted as X i ∈ R din , and its corresponding one-hot encoded label is For simplicity in exposition, we follow Arora et al. (2018; 2019) and Jing et al. (2021) and analyze linear neural networks (without nonlinear activation layers). We consider an (L + 1)-layer (where L ≥ 1) linear neural network trained using the cross entropy loss under gradient flow (i.e., gradient descent with an infinitesimally small learning rate). The weight matrix of the i-th layer (i ∈ [L + 1]) at the optimization time step t is denoted as W i (t). The dynamics can be expressed as y i ∈ R C . The collection of all N training samples is denoted as X = [X 1 , X 2 . . . , X N ] ∈ R din×N Ẇi (t) = - ∂ ∂W i ℓ(W 1 (t), . . . , W L+1 (t)), where ℓ denotes the cross-entropy loss. In addition, at the optimization time step t and given the input data X i , we denote z i (t) ∈ R d as the output representation vector (d being the dimension of the representations) and γ i (t) ∈ R C as the output softmax probability vector. We have γ i (t) = softmax(W L+1 (t)z i (t)) = softmax(W L+1 (t)W L (t) . . . W 1 (t)X i ). We define µ c = Nc N , where N c is number of data samples belonging to class c. We denote e c as the C-dimensional one-hot vector where only the c-th entry is 1 (and the others are 0). In addition, let γc (t) = 1 Nc N i=1 γ i (t)1{y i = e c } and Xc = 1 Nc N i=1 X i 1{y i = e c }.

3.3.2. ANALYSIS ON GRADIENT FLOW DYNAMICS

Since our goal is to analyze model representations z i (t), we focus on weight matrices that directly produce representations (i.e., the first L layers). We denote the product of the weight matrices of the first L layers as Π(t) = W L (t)W L-1 (t) . . . W 1 (t) and analyze the behavior of Π(t) under the gradient flow dynamics. In particular, we derive the following result for the singular values of Π(t). Theorem 1 (Informal). Assuming that the mild conditions as stated in Appendix A. 3 hold. Let σ k (t) for k ∈ [d] be the k-th largest singular value of Π(t). Then, σk (t) = N L (σ k (t)) 2-2 L (σ k (t)) 2 L + M (u L+1,k (t)) ⊤ G(t)v k (t), where u L+1,k (t) is the k-th left singular vector of W L+1 (t), v k (t) is the k-th right singular vector of Π(t), M is a constant, and G(t) is defined as G(t) = C c=1 µ c (e c -γc (t)) X⊤ c , where µ c , e c , γc (t), Xc are defined after Eqn. (2). The proof of the precise version of Theorem 1 is provided in Appendix A. Based on Theorem 1, we are able to explain why greater data heterogeneity causes Π(t) to be biased to become lower-rank. Note that strong data heterogeneity causes local training data of one client being highly imbalanced in terms of the number of data samples per class (recall Fig. 1(a) ). This implies that µ c , which is the proportion of the class c data, will be close to 0 for some classes. Next, based on the definition of G(t) in Eqn. ( 4), more µ c 's being close to 0 leads to G(t) being biased towards a low-rank matrix. If this is so, the term (u L+1,k (t)) ⊤ G(t)v k (t) in Eqn. (3) will only be significant (large in magnitude) for fewer values of k. This is because u L+1,k (t) and v k (t) are both singular vectors, which are orthogonal among different k's. This further leads to σk (t) on the left-hand side of Eqn. (3), which is the evolving rate of σ k , being small for most of the k's throughout training. These observations imply that only relatively few singular values of Π(t) will increase significantly after training. Furthermore, Π(t) being biased towards being low-rank will directly lead to dimensional collapse for the representations. To see this, we simply write the covariance matrix of the representations in terms of Π(t) as Σ(t) = 1 N N i=1 (z i (t) -z(t))(z i (t) -z(t)) ⊤ = Π(t) 1 N N i=1 (X i -X)(X i -X) ⊤ Π(t) ⊤ . (5) From Eqn. ( 5), we observe that if Π(t) evolves to being a lower-rank matrix, Σ(t) will also tend to be lower-rank, which corresponds to the stronger dimensional collapse observed in Fig. 2(b) .  L singular (w, X) = 1 d d i=1 λ i - 1 d d j=1 λ j 2 , ( ) where λ i is the i-th singular value of the covariance matrix of the representations. Essentially, L singular penalizes the variance among the singular values, thus discouraging the tail singular values from collapsing to 0, mitigating dimensional collapse. However, this regularization term is not practical as it requires calculating all the singular values, which is computationally expensive. Therefore, to derive a computationally-cheap training objective, we first apply the z-score normalization on all the representation vectors z i as follows: ẑi = (z i -z)/ Var(z). This results in the covariance matrix of ẑi being equal to its correlation matrix (i.e., the matrix of correlation coefficients). The following proposition suggests a more convenient cost function to regularize. Proposition 1. For a d-by-d correlation matrix K with singular values (λ 1 , . . . , λ d ), we have: d i=1 λ i - 1 d d j=1 λ j 2 = ∥K∥ 2 F -d. The proof of Proposition 1 can be found in Appendix B. This proposition suggests that regularizing the Frobenius norm of the correlation matrix ∥K∥ F achieves the same effect as minimizing L singular . In contrast to the singular values, ∥K∥ F can be computed efficiently. To leverage this proposition, we propose a novel method, FEDDECORR, which regularizes the Frobenius norm of the correlation matrix of the representation vectors during local training on each client. Formally, the proposed regularization term is defined as: L FedDecorr (w, X) = 1 d 2 ∥K∥ 2 F , ( ) where w is the model parameters, K is the correlation matrix of the representations. The overall objective of each local client is min w ℓ(w, X, y) + βL FedDecorr (w, X), ( ) where ℓ is the cross entropy loss, and β is the regularization coefficient of FEDDECORR. The pseudocode of our method is provided in Appendix G. To visualize the effectiveness of L FedDecorr in mitigating dimensional collapse, we now revisit the experiments of Fig. 2 and apply L FedDecorr under the heterogeneous setting where α ∈ {0.01, 0.05}. We plot our results in Fig. 3 2 : TinyImageNet Experiments. We run with α ∈ {0.05, 0.1, 0.5, ∞}) and report the test accuracy (%). All results are (re)produced by us and are averaged over 3 runs (mean ± std is reported). Bold font highlights the highest accuracy in each column.

Datasets:

We adopt three popular benchmark datasets, namely CI-FAR10, CIFAR100, and TinyImageNet. CIFAR10 and CIFAR100 both have 50, 000 training samples and 10, 000 test samples, and the size of each image is 32 × 32. TinyImageNet contains 200 classes, with 100, 000 training samples and 10, 000 testing samples, and each image is 64×64. The method generating local data for each client was introduced in Sec. 3.1. Implementation Details: Our code is based on the code of Li et al. (2021b) . For all experiments, we use MobileNetV2 (Sandler et al., 2018) . We run 100 communication rounds for all experiments on the CIFAR10/100 datasets and 50 communication rounds on the TinyImageNet dataset. We conduct local training for 10 epochs in each communication round using SGD optimizer with a learning rate of 0.01, a SGD momentum of 0.9, and a batch size of 64. The weight decay is set to 10 -5 for CIFAR10 and 10 -4 for CIFAR100 and TinyImageNet. We apply the data augmentation of Cubuk et al. ( 2018) in all CIFAR100 and TinyImageNet experiments. The β of FEDDECORR (i.e., β in Eqn. ( 9)) is tuned to be 0.1. The details of tuning hyper-parameters for other federated learning methods are described in Appendix F.

5.2. FEDDECORR SIGNIFICANTLY IMPROVES BASELINE METHODS

To validate the effectiveness of our method, we apply FEDDECORR to four baselines, namely Fe-dAvg (McMahan et al., 2017) , FedAvgM (Hsu et al., 2019) , FedProx (Li et al., 2020) , and MOON (Li et al., 2021b) . We partition the three benchmark datasets (CIFAR10, CIFAR100, and Tiny-ImageNet) into 10 clients with α ∈ {0.05, 0.1, 0.5, ∞}. Since α = ∞ is the homogeneous setting where local models should be free from the pitfall of excessive dimensional collapse, we only expect FEDDECORR to perform on par with the baselines in this setting. We display the CIFAR10/100 results in Tab. 1 and the TinyImageNet results in Tab. 2. We observe that for all of the heterogeneous settings on all datasets, the highest accuracies are achieved by adding FEDDECORR on top of a certain baseline method. In particular, in the strongly heterogeneous settings where α ∈ {0.05, 0.1}, adding FEDDECORR yields significant improvements of around 2% ∼ 9% over baseline methods on all datasets. On the other hand, for the less heterogeneous setting of α = 0.5, the problem of dimensional collapse is less pronounced as discussed in Sec 3, leading to smaller improvements from FEDDECORR. Such decrease in improvements is a general trend and is also observed on FedProx, FedAvgM, and MOON. In addition, surprisingly, in the homogeneous setting of α = ∞, FEDDECORR still produces around 2% of improvements on the TinyImageNet dataset. We conjecture that this is because TinyImageNet is much more complicated than the CIFAR datasets, and some other factors besides heterogeneity of label may cause undesirable dimensional collapse in the federated learning setup. Therefore, federated learning on TinyImageNet can benefit from FEDDECORR even in the homogeneous setting. To further demonstrate the advantages of FED-DECORR, we apply it on FedAvg and plot how the test accuracy of the global model evolves throughout the federated learning in Fig. 4 . In this figure, if we set a certain value of the testing accuracy as a threshold, we see that adding FEDDECORR significantly reduces the number of communication rounds needed to achieve the given threshold. This further shows that FED-DECORR not only improves the final performance, but also greatly boosts the communication efficiency in federated learning.

5.3. ABLATION STUDY ON THE NUMBER OF CLIENTS

Next, we study whether the improvements brought by FEDDECORR are preserved as number of clients increases. We partition the Tiny-ImageNet dataset into 10, 20, 30, 50, and 100 clients according to different α's, and then run FedAvg with and without FEDDECORR. For the experiments with 10, 20 and 30 clients, we run 50 communication rounds. For the experiments with 50 and 100 clients, we randomly select 20% of the total clients to participate the federated learning in each round and run 100 communication rounds. Results are shown in Tab. 3. From this table, we see that the performance improvements resulting from FEDDECORR increase from around 3% ∼ 5% to around 7% ∼ 10% with the growth in the number of clients. Therefore, interestingly, we show through experiments that the improvements brought by FEDDECORR can be even more pronounced under the more challenging settings with more clients. Moreover, our experimental results under random client participation show that the improvements from FEDDECORR are robust to such uncertainties. These experiments demonstrate the potential of FEDDECORR to be applied to real world federated learning settings with massive numbers of clients and random client participation.

Accuracy

Figure 5 : Ablation study on β. We apply FEDDECORR with different choices of β on FedAvg.

5.4. ABLATION STUDY ON THE REGULARIZATION COEFFICIENT β

Next, we study FEDDECORR's robustness to the β in Eqn. ( 9) by varying it in the set {0.01, 0.05, 0.1, 0.2, 0.3}. We partition the CIFAR10 and TinyImageNet datasets into 10 clients with α equals to 0.05 and 0.1 to simulate the heterogeneous setting. Results are shown in Fig. 5 . We observe that, in general, when β increases, the performance of FEDDECORR first increases, then plateaus, and finally decreases slightly. These results show that FEDDECORR is relatively insensitive to the choice of β, which implies FEDDECORR is an easy-to-tune federated learning method. In addition, among all experimental setups, setting β to be 0.1 consistently produces (almost) the best results. Therefore, we recommend β = 0.1 when having no prior information about the dataset. Lastly, we ablate on the number of local epochs per communication round. We set the number of local epochs E to be in the set {1, 5, 10, 20}. We run experiments with and without FEDDECORR, and we use the CIFAR100 and TinyImageNet datasets with α being 0.05 and 0.1 for this ablation study. Results are shown in Tab. 4, in which one observes that with increasing E, FEDAVG performance first increases and then decreases. This is because when E is too small, the local training cannot converge properly in each communication round. On the other hand, when E is too large, the model parameters of local clients might be driven to be too far from the global optimum. Nevertheless, FEDDECORR consistently improves over the baselines across different choices of local epochs E.

5.6. ADDITIONAL EMPIRICAL ANALYSES

We present more empirical analyses in Appendix C. These include comparing FEDDECORR with other baselines (Appendix C.4) and other decorrelation methods (Appendix C.2), experiments on other model architectures (Appendix C.3) and another type of data heterogeneity (Appendix C.5), and discussing the computational advantage of FEDDECORR (Appendix C.1).

6. CONCLUSION

In this work, we study representations of trained models under federated learning in which the data held by clients are heterogeneous. Through extensive empirical observations and theoretical analyses, we show that stronger data heterogeneity results in more severe dimensional collapse for both global and local representations. Motivated by this, we propose FEDDECORR, a novel method to mitigate dimensional collapse during local training, thus improving federated learning under the heterogeneous data setting. Extensive experiments on benchmark datasets show that FEDDECORR yields consistent improvements over existing baseline methods.

A PROOF OF THEOREM 1 IN MAIN PAPER

A.1 NOTATIONS REVISITED Here, for the reader's convenience, we summarize the notations used in both the main text and this appendix.

Notation

Explanation N Number of training data points.

C

Total number of classes.

X

The collection of the N training samples, X ∈ R din×N . y The collection of one hot labels of the N training samples, y ∈ R C×N . γ The collection of model output softmax vectors given all N input data, γ ∈ R C×N W i (t) The i-th layer weight matrix at the t-th optimization step.

Π(t)

The product of the weight matrices of the first L layers: Π(t) = W L (t) . . . W 1 (t). σ l,k The k-th singular value of W l . u l,k The k-th left singular vector of W l . v l,k The k-th right singular vector of W l . σ k The k-th singular value of Π. u k The k-th left singular vector of Π. Here, we elaborate two useful lemmas from Arora et al. (2019; 2018) . The first lemma is adopted from Arora et al. (2019) : Lemma 1. Assuming the weight matrix W evolves under gradient descent dynamics with infinitesimally small learning rate, the k-th singular value of this matrix (denoted as σ k ) evolves as σk (t) = (u k (t)) ⊤ Ẇ (t)v k (t), where u k (t) and v k (t) are the k-th left and right singular vectors of W (t), respectively. Proof. By performing an SVD on W (t), we have W (t) = U (t)S(t)V (t) ⊤ . Therefore, by the chain rule in differention, we have: Ẇ (t) = U (t)S(t)V (t) ⊤ + U (t) Ṡ(t)V (t) ⊤ + U (t)S(t) V (t) ⊤ . ( ) Next, for both sides of the above equation, we left multiply U (t) ⊤ and right multiply V (t): U (t) ⊤ Ẇ (t)V (t) = U (t) ⊤ U (t)S(t) + Ṡ(t) + S(t)( V (t)) ⊤ V (t). ( ) Since S(t) is a diagonal matrix, we consider the k-th diagonal entry of S(t), namely σ k (t): (u k (t)) ⊤ Ẇ (t)v k (t) = (u k (t)) ⊤ uk (t)σ k (t) + σk (t) + σ k (t)(v k (t)) ⊤ vk (t). ( ) Since u k (t) and v k (t) are unit vectors and are evolving in time with infinitesimal rate, we have (u k (t)) ⊤ uk (t) = 0 and (v k (t)) ⊤ vk (t) = 0. Next, Eqn. ( 13) can be simplified as σk (t) = (u k (t)) ⊤ Ẇ (t)v k (t). ( ) The proof is thus complete. The second lemma is adopted from Arora et al. (2018) . Lemma 2. Given L consecutive linear layers in a neural network characterized by weight matrices W 1 , W 2 , . . . , W L . We denote Π = W L W L-1 . . . W 1 . We further denote W j (t) as weight matrix W j after the t-th gradient descent optimization step. Correspondingly, the initialization of W j is W j (0). Assuming we have W j (0)(W j (0)) ⊤ = (W j+1 (0)) ⊤ W j+1 (0) for any j ∈ [L -1] at initialization. Then, under the gradient descent dynamics, Π(t) satisfies Π(t) = - L j=1 Π(t)Π(t) ⊤ L-j L ∂ℓ(Π(t)) ∂Π Π(t) ⊤ Π(t) j-1 L , ( ) where [•] L-j L and [•] j-1 L are fractional power operators defined over positive semi-definite matrices. Proof. Here, we first define some additional notation. Given any square matrices (or possibly scalar) A 1 , A 2 , . . . , A m , we denote diag(A 1 , A 2 , . . . , A m ) to be the block diagonal matrix diag(A 1 , A 2 , . . . , A m ) =     A 1 0 0 0 0 A 2 0 0 0 0 . . . 0 0 0 0 A m     . Here, we first consider dynamics of an arbitrary W j where j ∈ [L -1]. By the chain rule, we have Ẇj (t) = - ∂ℓ(W 1 (t), . . . , W L+1 (t)) ∂W j (t) = -(W j+1 (t) ⊤ . . . W L (t) ⊤ ) ∂ℓ(Π(t)) ∂Π (W 1 (t) ⊤ . . . W j-1 (t) ⊤ ). Given Eqn. ( 16), we right multiply Ẇj (t) by (W j (t)) ⊤ and we left multiply Ẇj+1 (t) by (W j+1 (t)) ⊤ , which yields Ẇj (t)(W j (t)) ⊤ = (W j+1 (t)) ⊤ Ẇj+1 (t). ( ) Applying the same trick on W j (t) ⊤ and W j+1 (t) ⊤ yields W j (t)( Ẇj (t)) ⊤ = ( Ẇj+1 (t)) ⊤ W j+1 (t). ( ) Adding Eqns. ( 17) and ( 18) on both sides yields Ẇj (t)(W j (t)) ⊤ + W j (t)( Ẇj (t)) ⊤ = W j+1 (t) ⊤ Ẇj+1 (t) + ( Ẇj+1 (t)) ⊤ W j+1 (t). Next, by the chain rule for differentiation, Eqn. ( 19) directly implies that d(W j (t)W j (t) ⊤ ) dt = d(W j+1 (t) ⊤ W j+1 (t)) dt . ( ) Since we have assumed that W j (0)W j (0 ) ⊤ = W j+1 (0) ⊤ W j+1 (0), we can conclude that W j (t)W j (t) ⊤ = W j+1 (t) ⊤ W j+1 (t). ( ) Next, we apply an SVD on W j (t) and W j+1 (t) in Eqn. ( 21). This yields U j (t)S j (t)S ⊤ j (t)U ⊤ j (t) = V j+1 (t)S ⊤ j+1 (t)S j+1 (t)V ⊤ j+1 (t). ( ) Based on Eqn. ( 22) and given the uniqueness property of SVD, we know:  S j (t)S j (t) ⊤ = S ⊤ j+1 (t)S j+1 (t) = diag(ρ 1 I d1 , ρ 2 I d2 , . . . , ρ m I dm ), U j (t) = V j+1 (t)diag(O j,1 , O j,2 , . . . , O j,m ). ( ) Given Eqns. ( 24), next, we study W j+1 (t)W j (t)W ⊤ j (t)W ⊤ j+1 (t) for any j ∈ [N -1]: W j+1 (t)W j (t)W ⊤ j (t)W ⊤ j+1 (t) = U j+1 S j+1 V ⊤ j+1 U j S j S ⊤ j U ⊤ j V j+1 S ⊤ j+1 U ⊤ j+1 = U j+1 S j+1 diag(O j,1 , O j,2 , . . . , O j,m )S j S ⊤ j diag(O ⊤ j,1 , O ⊤ j,2 , . . . , O ⊤ j,m )S ⊤ j+1 U ⊤ j+1 (plugging-in (24)) = U j+1 S j+1 S j S ⊤ j S ⊤ j+1 U ⊤ j+1 (S j commutes with diag(O j,1 , O j,2 , . . . , O j,m )) = U j+1 diag(ρ 2 1 I d1 , ρ 2 2 I d2 , . . . , ρ 2 m I dm )U ⊤ j+1 . (25) Similarly, it holds that W ⊤ j (t)W ⊤ j+1 (t)W j+1 (t)W j (t) = V j diag(ρ 2 1 I d1 , ρ 2 2 I d2 , . . . , ρ 2 m I dm )V ⊤ j . Next, by induction and Eqns. ( 25), W L (t) . . . W j (t)W j (t) ⊤ . . . W L (t) ⊤ = U L diag(ρ L-j+1 1 I d1 , ρ L-j+1 2 I d2 , . . . , ρ L-j+1 m I dm )U ⊤ L , by induction and Eqns. ( 26), it holds that W ⊤ 1 (t) . . . W ⊤ j (t)W j (t) . . . W 1 (t) = V 1 diag(ρ j 1 I d1 , ρ j 2 I d2 , . . . , ρ j m I dm )V ⊤ 1 . From Eqns. ( 27), we know that for any j ∈ [L -1], Π(t)Π(t) ⊤ = W L (t) . . . W 1 (t)W 1 (t) ⊤ . . . W L (t) ⊤ = U L diag(ρ L 1 I d1 , ρ L 2 I d2 , . . . , ρ L m I dm )U ⊤ L = U L diag(ρ L-j 1 I d1 , ρ L-j 2 I d2 , . . . , ρ L-j m I dm )U ⊤ L L L-j = W L (t) . . . W j+1 (t)W j+1 (t) ⊤ . . . W L (t) ⊤ L L-j . Similarly, from Eqn. (28), we know that for any 2 ≤ j ≤ L -1, Π(t) ⊤ Π(t) = W 1 (t) ⊤ . . . W L (t) ⊤ W L (t) . . . W 1 (t) = V 1 diag(ρ L 1 I d1 , ρ L 2 I d2 , . . . , ρ L m I dm )V ⊤ 1 = V 1 diag(ρ j-1 1 I d1 , ρ j-1 2 I d2 , . . . , ρ j-1 m I dm )V ⊤ 1 L j-1 = W ⊤ 1 . . . W ⊤ j-1 W j-1 (t) . . . W 1 (t) L j-1 . With everything derived above, we now study the dynamics of Π(t) as follows Π(t) = L j=1 [W L (t) . . . W j+1 (t)] ( Ẇj (t)) [W j-1 (t) . . . W 1 (t)] (differential chain rule) = - L j=1 W L (t) . . . W j+1 (t)W j+1 (t) ⊤ . . . W L (t) ⊤ × ∂ℓ(Π(t)) ∂Π W ⊤ 1 (t) . . . W ⊤ j-1 (t)W j-1 (t) . . . W 1 (t) (plugging-in (16)) = - L j=1 Π(t)Π(t) ⊤ L-j L ∂ℓ(Π(t)) ∂Π Π(t) ⊤ Π(t) j-1 L (plugging-in ( 29) and ( 30)). (31) This completes the proof.  (t) ⊤ v L+1,k ′ (t)| = 1{k = k ′ } approximately holds. A.3 ASSUMPTIONS Assumption 1. We assume that the initial values of the weight matrices satisfy W ⊤ i+1 (0)W i+1 (0) = W i (0)W ⊤ i (0) for any i ∈ [L -1]. Assumption 2. We assume |u k (t) ⊤ v L+1,k ′ (t)| = 1{k = k ′ } holds for all t, where u k (t) is the k-th left singular vector of Π(t) and v L+1,k ′ (t) is the k ′ -th right singular vector of W L+1 (t). Remark: For Assumption 1, it can be achieved in practice by proper random initialization. For Assumption 2, Ji & Telgarsky (2018) proved that under some assumptions, gradient descent optimization will drive consecutive layers of linear networks to satisfy it. We also provide empirical evidence in Fig. 6 to corroborate that this assumption approximately holds.

A.4 PROOF OF THEOREM 1 IN THE MAIN TEXT

Theorem 1 (formally stated). Let σ k (t) for k ∈ [d] be the k-th largest singular value of Π(t). Then, under Assumptions 1 and 2, we have σk (t) = N L (σ k (t)) 2-2 L (σ k (t)) 2 L + M (u L+1,k (t)) ⊤ G(t)v k (t), where u L+1,k (t) is the k-th left singular vector of W L+1 (t), v k (t) is the k-th right singular vector of Π(t), M is a constant, and G(t) is defined as G(t) = C c=1 µ c (e c -γc (t)) X⊤ c . Proof. Recall that for (L + 1)-layer linear neural networks, given the i-th training sample X i ∈ R d , we have γ i (t) = softmax(W L+1 (t)z i (t)) = softmax(W L+1 (t)Π(t)X i ), and the loss is the standard cross-entropy loss defined as follows ℓ(Π(t), W L+1 (t)) = N i=1 -y ⊤ i log γ i (t). By the chain rule, we can derive gradient of ℓ with respect to W L+1 and Π, which are respectively, ∂ℓ(Π(t), W L+1 (t)) ∂W L+1 = -(y -γ(t))X ⊤ Π(t) ⊤ , and ∂ℓ(Π(t), W L+1 (t)) ∂Π = -W L+1 (t) ⊤ (y -γ(t))X ⊤ . Next, under the gradient descent dynamics, the dynamics on W L+1 satisfies ẆL+1 (t) = - ∂ℓ(Π(t), W L+1 (t)) ∂W L+1 = (y -γ(t))X ⊤ Π(t) ⊤ , while the dynamics on Π requires invoking Lemma 2, which allows us to write Π(t) = - L j=1 [Π(t)Π(t) ⊤ ] L-j L ∂ℓ(Π(t)) ∂Π [Π(t) ⊤ Π(t)] j-1 L = L j=1 [Π(t)Π(t) ⊤ ] L-j L W L+1 (t) ⊤ (y -γ(t))X ⊤ [Π(t) ⊤ Π(t)] j-1 L . Next, we invoke Lemma 1 on Eqn. (39) and Eqn. ( 38), respectively, yielding: σk (t) = (u k (t)) ⊤ Π(t)(v k (t)) = L j=1 u k (t) ⊤ [Π(t)Π(t) ⊤ ] L-j L W L+1 (t) ⊤ (y -γ(t))X ⊤ [Π(t) ⊤ Π(t)] j-1 L v k (t) = L(σ k (t)) 2-2 L u k (t) ⊤ W L+1 (t) ⊤ (y -γ(t))X ⊤ v k (t) (SVD on Π(t)) = L(σ k (t)) 2-2 L k ′ σ L+1,k ′ u k (t) ⊤ v L+1,k ′ (t)(u L+1,k ′ (t)) ⊤ (y -γ(t))X ⊤ v k (t) (SVD on W L+1 (t)) = L(σ k (t)) 2-2 L σ L+1,k (u L+1,k (t)) ⊤ (y -γ(t))X ⊤ v k (t) (Assumption 2). (40) and σL+1,k (t) = u L+1,k (t) ⊤ (y -γ(t))X ⊤ Π(t) ⊤ v L+1,k (t) = k ′ σ k ′ u L+1,k (t) ⊤ (y -γ(t))X ⊤ v k ′ (t)u ⊤ k ′ v L+1,k (t) = σ k u L+1,k (t) ⊤ (y -γ(t))X ⊤ v k (t) (Assumption 2). Combining Eqns. ( 40) and (41), we have: 1 L ( σk (t))(σ k (t)) 2 L -1 = σ L+1,k (t)( σL+1,k (t)). Next, apply integration on both sides, which yields  (σ L+1,k (t)) 2 = (σ k (t)) 2 L + M, σk (t) = L(σ k (t)) 2-2 L (σ k (t)) 2 L + M (u L+1,k (t)) ⊤ (y -γ(t))X ⊤ v k (t). Finally, notice that (y -γ(t))X ⊤ can be rewritten as (y -γ(t))X ⊤ = N i=1 (y -γ i (t))X ⊤ i = N C c=1 µ c (e c -γc (t)) X⊤ c . We further substitute Eqn. (45) into Eqn. ( 44) and obtain σk (t) = N L(σ k (t)) 2-2 L (σ k (t)) 2 L + M (u L+1,k (t)) ⊤ G(t)v k (t), where G(t) is defined as G(t) = C c=1 µ c (e c -γc (t)) X⊤ c . This completes the proof.

B PROOF OF PROPOSITION 1 IN THE MAIN PAPER

Proposition 1 (restated). For a d-by-d correlation matrix K with singular values (λ 1 , . . . , λ d ), we have: d i=1 λ i - 1 d d j=1 λ j 2 = ∥K∥ 2 F -d. Proof. Given a d-by-d correlation matrix K, since the diagonal entries of K are all 1, we have d i=1 λ i = tr(K) = d. This is because for any symmetric positive definite matrix, the sum of all singular values equals the trace of the matrix. Next, for the left-hand side of Eqn. ( 7), we have: d i=1 λ i - 1 d d j=1 λ j 2 = d i=1 (λ i -1) 2 (Plug-in Eqn. (49)) = d i=1 λ 2 i -2 d i=1 λ i + d = d i=1 λ 2 i -d (Plug-in Eqn. (49)). Next, for the right-hand side of Eqn. (7), we have: ∥K∥ 2 F -d = tr(K ⊤ K) -d = tr(U SV ⊤ V S ⊤ U ⊤ ) -d (Apply SVD on K) = tr(U SS ⊤ U ⊤ ) -d = n i=1 λ 2 i -d. Therefore, we have shown that the left-hand side of Eqn. ( 7) equals its right-hand side.

C ADDITIONAL EMPIRICAL ANALYSES C.1 COMPUTATIONAL EFFICIENCY

We demonstrate FEDDECORR's advantage vis-à-vis some of its competitors in terms of its computational efficiency. We compare FEDDECORR with some other methods that also apply additional regularization terms during local training such as FedProx and MOON. We partition CIFAR10, CI-FAR100 and TinyImageNet into 10 clients with α = 0.5 and report the total computation times required for one round of training for FedAvg, FedProx, MOON, and FEDDECORR . Results are shown in Tab. 5. All results are produced with a NVIDIA Tesla V100 GPU. We see that FED-DECORR incurs a negligible computation overhead on top of the naïve FedAvg, while FedProx and MOON introduce about 0.5 ∼ 1× additional computation cost. 9 : TinyImageNet Experiments. We run with α ∈ {0.05, 0.1, 0.5, ∞} and report the test accuracies (%). All results are (re)produced by us and are averaged over 3 runs (mean ± std is reported). Bold font highlights the highest accuracy in each column. We add results of Scaffold and FedNova comparing to Tab. 2 in the main paper.

C.4 COMPARISON WITH OTHER FEDERATED LEARNING BASELINES

In this section, we compare FEDDECORR with two other baselines, namely Scaffold (Karimireddy et al., 2020) and FedNova (Wang et al., 2020b) . We use the same experimental setups as in the main paper to implement these two baselines. Results on CIFAR10/100 and TinyImageNet are shown in Tab. 8 and Tab. 9, respectively. As shown in the tables, across various datasets and degrees of heterogeneity, adding FEDDECORR on top of a baseline method can outperform the baselines when there is some heterogeneity across the agents, i.e., α < ∞. (Hsu et al., 2019) , FedProx (Li et al., 2020) , and MOON (Li et al., 2021b) . The x-axis (k) is the index of singular values.

ResNet32 ResNet18

log singular values 

D ADDITIONAL VISUALIZATIONS ON GLOBAL MODELS

In this section, we provide additional visualizations on global models with different federated learning methods, model architectures, and datasets. Through our extensive experimental results, we demonstrate that dimensional collapse is a general problem under heterogeneous data in federated learning.

D.1 VISUALIZATION ON GLOBAL MODELS OF OTHER FEDERATED LEARNING METHODS

In the main text, we have shown that global models produced by FedAvg (McMahan et al., 2017) suffer stronger dimensional collapse with increasing data heterogeneity. To further show such dimensional collapse phenomenon is a general problem in federated learning, we visualized global models produced by other federated learning methods such as FedAvg with server momentum (Hsu et al., 2019) , FedProx (Li et al., 2020) , and MOON (Li et al., 2021b) . Specifically, we follow the same procedure as in the main text and plot the singular values of covariance matrices of representations. Results are shown in Fig. 7 . From the figure, one can see that all these three other methods also demonstrated the similar hazard of dimensional collapse as in FedAvg.

D.2 VISUALIZATION ON GLOBAL MODELS OF OTHER MODEL ARCHITECTURES

In the main text, we have shown the dimensional collapse on global models caused by data heterogeneity with MobileNetV2. In this section, we perform the similar visualization based on other 



Figure 1: (a) illustrates data heterogeneity in terms of number of samples per class. (b), (c), (d) show representations (normalized to the unit sphere) of global models trained under homogeneous data, heterogeneous data, and heterogeneous data with FEDDECORR, respectively. Only (c) suffers dimensional collapse. (b), (c), (d) are produced with ResNet20 on CIFAR10. Best viewed in color.

Figure 2: Data heterogeneity causes dimensional collapse on (a) global models and (b) local models. We plot the singular values of the covariance matrix of representations in descending order. The x-axis (k) is the index of singular values and the y-axis is the logarithm of the singular values.

, and the N one-hot encoded training labels are denoted as y = [y 1 , y 2 , . . . , y N ] ∈ R C×N .

Figure 3: FEDDECORR effectively mitigates dimensional collapse for (a-b) local models and (c-d) global models. For each heterogeneity parameter α ∈ {0.01, 0.05}, we apply FEDDECORR and plot the singular values of the representation covariance matrix. The x-axis (k) is the index of singular values. With FEDDECORR, the tail singular values are prevented from dropping to 0 too rapidly.

Figure 4: Test accuracy (%) at each communication round. Results are averaged over 3 runs. Shaded areas denote one standard deviation above and below the mean.

v k The k-th right singular vector of Π. N c Number of samples of class c. µ c The proportion of class c samples w.r.t. the whole training samples: µ c = Nc N e c The C-dimensional one-hot vector where only the c-th entry is 1. Xc Mean vector of the training examples in class c: Xc = 1 Nc N i=1 X i 1{y i = e c } γc Mean output softmax vector given samples in class c: γc = 1 Nc N i=1 γ i 1{y i = e c } A.2 TWO LEMMAS

where √ ρ 1 , . . . , √ ρ m represent the m distinct singular values satisfying ρ 1 > ρ 2 > . . . > ρ m ≥ 0, and I dr for any r ∈ [m] are identity matrix of size d r × d r . Since Eqn. (23) holds for any j, we know by induction that the set of values of ρ's is the same across all layers j ∈ [L]. In addition, there exist orthogonal matrices O j,r ∈ R dr×dr for any r ∈ [m] such that

Figure 6: Alignment effects between the singular spaces of W L+1 (t) and Π(t). We train a 3-layer linear neural network on the MNIST dataset and visualize the models at 3, 5, 7, 9 training epochs, respectively. In each figure, the k ′ -th row and k-th column pixel is the value of |u k (t) ⊤ v L+1,k ′ (t)|. Darker colors denote values close to 1 while lighter colors denote values close to 0. From the figures, we empirically observe that |u k (t) ⊤ v L+1,k ′ (t)| = 1{k = k ′ } approximately holds.

where M a constant. By Eqn. (43), Eqn. (40) can be rewritten as

Figure7: Data heterogeneity causes similar dimensional collapse on other federated learning methods such as FedAvgM(Hsu et al., 2019), FedProx(Li et al., 2020), and MOON(Li et al., 2021b). The x-axis (k) is the index of singular values.

Figure 8: Data heterogeneity causes similar dimensional collapse on other model architectures during federated learning. The x-axis (k) is the index of singular values.

Figure 10: Heterogeneous local training data cause dimensional collapse. For each of the clients, given the four models trained under different degrees of heterogeneity, we plot the singular values of covariance matrix of representations in descending orders (the results of client 1 are shown in main text Fig. 2(b)). Representations are computed over the CIFAR100 test set. The x-axis (k) is the index of singular values and the y-axis is the logarithm of the singular values.

FedAvg 64.85±2.01 76.28±1.22 89.84±0.13 92.39±0.26 59.87±0.25 66.46±0.16 71.69±0.15 74.54±0.15 + FEDDECORR 73.06±0.81 80.60±0.91 89.84±0.05 92.19±0.10 61.53±0.11 67.12±0.09 71.91±0.04 73.87±0.18 CIFAR10/100 Experiments. We run experiments under various degrees of heterogeneity (α ∈ {0.05, 0.1, 0.5, ∞}) and report the test accuracy (%). All results are (re)produced by us and are averaged over 3 runs (mean ± std). Bold font highlights the highest accuracy in each column.

Ablation study on the number of clients. Based on TinyImageNet, we run experiments with different number of clients and different amounts of data heterogeneity.

ON THE NUMBER OF LOCAL EPOCHS Ablation study on local epochs. Experiments with different number of local epochs E.

The advantage of FEDDECORR in terms of efficiency is mainly because it only involves calculating the Frobenius norm of a matrix which is extremely cheap. Indeed this regularization operates on the output representation vectors of the model, without requiring computing parameter-wise regularization like FedProx nor extra forward passes like MOON. Comparison of computation times. We report the total computation times (in minutes) for one round of training on the three datasets for FedAvg, FedProx, MOON, and FEDDECORR. Here, FEDDECORR stands for applying FEDDECORR to FedAvg.C.2 COMPARISON WITH OTHER DECORRELATION METHODSSome decorrelation regularizations such as DeCov(Cogswell et al., 2015) and Structured-DeCov(Xiong et al., 2016) were proposed to improve the generalization capabilities in standard classification tasks. Both these methods operate directly on the covariance matrix of the representations instead of the correlation matrix like our proposed method-FEDDECORR. To compare our FEDDECORR with the existing decorrelation methods, we follow the same procedure as in FED-DECORR and apply DeCov and Structured-DeCov during local training. Our experiments are based on TinyImageNet and FedAvg. TinyImageNet is partitioned into 10 clients according to various α's. Results are shown in Tab. 6. Surprisingly, we see that unlike our FEDDECORR which steadily improves the baseline, adding DeCov or Structured-DeCov both degrade the performance in federated learning. We conjecture that this is because directly regularizing the covariance matrix is highly unstable, leading to undesired modification on the representations. This experiment shows that our design of regularization of the correlation matrix instead of the covariance matrix is of paramount importance.

Comparison with other decorrelation methods. Based on FedAvg and the TinyImageNet dataset, we use different decorrelation regularizers in local training.C.3 EXPERIMENTS ON OTHER MODEL ARCHITECTURESIn this section, we demonstrate the effectiveness of our method across different model architectures. Here, besides the MobileNetV2 used in the main paper, we also experiment on ResNet18 and ResNet32. Note that ResNet18 is the wider ResNet whose representation dimension is 512 and ResNet32 is the narrower ResNet whose representation dimension is 64. The coefficient of the Fed-Decorr objective is set to be 0.1 as suggested to be a good universal value of β in the paper. The heterogeneity parameter α is set to be 0.05 and we use the CIFAR10 dataset. Our results are shown in Tab. 7. As can be seen, FedDecorr yields consistent improvements across different neural network architectures. One interesting phenomenon is that the improvements brought about by FedDecorr are much larger on wider networks (e.g., MobileNetV2, ResNet18) than on narrower ones (e.g. ResNet32). We conjecture this is because the dimension of the ambient space of wider networks are clearly higher than that of shallower networks. Therefore, relatively speaking, the dimensional collapse caused by data heterogeneity will be more severe for wider networks.

Effectiveness of FEDDECORR on other model architectures.

CIFAR10/100 Experiments. We run experiments under various degrees of heterogeneity (α ∈ {0.05, 0.1, 0.5, ∞}) and report the test accuracy (%). All results are (re)produced by us and are averaged over 3 runs (mean ± std). Bold font highlights the highest accuracy in each column. We add results of Scaffold and FedNova comparing to Tab. 1 in the main paper.

FEDDECORR yields noticeable and consistent improvements under another type of data heterogeneity.

ACKNOWLEDGEMENTS

The authors would like to thank anonymous reviewers for the constructive feedback. Yujun Shi and Vincent Tan are supported by Singapore Ministry of Education Tier 1 grants (Grant Number: A-0009042-01-00, A-8000189-01-00, A-8000980-00-00) and a Singapore National Research Foundation (NRF) Fellowship (Grant Number: A-0005077-01-00). Jian Liang is supported by National Natural Science Foundation of China (Grant No. 62276256) and Beijing Nova Program under Grant Z211100002121108.

REPRODUCIBILITY STATEMENT

All source code has been released at https://github.com/bytedance/FedDecorr. Pseudo-code of FED-DECORR is provided in Appendix G. We introduced all the implementation details of baselines and our method in Sec. 5.1. In addition, the proofs of Theorem 1 and Proposition 1 are provided in Appendix A and Appendix B, respectively. All assumptions are stated and discussed in the proof.

F HYPERPARAMETERS OF OTHER FEDERATED LEARNING METHODS

The regularization coefficient of FedProx (Li et al., 2020) µ is tuned across {10 -4 , 10 -3 , 10 -2 , 10 -1 } and is selected to be µ = 10 -3 ; the regularization coefficient of MOON (Li et al., 2021b) µ is tuned across {0.1, 1.0, 5.0, 10.0} and is selected to be µ = 1.0; the server momentum of FedAvgM (Hsu et al., 2019) ρ is tuned across {0.1, 0.5, 0.9} and is selected to be ρ = 0.5.

G PSEUDO-CODE OF FEDDECORR

Here, we provide a pytorch-style pseudo-code for FEDDECORR in Alg. 1. All FEDDECORR-specific components are highlight in blue. As indicated in the pseudocode, the only additional operation of FEDDECORR is in adding a regularization term L FedDecorr (w, X) defined in Eqn. ( 8). This shows that FEDDECORR is an extremely convenient plug-and-play federated learning method.

H STABILITY OF FEDDECORR REGULARIZATION LOSS

In this section, we first split CIFAR10 into 10 clients with α = 0.5. Then, we plot how FedDecorr loss evolve within 10 local epochs for all the 10 clients in Fig. 11 . All training configurations are the same as in the main paper. From the results, one can observe that the optimization process of FedDecorr loss is stable.

