FEDBN: FEDERATED LEARNING ON NON-IID FEATURES VIA LOCAL BATCH NORMALIZATION

Abstract

The emerging paradigm of federated learning (FL) strives to enable collaborative training of deep models on the network edge without centrally aggregating raw data and hence improving data privacy. In most cases, the assumption of independent and identically distributed samples across local clients does not hold for federated learning setups. Under this setting, neural network training performance may vary significantly according to the data distribution and even hurt training convergence. Most of the previous work has focused on a difference in the distribution of labels or client shifts. Unlike those settings, we address an important problem of FL, e.g., different scanners/sensors in medical imaging, different scenery distribution in autonomous driving (highway vs. city), where local clients store examples with different distributions compared to other clients, which we denote as feature shift non-iid. In this work, we propose an effective method that uses local batch normalization to alleviate the feature shift before averaging models. The resulting scheme, called FedBN, outperforms both classical FedAvg, as well as the state-of-the-art for non-iid data (FedProx) on our extensive experiments. These empirical results are supported by a convergence analysis that shows in a simplified setting that FedBN has a faster convergence rate than FedAvg.

1. INTRODUCTION

Federated learning (FL), has gained popularity for various applications involving learning from distributed data. In FL, a cloud server (the "server") can communicate with distributed data sources (the "clients"), while the clients hold data separately. A major challenge in FL is the training data statistical heterogeneity among the clients (Kairouz et al., 2019; Li et al., 2020b) . It has been shown that standard federated methods such as FedAvg (McMahan et al., 2017) which are not designed particularly taking care of non-iid data significantly suffer from performance degradation or even diverge if deployed over non-iid samples (Karimireddy et al., 2019; Li et al., 2018; 2020a) . Recent studies have attempted to address the problem of FL on non-iid data. Most variants of FedAvg primarily tackle the issues of stability, client drift and heterogeneous label distribution over clients (Li et al., 2020b; Karimireddy et al., 2019; Zhao et al., 2018) . Instead, we focus on the shift in the feature space, which has not yet been explored in the literature. Specifically, we consider that local data deviates in terms of the distribution in feature space, and identify this scenario as feature shift. This type of non-iid data is a critical problem in many real-world scenarios, typically in cases where the local devices are responisble for a heterogeneity in the feature distributions. For example in cancer diagnosis tasks, medical radiology images collected in different hospitals have uniformly distributed labels (i.e., the cancer types treated are quite similar across the hospitals). However, the image appearance can vary a lot due to different imaging machines and protocols used in hospitals, e.g., different intensity and contrast. In this example, each hospital is a client and hospitals aim to collaboratively train a cancer detection model without sharing privacy-sensitive data. Tackling non-iid data with feature shift has been explored in classical centralized training in the context of domain adaptation. Here, an effective approach in practice is utilizing Batch Normalization (BN) (Ioffe & Szegedy, 2015) : recent work has proposed BN as a tool to mitigate domain shifts in domain adaptation tasks with promising results achieved (Li et al., 2016; Liu et al., 2020; Chang et al., 2019) . Inspired by this, this paper proposes to apply BN for feature shift FL. To illustrate the idea, we present a toy example that illustrates how BN may help harmonizing local feature distributions. Observation of BN in a FL Toy Example: We consider a simple non-convex learning problem: we generate data x, y ∈ R with y = cos(w true x) + , where x ∈ R is drawn iid from Gaussian distribution and is zero-mean Gaussian noise and consider models of the form f w (x) = cos(wx) with model parameter w ∈ R. Local data deviates in the variance of x. First, we illustrate that local batch normalization harmonizes local data distributions. We consider a simplified form of BN that normalizes the input by scaling it with γ, i.e., the local empirical standard deviation, and a setting with 2 clients. As Fig. 1 shows, the local squared loss is very different between the two clients. Thus, averaging the model does not lead to a good model. However when applying local BN, the local training error surfaces become similar and averaging the models can be beneficial. To further illustrate the impact of BN, we plot the error surface for one client with respect to both model parameter w ∈ R and BN parameter γ ∈ R in Fig. 2 . The figure shows that for an optimal weight w * 1 , changing γ deteriorates the model quality. Similarly, for a given optimal BN parameter γ * 1 , changing w deteriorates the quality. In particular, the average model w = (w * 1 + w * 2 )/2 and average BN parameters γ = (γ * 1 + γ * 2 )/2 has a high generalization error. At the same time, the average model w with local BN parameter γ * 1 performs very well. Motivated by the above insight and observation, this paper proposes a novel federated learning method, called FedBN, for addressing non-iid training data which keeps the client BN layers updated locally, without communicating, and aggregating them at the server. In practice, we can simply update the non-BN layers using FedAvg, without modifying any optimization or aggregation scheme. This approach has zero parameters to tune, requires minimal additional computational resources, and can be easily applied to arbitrary neural network architectures with BN layers in FL. Besides the benefit shown in the toy example, we also show the benefits in accelerating convergence by theoretically analyzing the convergence of FedBN in the over-parameterized regime. In addition, we have conducted extensive experiments on a benchmark and three real-world datasets. Compared to classical FedAvg, as well as the state-of-the-art for non-iid data (FedProx), our novel method, FedBN, demonstrates significant practical improvements on the extensive experiments.

2. RELATED WORK

Techniques for Non-IID Challenges in Federated Learning: The widely known aggregation strategy in FL, FedAvg (McMahan et al., 2017) , often suffers when data is heterogeneous over local client. Empirical work addressing non-iid issues, mainly focus on label distribution skew, where a non-iid dataset is formed by partitioning a "flat" existing dataset based on the labels. FedProx (Li et al., 2020b) , a recent framework tackled the heterogeneity by allowing partial information aggregation and adding a proximal term to FedAvg. Zhao et al. (2018) assumed a subset of the data is globally shared between all the clients, hence generalizes to the problem at hand. FedMA (Wang et al., 2020) proposed an aggregation strategy for non-iid data partition that shares global model in a layer-wise manner. However, so far there are only limited attempts considering non-iid induced from feature shift, which is common in medical data collecting from different equipment and natural image collected in various noisy environment. Very recently, FedRobust (Reisizadeh et al., 2020) assumes data follows an affine distribution shift and tackles this problem by learning the affine transformation. This hampers the generalization when we cannot estimate the explicit affine transformation. Concurrently to our work, SiloBN Andreux et al. ( 2020) empirically shows that local clients keeping some untrainable BN parameters could improve robustness to data heterogeneity, but provides no theoretical analysis of the approach. FedBN instead keeps all BN parameters strictly local. Recently, an orthogonal approach to the non-iid problem has been proposed that focuses on improving the optimization mechanism (Reddi et al., 2020; Zhang et al., 2020) . Batch Normalization in Deep Neural Networks: Batch Normalization (Ioffe & Szegedy, 2015) is an indispensable component in many deep neural networks and has shown its success in neural network training. Relevant literature has uncovered a number of benefits given by batch normalization. Santurkar et al. (2018) showed that BN makes the optimization landscape significantly smoother. Luo et al. (2018) investigated an explicit regularization form of BN such that improving the robustness of optimization. Morcos et al. (2018) suggested that BN implicitly discourages single direction reliance, thus improving model generalizability. Li et al. (2018) took advantage of BN for tackling the domain adaptation problem. However, what a role BN is playing in the scope of federated learning, especially for non-iid training, still remains unexplored to date.

3. PRELIMINARY

Non-IID Data in Federated Learning: We introduce the concept of feature shift in federated learning as a novel category of client's non-iid data distribution. So far, the categories of non-iid data considered according to Kairouz et al. (2019) ; Hsieh et al. (2019) can be described by the joint probability between features x and labels y on each client. We can rewrite P i (x, y) as P i (y|x)P i (x) and P i (x|y)P i (y). We define feature shift as the case that covers: 1) covariate shift: the marginal distributions P i (x) varies across clients, even if P i (y|x) is the same for all client; and 2) concept shift: the conditional distribution P i (x|y) varies across clients and P (y) is the same. Federated Averaging (FedAvg): We establish our algorithm on FedAvg introduced by McMahan et al. (2017) which is the most popular existing and easiest to implement federated learning strategy, where clients collaboratively send updates of locally trained models to a global server. Each client runs a local copy of the global model on its local data. The global model's weights are then updated with an average of local clients' updates and deployed back to the clients. This builds upon previous distributed learning work by not only supplying local models but also performing training locally on each device. Hence FedAvg potentially empowers clients (especially clients with small dataset) to collaboratively learn a shared prediction model while keeping all training data locally. Although FedAvg has shown successes in classical Federated Learning tasks, it suffers from slow convergence and low accuracy in most non-iid contents (Li et al., 2020b; 2019) .

4.1. PROPOSED METHOD -FEDBN

We propose an efficient and effective learning strategy denoted FedBN. Similar to FedAvg, FedBN performs local updates and averages local models. However, FedBN assumes local models have BN layers and excludes their parameters from the averaging step. We present the full algorithm in Appendix C. This simple modification results in significant empirical improvements in non-iid settings. We provide an explanation for these improvements in a simplified scenario, in which we show that FedBN improves the convergence rate under feature shift.

4.2. PROBLEM SETUP

We assume N ∈ N clients to jointly train for T ∈ N epochs and to communicate after E ∈ N local iterations. Thus, the system has Li et al. (2019) . To be more precise, we make the following assumption. Assumption 4.1 (Data Distribution). For each client i ∈ [N ] the inputs x i j are centered (Ex i = 0) with covariance matrix S i = Ex i x i , where S i is independent from the label y and may differ for each i ∈ [N ] e.g., S i are not all identity matrices, and for each index pair p = q, x p = κ • x q for all κ ∈ R \ {0}. With Assumption 4.1, the normalization of the first layer for client i is v k x i v k S i . FedBN with client- specified BN parameters trains a model f * : R d → R parameterized by (V, γ, c) ∈ R m×d × R m×N × R m , i.e., f * (x; V, γ, c) = 1 √ m m k=1 c k N i=1 σ γ k,i • v k x v k Si • 1{x ∈ client i} , where γ is the scaling parameter of BN and σ(s) = max{s, 0} is the ReLU activation function, c is the top layer parameters of the network. Here, we omit learning the shift parameter of BNfoot_0 . FedAvg instead trains a function f : R d → R which is a special case of Eq. 1 with γ k,i = γ k for ∀i ∈ [N ]. We take a random initialization of the parameters (Salimans & Kingma, 2016) in our analysis: v k (0) ∼ N 0, α 2 I , c k ∼ U {-1, 1}, and γ k = γ k,i = v k (0) 2 /α, where α 2 controls the magnitude of v k at initialization. The initialization of the BN parameters γ k and γ k,i are independent of α. The parameters of the network f * (x; V, γ, c) are obtained by minimizing the empirical risk with respect to the squared loss using gradient descent : L(f * ) = 1 N M N i=1 M j=1 f * (x i j ) -y i j 2 . (3)

4.3. CONVERGENCE ANALYSIS

Here we study the trajectory of networks FedAvg (f ) and FedBN (f * )'s prediction through the neural tangent kernel (NTK) introduced by Jacot et al. (2018) . Recent machine learning theory studies (Arora et al., 2019; Du et al., 2018; Allen-Zhu et al., 2019; van den Brand et al., 2020; Dukler et al., 2020) have shown that for finite-width over-parameterized networks, the convergence rate is controlled by the least eigenvalue of the induced kernel in the training evolution. To simplify tracing the optimization dynamics, we consider the case that the number of local updates E is 1. We can decompose the NTK into a magnitude component G(t) and direction component V(t)/α 2 following Dukler et al. (2020) : df dt = -Λ(t)(f (t) -y), where Λ(t) := V(t) α 2 + G(t). The specific forms of V(t) and G(t) are given in Appendix B.1. Let λ min (A) denote the minimal eigenvalue of matrix A. The matrices V(t) and G(t) are positive semi-definite, since they can be viewed as covariance matrices. This gives λ min (Λ(t)) ≥ max λ min (V(t))/α 2 , λ min (G(t)) . According to NTK, the convergence rate is controlled by λ min (Λ(t)). Then, for α > 1, convergence is dominated by G(t). Let Λ(t) and Λ * (t) denote the evolution dynamics of FedAvg and FedBN and let G(t) and G * (t) denote the magnitude component in the evolution dynamics of FedAvg and FedBN. For the convergence analysis, we use the auxiliary version of the Gram matrices, which is defined as follows. Definition 4.2. Given sample points {x p } N M p=1 , we define the auxiliary Gram matrices G ∞ ∈ R N M ×N M and G * ∞ ∈ R N M ×N M as G ∞ pq := E v∼N (0,α 2 I) σ v x p σ v x q , (FedAvg) (4) G * ∞ pq := E v∼N (0,α 2 I) σ v x p σ v x q 1{i p = i q }, (FedBN). (5) Given Assumption 4.1, we use the key results in Dukler et al. (2020) to show that G ∞ is positive definite. Further, we show that G * ∞ is positive definite. We use the fact that the distance between G(t) and its auxiliary version is small in over-parameterized neural network, such that G(t) remains positive definite. Based on our formulation, the convergence rate of FedAvg (Theorem 4.4) can be derived from Dukler et al. (2020) by considering non-identical covariance matrices. We derive the convergence rate of FedBN in Corollary 4.5. Our key result of comparing the convergence rates between FedAvg and FedBN is culminated in Corollary 4.6. Theorem 4.4 (G-dominated convergence for FedAvg Dukler et al. (2020) ). Suppose network (4) is initialized as in (2) with α > 1, trained using gradient descent and Assumptions 4.1 holds. Given the loss function of training the neural network is the square loss with targets y satisfying y ∞ = O(1). If m = Ω max N 4 M 4 log(N M/δ)/α 4 µ 4 0 , N 2 M 2 log(N M/δ)/µ 2 0 , then with probability 1 -δ, 1. For iterations t = 0, 1, • • • , the evolution matrix Λ(t) satisfies λ min (Λ(t)) ≥ µ0 2 . 2. Training with gradient descent of step-size η = O 1 Λ(t) converges linearly as f (t) -y 2 2 ≤ 1 - ηµ 0 2 t f (0) -y 2 2 . Following the key ideas in Dukler et al. (2020) , here we further characterize the convergence for FedBN. 

Proof sketch

The key is to show λ min (G ∞ ) ≤ λ min (G * ∞ ). Comparing equation ( 4) and ( 5), G * ∞ takes the M × M block matrices on the diagonal of G ∞ . Let G ∞ i be the i-th M × M block matrices on the diagonal of G ∞ . By linear algebra, λ min (G ∞ i ) ≥ λ min (G ∞ ) for i ∈ [N ]. Since G * ∞ = diag(G ∞ 1 , • • • , G ∞ N ), we have λ min (G * ∞ ) = min i∈[N ] {λ min (G ∞ i )}. Therefore, we have the result λ min (G * ∞ ) ≥ λ min (G ∞ ).

5. EXPERIMENTS

In this section, we demonstrate that using local BN parameters is beneficial in the presence of feature shift across clients with heterogeneity data. Our novel local parameter sharing strategy, FedBN, achieves more robust and faster convergence for feature shift non-iid datasets and obtains better model performance compared to alternative methods. This is shown on both benchmark and large real-world datasets.

5.1. BENCHMARK EXPERIMENTS

Settings: We perform an extensive empirical analysis using a benchmark digits classification task containing different data sources with feature shift where each dataset is from a different domain. Data of different domains have heterogeneous appearance but share the same labels and label distribution. Specifically, we use the following five datasets: SVHN Netzer et al. (2011) , USPS Hull (1994) , SynthDigits Ganin & Lempitsky (2015) , MNIST-M Ganin & Lempitsky (2015) and MNIST LeCun et al. (1998) . To match the setup in Section 4, we truncate the sample size of the five datasets to their smallest number with random sampling, resulting in 7438 training samples in each datasetfoot_2 . Testing samples are held out and kept the same for all the experiments on this benchmark dataset. Our classification model is a convolutional neural network where BN layers are added following each feature extraction layer (i.e., both convolutional and fully-connected). The architecture is detailed in Appendix D.2. For model training, we use the cross-entropy loss and SGD optimizer with a learning rate of 10 -2 . If not specified, our default setting for local update epochs is E = 1, and the default setting for the amount of data at each client is 10%foot_3 of the dataset original size. For the default non-iid setting, the FL system contains five clients. Each client exclusively owns data sampled from one of the five datasets. More details are listed in Appendix D.2. Analysis of Local Updating Epochs: Aggregating at different frequencies may affect the learning behaviour. Although our theory and the default setting for the other experiment takes E = 1, we demonstrate FedBN is effective for cases when E > 1. In Fig. 4 (a), we explore E = 1, 4, 8, 16 and compare FedBN to baseline FedAvg. As expected, an inverse relationship between the local updating epochs E and testing accuracy implied for both FedBN and FedAvg shown in Fig. 4 (a) . Zooming into the final testing accuracy, FedBN's accuracy stably exceeds the accuracy of FedAvg on various E . 

Effects of Statistical Heterogeneity:

A salient question that arises is: to what degree of heterogeneity on feature shift FedBN is superior to FedAvg. To answer the question, we simulate a federated settings with varying heterogeneity as described below. We parcel each dataset into 10 subsets, one for each clients, with equal number of data samples and the same label distribution. We treat the clients generated from the same dataset as iid clients, while the clients generated from different datasets as non-iid clients. We start with including one client from each dataset in FL system. Then, we simultaneously add one client from each datasets while keep the existing clients n times, for n ∈ {1, . . . , 9}foot_5 . For each setting, we train models from scratch. More clients correspond to less heterogenity. We show the testing accuracy under different level of heterogeneity in Fig. 4 (c) and include a comparison with FedAvg, which is designed for iid FL. Our FedBN achieves substantially higher testing accuracy than FedAvg over all levels of heterogeneity. (1.5) (2.5) (2.3) (1.4) (0.5) (1.4) (0.7) (0.8) (1.3) (1.1) (1.4) (2.9) (1.0)

Figure 5: Performance on benchmark experiments

Comparison with State-of-theart: To further validate our method, we compare FedBN with one of the current state-of-the-art methods for non-iid FL, FedProx Li et al. (2020b) , which also shares the benefit of easy adaptation to current FL frameworks in practice. We also include training on Sin-gleSet and FedAvg as baselines. For each strategy, we split an independent testing datasets on clients and report the accuracy on the testing datasets. We perform 5-trial repeating experiment with different random seeds. The mean and standard deviation of the accuracy on each dataset over trials are shown in Fig. 5 foot_6 . From the results, we can make the following observation: (1) FedBN achieves the highest accuracy, consistently outperforming the state-of-the-art and baseline methods; (2) FedBN achieves the most significant improvements on SVHN whose image appearance is very different from others (i.e., presenting more obvious feature shift); (3) FedBN shows a smaller variance in error over multiple runs, indicating its stability.

5.2. EXPERIMENTS ON REAL-WORLD DATASETS

To better understand how our proposed algorithm can be beneficial in real-word feature-shift noniid, we have extensively validated the effectiveness of FedBN in comparison with other methods on three real-world datasets: image classification on Office-Caltech10 (Gong et al., 2012) (3) We include four medical institutions (NYU, USM, UM, UCLA; each is viewed as a client) from ABIDE I that collects functional brain images using different imaging equipment and protocols. We validate on a medical application for binary classification between autism spectrum disorders patients and healthy control subjects. The Office-Caltech10 contains ten categories of objects. The DomainNet extensively contains 345 object categories and we use the top ten most common classes to form a sub-dataset for our experiments. Our classification models adopt AlexNet (Krizhevsky et al., 2012) architecture with BN added after each convolution and fully-connected layer. Before feeding into the network, all images are resized to 256 × 256 × 3. For ABIDE I, each instance is represented as a 5995-dimensional vector through brain connectome computation. We use a three-layer fully connected neural network as the classifier with the hidden layers of 16 with two BN layers after the first two fully connected layers. Same as the above benchmark, we perform 5 repeated runs for each experiment.

Results and Analysis:

The experimental results are shown in Table 1 in the form of mean (std). On Office-Caltech10, FedBN significantly outperforms the state-of-the-art method of FedProx, and improves at least 6% on mean accuracy compared with all the alternative methods. On DomainNet, FedBN achieved supreme accuracy over most of the datasets. Interestingly, we find the alternative FL methods achieves comparable results with SingleSet except Quickdraw, and FedBN outperforms them over 10%. Surprisingly, for the above two tasks, the alternative FL strategies are ineffective in the feature shift non-iid datasets, even worse than using single client data for training for most of the clients. In ABIDE I, FedBN excell by a non-negligible margin on three clients regarding the mean testing accuracy. The results are inspiring and bring the hope of deploying FedBN to healthcare field, where data are often limited, isolated and heterogeneous on features.

6. CONCLUSION AND DISCUSSION

This work proposes a novel federated learning aggregation method called FedBN that keeps the local Batch Normalization parameters not synchronized with the global model, such that it mitigates feature shifts in non-IID data. We provide convergence guarantees for FedBN in realistic federated settings under the overparameterized neural networks regime, while also accounting for practical issues. In our experiments, our evaluation across a suite of federated datasets has demonstrated that FedBN can significantly improve the convergence behavior and model performance of non-IID datasets. We also demonstrate the effectiveness of FedBN in scenarios that where a new client with an unknown domain joins the FL system (see Appendix G). FedBN is independent of the communication and aggregation strategy and thus can in practice be readily combined with different optimization algorithms, communication schemes, and aggregation techniques. The theoretical analysis of such combinations is an interesting direction for future work. We also note that since FedBN makes only lightweight modifications to FedAvg and has much flexibility to be combined with other strategies, these merits allow us to easily integrate FedBN into existing tool-kits/systems, such as Pysyft (Ryffel et al., 2018) , Google TFF (Google, 2020), Flower (Beutel et al., 2020) , dlplatform (Kamp & Adilova, 2020) and FedML (He et al., 2020) 8 . We believe that FedBN can improve a wide range of applications such as healthcare (Rieke et al., 2020) and autonomous driving (Kamp et al., 2018) for training with f * . Since the parameters are updated using gradient descent, the optimization dynamics of parameters are (A) the minimal eigenvalue of matrix A G ∞ expectation of G(t) G * ∞ expectation of G * (t) dv k dt = - ∂L ∂v k , dγ k dt = - ∂L ∂γ k . Let f p = f (x ip p ). Then, the dynamics of the prediction of the p-th data point in site i p is ∂f p ∂t = m k=1 ∂f p ∂v k dv k dt + ∂f p ∂γ k dγ k dt = - m k=1 ∂f p ∂v k ∂L ∂v k T p v - m k=1 ∂f p ∂γ k ∂L ∂γ k T p γ . The gradients of f p and L with respect to v k and γ k are computed as ∂f p ∂v k (t) = 1 √ m c k • γ k (t) v k (t) Si p • x v ip k (t) ⊥ p 1 pk (t), ∂L ∂v k (t) = 1 √ m N M q=1 (f q (t) -y q ) c k • γ k (t) v k (t) Si q x v iq k (t) ⊥ q 1 qk (t), ∂f p ∂γ k (t) = 1 √ m c k v k (t) Si p σ v k (t) x p , ∂L ∂γ k (t) = 1 √ m N M q=1 (f q (t) -y q ) c k v k (t) Si q σ v k (t) x q , where f p = f (x ip p ), x v ip k (t) ⊥ p (I - Si p uu u 2 S ip )x, and 1 pk (t) 1 {v k (t) xp≥0} . We define Gram matrix V(t) and G(t) as V pq (t) = 1 m m k=1 (αc k • γ k (t)) 2 v k (t) -1 Si p v k (t) -1 Si q x v ip k (t) ⊥ p , x v iq k (t) ⊥ q 1 pk (t)1 qk (t), (6) G pq (t) = 1 m m k=1 c 2 k v k (t) -1 Si p v k (t) -1 Si q σ v k (t) x p σ v k (t) x q . It follows that T p v (t) = N M q=1 V pq (t) α 2 (f q (t) -y q ) , T p γ (t) = N M q=1 G pq (t) (f q (t) -y q ) . Let f = (f 1 , . . . , f n ) = (f (x 1 ) , . . . , f (x N M )) . The full evolution dynamic is given by df dt = -Λ(t)(f (t) -y), where Λ(t) := V(t) α 2 + G(t). Similarly, we compute Gram matrix V * (t) and G * (t) for FedBN with f * as V * pq (t) = 1 m m k=1 (αc k ) 2 γ k,ip (t)γ k,iq (t) v k (t) -1 Si p v k (t) -1 Si q x v ip k (t) ⊥ p , x v iq k (t) ⊥ q 1 pk (t)1 qk (t), G * pq (t) = 1 m m k=1 c 2 k v k (t) -1 Si p v k (t) -1 Si q σ v k (t) x p σ v k (t) x q 1{i p = i q }. Thus, the full evolution dynamic of FedBN is df * dt = -Λ * (t)(f * (t) -y), where Λ * (t) := V * (t) α 2 + G * (t). B.2 PROOF OF LEMMA 4.3 Dukler et al. (2020) proved that the matrix G ∞ is strictly positive definite. In their proof, G ∞ is the covariance matrix of the functionals φ p define as φ p (v) := σ v x p over the Hilbert space V of L 2 N 0, α 2 I . G * ∞ is strictly positive definite by showing that φ 1 , • • • , φ N M are linearly independent, which is equivalent to that c 1 φ 1 + c 2 φ 2 + • • • + c N M φ N M = 0 in V holds only for c p = 0 for all p. Let G ∞ i denote the i-th M × M block matrices on the diagonal of G ∞ . Then we have G * ∞ = diag(G ∞ 1 , • • • , G ∞ N ). To prove that G * ∞ is strictly positive definite, we will show that G ∞ i is positive definite. Let us define φ * j,i (v) := σ v x j 1{j ∈ site i}, j = 1, • • • , M. Then, we are going to show that c 1 φ * 1,i + c 2 φ * 2,i + • • • + c M φ * M,i = 0 (11) holds only for c j = 0, ∀j ∈ [M ]. Suppose there exist c 1 , • • • , c M that are not identically 0, satisfying . Let the coefficients for client i be c 1 , • • • , c M and let the coefficients for other client be 0. Then, we have a sequence of coefficients satisfying (10), which is a contradiction with that G ∞ is strictly positive definite. This implies G ∞ i is strictly positive definite. Namely, G ∞ i 's eigenvalues are positive. Since the eigenvalues of G * ∞ are exactly the union of the eigenvalues of G ∞ i , λ min (G * ∞ ) is positive and thus, G * ∞ is strictly positive definite.

B.3 PROOF OF COROLLARY 4.6

To compare the convergence rates of FedAvg and FedBN when E = 1, we compare the exponential factor in the convergence rates, which are (1ηµ 0 /2) and (1ηµ * 0 /2) for FedAvg and FedBN, respectively. Then, it reduces to comparing µ 0 = λ min (G ∞ ) and µ * 0 = λ min (G * ∞ ). Comparing equation ( 7) and ( 9), G * ∞ takes the M × M block matrices on the diagonal of G ∞ : G ∞ =      G ∞ 1 G ∞ 1,2 • • • G ∞ 1,N G ∞ 1,2 G ∞ 2 • • • G ∞ 2,N . . . . . . . . . . . . G ∞ 1,N G ∞ 2,N • • • G ∞ N      , G * ∞ =     G ∞ 1 0 • • • 0 0 G ∞ 2 • • • 0 . . . . . . . . . . . . 0 0 • • • G ∞ N     , where G ∞ i is the i-th M × M block matrices on the diagonal of G ∞ . By linear algebra, λ min (G ∞ i ) ≥ λ min (G ∞ ), ∀i ∈ [N ]. Since the eigenvalues of G * ∞ are exactly the union of eigenvalues of G ∞ i , we have λ min (G * ∞ ) = min i∈[N ] {λ min (G ∞ i )}, ≥ λ min (G ∞ ). Thus, (1ηµ 0 /2) ≥ (1ηµ * 0 /2) and we can conclude that the convergence rate of FedBN is faster than the convergence of FedAvg.

C FEDBN ALGORITHM

We describe the details algorithm of our proposed FedBN as following Algorithm 1: Algorithm 1 Federated Learning using FedBN Notations: The user indexed by k, neural network layer indexed by l, initialized model parameters: w 

Dataset:

The study was carried out using resting-state fMRI (rs-fMRI) data from the Autism Brain Imaging Data Exchange dataset (ABIDE I preprocessed, (Di Martino et al., 2014) ). ABIDE is a consortium that provides preciously collected rs-fMRI ASD and matched controls data for the purpose of data sharing in the scientific community. We downloaded Regions of Interests (ROIs) fMRI series of the top four largest sites (UM, NYU, USM, UCLA viewed as clients) from the preprocessed ABIDE dataset with Configurable Pipeline for the Analysis of Connectomes (CPAC) and parcellated by Harvard-Oxford (HO) atlas. Skipping subjects lacking filename, resulting in 88, 167, 52, 63 subjects for UM, NYU, USM, UCLA separately. Due to a lack of sufficient data, we used sliding windows (with window size 32 and stride 1) to truncate raw time sequences of fMRI. The compositions of four sites were shown in Table 10 . The number of overlapping truncate is the dataset size in a client. 

F SYNTHETIC DATA EXPERIMENT

Settings We generate data from two-pair of multi-Gaussian distributions. For one pair, samples (x, 0) and (x, 1) are sampled from N (-1, Σ 1 ) and N (1, Σ 1 ) respectively, with coveriance Σ 1 ∈ R 10×10 . For another pair, samples ( x, 0) and ( x, 1) are sampled from N (-1, Σ 2 ) and N (1, Σ 2 ) respectively, with coveriance Σ 2 ∈ R 10×10 . Specifically, we design convariance matrix Σ 1 as an identity diagonal matrix and Σ 2 is different from Σ 1 by having non-zero values on off-diagonal entries. We train a two-layer neural network with 100 hidden neurons for 600 steps using crossentropy loss and SGD optimizer with 1×10 -5 learning rate. Denote W k and b k are the in-connection weigths and bias term of neuron k. We initialize the model parameters with W k ∼ N (0, α 2 I), b k ∼ N (0, α 2 ), where α = 10. Results. The aim of synthetic experiments is to study the behavior of using FedBN with a controlled setup. We achieve 100% accuracy on binary classification for FedAvg and FedBN. Fig. 9 shows comparison of training loss curve over steps using FedAvg and FedBN, presenting that FedBN obtains significantly faster convergence than FedAvg. 



We omit centering neurons as well as learning the shift parameter of BN for the neural network analysis because of the assumption that x is zero-mean and the two layer network setting(Kohler et al., 2019;Salimans & Kingma, 2016). This data preprocessing intends to strictly control non-related factors (e.g., imbalanced sample numbers across clients), so that the experimental findings can more clearly reflect the effect of local BN. Results without truncating are reported in Appendix E.2. Choosing 10% fraction as default is based on (i) considering it as a typical setting to present general efficacy of our method; (ii) matching literature where the client size is usually around 100 to 1000 data points(McMahan et al., 2017;Li et al., 2019;Hsu et al., 2019), which is a similar scale with respect to our 10% setting (in terms of the absolute value of sample numbers). Detailed statistics and FedAvg results are presented in Appendix E.2. Namely, each increment contains five non-iid clients. Detailed statistics are shown in Appendix E.2 More details about the datasets and training process are listed in the Appendix D.3 and D.4 The implementations on dlplatform and Flower are available, and the implementation on FedML is to appear soon.



Figure 1: Training error on local datasets for two clients respectively with and w/o BN, where BN harmonizes the loss surface.

Lemma 4.3. Fix points {x p } N M p=1 satisfying Assumption 4.1. Then Gram matrices G ∞ and G * ∞ defined as in (4) and (5) are strictly positive definite. Let the least eigenvalues be λ min (G ∞ ) =: µ 0 and λ min (G * ∞ ) =: µ * 0 , where µ 0 , µ * 0 > 0. Proof sketch The main idea follows Du et al. (2018); Dukler et al. (2020), that given points {x p } N M p=1 , the matrices G ∞ and G * ∞ can be shown as covariance matrix of linearly independent operators. More details of the proof are given in the Appendix B.2.

Figure 3: Convergence of the training loss of FedBN and FedAvg on the digits classification datasets. FedBN exhibits faster and more robust convergence.

Corollary 4.6 (Convergence rate comparison between FedAvg and FedBN). For the G-dominated convergence, the convergence rate of FedBN is faster than that of FedAvg.

Figure 4: Analytical experimental results on: (a) Analysis on different local updating epochs. FedBN consistently outperforms FedAvg in testing accuracy. (b) Model performance over varying dataset size on local clients. (c) Testing accuracy on different levels of heterogeneity.

Dataset Size: We vary the data amount for each client from 100% to 1% of its original dataset size, in order to observe FedBN behaviour over different data capacities at each client. The results in Fig.4 (b) present the accuracy of FedBN and SingleSet 4 . Testing accuracy starts to significantly drop when each of the local client is only attributed 20% percentage of data from its original data amount. The improvement margin gained from FedBN increases as local dataset sizes decrease. The results indicate that FedBN can effectively benefit from collaborative training on distributed data, especially when each client only holds a small amount of data which are non-iid.

, local update pace: E, and total optimization round T . 1: for each round t = 1, 2, . . . , T do 2:for each user k and each layer l do DATASET AND TRAINING DETAILS Here we describe the real-world medical datasets, the preprocessing and training details.

Figure 8: Test set accuracy curve (average of 5 datasets) of using different local updating epochs E and batch size B for FedBN.

Figure 9: Training loss on synthetic data. Data in client 1 is generated from Diagonal Gaussian, client 2 is generated from combination of Diagonal Gaussian and Full Gaussian.

T /E communication rounds over the T epochs. For simplicity, we assume all clients to have M ∈ N training examples (a difference in training examples can be account for by weighted averaging (McMahan et al., 2017)) for a regression task, i.e., each client i ∈ [N ] ([N ] = {1, . . . , N }) has training examples {(x i j , y i j ) ∈ R d ×R : j ∈ [M ]}. Furthermore, we assume a two-layer neural network with ReLU activations trained by gradient descent. Let v k ∈ R d denote the parameters of the first layer, where k ∈ [m] and m is the width of the hidden layer. Let v S √ v Sv denote the induced vector norm for a positive definite matrix S.

We report results on three different real-world datasets with format mean(std) from 5-trial run. For Office-Caltech 10, A, C, D ,W are abbreviations for Amazon, Caltech, DSLR and WebCam, for DomainNet, C, I, P, Q, R, S are abbreviations for Clipart, Infograph, Painting, Quickdraw, Real and Sketch. For ABIDE, we list the abbreviations for the clients (i.e., medical institutions).

Datasets and Setup: (1) We conduct the classification task on natural images from Office-Caltech10, which has four data sources composing Office-31Saenko et al. (2010) (three data sources) and Caltech-256 datasets (one data source)Griffin et al. (2007), which are acquired using different camera devices or in different real environment with various background. Each client joining the FL system is assigned data from one of the four data sources. Thus data is non-iid across the clients. (2) Our second dataset is DomainNet, which contains natural images coming from six different data sources: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. Similar to (1), each client contains iid data from one of the data sources, but clients with different data sources have different feature distributions.

. A few interesting directions for future work include analyzing what types of differences in local data can benefit from FedBN and explore the limits of FedBN. Moreover, privacy is an essential concern in FL. Invisible BN parameters in FedBN should make attacks on local data more challenging. It would be interesting to quantify the privacypreservation improvement in FedBN.APPENDIXRoadmap of Appendix The Appendix is organized as follows. We list the notations table in Section A. We provide theoretical proof of convergence in Section B. The algorithm of FedBN is described in Section C. The details of experimental setting are in Section D and additional results on benchmark datasets are in Section E. We show experiment on synthetic data in Section F. We demonstrate the ability of generalizing FedBN to test on a new client in Section G.

Notations occurred in the paper.In this section, we calculate the evolution dynamics Λ(t) for training with function f and Λ

Data summary of the dataset used in our study.Training Process : For all the strategies, we set batch size as 100. The total training local epoch is 50 with learning rate 10 -2 with SGD optimizer. Local update epoch for each client is E = 1. We selected the best parameters µ = 0.2 in FedProx through grid search.E.3 COMPARE FEDBN WITH CENTRALIZED TRAININGTo better understand the significance of the numbers reported in our main context, we compare FedBN with centralized training, that pools all training data in to a center. We present the testing accuracy on each digit dataset in Table12. FedBN, federated learning with data-specific BN layers, could achieve comparable performance with vanilla centralized training strategy.

Testing accuracy on each testing sets with format mean(std) from 5-trial run. E.4 DIFFERENT COMBINATIONS OF E AND B In this section, we show different combinations of local update epochs E and batch size B. Specifically, E ∈ {1, 4, 16} and B ∈ {10, 50, ∞}, ∞ denotes full batch learning. Following the setting in original FedAvg paper McMahan et al. (2017), we present the comparisons between FedBN and FedAvg on each combination of E and B in Table 13. The results are in good agreement that FedBN can consistently outperform FedAvg and robust to batch size selection. Further, we depicts the test sets accuracy vs. local epochs under different combination of E and B in Figure 8.

Test sets accuracy using different combinations of batch size B and local update epoch E on benchmark experiment with the default non-iid setting.

It is not too surprising that at Singleset can be the best when the a local client gets a lot of data. It is also possible to keep the data sets in their original size (which is unequal), by allowing clients with less data to repeat sampling. In this way, all clients use the same batch size and same local iterations of each epoch. We add results of such a setting with 10% and full original datasize in Table15and Table 16 respectively. It is observed that FedBN still consistently outperforms other methods. Table accuracy of each clients when clients' training samples are unequal using 10% of original data. The number of training samples for each client are denoted under their names.

Testing accuracy of each clients when clients' training samples are unequal using full size data. The number of training samples for each client are denoted under their names.

G TRANSFER LEARNING AND TESTING ON UNKNOWN DOMAIN CLIENTIn this section, we discuss out-of-domain generalization of FedBN and prove the solutions for the following two scenarios: 1) transferring FedBN to a new unknown domain clients during training; 2) testing an unknown domain client.If a new center from another domain joins training, we can transfer the non-BN layer parameters of the global model to this new center. This new center will compute its own mean and variance statistics, and learn the corresponding local BN parameters.Testing the global model on a new client with unknown statistics outside federation requires allowing access to local BN parameters at testing time (though BN layers are not aggregated at the global server during training). In this way, the new client can use the averaged trainable BN parameters learned at existing FL clients, and compute the (mean, variance) on its own data. Such a solution is also in line with what was done in recent literature, e.g.,SiloBN (Andreux et al., 2020). We conduct the experiment with this solution for FedBN and compared its performance with FedAvg and FedProx. Specifically, we use the digits classification task and treat the two unseen datasets -Morpho-global and Morpho-local from Morpho-MNIST(Castro et al., 2019) as the two new clients. The new clients contain substantially perturbed digits. Specifically, Morpho-global containing thinning and thickening versions of MNIST digits, while Morpho-local changes MNIST by swelling and fractures. The results are listed in Table17. It is observed that the obtained results from three methods are generally comparable in such a challenging setting, with FedBN presenting slightly higher performance on overall average accuracy. Generalizing the global model to unseen-domain clients.

availability

//github.com/med-

annex

We show image examples from the five benchmark datasets and the pixel value histogram. It obviously presents the heterogeneous appearances and shifted distributions. Clients formed from the five benchmark datasets are viewed as non-iid.

D.2 MODEL ARCHITECTURE AND TRAINING DETAILS ON BENCHMARK

We illustrate our model architecture and training details of the digits classification experiments in this section.Model Architecture.For our benchmark experiment, we use a six-layer Convolutional Neural Network (CNN) and its details are listed in Table 3 . Training Details. We give detailed settings for the experiments conducted in 5.1: (1) convergence rate (Table 4 ), (2) analysis of local update epochs (Table 5 ), (3) analysis of local dataset size (Table 6 ), (4) effects of statistical heterogeneity (Table 7 ) and ( 5) comparison with state-of-the-art (Table 8 ). Each table describes the number of clients, samples and the local update epochs.During training process, we use SGD optimizer with learning rate 10 -2 and cross-entropy loss, we set batch size to 32 and training epochs to 300. For hyper-parameter µ, we use the best value µ = 10 -2 founded by grid search from the the default settings in FedProx Li et al. (2020b) . Conv2D), we list parameters with sequence of input and output dimension, kernal size, stride and padding. For max pooling layer (MaxPool2D), we list kernal and stride. For fully connected layer (FC), we list input and output dimension. For BatchNormalization layer (BN), we list the channel dimension.

Datasets

Training Details. Office-Caltech10 selects 10 common objects in Office-31 Saenko et al. (2010) and Caltech-256 datasets Griffin et al. (2007) . There are four different data sources, one from Caltech-256 and three from Office-31, namely Amazon(images collected from online shopping website), DSLR and Webcam(images captured in office environment using Digital SLR camera and web camera).We first reshape input images in the two dataset into 256×256×3, then for training process, we use cross-entropy loss and SGD optimizer with learning rate of 10 -2 , batch size is set to 32 and training epochs is 300. When comparing with FedProx, we set µ to 10 -2 which is tuned from the default settings. The data sample number are kept into the same size according to the smallest dataset, 

E.2 DETAILED STATISTICS OF FIGURE 5

In Figure 5 , we compare the performance with respect to accuracy of FedBN and alternative methods. We show the detailed accuracy in the following 

