FEDFA: FEDERATED FEATURE AUGMENTATION

Abstract

Federated learning is a distributed paradigm that allows multiple parties to collaboratively train deep models without exchanging the raw data. However, the data distribution among clients is naturally non-i.i.d., which leads to severe degradation of the learnt model. The primary goal of this paper is to develop a robust federated learning algorithm to address feature shift in clients' samples, which can be caused by various factors, e.g., acquisition differences in medical imaging. To reach this goal, we propose FEDFA to tackle federated learning from a distinct perspective of federated feature augmentation. FEDFA is based on a major insight that each client's data distribution can be characterized by statistics (i.e., mean and standard deviation) of latent features; and it is likely to manipulate these local statistics globally, i.e., based on information in the entire federation, to let clients have a better sense of the underlying distribution and therefore alleviate local data bias. Based on this insight, we propose to augment each local feature statistic probabilistically based on a normal distribution, whose mean is the original statistic and variance quantifies the augmentation scope. Key to our approach is the determination of a meaningful Gaussian variance, which is accomplished by taking into account not only biased data of each individual client, but also underlying feature statistics characterized by all participating clients. We offer both theoretical and empirical justifications to verify the effectiveness of FEDFA. Our code is available at https://github.com/tfzhou/FedFA.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2016) is an emerging collaborative training framework that enables training on decentralized data residing devices like mobile phones. It comes with the promise of training centralized models using local data points such that the privacy of participating devices is preserved, and has attracted significant attention in critical fields like healthcare or finance. Since data come from different users, it is inevitable that the data of each user have a different underlying distribution, incurring large heterogeneity (non-iid-ness) among users' data. In this work, we focus on feature shift (Li et al., 2020b) , which is common in many real-world cases, like medical data acquired from different medical devices or natural image collected in diverse environments. While the problem of feature shift has been studied in classical centralized learning tasks like domain generalization, little is understood how to tackle it in federated learning. (Li et al., 2020b; Reisizadeh et al., 2020; Jiang et al., 2022; Liu et al., 2020a) are rare exceptions. FEDROBUST (Reisizadeh et al., 2020) and FEDBN (Li et al., 2020b) solve the problem through client-dependent learning by either fitting the shift with a client-specific affine distribution or learning unique BN parameters for each client. However, these algorithms may still suffer significant local dataset bias. Other works (Qu et al., 2022; Jiang et al., 2022; Caldarola et al., 2022) learn robust models by adopting Sharpness Aware Minimization (SAM) (Foret et al., 2021) as the local optimizer, which, however, doubles the computational cost compared to SGD or Adam. In addition to model optimization, FEDHARMO (Jiang et al., 2022) has investigated specialized image normalization techniques to mitigate feature shift in medical domains. Despite the progress, there leaves an alternative space -data augmentation -largely unexplored in federated learning, even though it has been extensively studied in centralized setting to impose regularization and improve generalizibility (Zhou et al., 2021; Zhang et al., 2018) . While seemingly straightforward, it is non-trivial to perform effective data augmentation in federated learning because users have no direct access to external data of other users. Simply applying conventional augmentation techniques to each client is sub-optimal since without injecting global information, augmented samples will most likely still suffer local dataset bias. To address this, FED-MIX (Yoon et al., 2021) generalizes MIXUP (Zhang et al., 2018) into federated learning, by mixing averaged data across clients. The method performs augmentation in the input level, which is naturally weak to create complicated and meaningful semantic transformations, e.g., make-bespectacled. Moreover, allowing exchange of averaged data will suffer certain levels of privacy issues. In this work, we introduce a novel federation-aware augmentation technique, called FedFA, into federated learning. FEDFA is based on the insight that statistics of latent features can capture essential domain-aware characteristics (Huang & Belongie, 2017; Zhou et al., 2021; Li et al., 2022a; b; 2021a) , thus can be treated as "features of participating client". Accordingly, we argue that the problem of feature shift in FL, no matter the shift of each local data distribution from the underlying distribution, or local distribution differences among clients, even test-time distribution shift, can be interpreted as the shift of feature statistics. This motivates us to directly addressing local feature statistic shift by incorporating universal statistic characterized by all participants in the federation. FEDFA instantiates the idea by online augmenting feature statistics of each sample during local model training, so as to make the model robust to certain changes of "features of participating client". Concretely, we model the augmentation procedure in a probabilistic manner via a multivariate Gaussian distribution. The Gaussian mean is fixed to the original statistic, and variance reflects the potential local distribution shift. In this manner, novel statistics can be effortlessly synthesized by drawing samples from the Gaussian distribution. For effective augmentation, we determine a reasonable variance based on not only variances of feature statistics within each client, but also universal variances characterized by all participating clients. The augmentation in FEDFA allows each local model to be trained over samples drawn from more diverse feature distributions, facilitating local distribution shift alleviation and client-invariant representation learning, eventually contributing to a better global model. FEDFA is a conceptually simple but surprisingly effective method. It is non-parametric, requires negligible additional computation and communication costs, and can be seamlessly incorporated into arbitrary CNN architectures. We propose both theoretical and empirical insights. Theoretically, we show that FEDFA implicitly introduces regularization to local model learning by regularizing the gradients of latent representations, weighted by variances of feature statistics estimated from the entire federation. Empirically, we demonstrate that FEDFA (1) works favorably with extremely small local datasets; (2) shows remarkable generalization performance to unseen test clients outside of the federation; (3) outperforms traditional data augmentation techniques by solid margins, and can complement them quite well in the federated learning setup.

2.1. PRELIMINARY: FEDERATED LEARNING

We assume a standard federated learning setup with a server that can transmit and receive messages from M client devices. Each client m ∈ [M ] has access to N m training instances (x i , y i ) Nm i=1 in the form of image x i ∈ X and corresponding labels y i ∈ Y that are drawn i.i.d. from a device-indexed joint distribution, i.e., (x i , y i ) ∼ P m (x, y). The goal of standard federated learning is to learn a deep neural network: f (w g , w h ) g(w g ) • h(w h ), where h : X → Z is a feature extractor with K convolutional stages: h = h K • h K-1 • • • • • h 1 , and g : Z → Y is a classifier. To learn network parameters w = {w g , w h }, the empirical risk minimization (ERM) is widely used: L ERM (w) 1 M m∈[M ] L ERM m (w), where L ERM m (w) = E (x i ,y i )∼Pm [ i(g • h(xi), yi; w)]. Here the global objective L ERM is decomposable as a sum of device-level empirical loss objectives (i.e., {L ERM m } m ). Each L ERM m is computed based on a per-data loss function i . Due to the separation of clients' data, L ERM (w) cannot be solved directly. FEDAVG (McMahan et al., 2017) 

2.2. MOTIVATION

While the ERM-based formulation has achieved great success, it is straightforward to see that the solution would strongly depend on how each approximated local distribution P e m mimics the underlying universal distribution P. In real-world federated learning setup however, in all but trivial cases each P e m exhibits a unique distribution shift from P, which causes not only inconsistency between local and global empirical losses (Acar et al., 2021; Wang et al., 2020) , but also generalization issues (Yuan et al., 2022) . In this work, we circumvent this issue by fitting each local dataset a richer distribution (instead of the delta distribution) in the vicinal region of each sample (x i , y i ) so as to estimate a more informed risk. This is precisely the principle behind vicinal risk minimization (VRM) (Chapelle et al., 2000) . Particularly, for data point (x i , y i ), a vicinity distribution V m (x i , ŷi |x i , y i ) is defined, from which novel virtual samples can be generated to enlarge the support of the local data distribution. In this way, we obtain an improved approximation of P m as P v m = 1 /Nm Nm i=1 V m (x i , ŷi |x i , y i ). In centralized learning scenarios, various successful instances of V m , e.g., MIXUP (Zhang et al., 2018) , CUTMIX (Yun et al., 2019) , have been developed. Simply applying them to local clients, though allowing for performance improvements (see Table 6 ), is suboptimal since, without injecting any global information, P v m only provides a better approximation to the local distribution P m , rather than the true distribution P. We solve this by introducing a dedicated method FEDFA to estimate more reasonable V m in federated learning.

2.3. FEDFA: FEDERATED FEATURE AUGMENTATION

FEDFA belongs to the family of label-preserving feature augmentation (Xie et al., 2020) . During training, it estimates a vicinity distribution V k m at each layer h k to augment hidden features in client m. Considering X k m ∈ R B×C×H×W as the intermediate feature representation of B mini-batch images, with spatial size (H × W ) and channel number (C), and Y k m as corresponding label.  V k m is label-preserving in the sense that V k m ( Xk m , Ŷm |X k m , Y m ) V k m ( Xk m |X k m )δ( Ŷm = Y m ), i.e., µ k m = 1 HW H h=1 W w=1 X k,(h,w) m ∈ R B×C , σ k m = 1 HW H h=1 W w=1 (X k,(h,w) m -µ k m ) 2 ∈ R B×C , where X k,(h,w) m ∈ R B×C represents features at spatial location (h, w). As the abstract of latent features, these statistics carry domain-specific information (e.g., style). They are instrumental to image generation (Huang & Belongie, 2017) , and have been recently used for data augmentation in image recognition (Li et al., 2021a) . In heterogeneous federated learning scenarios, the feature statistics among local clients will be inconsistent, and exhibit uncertain feature statistic shifts from the statistics of the true distribution. Our method explicitly captures such shift via probabilistic modeling. Concretely, instead of representing each feature X k m with deterministic statistics {µ k m , σ k m }, we hypothesize that the feature is conditioned on probabilistic statistics {μ k m , σk m }, which are sampled around the original statistics based on a multi-variate Gaussian distribution, i.e., μk m ∼ N (µ k m , Σ2 µ k m ) and σk m ∼ N (σ k m , Σ2 σ k m ) , where each Gaussian's center corresponds to the original statistic, and the variance is expected to capture the potential feature statistic shift from the true distribution. Our core goal is thus to estimate proper variances Σ2 µ k m / Σ2 σ k m for reasonable and informative augmentation. Client-specific Statistic Variances. In client-side, we compute client-specific variances of feature statistics based on the information within each mini-batch: Σ 2 µ k m = 1 B B b=1 (µ k m -E[µ k m ]) 2 ∈ R C , Σ 2 σ k m = 1 B B b=1 (σ k m -E[σ k m ]) 2 ∈ R C , where Σ 2 m is the variance of feature statistics in a particular channel, and its magnitude manifests how the channel will change potentially in the feature space. Client-sharing Statistic Variances. The client-specific variances are solely computed based on the data in each individual client, and thus likely biased due to local dataset bias. To solve this, we further estimate client-sharing feature statistic variances taking information of all clients into account. Particularly, we maintain a momentum version of feature statistics for each client, which are online estimated during training: μk m ← αμ k m + (1 -α) 1 B B b=1 µ k m ∈ R C , σk m ← ασ k m + (1 -α) 1 B B b=1 σ k m ∈ R C , where μk m and σk m are the momentum updated feature statistics of layer h k in client m, and they are initialized as C-dimensional all-zero and all-one vectors, respectively. α is a momentum coefficient. We set a same α for both updating, and found no benefit to set it differently. In each communication, these accumulated local feature statistics are sent to the server along with model parameters. Let μk = [μ k 1 , . . . , μk M ] ∈ R M×C and σk = [σ k 1 , . . . , σk M ] ∈ R M×C denote collections of accumulated feature statistics of all clients, the client sharing statistic variances are determined in server-side by: Σ 2 µ k = 1 M M m=1 (μ k m -E[μ k ]) 2 ∈ R C , Σ 2 σ k = 1 M M m=1 (σ k m -E[σ k ]) 2 ∈ R C . In addition, it is intuitive that some channels are more potentially to change than others, and it will be favorable to highlight these channels to enable a sufficient and reasonable exploration of the space of feature statistics. To this end, we further modulate client sharing estimations with a Student's t-distribution (Student, 1908; Van der Maaten & Hinton, 2008) with one degree of freedom to convert the variances to probabilities. The t-distribution has heavier tails than other alternatives such as Gaussian distribution, allowing to highlight the channels with larger statistic variance, at the same time, avoiding overly penalizing the others. Formally, denote Σ 2,(j) µ k and Σ 2,(j) σ k as the shared variances of the jth channel in Σ 2 µ k and Σ 2 σ k (Eq. 5), respectively. They are modulated by the t-distribution as follows: γ (j) µ k = C(1 + 1 /Σ 2,(j) µ k ) -1 C c=1 (1 + 1 /Σ 2,(c) µ k ) -1 ∈ R, γ σ k = C(1 + 1 /Σ 2,(j) σ k ) -1 C c=1 (1 + 1 /Σ 2,(c) σ k ) -1 ∈ R, where γ (j) µ k and γ (j) σ k refer to the modulated variances of the j-th channel. By applying Eq. 6 to each channel separately, we obtain γ µ k = [γ (1) µ k , . . . , γ (C) µ k ] ∈ R C and γ σ k = [γ (1) σ k , . . . , γ (C) σ k ] ∈ R C as modulated statistic variances of all feature channels at layer h k . In this way, the channels with large values in Σ 2 µ k (or Σ 2 σ k ) will be assigned with much higher importance in γ µ k (or γ σ k ) than other channels, allowing for more extensive augmentation along those directions. Adaptive Variance Fusion. The modulated client sharing estimations {γ µ k , γ σ k } provide a quantification of distribution difference among clients, and larger values imply potentials of more significant changes of corresponding channels in the true feature statistic space. Therefore, for each client, we weight the client specific statistic variances {Σ 2 µ k m , Σ 2 σ k m } by {γ µ k , γ σ k }, so that each client has a sense of such difference. To avoid overly modification of client specific statistic variances, we add a residual connection for fusion, yielding an estimation of Gaussian ranges as: Σ2 µ k m = (γ µ k + 1) Σ 2 µ k m ∈ R C , Σ2 σ k m = (γ σ k + 1) Σ 2 σ k m ∈ R C , where denotes the Hadamard product. Implementation of Feature Augmentation. After establishing the Gaussian distribution, we synthesize novel feature Xk m in the vicinity of X k where µ ∼ N (0, 1) and σ ∼ N (0, 1) follow the normal Gaussian distribution. The proposed federated feature augmentation (FFA) operation in Eq. 8 is a plug-and-play layer, i.e., it can be inserted at arbitrary layers in the feature extractor h. In our implementation, we add a FFA layer after each convolutional stage of the networks. During training, we follow the stochastic learning strategy (Verma et al., 2019; Zhou et al., 2021; Li et al., 2022b) to activate each FFA layer with a probability of p. This allows for more diverse augmentation from iteration to iteration (based on the activated FFA layers). At test time, no augmentation is applied. In Appendix A, we provide detailed descriptions of FEDFA in Algorithm 1 and FFA in Algorithm 2.

3. THEORETICAL INSIGHTS

In this section, we provide mathematical analysis to gain deeper insights into FEDFA. To begin with, we show that FEDFA is a noise injection process (Bishop, 1995; Camuto et al., 2020; Lim et al., 2022) that injects federation-aware noises to latent features. Lemma 1. Consider client m ∈ [M ], for a batch-wise latent feature X k m at layer k, its augmentation in FEDFA (cf. Eq. 8) follows a noising process Xk m = X k m +e k m , with the noise e k m taking the form: e k m = σ Σσ k m Xk m + µ Σµ k m , ( ) where µ ∼ N (0, 1), σ ∼ N (0, 1), Xk m = (X k m -µ k m ) /σ k m . Based on Lemma 1, we can identify the federation-aware implicit regularization effects of FEDFA. Theorem 1. In FEDFA, the loss function L FEDFA m of client m can be expressed as: L FEDFA m = L ERM m + L REG m , where L ERM m is the standard ERM loss, and L REG m is the regularization term: L ERM m = E (Xm,Ym)∼Pm (g(h 1:K (X m )), Y m ), L REG m = E Z∼K E (Xm,Ym)∼Pm h 1:K (Xm) (g(h 1:K (X m )), Y m ) z∈Z J z (X m )e z m , where J z denotes the Jacobian of layer z (see Proposition 1 in Appendix for its explicit expression). Theorem 1 implies that, FEDFA implicitly introduces regularization to local client learning by regularizing the gradients of latent representations (i.e., h 1:K (Xm) (g(h 1:K (X m )), Y m ) ), weighted by federation-aware noises in Lemma 1, i.e., z∈Z J z (X m )e z m .

4.1. SETUP

Datasets. We conduct extensive experiments on five datasets: Office-Caltech 10 ( Gong et al., 2012) , DomainNet (Peng et al., 2019) and ProstateMRI (Liu et al., 2020b) for validation of FEDFA in terms of feature-shift non-IID, as well as larger-scale datasets CIFAR-10 ( Krizhevsky & Hinton, 2009) and EMNIST (Cohen et al., 2017) for cases of label distribution and data size heterogeneity, respectively. Baselines. For comprehensive evaluation, we compare FEDFA against several state-of-the-art federated learning techniques, including FEDAVG (McMahan et al., 2017) , FEDAVGM (Hsu et al., 2019) , FEDPROX (Li et al., 2020a) , FEDSAM (Qu et al., 2022) , FEDBN (Li et al., 2020b) , FE-DROBUST (Reisizadeh et al., 2020) , and FEDMIX (Yoon et al., 2021) . Moreover, we compare with FEDHARMO (Jiang et al., 2022) in ProstateMRI, that is specialized designed for medical imaging. To gain more insights into FEDFA, we develop two baselines: FEDFA-R(ANDOM) and FEDFA-C(LIENT). FEDFA-R randomly perturbs feature statistics based on Gaussian distribution with a same standard deviation for all channels, i.e., Σµ k m = Σσ k m = λ, where λ = 0.5. FEDFA-C performs augmentation based only on client specific variances, i.e., Eq. 7 turns into Σ2 µ k m = Σ 2 µ k m , Σ2 σ k m = Σ 2 σ k m . Metrics. As conventions, we use top-1 accuracy for image classification and Dice coefficient for medical image segmentation, respectively. We report the performance only for the global model. Implementation Details. We use PyTorch to implement FEDFA and other baselines. Following FEDBN (Li et al., 2020b) , we adopt AlexNet (Krizhevsky et al., 2017) on Office-Caltech 10 and We first present the overall results on the five benchmarks, i.e., Office-Caltech 10 and DomainNet in Table 1 and Fig. 1 , ProstateMRI in Table 2 and Fig. 2 , EMNIST and CIFAR-10 in Table 3 . Results on ProstateMRI. FEDFA shows leading performance with extremely small local datasets.

Results on

In some practical scenarios like healthcare, the size of local dataset can be very small, which poses a challenge for federated learning. To examine the performance of federated learning algorithms in this scenario, we build mini-ProstateMRI by randomly sampling only 1 /6 of all training samples in each client for training. Results are summarized in Table 2 . FEDFA outperforms FEDAVG by significant margins (i.e., 3.0%) and it even performs better than FEDHARMO, which is specifically designed for medical scenarios. In addition, Fig. 2 shows how the performance of methods varies with respect to the size of local dataset. We train methods with different fractions (i.e., 1 /6, 2 /6, 3 /6, 4 /6, 5 /6, 1) of training samples . FEDFA shows promising performance in all cases. With a core of data augmentation, FEDFA is supposed to enforce regularization to neural network learning, which could improve generalization capability. To verify this, we perform experiments for federated domain generalization based on the leaveone-client-out strategy, i.e., training on M -1 distributed clients and testing on the held-out unparticipating client. The results are presented in Table 4 . As seen, FEDFA achieves leading generalization performance on most unseen clients. For example, it yields consistent improvements as compared to FEDMIX, i.e., 1.7% on Office-Caltech 10, 1.6% on DomainNet, and 1.3% on ProstateMRI, in terms of average performance. Despite the improved performance, we find by comparing to the results reported in Table 1 and Table 2 that, current federated learning algorithms still encounter significant participation gap (Yuan et al., 2022) , i.e., the performance difference between participating and non-participating clients, which is a critical issue that should be tackled in future.

4.4. DIAGNOSTIC EXPERIMENT

We conduct a set of ablative experiments to enable a deep understanding of FEDFA. et al., 2021) and MOEX (Li et al., 2021a) . The results are presented in Table 6 . We show that i) all the four techniques yield non-trivial improvements over FEDAVG, and some of them (e.g., MOEX) even outperform welldesigned federated learning algorithms (as compared to Tables 1 2 ); by accounting for global feature statistics, FEDFA surpasses all of them, yielding 2.9%/1.9%/1.0% improvements over the secondbest results on Office/DomainNet/ProstateMRI, respectively; iii) combining FEDFA with these techniques allows further performance uplifting, verifying the complementary roles of FEDFA to them. Adaptive Variance Fusion. Next, we examine the effect of adaptive variance fusion in Eqs. 6-7. We design a baseline "Direct Fusion" that directly combines the client-specific and client-sharing statistic variances as: Σ2 µ k m = (Σ 2 µ k + 1)Σ 2 µ k m , Σ2 σ k m = (Σ 2 σ k + 1)Σ 2 σ k m . We find from Table 7 that the baseline encounters severe performance degradation across all three benchmarks. A possible reason is that the two types of variances are mostly mis-matched, and the simple fusion strategy may cause significant changes of client-specific statistic variances, which would be harmful for local model learning. Hyper-parameter analysis. FEDFA includes only two hyperparameters, i.e., momentum coefficient α in Eq. 4 and stochastic learning probability p to apply feature statistic augmentation during training. As shown in Fig. 3 , (1) the model is overall robust to α. Notably, it yields promising performance at α = 0, in which the model only uses the feature statistics of the last mini-batch in each local epoch to compute client-sharing statistic variances. This result reveals that FEDFA is insensitive to errors of client-sharing statistic variances. (2) For the probability p, we see that FEDFA significantly improves the baseline (i.e., p = 0), even with a small probability (e.g., p = 0.1). The best performance is reached at p = 0.5.

4.5. COMPLEXITY ANALYSIS

Computation and memory costs. FEDFA involves only several basic matrix operations, thus incurring negligible extra computation cost. Compared to FEDAVG, it requires 4 K k=1 C k more GPU memory allocation to store four statistic values (μ k m , σk m , γ µ k , γ σ k ) at each of the K FFA layers. Here C k is the number of feature channel at each layer k. The costs are in practice very minor, e.g., 18 KB/15.5 KB for AlexNet/U-Net. For comparison, FEDMIX requires 2× more GPU memory than FEDAVG. The low computation/memory costs make FEDFA favorable for edge devices. Communication cost. In each round, FEDFA incurs additional communication costs since it requires the sending 1) from client to server the momentum feature statistics μk m and σk m , as well as 2) from server to client the client sharing feature statistic variances γ µ k and γ σ k at each layer k. Thus, for K layers in total, the extra communication cost for each client is c e = 4 K k=1 C k , where the factor of 4 is for server receiving and sending two statistic values. We further denote c m as the cost for exchanging model parameters in FedAvg. In general we have c e c m (e.g., c e = 18KB vs. c m = 99MB for AlexNet), hence, the extra communication bruden in FEDFA is almost negligible.

5. RELATED WORK

Federated Learning. Recent years have witnessed tremendous progress in federated learning (Konečnỳ et al., 2016) , which opens the door for privacy-preserving deep learning (Shokri & Shmatikov, 2015) , i.e., train a global model on distributed datasets without disclosing private data information. FEDAVG (McMahan et al., 2017) is a milestone; it trains local models independently in multiple clients and then averages the resulting model updates via a central server once in a while. However, FEDAVG is designed for i.i.d. data and suffers in statistical accuracy or even diverge if deployed over non-i.i.d. client samples. To address this issue, numerous efforts have been devoted to handling heterogeneous federated environments, by, for example, adding a dynamic regularizer to local objectives in FEDPROX (Li et al., 2020a) and FEDDYN (Acar et al., 2021) , correcting client drift through variance reduction in SCAFFOLD (Karimireddy et al., 2020) , adaptive server optimization in FEDOPT (Reddi et al., 2021) , local batch normalization in FEDBN (Li et al., 2020b) , or training a perturbed loss in FEDSAM (Qu et al., 2022) and FEDROBUST (Reisizadeh et al., 2020) . FEDMIX (Yoon et al., 2021) , as far as we know, is the only existing method that solves federated learning based on data augmentation. It adapts the well-known MIXUP algorithm (Zhang et al., 2018) from centralized learning into the federated learning scenario. Nevertheless, FEDMIX requires exchanging local data (or averaged version) across clients for data interpolation, thereby suffering privacy issues. In addition, FEDMIX operates on the input level, while our approach focuses on latent feature statistic augmentation. Since deeper representations tend to disentangle the underlying factors of variation better (Bengio et al., 2013) , traversing along latent space will potentially make our method encounter more realistic samples. This is supported by the fact that FEDFA achieves consistent performance improvements over FEDMIX in diverse scenarios. Data Augmentation. Data augmentation has a long and rich history in machine learning. Early studies (Schölkopf et al., 1996; Kukačka et al., 2017) focus on label-preserving transformations to employ regularization via data, alleviating overfitting and improving generalization. For image data, some techniques, like random horizontal flipping and cropping are commonly used for training of advanced neural networks (He et al., 2016) . In addition, there is a recent trend for label-perturbing augmentation, e.g., MIXUP (Zhang et al., 2018) or CUTMIX (Yun et al., 2019) . Separate from these input-level augmentation techniques are feature augmentation methods (Verma et al., 2019; Li et al., 2021a; 2022b; Zhou et al., 2021) that make augmentation in latent feature space. These various data augmentation techniques have shown great successes to learn domain-invariant models in the centralized setup. Our method is an instance of label-preserving feature augmentation, designed for federated learning. It is inspired by recent efforts on implicit feature augmentation (Li et al., 2021a; 2022b; Zhou et al., 2021) that synthesize samples of novel domains by manipulating instance-level feature statistics. In these works, feature statistics are treated as 'features', which capture essential domain-specific characteristics. In FEDFA, we estimate appropriate variances of feature statistics from a federated perspective, and draw novel statistics probablistically from a distribution centered on old statistics, while spanning with the variances. FEDFA avoids statistics mixing of instances from different clients, as done in FEDMIX (Yoon et al., 2021) , thus can better preserve data privacy.

6. CONCLUSION

This work solves federated learning from a unique perspective of feature augmentation, yielding a new algorithm FEDFA that shows strong performance across various federated learning scenarios. FEDFA is based on a Gaussian modeling of feature statistic augmentation, where Gaussian variances are estimated in a federated manner, based on both local feature statistic distribution within each client, as well as universal feature statistic distribution across clients. We identify the implicit federation-aware regularization effects of FEDFA through theoretical analysis, and confirm its empirical superiority across a suite of benchmarks in federated learning.

7. REPRODUCIBILITY

Throughout the paper we have provided details facilitating reproduction of our empirical results. All our experiments are ran with a single GPU (we used NVIDIA GeForce RTX 2080 Ti with a 11G memory), thus can be reproduced by researchers with computational constraints as well. The source code has been made publicly available in https://github.com/tfzhou/FedFA. For the theoretical results, all assumptions, proofs and relevant discussions are provided in the Appendix. This appendix provides theoretical proofs, additional results and experimental details for the main paper. It is organized in five sections: • §A summarizes the algorithms of FEDFA in Algorithm 1 and FFA in Algorithm 2; • §B presents detailed theoretical analysis and proofs of our approach; • §C shows additional ablative experiments; • §D provides a more detailed analysis of extra communication cost required by FEDFA; • §E describes experimental details and more results.

A DETAILED ALGORITHM

In Algorithm 1, we illustrate the detailed training procedure of our FEDFA. It is consistent with algorithms such as FEDAVG (McMahan et al., 2017) . In each communication round, the client performs local model training of the feature extractor h(w h ) and classifier g(w g ). We append a FFA layer (Algorithm 2) after each convolutional stage h k . Each client additionally maintains a pair of momentum feature statistics {μ m , σm }, which is updated in a momentum manner during training. The parameters from local training (i.e., w = {w h , w g }), which are omitted in Algorithm 1, along with the momentum feature statistics are sent to server for model aggregation and computation of client-sharing statistic variances, which will be distributed back to clients for the next round of local training. for each layer k ∈ [K] do 5: X k m = h k ( Xk-1 m ) Run the k-th layer of the feature extractor h 6: Xk m , μk m , σk m = FFA(X k m , μk m , σk m ) FFA layer in Algorithm 2 7: Y = g( XK m ) Run classifer g to get predictions 8: Run loss computation and backward optimization 9: Σ 2 µ k = 1 M M m=1 (μ k m -E[μ k ]) 2 Compute client sharing statistic variance (Eq. 5) 10: Σ 2 σ k = 1 M M m=1 (σ k m -E[σ k ]) 2 11: for each channel j ∈ [C] do Compute adaptive fusion coefficients (Eq. 6) 12: γ (j) µ k = C(1+ 1 /Σ 2,(j) µ k ) -1 C c=1 (1+ 1 /Σ 2,(c) µ k ) -1 13: γ (j) σ k = C(1+ 1 /Σ 2,(j) σ k ) -1 C c=1 (1+ 1 /Σ 2,(c) σ k ) -1 14: return γ µ k , γ σ k

B THEORETICAL INSIGHTS

In this section, we provide mathematical analysis to understand FEDFA. We begin with interpreting FEDFA as a noise injection process (Bishop, 1995; Camuto et al., 2020; Lim et al., 2022) , which is a case of VRM ( §2.1), and show that FEDFA injects federation-aware noises to latent representations ( §B.1).  µ k m = 1 HW H h=1 W w=1 X k,(h,w) m Compute channel-wise feature statistics (Eq. 2) 3: σ k m = 1 HW H h=1 W w=1 (X k,(h,w) m -µ k m ) 2 4: Σ 2 µ k m = 1 B B b=1 (µ k m -E[µ k m ]) 2 Compute client specific statistic variances (Eq. 3) 5: Σ 2 σ k m = 1 B B b=1 (σ k m -E[σ k m ]) 2 6: Σ2 µ k m = (γ µ k + 1)Σ 2 µ k m Adaptive variance fusion (Eq. 7) 7: Σ2 σ k m = (γ σ k + 1)Σ 2 σ k m 8: μk m = µ k m + µ Σµ k m Sampling novel feature statistics (Eq. 9) 9: σk m = σ k m + σ Σσ k m 10: Xk m = σk m X k m -µ k m σ k m + μk m Transform original feature based on novel statistics (Eq. 8) 11: μk m ← αμ k m + (1 -α) 1 B B b=1 µ k m Momentum updating feature statistics (Eq. 5) 12: A popular choice of e k is isotropic Gaussian noise (Camuto et al., 2020) , i.e., e k ∼ N (0, σ 2 I), where I is an identity matrix and σ is a scalar, controlling the amplitude of e k . To avoid over-perturbation that may cause model collapse, σ is typically set as a small value. Despite its simplicity, the strategy is confirmed as a highly effective regularized for tackling domain generalization (Li et al., 2021b) and adversarial samples (Lecuyer et al., 2019; Cohen et al., 2019) . However, as shown in Table 5 , its performance (see FEDFA-R) is only marginally better or sometimes worse than FEDAVG in FL. σk m ← ασ k m + (1 -α) 1 B B b=1 σ k Federation-Aware Noise Injection. From Eq. 9, we can clearly see that the feature statistic augmentation in our approach follows the noise injection process above. Next we show that this eventually results in features perturbed under a federation-aware noising process. Lemma 1. Consider client m, for a batch-wise latent feature X k m at the k-th layer, its augmentation in FEDFA follows a noising process Xk m = X k m + e k m , with the noise e k m taking the form: e k m = σ Σσ k m Xk m + µ Σµ k m , where µ ∼ N (0, 1), σ ∼ N (0, 1), Xk m = (X k m -µ k m ) /σ k m . Proof of Lemma 1. We can easily prove this by substituting Eq. 9 into Eq. 8: Xk m = σk As compared to the Gaussian noise injections (Camuto et al., 2020; Cohen et al., 2019; Lecuyer et al., 2019) , the noise term e k m in FEDFA shows several desirable properties: it is 1) data-dependent, adaptively determined based on the normalized input feature Xk m ; 2) channel-independent, allowing for more extensive exploration along different directions in the feature space; 3) most importantly federation-aware, i.e., its strength is controlled by statistic variances Σµ k m and Σσ k m (cf. Eq. 7), which are known carrying universal statistic information of all participating clients.

B.2 FEDERATION-AWARE IMPLICIT REGULARIZATION IN FEDFA

Next we show that with noise injections, FEDFA imposes federation-aware implicit regularization to local client training. By this, we mean regularization imposed implicitly by the stochastic learning strategy, without explicit modification of the loss, and the regularization effect is affected by the federation-aware noise (Lemma 1). Recall the deep neural network f defined in §2.1: f g • h, where h = h K • h K-1 • • • • • h 1 is a K-layer CNN feature extractor and g is a classifier. Given a batch of samples X m with labels Y m , its latent representation at the k-th layer is computed as X k m = h k • h k-1 • • • • • h 1 (X m ) , or we write it in a simpler form, X k m = h 1:k (X m ). Note that we only add noises to layers in h, but not to g. Concretely, in each mini-batch training, FEDFA follows a stochastic optimization strategy to randomly select a subset of layers from {h k } K k=1 and add noises to them. For simplicity, we denote K = {1, . . . , K} as the index of all layers in h, Z ⊆ K as the subset of layer indexes that are selected, E = {e z m } ∀z∈Z as the corresponding set of noises. Then, the loss function L FEDFA  L FEDFA m = L ERM m + L REG m , where L ERM m is the standard ERM loss, and L REG m is the regularization term: L ERM m = E (Xm,Ym)∼Pm (g(h 1:K (X m )), Y m ), L REG m = E Z∼K E (Xm,Ym)∼Pm h 1:K (Xm) (g(h 1:K (X m )), Y m ) z∈Z J z (X m )e z m , where J z denotes the Jacobian of layer z (defined in Proposition 1). Theorem 1 implies that, FEDFA implicitly introduces regularization to local client learning by regularizing the gradients of latent representations (i.e., h 1:K (Xm) (g(h 1:K (X m )), Y m ) ), weighted by federation-aware noises z∈Z J z (X m )e z m . In the remainder of this section, we prove Theorem 1. For the sake of analysis, we first marginalize the effect of the noising process. We do so by defining an accumulated noise êK m in the final layer K, which originates from the forward propagation of all noises in E. We compute the accumulated noise based on (Camuto et al., 2020) that examines Gaussian noise injection into every latent layer in a neural network. Formally, the accumulated noise on the final convolutional layer K can be expressed as follows: Proposition 1. Consider a K-layer neural network, in which a random noise e z m is added to the activation of each layer z ∈ Z. Assuming the Hessians, of the form 2 h 1:k (X m )| h 1:n (Xm) where k, n are the indexes over layers, are finite. Then, the accumulation noise êK m is approximated as: êK m = z∈Z J z (Xm)e z m + O(β) , where J z ∈ R C K ×Cz indicates the Jacobian of layer z, i.e., J z (X) i,j = ∂h 1:K (Xm)i ∂h 1:z (Xm)j , where C K and C z denote the number of neurons in layer K and z, respectively. O(β) represents higher order terms in E that tend to be zero in the limit of small noises. Proof of Proposition 1. Starting with layer 1 as the first convolution layer, the accumulated noise on layer K can be approximated through recursion. If K = 1, the accumulated noise is equal to êK m = e 1 m . For K = 2, we apply Taylor's theorem on h 2 (X 1 m + e 1 m ) around the output feature X 1 m at h 1 . If we assume that all values in Hessian of h 2 (X 1 m ) is finite, the following approximation holds: h 2 (X 1 m + e 1 m ) = h 2 (X 1 m ) + ∂h 2 (X 1 m ) ∂X 1 m e 1 m + O(κ1), where O(κ 1 ) denotes asymptotically dominated higher order terms given the small noise. In this special case of K = 2 , we obtain the accumulated noise as êK m = ∂h 2 (X 1 m ) ∂X 1 m e 1 m + O(κ1) + ě2 m . The noise consists of two components: ∂h 2 (X 1 m ) ∂X 1 m e 1 m + O(κ 1 ) is the noise propagated from h 1 , while m = e 2 m is the noise added to h 2 if the layer is activated; otherwise, ě2 m = 0. Note that Eq. 20 can be generalized to an arbitrary layer. Repeating this process for each layer z ∈ Z, and assuming that all Hessians of the form 2 h k (X m )| h n (Xm) , ∀k < n are finite, we obtain the accumulated noise for layer K as êK m =   z∈Z\K ∂h 1:K (Xm) ∂h 1:z (Xm) e z m + O(β)   + ěK m = z∈Z ∂h 1:K (Xm) ∂h 1:z (Xm) e z m + O(β) , where ěK m = e K m is the noise added to h K if layer K is activated; otherwise, ěK m = 0. Denoting ∂h 1:K (Xm) ∂h 1:z (Xm) as the Jacobian J z (X m ) completes the proof. Based on Proposition 1, we provide a linear approximation of the loss for samples (X m , Y m ) as (g(h 1:K (Xm) + êK m ), Ym) ≈ (g(h 1:K (Xm)), Ym) + h 1:K (Xm) (g(h 1:K (Xm)), Ym) êK m = (g(h 1:K (Xm)), Ym) + h 1:K (Xm) (g(h 1:K (Xm)), Ym) z∈Z J z (Xm)e z m , in which the higher order terms in Proposition 1 are neglected. Based on Eq. 23, we further approximate the local training objective L FEDFA m in client m and derive the regularization term as follows: In this section, we study the sensitivity of FedFA to the set of eligible layers to apply FFA. For notation, we use {1} to represent that FFA is applied to the 1st convolutional stage; {1, 2} to represent that FFA is applied to both the 1st and 2nd convolutional stages; and so forth. The results are shown in Table 8 . We observe that i) our default design (using five layers) always shows the best performance on the three datasets (Office, DomainNet and ProstateMRI). We conjecture that this is due to its potential to beget more comprehensive augmentation; ii) applying FFA to only one particular layer brings minor gains against FedAvg; but iii) by adding more layers, the performance tends to improve. This implies that our approach benefits from inherent complementarity of features in different network layers. L FEDFA m = EZ∼KL Z m = EZ∼KE (Xm,Ym)∼Pm [ (g(h 1:K (Xm) + êK m ), Ym)] = EZ∼KE (Xm,Ym)∼Pm [ (g(h 1:K (Xm)), Ym)+ h 1:K (Xm) (g(h 1:K (Xm)), Ym) z∈Z J z (Xm)e z m ] = E (Xm,Ym)∼Pm (g(h 1:K (Xm)), Ym) L ERM m + EZ∼KE (Xm,Ym)∼Pm h 1:K (Xm) (g(h 1:K (Xm)), Ym) z∈Z J z (Xm)e z m L REG m .

D ANALYSIS OF ADDITIONAL COMMUNICATION COST IN FEDFA

In each round, FEDFA incurs additional communication costs since it requires the sending 1) from client to server the momentum feature statistics μk m and σk m , as well as 2) from server to client the client sharing feature statistic variances γ µ k and γ σ k at each layer k. Thus, for K layers in total, the extra communication cost for each client is c e = 4 K k=1 C k , where the factor of 4 is for server receiving and sending two statistic values. As presented in Table 10 and Table 14 , we append one FFA layer after each convolutional stage of feature extractors in AlexNet and U-Net. Hence, the total additional communication costs for AlexNet and U-Net are: However, it should be noted that these additional costs are minor in comparison with the cost required for exchanging model parameters, which are 2×49.5 MB and 2×29.6 MB for AlexNet and U-Net, respectively. 



is a leading algorithm to address this. It starts with client training of all the clients in parallel, with each client optimizing L ERM m independently. After local client training, FEDAVG performs model aggregation to average all client models into a updated global model, which will be distributed back to the clients for the next round of client training. Here the client training objective in FEDAVG is equivalent to empirically approximating the local distribution P m by a finite N m number of examples, i.e., P e m (x, y) = 1 /Nm Nm i=1 δ(x = x i , y = y i ), where δ(x = x i , y = y i ) is a Dirac mass centered at (x i , y i ).

denote the variance of feature mean µ k m and standard deviation σ k m that are specific to each client. Each value in Σ 2 µ k m or Σ 2 σ k

Figure 1: Test accuracy versus communication rounds on Office-Caltech 10.

Figure 2: Segmentation performance w.r.t local data size (i.e., fraction of training samples over the whole training set).

Figure 3: Hyper-parameter analysis for α and p.

FEDFA: federated training phase. (We omit the parameter updating procedure, which is exactly same to FEDAVG.) Input: Number of clients M ; number of communication rounds T ; neural network f = g • h; each X0 m represents the collection of training images in corresponding clients; Output: γ µ k , γ σ k ; 1: for t = 1, 2, . . . , T do 2: for each client m ∈ [M ] do 3: μm = 0, σm = 1 Initialize averaged feature statistics for the client 4:

FEDFA AS FEDERATION-AWARE NOISE INJECTIONNoise Injection in Neural Networks. Let x be a training sample and x k its latent representation at the k-th layer, with no noise injections. The x k can be noised under a process xk = x k + e k , where e k is an addition noise drawn from a probability distribution, and x k is the noised representation.

of client m in FEDFA can be equivalently written as L FEDFA m = E Z∼K L Z m , where L Z m is a standard loss function L ERM m (cf. Eq. 1) imposed by adding noises to layers in Z. In the remainder, we relate the loss function L FEDFA m to the original ERM loss L ERM m as well as a regularization term conditioned on E. Theorem 1. In FEDFA, the loss function L FEDFA m of client m can be expressed as:

× (64 + 192 + 384 + 256 + 256)/1024 × 4 = 18 KB, U-Net: 4 × (32 + 64 + 128 + 256 + 512)/1024 × 4 = 15.5 KB.(25)

Image classification performance on Office-Caltech 10 and DomainNet test. Top-1 accuracy (%) is reported. Office-Caltech 10 has four clients: A(mazon), C(altech), D(SLR), and W(ebcam), while DomainNet has six: C(lipart), I(nfograph), P(ainting), Q(uickdraw), R(eal), and S(ketch). See §4.2 for details. DomainNet, using the SGD optimizer with learning rate 0.01 and batch size 32. Following FED-HARMO(Jiang et al., 2022), we employ U-Net(Ronneberger et al., 2015) on ProstateMRI using Adam as the optimizer with learning rate 1e-4 and batch size 16. The communication rounds are 400 for Office-Caltech 10 and DomainNet, and 500 for ProstateMRI, with the number of local up-

Medical image segmentation accuracy on mini-ProstateMRI test(Liu et al., 2020b)  with small-size local datasets. Dice score (%) is reported. The dataset consists of data from six medical institutions: B(IDMC), H(K), I(2CVB), (B)M(C), R(UNMC) and U(CL). The number in the bracket denotes the number of training samples in each client. See §4.2 for more details.

Performance on CIFAR-10 and EMNIST.Results on CIFAR-10 and EMNIST. In addition to feature-shift non-i.i.d., FEDFA shows consistent improvements in label distribution heterogeneity (CIFAR-10) and data size heterogeneity (EMNIST). As shown in Table3, in CIFAR-10, FEDFA surpasses the second best method,

Comparison of generalization performance to unseen test clients on the three benchmarks ( §4.2).Federated learning are dynamic systems, in which novel clients may enter the system after model training, most possibly with test-time distribution shift. However, most prior federated learning algorithms focus only on improving model performance on the participating clients, while neglecting model generalizability to unseen non-participating clients. Distribution shift often occurs during deployment, thus it is essential to evaluate the generalizability of federated learning algorithms.

Efficacy of FEDFA over FEDFA-C and FEDFA-R.

Efficacy of FEDFA against augmentation techniques.

Effectiveness of adaptive variance fusion.

Algorithm description of FFA for the kth layer in client m. Input: Original feature X k m ∈ R B×C×H×W ; momentum α = 0.99; probability p = 0.5; Client-sharing fusion coefficients γ µ k ∈ R C and γ σ k ∈ R C downloaded from the server; Accumulated feature statistics μk m ∈ R C and σk m ∈ R C ;

A summary of key experimental configuration for each dataset.

Performance for different sets of eligible layers to apply FFA.

Network architecture of AlexNet for Office-Caltech10 and DomainNet experiments. For convolutional layer (Conv2D), we list parameters with sequence of input and output dimension, kernal size, stride and padding. For max pooling layer (MaxPool2D), we list kernal and stride. For fully connected layer (FC), we list input and output dimension. For Batch Normalization layer (BN), we list the channel dimension. Note that FFA denotes the proposed feature augmentation layer, and we list the dimension of its input feature.

Numbers of samples in the training, validation, and testing sets of each client in Office-Caltech 10 used in our experiments.

Numbers of samples in the training, validation, and testing sets of each client in DomainNet used in our experiments.

Numbers of samples in the training, validation, and testing sets of each client in ProstateMRI used in our experiments.

8. ACKNOWLEDGMENTS

We would like to thank the anonymous referees for their valuable comments for improving the paper. This study was partly supported by Varian Research, Switzerland Grant.

E.3 EXPERIMENTAL DETAILS FOR MEDICAL IMAGE SEGMENTATION

Network Architecture. For medical image segmentation on ProstateMRI (Liu et al., 2020b) , we use a vanilla U-Net architecture, as presented in Table 14 and Table 15. Training Details. Following FEDHARMO, we use a combination of standard cross-entropy and Dice loss to train the network, using the Adam optimizer with learning rate 1e-4, batch size 16, and weight decay 1e-4. No any data augmentation techniques are applied. The dataset splits of ProstateMRI used in our experiments are summarized in Table 13 Additional Results. Table 16 provides a detailed performance statistic of different methods on ProstateMRI, w.r.t. different fractions ( 1 /6, 2 /6, 3 /6, 4 /6, 5 /6, 1) of training samples used in each client. The table corresponds to the plot in Fig. 2 . 

