ADVERSARIAL COLLABORATIVE LEARNING ON NON-IID FEATURES

Abstract

Federated Learning (FL) has been a popular approach to enable collaborative learning on multiple parties without exchanging raw data. However, the model performance of FL may degrade a lot due to non-IID data. While many FL algorithms focus on non-IID labels, FL on non-IID features has largely been overlooked. Different from typical FL approaches, the paper proposes a new learning concept called ADCOL (Adversarial Collaborative Learning) for non-IID features. Instead of adopting the widely used model-averaging scheme, ADCOL conducts training in an adversarial way: the server aims to train a discriminator to distinguish the representations of the parties, while the parties aim to generate a common representation distribution. Our experiments show that ADCOL achieves better performance than state-of-the-art FL algorithms on non-IID features. Under review as a conference paper at ICLR 2023 with non-IID features. Therefore, we need a fundamentally new approach to address the technical challenges of non-IID features. In this paper, we think out of the model-averaging scheme used in FL, and propose a novel learning concept called adversarial collaborative learning. While the feature distribution of each party is different, we aim to extract the common representation distribution that is sufficient for the prediction task. Instead of averaging the local models, we apply adversarial learning to match the representation distributions of different parties. Specifically, the server aims to train a discriminator to distinguish the local representations by the party IDs, while the parties train the base encoders such that the generated representations cannot be distinguished by the discriminator. Besides the base encoders, each party trains a predictor for local personalization and ensures that the generated representation is meaningful for the prediction task. Our experiments show that ADCOL outperforms state-of-the-art FL algorithms on three real-world tasks. More importantly, ADCOL points out a promising research direction on collaborative learning. For example, it is interesting to generalize ADCOL to other settings besides feature skew in a communication-efficient way. We use P i (x, y) to denote the data distribution of party i, where x is the features and y is the label. According to existing studies (Kairouz et al., 2019; Hsieh et al., 2020) , we can categorize non-IID data in FL into the following four classes: (1) non-IID labels: the marginal distribution P i (y) varies across parties. (2) non-IID features: the marginal distribution P i (x) varies across parties. (3) concept drift: The conditional distributions P i (y|x) or P i (x|y) varies across parties. (4) quantity skew: the amount of data varies across parties. In this paper, we focus on non-IID features, which widely exist in reality. For example, the distributions of images collected by different camera devices may vary due to the different equipment and environments. Non-IID data is a key challenge in FL. There have been many studies trying to improve the performance of FL under non-IID data. However, most existing approaches (

1. INTRODUCTION

Deep learning is data hungry. While data are always dispersed in multiple parties (e.g., mobile devices, hospitals) in reality, data are not allowed to transfer to a central server for training due to privacy concerns and data regulations. Collaborative learning among multiple parties without the exchange of raw data has been an important research topic. Federated learning (FL) (McMahan et al., 2016; Kairouz et al., 2019; Li et al., 2019b; a) has been a popular form of collaborative learning without exchanging raw data. A basic FL framework is FedAvg (McMahan et al., 2016) , which uses a model-averaging scheme. In each round, the parties update their local models and send them to the server. The server averages all local models to update the global model, which is sent back to the parties as the new local model in the next round. FedAvg has been widely used due to its effectiveness and simpleness. Most existing FL approaches are designed based on FedAvg. However, as shown in many existing studies (Hsu et al., 2019; Li et al., 2020; 2021a) , the performance of FedAvg and its alike algorithms may be significantly degraded in non-IID data among parties. While many studies try to improve FedAvg on non-IID data, most of them (Li et al., 2020; Wang et al., 2020b; Karimireddy et al., 2020; Acar et al., 2021; Li et al., 2021b; Wang et al., 2020a) focus on the label imbalance setting, where the parties have different label distributions. In their experiments, they usually simulate the federated setting by unbalanced partitioning the dataset into multiple subsets according to labels. As summarized in Hsieh et al. (2020) ; Kairouz et al. (2019) , besides the label distribution skew, feature imbalance is also an important case of non-IID data. In the feature imbalance setting, the feature distribution P i (x) varies across parties. This setting widely exists in reality, e.g., people have different stroke width and slant when writing the same word. Another example in practice is that images collected by different cameras have different intensity and contrast. However, compared with non-IID labels, FL on non-IID features has been less explored. Most existing studies on non-IID data are still based on the model-averaging scheme (Li et al., 2020; Collins et al., 2021; Li et al., 2021b; Fallah et al., 2020) , which implicitly assumes that the local knowledge P i (y|x) is common across parties and is not applicable in the non-IID feature setting. For example, FedRep (Collins et al., 2021) learns a common base encoder among parties, which will output very different representation distributions across parties in the non-IID feature case even though for the data from the same class. Such a model-sharing design fails to achieve good model accuracy for application scenarios partially averaging the local models, we propose a fundamentally new training framework based on adversarial learning that does not average the models at all.

2.4. ADVERSARIAL LEARNING FOR DISTRIBUTION MATCHING

Adversarial learning has been successful for distribution matching (e.g., domain adaptation (Tzeng et al., 2017) , GANs (Goodfellow et al., 2014) ). The basic idea is to train a discriminator to encourage indistinguishable distributions, which is smart and sweet. Peng et al. (2019b) proposes FADA to apply adversarial learning in federated setting to study the federated domain adaption problem, which has a different setting and goal from our paper. For more discussion on the relation and difference between our approach and FADA, please refer to Appendix C. More recently, a study (Zhang et al., 2021a) proposed FedUFO, where each party trains a discriminator to apply feature and objective consistency constrains to address the non-IID data issue. However, during the local training stage, FedUFO needs to transfer each local model to all the other parties, which causes massive communication overhead. Moreover, FedUFO focuses on the non-IID label setting in their experiments.

3. THE PROPOSED METHOD

3.1 PROBLEM STATEMENT Suppose there are N parties, where party i has a local dataset D i = {x, y}. The feature distributions P (x) are different among parties while the label distributions P (y) are same/similar among parties. The parties conduct collaborative learning over D ≜ i∈[N ] D i with the help of a central server without exchanging the raw data. Like typical personalized FL, the goal of each party is to train a machine learning model which has good accuracy on its local test dataset.

3.2. MOTIVATION

Problem of Model-Averaging on non-IID Features Most existing studies (Li et al., 2020; Karimireddy et al., 2020; Li et al., 2021b; Dinh et al., 2020; Collins et al., 2021) are still based on FedAvg to address the non-IID data. However, they are not suitable in our setting. In the model-averaging scheme, the server averages the local models as a global model, which essentially tries to learn a common P (y|x). In their experiments, they usually partition a dataset to different parties horizontally to simulate the federated setting, where parties indeed follow the same P (y|x) (with different P i (y)) so that the global model is helpful. However, in our setting, for party i and j, since P i (x) ̸ = P j (x) and P i (y) = P j (y), P i (y|x) and P j (y|x) are different. Averaging the local models does not directly help the learning of local knowledge P i (y|x). Instead of averaging and learning a global model, we propose to learning a common representation distribution to address the non-IID features. Although the feature distributions P i (x) are different among parties, they have the same task y. Thus, we aim to extract the underlying task-specific representation z for the task y from multiple parties. Specifically, we decompose local objective P i (y|x) into two parts: P i (z|x) and P i (y|z). The first part is to learn the oracle representation for the task and the second part is to predict the label by the representations. The second part can be easily achieved by training a predictor head with the representations as inputs. For the first part, we assume that there exists an oracle optimal representation distribution P * (z) for the prediction of y. Then, the ideal objective of party i can be formulated as min θi E x∼Di ℓ KL ((P i (x)P i (z|x; θ i )) || P * (z)), where ℓ KL is the KL divergence loss and θ i is the base encoder to generate the representation. In practice, P * (z) is unknown. However, it has the following two features: (1) P * (z) is same for each party; (2) P * (z) is able to predict y. Thus, we approximate the objective by two aspects: (1) To ensure that the representation absorbs the knowledge of multiple parties, the parties aim to map their local data into a common representation distribution P (z); (2) We ensure that the generated P (z) contains necessary information for the prediction of y by training a predictor on the representation. We introduce the details about the training procedure in Section 3.4.

3.3. MODEL ARCHITECTURE

There are two kinds of models in ADCOL: the local models trained in the parties and the discriminator trained in the server. As ADCOL works from the perspective of representation, the architecture of the local model is similar as existing studies (Chen et al., 2020; Chen & He, 2021) on self-supervised representation learning. The local model has three components: a base encoder, a projection head, and a predictor. The base encoder (e.g., ResNet-50) extracts representation vectors from inputs. Like SimCLR (Chen et al., 2020) and SimSam (Chen & He, 2021) , an additional projection head is introduced to map the representation to a space with a fixed dimension. The final predictor is used to output probabilities for each class. For ease of presentation, we use F (•) to denote the whole model and G(•) to denote the model before the predictor (i.e., G(x) is the mapped representation vector of input x). For the discriminator, we simply use a MLP.

3.4. THE OVERALL FRAMEWORK

The overall framework is shown in Figure 1 and Algorithm 1. There are four steps in each round: (1) The server sends the discriminator to the parties. (2) The parties update their local models. (3) The parties send representations to the server. (4) The server updates the discriminator. Step 1 In the first step, the server sends the discriminator to parties (line 4 of Algorithm 1). Step 2 In the second step, the parties update their models using their local datasets (lines 10-17 of Algorithm 1). In addition to the objective which aims to minimize the cross-entropy loss (i.e., ℓ CE ) on the local dataset, ADCOL introduces an additional regularization term which aims to maximize the probability that the discriminator cannot distinguish the local representations. For each input x, ADCOL feeds the representation G(x) to the discriminator. ADCOL expects the discriminator to output probability vector [foot_0 N ] N (i.e., the probability of each class is 1 N ) such that it cannot distinguish which party that the representation comes from. Thus, ADCOL uses Kullback-Leibler (KL) divergence loss to measure the difference between the output of the discriminator D(G(x)) and the target [ 1 N ] N . The final loss of an input (x, y) is computed as ℓ = ℓ CE (F (x), y) + µℓ KL ([ 1 N ] N || D(G(x))) (2) where µ is a hyper-parameter to control the weight of KL divergence loss, ℓ CE is the crossentropy loss, and ℓ KL is the KL divergence loss. Each party minimizes its local empirical risk E (x,y)∼Di ℓ(x, y; D) to update its local model, where ℓ(•) is presented in Equation 2. Step 3 After local training, the parties feed their data into the local models and transfer the representations to the server (line 5 of Algorithm 1). Step 4 The server updates the discriminator using the received representations (lines 6-9 of Algorithm 1). Specifically, the server builds a training set D R = {R, I}, where the feature values are the representations and the labels are the party IDs that the representations come from. The server minimizes the empirical risk E (R,I)∼D R ℓ CE (R, I) on the training set to update the discriminator.

4.1. CONVERGENCE OF ADCOL

As shown in Equation 2, the local loss has two parts: the cross-entropy loss part to update the whole network F and the KL divergence loss part to update the representation generator G. Ideally, to achieve minimum of ℓ, each part should achieve minimum. Since the cross-entropy loss part is same as FedAvg, we focus on the effect of the KL divergence loss. For simplicity, we ignore the cross-entropy loss and study the KL divergence loss in our theoretical analysis 1 . The local objective  R i ← PartyLocalTraining(i, D) R ← {(R i , i)} N i=1 for each batch b = {R i , i} of R do ℓ ← CrossEntropyLoss(D(R i ), i) D ← D -η∇ℓ ℓ CE ← CrossEntropyLoss(F i (x), y) 14 R ← G i (x) 15 ℓ KL ← KLDiv([ 1 N ] N ||D(R)) 16 ℓ ← ℓ CE + µℓ KL 17 F i ← F i -η∇ℓ 18 return G i (x i ) to server of party i is: min Gi E x∼Di ℓ KL ([ 1 N ] N || D(G i (x))). The objective of the discriminator is max D N i=1 E x∼Di log(D i (G i (x))), where D i (•) is the i-th output of the prediction vector D(•) (i.e., the probability of class i). Here we analyze the convergence property of the training process like existing studies on GANs (Goodfellow et al., 2014; Tran et al., 2019) . In Theorem 4.1, we derive the optimal discriminator given the objective Equation Equation 4. Then, in Theorem 4.2, we derive the optimal solution for the distributions of local representations to minimize the local objective Equation Equation 3given the optimal discriminator from Theorem 4.1. Last, in Theorem 4.3, we show that the distribution of local representations can converge to optimal solution given in Theorem 4.2. All the proofs are available in Appendix A. Theorem 4.1. We use P Gi to denote the distribution of the representations generated in party i and P Gi (z) is the probability of representation z in distribution P Gi . Then, the optimal discriminator D * of Equation 4 is D * k (z) = P G k (z) N i=1 P Gi (z) . (5) Theorem 4.2. Given the optimal discriminator D * from Equation 5, the global minimum of Equation 3 is achieved if and only if P G1 = P G2 = • • • = P G N Theorem 4.1 and Theorem 4.2 show that to achieve the minimum of the objectives of the local parties and the discriminator, the parties will generate the same representation distribution, which matches the goal of ADCOL. In Theorem 4.2, we assume that D can reach D * like existing GAN studies (Goodfellow et al., 2014; Tran et al., 2019) . For detailed analysis on it, please refer to Appendix A. Theorem 4.3. Suppose P * G is the optimal solution shown in Theorem 4.2. If G i (∀i ∈ [1, N ]) and D have enough capacity, and P Gi is updated to minimize the local objective (i.e., Equation 3), given the optimal discriminator D * from Equation 5, then P Gi converges to P * G . The above theorem provides insights on the convergence of the training. In practice, we optimize the parameter θ of the local networks rather than P Gi itself, which is reasonable due to the excellent performance as claimed in Goodfellow et al. (2014) . Note that there are collapsing solutions for Equation 3 and Equation 4. The representations of each party can simply be constant vectors, which can achieve global minimum of Equation 3. Thus, the cross-entropy loss is necessary in Equation 2, which ensures that the generated representations are meaningful.

4.2. COMMUNICATION SIZE

For simplicity, our analysis assumes all parties participate in learning in each round, and it is straightforward to extend this assumption by considering party sampling techniques. We use S L to denote the size of the local model. Then, the communication size per round of FedAvg is 2N S L , including the server sends the model to all parties and the parties send their local models to the server. We use n to denote the total number of examples (i.e., n = N i=1 |D i |), d to denote the dimension of the representations, and S D to denote the size of the discriminator. Suppose each float value costs four bytes to store. In each round, the communication size of ADCOL is (4nd + N S D ), including the parties send the representations to the server and the server sends the discriminator to the parties. Although the communication costs of ADCOL and FedAvg depend on the specific settings, we find that ADCOL is usually more communication-efficient than FedAvg in the experimental setting of existing studies. For example, in the experimental setting of FedAvg [5], a simple MLP (199,210 parameters) is used for classification on MNIST with 100 parties. For FedAvg, the communication size per round is 159.4M B. For ADCOL, considering the dimension of representation d = 128, and the discriminator as a 2-layer MLP with 128 hidden units (S d = 68, 628B). Then, the communication size per round of ADCOL is 37.6M B. There may exist extreme cases that ADCOL has more communication size per round than FedAvg when the number of parties is small, the number of samples is large, and the size of model is small. However, in such cases, it is usually impractical to conduct FL as local training may already achieve satisfactory performance. Like party sampling in FedAvg, we propose representation sampling to further reduce the communication size. In each round, each party randomly samples a subset of the representations and sends them to the server. As we will show in Appendix B.11, representation sampling can effectively reduce the communication size with tolerable accuracy loss.

4.3. PRIVACY

While sharing representation is used in ADCOL and other collaborative learning studies (He et al., 2020; Peng et al., 2019b; Vepakomma et al., 2018) , one possible concern is that representations may leak more information than models. There are many existing studies (Shokri et al., 2017; Nasr et al., 2019) that infer sensitive information from exchanged gradients/models. Also, there are studies (Salem et al., 2020) on reconstruction attacks on the output of a model. While existing studies have shown that the mutual information between the input data and the final representation is small (Shwartz-Ziv & Tishby, 2017) , it is still not clear that whether sharing models is more private than sharing representations to the best of our knowledge, which can be an interesting future direction. To enhance the privacy guarantee, Differential Privacy (DP) (Dwork et al., 2014) can be applied to protect the transferred messages including the representation. For more details, please refer to Appendix B.12.

5.1. EXPERIMENTAL SETUP

Baselines We compare ADCOL with seven baselines including SOLO (i.e., each party trains the model individually without collaborative learning), FedAvg (McMahan et al., 2016) , FedBN (Li et al., 2021c) , PartialFed (Sun et al., 2021) , FedProx (Li et al., 2020) , Per-FedAvg (Fallah et al., 2020) , and FedRep (Collins et al., 2021) . Here FedBN is the state-of-the-art FL approach on non-IID features. PartialFed is a personalized FL approach on the cross-domain setting which is also applicable to the non-IID feature setting. FedProx is a popular FL approach for non-IID data. Per-FedAvg and FedRep are two state-of-the-art personalized FL approaches. FedUFO is not open-sourced and requires all-to-all communication of local models among any two parties during local training, which leads to prohibitively high communication cost. For example, in our experimental setting with Digits task, FedUFO has 77 times higher communication cost than ours. Thus, we omit the experiments with FedUFO here. Like FedAvg (McMahan et al., 2016) , we use weighted average according to the data volume of each party for all baselines. By default, we do not apply representation sampling and differential privacy in ADCOL. Models All approaches use the same local model architecture for a fair comparison. The architecture of the local model is similar as SimSam (Chen & He, 2021) , which has the following three components: (1) Base encoder: ResNet-50 (He et al., 2016) . ( 2) Projection head: a 3-layer MLP with BN applied to each fully-connected layer. The input dimension is 4096. The dimension of the hidden layer and the output layer is 2048. (3) Predictor: a 2-layer MLP with BN applied to its hidden layer. The input dimension is 2048. The dimension of its hidden layer is 512. The discriminator is a 3-layer MLP. The input dimension is 2048. The dimension of the hidden layers is 512. The output dimension is equal to the number of parties. Datasets We use the same tasks as in the study of FedBN. There are three tasks in our experiments: (1) Digits: The Digits task has the following five digit data sources from different domains: MNIST (LeCun et al., 1998) , SVHN (Netzer et al., 2011) , USPS (Hull, 1994) , SynthDigits (Ganin & Lempitsky, 2015) , and MNIST-M (Ganin & Lempitsky, 2015) . ( 2) Office-Caltech-10 ( Gong et al., 2012) : The dataset has four data sources acquired using different camera devices or in different real environments with various backgrounds: Amazon, Caltech, DSLR, and WebCam. (3) DomainNet (Peng et al., 2019a) : The dataset contains natural images coming from six different data sources with different image styles: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. Here the first task is a synthetic task by combining different digit datasets. The second and third tasks are real-world datasets that naturally generated in a federated setting. For each task, different datasets have heterogeneous features but share the same label distribution, which naturally forms the non-IID feature setting (Li et al., 2021a ). Due to the page limit, we only present some experimental results on Digits in the main paper. For more experimental results and details, please refer to Appendix B. Setup By default, the number of parties is equal to the number of data sources, where each party has data from one of the data sources. For each dataset, we randomly split 1/5 of the original dataset as the test dataset, while the remained dataset is used as the training dataset. The number of local epochs is set to 10 by default for all FL approaches. The number of epochs is set to 300 for SOLO. For ADCOL and FedProx, we tune µ ∈ {10, 1, 0.1, 0.01, 0.001} and report the best results. For FedRep, we tune β (i.e., step size for the second batch training) from {0.001, 0.01} and report the best results. We use the prediction layers as the shared representation in FedRep. We use PyTorch (Paszke et al., 2019) to implement all approaches. We use the SGD optimizer for training with a learning rate of 0.01. The SGD weight decay is set to 10 -5 and the SGD momentum is set to 0.9. The batch size is set to 64, 32, and 32 for Digits, Office-Caltech-10, and DomainNet, respectively. We run the Table 1 : The comparison of top-1 test accuracy among different approaches on Digits. We run FL approaches for 100 rounds (all approaches have converged). We run three trials and report the mean and standard derivation. Besides the test accuracy on each party, we also report the mean accuracy of all parties denoted as "AVG". experiments on a server with 8 * NVIDIA GeForce RTX 3090, a server with 4 * NVIDIA A100, and a cluster with 45 * NVIDIA GeForce RTX 2080 Ti.

5.2. OVERALL COMPARISON

Table 1 reports the test accuracy of different approaches on three tasks. We have the following observations. First, ADCOL is more effective than the other approaches. It can achieve the best test accuracy on most datasets. Moreover, ADCOL can outperform the other approaches by more than 2% accuracy on average. Second, while the parties may not benefit from FL approaches in some cases (e.g., Caltech-10), ADCOL always achieves better accuracy than SOLO, which demonstrates the robustness of ADCOL. Last, the personalized FL approaches (i.e., Per-FedAvg and FedRep) have a poor performance on the non-IID feature setting, which are even worse than SOLO. For the results of other tasks, please refer to Appendix B.2.

5.3. COMMUNICATION EFFICIENCY

To show the communication efficiency of ADCOL, like existing studies (Karimireddy et al., 2020; Lin et al., 2020) , we compare the number of communication rounds and communication size of each approach to achieve the same target performance. The results on Digits are shown in Table 2 . We can observe that no approach consistently outperforms the other approaches in terms of the number of communication rounds. However, the communication size of ADCOL is always much smaller than the other approaches. ADCOL can save at least 10 times the communication costs to achieve the same accuracy as FedAvg. The speedup can even be up to 34× on Digits. The results demonstrate that ADCOL is much more communication-efficient than the other FL approaches. For the results of other tasks, please refer to Appendix B.2.

Scalability and Heterogeneity

We adopt the same approach as Li et al. (2021c) to study the effect of number of parties and heterogeneity. We divide each dataset into ten parts randomly and equally and allocate each part into one party. The parties from the same dataset are treated as IID and the parties from different datasets are treated as non-IID. We add two parties from each dataset each time, which results in the number of parties N ∈ {10, 20, 30, 40, 50}. Moreover, the degree of heterogeneity decreases as the number of parties increases since the number of IID parties increases. The test accuracies are reported in Figure 2a . We can observe that the accuracy of all approaches can be slightly improved when increasing the number of parties due to the reduced heterogeneity and increased total amount of data. Given a different number of parties, ADCOL consistently outperforms the other baselines. Although the number of classes to distinguish increases for the discriminator when increasing N , ADCOL still shows a good and stable performance.

Effect of Local Dataset Size

We vary the percentage of the original local dataset used in each party from 20% to 100%. The results are shown in Figure 2b . The improvement of ADCOL is more significant when the size of the local dataset is small. If the size of the local dataset is large, each party can already achieve satisfactory accuracy by SOLO. The accuracy of all approaches is close when the percentage is 100%. It is not necessary to conduct collaborative learning in such a case. Effect of µ We vary µ ∈ {0, 0.1, 1, 10} and report the accuracy of ADCOL as shown in Figure 2c . We can observe that ADCOL can achieve the best accuracy when µ = 1. If µ is too small, the KL divergence loss of Equation 2 has little effect on the local training. Then, the goal of learning a common representation distribution may not achieve. If µ is too large, the cross-entropy loss of Equation 2 has little effect on the local training, and the representations may not be useful for classification at all (e.g., all representations are a constant vector). ADCOL with µ = 10 may even be worse than SOLO (i.e., µ = 0). Thus, an appropriate µ is important in ADCOL. Through our experimental studies, we find that setting µ = 1 is a good default choice.

6. CONCLUSION

In this paper, we propose ADCOL, a novel collaborative learning approach for non-IID features. Unlike most previous studies performing model averaging, ADCOL trains the models in an adversarial way between the parties and the server from the perspective of representation distributions. The parties aim to learn a common representation distribution, while the server aims to distinguish the representations by party IDs. Our experiments on three real-world tasks show that ADCOL achieves higher accuracy than the other state-of-the-art federated learning approaches on non-IID features. ADCOL shows that it is possible to incorporate global knowledge into parties in an adversarial way instead of model averaging. This is a fundamentally new and potentially powerful way for federated learning. We are interested in future studies on extending ADCOL to more federated settings and advanced techniques for efficient and privacy-preserving representation sharing.

Reproducibility Statement

We have provided the experimental details in Section 5.1 and Appendix B.1 for reproducibility. Moreover, we will make the code publicly available.

A THEORETICAL ANALYSIS

Theorem 4.1. We use P Gi to denote the distribution of the representations generated in party i and P Gi (z) is the probability of representation z in distribution P Gi . Then, the optimal discriminator D * of Equation 4 of the main paper is D * k (z) = P G k (z) N i=1 P Gi (z) . ( ) Proof. From the view of the distribution of representations z, we can reformulate Equation 4 and the objective is to maximize: N i=1 z P Gi (z) log(D i (z))dz (8) Let V (D) = N i=1 P Gi (z) log(D i (z)). To maximize Equation 8 with respect to D, it is equivalent to maximize V (D) with respect to D given any z. Note that N i=1 D i (z) = 1. Let F (D) = V (D) + λ(1 - N i=1 D i (z)). We have ∂F (D) ∂D i (z) = P Gi (z) D i (z) -λ (9) Let ∂F (D) ∂Di(z) = 0 for i ∈ [1, N ], we have P G1 (z) D 1 (z) = P G2 (z) D 2 (z) = • • • = P G N (z) D N (z) = λ Thus, V (D) can achieve maximum when D * k (z) = P G k (z) N i=1 P Gi (z) Note that the discriminator uses SGD to update its model with cross-entropy loss as shown in Lines 6-9 of Algorithm 1. If the discriminator is a linear function, then it can converge to the global optima since the loss function is convex. If the discriminator is a neural network with non-linear activations, whether SGD finds a global minimum or not is a traditional optimization problem, which is orthogonal to our study. Given the evidence of the power of deep learning with SGD from existing studies (Du et al., 2019; Zhou et al., 2019; Zou et al., 2018; Choromanska et al., 2015; Dauphin et al., 2014) , we assume that D can reach D * like existing GAN studies (Goodfellow et al., 2014; Tran et al., 2019) . We also empirically show that the discriminator can converge to optima in Appendix B.15. Theorem 4.2. Given the optimal discriminator D * from Equation 7, the global minimum of Equation 3 of the main paper is achieved if and only if P G1 = P G2 = • • • = P G N (12) Proof. From Equation 3 of the main paper, the local objective of party k is to minimize W (G k ) = -E x∼D k 1 N N i=1 log(N • D i (G k (x))) = -E x∼D k 1 N N i=1 log( N • P Gi (G k (x)) N j=1 P Gj (G k (x)) ) = -E x∼D k 1 N N i=1 (log N + log( P Gi (G k (x)) N j=1 P Gj (G k (x)) )) (13) 

B.2 CALTECH-10 AND DOMAINNET

Table 4 and 5 show the test accuracy of different approaches on Caltech-10 and DomainNet, respectively. We can observe that ADCOL still outperforms the other approaches in most cases. We show the communication efficiency of ADCOL on Caltech-10 and DomainNet in Table 6 and Table 7 . We can observe that ADCOL is much more communication-efficient than the other approaches. The speedup can be even up to 300 times.

B.3 EXPLANATION OF THE EXPERIMENTAL RESULTS BY FID

We observe that there is a correlation between FID and the performance gain of ADCOL compared with local training. Generally, with a higher FID (i.e., more imbalanced feature distribution), the party can gain more from our approach. The relative improvement on the accuracy of ADCOL against local training is 15.4%, 6.8%, and 17.2% on Digits, Caltech-10, and DomainNet, respectively. The improvement is positively related to the FID of each task. If FID is small, the representation distribution of the local dataset is close to the global dataset, then local training may already learn a good representation and the improvement of ADCOL is limited. 

B.4 TRAINING CURVES

The training curves of different approaches on Digit are shown in Figure 5 . We can observe that ADCOL is much more communication-efficient than the other approaches. ADCOL can convergence with a much smaller communication size than the other approaches.

B.5 PARTY SAMPLING

Party sampling is a technique usually used in the cross-device setting, where a fraction of parties is sampled to participate in federated learning in each round. Here we set the sample fraction to 0.4 in Digit and choose FedAvg and FedBN as the baselines. The training curves are shown in Figure 6 . We can observe that all approaches have an unstable accuracy during training due to sampling. Moreover, FedAvg and FedBN have a very poor accuracy, which shows that existing federated learning approaches cannot well support party sampling on non-IID features. ADCOL significantly outperforms the other approaches. We increase the number of parties to 100 (i.e., divide each dataset to 20 subsets) and vary the sampling rate from {0.1, 0.2, 0.5, 1}. We run all approaches for 200 rounds. The results are shown in Table 8 . We can observe that when the sampling rate decreases, the performance of all approaches decreases. Moreover, the training is more unstable if the sampling rate is smaller. However, ADCOL still significantly outperforms FedAvg and FedBN. It is still a challenging task to develop effective algorithm on the cross-device setting with a low sampling rate.

B.6 STUDY ON THE DISCRIMINATOR

One natural question is how to increase the information contained in the discriminator to improve the performance of ADCOL. We have tried two approaches. Changing Model Architecture One approach is to increase the capacity of the discriminator. We change the model architecture to ResNet-50. The results are shown in Table 9 . The performance of ADCOL cannot be improved by increasing the capacity of the discriminator.  ℓ = 1 N d N d i=1 ℓ KL ([ 1 N ] N || D i (G(x))), where D i is the discriminator trained in round max(1, t -i). The results are shown in Table 10 . ADCOL cannot benefit from more discriminators. When the number of discriminators is larger, the accuracy of ADCOL is even worse. It is a future work to investigate how to integrate more useful information into the discriminator.

B.7 DIMENSION OF REPRESENTATIONS

Same as SimSam (Chen & He, 2021) , we set the dimension of representations (i.e., the output dimension of the projection head, the input dimension of the discriminator) to 2048 by default. As shown in Table 11 , we report the performance of ADCOL varying the representation dimension. ADCOL can benefit from a larger representation dimension, where the representations are more informative. The mean accuracy can be improved by about 5% by increasing the dimension from 512 to 2048.

B.8 NON-IID LABELS

We test the performance of ADCOL on non-IID label settings. Specifically, we sample p k ∼ Dir N (0.5) and allocate a p k,j proportion of the instances of class k to party j, where Dir(0.5) is the Dirichlet distribution with a concentration parameter 0.5. The results are shown in Table 12 . ADCOL cannot achieve a better performance than FedAvg and FedBN. Intuitively, the task-specific representations of images from different classes should be very different. If the label distribution varies across parties, the representation distribution naturally also varies a lot. The intuition of ADCOL, which aims to learn a common representation distribution, is not appropriate on non-IID label settings.

B.9 COMPUTATION OVERHEAD

As shown in Table 13 , the training time of ADCOL is larger than the other approaches. ADCOL requires the training of a discriminator in the server side, while the other approaches only need to average the models in the server side. However, in practice, the server usually has much power- ful computation resources than the parties. Thus, the computation overhead in the server side is affordable.

B.10 SHARING THE PREDICTOR LAYERS

The parties only send the representations to the server in ADCOL. While ADCOL aims to learn a common representation distribution z, an interesting extension is to share the predictor layers between the parties and the server, which ideally helps in regularizing p(y|z). The results are shown in Table 14 . We can observe that ADCOL without sharing the predictor layers is generally more effective than sharing the predictor layers. In practice, the distribution p(y|x i ) is not exactly the same across parties. Thus, it is not necessary to regularize p(y|z) among the parties. Leaving the parties to fine tune their own predictor layer is more capable to learn the personalized local distribution.

B.11 REPRESENTATION SAMPLING

Here we apply the representation sampling technique and change the sampling rate from {20%, 60%, 80%, 100%}. The final accuracy and the communication efficiency are shown in Table 15 . We can observe that the communication cost of ADCOL can be significantly reduced with representation sampling. Moreover, there is little accuracy loss when the sampling rate is large than 60%.

B.12 DIFFERENTIAL PRIVACY

We consider two popular threat models in existing FL studies: 1) The server is trusted and the parties are honest-but-curious (Geyer et al., 2017) . We need to protect the messages that are sent from the server to the parties. 2) The server and the parties are honest-but-curious and we need to protect all the transferred messages (Wei et al., 2020; Truex et al., 2020) . Trusted Server In this setting, we do not need to protect the representations sent from parties to the server. We need to protect the classification model sent from the server to the parties. Thus, when training the classification model on the server-side, we apply DP-SGD (Abadi et al., 2016) to add Gaussian noises to the gradients during training to satisfy (ε, δ)-DP with the same default parameters. We keep ε fixed and ensure that δ ≤ 10 -2 to compare the accuracy of DP-ADCOL with the non-private version as shown in Table 16 . We can observe that DP-ADCOL can achieve a very close accuracy to the non-private version with a budget 5. There are two reasons that DP works well in ADCOL: 1) DP-SGD works well with the discriminator as it is a shallow model (Tramer & Boneh, 2021) , which is a simple 2-layer MLP. If we increase the number of layers for the discriminators, the accuracy of DP-ADCOLL will decrease as shown in the last row of Table 16 . 2) The discriminator needs a small number of steps to update and the accumulated privacy loss is small. For FedAvg, it is not easy to apply record-level DP. Existing studies (Geyer et al., 2017; McMahan et al., 2017) clip the local model updates to provide party-level DP which is more strict than the record-level DP. We conduct simple experiments and find that the accuracy of DP-FedBN is low with party-level DP, which is about 66.3% accuracy given the budget 5. Honest-but-curious Server In this setting, the messages sent from parties to the server should also be protected. We apply local differential privacy with sampling in ADCOL to provide rigorous privacy guarantees. Specifically, in each round, we sample and normalize the representations and add noises from Gau(0, 1/ϵ) before sending them to the server, where Lap(0, 1/ϵ) is the Laplace distribution with mean 0 and scale 1/ϵ. Then, in each round, the transferred representations satisfy ϵ-differential privacy (Lyu et al., 2020) . Due to the parallel composition, the privacy loss is not accumulated among rounds. To achieve the same level of privacy guarantee with DP-ADCOL for FedBN, we implement DP-FedBN by clipping and adding Laplace noises to the communicated model updates (Kairouz et al., 2019) . For DP-FedBN, we try two methods: 1) without party sampling: the privacy loss is accumulated among different rounds. 2) party sampling without replacement: we set the sampling fraction per round to 0.2 and the privacy loss is not accumulated among every five rounds. The results are shown in Table 17 . We can observe that the accuracy of ADCOL is very close to the non-private version with a modest privacy budget (i.e., 10). Moreover, DP-ADCOL achieves a We vary the number of local epochs E ∈ {1, 2, 5, 10, 20} and report the results in Figure 7 . We run all approaches for 100 rounds. If the number of local epochs is too small, the local update is small in each round and the convergence speed is slow. Thus, the accuracy of all approaches is relatively low after running for 100 rounds with a small number of local epochs. ADCOL still consistently outperforms the other approaches with a different number of epochs.

B.14 PARTY SAMPLING WITH A FIXED NUMBER OF SELECTED PARTIES

One practical concern is that the output dimension of the discriminator is fixed to be the number of participating parties, which may not handle the case when the number of parties is extremely large or the number of parties is changing over time. To address the concern, we propose to apply party sampling with a fixed number of selected parties each round. The output dimension of the discriminator is same as the number of participated parties each round. The selected parties first update their models locally without the regularization term we introduced. Next, the parties send their representations to the server, which updates the discriminator and sends back the discriminator to the parties. Then, the same parties update their models again with the regularization term using the discriminator. After that, we can move into next round and sample new parties again. We have conducted experiments on Digits with 50 parties. In each round, we randomly drop 5 parties tentatively for one round to simulate the scenario where the number of parties change over time (i.e., the selected 5 parties leave FL for current round and join FL again in the next round). After dropping 5 parties, we randomly select 5 parties to participate in FL in the current round. The output dimension of the discriminator is set to 5. We run ADCOL, FedAvg, and FedBN for 100 rounds and the results are shown in Table 18 . We can observe that FedAvg and FedBN have a poor accuracy in such a scenario. ADCOL significantly outperforms these two approaches.

B.15 CONVERGENCE OF THE DISCRIMINATOR

We empirically study whether the discriminator converge to optima or not. Besides using a MLP as the discriminator in our experiments, to compare convex and non-convex loss function, we also try a linear function as the discriminator by removing the non-linear activation in MLP. The training curves are shown in Figure 8 and the accuracy of using a linear function is shown in 

C DISCUSSION

We can consider each party as a domain and studies on multi-domain are also potentially applicable. We have compared ADCOL with the most related FL study on the multi-domain setting (i.e., PartialFed Sun et al. (2021) ). Besides PartialFed, we also discuss the relation between ADCOL and the studies on domain adaptation and domain generalization below. Relation to Domain Adaptation Domain adaptation aims to train a model on a source domain (or multi-source domain), which has a good accuracy on a target domain. A classic and popular approach in domain adaptation is to perform adversarial training, i.e., training a discriminator to encourage domain-invariant features (Ganin et al., 2016; Peng et al., 2019b) . Peng et al. (2019b) proposed FADA, which extends domain adaptation in a federated setting. One connection between our approach and domain adaptation is that each party can be viewed as a source domain, and the target domain is the unknown oracle optimal (like domain generalization introduced in Section 3.2 of the main paper). Then, our approach is to extract domain-invariant features from multiple source domain, which is used to regularize the training. To highlight the differences between our approach and the domain adaptation techniques, we compare our approach with the federated domain adaptation study (Peng et al., 2019b) (FADA) and show the main differences: (1) Setting: FADA aims to train a model on multiple source domain, which has a good accuracy on a target domain. Our study aims to train a personalized model for each party, which has good accuracy on its local data. (2) Discriminator: FADA uses multiple discriminators, where each discriminator is used for binary classification for one source-target domain pair. Our study uses a single discriminator for the multi-classification among all parties. Moreover, we have provided the theoretical analysis on the convergence properties. (3) Framework: FADA uses adversarial training to generate domain-invariant and domain-specific features. Our study uses adversarial training to regularize the local training in federated learning. Intuitively, we cannot directly compare ADCOL and FADA in the experiments since the settings are different. In our experiments, there is no a single target domain for testing in FADA. One method is to treat each party as a target domain and applying FADA N times, where N is the number of parties. However, the computation and communication overhead is significantly large. Moreover, such an approach does not utilize the labels of the target dataset. We have compared ADCOL and FADA using the above method and the results are shown in Table 21 . ADCOL significantly outperforms FADA. Moreover, the test accuracy of FADA is even smaller than local training in many cases since it does not exploit the labels of the target dataset. Relation to Domain Generalization While the motivation of ADCOL is intuitive, it can also be explained from the perspective of domain generalization (Muandet et al., 2013) . In domain generalization, the goal is to extract knowledge from multiple source domains to apply it to an unseen target domain. Considering each party as a source domain and the target domain as the oracle optimal representation space, we aim to extract the domain invariant representation distribution and use it to regularize the local training. Existing domain generalization techniques are designed in a centralized setting, which usually require the access to the raw data of multiple source domains (Li et al., 2018a; b; Liu et al., 2018) . There is one work (Liu et al., 2021) that studies domain generalization in the federated setting. It is designed for medical image segmentation by episodic learning in the frequency space. In this paper, we aim to design a general collaborative learning framework based on adversarial learning. Limitations ADCOL is a collaborative learning method for non-IID features. As shown in Appendix B.8, the performance of ADCOL is poor compared with federated learning approaches on non-IID label setting. Note that ADCOL aims to learn a common representation distribution. Intuitively, the task-specific representations of images from different classes should be very different, which can be easily classified by a small MLP. Thus, if the label distribution varies across parties, the representation distribution naturally also varies a lot across parties. The current objective of ADCOL does not fit into the non-IID label setting. As shown in Section 3.5 of the main paper, the communication size of ADCOL is related to the number of examples. If the number of examples is very large and the size of the model is small, the communication cost of ADCOL will be larger than other federated learning approaches. However, local training can usually achieve satisfactory performance if the dataset size is very large. In such cases, besides ADCOL, existing federated learning approaches may also not help.

Insights and Future Work

The key insights from ADCOL are (1) a GAN-style training scheme and (2) regularization from a view of representation distribution. While ADCOL does not have a requirement on the vanilla local training algorithm, it can also be extended to self-supervised federated learning, where the cross-entropy loss is replaced by the loss used in self-supervised learning (e.g., contrastive loss (Chen et al., 2020; Chen & He, 2021) ). Moreover, while ADCOL only works on non-IID feature settings currently, the adversarial collaborative training scheme can potentially be applied to address other data settings by modifying the objectives of local training and server training. There are many research opportunities based on the findings of this paper.



Note that G is a part of F and the two losses are not independent of each other. For simplicity, we only analyze the KL divergence loss to study its effect.



Step 4: The server updates the discriminator.

Figure 1: The ADCOL framework Algorithm 1: The ADCOL algorithm Input: number of communication rounds T , number of parties N , number of local epochs E, learning rate η, hyper-parameter µ Output: The local models F i (i ∈ [N ]) Server executes: for t = 1, 2, ..., T do for i = 1, 2, ..., N in parallel do send the discriminator D to party i

PartyLocalTraining(i, D): 11 for epoch e = 1, 2, ..., E do 12 for each batch b = {x, y} of D i do 13

Figure 2: Effect of different factors. We run three trials and report the mean accuracy across parties and its standard derivation.

Figure 3: The label distributions of each task. The value in each cell of row i and column j represents the percentage of samples with class j in Party i.

Figure 4: The feature distributions of each task.

Figure 5: The training curves of different approaches on Digit.

Figure 6: The training curves with party sampling (sample fraction = 0.4). We report the mean test accuracy across all parties.

Figure 8: The training curves of different discriminators.

The communication round and size of each approach to achieve the same target performance as the minimum converged accuracy among FedAvg, FedBN, PartialFed, FedProx, and ADCOL as shown in Table1(i.e., 94.1% in MNIST). We use the slash cell to indicate that the approach (i.e., Per-FedAvg and FedRep) cannot reach the target performance in 100 rounds/30 GB. The speedup is computed by dividing the communication size of FedAvg by the communication size of ADCOL.

The statistics of all studied datasets.

The comparison of top-1 test accuracy among different approaches on Caltech-10.

The comparison of top-1 test accuracy among different approaches on DomainNet.

The communication round and communication cost of each approach to achieve the same target performance on Caltech-10.

The communication round and communication cost of each approach to achieve the same target performance on DomainNet.

The performance of different approaches varying the sampling rate. We run all approaches for 200 rounds and report the final mean accuracy and standard deviation with three runs.

ADCOL with different discriminator architectures. 1% ± 0.5% 55.6% ± 0.8% 96.0% ± 0.3% 73.6% ± 0.5% 76.5% ± 0.5% 79.4% ± 0.4% MLP 94.7% ± 0.6% 58.2% ± 1.0% 95.4% ± 0.2% 76.0% ± 0.3% 76.7% ± 0.8% 80.2% ± 0.5%

ADCOL with different number of discriminators.

The test accuracy of ADCOL with different representation dimensions.

The test accuracy of different approaches on non-IID label settings.

The total training time of running all approaches for 100 rounds.

The comparison between sharing the predictor layers and not sharing the predictor layers.

The privacy-accuracy tradeoff of DP-ADCOL in the trusted server setting. FedBN in the same privacy level. It is promising to apply DP in ADCOL thanks to the representation-sharing scheme.

Comparison between DP-ADCOL and DP-FedBN under the same privacy level.

The mean test accuracy and standard derivation across parties when applying party sampling with a fixed number of selected parties each round. The output dimension of the discriminator in ADCOL is set to the number of selected parties each round.

We can observe both MLP and linear function can achieve optima (i.e., zero training loss) with SGD. Moreover, ADCOL with a linear function as the discriminator can still achieve a better performance than the other baselines from Table1of the main paper.B.16 STUDY ON THE LOCAL MODEL ARCHITECTUREInstead of using ResNet-50, we try a different local model to investigate the robustness of our approach. We use the same model as the experiments in FedBN for Digit task, which is a sixlayer convolutional neural network. We use the input before the last fully-connected layer as the representation. The results are shown in Table20. From the table, we can observe that ADCOL outperforms FedBN, which further verifies the effectiveness of ADCOL.

ADCOL with MLP or linear function as the discriminator.

Comparison between ADCOL and FedBN using a CNN as local model.

annex

To minimize Equation 13, we need to maximize log(

Note that

N j=1 P G j (G k (x)) = 1. Similar to the proof in Theorem 4.1, Equation 13 can achieve minimum when. Given a representation z, we haveThus,Theorem 4.3. Suppose P * G is the optimal solution shown in Theorem 4.2. If G i (∀i ∈ [1, N ]) and D have enough capacity, and P Gi is updated to minimize the local objective (i.e., Equation 3 of the main paper), given the optimal discriminator D * from Equation 7, then P Gi converges to P * G .Proof. In Equation 13, consider W (G k ) = U (P Gi ) as a function of P Gi . ThenWe haveThus, U (P Gi ) is convex in P Gi . Therefore, with sufficiently small updates of P Gi , P Gi converges to P * G , concluding the proof.

B.1 ADDITIONAL EXPERIMENTAL DETAILS

In each experiment, like FedBN (Li et al., 2021c) , to remove the effect of quantity skew, we truncate the size of all datasets to their smallest number with random sampling. For Digits, we resize all images to 28 × 28 × 3 and normalize them with mean 0.5 and standard derivation 0.5 for each channel.For Office-Caltech-10, we resize all images to 64 × 64 × 3 with random horizontal flip and random rotation. For DomainNet, we resize all images to 64 × 64 × 3 with random horizontal flip and random rotation. Like FedBN, we take Digits as the benchmark task for most studies.The statistics of all the datasets are shown in Table 3 . To quantitatively demonstrate the feature imbalance, we use FID (Heusel et al., 2017) to measure the difference between feature distributions of different parties. Specifically, it measures the Frechet distance between the representation distributions of different datasets, where the representation is generated by a Inception v3 model pretrained on ImageNet dataset. FID is 0 when two datasets are the same. For each task, we compute the FID between each subset and the whole dataset by merging all subsets. With FID values for each subset, we report the mean value and the standard deviation. From Table 3 , we can observe that there indeed exists feature imbalance for each task. We also show the label distributions in Figure 3 . The portion of samples with each class is close to 0.1. The label distribution is balanced among the parties. We train a ResNet-50 on all datasets (i.e., parties) from a task and extract the feature distributions of each dataset. Then, we use t-SNE to visualize the representation as shown in Figure 4 . We can observe that the feature distribution of each party is different.There are two major differences between our experimental setup and the setup in FedBN.(1) The model architecture is different. Our paper adopts ResNet-50 for all datasets, while FedBN uses a simple CNN for Digit and AlexNet for Office and DomainNet. We adopt ResNet since we need to

