DEPTHFL: DEPTHWISE FEDERATED LEARNING FOR HETEROGENEOUS CLIENTS

Abstract

Federated learning is for training a global model without collecting private local data from clients. As they repeatedly need to upload locally-updated weights or gradients instead, clients require both computation and communication resources enough to participate in learning, but in reality their resources are heterogeneous. To enable resource-constrained clients to train smaller local models, width scaling techniques have been used, which prunes the channels of a global model. Unfortunately, width scaling suffers from parameter mismatches of channels when aggregating them, leading to a lower accuracy than when simply excluding resourceconstrained clients from training. This paper proposes a new approach based on depth scaling called DepthFL to solve this issue. DepthFL defines local models of different depths by pruning the deepest layers off the global model, and allocates them to clients depending on their resources. Since many clients do not have enough resources to train deep local models, this would make deep layers partially-trained with insufficient data, unlike shallow layers that are fully trained. DepthFL alleviates this problem by mutual self-distillation of knowledge among the classifiers of various depths within a local model. Our experiments show that depth-scaled local models build a global model better than width-scaled ones, and that self-distillation is highly effective in training data-insufficient deep layers.

1. INTRODUCTION

Federated learning is a type of distributed learning. It trains a shared global model by aggregating locally-updated model parameters without direct access to the data held by clients. It is particularly suitable for training a model with on-device private data, such as next word prediction or on-device item ranking (Bonawitz et al., 2019) . Generally, federated learning demands client devices to have enough computing power to train a deep model as well as enough communication resources to exchange the model parameters with the server. However, the computation and communication capability of each client is quite diverse, often changing dynamically depending on its current loads, which can make those clients with the smallest resources become a bottleneck for federated learning. To handle this issue, it would be appropriate for clients to have a different-sized local model depending on their available resources. However, it is unclear how we can create local models of different sizes, without affecting the convergence of the global model and its performance. There are various methods to prune a single global model to create heterogeneous local models, such as HeteroFL (Diao et al., 2021) , FjORD (Horváth et al., 2021) , and Split-Mix (Hong et al., 2022) . They create a local model as a subset of the global model by pruning channels, that is, width-based scaling. HeteroFL was a cornerstone research that could make different local models by dividing a global model based on width, yet still producing a global model successfully. However, we observed some issues of width scaling. We evaluated HeteroFL compared to exclusive federated learning, which simply excludes those clients who do not have enough resources to train a given global model from training (see Section 4.2). The result shows that the global model of HeteroFL achieves a tangibly lower accuracy than the models of exclusive learning, due to parameter mismatch of channels when they are aggregated. In this paper, we propose a different approach of making local models, called DepthFL. DepthFL divides a global model based on depth rather than width. We construct a global model that has several classifiers of different depths. Then, we prune the highest-level layers of the global model to create local models with different depths, thus with a different number of classifiers. We found that this depth-based scaling shows a better performance than exclusive learning in most cases, unlike in HeteroFL, since training a local model can directly supervise its sub-classifiers as well as its output classifier, obviating parameter mismatch of sub-classifiers when aggregated. This means that depth scaling allows resource-constrained clients to participate and contribute to learning, although there is a small overhead of separate classifiers. We analyzed the root cause of this difference between depth scaling and width scaling. There is one issue in depth scaling, though. There are only a few clients whose local models include deep classifiers, while many clients have shallow classifiers in their local model. This would make deep classifiers be partially-trained only with a limited amount of data, so their accuracy might be inferior to fully-trained shallow classifiers. That is, resource-constrained clients cannot train deep classifiers, so deep classifiers cannot be general enough to cover the unseen data of those clients. To moderate this issue, we make the deep classifiers of a local model learn from its shallow classifiers by knowledge distillation. This is similar to self distillation (Zhang et al., 2019) , except that the direction of distillation is opposite. Actually, we make the classifiers collaborate with each other as in deep mutual learning (Zhang et al., 2018) . So, a client not only trains the classifiers in its local model using its data, but distills each other's knowledge at the same time. Our evaluation shows that deep classifiers can learn from shallow classifiers trained with otherwise unseen data, and that both classifiers actually help each other to improve the overall performance of the global model. We also analyzed the fundamental reason for the effectiveness of knowledge distillation in DepthFL, using an evaluation in a general teacher-student environment. Recently, InclusiveFL (Liu et al., 2022) proposed a kind of depth-scaled method with a similar intuition to ours, yet with less elaboration. We show that depth scaling alone in InclusiveFL without the companion objective of sub-classifiers cannot effectively solve parameter mismatches, and that its performance of deep classifiers is much lower due to no self distillation. We employed the latest federated learning algorithm FedDyn (Acar et al., 2021) as the optimizer for proposed method. Although DepthFL is a framework that focuses on resource heterogeneity of clients, we try to verify if the FedDyn optimizer that focuses on data heterogeneity is applicable to DepthFL seamlessly. Our experiment shows that DepthFL works well for both resource heterogeneity and data heterogeneity. Contributions To summarize, our contributions are threefold: • We present a new depth scaling method to create heterogeneous local models with sub-classifiers, which are directly supervised during training, thus no parameter mismatches when aggregated. • We show that knowledge distillation among the classifiers in a local model can effectively train deep classifiers that can see only a limited amount of data. • We perform a comprehensive evaluation on the difference between depth scaling and width scaling using exclusive learning, and on the effectiveness of knowledge distillation in DepthFL.

2. RELATED WORK

2.1 FEDERATED LEARNING Based on FedAvg (McMahan et al., 2017) , the standard federated learning method, there have been many studies to solve various problems of federated learning (Li et al., 2020a; Kairouz et al., 2019) . One main research field deals with efficient learning algorithms considering the non-IID distribution of data among clients (Karimireddy et al., 2020; Li et al., 2021b; Wang et al., 2020; Acar et al., 2021; Li et al., 2020b) . For example, FedDyn (Acar et al., 2021) performs exact minimization by making the local objective align with the global objective, which enables fast and stable convergence. There are also studies to deal with heterogeneous, resource-constrained clients in federated learning. One popular method to create heterogeneous local models is to prune the channels of a global model for width scaling (Diao et al., 2021; Li et al., 2021a; Niu et al., 2020; Horváth et al., 2021; Hong et al., 2022; Zhu et al., 2022) . Based on the general layer-pruning methods (Chen & Zhao, 2019; Sajjad et al., 2023) that create a shallow model, there is a recent work (Liu et al., 2022) to prune the layers of a global model for tackling system heterogeneity, but it partially exploits the benefit of depth scaling and knowledge distillation unlike DepthFL (see Appendix A.3). Another direction is utilizing knowledge distillation to reduce the burden on clients (He et al., 2020) or to aggregate heterogeneous local models (Zhang & Yuan, 2021; Lin et al., 2020) . Separately, there are researches to reduce the communication cost by compressing the local models (Rothchild et al., 2020; Haddadpour et al., 2021; Reisizadeh et al., 2020) . There are also research directions on personalized federated learning to handle heterogeneity of clients (Li et al., 2021c; Zhang et al., 2021; Hanzely et al., 2020; T. Dinh et al., 2020) .

2.2. KNOWLEDGE DISTILLATION

Knowledge distillation (Hinton et al., 2015) was introduced as a way of training a small student model with the help of a big teacher model. It was initially thought that the role of knowledge distillation is transferring high-quality, similarity information among categories, but researches are still on going to understand its impact such as the relationship with label smoothing regularization (Yuan et al., 2020; Tang et al., 2020) . As such, variants of knowledge distillation techniques have been proposed (Huang et al., 2021; Park et al., 2019; Zhang et al., 2018; 2019) . Deep mutual learning (Zhang et al., 2018) allows student models to mutually distill each other's knowledge without a cumbersome teacher model. Self distillation (Zhang et al., 2019) distills knowledge within a multi-exit network (Teerapittayanon et al., 2016) , from its deep layers to shallow layers for higher performance, or for faster and accurate inference (Phuong & Lampert, 2019) . DepthFL also employs self distillation within a local model, yet in an opposite direction, mainly for distilling from fully-trained shallow layers to partially-trained deep layers. (Lee et al., 2015) . Each classifier shares the blocks that create feature maps, and an additional bottleneck layer is attached to each block so that the local models and the global model have several classifiers inside. Accordingly, the local model parameters of the client whose resource capability is

3. DEPTHFL METHOD

d k is W d k l = W g [: d k ], and it has d k classifiers inside. The overall model structure can be found in Figure 1 . DSN can provide integrated direct supervision to sub-classifiers in addition to the output classifier. DSN exploits it for better performance, but DepthFL exploits it for consistent training of sub-classifiers across all local models that includes it to fully exhibit its performance. This way of creating heterogeneous local models is likely to make deep classifiers learn from a partial amount of data, because only resource-rich clients can train local models with deep classifiers. On the other hand, shallow classifiers are likely to be fully trained since many clients can train local models with them. To mitigate the variance of the performance of different classifiers, it is natural to use their ensemble for inference. One interesting question is if partially-trained deep classifiers can really contribute to learning when many clients cannot afford a local model with them. In our experimental result in Table 4 , deep classifiers perform similarly or even worse than shallow classifiers unless they receive a help, in the form of self distillation described below.

3.2. SELF DISTILLATION

To mitigate the low performance of deep classifiers, DepthFL utilizes the concept of self distillation. Several classifiers inside a local model are trained through the cross-entropy loss with labels, as well as the KL loss with the output of other classifiers, as depicted in Figure 1 . Previously, self distillation made shallow classifiers imitate deep classifiers (Zhang et al., 2019; Phuong & Lampert, 2019) . Whereas, DepthFL makes deep classifiers imitate shallow classifiers, or more precisely, learn collaboratively. The local objective function of the client is as follows: L k = d k i=1 L i ce + 1 d k -1 d k i=1 d k j=1,j̸ =i D KL (p j ∥p i ) (1) where L i ce is a cross-entropy loss of ith classifier, and p i are softmax outputs of ith classifier's logits. Although FjORD (Horváth et al., 2021) utilized self distillation already, the motivation of self distillation in FjORD was to allow for the bigger capacity supermodel to teach the width-scaled submodel. In contrast, the purpose of self distillation in DepthFL is that depth-scaled submodels help training of bigger capacity supermodels. Also, unlike distillation between width scaled models, which should run forward pass through teacher supermodel and student submodel independently, there is no overhead of distillation between depth scaled models except for the bottleneck layers because they share feature maps. We use FedDyn (Acar et al., 2021 ) instead of FedAvg (McMahan et al., 2017) as the default optimizer for the above local objective function 1, to make a fast convergence even when there is a data heterogeneity of clients. When applying dynamic regularization, we replaced the client's local objective by the modified heterogeneous local objective of 1. Also, ∇L k (θ t k ) and h values required for dynamic regularization are used and updated in consideration of the heterogeneity of the local models. Therefore, the penalized local objective function of the client is as follows. L ′ k ( θ) = L k ( θ) -⟨∇L k ( θt k ), θ⟩ + α 2 || θ -θt || 2 (2) where θ is the local model parameters, ∇L k ( θt k ) is the gradient of the local objective function in the previous round, and θt is parts of the current global model's parameters corresponding to the local model parameters θ. Considering the case where the complexity d k of the client changes dynamically depending on its current loads, ∇L k (θ t k ) is stored in the same shape as the entire global model parameters, and only a subset corresponding to the current local model parameters is used for actual training. So, the value of ∇L k (θ t k ) is updated as in the following equation. ∇L k (θ t+1 k )[: d k ] ← ∇L k (θ t k )[: d k ] -α( θt+1 k -θt ) (3) Algorithm 1: DepthFL Initialization : θ 0 , h 0 = 0, ∇L k (θ 0 k ) = 0 Server executes: for round t = 0, 1, . . . T -1 do P t ← Random Clients θ t+1 ← 0 h t+1 ← h t for each client k ∈ P t , and in parallel do θt ← θ t [: d k ] θt+1 k ← Client Update(k, θt ) h t+1 [: d k ] ← h t+1 [: d k ]-α 1 m ( θt+1 k -θ t [: d k ]) θ t+1 [: d k ] ← θ t+1 [: d k ] + θt+1 k end for each resource capability d i do θ t+1 [d i ] = 1 |P d k ≥d i t | θ t+1 [d i ] -1 α h t+1 [d i ] end end Client Update(k, θt ): θt+1 k ← θt ∇L k ( θt k ) ← ∇L k (θ latest updated k )[: d k ] for local epoch e = 1, 2, . . . E do for each mini batch b do We also evaluate DepthFL without self-distillation and FedDyn (use FedAvg instead), which we call DepthFL(FedAvg), by comparing with its corresponding exclusive learning. Finally, we evaluate HeteroFL by comparing with its corresponding exclusive learning. For a fair comparison, exclusive learning is trained and tested in the same way as the corresponding scaled method. For example, when comparing DepthFL with its exclusive learning, the local objective function of both includes the self distillation and regularization terms. Also, both are tested with the ensemble inference of the internal classifiers. It should be noted that we cannot compare the accuracy of DepthFL and HeteroFL directly, since their local models have a different size, so comparing only with their corresponding exclusive learning is meaningful. L k = d k i=1 L i ce + 1 d k -1 d k i=1 d k j=1,j̸ =i DKL(pj∥pi) L ′ k ( θ) = L k ( θ) -⟨∇L k ( θt k ), θ⟩ + α 2 || θ -θt || 2 θt+1 k ← θt+1 k -η∇L ′ k ( θt+1 k ; b) end end ∇L k (θ t+1 k )[: d k ] ← ∇L k (θ t k )[: d k ] -α( θt+1 k -θt ) return θt+1 k Table 1: (Number of parameters) / [# of MACs] of local models according to division method Model Method a = W 1 l b = W 2 l c = W 3 l d(W g ) = W 4 l ConvNet HeteroFL (Width) 99.0 K [4.11 M] 391 K [15.3 M] 877 K [33.6 M] Table 2 shows the results comparing HeteroFL, DepthFL(FedAvg), and DepthFL to their corresponding exclusive learning. In case of HeteroFL, exclusive learning with b as the global model that prunes half of the channels, and with only 75% of the clients participating in learning, shows a tangibly better accuracy than HeteroFL in CIFAR-100. This means that although clients with insufficient resources can participate in learning on the HeteroFL framework, heterogeneous local models appear to deteriorate the global model and produce a worse result. DepthFL(FedAvg) shows a better result than any exclusive learning for MNIST and CIFAR-100, but not for Tiny Imagenet. On the other hand, DepthFL performs better than any of exclusive learning in all datasets. This result indicates that DepthFL can better train the global model using heterogeneous local models. One question is why HeteroFL shows a tangibly lower performance than exclusive learning, and why this is not the case for DepthFL(FedAvg). To answer these questions, we measured the performance of each global sub-model of HeteroFL separately, by running 1/4, 2/4, 3/4, and 4/4 channels of the global model. We also measured the same for InclusiveFL (Liu et al., 2022) without momentum distillation and DepthFL(FedAvg), i.e., the performance of classifiers at 1/4, 2/4, 3/4, and 4/4 of the global model. Table 3 shows the result compared to exclusive learning result of Classifier 1/4 directly. Table 3 shows that SHeteroFL differs from HeteroFL in that the accuracy of its global sub-models is similar or higher than that of exclusive learning, as in DepthFL. This can also explain the better performance of DepthFL(FedAvg) than its exclusive learning. That is, when a client trains its 2/4 Classifier, it also trains 1/4 Classifier with the companion objective, as explained in Section 3.1. InclusiveFL has no companion objective, showing a worse accuracy in 

Knowledge Distillation

To analyze the impact of mutual self-distillation on the performance of the global model, we turn on and off self distillation and measure the accuracy of the global model. We experiment with both IID and non-IID data distribution, whose results are depicted in Table 4 . Regardless of data distribution, it shows that self distillation plays a key role in enhancing the accuracy of deep classifiers. That is, deep classifiers tend to perform worse than shallow classifiers when self distillation is off, yet they perform similarly or better in most cases when self distillation is on. Table 4 shows that self distillation improves the accuracy of shallow classifiers as well, which is encouraging because resource-constrained clients should do inference using shallow classifiers.

Maximum & Dynamic Complexity

We evaluate why self-distillation enhances the accuracy of deep classifiers. Our conjecture was that fully-supervised shallow classifiers can train undersupervised deep classifiers that were partially trained with a limited amount of data. For this evaluation, we constructed a Maximum experimental environment where all clients can train the full global model (i.e., the d model in Table 1 ). We performed mutual self-distillation on the Maximum environment and measure the accuracy as in Table 5 , which includes the previous IID result of Table 4 (marked Fixed) for comparison. We can actually see that the impact of self-distillation on deep classifiers in Maximum is tangibly smaller than in Fixed. This means that when resources are heterogeneous, self-distillation achieves additional performance by transmitting the domain knowledge of shallow classifiers to deep classifiers. We also constructed a Dynamic experimental environment where the resources of the clients can change randomly in every round, yet the average amount of resources of those clients participating in each round is the same as in Fixed. So, unlike Fixed, deeper layers in Dynamic can learn from all data from every client as in Maximum, albeit less frequently. In Table 5 , we can see that the impact of self-distillation on deep classifiers in Dynamic is similar to Maximum. This indicates that the lower accuracy of deep classifiers in Fixed is indeed due to those unseen data of resource-constrained clients, and that self-distillation can alleviate the problem.

4.4. UNDERSTANDING SELF-DISTILLATION EFFECT OF DEPTHFL

This section attempts to understand the fundamental reason behind the effectiveness of selfdistillation for depthFL. For this we evaluate knowledge distillation in a general setup of teacherstudent models in a centralized learning environment, except that the student model is trained with insufficient data (as the partially-trained, deep layers of DepthFL). We first evaluate if the existing analysis methods of knowledge distillation in (Tang et al., 2020) , which are label smoothing (LS), gradient rescaling (KD-pt), and domain knowledge of class relationships (KD-sim), can explain the effectiveness of self-distillation in DepthFL (see A.6 for detailed explanation). The Resnet18 model is used as the student, while the Resnet101 model, which is fully trained with all data, is used as the teacher. Additionally, we analyze the effect of knowledge distillation by the poor teacher (PKD), which is trained only with 25% of the data, as well as by the light teacher (LKD), which is trained with all data but is sized smaller than the student model (as the fully-trained, shallow layers of DepthFL). As the light teacher of LKD, we use ×0.25 depth-scaled local model a = W 1 l of Resnet18. We perform these experiments with a variable amount of data the student learns as in Table 6 . When the student model learns all data (100%), most effects of knowledge distillation could be explained by the three existing effects. However, when the student model learns partly (50% or 25%), the effect is not fully explainable by the composition of those three effects, neither by PKD. On the other hand, LKD is quite effective, achieving high accuracy even for 50% and 25%. This means that even if the size of the light teacher is smaller than that of the student, the class relationship for each input instance can be transferred by the light teacher, which effectively helps for the generalization of the data-insufficient student model. This additional role appears to be the reason why self-distillation in DepthFL is effective, especially for data-insufficient deep layers.

4.5. ROBUSTNESS TEST

The performance of DepthFL would inevitably be different depending on the distribution of each client's resource capability. To evaluate the robustness of DepthFL, we changed the distribution of resource capabilities and measured the performance of the classifiers, as in Figure 2 . As expected, as the ratio of resource-constrained clients increases, the performance of deep classifiers gets lower, even seriously when self distillation is off. With self distillation on, however, even if the deepest classifier is trained only by 10% of the clients, the performance drop is small, due to the help of other classifiers. Since shallow classifiers will be used for fast inference while an ensemble model will be used for high performance, deep classifiers do not always have to be better than shallow ones. Appendix A.5.1 explains and evaluates whether deep classifiers are really needed for DepthFL.

5. CONCLUSION

We et al., 2017) . And, we performed the same experiments in Table 4 and Table 5 , whose results are shown in Table 9 and Table 10 , respectively. Even if FedAvg is used as the optimizer, the overall trend of the experimental results is almost the same, but we can see that the overall accuracy is lowered. Also, we can see that the impact of self-distillation is reduced compared to when FedDyn is used. 5 , when the client resources change dynamically over time so that all clients can train deep layers even occasionally, deep layers can have a higher performance. In reality, the performance of classifiers with different depths would inevitably be different depending on various factors such as the distribution of client resources, the nature of the learning task, the number of data each client has, and the size of global model/sub-models. We tried to test whether deep classifiers are really necessary by performing an experiment excluding deep classifiers, whose result is in Figure 4 . For example, DepthFL (exclude #4 Classifier) reduces the depth of the global model to 3/4, so that those 25% clients whose resource can train Classifier 4/4 trains only up to Classifier 3/4. We conducted the experiment with big and small global models (Resnet18 and ConvNet) for two learning environments (Fixed and Dynamic) on CIFAR-100. We can see that if the size of the global model is smaller compared to the task, and if the client resources change dynamically so the deep classifier can see more data, deep classifiers gets more important for high performance. Actually, this paper presented a general framework that can flexibly deal with such diverse situations.

A.5.2 NON-IID DEGREE AND NUMBER OF CLIENTS

We performed ablation study for diverse non-IID degrees and for different number of clients, whose results are in Table 11 and Table 12 , respectively. They show a similar behavior as previously.

A.6 PARTIAL KNOWLEDGE DISTILLATION METHODS

The experiment in Section 4.4 was conducted as follows. In KD-pt method, we synthesize teacher distribution ρ pt as ρ pt i = p t if i = t, (1 -p t )/(K -1) otherwise, where p t is prediction on ground truth class from the teacher's probability distribution. In KD-sim method, we synthesize teacher distribution ρ sim as the softmax over cosine similarity between the teacher's last logit layer's weights: ρ sim = softmax(relu( ŵt Ŵ T ) α /β), where Ŵ ∈ R K×d is the normalized logit layer weights, ŵt is the t-th row of Ŵ corresponding to the ground truth, and α, β are hyper-parameters for resolution of cosine similarities.

A.7 LEARNING CURVES

The learning curves of Table 2 for comparing HeteroFL, DepthFL(FedAvg), and DepthFL with their corresponding exclusive learning are in Figure 5 . The learning curves of Table 4 for the ablation study of knowledge distillation are in Figure 6 . The x-axis is the communication round, and the y-axis is the moving average of the test accuracy. 



Figure 1: Global model (ResNet) parameter W g has 4 kinds of local models as its subsets, distributed to m = 6 heterogeneous clients. Each local model has multiple classifiers at different depths.

Figure 2: Top-1 accuracy of four classifiers for four resource distribution ratios on CIFAR-100.

presented DepthFL, a new federated learning framework considering resource heterogeneity of clients. DepthFL creates local models of different sizes by scaling the depth of the global model, and allocates them to the clients depending on their available resources. During local training, a client trains several classifiers within its local model, and at the same time, distills their knowledge with each other. As a result, both deep classifiers trained with limited data and shallow classifiers trained by most clients can help one another to build the global model, with no parameter mismatch. We also evaluated depth scaling compared to width scaling thoroughly, and self-distillation in DepthFL.

Figure 3: Learning curves for DepthFL and Split-Mix

Figure 4: Learning curves for DepthFL without deep classifiers

Figure 5: Comparative experiments with exclusive learning

tribute to the training of the global model by selecting a local model suitable for its resource amount, in a fixed or dynamic way. Since the highest-level layers are entirely pruned, an additional bottleneck layer is needed for a local model to become an independent classifier. This means that the global model should have a different classifier at a different depth. The structure of a global model that satisfies these conditions can be found in Deeply-Supervised Nets (DSN)

Datasets and models We used MNIST, CIFAR-100, and Tiny ImageNet datasets for the image classification task, and WikiText-2 dataset for the masked language modeling task. CNN composed of 4 convolution layers, Resnet18, and Resnet34 models were used for MNIST, CIFAR-100, and Tiny ImageNet, respectively. The transformer model was used for Wikitext-2. We create four local models with DepthFL, and four local models with HeteroFL by dividing the channels into four equal parts following its division method. Table1depicts the model size and number of MACs for the four local models of HeteroFL and DepthFL. For example, the smallest model a has only the classifier 1/4 in DepthFL, while having 1/4 channels in HeteroFL. For inference, DepthFL uses the ensemble of all internal classifiers, while HeteroFL uses the global model with all channels. overhead of depth-scaled clients would be lower than that of width-scaled clients, but it would be opposite for the average computation overhead. Also, depth scaling is less fine-grained than width scaling since depth scaling is more dependent on the layer structure of the global model, and may not be effective if the size of the bottleneck layer is too large. Despite these limitations, depth-scaled local models do not seriously affect the performance of the global model when aggregated, unlike width-scaled local models, as will be explained shortly.Default Settings Unless otherwise stated, the same number of clients are allocated to each of the four different local models (i.e., 25% of the clients are allocated to each of a, b, c, and d local models in Table1). Also, the data is distributed in an IID manner, FedDyn is used as the optimizer, and randomly sampled 10% of the clients among a total of 100 clients participate in each communication round. When the data is distributed in a non-IID manner, as in FedMA(Wang et al., 2020), a Dirichlet distribution p c ∼ Dir k (β = 0.5) was used to allocate p c,k ratio of data samples belonging to class c to client k.4.2 COMPARISON WITH EXCLUSIVE LEARNINGWe first evaluate if it is beneficial for resource-constrained clients to participate in learning with depth-scaled local models even if they cannot accommodate a given global model. For this evaluation, we compare the accuracy of DepthFL with the accuracy of its exclusive learning, which excludes those clients who do not have enough resources to run a given global model from learning. In DepthFL, d is the global model but we still allow those 75% clients whose resource cannot accommodate d to participate in learning. In exclusive learning, however, if d is the global model, we allow only those 25% clients who can run d to participate in learning. Similarly, if c is the global model, we allow those 50% clients who can run c (including those who can run d but should run c since c is the global model now) to participate in learning.



Accuracy of the global model compared to exclusive learning. 100% (a) exclusive learning means, the global model and every local model are equal to a = W 1 l model, and 100% clients participate in learning. Likewise, 25% (d) exclusive learning means, the global model and every local model are equal to d(W g ) = W 4 l model, and only 25% clients participate in learning.

Accuracy of global sub-models compared to exclusive learning on CIFAR-100.



Accuracy of the global model with/without self distillation for both IID/Non-IID data

Ablation study of the self distillation according to the resource complexity d k distribution of the clients. Fixed complexity means that a client's complexity d k does not change from the initial value. Dynamic complexity means a client's d k value changes randomly every round. Maximum is the situation when all clients have sufficient resources, so all d k values are maximum. It also shows that DepthFL with self-distillation works well for the transformer model with WikiText-2.

Top-1 accuracy of student on CIFAR-100 with a few partial knowledge distillation methods.

Accuracy of FedAvg global model with/without self distillation for both IID/Non-IID data

Ablation study of self-distillation according to resource complexity d k distribution.Table4showed that the performance of the deepest classifiers is similar or slightly lower than that of shallow classifiers even with self-distillation. So, one might question if deep classifiers are really needed for resource-heterogeneous federated learning, since they are likely to have fewer clients to train although they allow more generalization of the model. That is, if most clients have insufficient resources, deeper layers might not help; otherwise, they might contribute to enhancing the performance of the global model. On the other hand, as in the Dynamic case in Table

Ablation Study : Data distribution +2.93) 71.46 (+3.78) 71.52 (+3.86) 71.36 (+4.11) 74.25 (+1.61)

Ablation Study : # Clients

ACKNOWLEDGMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-00180, 25%) and (No. 2021-0-00136, 25%), by the ITRC (Information Technology Research Center) support program (IITP-2021-0-01835, 25%) supervised by the IITP, and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00208245, 25%).

A APPENDIX

A.1 LIMITATION OF SHETEROFL As mentioned in Section 4.2, although SHeteroFL includes additional computation overhead of locally training all possible sub-models, the performance of its global model that leverages all these sub-models can improve. However, we want to check whether additional training of sub-models of different widths is always helpful for the global model. We compared the performance of FedAvg, SHeteroFL, and DepthFL(FedAvg) in the Maximum case introduced in Section 4.3 where all clients have enough resources to train the global model. Table 7 shows the experimental result on CIFAR-100 using Resnet18 as the global model. SHeteroFL does not show better performance than FedAvg despite its additional training of width-scaled sub-models. On the other hand, DepthFL could improve the performance of the global model by additionally training depth-scaled sub-models, which is in line with the result in Lee et al. (2015) . In other words, although SHeteroFL could alleviate the problem of HeteroFL, DepthFL is a better approach to heterogeneous federated learning since it can better enhance the performance of global model with less computation cost. The accuracy of DethpFL and Split-Mix in Fixed and Dynamic cases are shown in Figure 3 . In Fixed, the learning curves of DepthFL and Split-Mix are almost same. Since the number of parameters of local models of DepthFL is much smaller than Split-Mix, DepthFL is more efficient in terms of communication overhead. In Dynamic, DepthFL shows faster convergence and higher accuracy than Split-Mix. The reason is that Split-Mix allows the client to train multiple base models alternately, so there is no big difference between Fixed and Dynamic, whereas in DepthFL, the deep classifier can train more data in Dynamic, as explained in Section 4.3.

A.3 COMPARISON WITH INCLUSIVEFL (LIU ET AL., 2022)

Recently, InclusiveFL (Liu et al., 2022) proposed a kind of depth-scaled method for heterogeneous federated learning. However, there are two major differences from DepthFL. First, unlike DepthFL, InclusiveFL trains only the last classifier of the local model without training the sub-models. As we saw in Table 3 , training sub-models during local training has a decisive impact on the performance of global sub-models, hence the overall performance (its ensemble result is 63.79, lower than the B HYPERPARAMETERS Most of the hyperparameters used in our experiments are depicted in Table 13 . We set both α, β hyper-parameters to 0.3 for KD-sim (Tang et al., 2020) experiments. In the masked language modeling task with the transformer model, we randomly selected 15% of the input tokens. Among the selected tokens, 80% were modified to [mask] tokens, 10% to random tokens, and 10% to remain unchanged. 

