DEPTHFL: DEPTHWISE FEDERATED LEARNING FOR HETEROGENEOUS CLIENTS

Abstract

Federated learning is for training a global model without collecting private local data from clients. As they repeatedly need to upload locally-updated weights or gradients instead, clients require both computation and communication resources enough to participate in learning, but in reality their resources are heterogeneous. To enable resource-constrained clients to train smaller local models, width scaling techniques have been used, which prunes the channels of a global model. Unfortunately, width scaling suffers from parameter mismatches of channels when aggregating them, leading to a lower accuracy than when simply excluding resourceconstrained clients from training. This paper proposes a new approach based on depth scaling called DepthFL to solve this issue. DepthFL defines local models of different depths by pruning the deepest layers off the global model, and allocates them to clients depending on their resources. Since many clients do not have enough resources to train deep local models, this would make deep layers partially-trained with insufficient data, unlike shallow layers that are fully trained. DepthFL alleviates this problem by mutual self-distillation of knowledge among the classifiers of various depths within a local model. Our experiments show that depth-scaled local models build a global model better than width-scaled ones, and that self-distillation is highly effective in training data-insufficient deep layers.

1. INTRODUCTION

Federated learning is a type of distributed learning. It trains a shared global model by aggregating locally-updated model parameters without direct access to the data held by clients. It is particularly suitable for training a model with on-device private data, such as next word prediction or on-device item ranking (Bonawitz et al., 2019) . Generally, federated learning demands client devices to have enough computing power to train a deep model as well as enough communication resources to exchange the model parameters with the server. However, the computation and communication capability of each client is quite diverse, often changing dynamically depending on its current loads, which can make those clients with the smallest resources become a bottleneck for federated learning. To handle this issue, it would be appropriate for clients to have a different-sized local model depending on their available resources. However, it is unclear how we can create local models of different sizes, without affecting the convergence of the global model and its performance. There are various methods to prune a single global model to create heterogeneous local models, such as HeteroFL (Diao et al., 2021 ), FjORD (Horváth et al., 2021 ), and Split-Mix (Hong et al., 2022) . They create a local model as a subset of the global model by pruning channels, that is, width-based scaling. HeteroFL was a cornerstone research that could make different local models by dividing a global model based on width, yet still producing a global model successfully. However, we observed some issues of width scaling. We evaluated HeteroFL compared to exclusive federated learning, which simply excludes those clients who do not have enough resources to train a given global model from training (see Section 4.2). The result shows that the global model of HeteroFL achieves a tangibly lower accuracy than the models of exclusive learning, due to parameter mismatch of channels when they are aggregated. In this paper, we propose a different approach of making local models, called DepthFL. DepthFL divides a global model based on depth rather than width. We construct a global model that has several classifiers of different depths. Then, we prune the highest-level layers of the global model to create local models with different depths, thus with a different number of classifiers. We found that this depth-based scaling shows a better performance than exclusive learning in most cases, unlike in HeteroFL, since training a local model can directly supervise its sub-classifiers as well as its output classifier, obviating parameter mismatch of sub-classifiers when aggregated. This means that depth scaling allows resource-constrained clients to participate and contribute to learning, although there is a small overhead of separate classifiers. We analyzed the root cause of this difference between depth scaling and width scaling. There is one issue in depth scaling, though. There are only a few clients whose local models include deep classifiers, while many clients have shallow classifiers in their local model. This would make deep classifiers be partially-trained only with a limited amount of data, so their accuracy might be inferior to fully-trained shallow classifiers. That is, resource-constrained clients cannot train deep classifiers, so deep classifiers cannot be general enough to cover the unseen data of those clients. To moderate this issue, we make the deep classifiers of a local model learn from its shallow classifiers by knowledge distillation. This is similar to self distillation (Zhang et al., 2019) , except that the direction of distillation is opposite. Actually, we make the classifiers collaborate with each other as in deep mutual learning (Zhang et al., 2018) . So, a client not only trains the classifiers in its local model using its data, but distills each other's knowledge at the same time. Our evaluation shows that deep classifiers can learn from shallow classifiers trained with otherwise unseen data, and that both classifiers actually help each other to improve the overall performance of the global model. We also analyzed the fundamental reason for the effectiveness of knowledge distillation in DepthFL, using an evaluation in a general teacher-student environment. Recently, InclusiveFL (Liu et al., 2022) proposed a kind of depth-scaled method with a similar intuition to ours, yet with less elaboration. We show that depth scaling alone in InclusiveFL without the companion objective of sub-classifiers cannot effectively solve parameter mismatches, and that its performance of deep classifiers is much lower due to no self distillation. We employed the latest federated learning algorithm FedDyn (Acar et al., 2021) as the optimizer for proposed method. Although DepthFL is a framework that focuses on resource heterogeneity of clients, we try to verify if the FedDyn optimizer that focuses on data heterogeneity is applicable to DepthFL seamlessly. Our experiment shows that DepthFL works well for both resource heterogeneity and data heterogeneity. Contributions To summarize, our contributions are threefold: • We present a new depth scaling method to create heterogeneous local models with sub-classifiers, which are directly supervised during training, thus no parameter mismatches when aggregated. • We show that knowledge distillation among the classifiers in a local model can effectively train deep classifiers that can see only a limited amount of data. • We perform a comprehensive evaluation on the difference between depth scaling and width scaling using exclusive learning, and on the effectiveness of knowledge distillation in DepthFL.

2. RELATED WORK

2.1 FEDERATED LEARNING Based on FedAvg (McMahan et al., 2017) , the standard federated learning method, there have been many studies to solve various problems of federated learning (Li et al., 2020a; Kairouz et al., 2019) . One main research field deals with efficient learning algorithms considering the non-IID distribution of data among clients (Karimireddy et al., 2020; Li et al., 2021b; Wang et al., 2020; Acar et al., 2021; Li et al., 2020b) . For example, FedDyn (Acar et al., 2021) performs exact minimization by making the local objective align with the global objective, which enables fast and stable convergence. There are also studies to deal with heterogeneous, resource-constrained clients in federated learning. One popular method to create heterogeneous local models is to prune the channels of a global model

