DEPTHFL: DEPTHWISE FEDERATED LEARNING FOR HETEROGENEOUS CLIENTS

Abstract

Federated learning is for training a global model without collecting private local data from clients. As they repeatedly need to upload locally-updated weights or gradients instead, clients require both computation and communication resources enough to participate in learning, but in reality their resources are heterogeneous. To enable resource-constrained clients to train smaller local models, width scaling techniques have been used, which prunes the channels of a global model. Unfortunately, width scaling suffers from parameter mismatches of channels when aggregating them, leading to a lower accuracy than when simply excluding resourceconstrained clients from training. This paper proposes a new approach based on depth scaling called DepthFL to solve this issue. DepthFL defines local models of different depths by pruning the deepest layers off the global model, and allocates them to clients depending on their resources. Since many clients do not have enough resources to train deep local models, this would make deep layers partially-trained with insufficient data, unlike shallow layers that are fully trained. DepthFL alleviates this problem by mutual self-distillation of knowledge among the classifiers of various depths within a local model. Our experiments show that depth-scaled local models build a global model better than width-scaled ones, and that self-distillation is highly effective in training data-insufficient deep layers.

1. INTRODUCTION

Federated learning is a type of distributed learning. It trains a shared global model by aggregating locally-updated model parameters without direct access to the data held by clients. It is particularly suitable for training a model with on-device private data, such as next word prediction or on-device item ranking (Bonawitz et al., 2019) . Generally, federated learning demands client devices to have enough computing power to train a deep model as well as enough communication resources to exchange the model parameters with the server. However, the computation and communication capability of each client is quite diverse, often changing dynamically depending on its current loads, which can make those clients with the smallest resources become a bottleneck for federated learning. To handle this issue, it would be appropriate for clients to have a different-sized local model depending on their available resources. However, it is unclear how we can create local models of different sizes, without affecting the convergence of the global model and its performance. There are various methods to prune a single global model to create heterogeneous local models, such as HeteroFL (Diao et al., 2021 ), FjORD (Horváth et al., 2021 ), and Split-Mix (Hong et al., 2022) . They create a local model as a subset of the global model by pruning channels, that is, width-based scaling. HeteroFL was a cornerstone research that could make different local models by dividing a global model based on width, yet still producing a global model successfully. However, we observed some issues of width scaling. We evaluated HeteroFL compared to exclusive federated learning, which

