PERFEDMASK: PERSONALIZED FEDERATED LEARN-ING WITH OPTIMIZED MASKING VECTORS

Abstract

Recently, various personalized federated learning (FL) algorithms have been proposed to tackle data heterogeneity. To mitigate device heterogeneity, a common approach is to use masking. In this paper, we first show that using random masking can lead to a bias in the obtained solution of the learning model. To this end, we propose a personalized FL algorithm with optimized masking vectors called PerFedMask. In particular, PerFedMask facilitates each device to obtain its optimized masking vector based on its computational capability before training. Finetuning is performed after training. PerFedMask is a generalization of a recently proposed personalized FL algorithm, FedBABU (Oh et al., 2022). PerFedMask can be combined with other FL algorithms including HeteroFL (Diao et al., 2021) and Split-Mix FL (Hong et al., 2022). Results based on CIFAR-10 and CIFAR-100 datasets show that the proposed PerFedMask algorithm provides a higher test accuracy after fine-tuning and lower average number of trainable parameters when compared with six existing state-of-the-art FL algorithms in the literature.

1. INTRODUCTION

Federated learning (FL) is a distributed artificial intelligence (AI) framework, which allows multiple edge devices to train a single model collaboratively (Konečnỳ et al., 2015; McMahan et al., 2017) . The model is trained under the orchestration of a central server. In a typical FL algorithm, each communication round includes the following steps: (1) the edge devices download the latest model from the server to be used as their local model; (2) each device performs multiple local update iterations for updating the local model based on its local dataset; (3) the devices upload their updated local models to the server; (4) the server computes the new model by aggregating the local models. In practical systems, the devices may have diverse and limited computation, communication, and storage capabilities. Moreover, the local datasets available to the devices may be different in size, and contain non-independent and identically distributed (non-IID) data samples across the devices. Under these heterogeneous settings, the performance of the conventional FL algorithms can degrade (Wang et al., 2020; Li et al., 2021) . To handle the case when the data is non-IID, some works (Li et al., 2020a; Karimireddy et al., 2020) have introduced new optimization frameworks to obtain a more stable global model for the devices. Another approach to address the data heterogeneity issue is by designing a personalized model for each device (Arivazhagan et al., 2019; Fallah et al., 2020; Collins et al., 2021; Oh et al., 2022) . In personalized FL algorithms, instead of obtaining a single model for all the devices, an initial model is obtained. This initial model can then be personalized for each device using its local data samples. To overcome the computation limitation of the heterogeneous devices, one common approach is to use masking vectors. Masking vectors can be used to train only a sub-network of the learning model for each device based on the computational capability of that device. Masking vectors can be combined with pruning and freezing methods. Pruning methods utilize masking vectors to keep the important parameters of the learning model and remove those which are unimportant from the model architecture. However, leveraging pruning in FL may incur additional communication overhead (Babakniya et al., 2022; Bibikar et al., 2022) . Moreover, it results in different model architectures for the devices (Guo et al., 2016) . This may lead to accuracy loss, particularly when data heterogeneity exists in the system (Hong et al., 2022) . In the freezing methods, the masking vectors are used to freeze some parts of the learning model for each device. Unlike pruning, the masked parameters are not removed but are frozen during local updates. Hence, a more stable FL algorithm is obtained without changing the learning model architecture. Sidahmed et al. (2021) and Pfeiffer et al. (2022) have shown that freezing methods can reduce the computational and communication resources required for training the learning model in FL. However, the aforementioned works use heuristic approaches for designing the masking vectors, and do not provide a theoretical analysis for their choice. Also, the aforementioned works do not address the data heterogeneity issue in their proposed algorithms. In this work, we aim to answer the following question: By exploiting freezing method in FL, what is a systematic approach to determine the masking vectors which can improve the final test accuracy in a setting with data and device heterogeneities? We first show that using the masking vectors to freeze the model parameters for the devices may lead to a bias in the convergence bound. This bias can hinder the success of employing masking vectors to tackle the device heterogeneity issue in FL. Using the insights from our analysis, we propose PerFedMask, Personalized Federated Learning with Optimized Masking Vectors (see Fig. 1 ). Specifically, by decoupling the learning model into a global model and a local head model, we first freeze the local head model for all the devices. Then, we freeze a portion of the global model for each device based on its computational capability. In our work, the masking vectors are determined before training through minimizing the bias term in the convergence bound. The bias can be mitigated by this approach. However, it may not be eliminated completely. Thus, after training of the global model, the frozen parameters of the local head model can assist to fine-tune the entire learning model for each device. We demonstrate empirically the effectiveness of PerFedMask under the heterogeneous settings when compared with six existing state-of-the-art FL algorithms. PerFedMask has several distinct advantages: (1) PerFedMask generalizes the recently proposed personalized FL algorithm, FedBABU (Oh et al., 2022) . In particular, FedBABU is a special case of PerFedMask when all the devices have the same computational capability. (2) PerFedMask is flexible. Since PerFedMask does not change the model architecture, it can be combined with other FL algorithms such as HeteroFL (Diao et al., 2021) and Split-Mix FL (Hong et al., 2022) to further improve the performance. (3) PerFedMask can address the objective inconsistency problem, which arises due to different number of local update iterations. Unlike FedNova (Wang et al., 2020) which requires the modification of device optimizers to tackle this issue, in PerFedMask, we consider the same number of local update iterations for all the devices, while adjusting the required number of computations for those devices with lower computational capabilities.

3. PROBLEM SETTING

We consider one server and N edge devices. Each device n ∈ [N ] = {1, 2, . . . , N } has its own set of local data samples D n . In a supervised learning setting, each device aims to find a learning model θ n ∈ R d θ for predicting the true label y n given the input x n , (x n , y n ) ∈ D n , where d θ denotes the dimension of the learning model. Let f n (θ n ) represent the expected loss over the data distribution of device n. We have f n (θ n ) = E (xn,yn)∼pn L(θ n ; x n , y n ), where L(θ n ; x n , y n ) is the loss function that measures the prediction error of θ n over data samples (x n , y n ) ∈ D n , and p n is the distribution over D n . Formally, the optimization problem min θ 1 N N n=1 f n (θ) is solved in the conventional FL. All our results can easily be extended to the weighted averaging case, where the devices have different size of data samples. In this work, we study both data and device heterogeneity issues using model personalization and masking vectors. Model Personalization. In the heterogeneous data setting, the probability distribution p n varies across the devices. Unlike conventional FL problems, to obtain a more personalized solution for each device, the learning models θ n , n ∈ [N ], are not equal to each other. Similar to FedBABU (Oh et al., 2022)  F ({w g , ϕ g }) = 1 N N n=1 f n ({w g , ϕ n }) throughout all the communication rounds, where the operator {., .} denotes the concatenation of the learning model parameters. Since ϕ g and ϕ n are never updated during training, with some abuse of notation, we consider the objective function as F (w g ) =foot_0 N N n=1 f n (w g ). After convergence of w g , each device n obtains its personalized local head model ϕ n by fine-tuning the learning model θ n using its local data samples. Masking Vectors. In the heterogeneous device setting, devices vary in their computational and communication capabilities. We consider that in each communication round, the devices perform τ local update iterations. When deploying the same local models on all the devices, some devices with limited computational capability are not able to complete τ local update iterations and send their final local models to the server for aggregation in a timely manner. To address the device heterogeneity issue, masking vectors are used to freeze a part of the local model parameters by customization. A masking vector m n ∈ {0, 1} dw is selected for each device n ∈ [N ] based on its computational capability. During local update iterations, each device n only updates those parameters in the global model that correspond to non-zero values of the masking vector m n . Other parameters, which correspond to the elements of m n with zero values, are frozen during local updates. Note that in FedBABU (Oh et al., 2022) , all the elements of vector m n are equal to one for all devices n ∈ [N ]. Let w i n (t) denote the local model of device n at the beginning of local update iteration i in communication round t. At initialization (i.e., i = 1), we set w 1 n (t) = w g (t). At local update iteration i > 1, the local model of device n is updated using SGD as follows: w i+1 n (t) ← w i n (t) -η(t)m n ⊙ ∇f n (w i n (t), b i n (t)), i = 1, . . . , τ, where ⊙ denotes the element-wise product, η(t) is the learning rate, and b i n (t) is the local batch sample chosen uniformly at random from the local dataset D n . After performing τ local update iterations, each device n sends its final local model, i.e., w τ +1 n (t) to the server. We have w τ +1 n (t) = w g (t) -η(t)m n ⊙ τ i=1 ∇f n (w i n (t), b i n (t)). Algorithm 2 in Appendix B describes the DeviceLocalUpdate function based on (1) and (2). In the aggregation step, we consider that the server aggregates the received final local models by taking the masking vectors of the devices into account. The global model for the next communication round can thus be determined through stable aggregation of unfrozen parameters, as follows: w g (t + 1) = n∈N (t) k n ⊙ w τ +1 n (t), where (k n ) l = (mn) l n ′ ∈N (t) (m n ′ ) l denotes the l-th element of vector k n . N (t) denotes the set of devices participating in training in communication round t. Specifically, the server selects a fraction c of N devices at random as participating devices in each communication round. For each device n, vector k n is obtained as a normalized masking vector using the vectors m n , n ∈ N (t). Using k n in (3) indicates that the server only aggregates the updated parameters from the participating devices.

4. THEORETICAL RESULTS ON THE CONVERGENCE RATE OF THE GLOBAL MODEL

In this section, we analyze the convergence rate of the global model when masking vectors are used by the devices. Without loss of generality, we focus on non-convex loss functions. We also present the convergence results for strongly convex loss functions in Appendix C. In both cases, we assume that the loss functions are smooth. For simplicity, we obtain our convergence results for the full device participation scenario 1 (i.e., c = 1, N (t) = [N ] for all t). The analysis relies on the following assumptions, which are commonly used for obtaining the convergence rate of different FL algorithms in the literature (Li et al., 2020b; Reddi et al., 2021; Amiri et al., 2022) . Assumption 1. The function f n (w), n ∈ [N ], is L-smooth and satisfies: ∇f n (w i n (t)) 2 ≤ 2L f n (w i n (t)) -f * n , n ∈ [N ], i = 1, . . . , τ, ∀t, ) where f * n denotes the minimum value of f n (w). Assumption 2. ∇f n (w i n (t), b i n (t)) is an unbiased stochastic gradient of function f n (w). The variance of the masked stochastic gradients is bounded for each device n ∈ [N ]. We have E k n ⊙ ∇f n (w i n (t), b i n (t)) -k n ⊙ ∇f n (w i n (t)) 2 ≤ ξ 2 n , n ∈ [N ], i = 1, . . . , τ, ∀t. (5) Assumption 3. The expected squared l 2 -norm of the masked stochastic gradients for all the devices is uniformly bounded. We have E m n ⊙ ∇f n (w i n (t), b i n (t)) 2 ≤ G 2 , n ∈ [N ], i = 1, . . . , τ, ∀t. When the masking vectors are determined based on the computational capability of the devices, we define the term γ n = max l (k n ) l to quantify the degree of device heterogeneity in the network. Note that in the full device participation scenario, 1 N ≤ γ n ≤ 1, n ∈ [N ]. In addition, γ n is inversely proportional to the minimum non-zero element of the vector N n ′ =1 m n ⊙ m n ′ . Hence, larger γ n implies a higher degree of device heterogeneity for device n ∈ [N ]. We first present the following lemma, which is derived using Theorem 3 in Fang et al. (1994) . From Lemma 1, we can quantify the impact of freezing the parameters by masking vectors on the convergence bound. Lemma 1. The following inequality holds for any vectors x and z ∈ R d , for which there exists Q > 0 satisfying |min l (x ⊙ z) l | ≤ Q, and for any vector y ∈ R d : ⟨x, y ⊙ z⟩ ≤ max l (y) l ⟨x, z⟩ + Q d max l (y) l - d l=1 (y) l , where ⟨., .⟩ denotes the inner product operator in R d . We use Lemma 1 to prove the following theorem concerning the device heterogeneity effect on the FL convergence bound. Devices with lower computational capability partially train the global model due to the zero-valued elements in their masking vectors. We show that employing the masking vectors to address the device heterogeneity issue in FL leads to a bias term in the convergence bound. However, it does not affect the convergence rate. Theorem 1. Under Assumptions 1-3, and for smooth and non-convex loss functions, if the total number of communication rounds T is pre-defined and the learning rate η(t) is small enough such that η(t ) = η ≤ 1 LN 2 τ , we have 1 T T t=1 E∥∇F (w g (t))∥ 2 ≤ 2 ητ T (F (w g (1)) -F * ) + LN τ η N n=1 ξ 2 n + 2Ψ N n=1 d w γ n - dw l=1 (k n ) l + L 2 η 2 G 2 (τ -1)(2τ -1) 6 , where Ψ is a constant satisfying max l ∇f n (w i n (t)) ⊙ ∇F (w g (t)) l ≤ Ψ for all n ∈ [N ], i = 1, . . . , τ , and t = 1, . . . , T . F * = F (w * ), where w * is the global optimal point. L, ξ 2 n , and G are constants defined in Assumptions 1-3. The proof for Lemma 1 and Theorem 1 can be found in Appendices D and E, respectively. Remark 1. By employing the masking vectors in FL, the term N n=1 d w γ n - dw l=1 (k n ) l appears on the right-hand side of (8). Since this term does not scale with the number of communication rounds T , it is considered as a bias term, which remains as a residual in the convergence bound. Hence, those FL algorithms, which use masking vectors to reduce the computational and communication costs for the devices, may converge to a local minimum of the objective function F (w g ). In PerFedMask, we design the masking vectors by minimizing N n=1 d w γ n - dw l=1 (k n ) l to mitigate the performance degradation due to this bias term. In Appendix F, we present a simple example to gain insight regarding the selection of masking vectors by minimizing the bias term.

5. OUR PROPOSED ALGORITHM

In this section, we propose a novel algorithm called PerFedMask, which aims to mitigate the performance degradation caused by bias described in Remark 1 through: (1) systematically designing the for each device n ∈ N (t) in parallel do 8: w τ +1 n (t) = DeviceLocalUpdate(wg(t), ϕg, mn, fn, Dn, η(t)). # Local updates using eqn. (1) 9: end for 10: wg(t + 1) = n∈N (t) kn ⊙ w τ +1 n (t). # Aggregation at the server using eqn. (3) 11: Update η(t). 12: end for 13: "Client Operation" 14: for each device n ∈ [N ] in parallel do 15: Fine-tune the learning model θn: = {wg(T + 1), ϕn} using training data samples Dn. # Fine-tuning 16: end for masking vectors via an optimization framework, and (2) fine-tuning the local head models. First, each device determines the maximum number of parameters which can be trained during local update iterations based on its computational capabilityfoot_1 . These values are sent to the server. The server then determines the masking vector for each device before training. Let ψ n denote the maximum number of parameters that can be trained by device n ∈ [N ], where ψ n ≤ d w . Given ψ n , the server determines the masking vector m n for each device n before training by minimizing the bias term. Here, we present the formulation of layer-wise maskingfoot_2 , which decides whether or not to freeze all the parameters in each layer of the global model. We adopt layer-wise masking since most of the current machine learning frameworks such as PyTorch (PyTorch, 2022) and TensorFlow (Abadi et al., 2016) run at the granularity of a full tensor, with no APIs which can provide parameter freezing at a finer granularity. Also, by considering layer-wise masking, the number of optimization variables can be reduced. Thus, the complexity of obtaining the masking vectors can also be reduced accordingly. Let Λ and |Λ| denote the set and number of layers in the global model, respectively. Let π j and |π j | denote the set and number of parameters in layer j ∈ Λ, respectively. We define mn ∈ {0, 1} |Λ| as the layer-wise masking vector. If ( mn ) j = 1, then all the elements (m n ) l , l ∈ π j are equal to one. In PerFedMask, the following optimization problem is solved by the server to obtain the masking vectors for the devices: P mask : minimize mn, ϵn, n∈[N ] N n=1   d w max j∈Λ ( kn ) j - j ′ ∈Λ |π j ′ |( kn ) j ′ + ϵ n   subject to ( kn ) j = ( mn ) j N n ′ =1 ( mn ′ ) j , j ∈ Λ, n ∈ [N ], ( ) j∈Λ |π j |( mn ) j = ψ n -ϵ n , n ∈ [N ], ( mn ) j ∈ {0, 1}, j ∈ Λ, n ∈ [N ], (9c) ϵ n ≥ 0, n ∈ [N ], where ϵ n is a slack variable, which prevents to train more than ψ n parameters for each device n in the layer-wise masking. Problem P mask is a mixed-integer nonlinear program, which is NP-hard and difficult to solve. In Appendix H, we show how to obtain a close-to-optimal solution by using successive convex approximation (Shen et al., 2016) (Oh et al., 2022) , we fix the product of the local epochs E and the maximum number of communication rounds T to 320. After T communication rounds, we choose the model with the maximum validation accuracy and perform 5 local update iterations for fine-tuning the learning model using each device's local training data samples. We perform the experiments using PyTorch library (PyTorch, 2022) in Python 3.7. We apply layer-wise masking in our experiments. Baselines. We compare the performance of our proposed algorithm, PerFedMask, with the following FL algorithms: FedBABU (Oh et al., 2022) and FedProx (Li et al., 2020a) , which have been proposed to tackle data heterogeneity; FedNova (Wang et al., 2020) , which has been proposed to address objective inconsistency problem; HeteroFL (Diao et al., 2021) and Split-Mix FL (Hong et al., 2022) , which have been proposed to tackle device heterogeneity in the non-IID data settings. Performance Metrics. We consider the test accuracy as one of the performance metrics. Finetuning steps are performed for PerFedMask and FedBABU algorithms to personalize the learning model and obtain the local head model for each device. For fair comparison, the obtained learning model has been also fine-tuned for each device in the other FL algorithms. Thus, we report the test accuracy before and after fine-tuning. We also assess the average number of floating-point operations (FLOPs) in each communication round for both the forward and backward propagation paths to show the required computation for the algorithms. We use PyPAPI (PyPAPI, 2017) to obtain the number of FLOPs. Since the average number of trainable parameters is equal to the average number of parameters transmitted from the devices to the server in each communication round, we report this number to show the communication cost of each FL algorithm.

6.2. BENCHMARK EXPERIMENTS

We consider a heterogeneous device setting, where half of the devices (i.e., devices with maximum computational capability) perform four local epochs (i.e., E = 4). Due to the limited computational capability, the remaining devices perform two local epochs for updating all the parameters. Using PerFedMask, instead of considering different local epochs for the devices, we address the device heterogeneity issue by freezing some parts of the global model for those devices with lower computational capability. For fair comparison, the number of frozen parameters are selected in a way such that the considered algorithms have the same number of FLOPs for each device. Performance Comparison with the Baselines. Table 1 shows the obtained test accuracy after fine-tuning and the number of trainable parameters for CIFAR-10 and CIFAR-100 datasetsfoot_4 . Our observations are as follows: (1) our proposed algorithm, PerFedMask, has comparable performance Combining PerFedMask with HeteroFL and Split-Mix FL. Table 2 shows the performance results for PerFedMask and its combination with HeteroFL and Split-Mix FL algorithms. Table 2 also shows the performance results for FedBABU, HeteroFL, and Split-Mix FL algorithms. The number of FLOPs for forward and backward indicates the number of required computations in the forward and backward propagation paths, respectively. In general, backpropagation dominates the computational cost during training of a learning model (Xu et al., 2022) . In PerFedMask, by using masking vectors, there is no need to compute the partial derivative of the objective function with respect to the frozen parameters. Although PerFedMask has reduced the number of trainable parameters and the backward FLOPs, it can still achieve higher test accuracy compared to other algorithms including FedBABU. Since PerFedMask does not change the architecture of the learning model, it can easily be combined with other FL algorithms. Table 2 shows that combining PerFedMask with Split-Mix FL and HeteroFL algorithms can further reduce the number of FLOPs in the backward path and the number of trainable parameters. This combination provides a higher test accuracy after fine-tuning than Split-Mix FL and HeteroFL algorithms. 

6.3. ABLATION STUDIES

Effect of Fine-Tuning Steps. We investigate the impact of the number of fine-tuning steps on the final test accuracy of PerFedMask and FedBABUfoot_5 . Fine-tuning steps equal to zero means the test accuracy is obtained before fine-tuning. Also, since the number of batches for each device's training dataset is equal to 9, increasing the fine-tuning steps by one leads to 9 more local update iterations. Results from Table 3 show that similar to FedBABU, PerFedMask can achieve better accuracy with Published as a conference paper at ICLR 2023 a small number of fine-tuning steps. This characteristic is important when fine-tuning is restricted or costly for the devices. Effect of Increasing the Number of Devices with Maximum Computational Capability. We investigate the impact of increasing ν on the performance of PerFedMask in Table 4 . More devices in the network are able to train the entire global model as ν increases. This leads to an increase in the number of trainable parameters and number of backward FLOPs in PerFedMask. We can observe that by increasing ν, the test accuracy before fine-tuning is improved. Note that PerFedMask can provide a comparable test accuracy after fine-tuning even for ν = 0.2, when compared with the case in which all devices have the maximum computational capability (i.e., ν = 1). Effect of Masking Vectors Design. In Table 5 , we consider three masking approaches: in sequential masking, the layers are masked sequentially; in random masking, the layers are masked randomly; and in optimized masking, the layers are masked by solving problem P mask . Optimized masking minimizes the bias described in Remark 1. As shown in Table 5 , optimized masking can provide lower training loss and higher training and test accuracies. These results are compatible with our theoretical analysis. As shown in Table 5 , PerFedMask enhances FL performance before fine-tuning by employing optimal masking vectors. The final test accuracy can then be improved by fine-tuning. 

7. CONCLUSION

In this work, we proposed a flexible and easy to implement personalized FL algorithm called PerFed-Mask. We provided theoretical and empirical grounds to justify the utility of PerFedMask in heterogeneous data and device settings. In particular, PerFedMask employs (1) optimized masking vectors obtained by minimizing the bias term in the convergence bound, and (2) fine-tuning. The masking vectors are exploited to freeze some parts of the global model for each device based on its computational capability. Fine-tuning is performed by each device after training to improve the final test accuracy. When compared with some existing state-of-the-art FL algorithms, PerFedMask can achieve higher test accuracy. It can also decrease the average number of trainable parameters and the average number of FLOPs in each communication round without changing the learning model architecture. A future direction is to consider freezing priority for different layers in the neural network architecture. For example, in Frankle et al. (2021) , it has been shown that batch normalization layers in convolutional networks are important to be considered as the trainable parameters. Also, the approach of masking vectors design in this work can be emulated in pruning methods.

A RELATED WORK

FL Algorithms with non-IID Data. In FedAvg (McMahan et al., 2017) , the edge devices perform multiple local update iterations on their local models before sending them to the server for aggregation. In this way, the communication cost can be reduced. When there is data heterogeneity, conventional FL algorithms such as FedAvg may have slow and unstable convergence (Karimireddy et al., 2020) . To handle the case when the data is non-IID, some works (Li et al., 2020a; Karimireddy et al., 2020) have introduced new optimization frameworks to obtain a more stable model for the devices. Although the obtained model may still perform well on average across the devices, some of the devices may not be satisfied with the obtained final accuracy (Collins et al., 2021) . To address this issue, more personalized models are required for the devices. The personalized models can be obtained by using different methods such as meta-learning (Fallah et al., 2020) , multi-task learning (Marfoq et al., 2021) , clustering (Mansour et al., 2020) , and decomposition. In the decomposition method, the learning model is decomposed into a global model and a device-specific head model (Arivazhagan et al., 2019; Collins et al., 2021) . In each communication round, the global model is updated by the devices and is aggregated by the server, whereas the head model is updated by each device but is not transmitted to the server. Oh et al. (2022) proposed FedBABU using the decomposition method. To improve the personalization ability, in FedBABU, the local head model parameters are frozen during training. After convergence to a global model, each device updates the local head model parameters during the fine-tuning steps by using its local data samples. FL Algorithms with Heterogeneous Devices. Given the disparities in devices' hardware, it is crucial to address device heterogeneity in FL. In general, this problem can be tackled by reducing the computational complexity of model training based on the devices' hardware capabilities. Lin et al. (2020) ; Afonin & Karimireddy (2022) proposed FL algorithms using knowledge distillation, which aim to generate compact models by transferring knowledge of a large model to smaller ones. Pruning method, where a compact model for each device can be obtained by removing the parameters with little impact on the performance of the original learning model, is another approach to accommodate device heterogeneity in FL. Due to the dynamic sparse training, pruning method may suffer from instability in convergence for finding the sparse masking vectors, which are used for training the sparse sub-networks at the devices. This issue can be resolved by consistent mask adjustment procedure at the devices, at the expense of additional communication overhead (Jiang et al., 2022; Babakniya et al., 2022; Bibikar et al., 2022) . The ordered dropout technique has been proposed in Horvath et al. (2021) to dynamically adapt the model size used by each device based on its capabilities. Another pruning method that does not require additional communication cost is to use a fixed masking vector for each device before training based on its computational capability. This would result in training of the heterogeneous local models for the devices. In this regard, HeteroFL (Diao et al., 2021) aims to facilitate efficient training and stable aggregation of devices' local models. However, the available local data samples at the devices may not be used efficiently in HeteroFL. In particular, HeteroFL considers different model architectures as the local models for the devices. Thereby, the available data samples at each device can be used only for training a specific model architecture. In Split-Mix FL (Hong et al., 2022) , based on its computational capability, each device randomly selects some of the base models in each communication round to train them. After training, by using ensemble learning approach, each device mixes the selected base models to construct its desired model size. The potential drawback of this approach is that none of the devices train the original learning model. Another line of research is related to the objective inconsistency problem in FL. Due to the different size of the devices' local dataset and their computational capability, some of the devices may finish their local update iterations faster. To prevent the devices from being idle, Wang et al. (2020) proposed to let those faster devices continue their local update iterations until the slowest device finishes its local update. FedNova is proposed by Wang et al. (2020) to resolve the objective inconsistency problem due to different number of local update iterations. Another approach is to consider the same number of local update iterations for all the devices. However, to enable all the devices to finish their local update iterations in a timely manner, the required number of computations should be adjusted for those devices with lower computational capabilities. Our proposed algorithm as well as other FL algorithms which aim to adjust the required computations for each device based on its computational capability can be employed to address the objective inconsistency problem. This part is for aligning the last page of References. This part is for aligning the last page of References. This part is for aligning the last page of References. This part is for aligning the last page of References. for each batch b ∈ B do 6: w i+1 ← w i -ηm ⊙ ∇f (w i , b). 7: i := i + 1. 8: end for 9: end for 10: Return w i 11: end function

C CONVERGENCE ANALYSIS FOR STRONGLY CONVEX LOSS FUNCTIONS

To show the convergence of PerFedMask for smooth and strongly convex loss functions, in addition to Assumptions 1-3, we make the following assumption: Assumption 4. f n (w), n ∈ [N ], is µ-strongly convex and satisfies: f n (v) ≥ f n (w) + (v -w) T ∇f n (w) + µ 2 ∥v -w∥ 2 , ∀v, w, n ∈ [N ]. To quantify the degree of data heterogeneity at each device n ∈ [N ], the term Γ n = f n (w * ) -f * n is defined (Li et al., 2020b). Let δ(t) = E∥w g (t) -w * ∥ 2 . We first prove the following useful lemma. Lemma 2. Under Assumptions 1-4, if the learning rate is small enough, i.e., η(t) ≤ 1 L(N τ +1) , for all t = 1, . . . , T , we have δ(t + 1) ≤ (1 -q 0 η(t)) δ(t) + q 1 η(t) + q 2 η 2 (t), where q 0 ≜ 1 2 µτ N n=1 γ n , q 1 ≜ 2τ Υ N n=1 γ n - 1 N + 2τ Ω N n=1 d w γ n - dw l=1 (k n ) l , q 2 ≜ G 2 τ (τ -1)(2τ -1) 6 (2 + µ) N n=1 γ n + 2Lτ N n=1 γ n (2 + N τ γ n ) Γ n + N τ 2 N n=1 ξ 2 n , and Ω and Υ should satisfy min l w * -w i n (t) ⊙ ∇f n (w i n (t)) l ≤ Ω, for all n ∈ [N ], i = 1, . . . , τ, t = 1, . . . , T and min n∈[N ] (f n (w g (t)) -f n (w * )) ≤ Υ, t = 1, . . . , T , respectively. Proof. See Appendix N. Using Lemma 2, we can state the following theorem for the convergence rate of smooth and strongly convex loss functions: Theorem 2. Given Assumptions 1-4, if we choose κ = 2L q0 (N τ + 1) and the learning rate η(t) = 2 q0(t+κ) , under the full device participation scenario (i.e., c = 1), after T communication rounds, we have EF (w g (T )) -F * ≤ Lq 1 q 0 + L 2 (T + κ) 4q 2 q 2 0 + (κ + 1) δ(1) . Proof. See Appendix O. Remark 2. The first term on the right-hand side of (15) appears in the convergence bound due to the device heterogeneity. In particular, Theorem 2 shows that for q 1 ̸ = 0, the FL algorithm converges to a local optimal solution at the rate of O(1/T ). This convergence rate is similar to the results presented in Li et al. (2020b); Amiri et al. (2022) , where device heterogeneity has not been considered. Hence, using the masking vectors in FL do not degrade the convergence rate of FL. Remark 3. The result in Theorem 2 shows that when q 1 → 0, the FL algorithm converges to the global optimal solution for the smooth and strongly convex functions. Based on (13), it is straightforward to verify that without device heterogeneity in the network (i.e., when all the devices have the maximum computational capability and can update all the parameters of the global model), all the elements of vector k n , n ∈ [N ] are equal to 1 N . Hence, in this case, we have γ n = 1 N , and q 1 = 0. In general, based on (13), one way to reduce the bias incurred by the device heterogeneity is to design the masking vectors by minimizing q 1 . For example, one can search for the masking vectors, which minimize N n=1 d w γ n - dw l=1 (k n ) l .

D PROOF OF LEMMA 1

Given vectors x, y, and z, we form diagonal matrices X, Y , and Z, respectively. Note that we can write ⟨x, y ⊙ z⟩ as the form of the trace of matrices X, Y , and Z product, i.e., ⟨x, y ⊙ z⟩ = Tr(XY Z). By using Theorem 3 in Fang et al. (1994) , we have the following inequality: Tr(XY Z) ≤ λ 1 (Y ) Tr(XZ) -λ d (XZ) (dλ 1 (Y ) -Tr(Y )) , where λ 1 (Y ) and λ d (XZ) are the largest eigenvalue of matrix Y and the smallest eigenvalue of matrix XZ, respectively. Since the considered matrices are diagonal, we have λ 1 (Y ) = max l (y) l and λ d (XZ) = min l (x ⊙ z) l . Hence, we have ⟨x, y ⊙ z⟩ ≤ max l (y) l ⟨x, z⟩ -min l (x ⊙ z) l d max l (y) l - d l=1 (y) l . Since d max l (y) l -d l=1 (y) l ≥ 0, by considering min l (x ⊙ z) l ≥ -Q, Lemma 1 is proved using inequality (17).

E PROOF OF THEOREM 1

Considering the smoothness of f n (w), n ∈ [N ], in each communication round t ≥ 1, we have EF (w g (t + 1)) ≤ EF (w g (t)) + E ⟨w g (t + 1) -w g (t), ∇F (w g (t))⟩ + L 2 E ∥w g (t + 1) -w g (t)∥ 2 . ( ) We first find an upper bound for ∥w g (t + 1) -w g (t)∥ 2 as follows: E ∥w g (t + 1) -w g (t)∥ 2 (a) = η 2 (t)E N n=1 k n ⊙ τ i=1 ∇f n (w i n (t), b i n (t)) 2 (b) = η 2 (t) E N n=1 τ i=1 k n ⊙ ∇f n (w i n (t), b i n (t)) -k n ⊙ ∇f n (w i n (t)) 2 A1 + η 2 (t) N n=1 τ i=1 k n ⊙ ∇f n (w i n (t)) 2 A2 , where equality (a) results from ( 2) and (3). Equality (b) is obtained via basic equality E ∥z∥ 2 = E ∥z -Ez∥ 2 + ∥Ez∥ 2 for any random vector z. By using Assumption 2, we can an upper bound of A 1 as follows: A 1 = E N n=1 τ i=1 k n ⊙ ∇f n (w i n (t), b i n (t)) -k n ⊙ ∇f n (w i n (t)) 2 ≤ N τ N n=1 τ i=1 E k n ⊙ ∇f n (w i n (t), b i n (t)) -k n ⊙ ∇f n (w i n (t)) 2 ≤ N τ 2 N n=1 ξ 2 n . By considering the convexity of ∥•∥ 2 and by using γ n = max l (k n ) l , we can obtain an upper bound of A 2 as follows: A 2 = N n=1 τ i=1 k n ⊙ ∇f n (w i n (t)) 2 ≤ N τ N n=1 τ i=1 k n ⊙ ∇f n (w i n (t)) 2 ≤ N τ N n=1 τ i=1 γ 2 n ∇f n (w i n (t)) 2 . ( ) By combining ( 19), (20), and ( 21), we have the following inequality: E ∥w g (t + 1) -w g (t)∥ 2 ≤ N τ 2 η 2 (t) N n=1 ξ 2 n + N τ η 2 (t) N n=1 τ i=1 γ 2 n ∇f n (w i n (t)) 2 . Now, we aim to obtain an upper bound of E ⟨w g (t + 1) -w g (t), ∇F (w g (t))⟩. We have E ⟨w g (t + 1) -w g (t), ∇F (w g (t))⟩ (a) = E -η(t) N n=1 τ i=1 k n ⊙ ∇f n (w i n (t), b i n (t)), ∇F (w g (t)) (b) = η(t)E N n=1 τ i=1 k n ⊙ ∇f n (w i n (t)), -∇F (w g (t)) (c) ≤ η(t)E N n=1 τ i=1 (-γ n ) ∇f n (w i n (t)), ∇F (w g (t)) + η(t)τ Ψ N n=1 d w γ n - dw l=1 (k n ) l (d) ≤ -η(t) τ i=1 E 1 N N n=1 ∇f n (w i n (t)), ∇F (w g (t)) + η(t)τ Ψ N n=1 d w γ n - dw l=1 (k n ) l , where equality (a) results from ( 2) and (3). Equality (b) follows from E∇f n (w i n (t), b i n (t)) = ∇f n (w i n (t)). Inequality (c) holds by using Lemma 1. Inequality (d) follows from γ n ≥ 1 N . To find an upper bound for -E 1 N N n=1 ∇f n (w i n (t)), ∇F (w g (t)) , we first represent it as follows: -E 1 N N n=1 ∇f n (w i n (t)), ∇F (w g (t)) = 1 2 E 1 N N n=1 ∇f n (w i n (t)) -∇f n (w g (t)) 2 - 1 2 E 1 N N n=1 ∇f n (w i n (t)) 2 - 1 2 E ∥∇F (w g (t))∥ 2 . ( ) E 1 N N n=1 ∇f n (w i n (t)) -∇f n (w g (t)) 2 is bounded as follows: E 1 N N n=1 ∇f n (w i n (t)) -∇f n (w g (t)) 2 (a) ≤ 1 N N n=1 E ∇f n (w g (t)) -∇f n (w i n (t)) 2 (b) ≤ L 2 N N n=1 E w g (t) -w i n (t) 2 , ( ) where inequality (a) results from the convexity of ∥•∥ 2 . Inequality (b) results from Assumption 1. Now, we aim to bound E w g (t) -w i n (t) 2 for i = 2, . . . , τ . By using (1), we have E w g (t) -w i n (t) 2 = E η(t)m n ⊙ i-1 j=1 ∇f n (w j n (t), b j n (t)) 2 ≤ η 2 (t)(i -1) i-1 j=1 E m n ⊙ ∇f n (w j n (t), b j n (t)) 2 ≤ η 2 (t)(i -1) 2 G 2 , ( ) where the last inequality results from Assumption 3. By combining ( 25) and ( 26), we have E 1 N N n=1 ∇f n (w i n (t)) -∇f n (w g (t)) 2 ≤L 2 η 2 (t)(i -1) 2 G 2 . ( ) By combining ( 18) and ( 22)-( 27), we have EF (w g (t + 1)) ≤ EF (w g (t)) + L 2 N τ 2 η 2 (t) N n=1 ξ 2 n + η(t)τ Ψ N n=1 d w γ n - dw l=1 (k n ) l - η(t)τ 2 E ∥∇F (w g (t))∥ 2 + L 2 η 3 (t)G 2 τ (τ -1)(2τ -1) 12 - η(t) 2 N n=1 τ i=1 1 N -LN τ γ 2 n η(t) ∇f n (w i n (t)) 2 . ( ) Since η(t) = η ≤ 1 LN 2 τ , we have -η(t) 2 N n=1 τ i=1 1 N -LN τ γ 2 n η(t) ∇f n (w i n (t)) 2 ≤ 0. By rearranging the terms in (28), we obtain E ∥∇F (w g (t))∥ 2 ≤ 2 ητ (EF (w g (t)) -EF (w g (t + 1))) + LN τ η N n=1 ξ 2 n + 2Ψ N n=1 d w γ n - dw l=1 (k n ) l + L 2 η 2 G 2 (τ -1)(2τ -1) 6 . Finally, we multiply both sides of (29) by 1 T and sum over t = 1, . . . , T . Then, Theorem 1 is concluded by considering that the first term on the right-hand side of ( 29) is a telescoping series. We have 2 ητ T T t=0 (EF (w g (t)) -EF (w g (t + 1))) = 2 ητ T (F (w g (1)) -EF (w g (T + 1))) ≤ 2 ητ T (F (w g (1)) -F * ), where the last inequality is obtained by considering that EF (w g (t + 1)) ≥ F * . 

F SELECTION OF MASKING VECTORS IN A TOY EXAMPLE

In a heterogeneous device setting, devices with lower computational capability partially train the learning model. Those devices use masking vectors to freeze a portion of the learning model during training based on their computational capability. As shown in Theorem 1, employing masking vectors lead to a bias term in the convergence bound. In this section, we demonstrate how the bias value is impacted by the design of masking vectors using a simple example. We consider three devices. One device has the maximum computational capability. However, different scenarios are considered for the computational capability of the other two devices. The devices aim to train a model with four parameters. Hence, their masking vectors have four elements. Fig. 2 shows the considered scenarios and the possible masking vectors selections, which may lead to different bias values. For each scenario and for each possible selection, Table 6 shows vector k n and the value of γ n for each device as well as the obtained bias value. For this example, the results in Table 6 illustrate that the bias value is minimized by freezing the same parameters for two devices with lower computational capability, in case each parameter cannot be trained at least by one of those devices (Figs. 2(a) and 2(b)). However, when each parameter can be trained at least by one of those less capable devices, the bias value is minimized by freezing distinct parameters for the devices (Figs. 2(c)-2(f)). These results are compatible with the empirical results obtained by Pfeiffer et al. (2022) and Yang et al. (2022) , where random selection is used by the devices to prevent freezing of the same parameters.

J COMPARISON OF TRAINING CURVES

This section includes some of the training curves for our experiments. Fig. 4 illustrates the training loss over communication rounds for PerFedMask and the baseline algorithms on CIFAR-100 and DomainNet datasets. In Fig. 4 , we have considered the full device participation scenario. The difference between the training loss of PerFedMask and FedBABU is the bias resulted by the device heterogeneity. Although we have minimized this bias through solving P mask , it has not been completely eliminated. Fine-tuning after training helps to improve the performance of PerFedMask. When c = 0.1, Fig. 5 shows the evolution of the validation accuracy over communication rounds for PerFedMask, FedBABU, FedAvg, and Split-Mix FL algorithms on CIFAR-10 and CIFAR-100 datasets. 

L OF PERFEDMASK WITH HETEROFL AND SPLIT-MIX FL

holds for δ(1). Now, we assume that the inequality holds for t. We show that it also holds for t + 1. From Lemma 2, we have δ(t + 1) ≤ (1 -q 0 η(t)) δ(t) + q 1 η(t) + q 2 η 2 (t) ≤ 1 -2 t + κ β(t) t + κ + 2q 1 q 0 (t + κ) + 4q 2 q 2 0 (t + κ) 2 = t + κ -1 (t + κ) 2 β(t) + 2q 0 q 1 (t + κ) + 4q 2 q 2 0 (t + κ) 2 - β(t) (t + κ) 2 ≤ t + κ -1 (t + κ) 2 β(t) ≤ t + κ -1 (t + κ) 2 -1 β(t) ≤ β(t) t + κ + 1 . ( ) Finally, by the L-smoothness assumption for F , we have EF (w g (T )) -F * ≤ L 2 δ(T ) ≤ L 2 β(T ) T + κ . ( ) From the definition of β(t), we have β(T ) ≤ 2q1 q0 (T + κ) + 4q2 q 2 0 + (κ + 1) δ(1). Combining this with (51) completes the proof of Theorem 2.

P PROOF OF LEMMA 3

First, we define two vectors x and y ∈ R N , where (x) n = γ n (1 -Lη(t) (N τ γ n + 1)) and (y) n = f n (w g (t)) -f n (w * ), n ∈ [N ], respectively. Let X and Y , respectively, denote the corresponding diagonal matrices of vectors x and y. By using Theorem 3 in Fang et al. (1994) , we have the following inequality: Tr(XY ) ≥ λ N (X) Tr(Y ) + λ N (Y ) (Tr(X) -N λ N (X)) , (52) where λ N (X) and λ N (Y ) are the smallest eigenvalue of matrices X and Y , respectively. Since X and Y are diagonal matrices, we have λ N (X) = min n∈[N ] (x) n and λ N (Y ) = min n∈[N ] (y) n . By considering η(t) ≤ 1 L(N τ +1) , all the elements of vector x including λ N (X) are nonnegative. Also, (x) n is a quadratic function with respect to γ n . For 1 N ≤ γ n ≤ 1, by considering that η(t) ≤ 1 L(N τ +1) , it can be verified that the minimum value of (x) n is obtained at γ n = 1 N . Thus, λ N (X) is lower bounded as λ N (X) ≥ 1 N (1 -Lη(t) (τ + 1)). Hence, we have -2η(t)τ (1 -Lη(t))  where for inequality (a), we use (52). We also use the fact that -2η(t)τ (1 -Lη(t)) ≤ 0. Note that since η(t) ≤ 1 L(N τ +1) , ∀t, the quadratic function -2η(t)τ (1 -Lη(t)) is always nonpositive. Inequality (b) results from the definition of F (w). For inequality (c), since F (w g (t)) ≥ F * , we use the fact that -2η(t)τ (1 -Lη(t)) λ N (X)N (F (w g (t)) -F * ) ≤ 0. Finally, Lemma 3 is concluded by using the obtained lower bound for λ N (X) and by rearranging the terms in (53).



Using the techniques presented inLi et al. (2020b);Karimireddy et al. (2020), the extension to the general case (i.e., c ≤ 1) would be straightforward. Using the curves similar to those in Fig. in Appendix G, each device can determine its maximum number of trainable parameters.3 Random layer-wise masking has been considered inSidahmed et al. (2021);Pfeiffer et al. (2022). We also use AlexNet on DomainNet dataset(Li et al., 2021) and provide the results in Appendix I to show the performance under feature non-IID configuration. We show some of the training curves in Appendix J. Also, in Appendix K, we present additional statistics for some of the results in Table1. We also investigate the effect of freezing the local head models in Appendix M. Reduction percentage is the percentage change in a value compared to its maximum value.



Figure 1: Illustration of an FL system using PerFedMask. The model is decoupled into a global model and a local head model. The local head model remains unchanged during training. The devices collaboratively train the global model. Some parts of the global model can be frozen for the devices during local updates using the optimized masking vectors. After training, a personalized model is obtained for each device by fine-tuning.

, we decouple each learning model θ n into a global representation model w g ∈ R dw and a device-specific head model ϕ n ∈ R d ϕ , where d w and d ϕ denote the dimensions of the global model and local head model, respectively. We have d w +d ϕ = d θ . To further improve accuracy after training, we freeze the local head model parameters during training. In particular, before training, ϕ g is initialized randomly by the server to be used for initialization of all the devices' local head models. We have ϕ n = ϕ g , n ∈ [N ]. ϕ n remains unchanged during training until the convergence is reached for the global model w g . The global model is obtained by minimizing the objective function

Training Procedure of PerFedMask 1: Input: Local datasets Dn; maximum number of trainable parameters ψn, n ∈ [N ]; the number of local epochs E; the number of local batches B; participation ratio c. 2: Initialize the learning rate η(1) and initialize randomly θg(1) : = {wg(1), ϕg}. # Model initialization 3: "Server Operation" 4: Select masking vector mn for each device n by solving P mask . # Optimized masking using (9a)-(9d) 5: for each communication round t ∈ {1, . . . , T } do 6: N (t) ← Random subset of max (cN, 1) devices. # Selection of participating devices 7:

Published as a conference paper at ICLR 2023 B LOCAL UPDATES AT DEVICES Algorithm 2: Local Update Function 1: function DeviceLocalUpdate(wg, ϕg, m, f, D, η) # Local update iterations for each device 2: B ← Split local data samples D into B local batches. 3: i := 1, and w i ← wg. 4: for each local epoch e ∈ {1, . . . , E} do # The total number of local update iterations τ = E × B 5:

Figure 2: Illustration of masking vectors selection in a network with three devices. Device 3 has the maximum computational capability to train ×1 parameters of the learning model. We consider different scenarios for computational capability of devices 1 and 2. (a) Devices 1 and 2 can train ×0.25 parameters of the learning model. (b) Devices 1 and 2 can train ×0.25 and ×0.5 parameters of the learning model, respectively. (c) Devices 1 and 2 can train ×0.25 and ×0.75 parameters of the learning model, respectively. (d) Devices 1 and 2 can train ×0.5 parameters of the learning model. (e) Devices 1 and 2 can train ×0.5 and ×0.75 parameters of the learning model, respectively. (f) Devices 1 and 2 can train ×0.75 parameters of the learning model. (g) All three devices have the maximum computational capability.

Figure 4: Training loss evolution over communication rounds for (a) CIFAR-100 and (b) DomainNet datasets. c is set to 1.

1 -Lη(t) (N τ γ n + 1)) (f n (w g (t)) -f n (w * )) (a) ≤ -2η(t)τ (1 -Lη(t)) λ N (X) N n=1 f n (w g (t)) -f n (w * ) -2η(t)τ (1 -Lη(t)) min n∈[N ] (f n (w g (t)) -f n (w * )) (Tr(X) -N λ N (X)) (b) ≤ -2η(t)τ (1 -Lη(t)) λ N (X)N (F (w g (t)) -F * ) + 2η(t)τ (1 -Lη(t)) Υ (Tr(X) -N λ N (X)) Lη(t) (N τ γ n + 1)) -(1 -Lη(t) (τ + 1)) ,

Datasets and Model Architectures. We conduct our experiments on CIFAR-10 and CIFAR-100 image classification tasks 4 . Our experiments are performed with ResNet (PreResNet18)(He et al., 2016) for CIFAR-10, and with MobileNet(Howard et al., 2017) for CIFAR-100. In both cases, we set the number of devices to 100. The data samples are uniformly divided among the devices. Each device has 450 training data samples, 50 validation data samples, and 100 test data samples. The batch size is set to 50. To enable non-IID data partitioning among the devices, we distribute 3 and 10 classes per device for CIFAR-10 and CIFAR-100 datasets, respectively. The same classes are considered in the training, validation, and test datasets.Implementation Details. We denote the fraction of devices with the maximum computational capability by ν. That is, ν represents the ratio of devices which can completely update the entire global model during the local update iterations. Those devices are able to train ×1 parameters of the global model. Since the remaining devices have lower computational capability, they should mask some parts of the global model during the local update iterations based on their capabilities. Unless stated otherwise, we set ν = 0.5. For all the experiments, the learning rate starts with 0.1 and is decayed by a factor of 0.1 in communication round t ∈ { 1 2 T, 3 4 T }. Similar to FedBABU

Test accuracy after fine-tuning and number of trainable parameters of PerFedMask and the baseline algorithms for CIFAR-10 and CIFAR-100 datasets FedBABU and outperforms the other baselines in terms of test accuracy after fine-tuning. (2) By increasing the number of devices participating in FL (i.e., by increasing c), a higher test accuracy can be achieved. (3) PerFedMask, HeteroFL, and Split-Mix FL can provide lower number of trainable parameters. Split-Mix FL has the lowest number of trainable parameters because it trains several low-width base models instead of the original learning model. (4) Different number of local update iterations for the devices may lead to the objective inconsistency problem. PerFedMask, HeteroFL, and Split-Mix FL algorithms can address this problem by decreasing the number of FLOPs for the less capable devices. Thus, the same number of local update iterations can be considered for all the devices. Also, this problem is tackled in FedNova through modifying the optimizer. However, other algorithms suffer from the objective inconsistency problem, which may degrade their performance.

Performance comparison on CIFAR-10 dataset when c = 1. Results for CIFAR-100 dataset can be found in Appendix L.

Performance according to fine-tuning steps when c = 1

Results of increasing ν for CIFAR-100 dataset when c = 1.

Results of different approaches for masking vectors design for CIFAR-100 dataset when c = 1 and ν = 0.2.

Performance comparison on CIFAR-100 dataset when c = 1 EFFECT OF FREEZING THE LOCAL HEAD MODELS DURING TRAININGIn this section, we investigate the impact of freezing the local head models on the test accuracy of PerFedMask. In particular, we compare the test accuracy of PerFedMask with (w/) and without (w/o) freezing the local head models. In the w/o head freezing scenario, the learning model is not decoupled into the global model and the local head model. Table9shows that the test accuracy of PerFedMask is improved by keeping the local head models frozen during training.

Test accuracy of PerFedMask with and without freezing the local head models

ACKNOWLEDGMENTS

This work was supported in part by Rogers Communications Canada Inc., Natural Sciences and Engineering Research Council of Canada (NSERC), and Public Safety Canada (NS-5001-22170).

annex

Published as a conference paper at ICLR 2023 Table 6 : kn, γn, and the bias value for the possible masking vectors selections of the scenarios shown in Fig. 2 . The selection with the minimum bias value is chosen based on Remark 1.

Scenario Possible selections

k n γ n Bias value Fig. 2(a )Selection 1Selection 1Selection 1The results in Table 6 also show that by increasing the computational capability of the devices, the bias value can be decreased. In the extreme case that all three devices have the maximum computational capability (i.e., Fig. 2(g )), the bias value is zero.

G EFFECT OF MASKING RATE ON THE COMPUTATIONS

By increasing the masking rate (i.e., 1 -ψn dw ), each device can reduce the number of trainable parameters and the number of FLOPs based on its computational capability. Fig. 3 shows the reduction percentage 7 , which can be obtained for the number of trainable parameters and the number of FLOPs versus the masking rate. For the results in Fig. 3 , we have considered that a device performs each local update iteration on a batch containing 50 data samples of CIFAR-10 and CIFAR-100 datasets using ResNet (PreResNet18) and MobileNet, respectively. H SOLVING PROBLEM P MASK BY SCAIn this section, we first transform the non-convex constraints in optimization problem P mask into convex functions or a difference of two convex functions. Then, by using successive convex approximation (SCA) method, we can obtain a suboptimal solution for problem P mask in polynomial time. We first rewrite constraint (9a) in the form of the following inequalities:Since ( mn ) j is a binary variable, the non-convex constraint (31b) can be expressed as the following convex constraint:The non-convex constraint (31a) can be rewritten as follows:Equality ( kn) 2 can be used to express the left-hand side of (33) as a difference of two convex functions. We haveNext, we relax the binary constraint (9c) in the form of the difference of two convex functions as follows:Finally, we define the following functions:where vector Mj = (( mn ′ ) j , n ′ ∈ [N ]).Algorithm 3 describes the SCA algorithm for solving problem P mask . Let i denote the iteration index. In Line 1, we initialize the maximum number of iterations i max . In Line 2, the decision variables m(1) n and k(1) n , n ∈ [N ] are initialized in iteration i = 1 with a feasible solution of problem P mask . In Line 4, the optimal solution of problem P mask-SCA (i.e., m * n and k * n , n ∈ [N ]) are determined. In Line 5, using m * n and k * n , we update m(i+1)to be used for obtaining the first-order approximations Θ(( mn ) j ) and θ(( kn ) j , Mj ) of functions Θ(( mn ) j ) and ϑ(( kn ) j , Mj ), respectively. The iteration index is updated in Line 6. The steps within Lines 3 to 7 are repeated until the algorithm converges to a solution or we reach to i max . The convex problem P mask-SCA , which is solved in each iteration i, is as follows: 32), (35b), and (9d), 

6:

Set i := i + 1.

7:

Until i = i max or mn and kn, n ∈ [N ] converge.

8:

Return mopt n := m(i) n and kopt

I PERFORMANCE COMPARISON FOR DOMAINNET DATASET

Different from the considered class non-IID configuration for CIFAR-10 and CIFAR-100 datasets, in this section, we evaluate the performance of our proposed algorithm under feature non-IID configuration. We perform our experiment with AlexNet on DomainNet dataset. The dataset contains images of six distinct domains including Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. We consider 30 devices in the network, and split each domain among 5 devices. We set the number of communication rounds and the learning rate to be 100 and 0.01, respectively. Table 7 shows the obtained test accuracy after fine-tuning and the number of trainable parameters for PerFedMask compared to other baseline algorithms. The results in Table 7 indicate that PerFedMask can achieve a higher test accuracy compared to HeteroFL and Split-Mix FL algorithms, while the number of trainable parameters is much less than FedBABU, FedProx, FedNova, and FedAvg algorithms. Published as a conference paper at ICLR 2023 By considering the convexity of ∥•∥ 2 and by using Assumption 1, we can bound B 1 as follows:Next, we aim to bound B 2 . We haveWe first obtain an upper bound of C 1 . We havewhere inequality (a) results from triangle and Hölder's inequalities. Inequality (b) results from the inequality of arithmetic and geometric means (AM-GM) inequality. For inequality (c), we use Assumption 1.Next, we obtain an upper bound of C 2 by using Lemma 1. In particular, by considering x = w * -w i n (t), y = k n , and z = ∇f n (w i n (t)) in Lemma 1, we can bound C 2 as follows:where the last inequality results from Assumption 4. Now, we focus on bounding D. Considering the inequality ∥x + y∥ 2 ≤ 2 ∥x∥ 2 +2 ∥y∥ 2 for any x, y ∈ R d , and by replacing x with w i n (t)-w * Published as a conference paper at ICLR 2023 and y with w g (t) -w i n (t) we haveBy combining ( 38)-( 43), we obtainBy rearranging the terms in (44), we haveNow, we aim to bound f n (w i n (t)) -f n (w * ) as follows:where inequality (a) results from the convexity of f n (w), inequality (b) is obtained by using Cauchy-Schwarz and AM-GM inequalities, and inequality (c) is due to Assumption 1.By combining ( 26), (45), and (46), we haveLemma 3. For η(t) ≤ Proof. See Appendix P.Using Lemma 3, we can simplify (47) as follows: O PROOF OF THEOREM 2First, through induction, we show that for a diminishing stepsize η(t) = 2 q0(t+κ) , we have δ(t) ≤t+κ , where β(t) = max{ 2q1 q0 (t + κ) + 4q2 q 2 0 , (κ + 1) δ(1)}. Note that the considered η(t) satisfies the mentioned condition in Lemma 2. Moreover, the definition of β(t) ensures that the inequality

