FEDERATED SEMI-SUPERVISED LEARNING WITH DUAL REGULATOR

Abstract

Federated learning emerges as a powerful method to learn from decentralized heterogeneous data while protecting data privacy. Federated semi-supervised learning (FSSL) is even more practical and challenging, where only a fraction of data can be labeled due to high annotation costs. Existing FSSL methods, however, assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. In this work, we propose a novel FSSL framework with dual regulator, FedDure, to optimize and customize model training according to specific data distributions of clients. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; Freg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimize the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulator. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 12% on CIFAR-10 and CINIC-10 datasets.

1. INTRODUCTION

Federated learning (FL) is an emerging privacy-preserving machine-learning technique (McMahan et al., 2017) , where multiple clients collaboratively learn a model under the coordination of a central server without exchanging private data. Edge devices like mobile phones have generated and stored a large amount of private data. Centralizing these data could lead to data privacy issues (Voigt & Von dem Bussche, 2017) . Federated learning is a decentralized learning paradigm to leverage these data and has empowered a wide range of applications in many industries, including healthcare (Kaissis et al., 2020; Li et al., 2019) , consumer products (Hard et al., 2018; Niu et al., 2020) , and public security (Zhuang et al., 2022) . The majority of existing FL works (McMahan et al., 2017; Wang et al., 2020; Li et al., 2021a) assume that the private data in clients are fully labeled, but the assumption is unrealistic in real-world federated applications as annotating data is time-consuming, laborious, and expensive. To remedy these issues, federated semi-supervised learning (FSSL) is proposed to improve model performance with a large amount of unlabeled data on each client. In particular, priors work (Jeong et al., 2021) has achieved competitive performance by exploring the inter-client mutual knowledge, i.e., inter-client consistency loss Jeong et al. (2021) . However, they usually focus on mitigating inter-client heterogeneous data distribution across clients (External Imbalance) and assume that the class distribution between the labeled and unlabeled data is consistent. These assumptions enforce strict requirements of data annotation and would not be practical in many real-world applications. A general case is that labeled data and unlabeled data have different data distribution (Internal Imbalance). For example, photo gallery in mobile phones contains much more unlabeled images and irrelevant samples than the ones that can be labeled manually for image classification task Yang et al. (2011) . Besides, these existing methods require sharing of additional information among clients, which could impose potential privacy risks. Specifically, they transmit models among clients to provide consistency regularization. However, inter-client interactions might open a loophole to unauthorized infringement for privacy risks Chen et al. (2019) ; many reverse-engineering techniques Yin et al. (2020) can even recover the client data from the mutual models, threatening the users' data privacy and security (Yonetani et al., 2017; Wu et al., 2018) . In this paper, we propose a flexible federated semi-supervised learning framework with dual regulator, termed FedDure, to handle the internal and external imbalance problems while accounting for privacy. FedDure explores the adaptive regulators to flexibly update the model parameters in clients; it dynamically adjusts dual regulators to optimize the model training in each client according to the learning process and outcome of the client's specific data distribution. Our framework includes two novel components: Coarse-grained regulator (C-reg) and Fine-grained regulator (Freg). On the one hand, C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution. It mitigates the distribution mismatching between labeled and unlabeled data to prevent corrupted pseudo labels and maintain generalization ability. On the other hand, F-reg learns an adaptive weighting scheme tailored for each client; it automatically equips a soft weight for each unlabeled instance to measure its contribution to the training. This scheme automatically adjusts the instance-level weights to strengthen (or weaken) its confidence according to the feedback of F-reg on the labeled data. FedDure utilizes the bi-level optimization strategy to alternately update the local model and dual regulators in clients. We theoretically show that C-reg and F-reg converge with guarantee and empirically demonstrate that FedDure outperforms other methods on various benchmarks. The main contributions of this work are three-fold. (1) We address a more practical scenario of FSSL, where data distribution is different not only across clients (external imbalance) but also between labeled and unlabeled data within a client (internal imbalance). ( 2) We propose a flexible FSSL framework with dual regulator, (FedDure), which designs adaptive regulators to flexibly update the local model according to the learning processes and outcomes on specific data distributions of each client. It does not require sharing of additional information among clients. ( 3) We theoretically analyze and present the convergence of the dual regulator and empirically demonstrate that FedDure is superior to the state-of-the-art FSSL approaches across multiple benchmarks.

2. RELATED WORK

Federated Learning (FL) is an emerging distributed training technique that trains models on decentralized clients and aggregates model updates in a central server (Yang et al., 2019) . It protects data privacy as raw data are always kept locally. FedAvg (McMahan et al., 2017 ) is a pioneering work that aggregates model updates by weighted averaging. Statistical heterogeneity is an important challenge of FL in real-world scenarios, where the data distribution is inconsistent among clients (Li et al., 2020a) . A plethora of works have been proposed to address this challenge with approaches like extra data sharing, regularization, new aggregation mechanisms, and personalization (Zhao et al., 2018; Li et al., 2020b; Zhuang et al., 2020; Li et al., 2021b; Gao et al., 2022) . However, these approaches commonly consider only supervised learning settings and may not be simply applied to scenarios where only a small portion of data is labeled. Numerous studies focus on purely unsupervised federated learning, but they are either application-specific or only learn generic visual representations (Zhuang et al., 2021a; b) ; they do not effectively leverage the small fraction of labeled data that could exist in real-world applications. Our work primarily focuses on federated semi-supervised learning, where a small fraction of data has labels in each client. Semi-Supervised Learning aims to utilize unlabeled data for performance improvements and is usually divided into two popular branches pseudo labeling and consistency regularization. Pseudolabeling methods (Lee et al., 2013; Zou et al., 2022; Pham et al., 2021) usually generate artificial labels of unlabeled data from the model trained by labeled data and apply the filtered high-confidence labels as supervised signals for unlabeled data training. MPL Pham et al. (2021) extends the knowledge distillation and meta-learning to SSL by optimizing the teacher model with feedback from the student model. Consistency regularization (Lee et al., 2022; Tarvainen & Valpola, 2017) regularizes the outputs of different perturbed versions of the same input to be consistent. Many works (Sohn et al., 2020; Zhang et al., 2021a; Lee et al., 2022) apply data augmentation as a perturbed strategy for pursuing outcome consistency. FixMatch Sohn et al. (2020) follows the UDA and brings the idea of pseudo-label to model training with unlabeled samples filtered by fixed threshold. Federated Semi-Supervised Learning (FSSL) considers learning models from decentralized clients where a small amount of labeled data resides on either clients or the server Jin et al. (2020) . FSSL scenarios can be classified into three categories: (1) Labels-at-Server assumes that clients have purely unlabeled data and the server contains some labeled data (Lin et al., 2021; He et al., 2021; Zhang et al., 2021b; Diao et al., 2021) ; ( 2) Labels-at-Clients considers each client has mostly unlabeled data and a small amount of labeled data (Jeong et al., 2021) ; (3) Labels-at-Partial-Clients assumes that the majority of clients contain fully unlabeled data while numerous clients have fully labeled data (Zhang et al., 2021b; Lin et al., 2021; Liang et al., 2022) . Labels-at-Clients has been largely overlooked; the prior work (Jeong et al., 2021) proposes inter-client consistency loss, but it shares extra information among clients and bypasses the internal class imbalance issue. This work introduces the dual regulator to address the issue, with no extra information shared among clients. Class Imbalance Methods are concerned with dataset resampling (Chawla et al., 2002; Buda et al., 2018) or loss reweighting (Ling & Sheng, 2008; Sun et al., 2007; Cui et al., 2019) for gradient calculation. In the centralized learning setting, many methods (Chawla et al., 2002; Liu et al., 2008) focus on resampling from the minority class for balanced class-wise distribution. Important examples receive more attention and align larger weights than others for accelerating the optimization of networks. The important examples are quantified by their loss Zhao & Zhang (2015) or the uncertainty Chang et al. (2017) . Besides, the loss reweighting methods usually formulate the sampling scheme on certain prior knowledge. Typical methods include the AdaBoost Freund & Schapire (1997) and focal loss Lin et al. (2017) , which focus on those hard examples and align larger weights for them.

3. METHOD

This section first defines the problem and introduces a novel framework with dual regulators (Fed-Dure). Using the dual regulator, we then build a bi-level optimization strategy for federated semisupervised learning.

3.1. PROBLEM DEFINITION

We focus on Federated Semi-Supervised Learning (FSSL) with external and internal class imbalance problems, where each client data is partially labeled. We assume that there are K clients, denoted as {C 1 , ..., C K }. Federated learning aims to train a generalized global model f g with parameter θ g . It coordinates decentralized clients to train their local models F l = {f l,1 , ..., f l,K } with parameters {θ l,1 , ..., θ l,K }, where each client is only allowed to access its own local private dataset. In the standard semi-supervised setting, the dataset contains a labeled set D s = {x i , y i } N s i=1 and an unlabeled set D u = {u i } N u i=1 , where N s ≪ N u . Under FSSL, the private dataset D k of each client C k contains N s k labeled instances D s k = {x i,k , y i,k } N s k i=1 and N u k unlabeled instances D u k = {u i,k } N u k i=1 . In this work, we primarily focus on image datasets. For an unlabeled image u k in client C k , we compute the corresponding pseudo label ŷk with the following equation: ŷk = argmax(f l,k (T w (u k ); θ θ θ l,k )), where T w (u k ) is the weakly-augmented version of u k and the pseudo labeling dataset in the client C k is denoted as D u k = {u i,k , ŷi,k } N u k i=1 . For simplicity of notation, we omit the symbol k in the parameters later.

3.2. DUAL REGULATOR

In this section, we present federated semi-supervised learning with dual regulator, termed FedDure. FedDure adaptively adjusts gradient updates in each client according to the class distribution characteristics with two regulators, a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg). Figure 1 depicts the optimization process with these two regulators. We first introduce the regulators in this section and present the optimization process in Section 3.3. Coarse-grained Regulator (C-reg). C-reg dynamically regulates the importance of local training on the unlabeled data by quantifying the learning effect using labeled data. We define C-reg as f d with parameters ϕ ϕ ϕ. Intuitively, C-reg and local models are collaboratively optimized during the local training process in each client. On the one hand, C-reg is trained using pseudo labels generated by the local model in a supervised manner. On the other hand, the parameters of the local model are further rectified according to the feedback from C-reg. Specifically, at training iteration t, the coarse-grained regulator searches its optimal parameter ϕ ϕ ϕ * by minimizing the cross-entropy loss on unlabeled data with pseudo labels. We formulate it as: ϕ ϕ ϕ * = argmin ϕ ϕ ϕ E u L ce ŷ, f d (T s (u); ϕ ϕ ϕ (t) ) , w.r.t ŷ = argmax(f l (T w (u); θ θ θ l )), where T s (u) is the strongly-augmented unlabeled image u. The optimal parameter ϕ ϕ ϕ * is related to the local model's parameter θ θ θ l via the generated pseudo label, where we denote the relationship as ϕ ϕ ϕ * (θ θ θ l ). In practice, it requires heavy computational costs to explore the optimal parameter by training the coarse-grained regulator, so we design a strategy to approximate ϕ ϕ ϕ * . Conceptually, we approximate ϕ ϕ ϕ * by performing one gradient step based on the parameter at training iteration t (i.e., ϕ ϕ ϕ (t) ). We establish the meta learning process between ϕ ϕ ϕ and θ θ θ as followed: ϕ ϕ ϕ * (θ θ θ (t) l ) ≈ ϕ ϕ ϕ (t+1) (θ θ θ (t) l ) = ϕ ϕ ϕ (t) -η s ∇ ϕ ϕ ϕ (t) E u L ce ŷ, f d T s (u); θ θ θ (t) l ; ϕ ϕ ϕ (t) . Practically, we update C-reg by utilizing the updated fine-grained regulator to measure the adaptive weight for each unlabeled instance, where the updated fine-grained regulator is obtained based on one gradient step of C-reg (ϕ ϕ ϕ -in Eqn. 6). The formulation to optimize C-reg is as followed: ϕ ϕ ϕ (t+1) = ϕ ϕ ϕ (t) -η s H(w w w (t+1) ; ϕ ϕ ϕ (t) )∇ ϕ ϕ ϕ E u u u L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) ϕ ϕ ϕ (t) , where t) ; w w w (t+1) , f w is the fine-grained regulator (F-reg), and w w w (t+1) is the parameters of F-reg updated by Eqn. 7 that we present in the following subsection. H(w w w (t+1) ; ϕ ϕ ϕ (t) ) = f w f d T s (u); ϕ ϕ ϕ ( Next, we quantify the learning effect of the local model with the C-reg using labeled samples by computing the entropy difference d (t+1) of C-reg between training iterations t and t + 1: d (t+1) = E x,y L ce y, f d (x; ϕ ϕ ϕ (t) -L ce y, f d (x; ϕ ϕ ϕ (t+1) . ( ) The learning effect is further used as reward information to optimize the local model by regulating the importance of local training on unlabeled data. In particular, the cross-entropy differences signify the generalization gap for the C-reg updated by the pseudo labels from the local model. We provide detailed derivation in Appendix A.4 for the optimization of the local model with C-reg. Fine-grained Regulator (F-reg). F-reg regulates the importance of each unlabeled instance in local training. It addresses the challenge that the f d is substantially hindered by corrupted labels or class imbalance, especially in the early period of training rounds. This challenge could further negatively impact the optimization of local model f l . Furthermore, previous methods usually utilize a fixed threshold to filter noisy pseudo labels in all clients. It is infeasible for clients with heterogeneous data when the skewed data distribution across clients is unknown. To tackle the challenge, we construct F-reg f w parameterized by w w w.foot_0 It learns an adaptive weighting scheme tailored for each client according to the unlabeled data distribution characteristics. A unique weight is generated for each unlabeled image to measure the contribution of the image to the overall performance. We first perform one gradient step update of C-reg ϕ ϕ ϕ to associate F-reg and C-reg with the following formula: ϕ ϕ ϕ -(w w w (t) ) = ϕ ϕ ϕ (t) -η s H(w w w (t) ; ϕ ϕ ϕ (t) )∇ ϕ ϕ ϕ E u u u L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) ϕ ϕ ϕ (t) , where one gradient step of C-reg ϕ -depends on the F-reg w w w (t) . Next, we optimize F-reg in local training iteration t as follows: w w w * ≈ w w w (t+1) = w w w (t) -η w ∇ w w w (t) E x,y L ce y, f d (x; ϕ - ϕ - ϕ -(w w w (t) ) , where f d (x; ϕ ϕ ϕ -(w w w (t) )) is the output of f d on labeled data. We then introduce a re-weighting scheme that calculates a unique weight m m m i for i-th unlabeled sample: m m m i = f w (f l (T s (u i ), θ θ θ (t) l ), w w w (t+1) ). Note that m i is a scalar to re-weight the importance of the corresponding unlabeled image.

3.3. BI-LEVEL OPTIMIZATION

In this section, we present optimization processes for the dual regulator and local model θ. We alternatively train two regulators, which approximate a gradient-based bi-level optimization procedure (Finn et al., 2017; Liu et al., 2018) . Then, we update the local model with fixed C-reg and F-reg. Update F-reg. Firstly, we obtain one gradient step update of C-reg ϕ -using Eqn. 6. After that, the supervised loss L ce y, f d (x; ϕ ϕ ϕ (-) (w w w (t) ) guides the update of the F-reg with Eqn. 7. Update C-reg. After updating the parameters of F-reg, we update C-reg by Eqn. 4, regarding local model θ l θ l θ l (t) and the updated F-reg w w w (t+1) as fixed parameters. Update Local Model with F-reg. We use the updated F-reg w w w (t+1) to calculate a unique weight m m m i for i-th unlabeled sample with Eqn. 8. The gradient optimization is formulated as: g g g (t) u = E u m m m • ∇ θ θ θ (t) l L ce ŷ, f l T s (u); θ θ θ (t) l . ( ) Update Local Model with C-reg. We then use C-reg to calculate entropy difference d (t+1) using Eqn. 5. The entropy difference d (t+1) is adopted as a reward coefficient to adjust the gradient update of local model on unlabeled data. The formulation is as followed: g g g (t) d = d (t+1) • ∇ θ θ θ (t) l E u L ce ŷ, f l T s (u); θ θ θ (t) l , ( ) where the learning process can be derived by meta-learning strategy shown in Appendix A.4. Update Local Model with Supervised Loss. Besides, we compute the gradient local model on labeled data as followed: g g g (t) s = ∇ θ θ θ (t) l E x,y L ce y, f l x; θ θ θ (t) l . ( ) On this basis, we update the local model's parameter with the above gradient computation in Eqn. 9, 10 and 11, which is defined as: θ θ θ (t+1) l = θ θ θ (t) l -η g g g (t) s + g g g (t) u + g g g (t) d , where η denotes the learning rate of the local model. Finally, after T local epochs, the local model is returned to the central server. The server updates the global model θ g θ g θ g r+1 by weighted averaging the parameters from these received local models in the current round, and the r + 1 round is conducted by sending θ g θ g θ g r+1 to the randomly selected clients as initialization. Alg. 1 presents the pipeline of the overall optimization process. Algorithm 1 Federated Semi-supervised Learning with Dual Regulator (FedDure) Require: K: number of clients; S: number of selected clients in each round; R: number of training rounds; T : number of local iterations; 1: RunServer 2: Initialize θg and w for each client 3: for each round r from 0 to R -1 do 4: Randomly select {C k } S k=1 from K clients; 5: for each k ∈ [1, S] in parallel do 6: θ (r+1) l,k ← RunClient(θ r g ) 7: end for 8: θ (r+1) g ← 1 S k=1 (|D s k |+|D u k |) S k=1 (|D s k | + |D u k |) • θ (r+1) l,k ▷ Aggregation 9: end for 10: return θ R g ; 11: RunClient(θ r g ) 12: θ (0) l ← θ r g ; ϕ (0) ← θ r g 13: for each local iteration t from 0 to T -1 do 14: for minibatch D u k ∈ D u k and D s k ∈ D s k do 15: ŷi,k ← Generate pseudo labels for unlabeled data D u k with Eqn. 1; D u k ← {u i,k , ŷi,k } N u k i=1 16: Update the fine-grained regulator w with Eqn. 7 17: Compute the instance weight mi with Eqn. 8 18: Update the coarse-grained regulator ϕ with Eqn. 4 19: Compute the entropy difference d with Eqn. 5 20: Compute local model's gradient g g g (t) u , g g g (t) d , g g g (t) s following Eqn. 9, 10, and 11 21: Update local model θ l with Eqn. 12 22: end for 23: end for 24: return θ T l

3.4. CONVERGENCE OF OPTIMIZATION PROCESS

In this section, we further analyze the convergence of the coarse-grained and the fine-grained regulators and derive the following theorems. The proofs are provided in Appendix A.5. Theorem 1 Suppose supervised loss L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ (t) l )) is L-Lipschitz and has ρ-bounded gradients. The L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) has ρ-bounded gradients and twice differential with Hessian bounded by B. Let the learning rate η s = min{1, e T } for constant e > 0, and η = min{ 1 L , c √ T } for some c > 0, such that √ T c ≥ L. Thus, the optimization of the local model using coarse-grained regulator can achieve: min 0≤t≤T E[∥∇ θ l L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ (t) l ))∥ 2 2 ] ≤ O( c √ T ). ( ) Theorem 2 Suppose supervised and unsupervised loss functions are Lipschitz-smooth with constant L, and have ρ-bounded gradient. The H(•) is differential with a ϵ-bounded gradient and twice differential with its Hessian bounded by B. Let learning rate η s satisfies η s = min{1, k T } for constant k > 0, such that k T < 1. η w = min{ 1 L , c √ T } for constant c > 0 such that √ T c ≥ L. The optimization of the coarse-grained regulator can achieve: lim t→∞ E[∥H(w w w (t+1) ; ϕ ϕ ϕ (t) )∇ ϕ E u u u L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) ϕ ϕ ϕ (t) ∥ 2 2 ] = 0. ( )

4. EXPERIMENTS

In this section, we demonstrate the effectiveness and robustness of our method through comprehensive experiments in three benchmark datasets with multiple data settings. More details and additional experiments can be found in the supplementary material. 

4.1. EXPERIMENTAL SETUP

Datasets. We conduct comprehensive experiments on three image classification datasets, including CIFAR-10 ( Krizhevsky et al., 2009) , Fashion-MNIST (Xiao et al., 2017) and CINIC-10 (Darlow et al., 2018) . We provide the simulation of different data heterogeneity for external and internal imbalance below and present more details of these datasets in the Appendix A.1. Data Heterogeneity. We construct three data heterogeneity settings with different data distributions. We denote each setting as (A, B), where A and B are data distribution of labeled and unlabeled data, respectively. The settings are as follows: (1) (IID, IID) means both labeled and unlabeled data are IID. By default, we use 5 instances per class to build the labeled dataset for each client. The remaining instances of each class are divided into K clients evenly to build an unlabeled dataset. (2) (IID, DIR) means labeled data is the same as (IID, IID), but the unlabeled data is constructed with Dirichlet distribution to simulate data heterogeneity, where each client only contains a subset of classes. (3) (DIR, DIR) constructs both labeled and unlabeled data with Dirichlet distribution. It simulates external and internal class imbalance, where the class distributions across clients and within a client are different. Specifically, we allocate 500 labeled data per class to 100 clients using the Dirichlet process; The rest of the instances are also divided into each client with another Dirichlet distribution. Figure 3 compares the data distribution of FedMatch (Batch NonIID) (Jeong et al., 2021) and ours. Our (DIR, DIR) setting presents class imbalance both across clients (external imbalance) and between labeled and unlabeled data within a client (internal imbalance). Implementation Detail. We use the Adam optimizer with momentum = 0.9, batch size = 10 and learning rates = 0.0005 for η s , η and η w . If there is no specified description, our default settings also include local iterations T = 1, the selected clients in each round S = 5, and the number of clients K = 100. For the DIR data configuration, we use a Dirichlet distribution Dir(γ) to generate the DIR data for all clients, where γ = 0.5 for all three datasets. We adopt the ResNet-9 network as the backbone architecture for local models and the coarse-grained regulator, while an MLP is utilized for the fine-grained regulator. More details refer to Appendix A.2. Baselines. We compare our FedDure with the following state-of-the-art methods. FedAvg-SL (McMahan et al., 2017) and FedProx-SL (Li et al., 2020b) denote the supervised algorithms with corresponding FL methods and have fully labeled data in each client. FedAvg+UDA, Fed-Prox+UDA, FedAvg+Fixmatch, and FedProx+Fixmatch: a naive combination between semisupervised method (Sohn et al., 2020; Xie et al., 2020) and FL. Although both labeled and unlabeled data are utilized in these methods, they need to specify a predefined threshold on pseudo labels across decentralized clients. FedMatch (Zhang et al., 2021b) adopts inter-consistency loss and disjoint loss for model training, which can reflect state-of-the-art performance in FSSL. Note that, we set equal hyper-parameters for FedDure and other state-of-the-art methods in all experiments.

4.2. PERFORMANCE COMPARISON

Table 1 reports the overall results of FedDure and other state-of-the-art methods on the three datasets, where all results are averaged over 3 independent random trails. Our FedDure achieves state-of-theart FSSL performances on all datasets and data settings. (IID, IID) setting: compared with naive combination FSSL methods and FedMatch, our FedDure significantly outperforms them on all three datasets. Specifically, when evaluated on CINIC-10, which is a more difficult dataset and may encounter a larger amount of unlabeled samples, we observe that other methods suffer from the performance bottleneck and are inferior to the evaluation on CIFAR-10 with fewer unlabeled samples. This phenomenon verifies that our FedDure alleviates the influence of mass unlabeled data and prevents performance degradation when the imbalance between labeled and unlabeled data increase rapidly. (IID, DIR) setting: our FedDure is slightly affected by weak class mismatch on unlabeled data, but our FedDure makes a rapid performance boost by 16.17% compared to FedMatch on CI-FAR10. Also, competitive performance is achieved compared to the supervised method FedAvg-SL on Fashion-MNIST. (DIR, DIR) setting: to simulate the federated semi-supervised scenario in realworld applications, we formulate a severe scenario where labeled data and unlabeled data come from different data distributions and suffer from extreme data imbalance. Under this setting, our FedDure significantly outperforms others by at least 10% on CIFAR-10 and CINIC-10. In particular, we observe that the performance of other approaches degrades sharply and is even inferior to FedAvg* which is only trained on divided labeled data. That is to say, extra unlabeled data might even have a negative effect on model performance due to the distribution mismatch between labeled and unlabeled data. Therefore, these quantity results demonstrate that our method is well suited for this real-world scenario since the dual regulator effectively and flexibly provides real-time feedback for local model updating.

4.3. ABLATION STUDY

Effectiveness of Components. To measure the importance of proposed components in our Fed-Dure, we conduct ablation studies with the following variants. (1) baseline: the naive combination of FedAvg and Fixmatch. (2) Ours w/o C-reg: this variant removes the C-reg (i.e. g d in Eqn.12) and updates F-reg with local model. (3) Ours w/o F-reg: this variant replaces the dynamic weight (i.e. g u in Eqn.12) and utilizes the fixed threshold to filter low-confidence pseudo labels. Table 2 shows that adopting the C-reg improves the performance from 54.79% to 57.73% under (DIR, DIR) setting on CIFAR-10. The F-reg can further make a remarkable performance boost under almost all data sets on CIFAR-10 and F-MNIST. These evaluations verify the effectiveness of our components. The local model can flexibly optimize parameters according to the complementary feedback from coarse and fine-grained regulators. Number of Label Data per Client. We evaluate the performance of our method when tuning the proportion of labeled instances in each client in {2%, 4%, 10%, 15%, 20%}. As illustrated in Figure 3 (a) and 3(b), we find that our framework gains steady performance improvements with the number of labeled data increases both in two data settings. Interestingly, after the labeling ratio reaches 10%, the performance of the baseline is basically unchanged in (IID, DIR), while we find substantial performance fluctuations for FedMatch and baseline in (DIR, DIR). This phenomenon proves that our regulators can more properly extract valuable knowledge from labeled instances with imbalanced distribution to help local model optimization. Number of Selected Clients per Round. We also investigated the performance impact of the number of selected clients per epoch, which varies in {2, 5, 10, 20}. As illustrated in Figure 3(c ) and 3(d), significant improvements can be achieved by increasing the selected clients. However, there would be a limited impact on performance when the selected clients reach a certain amount. We argue that although the number of the selected clients has a positive correlation with overall performance, our framework can fully explore the underlying knowledge of each client to promote overall performance improvement in the central server. In this case, when there are enough clients, our method has learned comprehensive knowledge such that the performance becomes saturated. Impacts of Data Heterogeneity. As illustrated in Figure 4 , our FedDure is the only method that is robust against different levels of internal data imbalance characterized by Dirichlet distribution. FedMatch and baseline (FedAvg-Fixmatch) suffer from the rapid performance degradation in the higher data heterogeneous (small Dirichlet Coefficient). These results demonstrate that our FedDure is more flexible and can alleviate diverse inductive bias across clients when accounting for severe data heterogeneity in real-world applications.

5. CONCLUSION

In this paper, we introduce a new federated semi-supervised learning framework with dual regulator, FedDure, to address the challenge of external and internal imbalance of data distribution. Particularly, we propose a coarse-grained regulator to regularize the gradient update in client model training and present a fine-grained regulator to learn an adaptive weighting scheme for unlabeled instances for gradient update. Furthermore, we formulate the learning process in each client as bi-level optimization that optimizes the local model in the client adaptively and dynamically with these two regulators. Theoretically, we show the convergence guarantee of the regulators. Empirically, extensive experiments demonstrate the significance and effectiveness of FedDure. In the future, we consider designing and integrating other client selection strategies for FSSL. Future work also involves extending our method from image classification to more computer vision tasks.

A APPENDIX

A.1 DATASET CIFAR-10 is a dataset containing 60,000 RGB images of 32 x 32 resolutions for image classification tasks. We follow the official guidance and involve 50,000 training images and 10,000 test images. The data set contains 10 categories: "airline", "automobile", "bird", "cat","deer", "dog", "frog", "horse", "ship", "trunk". CINIC-10 is a drop-in replacement for CIFAR-10 and is an extension of CIFAR-10 via the addition of downsampled ImageNet images. CINIC-10 contains 270,000 images and is equally split into three subsets: train set, validation set, and test set. In each subset, there are 10 categories like CIFAR-10 and each class involves 9,000 images. In our FedDure training, we apply the training subset for training and allocate them to K clients for uniform or Dirichlet distributions while the global model tests in the test subset. A.4 DERIVATION OF COARSE-GRAINED REGULATOR Our C-reg tries to correct the pseudo labels generated by the local model and the local model tries to update its parameters by the feedback from the coarse-grained regulator (C-reg). The cross-entropy difference is applied to quantify the learning effect, it can be derived by gradient-based meta-learning method. We now present the derivation, which theoretically verifies the effectiveness of our coarsegrained regulator. We first formulate the one-step update of the regulator ϕ ϕ ϕ (t) using the sampled soft pseudo label ŷ ∼ f l (T w (u); θ θ θ l ) generated by local model as follows, (t) . ϕ ϕ ϕ (t+1) (θ θ θ (t) l ) = E ŷ∼f l (Tw(u);θ θ θ l ) ϕ ϕ ϕ (t) -η s ∇ ϕ ϕ ϕ (t) L ce ŷ, f d T s (u); ϕ ϕ ϕ (15) Then, CE loss on labeled samples L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ (t) l ) ) is utilized to characterize the quality of pseudo labels from the local model. Since ϕ ϕ ϕ (t+1) has a dependency on θ θ θ 

∂L ∂θ

(t) l = ∂ ∂θ θ θ (t) l L ce y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ (t) l ) = ∂ϕ ϕ ϕ (t+1) ∂θ θ θ (t) l • ∂ ∂ϕ ϕ ϕ (t+1) L ce y, f d (x; ϕ ϕ ϕ (t+1) (16) We focus on the first term in Equation 16. Since the ϕ ϕ ϕ (t+1) has no dependency on θ θ θ l . Note that here ŷ y y is the soft predictions of the local model f l . Therefore, we utilize REINFORCE to achieve, ∂ϕ ϕ ϕ (t+1) ∂θ θ θ (t) l = ∂ ∂θ θ θ (t) l E ŷ∼f l (Tw(u);θ θ θ l ) ϕ ϕ ϕ (t) -η s ∇ ϕ ϕ ϕ (t) L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) = -η s • ∂ ∂θ θ θ (t) l ∂ ∂ϕ ϕ ϕ (t) E ŷ∼f l (Tw(u);θ θ θ l ) L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) = -η s • E ŷ∼f l (Tw(u);θ θ θ l ) ∂ ∂ϕ ϕ ϕ (t) L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) • ∂ ∂θ θ θ (t) l log P ŷ ŷ ŷ|u u u; θ θ θ (t) l (17) Then Monte Carlo approximation is applied for the term in Equation 17. We approximate the expected value with the same using ŷ ŷ ŷ. Finally, we rewrite the Equation 16 according to the Equation 17 as follows, ∂L ∂θ (t) l = -η s ∂ ∂ϕ ϕ ϕ (t+1) L ce y y y, f d (x; ϕ ϕ ϕ (t+1) • ∂ ∂ϕ ϕ ϕ (t) L ce ŷ y y, f d T s (u); ϕ ϕ ϕ (t) • ∂ ∂θ θ θ (t) l log P ŷ ŷ ŷ|u u u; θ θ θ (t) l (18) Due to the heavy computation cost, we apply the first-order Taylor expansion to approximate the first two factors. Given that ϕ ϕ ϕ (t+1) = ϕ ϕ ϕ (t) -η s ∂ ∂ϕ ϕ ϕ (t) L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) , we achieve L ce y, f d (x; ϕ ϕ ϕ (t) -L ce y, f d (x; ϕ ϕ ϕ (t+1) = ∂ ∂ϕ ϕ ϕ (t+1) L ce y, f d (x; ϕ ϕ ϕ (t+1) • (ϕ ϕ ϕ (t) -ϕ ϕ ϕ (t+1) ) = η s • ∂ ∂ϕ ϕ ϕ (t+1) L ce y, f d (x; ϕ ϕ ϕ (t+1) • ∂ ∂ϕ ϕ ϕ (t) L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) Now, we rewrite the Equation 18 as follows, ∂L ∂θ (t) l = -L ce y, f d (x; ϕ ϕ ϕ (t) -L ce y, f d (x; ϕ ϕ ϕ (t+1) • ∂ ∂θ θ θ (t) l log P ŷ ŷ ŷ|u u u; θ θ θ (t) l = L ce y, f d (x; ϕ ϕ ϕ (t) -L ce y, f d (x; ϕ ϕ ϕ (t+1) • ∂ ∂θ θ θ (t) l L ce ŷ ŷ ŷ, f l T s (u); θ θ θ (t) l (20) The Equation 10is equal to Equation 20 and the difference between the coarse-grained regulator characterizes learning difficulty from the perspective of the feedback and learning process of the regulator on labeled samples.

A.5 CONVERGENCE OF OPTIMIZATION PROCESS

Our FedDure involves a bi-level optimization, so we demonstrate the convergence of these objectives theoretically. Theorem 1 Suppose supervised loss L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ  min 0≤t≤T E[∥∇ θ l L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ (t) l ))∥ 2 2 ] ≤ O( c √ T ), The CE loss L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ l . Let L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) ) = L ce (y, f d (x; ϕ ϕ ϕ (t+1) (θ θ θ (t) l )), so the update of θ θ θ l in each step as follows, θ θ θ (t+1) = θ θ θ (t) -η∇ θ L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) ) In coarse-grained optimization process, the updating for θ θ θ (t) to θ θ θ (t+1) is, L(ϕ ϕ ϕ (t+2) ; θ θ θ (t+1) ) -L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) ) = { L(ϕ ϕ ϕ (t+2) ; θ θ θ (t+1) ) -L(ϕ ϕ ϕ (t+1) ; θ θ θ (t+1) )} + { L(ϕ ϕ ϕ (t+1) ; θ θ θ (t+1) ) -L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) )} For the first term, since we have ϕ ϕ ϕ (t+1) = ϕ ϕ ϕ (t) -η s ∇ ϕ ϕ ϕ (t) E u L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) , let V coarse ϕ ϕ ϕ (t) = E u L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) , we achieve, L(ϕ ϕ ϕ (t+2) ; θ θ θ (t+1) ) -L(ϕ ϕ ϕ (t+1) ; θ θ θ (t+1) ) ≤ ⟨∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) ), -η s ∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) )⟩ + L 2 ∥ϕ ϕ ϕ (t+2) -ϕ ϕ ϕ (t+1) ∥ 2 2 = ⟨∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) ), -η s ∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) )⟩ + L 2 ∥ -η s ∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) )∥ 2 2 = -η s + η s 2 L 2 ∥∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) )∥ 2 2 (24) Since ∇ ϕ (t+1) V coarse (ϕ ϕ ϕ (t+1) ) ≤ ρ, so Equation 24 satisfies, L(ϕ ϕ ϕ (t+2) ; θ θ θ (t+1) ) -L(ϕ ϕ ϕ (t+1) ; θ θ θ (t+1) ) ≤ -η s + η s 2 L 2 ρ 2 For the second term, we have, L(ϕ ϕ ϕ (t+1) ; θ θ θ (t+1) ) -L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) ) ≤ ⟨∇ θ (t) L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) ), θ θ θ (t+1) -θ θ θ (t) ⟩ + L 2 ∥θ θ θ (t+1) -θ θ θ (t) ∥ 2 ρ 2 -L(ϕ ϕ ϕ (T +2) ; θ θ θ (T +1) ) + L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) ≤ T t=1 -η s + η s 2 L 2 ρ 2 + L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) Furthermore, we deduce that, min t E[∥∇ θ (t) L(ϕ ϕ ϕ (t+1) ; θ θ θ (t) )∥ 2 2 ] ≤ 1 T t=1 η -η 2 L 2 T t=1 -η s + η s 2 L 2 ρ 2 + L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) ≤ 1 T t=1 η T t=1 -2η s + η s 2 L ρ 2 + 2 L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) = T t=1 -2η s + η s 2 L ρ 2 T t=1 η + 2 L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) T t=1 η = -2η s + η s 2 L ρ 2 η + 2 L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) T η = (-2η s + η s L) ρ 2 max{L, √ T c } + 2 L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) T max{L, √ T c } = (L -2) ρ 2 min{1, e T } max{L, √ T c } + 2 L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) T max{L, √ T c } ≤ e(L -2)ρ 2 c √ T + 2 L(ϕ ϕ ϕ (2) ; θ θ θ (1) ) c √ T = O c √ T 29) The proof has been completed. Let the ϕ ϕ ϕ is optimized with the loss L f ine (ϕ ϕ ϕ (t) ; w w w (t+1) ) = H(w w w (t+1) ; ϕ ϕ ϕ (t) )L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) , where H(w w w (t+1) ; ϕ ϕ ϕ (t) ) = f w f d T s (u); ϕ ϕ ϕ (t) l ; w w w (t+1) indicates the adaptive weight for samples u u u. Theorem 2 Suppose supervised and unsupervised loss functions are Lipschitz-smooth with constant L and have ρ-bounded gradient. The H(•) is differential with a ϵ-bounded gradient and twice differential with its Hessian bounded by B. Let learning rate η s satisfies η s = min{1, k T } for constant k > 0, such that k T < 1. η w = min{ 1 L , c √ T } for constant c > 0 such that √ T c ≥ L. The method can achieve lim t→∞ E[∥∇ ϕ (t) L f ine (ϕ ϕ ϕ t ; w w w (t+1) )∥ 2 2 ] = 0. The proof. The optimization from w w w (t) to w w w (t+1) is, w w w (t+1) = w w w (t) -η w ∇ w w w (t) L ce y, f w f d (x; ϕ ϕ ϕ (t) ); w w w (t) :=L s (x x x;ϕ ϕ ϕ (t) ,w w w (t) ) , We know the updating of ϕ ϕ ϕ is, ϕ ϕ ϕ (t+1) = ϕ ϕ ϕ (t) -η s H(w w w (t+1) ; ϕ ϕ ϕ (t) )∇ ϕ ϕ ϕ L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) ϕ ϕ ϕ (t) :=∇ ϕ L u (u u u;ϕ ϕ ϕ (t) ,w w w (t+1) ) . (32) Observe that, t+1) ; w w w (t+2) ) -L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+1) )} + {L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+1) ) -L f ine (ϕ ϕ ϕ t ; w w w (t+1) )}. L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+2) ) -L f ine (ϕ ϕ ϕ t ; w w w (t+1) ) = {L f ine (ϕ ϕ ϕ For the first term, we have L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+2) ) -L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+1) ) = H(w w w (t+2) ; ϕ ϕ ϕ (t+1) ) -H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ) L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) . In the first factor, we achieve, H(w w w (t+2) ; ϕ ϕ ϕ (t+1) ) -H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ) ≤ ⟨∇ w (t+1) H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ), w w w (t+2) -w w w (t+1) ⟩ + L 2 ∥w w w (t+2) -w w w (t+1) ∥ 2 2 = ⟨∇ w (t+1) H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ), -η w ∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )⟩ + L 2 ∥ -η w ∇ w w w (t) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )∥ 2 2 = -η w • ∇ w (t+1) H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ) • ∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) ) + η w 2 L 2 ∥∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )∥ 2 2 . (35) For the second term, we have t) ; w w w (t+1) ), ϕ ϕ ϕ (t+1) -ϕ ϕ ϕ (t) ⟩ + L 2 ∥ϕ ϕ ϕ (t+1) -ϕ ϕ ϕ (t) ∥ 2 2 = ⟨∇ ϕ (t) L f ine (ϕ ϕ ϕ (t) ; w w w (t+1) ), -η s ∇ ϕ L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )⟩ + L 2 ∥ -η s ∇ ϕ L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 = -η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 + η s 2 L 2 ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 . L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+1) ) -L f ine (ϕ ϕ ϕ (t) ; w w w (t+1) ) ≤ ⟨∇ ϕ (t) L f ine (ϕ ϕ ϕ ( (36) Combining these two terms we have, L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+2) ) -L f ine (ϕ ϕ ϕ (t) ; w w w (t+1) ) ≤ -η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 + η s 2 L 2 ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 + L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) • {-η w • ∇ w (t+1) H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ) • ∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) ) + η w 2 L 2 ∥∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )∥ 2 2 }. (37) Rearranging the inequality, we obtain, η w • ∇ w (t+1) H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ) • ∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) ) • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 ≤ η s 2 L 2 ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 + η w 2 L 2 ∥∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )∥ 2 2 • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + L f ine (ϕ ϕ ϕ (t) ; w w w (t+1) ) -L f ine (ϕ ϕ ϕ (t+1) ; w w w (t+2) ) (38) Summing up the inequalities in T iterations on both sides, we achieve η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 < ∞. (41) According to the Lemma A.5 Mairal (2013) , to prove lim t→∞ E[∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 ] = 0, we should prove E[∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t+1) , w w w (t+2) )∥ 2 2 ] -E[∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 ] ≤ Qη s , where Q represents constant. Consider the Equation 41, we have E[∥∇ ϕ (t+1) L u (u u u; ϕ ϕ ϕ (t+1) , w w w (t+2) )∥ 2 2 ] -E[∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 ] ≤ E[(∥∇ ϕ (t+1) L u (u u u; ϕ ϕ ϕ (t+1) , w w w (t+2) )∥ 2 + ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 ) • (∥∇ ϕ (t+1) L u (u u u; ϕ ϕ ϕ (t+1) , w w w (t+2) )∥ 2 -∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 )] ≤ E[(∥∇ ϕ (t+1) L u (u u u; ϕ ϕ ϕ (t+1) , w w w (t+2) ) + ∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 ) • (∥∇ ϕ (t+1) L u (u u u; ϕ ϕ ϕ (t+1) , w w w (t+2) ) -∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 )] ≤ 2Lρη s η w E[∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) )∥ 2 2 ] + E[∥∇ w (t) L s (x x x; ϕ ϕ ϕ (t) , w w w (t) )∥] ≤ 2 √ 2Lρ 2 η s η w (43) Thus, it has proved. Since L f ine (ϕ ϕ ϕ t ; w w w (t+1) ) = L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) ), therefore, the proof has completed.



F-reg is a MLP architecture with one fully connected layer with 128 filters and a Sigmoid function.



Figure 1: Illustration of Federated Semi-Supervised Learning Framework with Dual Regulator (FedDure). FedDure updates the coarse-grained regulator (C-reg) and the fine-grained regulator (Freg) to adaptively regularize the model training in each client: C-reg dynamically regulates the importance of local training on the unlabeled data by reflecting the overall learning effect on labeled data; F-reg regulates the performance contribution of each unlabeled sample.

Figure 2: Comparison of data distribution between FedMatch (Jeong et al., 2021) and our (DIR, DIR) setting: (a) and (b) are labeled and unlabeled data distribution used in FedMatch, respectively; our data distribution in (c) and (d) present external imbalance across clients and internal imbalance between labeled and unlabeled data inside a client.

Figure 3: Analysis of the impact of the number of labeled data and the impact of the number of selected clients. (a) and (b) illustrate that FedDure consistently outperforms FedMatch and Baseline (FedAvg-Fixmatch) on different percentages of labeled data. (c) and (d) show that FedDure scales with increasing numbers of selected clients on CIFAR-10 and Fashion-MNIST datasets.

Fashion-MINIST consists of 60,000 training examples and 10,000 test examples. Each example is a 28 x 28 gray-scale image, associated with a label from 10 classes. We implement training in divided training examples and test in test set. A.2 IMPLEMENTATION DETAILS Network architecture. Follow the conventional methods Jeong et al. (2021), we employ the official ResNet-9 to local model on all these datasets, i.e, CIFAR-10, CINIC-10 and Fashion-MNIST. Our coarse-grained regulator is a deep copy version of initial local model. Our fine-grained regulator is an MLP architecture, which contains one fully connected layer with 128 filters and follows a Sigmoid function to normalize the output.Naive combination methods. For all reimplement SSL algorithms FixMatchSohn et al. (2020) or UDAXie et al. (2020), we fix the confidence threshold 0.85 for all FixMatch and FedMatch methods following their official implementation. To achieve federated semi-supervised methods, the plain FedAvg and FedProx are separately equipped with these SSL methods. For data augmentation, we apply the same strong(RandAugment Cubuk et al. (2020)  ) and weak (flip-and-shift) augmentation for unsupervised loss on unlabeled data.

Figure 5: Visualization analysis in terms of different local iterations, where the total number of local iterations or training rounds is fixed.

can minimize the CE loss to update the local model θ θ θ (t) l according to the real-time feedback of regulator.

is L-Lipschitz and have ρ-bounded gradients. The L ce ŷ, f d T s (u); ϕ ϕ ϕ (t) have ρ-bounded gradients and twice differential with Hessian bounded by B. Let the learning rate η s = min{1, e T } for constant e > 0, and η = min{ 1 L , c √ T } for some c > 0, such that √ T c ≥ L. Thus, the optimization of the local model using a coarse-grained regulator can achieve,

is minimized to update the local model θ θ θ (t)

∇ w (t+1) H(w w w(t+1) ; ϕ ϕ ϕ (t+1) ) • ∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ(t+1) , w w w(t+1) ) • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + T t=1 η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )∥ 2 2 • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + L f ine (ϕ ϕ ϕ (1) ; w w w (2) ) -L f ine (ϕ ϕ ϕ (T +1) ; w w w (T +2) ) t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) )∥ 2 2 • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + L f ine (ϕ ϕ ϕ (1) ; w w w (2) ) ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + L f ine (ϕ ϕ ϕ (1) ; w w w (2) ) (39) When T → ∞, we can obtain, T t=1 η w • ∇ w (t+1) H(w w w (t+1) ; ϕ ϕ ϕ (t+1) ) • ∇ w w w (t+1) L s (x x x; ϕ ϕ ϕ (t+1) , w w w (t+1) ) • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + T t=1 η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) w ϵρ • L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + lim T →∞ T t=1 η s ∥∇ ϕ (t) L u (u u u; ϕ ϕ ϕ (t) , w w w (t+1) ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) + L f ine (ϕ ϕ ϕ (1) ; w w w (2) ) ≤ ∞, (40) since L ce ŷ, f d T s (u); ϕ ϕ ϕ (t+1) is bounded for a limited number of unlabeled samples, supervised loss function have ρ-bounded gradient and the H() is differential with a ϵ-bounded gradient. Therefore we deduce that lim T →∞ T t=1

Performance comparison of our proposed FedDure with state-of-the-art methods on three different data heterogeneity settings. FedDure outperforms all other methods in all settings. ± 0.55 66.78 ± 0.87 59.55 ± 0.61 89.53 ± 0.23 88.35 ± 0.02 87.32 ± 0.84 68.13 ± 0.96 58.67 ± 1.04 52.09 ± 0.20 FedAvg+UDA 47.47 ± 0.67 43.89 ± 0.15 35.52 ± 0.52 86.20 ± 0.75 85.35 ± 0.62 81.07 ± 0.56 42.25 ± 0.31 39.93 ± 0.57 29/27± 0.09 FedProx+UDA 46.49 ± 0.74 42.82 ± 0.79 37.38 ± 0.52 84.78 ± 0.43 84.50 ± 0.34 82.94 ± 0.39 41.81 ± 0.94 39.40 ± 0.18 33.26 ± 0.98 FedAvg+Fixmatch 46.71 ± 2.49 46.67 ± 0.56 39.95 ± 1.85 86.46 ± 0.39 85.42 ± 0.19 81.07 ± 0.56 40.40 ± 0.61 39.66 ± 1.01 31.99 ± 0.31 FedProx+Fixmatch 47.60 ± 1.05 43,39 ± 0.71 41.85 ± 1.32 86.31 ± 0.28 85.18 ± 0.79 83.68 ± 0.78 41.46 ± 0.35 40.02 ± 0.61 32.21 ± 1.03 FedMatch 51.52 ± 0.50 51.59 ± 0.32 45.56 ± 0.91 85.71 ± 0.21 85.55 ± 0.09 85.13 ± 0.15 43.73 ± 1.15 41.82 ± 0.23 35.27 ± 0.35 FedDure (Ours) 67.69 ± 0.23 66.85 ± 0.65 57.73 ± 0.31 88.69 ± 0.16 88.21 ± 0.07 86.96 ± 0.12 56.36 ± 0.29 55.10 ± 0.25 46.43 ± 0.13

Quantitative analysis of components of FedDure on CIFAR-10 and Fashion-MNIST datasets. Baseline 39.95 ± 1.85 46.67 ± 0.56 81.07 ± 0.56 85.42 ± 0.19 w/o C-reg 54.79 ± 0.59 64.98 ± 0.58 86.18 ± 0.13 87.45 ± 0.23 w/o F-reg 56.79 ± 0.63 66.75 ± 0.23 86.79 ± 0.17 88.14 ± 0.12 FedDure 57.73 ± 0.31 66.85 ± 0.65 86.96 ± 0.12 88.21 ± 0.07

6. REPRODUCIBILITY STATEMENT

We provide the datasets, experimental settings, and implementation details in Section 4.1. More details of the experimental setup are provided in Appendix A.1 and A.2. Besides, we summarize our proposed FedDure in Algorithm 1. The source code will be released in the future.

