FEDPROP: CROSS-CLIENT LABEL PROPAGATION FOR FEDERATED SEMI-SUPERVISED LEARNING

Abstract

Federated learning (FL) allows multiple clients to jointly train a machine learning model in such a way that no client has to share their data with any other participating party. In the supervised setting, where all client data is fully labeled, FL has been widely adopted for learning tasks that require data privacy. However, it is an ongoing research question how to best perform federated learning in a semisupervised setting, where the clients possess data that is only partially labeled or even completely unlabeled. In this work, we propose a new method, FedProp, that follows a manifold-based approach to semi-supervised learning (SSL). It estimates the data manifold jointly from the data of multiple clients and computes pseudo-labels using cross-client label propagation. To avoid that clients have to share their data with anyone, FedProp employs two cryptographically secure yet highly efficient protocols: secure Hamming distance computation and secure summation. Experiments on three standard benchmarks show that FedProp achieves higher classification accuracy than previous federated SSL methods. Furthermore, as a pseudo-label-based technique, FedProp is complementary to other federated SSL approaches, in particular consistency-based ones. We demonstrate experimentally that further accuracy gains are possible by combining both.

1. INTRODUCTION

Federated Learning (FL) is a machine learning paradigm in which multiple clients, each holding their own data, cooperate to jointly train a model. Training is coordinated by a central server, which, however, must not have direct access to client data. Typically this is not due to the server being viewed as a hostile party but rather to comply with external privacy and legal constraints that require client data to remain stored on-device. FL has been receiving abundant interest in recent years as it allows models to be trained on valuable data that would otherwise be inaccessible. To date, the vast majority of research within FL has been focused on the supervised setting, in which client data is fully labeled. However, in many real-world settings, this is not the case. For instance, in cross-device FL, smartphone users are not likely to be interested in annotating more than a handful of the photos on their devices or in a cross-silo setting the labeling of medical imaging data may be both costly and time consuming. As such, in recent years there has been growing interest in learning from partly labeled data in a federated setting. In this work we propose FedProp, a method for semi-supervised learning (SSL) in the federated setting that follows a manifold-based approach to pseudo-labeling client data. During each training round FedProp leverages the data of multiple clients to obtain an estimate of the data manifold which it then uses to compute pseudo-labels for clients' unlabeled data via label propagation. Using these pseudo-labels clients then train in a supervised manner for the remainder of the round. The motivation for this approach comes from the fact that the more data that is available, the more densely we have sampled the manifold and therefore the better our estimates and pseudo-labels will be. Thus, it is of crucial importance to be able to combine information from multiple clients, rather than treating each client's data in isolation. The key challenge lies in how to perform such crossclient pseudo-labeling, the steps of which would normally require the data of participating parties to be shared in order to estimate the manifold and compute label propagation. Our main contribution lies in the CrossClientLP subroutine of FedProp, which we propose to address this challenge. It uses locality-sensitive hashing and secure Hamming distance computa-tion to efficiently estimate the cross-client data manifold. It then distributes the label propagation computation across clients and aggregates the results using secure summation. CrossClientLP preserves privacy, in the sense that it does not require clients to share their data with anyone else. At the same time, it adds only limited communication and computation overhead relative to popular federated learning methods such as FederatedAveraging. Our experiments show that FedProp outperforms all existing methods for federated semi-supervised learning, as well as a range of natural baselines in the standard CIFAR-10 setup. Going beyond prior work, we also evaluate FedProp on more challenging datasets, namely CIFAR-100 and Mini-ImageNet, where we also observe substantial improvements in accuracy. Moreover, as a method for pseudo-labeling unlabeled data, FedProp is orthogonal to other approaches for federated SSL, in particular those based on consistency regularization. We demonstrate this empirically by combining FedProp with such an approach and observing that this often leads to further accuracy gains.

2. RELATED WORK

Semi-supervised Learning Semi-supervised learning (SSL) is a classical and well studied problem in machine learning where the goal is to leverage both labeled and unlabeled training examples to improve performance on some task, see (Chapelle et al., 2006) for a full overview. In recent years there has been a great deal of interest in applying deep learning techniques to SSL. Broadly speaking such semi-supervised deep learning approaches can be categorized into two groups. The first group consists of methods that add an unsupervised loss term to the objective function. In particular many of these methods introduce some form of consistency regularization (Sajjadi et al., 2016) , which encourages the model to produce similar outputs for similar inputs, examples include (Tarvainen & Valpola, 2017; Berthelot et al., 2019; Xie et al., 2020) . The second group consists of methods that exploit unlabeled data by computing pseudo-labels for unlabeled points and then training on these in a supervised fashion, for instance (Lee, 2013; Shi et al., 2018; Iscen et al., 2019; Rizve et al., 2021) . Combinations of both approaches are also possible (Iscen et al., 2019; Sohn et al., 2020) . Federated Learning Federated learning (FL) (McMahan et al., 2017) was originally proposed for learning on private fully labeled data split across multiple clients. For a survey on recent developments in the field see (Kairouz et al., 2021) . A number of recent works propose federated learning in the absence of fully labeled data. When only unlabeled data is available, methods for cluster analysis, dimensionality reduction have been proposed (Dennis et al., 2021; Grammenos et al., 2020) . Federated self-supervised learning (Zhuang et al., 2022; Makhija et al., 2022) can also be performed where the goal is to learn representations of unlabeled data that can later be fine-tuned to other tasks. However, all of these settings are different from the task of semi-supervised learning, in which the goal is to directly learn better classifiers from labeled and unlabeled client data. 2022) develop a specialized method for human activity recognition which uses label propagation locally on each client to pseudo label incoming data. Unlike in classical SSL, none of the above methods fully make use of the knowledge gained from estimating the data manifold because they exploit interactions between data points at most locally within each client. In contrast, FedProp uses securely computed cross-client interactions, thereby obtaining a better estimate of the data manifold.

3. PRELIMINARIES AND BACKGROUND

We assume a federated classification setting with m clients coordinated by a central server. Each client j possesses partly labeled data (X 



For semi-supervised FL, several works follow a consistency-based approach. Jeong et al. (2021) propose inter-client consistency and parameter decomposition to separately learn from labeled and unlabeled data. Long et al. (2020) apply consistency locally through client based teacher models. Zhang et al. (2021) and Diao et al. (2022) focus on an alternative setting in which the server has access to labeled data. In this setting Zhang et al. (2021) combine local consistency with grouping of client updates to reduce gradient diversity while Diao et al. (2022) combine consistency, through strong data augmentation, with pseudo-labeling unlabeled client data. Other methods focus exclusively on pseudo-labeling: Albaseer et al. (2020) and Lin et al. (2021) both use network predictions to assign pseudo labels, while Presotto et al. (

