FEDPROP: CROSS-CLIENT LABEL PROPAGATION FOR FEDERATED SEMI-SUPERVISED LEARNING

Abstract

Federated learning (FL) allows multiple clients to jointly train a machine learning model in such a way that no client has to share their data with any other participating party. In the supervised setting, where all client data is fully labeled, FL has been widely adopted for learning tasks that require data privacy. However, it is an ongoing research question how to best perform federated learning in a semisupervised setting, where the clients possess data that is only partially labeled or even completely unlabeled. In this work, we propose a new method, FedProp, that follows a manifold-based approach to semi-supervised learning (SSL). It estimates the data manifold jointly from the data of multiple clients and computes pseudo-labels using cross-client label propagation. To avoid that clients have to share their data with anyone, FedProp employs two cryptographically secure yet highly efficient protocols: secure Hamming distance computation and secure summation. Experiments on three standard benchmarks show that FedProp achieves higher classification accuracy than previous federated SSL methods. Furthermore, as a pseudo-label-based technique, FedProp is complementary to other federated SSL approaches, in particular consistency-based ones. We demonstrate experimentally that further accuracy gains are possible by combining both.

1. INTRODUCTION

Federated Learning (FL) is a machine learning paradigm in which multiple clients, each holding their own data, cooperate to jointly train a model. Training is coordinated by a central server, which, however, must not have direct access to client data. Typically this is not due to the server being viewed as a hostile party but rather to comply with external privacy and legal constraints that require client data to remain stored on-device. FL has been receiving abundant interest in recent years as it allows models to be trained on valuable data that would otherwise be inaccessible. To date, the vast majority of research within FL has been focused on the supervised setting, in which client data is fully labeled. However, in many real-world settings, this is not the case. For instance, in cross-device FL, smartphone users are not likely to be interested in annotating more than a handful of the photos on their devices or in a cross-silo setting the labeling of medical imaging data may be both costly and time consuming. As such, in recent years there has been growing interest in learning from partly labeled data in a federated setting. In this work we propose FedProp, a method for semi-supervised learning (SSL) in the federated setting that follows a manifold-based approach to pseudo-labeling client data. During each training round FedProp leverages the data of multiple clients to obtain an estimate of the data manifold which it then uses to compute pseudo-labels for clients' unlabeled data via label propagation. Using these pseudo-labels clients then train in a supervised manner for the remainder of the round. The motivation for this approach comes from the fact that the more data that is available, the more densely we have sampled the manifold and therefore the better our estimates and pseudo-labels will be. Thus, it is of crucial importance to be able to combine information from multiple clients, rather than treating each client's data in isolation. The key challenge lies in how to perform such crossclient pseudo-labeling, the steps of which would normally require the data of participating parties to be shared in order to estimate the manifold and compute label propagation. Our main contribution lies in the CrossClientLP subroutine of FedProp, which we propose to address this challenge. It uses locality-sensitive hashing and secure Hamming distance computa-

