FEDERATED LEARNING WITH OPENSET NOISY LA-BELS

Abstract

Federated learning is a learning paradigm that allows the central server to learn from different data sources while keeping the data private at local. Without controlling and monitoring the local data collection process, it is highly likely that the locally available training labels are noisy, just as in a centralized data collection effort. Moreover, different clients may hold samples within different label spaces. The noisy label space is likely to be different from the unobservable clean label space, resulting in openset noisy labels. In this work, we study the challenge of federated learning from clients with openset noisy labels. We observe that many existing solutions, e.g., loss correction, in the noisy label literature cannot achieve their originally claimed effect in local training. A central contribution of this work is to propose an approach that communicates globally randomly selected "contrastive labels" among clients to prevent local models from memorizing the openset noise patterns individually. Randomized label generations are applied during label sharing to facilitate access to the contrastive labels while ensuring differential privacy (DP). Both the DP guarantee and the effectiveness of our approach are theoretically guaranteed. Compared with several baseline methods, our solution shows its efficiency in several public benchmarks and real-world datasets under different noise ratios and noise models.

1. INTRODUCTION

With the development of distributed computation, federated learning (FL) emerges as a powerful learning paradigm for its ability to train with data from multiple clients with strong data privacy protection (McMahan et al., 2017; Kairouz et al., 2021; Yang et al., 2019) . With each of the distributed clients having a different collection and annotation process, their observed data distributions are likely to be highly heterogeneous and noisy. This paper aims to provide solutions for a practical FL setting where not only do each client's training labels carry different noise rates, the observed label space at these clients will differ as well, even though their underlying clean labels are drawn from the same label space. For example, in a global medical system, the causes (labels) of disease are annotated and reported by doctors, and these labels are potentially noisy due to the differences in the doctors' training backgrounds (Ng et al., 2021) . When certain causes and cases can only be found in data clients from country A but not country B, the observed noisy label classes in country A will then differ from the one of country B. We call such a federated learning system has openset noise problems if the observed label space differs across clients. We observe that the above openset label noise will pose significant challenges if we apply the existing learning with noisy labels solutions locally at each client. For instance, a good number of these existing solutions operate with centralized training data and rely on the design of robust loss functions (Natarajan et al., 2013; Patrini et al., 2017; Ghosh et al., 2017; Zhang & Sabuncu, 2018; Feng et al., 2021; Wei & Liu, 2021; Zhu et al., 2021a) . Implementing these approaches often requires assumptions, which are likely to be violated if we directly employ these centralized solutions in a federated learning setting. For example, loss correction is a popular design of robust loss functions (Patrini et al., 2017; Natarajan et al., 2013; Liu & Tao, 2015; Scott, 2015; Jiang et al., 2022) , where the key step is to estimate the label noise transition matrix correctly (Bae et al., 2022; Zhang et al., 2021b; Zhu et al., 2021b; 2022) . Correctly estimating the label noise transition matrix requires observing the full label space, when the ground-truth labels are not available. In FL where the transition matrix is often estimated only with the local openset noisy labels, existing estimators

