THE BEST OF BOTH WORLDS: ACCURATE GLOBAL AND PERSONALIZED MODELS THROUGH FEDERATED LEARNING WITH DATA-FREE HYPER-KNOWLEDGE DISTILLATION

Abstract

Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -information that we refer to as "hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.

1. INTRODUCTION

Federated learning (FL), a communication-efficient and privacy-preserving alternative to training on centrally aggregated data, relies on collaboration between clients who own local data to train a global machine learning model. A central server coordinates the training without violating clients' privacy -the server has no access to the clients' local data. The first ever such scheme, Federated Averaging (FedAvg) (McMahan et al., 2017) , alternates between two steps: (1) randomly selected client devices initialize their local models with the global model received from the server, and proceed to train on local data; (2) the server collects local model updates and aggregates them via weighted averaging to form a new global model. As analytically shown in (McMahan et al., 2017) , FedAvg is guaranteed to converge when the client data is independent and identically distributed (iid). A major problem in FL systems emerges when the clients' data is heterogeneous (Kairouz et al., 2021) . This is a common setting in practice since the data owned by clients participating in federated learning is likely to have originated from different distributions. In such settings, the FL procedure may converge slowly and the resulting global model may perform poorly on the local data of an individual client. To address this challenge, a number of FL methods aiming to enable learning on non-iid data has recently been proposed (Karimireddy et al., 2020; Li et al., 2020; 2021a; Acar et al., 2021; Liu et al., 2021; Yoon et al., 2021; Chen & Vikalo, 2022) . Unfortunately, these methods struggle to train a global model that performs well when the clients' data distributions differ significantly. Difficulties of learning on non-iid data, as well as the heterogeneity of the clients' resources (e.g., compute, communication, memory, power), motivated a variety of personalized FL (pFL) techniques

