FEDERATED REPRESENTATION LEARNING VIA MAXIMAL CODING RATE REDUCTION Anonymous

Abstract

We propose a federated methodology to learn low-dimensional representations from a dataset that is distributed among several clients. In particular, we move away from the commonly-used cross-entropy loss in federated learning, and seek to learn shared low-dimensional representations of the data in a decentralized manner via the principle of maximal coding rate reduction (MCR 2 ). Our proposed method, which we refer to as FLOW, utilizes MCR 2 as the objective of choice, hence resulting in representations that are both between-class discriminative and within-class compressible. We theoretically show that our distributed algorithm achieves a first-order stationary point. Moreover, we demonstrate, via numerical experiments, the utility of the learned low-dimensional representations.

1. INTRODUCTION

Federated Learning (FL) has become the tool of choice when seeking to learn from distributed data. As opposed to a centralized setting where data are concentrated in a single node, FL allows datasets to be distributed among a set of clients. This subtle difference plays an important role in practice, where data collection has moved to the edge (e.g., cellphones, cameras, sensors, etc.), and centralizing all the available data might not be possible due to privacy constraints and hardware limitations. Moreover, under the FL paradigm, clients are required to train on their local datasets, which unlike the centralized setting, successfully exploits the existence of available computing resources at the edge (i.e., at each client). The key challenges in FL include dealing with (i) data imbalances between clients, (i) unreliable connections between the server and the clients, (iii) a large number of clients participating in the communication, and (iv) objective mismatch between clients. A vast amount of successful work has been done to deal with challenges (i), (ii), and (iii). However, the often-overlooked challenge of objective mismatch plays a fundamental role in any distributed problem. For an client to participate in a collaborative training process (as opposed to training on its own private dataset), there must be a motivation: each client should see itself improved by taking part in the collaboration. Recent work has shown that even in the case of convex losses, FL converges to a stationary point from a mismatched optimization problem. This implies that there are cases where certain clients own the majority of the data (or even of certain classes), and see their individual performance curtailed by the collaborative approach. When optimizing the average of the losses over the clients, the solution to the optimization problem generally differs from the solution of the individual per-client optimization problems. Objective mismatch becomes a particularly difficult problem in FL given the privacy limitations, which prevents the central server from curtailing this undesirable effect. Moreover, given that in standard FL, the central server possesses no data, and that no proxies of data structures should be shared, a centralized solution cannot be implemented. In order to resolve the objective mismatch issue, several approaches have been proposed. However, most such approaches rely on obtaining more trustworthy gradients in the clients, at the expense of either more communications rounds, or more expensive communications. In this work, we propose an alternative representation learning-based approach to resolve objective mismatch, where low-dimensional representations of the data are learned in a distributed manner. We specifically bridge two seemingly disconnected fields, namely federated representation learning and rate distortion theory. We leverage the rate distortion theory to propose a principled way of optimizing the coding rate of the data between the clients, which does not require sharing data between clients, and can be implemented in the standard FL setting, i.e., by sharing the weights of the underlying backbone (i.e., feature extractor) parameterizations. Our approach is collaborative in that all clients are individually rewarded by participating in the common optimization objective, and follows the FL paradigm, in which only gradients of the objective function with respect to the backbone parameters (or equivalently, the backbone parameters themselves) are shared between the clients and the central server. Related Work. Several studies have been conducted in the context of FL to show the problem of objective mismatch, by proposing modifications in the FL algorithm (Yang et al., 2019) , adding constraints to the optimization problem (Shen et al., 2021) , or even including extra rounds of communication (Mitra et al., 2021) . As opposed to these methods, we propose to tackle the problem by introducing a common loss that is in all clients' self-interest to minimize. Another line of research seeks to learn personalized FL solutions by partitioning the set of learnable parameters into two parts, a common part, called the backbone, and a personalized part, called the head, to be used for individual downstream tasks. Often referred to as personalized FL, this area of research is interested in learning models utilizing a common backbone that is collaboratively learned among all clients, while personalizing the head to each individual agent's task or data distribution Liang et al. 2022). We, on the other hand, are interested in learning representations in a principled and interpretable way, as opposed to converging to a solution without any guarantees on its behavior. In the context of information theory, rate distortion theory has been used to provide theoretical (Altug et al., 2013; Unal & Wagner, 2017; Mahmood & Wagner, 2022) and empirical (Ma et al., 2007; Wagner & Ballé, 2021) results on the tradeoff between the compression rate of a random variable and its reconstruction error. However, most such solutions are centralized. Contributions. We summarize our key contributions as follows: 1. We introduce a theoretically-grounded federated representation learning objective, referred to as the maximal coding rate reduction (MCR 2 ), that seeks to minimize the number of bits needed to compress random representations up to a bounded reconstruction error. 2. We demonstrate that obtaining low-dimensional representations using our proposed method, which we refer to as FLOW, entails an objective that is naturally collaborative, i.e., all clients have a motivation to participate in the learning process. 

2. BACKGROUND

n = {(x n i , y n i )} |Dn| i=1 , where x n i ∈ R D and y n i ∈ [K], ∀i ∈ [|D n |], ∀n ∈ [N ]. Focusing on a set of parameters θ ∈ Θ, we assume that the n th client intends to minimize a local objective, denoted by f n (D n ; θ), given its local dataset D n . In many cases, such as the cross-entropy loss (CE), this local objective can be decomposed as an empirical average of the per-sample losses, i.e., f n (D n ; θ) = 1 |D n | Dn n=1 ℓ(h θ (x n i ), y n i ), where h θ : R D → [K] is a parameterized model that maps each input sample x to its predicted label h θ (x), and l : (2)



(2020); Collins et al. (2021); Oh et al. (2021); Chen & Chao (2021); Silva et al. (2022); Collins et al. (2022); Chen et al. (

FEDERATED LEARNING Consider a federated learning (FL) setup with a central server and N clients. For any positive integer M , let [M ] denote the set {1, . . . , M } containing the positive integers up to (and including) M . Each client n ∈ [N ] is assumed to host a local dataset of labeled samples, denoted by D

[K] × [K] → R denotes a per-sample loss function. The global objective in the FL setup is to find a single set of parameters θ * that minimizes the average of the per-client objectives, i.e., θ * = arg min θ∈Θ 1 N N n=1 f n (D n ; θ).

