FEDERATED REPRESENTATION LEARNING VIA MAXIMAL CODING RATE REDUCTION Anonymous

Abstract

We propose a federated methodology to learn low-dimensional representations from a dataset that is distributed among several clients. In particular, we move away from the commonly-used cross-entropy loss in federated learning, and seek to learn shared low-dimensional representations of the data in a decentralized manner via the principle of maximal coding rate reduction (MCR 2 ). Our proposed method, which we refer to as FLOW, utilizes MCR 2 as the objective of choice, hence resulting in representations that are both between-class discriminative and within-class compressible. We theoretically show that our distributed algorithm achieves a first-order stationary point. Moreover, we demonstrate, via numerical experiments, the utility of the learned low-dimensional representations.

1. INTRODUCTION

Federated Learning (FL) has become the tool of choice when seeking to learn from distributed data. As opposed to a centralized setting where data are concentrated in a single node, FL allows datasets to be distributed among a set of clients. This subtle difference plays an important role in practice, where data collection has moved to the edge (e.g., cellphones, cameras, sensors, etc.), and centralizing all the available data might not be possible due to privacy constraints and hardware limitations. Moreover, under the FL paradigm, clients are required to train on their local datasets, which unlike the centralized setting, successfully exploits the existence of available computing resources at the edge (i.e., at each client). The key challenges in FL include dealing with (i) data imbalances between clients, (i) unreliable connections between the server and the clients, (iii) a large number of clients participating in the communication, and (iv) objective mismatch between clients. A vast amount of successful work has been done to deal with challenges (i), (ii), and (iii). However, the often-overlooked challenge of objective mismatch plays a fundamental role in any distributed problem. For an client to participate in a collaborative training process (as opposed to training on its own private dataset), there must be a motivation: each client should see itself improved by taking part in the collaboration. Recent work has shown that even in the case of convex losses, FL converges to a stationary point from a mismatched optimization problem. This implies that there are cases where certain clients own the majority of the data (or even of certain classes), and see their individual performance curtailed by the collaborative approach. When optimizing the average of the losses over the clients, the solution to the optimization problem generally differs from the solution of the individual per-client optimization problems. Objective mismatch becomes a particularly difficult problem in FL given the privacy limitations, which prevents the central server from curtailing this undesirable effect. Moreover, given that in standard FL, the central server possesses no data, and that no proxies of data structures should be shared, a centralized solution cannot be implemented. In order to resolve the objective mismatch issue, several approaches have been proposed. However, most such approaches rely on obtaining more trustworthy gradients in the clients, at the expense of either more communications rounds, or more expensive communications. In this work, we propose an alternative representation learning-based approach to resolve objective mismatch, where low-dimensional representations of the data are learned in a distributed manner. We specifically bridge two seemingly disconnected fields, namely federated representation learning and rate distortion theory. We leverage the rate distortion theory to propose a principled way of optimizing the coding rate of the data between the clients, which does not require sharing data between clients, and can be implemented in the standard FL setting, i.e., by sharing the weights of the underlying backbone (i.e., feature extractor) parameterizations. Our approach is collaborative in that all clients are individually rewarded by participating in the common optimization objective, and follows the FL paradigm, in which only gradients of the objective function with respect to the backbone parameters (or equivalently, the backbone parameters themselves) are shared between the clients and the central server. Related Work. Several studies have been conducted in the context of FL to show the problem of objective mismatch, by proposing modifications in the FL algorithm (Yang et al., 2019) , adding constraints to the optimization problem (Shen et al., 2021) , or even including extra rounds of communication (Mitra et al., 2021) . As opposed to these methods, we propose to tackle the problem by introducing a common loss that is in all clients' self-interest to minimize. Another line of research seeks to learn personalized FL solutions by partitioning the set of learnable parameters into two parts, a common part, called the backbone, and a personalized part, called the head, to be used for individual downstream tasks. Often referred to as personalized FL, this area of research is interested in learning models utilizing a common backbone that is collaboratively learned among all clients, while personalizing the head to each individual agent's task or data distribution Liang et al. (2020) ; Collins et al. (2021) ; Oh et al. (2021) ; Chen & Chao (2021) ; Silva et al. (2022) ; Collins et al. (2022) ; Chen et al. (2022) . We, on the other hand, are interested in learning representations in a principled and interpretable way, as opposed to converging to a solution without any guarantees on its behavior. In the context of information theory, rate distortion theory has been used to provide theoretical (Altug et al., 2013; Unal & Wagner, 2017; Mahmood & Wagner, 2022) and empirical (Ma et al., 2007; Wagner & Ballé, 2021) results on the tradeoff between the compression rate of a random variable and its reconstruction error. However, most such solutions are centralized. Contributions. We summarize our key contributions as follows: 1. We introduce a theoretically-grounded federated representation learning objective, referred to as the maximal coding rate reduction (MCR 2 ), that seeks to minimize the number of bits needed to compress random representations up to a bounded reconstruction error. 2. We demonstrate that obtaining low-dimensional representations using our proposed method, which we refer to as FLOW, entails an objective that is naturally collaborative, i.e., all clients have a motivation to participate in the learning process.

2. BACKGROUND

2.1 FEDERATED LEARNING Consider a federated learning (FL) setup with a central server and N clients. For any positive integer M , let [M ] denote the set {1, . . . , M } containing the positive integers up to (and including) M . Each client n ∈ [N ] is assumed to host a local dataset of labeled samples, denoted by D n = {(x n i , y n i )} |Dn| i=1 , where x n i ∈ R D and y n i ∈ [K], ∀i ∈ [|D n |], ∀n ∈ [N ]. Focusing on a set of parameters θ ∈ Θ, we assume that the n th client intends to minimize a local objective, denoted by f n (D n ; θ), given its local dataset D n . In many cases, such as the cross-entropy loss (CE), this local objective can be decomposed as an empirical average of the per-sample losses, i.e., f n (D n ; θ) = 1 |D n | Dn n=1 ℓ(h θ (x n i ), y n i ), where h θ : R D → [K] is a parameterized model that maps each input sample x to its predicted label h θ (x), and l : [K] × [K] → R denotes a per-sample loss function. The global objective in the FL setup is to find a single set of parameters θ * that minimizes the average of the per-client objectives, i.e., θ * = arg min θ∈Θ 1 N N n=1 f n (D n ; θ). It is assumed that the clients in a FL setup cannot share their local datasets with each other. This implies that the optimization problem in (2) needs to be solved in a distributed manner. To that end, we assume that each client n ∈ [N ] maintains a local set of parameters θ n t ∈ Θ over a series of time steps t ∈ [T ]. Each client performs τ number of local updates using stochastic gradient descent (SGD), and then the local parameters are sent to a central server every τ time steps, so that the server averages clients' parameters and broadcasts the resulting aggregated parameters to to the clients to replace their local models. More precisely, denoting the learning rate by η, and letting ∇θ represent the stochastic gradient with respect to the model parameters, the sequential parameter updates are given by θ n t+1 = θ n t -η ∇θ f n (D n ; θ n t ) if t mod τ ̸ = 0, 1 N N n=1 θ n t o.w. ( ) This forms the basis of the FedAvg algorithm (McMahan et al., 2017) .

2.1.1. PERSONALIZED FEDERATED LEARNING

Leveraging the representation learning paradigm (Bengio et al., 2013; Oord et al., 2018; Chen et al., 2020) , the parameterized model h θ : R D → [K] can be decomposed into two components, namely i) a backbone h ϕ : R D → R d , parameterized by a set of parameters ϕ ∈ Φ, that maps each input sample x ∈ R D to a low-dimensional representation z = h ϕ (x) ∈ R d , where we assume that d ≪ D, and ii) a head h ψ : R d → [K], parameterized by a set of parameters ψ ∈ Ψ, that maps the representation z ∈ R d to the predicted class h ψ (z) = h ψ (h ϕ (x)) = h θ (x) ∈ [K] . This implies that the set of end-to-end model parameters is given by θ = (ϕ, ψ), with the corresponding parameter space being decomposed as Θ = Φ × Ψ. Such a decomposition can then be used to train a shared backbone for all the clients using the FL procedure, while the training process for the head can be personalized and local for each client. In particular, for the n th client, assume that the local objective f n (D n ; θ) can be decomposed into an objective on the backbone parameters, denoted by f n,ϕ (D n ; ϕ), and a separate objective on the head parameters, denoted by f n,ψ ( Dn,ϕ ; ψ), where, Dn,ϕ = {(z n i , y n i )} |Dn| i=1 = {(h ϕ (x n i ), y n i )} |Dn| i=1 , i.e., the dataset D n with each input sample x n i being replaced by its low-dimensional representation z n i = h ϕ (x n i ). Then, the global backbone objective would be a variation of (2), where the end-to-end objectives are replaced by their backbone counterparts, i.e., ϕ * = arg min ϕ∈Φ 1 N N n=1 f n,ϕ (D n ; ϕ). (5) Similarly to (3), in order to derive the optimal backbone parameters ϕ * using SGD, the backbone parameters at each client n ∈ [N ] can be sequentially updated as ϕ n t+1 = ϕ n t -η ∇ϕ f n,ϕ (D n ; ϕ n t ) if t mod τ ̸ = 0 1 N N n=1 ϕ n t o.w. ( ) Once the optimal backbone parameters ϕ * are derived, each client n ∈ [N ] can freeze its backbone and train its personalized head parameters ψ n based on its local dataset Dn,ϕ * , i.e., ψ * n = arg min ψ∈Ψ f n,ψ ( Dn,ϕ * ; ψ). 2.2 RATE-DISTORTION THEORY AND MAXIMAL CODING RATE REDUCTION Among the many ways to define the backbone objective f ϕ (D; ϕ) to learn low-dimensional representations for a given dataset D (see, e.g., (Chen et al., 2020; Grill et al., 2020; Wang & Isola, 2020; Zbontar et al., 2021; Bardes et al., 2021) , Lezama et al. (2018) ), the maximal coding rate reduction (or, MCR 2 , in short) has been recently proposed by Yu et al. (2020) as a theoretically-grounded way of training low-dimensional representations based on the rate-distortion theory (Cover & Thomas, 2006) . Consider an i.i.d. sequence {z i } i∈[M ] of M random variables following a distribution p(z), z ∈ Z and a distortion function ω : Z ×Z → R + . For a given Ω ≥ 0, the rate-distortion function is defined as the infimum r for which there exist an encoding function g enc : Z M → [2 M r ] and a decoding function g dec : [2 M r ] → Z M , such that lim M →∞ 1 M M i=1 E [ω(z i , ẑi )] ≤ Ω, where the sequence {ẑ i } i∈[M ] denotes the reconstruction of the original sequence {z i } i∈[M ] at the decoder output, i.e., {ẑ i } i∈[M ] = g dec • g enc {z i } i∈[M ] . Intuitively, the rate-distortion function represents the minimum number of bits required to compress a given random variable, such that the decompressing error is upper-bounded by a constant Ω. In general, deriving the rate-distortion function is challenging, as it entails computing mutual information terms between the input sequence and the reconstructed sequence. However, for the case of finite-sample zero-mean multivariate Gaussian distribution with a squared-error distortion measure, the rate-distortion function has a closed-form solution. In particular, letting Z = [z 1 . . . z M ] ∈ R d×M denote the matrix containing a set of M d-dimensional samples, for a squared-error distortion of ϵfoot_1 , the rate-distortion function is given by M +d 2 log det I + d M ϵ 2 ZZ T , where I denotes the d × d identity matrix (Ma et al., 2007) . Quite interestingly, the rate-distortion function, when normalized by the number of samples, can be viewed as a measure of compactness of the given samples in R d . Assuming M ≫ d, this leads to the coding rate R(Z, ϵ), defined as R(Z, ϵ) := 1 2 log det I + d M ϵ 2 ZZ T . The coding rate in (10) can be leveraged in a representation learning setup, where z i 's are the representations produced by the backbone h ϕ . For representations to be useful, the representations within one class should be as compact as possible, whereas the entire set of representations should be as diverse as possible. For a given class k ∈ [K], let Π k ∈ R M ×M be a diagonal binary matrix, whose i th diagonal element is 1 if and only if the i th samples belongs to class k. Then, the average per-class coding rate given the partitioning Π = {Π k } k∈[K] can be written as R c (Z, ϵ|Π) := 1 2M k∈[K] tr(Π k ) log det I + d tr(Π k )ϵ 2 ZΠ k Z T , where tr(•) represents the trace operation. The principle of maximal coding rate reduction (MCR 2 ) proposed by Yu et al. (2020) defines the backbone objective f ϕ (D; ϕ) as the difference between the average per-class coding rate R c (Z, ϵ|Π) in ( 11) and the average coding rate over the entire dataset, R(Z, ϵ) in (10). More precisely, f ϕ (D; ϕ) = -∆R(Z(D; ϕ)) = R c (Z(D; ϕ), ϵ|Π) -R(Z(D; ϕ), ϵ), where the dependence of the representations Z on the dataset D and the set of backbone parameters ϕ is explicitly shown.foot_0 

3. PROPOSED METHOD

Learning a low-dimensional representation can be posed as a collaborative objective, where each client in the network benefits from the collaboration. In federated learning, the dataset D is distributed among a set of clients, i.e., D = ∪ n∈[N ] D n , where D n is the dataset located at the n th client. We leverage the MCR 2 principle to introduce the global objective of our proposed FL method, which we refer to as Federated Low-Dimensional Representation Learning, or FLOW, as follows, min ϕ f ϕ (D; ϕ) := 1 2M k∈[K] log det   I + d |M k |ϵ 2 n∈[N ] m∈Dn∩M k h ϕ (x m )h ϕ (x m ) T   - 1 2 log det   I + d M ϵ 2 n∈[N ] m∈Dn h ϕ (x m )h ϕ (x m ) T   , ( ) where for a given class k ∈ [K], M k denotes the set of samples that belong to the k th class. Note that in (13), we have made the dependency of the objective function on ϕ explicit, that is z m = h ϕ (x m ). It is worth noting that the objectives f ϕ (D; ϕ) in ( 12) and ( 13) are equivalent, as Z = [z 1 . . . z M ] = [h ϕ (x 1 ) . . . h ϕ (x M )] , and ZZ T = m∈[M ] z m z T m , and the partition matrix Π k has its m th diagonal element equal to one if and only if the m th belongs to M k . Therefore, learning low-dimensional representations in a distributed manner is equivalent to solving (13). Note that as opposed to common FL implementations, our approach optimizes a common objective, as opposed to a summation over different objectives. However, this comes at a cost; the objective in ( 13) is not separable, i.e., it does not immediately follow that each client can take local gradient descent steps. In what follows, we will demonstrate interesting properties of problem (13), namely (i) that it is in each client's self interest to obtain a collaborative solution, and (ii) that a solution to problem (13) can be found in a distributed manner without clients needing to share their local datasets with each other.

3.1. MOTIVATION

Learning low-dimensional representations is a collaborative objective, and it is in each client's self interest to obtain a better representation. The choice of maximizing the coding rate reduction is well motivated by properties of the solution of problem (13), as can be shown in the following theorem. • The optimal subspaces associated with each class are orthogonal even from data across clients, i.e., h ϕ * (x m ) T h ϕ * (x m) = 0 for any m ∈ M k , m ∈ M k with k ̸ = k; and, • Each class subspace Z * k = m∈M k h ϕ * (x m )h ϕ * (x m ) T achieves its maximal dimension rank(Z * k ) = |M k |, and the largest |M k | -1 singular values of Z * k are equal. Proof. The proof follows from (Yu et al., 2020 , Theorem 2.1) noting that problem ( 13) is equivalent to optimizing the centralized objective (12). A similar proof can also be found in (Chan et al., 2022, Theorem 1). Theorem 1 is important because it shows that the benefits of our method are two-fold: (i) the solution of the problem is orthogonal between classes, even from data coming from different clients, and (ii) the obtained representations for each class are maximally diverse. Theorem 1 is notable given that we are not sharing data between clients, and we are still able to learn representations that are orthogonal between classes. That is to say, if two samples x ∈ R D and x ′ ∈ R D belong to different classes, their corresponding low-dimensional representations z and z ′ will be orthogonal regardless of which client owns the datum. What is more, the subspace associated with class j is maximal across clients, which translates into having a rich and diverse representation, even in low dimensions. Note that if clients were to solve the problem individually, there would be two undesirable properties. First, even if the representations of samples of different classes for a given client are orthogonal, if t mod τ ̸ = 0 then 5: Client n does: Update model locally, ϕ n t = ϕ t -η∇ ϕ f ϕ (D n ; ϕ n t ) , with f ϕ given in (12). Server does: Average models: ϕ t+1 = 1 N N n=1 ϕ n t . 8: end if 9: end for that orthogonality might be violated when we move across clients, since there is no guarantee that per-class subspaces are aligned across clients. Therefore, having a common representation is a desirable property as it will enforce orthogonality between samples that do not co-exist at the same client. Second, the fact that the class subspace achieves its maximal dimension makes the representations more diverse, grouping similar samples together. Again, this property is desirable, and collaborating between clients is in each client's best interest. Note that these properties are properties of a centralized approach Yu et al. (2020) , which our proposed method inherits and maintains in the distributed setting.

3.2. ALGORITHM CONSTRUCTION

The optimization problem in ( 13) is non-separable between clients, that is to say, the global objective is not equal to a summation, or an average, of individual objectives. Given that obtaining a closedform solution of ϕ cannot be done in practice, we turn into an iterative SGD-based procedure. In short, at each round t, each client receives the current state of the model ϕ t , and utilizes its own data to maximize its own MCR 2 loss, as follows, ϕ n t+1 = ϕ t -η∇ ϕ f ϕ (D n ; ϕ t ), with η being a non-negative step size. Every τ rounds, the clients communicates their backbone parameters back to the central server. The central server's job is to average the received backbone parameters. Notice that these framework has two advantages: (i) clients do not need to share any of their private data, (ii) the computing is done at the edge, on the clients. Moreover, averaging the models between the clients can be done utilizing Homomorphic Encryption (HE), preventing the central client from revealing clients' gradient information. An overview of our proposed method can be found in Algorithm 1.

3.3. CONVERGENCE OF FLOW

In this section we analyze the convergence of FLOW (cf. Algorithm 1). To do so, we require the following assumptions, Assumption 1. The MCR 2 loss is G-smooth with respect to the parameters ϕ, i.e., ∥∇ ϕ f ϕ (D n ; ϕ 1 ) -∇ ϕ f ϕ (D n ; ϕ 2 )∥ ≤ G∥ϕ 1 -ϕ 2 ∥. ( ) Assumption 1 is a standard assumption for learning problems. What this assumption implies is smoothness on the gradient of the function with respect to the parameters ϕ. In the case of neural networks as the parameterization, this is a mild assumption, given the continuity of the non-linearity and its linear filters. Theorem 2. Consider the iterates generated by Algorithm 1. Under Assumption 1, if the client gradients are homogeneous unbiased estimates of ∇ ϕ f ϕ (D; ϕ), i.e. E Dn [∇ ϕ f ϕ (D n ; ϕ)] = ∇ ϕ f ϕ (D; ϕ) , and the variance of the estimates of the gradients is bounded, i.e. E[∥∇ ϕ f ϕ (D n ; ϕ) - ∇ ϕ f ϕ (D; ϕ)∥ 2 ] ≤ σ 2 , then 1 T T t=1 ∥∇ ϕ f ϕ (D; ϕ)∥ 2 ≤ G T f ϕ (D n ; ϕ 0 ) -f ϕ (D n ; ϕ T ) + σ 2 2N , with η ≤ 1/L. Proof. See Appendix A. If datasets D n are composed of samples that are sufficiently similar, individual gradients taken at each client can be modeled as unbiased gradients of the gradients taken over the whole dataset, i.e., E Dn [∇ ϕ f ϕ (D n ; ϕ)] = ∇ ϕ f ϕ (D; ϕ). Theorem 2 provides a standard convergence result for the case of a non-convex loss, which indicates that the summation of the norm of the gradient square does not diverge. The convergence of the summation implies that the norm of the gradient is in fact decreasing, which means that the iterates of the algorithm are approaching a first order stationary point. We can also provide a proof of convergence of our algorithm in the case in which the distributions are not uniform in the clients. Theorem 3. Consider the iterates generated by Algorithm 1. Under Assumption 1, if the client gradients are a biased estimate of ∇ ϕ f ϕ (D; ϕ), i.e. E[∇ ϕ f ϕ (D n ; ϕ)] = ∇ ϕ f ϕ (D; ϕ) + µ n , with ∥µ T n ∇ ϕ f ϕ (D; ϕ)∥ ≤ δ, and E[∥∇ ϕ f ϕ (D; ϕ) -∇ ϕ f ϕ (D n ; ϕ)∥ 2 ] ≤ δ 2 + σ 2 , then 1 T T t=1 ∥∇ ϕ f ϕ (D; ϕ)∥ 2 ≤ G T f ϕ (D; ϕ 0 ) -f ϕ (D; ϕ T ) + σ 2 2N + δ, with η ≤ 1/L. Proof. See Appendix B. Theorem 3 provides a convergence result of Algorithm 1 in the case of non-uniform clients. We model the non-uniformity of the client distributions by introducing a µ n discrepancy vector for each client n. Notice that the key difference between Theorems 2 and 3 is the presence of δ, which is a bound on the maximum norm of the discrepancy between the gradients. The consequence of such a dissimilarity is mild, as we can still obtain a convergent sequence.

4. EXPERIMENTS

We run our Algorithm 1 in two federated learning settings, with N = 50, and with N = 100 agents, in both cases, we run full participation, i.e. all agents were part of the communication rounds. For the dataset, we utilized CIFAR 10, and for the parameterization, ResNet18. The low dimensional representation has dimension d = 128. To model the agent mismatch, we distributed the samples per class according to a Dirichlet distribution prior with α = 5, this distribution is widely used in the literature Shen et al. (2021) ; Hsu et al. (2019); Acar et al. (2021) . In all cases we run for 500 epochs, with a learning rate of 0.3, we utilized a batch size of 500 samples, and we run 5 local epochs per agent.

4.1. LEARNING CURVES

In figure 1 we plot the learning curves for the MCR 2 , as well as the R loss, and the R C loss. It can be seen that in all cases, the centralized MCR 2 parameterization outperforms the Federated learning case. This is expected, as distributing the datasets tends to have a negative effect on performance. The number of agents also affects the loss, as the parameterization is able to get a better performance on N = 50 than on N = 100. This has to do with the unbiasness of the local gradients, that as the number of clients increases, so does the bias term. In all, figure 1 shows that the MCR 2 loss can be learned in a distributed manner. 

4.2. ORTHOGONALITY OF REPRESENTATIONS

Figure 2 shows the cosine similarities between all the elements of the dataset. Upon training, we obtained the low dimensional representation of each sample, and computed the pairwise cosine correlation between them. In order to plot the samples, we ordered so that the first 10000 samples belong to the first class and so on so forth. As expected by Theorem 1, samples of different classes tend to be orthogonal between themselves, and samples of the same class are maximally diverse. Consistently with the worse value of the loss observed in Figure 1 , we can visually verify that the orthogonality between samples is worse as the number of clients increases. Nevertheless, for the most part, we are able to obtain an orthogonal representation for the samples. This, is as expected by Theorem 1, 2, 3. As opposed to the centralized case, in our federated learning procedure, samples of different agents are never shared, which adds merit to Figure 2 . The value of using the MCR 2 as a loss is seen when compared to the representations learned with the cross entropy loss. To obtain this representation, we train a centralized architecture (i.e. ResNet 18) with 128 features before the fully connected layer. Figure 2 shows that learning orthogonal representations is not obtained unless enforced. Moreover, the block diagonal elements of the cross entropy matrix are darker, which means that the numbers are closer to 1. This comes to no surprise, as the sole objective of the cross-entropy loss is to separate samples of different classes. However, the MCR 2 loss also seeks for diverse representations, allowing samples of the same class to have different alignments. Finally, Figure 3 shows the distribution of the eigenvalues of the per-class matrices Z k Z T k or the singular values of Z k for different classes in centralized and federated cases. Again, we see that our proposed approach can lead to similar distributions of the principal components of the learned representation subspaces, where each class ends up occupying a low-dimensional subspace, even though each client does not have direct access to the data samples hosted by other clients.

5. CONCLUSION

In this paper we introduced a principled procedure to learn low-dimensional representations in a distributed manner. In the context of Federated Learning, we introduce a collaborative loss based on the maximal coding rate reduction (MCR 2 ), which individually benefits all the agents in a self interested way. We refer to our federated low-dimensional representation learning algorithm by FLOW. Theoretically, we show that (i) the solution of FLOW generated orthogonal representations for samples of different classes, and maximizes the dimension of each class subspace, and (ii) that under mild conditions, FLOW converges to first order stationary point. Empirically, we compare our method to the centralized procedure, validating all the claims that we put forward.



Since the MCR backbone objective in (12) is monotonically decreasing with scaling the representations Z, in practice, the representations need to be constrained, e.g., to the unit hypersphere S d-1 , or the Frobenius norm of per-class representations should be bounded by the number of per-class samples.



Consider a set of dimensions {d k } K k=1 such that with rank(Z * k ) ≤ d k . If the embedding space is large enough, i.e., d ≥ K k=1 d k , and the coding precision is high enough, i.e. ϵ 4 < min k∈[K] |M k |d 2 M d 2 j then:

Figure 1: Learning curves for MCR 2 in Federated and Centralized settings for CIFAR-10.

Figure 2: Orthogonality of the low dimensional representation.

Federated Learning 50 agents.

Federated Learning 100 agents.

Figure 3: Decreasing order of magnitude of singular values of the subspaces associated with each class.

Algorithm 1 FLOW: Federated LOW Dimensional Representation Learning 1: Set coding precision ϵ, step size η, embedding space dimensionality d, aggregation period τ .

