SHARE YOUR REPRESENTATION ONLY: GUARANTEED IMPROVEMENT OF THE PRIVACY-UTILITY TRADEOFF IN FEDERATED LEARNING

Abstract

Repeated parameter sharing in federated learning causes significant information leakage about private data, thus defeating its main purpose: data privacy. Mitigating the risk of this information leakage, using state of the art differentially private algorithms, also does not come for free. Randomized mechanisms can prevent convergence of models on learning even the useful representation functions, especially if there is more disagreement between local models on the classification functions (due to data heterogeneity). In this paper, we consider a representation federated learning objective that encourages various parties to collaboratively refine the consensus part of the model, with differential privacy guarantees, while separately allowing sufficient freedom for local personalization (without releasing it). We prove that in the linear representation setting, while the objective is non-convex, our proposed new algorithm CENTAUR converges to a ball centered around the global optimal solution at a linear rate, and the radius of the ball is proportional to the reciprocal of the privacy budget. With this novel utility analysis, we improve the SOTA utility-privacy trade-off for this problem by a factor of √ d, where d is the input dimension. We empirically evaluate our method with the image classification task on CIFAR10, CIFAR100, and EMNIST, and observe a significant performance improvement over the prior work under the same small privacy budget. The code can be found in this link.

1. INTRODUCTION

In federated learning (FL), multiple parties cooperate to learn a model under the orchestration of a central server while keeping the data local. However, this paradigm alone is insufficient to provide rigorous privacy guarantees, even when local parties only share partial information (e.g. gradients) about their data. An adversary (e.g. one of the parties) can infer whether a particular record is in the training data set of other parties (Nasr et al., 2019) , or even precisely reconstruct their training data (Zhu et al., 2019) . To formally mitigate these privacy risks, we need to guarantee that any information shared between the parties during the training phase has bounded information leakage about the local data. This can be achieved using FL under differential privacy (DP) guarantees. FL and DP are relatively well-studied separately. However, their challenges multiply when conducting FL under a DP constraint, in real-world settings where the data distributions can vary substantially across the clients (Li et al., 2020b; Acar et al., 2020; Shen et al., 2022) . A direct consequence of such data heterogeneity is that the optimal local models might vary significantly across clients and differ drastically from the global solution. This results in large local gradients (Jiang et al., 2019) . However, these large signals leak information about the local training data, and cannot be communicated as such when we need to guarantee DP. We require clipping gradient values (usually by a small threshold (De et al., 2022) ) before sending them to the server, to bound the sensitivity of the gradient function with respect to changes in training data Abadi et al. (2016) . As the local persample gradients (due to data heterogeneity) tend to be large even at the global optimum, clipping per-example gradient by a small threshold and then randomizing it, will result in a high error in the overall gradient computation, and thus degrading the accuracy of the model learned via FL. Contributions. In this work, we identify an important bottleneck for achieving high utility in FL under a tight privacy budget: There exists a magnified conflict between learning the representation function and classification head, when we clip gradients to bound their sensitivity (which is required for achieving DP). This conflict causes slow convergence of the representation function and disproportional scaling of the local gradients, and consequently leads to the inevitable utility drop in DP FL. To address this issue, we observe that in many FL classification scenarios, participants have minimal disagreement on data representations (Bengio et al., 2013; Chen et al., 2020; Collins et al., 2021) , but possibly have very different classifier heads (e.g., the last layer of the neural network). Therefore, instead of solving the standard classification problem, we borrow ideas from the literature of model personalization and view the neural network model as a composition of a representation extractor and a small classifier head, and optimize these two components in different manners. In the proposed scheme, CENTAUR, we train a single differentially private global representation extractor while allowing each participant to have a different personalized classifier head. Such a decomposition has been considered in previous arts like (Collins et al., 2021) and (Singhal et al., 2021) , but only in a non-DP setting, and also in (Jain et al., 2021) , but only for a linear embedding case. Due to low heterogeneity in data representation (compared to the whole model), the DP learned representation in our new scheme outperforms prior schemes that perform DP optimization over the entire model. In the setting where both the representation function and the classifier heads are linear w.r.t. their parameters, we prove a novel utility-privacy trade-off for an instance of CENTAUR, yielding a significant O( √ d) improvement over previous art, where d is the input dimension (Corollary 5.1). A major algorithmic novelty of our proposed approach is a cross-validation scheme for boosting the success probability of the classic noisy power method for privacy-preserving spectral analysis. We present strong empirical evidence for the superior performance of CENTAUR over the prior work, under the small DP budget of (1, 10 -5 ) in a variety of data-heterogeneity settings on benchmark datasets CIFAR10, CIFAR100, and EMNIST. Our method outperforms the prior work in all settings. Moreover, we showcase that CENTAUR uniformly enjoys a better utility-privacy trade-off over its competitors on the CIFAR10 dataset across different privacy budget ϵ (Figure 1 ). Importantly, CENTAUR outperforms the local stand-alone training even with, ϵ = 0.5, thus justifying the benefit of collaborative learning compared to stand-alone training for a larger range of privacy budget. 1.1 RELATED WORK Federated learning with differential privacy has been extensively studied since its emergence (Shokri & Shmatikov, 2015; McMahan et al., 2017a) . Without any trusted central party, the local DP model requires each client to randomize its messages before sending them to other (malicious) parties. Consequently, the trade-off between local DP and accuracy is significantly worse than that for centralized setting and requires huge amount of data for learning even simple statistics (Duchi et al., 2014; Erlingsson et al., 2014; Ding et al., 2017) . By using secure aggregation protocol, recent works (McMahan et al., 2017b; Agarwal et al., 2018; Levy et al., 2021; Kairouz et al., 2021 ) study user-level DP under Billboard model to enable utility. We also focus on such user-level DP setting. Model personalization approaches (Smith et al., 2017; Fallah et al., 2020; Li et al., 2020b; Arivazhagan et al., 2019; Collins et al., 2021; Pillutla et al., 2022) enable each client to learn a different (while related) model, thus alleviating the model drifting issue due to data heterogeneity. Recent works further investigate whether model personalization approaches enable improved privacy accuracy trade-off for federated learning. Hu et al. (2021) propose a private federated multi-task learning algorithm by adding task-specific regularization to each client's optimization objective. However, the regularization has limited ability to deal with data heterogeneity. Bietti et al. (2022) propose the PPSGD algorithm that enables training additive personalized models with user-level DP. However, their generalization guarantees crucially rely on the convexity of loss functions. On the contrary, we study the convergence of CENTAUR algorithm under more general non-convex objectives. The closest work to our approach in the literature is Jain et al. (2021) , who also propose differentially privately learning shared low-dimensional linear representation with individualized classification head. However, their algorithm relies on solving the least square problem exactly at server side, by performing SVD on perturbed noisy representation matrix. Hence, the generalization guarantee of their algorithm intrinsically has an expensive d 1.5 dependency on the data dimension d. By contrast, we perform noisy gradient descent at server side and improve upon this error dependency by a factor of √ d. Their algorithm is also limited to the linear representation learning problem, unlike our CENTAUR algorithm which enables training multiple layers of shared representations. Our work builds on the FedRep algorithm (Collins et al., 2021) , which also relies on learning shared representations between clients but does not consider privacy. In contrast, our work provides a novel private and federated representation learning framework. Moreover, a major different ingredient of our algorithm is the initialization procedure, which requires performing differentially private SVD to the data matrix. We use the noisy power method (Hardt & Price, 2014) as a crucial tool to enable a constant probability for utility guarantee. We then perform cross-validation to further boost success probability to arbitrarily large (inspired by Liang et al. (2014) ).

2. NOTATIONS AND BACKGROUND ON PRIVACY

Notations. We denote the clipping operation by clip( x; ζ) . = x • min{1, ζ/∥x∥}, and denote the Gaussian mechanism as GM ζ,σ ({x i } s i=1 ) . = 1 s ( s i=1 clip(x i ; ζ) + σζW ) where W ∼ N (0, I). Define Rényi Differential Privacy (RDP) on a dataset space D equipped with a distance d as follows. Definition 2.1 (RDP). For measures ν, ν ′ over the same space with ν ≪ ν ′ , their Rényi divergence R α (ν, ν ′ ) = 1 α-1 log E α (ν, ν ′ ), where E α (ν, ν ′ ) = dν dν ′ α dν ′ . A randomized algorithm M : D → Θ satisfies (α, ϵ)-RDP, if ∀D, D ′ ∈ D with d(D, D ′ ) ≤ 1, we have R α (M(D), M(D ′ )) ≤ ϵ. User-level-RDP. Let D be the space of all of n tuples of local datasets {S i } n i=1 , where each local dataset consists of m data points, i.e. D = {{S i } n i=1 | S i = {z ij } m j=1 }. The distance d is the Hamming distance in the dataset level, i.e. d({S i } n i=1 , {S ′ i } n i=1 ) = n i=1 1 Si̸ =S ′ i . We refer to the privacy guarantee recovered by this choice of dataset space as user-level-RDP. In Appendix A.3, we further describe the Gaussian Mechanism and the composition of RDP, the standard notion of Differential Privacy (DP), and the conversion Lemma from RDP to DP. Threat Models. We aim to protect the privacy of each user against potential adversarial other clients, i.e., any eavesdropper will not be able to tell whether one users has participated in the collaborative learning procedure, given the information released during training phase. We establish user-level RDP guarantees under the billboard model, which is a communication protocol that is particularly compatible with algorithms in the federated setting and has been adopted in many previous works (Jain et al., 2021; Hu et al., 2021; Bietti et al., 2022) . In this model, a trusted server (either a trusted party or one that uses cryptographic techniques like multiparty computation) aggregates information subject to a DP constraint, which is then shared as public messages with all the n users. Then, each user computes its own personalized model based on the public messages and its own private data (Hsu et al., 2014) . Our RDP guarantees only hold for releasing the shared representations b Tg in Algorithm 1, and the guarantees are equivalent to joint-(R)DP guarantees Kearns et al. (2014) if all the personalized models w T l i of individual users were additionally released as outputs of Algorithm 2.

3. PROBLEM FORMULATION

Consider a Federated Learning (FL) setting where there are n agents interacting with a central server and the ith agent holds m i local data points S i = {(x ij , y ij )} mi j=1 . Here (x ij , y ij ) ∈ R d × R denotes a pair of an input vector and the corresponding label. Suppose that each local dataset S i is sampled from an underlying joint distribution p i (x, y) of the input and response pair. We consider the dataheterogenous setting where potentially p i ̸ = p j for i ̸ = j. We further consider a hypothesis model f (•; θ) : R d → R which maps an input x to the predicted label y. Here θ ∈ R q is the trainable parameter of the model f . Let ℓ : R × R → R be a loss function that penalizes the error of the prediction f (x; θ) given the true label y. The local objective on client i is defined as l(θ; S i ) . = 1 m i mi j=1 ℓ(f (x ij ; θ), y ij ). The standard FL formulation seeks a global consensus solution, whose objective is defined as arg min θ L(θ) . = 1 n n i=1 l(θ; S i ). While this formulation is reasonable when the local data distributions are similar, the obtained global solution may be far from the local optima arg min l(θ; S i ) under diverse local data distributions, a phenomenon known as statistical heterogeneity in the FL literature (Li et al., 2020a; Wang et al., 2019; Malinovskiy et al., 2020; Mitra et al., 2021; Charles & Konečnỳ, 2021; Acar et al., 2020; Karimireddy et al., 2020) . Such a (potentially significant) mismatch between local and global optimal solutions limits the incentive for collaboration, and cause extra difficulties when DP constraints are imposed (Remark 3.1). These considerations motivate us to search for personalized solutions that can be learned in a federated fashion, with less utility loss due to the DP constraint. Federated representation learning with differential privacy. It is now well-documented that in some common and real-world FL tasks, such as image classification and word prediction, clients have minimal disagreement on data representations (Chen et al., 2020; Collins et al., 2021) . Based on this observation, a more reasonable alternative to the FL objective in equation 2 should focus on learning the data representation, which is the information that is agreed on among most parties, while also allowing each client to personalize its learning on information that the other clients disagree on. To formalize this, suppose that the variable θ ∈ R q can be partitioned into a pair [w, b] ∈ R q1 × R q2 with q = q 1 +q 2 and the parameterized model admits the composition f = h•ϕ where ϕ : R d → R k is the representation extractor that maps d-dimensional data points to a lower dimensional space of size k and h : R k → R is a classifier head that maps from the lower dimensional subspace to the space of labels. An important example is bottom and top layers of the neural network model. We use w and b to denote the parameters that determine h and ϕ, respectively. With the above notation, we consider the following federated representation learning (FRL) problem min b∈B 1 n n i=1 min wi    l([w i , b]; S i ) := 1 m i mi j=1 ℓ(h(ϕ(x ij ; b); w i ), y ij )    , where we maintain a single global representation extraction function ϕ(•; b) subject to the constraint b ∈ B ⊆ R q2 while allowing each client to use its personalized classification head h(•; w i ) locally. The constraint B is included so that equation 3 also covers the linear case studied in section 5. The choice of the FRL formulation in equation 3 entails considerations from both DP and optimization perspectives: From the DP standpoint, the phenomenon of statistical heterogeneity introduces additional difficulties for federated learning under DP constraint (see Remark 3.1 below). If the clients collaborate to train only a shared representation function, then the aforementioned disadvantages can be alleviated; From the optimization standpoint, we typically have k ≪ d, i.e. the dimension of the extracted features is much smaller than that of the original input. Hence, for a fixed representation function ϕ(•; b), the client specific heads h(•; w i ) are in general easy to optimize locally as the number of parameters, k, is typically small. Remark 3.1 (Statistical heterogeneity makes DP guarantee harder to establish.). To establish DP guarantees for gradient based methods, e.g. DP-SGD, a common choice is the Gaussian mechanism, which is comprised of the gradient clipping step and the noise injection step. It is empirically observed that to achieve a better privacy-utility trade-off, a small clipping threshold is preferred, since it limits the large variance due to the injected noise (De et al., 2022) . Moreover, the effect of the bias (due to clipping) subsides as the per-sample gradient norm diminishes during the centralized training, a phenomenon known as benign overfitting in deep learning (Bartlett et al., 2020; Li et al., 2021; Bartlett et al., 2021) . However, due to the phenomenon of distribution shift, the local (persample) gradients in the standard FL setting (described in equation 2) remain large even at the global optimal solution, and hence setting a small (per-sample) gradient clipping threshold will Algorithm 1 SERVER procedure of CENTAUR 1: procedure SERVER(b 0 , p g , T g , η g , σ g , ζ g ) 2: // b 0 is obtained from the INITIALIZATION procedure. 3: for t ← 0 to T g -1 do 4: Sample set C t of active clients using Poisson sampling with parameter p g .

5:

Broadcast the current global representation parameter b t to active clients.

6:

Receive the local update directions {g t i } i∈C t from the CLIENT procedures. 7: Compute the update direction g t = GM ζg,σg ({g t i } i∈C t ) 8: Update the global representation function b t+1 := AGGREGATION(b t , g t , η g ).

9:

// The AGGREGATION procedure depends on the feasible set B in equation 3. 10: return b Tg . Algorithm 2 CLIENT procedure of CENTAUR in the general case (for client i) 1: procedure CLIENT(b t , m, T l , η l ) 2: [  := b t,s i -η l • ∂ b l([b t,s i , w t+1 i ]; S s i ). 7: [Phase 3: Summarize the local update direction.] return g t i := b t,T l i -b t . result in a large and non-diminishing bias in the overall gradient computation. In contrast, for tasks where the representation extracting functions are approximately homogeneous, the local and global optimal of the FRL formulation 3 are close and hence the gradients w.r.t. the representation function vanishes at the optimal, which is amiable to small clipping threshold.

4. DIFFERENTIAL PRIVATE FEDERATED REPRESENTATION LEARNING

In this section we present the proposed CENTAUR method to solve the FRL problem in equation 3. SERVER procedure (Algorithm 1) takes the following quantities as inputs: b 0 denotes the initializer for the parameter of the global representation function, obtained from a procedure INITIALIZATION; p g denotes the portion of the clients that will participate in training per global communication round; T g denotes the total number of global communication rounds; η g denotes the global update step size; (σ g , ζ g ) stand for the noise multiplier and the clipping threshold of the Gaussian mechanism (that ensures userlevel RDP). Note that in this section, we consider random INITIALIZATION over unconstrained space B = R q2 , and the procedure AGGREGATION(b t , g t , η g ) = b t + η g • g t . Under these configurations, SERVER follows the standard FL protocol: After broadcasting the current global representation function to the activate clients, it aggregates the information returned from the CLIENT procedure to update the global representation function. CLIENT procedure (Algorithm 2) takes the following quantities as inputs: b t denotes the parameter of the global representation function received from the server; m denotes the number of local data points used as mini-batch to update the local representation function; T l denotes the number of local update iterations; η l denotes the local update step size. CLIENT can be divided into three phases: 1. After receiving the current global parameter b t of the representation function from the server, the client update the local classifier head to the minimizer of the local objective w t+1 i = arg min w l([b t , w]; S i ). This is possible since the local objective l usually admits very simple structure, e.g. it is convex w.r.t. w, once the representation function is fixed. In prac-tice, we would run multiple SGD epochs on w to approximately minimize l with b t fixed. This is computationally cheap since the dimension of w is much smaller compared to the whole variable θ = [w, b]. 2. Once the local classifier head is updated to w t+1 i , the client optimizes its local representation function with multiple SGD steps, starting from the current global consensus b t . 3. Finally, each client calculate its local update direction for representation via the difference between its latest local representation function and the previous global consensus, b t,T l i -b t . Remark 4.1 (Privacy guarantee). By the composition theorem for subsampled Gaussian Mechanism Mironov et al. (2019) , we prove that Algorithm 1 and Algorithm 2 satisfies user-level (α, ϵ)-Rényi differential privacy for ϵ = T g •S α (p g , σ g ), where S α (p g , σ g ) = R α (N (0, σ 2 g )∥N (0, σ 2 g )+p g • N (1, σ 2 g )). In the special case of full gradient descent with p g = 1, we have that ϵ = α • T g /(2σ 2 g ).

5. GUARANTEED IMPROVEMENT OF THE UTILITY-PRIVACY TRADE-OFF

In the previous section, we present CENTAUR for the general FRL problems (3). Due to the lack of structure information, for a general non-convex optimization problem we cannot expect any utility guarantee beyond the convergence to a stationary point. In this section, we consider a specific instance of the FRL problems where both the representation function ϕ and the local classifiers h are linear w.r.t. their parameters b and w i . This model has been commonly used in the analysis of representation learning (Collins et al., 2021; 2022 ), meta learning (Tripuraneni et al., 2021;; Du et al., 2020; Thekumparampil et al., 2021; Sun et al., 2021) , model personalization (Jain et al., 2021) , multi-task learning (Maurer et al., 2016) . For this (still nonconvex) instance, we prove that CENTAUR converges to a ball centered around the global minimizer in a linear rate where the size of the ball depends on the required RDP parameters (α, ϵ). Let ϵ a be the utility and let ϵ dp be the DPparameter (CENTAUR is an (ϵ dp , δ)-DP mechanism). We obtain the improved privacy-utility tradeoff (Jain et al., 2021) . In the following, we will first review objective (3) in the linear representation setting, and analyse CENTAUR to show the improved utility-privacy tradeoff. ϵ a • ϵ dp ≥ O(d/n), which is O( √ d) better than the current SOTA result, ϵ a • ϵ dp ≥ O(d 1.5 /n), by Federated Linear Representation Learning (LRL) Recall the problem formulation in Section 3. For simplicity, we assume m i = m for all i ∈ {1, . . . , n} for some constant m. Consider the LRL setting, where given the input x ij the response y ij ∈ R satisfies y ij = w * i ⊤ B * ⊤ x ij .foot_0 Here B * ∈ R d×k is a column orthonormal global representation matrix and w * i ∈ R k is an agent-specific optimal local linear classifiers. In terms of the notations, B corresponds to the parameter b in the general formulation (3), but is capitalized since it is now regarded as a matrix. The feasible domain B is the set of column orthonormal matrices O d,k = {B ∈ R d×k |B ⊤ B = I k }. Given a local dataset S (could be S i or its subset), define l([w, B]; S) = 1 |S| (x,y)∈S 1 2 (w ⊤ B ⊤ x -y) 2 . ( ) Our goal is to recover the ground truth representation matrix B * using the collection of the local datasets {S i } n i=1 via solving the following linear instance of the FRL problem equation 3 min B ⊤ B=I k 1 n n i=1 min wi l([w i , B]; S i ) (L-FRL) in a federated and DP manner. Here, equation L-FRL is an instance of the general FRL problem (3) with h and ϕ set to linear functions and ℓ set to the least square loss. Note that, despite the (relatively) simple structure of equation L-FRL, it is still non-convex w.r.t. the variable B. Methodology for the LRL case. To establish our novel privacy and utility guarantees, we need to specify INITIALIZATION and AGGREGATION in the procedures SERVER and we also need to slightly modify the CLIENT procedure, which are elaborated as follows. 1. To obtain the novel guarantee of converging to the global minimizer for the LRL case, the initializer B 0 needs to be within a constant distance to the global optimal B * which requires a more Algorithm 3 INITIALIZATION procedure for CENTAUR in the LRL case. 1: procedure INITILIZATION(T 0 , ϵ i , L 0 , σ 0 , ζ 0 ) 2: Run T 0 independent copies of PPM(L, σ 0 , ζ 0 ) to obtain T 0 candidates {B c } T0 c=1 ; 3: // PPM stands for private power method and is presented in Algorithm 5 in the appendix.

4:

// Boost this probability to 1 -n -k via cross validation. 5: Find ĉ in [T 0 ] such that for at least half of c ∈ [T 0 ], s i (B ⊤ c B ĉ) ≥ 1 -2ϵ 2 i , ∀i ∈ [k]. 6: // s i (•) denotes the i th singular value of a matrix. 7: return B ĉ. Algorithm 4 CLIENT procedure for CENTAUR in the LRL case 1: procedure CLIENT(B t , m) 2: Sample without replacement two subsets S t,1 i and S t,2 i from S i both with cardinality m. 3: [Phase 1:] Update the local head w t+1 i = arg min w∈R k l(w, B t ; S t,1 i ). 4: [Phase 2:] Compute the local gradient of the representation G t i = ∂ B l([w t+1 i , B t ]; S t,2 i ); 5: [Phase 3:] return G t i sophisticated procedure than the simple random initialization. We show that this requirement can be ensured by running a modified instance of the private power method (PPM) by (Hardt & Price, 2014) , but the utility guarantee only holds with a constant probability (Lemma F.1). A key contribution of our work is to propose a novel cross-validation scheme to boost the success probability of the PPM with a small extra cost. Our scheme only takes as input the outputs of independent PPM trials, and hence can be treated as post-processing, which is free of privacy risk (Lemma F.3). The proposed INITIALIZATION procedure is presented in Algorithm 3. The analyses are deferred to Appendix F. 2. As discussed earlier, the feasible domain B is the set of column orthonormal matrices. In order to ensure the feasibility of B t+1 , we set AGGREGATION(B t , G t , η g ) = QR(B t -η g • G t ), where QR(•) denotes the QR decomposition and only returns the Q matrix. 3. We make a small modification in line 2 where two subsets S t,1 i and S t,2 i of the local dataset are sampled without replacement from S i and are used to replace S i in Phases 1 and 2. This change is required to establish our novel utility result, which ensures that clipping threshold of the Gaussian mechanism (line 7) in the SERVER procedure is never reached with a high probability (Lemma C.1).

5.1. ANALYSIS OF CENTAUR IN THE LRL SETTING

Use s i (A) to denote the ith largest singular value of a matrix A. Let W * ∈ R n×k be the collection of the local optimal classifier heads with W * i,: = w * i . We use s i as a shorthand for s i (W * / √ n) and use κ = s 1 /s k to denote the condition number of the problem. We choose the scaling 1/ √ n such that s i remains meaningful as n → ∞. We make the following assumptions. Assumption 5.1. x ij is zero mean, I d -covariance, 1-subgaussian (defined in Appendix A.1). Assumption 5.2. There exists a constant µ > 0 such that max i∈[n] ∥w * i ∥ 2 ≤ µ √ ks k . Assumption 5.3. The number of local data points is sufficiently large: m ≥ cld max(k 2 , k 4 d/n). Here cld hides the dependence on κ, µ, and the log factors. These assumptions are standard in literature (Collins et al., 2021; Jain et al., 2021) . An elaborated discussion is provided in Appendix K. While Problem (L-FRL) is non-convex, we show that the consensus representation B t converges to a ball centered around the ground truth solution B * in a linear rate. The size of the ball is controlled by the noise multiplier σ g , which will be the free parameter that controls the utility-privacy tradeoff. High level idea. For Problem (L-FRL), we find an initial point close to the ground truth solution via the method of moments. Given this strong initialization, CENTAUR converges linearly to the vicinity of the ground truth since it can be interpreted as an inexact gradient descent method with a fast decreasing bias term. One caveat that requires particular attention is the clipping step in the Gaussian mechanism (line 5 in Algorithm 4) will destroy the above interpretation if the threshold parameter ζ g is set too small. To resolve this, we set ζ g to be a high probability upper bound of ∥G t i ∥ F so that the clipping step only takes effect with a negligible probability. In the utility analysis of the LRL case, we use the principal angle distance to measure the convergence of the column-orthonormal variable B towards the ground truth B * . We also refer to this quantity as the utility of the algorithm. Let B ⊥ be the orthogonal complement of B. We define dist(B, B * ) := ∥(I d -BB ⊤ )B * ∥ 2 = ∥B ⊤ ⊥ B * ∥ 2 ≤ 1. To simplify the presentation of the result, we make the following assumptions: The dimension of the input is sufficiently large d ≥ κ 8 kfoot_3 log n and the number of clients is sufficiently large n ≥ m. The proof of the following theorem can be find in Appendix C. Theorem 5.1 (Utility Analysis). Consider the instance of CENTAUR for the LRL setting with its CLIENT and INITIALIZATION procedures defined in Algorithms 4 and 3 respectively. Suppose that the matrix B 0 returned by INITIALIZATION satisfies dist(B 0 , B * ) ≤ ϵ 0 = 0.2, and suppose that the mini-batch size parameter m satisfies m ≥ c m max{κ 2 k 2 log n, k 2 d/n}. Set the clipping threshold of the Gaussian mechanism ζ g = c ζ µ 2 ks 2 k √ dk log n, the global step size η g = 1/4s 2 1 , the number of global rounds T g = c T κ 2 log(κη g ζ g σ g d/n). Assuming that the noise multiplier in the Gaussian mechanism is sufficiently smallfoot_2 : σ g ≤ c σ nκ 4 /(µ 2 d √ k log n). Let c m , c ζ , c T , c σ , c p and c d be universal constants. We have with probability at least 1 -c p mT g • n -k , dist(B Tg , B * ) ≤ c d κ 2 η g σ g ζ g √ d/n = cd σ g µ 2 k 1.5 d/n. 3 Since the SERVER procedure remains exactly the same as Algorithm 1 in the LRL case, the main body (anything after INITIALIZATION) of the resulting CENTAUR instance has the same privacy guarantee as described in remark 4.1. However, we still need to account for the privacy leakage of the INITIALIZATION procedure in Algorithm 3 as it is data-dependent. This will be deferred to Appendix F, where we show that Algorithm 3 is an (α, ϵ init )-RDP mechanism, with ϵ init defined in Corollary F.1. Combining this fact with the RDP analysis for the main body leads to the following privacy guarantee for CENTAUR in the LRL case (see Appendix A.4). Theorem 5.2 (Privacy Bound). Consider the instance of CENTAUR with its CLIENT procedure defined in Algorithm 4 and its INITIALIZATION procedure defined in Algorithm 3. Suppose that the INITIALIZATION procedure is an (α, ϵ init )-RDP mechanism (proved in Corollary F.1). Let σ g , the noise multiple in the CLIENT procedure, be a free parameter that controls the privacy-utility trade-off. By setting the inputs to SERVER as p g = 1, T g = c T κ 2 log(κη g ζ g σ g d/n), the instance of CENTAUR under consideration is an (α, ϵ init + ϵ rdp /2)-RDP mechanism, where ϵ rdp = 4αT g /σ 2 g . Moreover, when σ g = Õ(n/(k 4 µ 2 d)), we have ϵ init ≤ ϵ rdp /2, in which case CENTAUR is an (α, ϵ rdp )-RDP mechanism and is also an (ϵ dp , δ)-DP mechanism with ϵ dp = 2 T g log(1/δ)/σ g . Overall Utility-Privacy Trade-off We now combine the utility and privacy analyses of CENTAUR in the LRL setting to obtain the overall utility-privacy trade-off in the following sense: According to Theorem 5.1, to achieve a high utility, i.e. a small ϵ a , we need to choose a small noise multiplier σ g while Theorem 5.2 states that the smaller σ g is, the larger the privacy cost. Corollary 5.1. Use ϵ a to denote a target utility, i.e. dist(B T , B * ) ≤ ϵ a where B T is the output of CENTAUR and use ϵ dp to denote a privacy budget, i.e. CENTAUR is an (ϵ dp , δ)-DP mechanism. Suppose that ϵ dp ≥ c′ t µ 2 d √ k/(κ 3 n), which is a restriction due to the requirement on σ g in Theorem 5.1. Under Assumptions 5.1 to 5.3, CENTUAR outputs a solution that provably achieves the ϵ a utility within the ϵ dp budget, under the condition that the tuple (ϵ a , ϵ dp ) satisfies ct κk 1.5 µ 2 d/n ≤ ϵ a • ϵ dp , where ct and c′ t hide the constants and log terms. When focusing on the input dimension d and the number of clients n and treating other factors as constants, the restriction on ϵ dp and the trade-off of the tuple (ϵ a , ϵ dp ) can be simplified to ϵ dp ≥ Θ(d/n) and Θ(d/n) ≤ ϵ a • ϵ dp . Recall that in the previous SOTA result of the LRL setting (Jain et al., 2021) , the restriction on the DP budget is ϵ dp ≥ Θ(d 1.5 /n) (point iii in Assumption 4.1 therein) and the utility-privacy tradeoff can be interpreted as Θ(d 1.5 /n) ≤ ϵ a • ϵ dp (Lemma 4.4 therein). Hence, we obtain a Θ( √ d) improvement in both regards, which means that CENTAUR delivers the utility-privacy guarantees for a much wider range of combinations of (ϵ a , ϵ dp ). Please see an elaborated discussion of this result in Appendix J.

6. EXPERIMENTS

In this section, we present the empirical results that show the significant advantage of the proposed CENTAUR over previous arts. Four baselines are included: Stand-alone-no-FL which stands for local stand-alone training; DP-FedAvg-fb which stands for DP-FedAvg with local fine tuning (Yu et al., 2020) ; PPSGD proposed by Bietti et al. (2022) ; and PMTL-ft which stands for PMTL proposed by Hu et al. (2021) with local fine tuning. Note that Stand-alone-no-FL does not involve any global communications, therefore no privacy mechanism is added to its implementation. This makes Stand-alone-no-FL a strong baseline as the utility of all included differentially private competing methods are affected by gradient clipping and noise injection, especially when the DP budget is small, e.g. ϵ dp = 1. Another advantage of Stand-alone-no-FL setting is that, the local stand-alone models are highly flexible, i.e. the model on one client and be completely different from the one on others. On the contrary, while the models of all other non-local methods share a common representation part, which takes up the major portion of the whole model. We focus on the task of image classification and conduct experiments on three representative datasets, namely CIFAR10, CIFAR100, and EMNIST. In terms of architecture of the neural network, we use LeNet for CIFAR10/CIFAR100 and use MLP for EMNIST, as commonly used in the federated learning literature, the details of which are discussed in Appendix. In terms of data augmentation, we do not perform any data augmentation for training the representation, as we observe that classic data augmentation for DP training leads to worse accuracy, as also reported in De et al. (2022) . We also tried a new type of data augmentation suggested by De et al. (2022) , which does not consistently improve the classification (validation) accuracy in our experiments neither. CENTAUR Has the Best Privacy Utility Trade-off. We first present the utility (testing accuracy) of models trained with CENTAUR and other baselines algorithms under a fixed small privacy budget ϵ dp = 1, for a variety of heterogeneous FL settings. To simulate the data-heterogeneity phenomenon ubiquitous in the research of federated learning, we follow the data allocation scheme of (Collins et al., 2021) : Specifically, we first split the original dataset into a training part (90%) and a validation part (10%) and we then allocate the training part equally to n clients while ensuring that each client has at most data from S classes. In Table 1 , we observe that, under this small privacy budget, our proposed CENTAUR enjoy better performance than all the included baseline algorithms. Importantly, CENTAUR is the only method that consistently outperforms the strong local-only baseline, and therefore justifies the choice of collaborative learning as opposed of local stand-alone training. Finally, we further demonstrate that CENTAUR enables superior privacy utility trade-off uniformly across different privacy budget ϵ, for the setting of EMNIST dataset (n = 2000, S = 5) in Figure 1 .

Conclusion.

In this work, we point out that the phenomenon of statistical heterogeneity, one of the major challenges of federated learning, introduces extra difficulty when DP constraints are imposed. To alleviate this difficulty, we consider the federated representation learning where only the representation function is to be globally shared and trained. We provide a rigorous guarantee for the utility-privacy trade-off of the proposed CENTAUR method in the linear representation setting, which is O( √ d) better than the SOTA result. We also empirically show that CENTAUR provides better utility uniformly on several vision datasets under various data heterogeneous settings.  I k = B ⊤ [B * B * ⊥ ][B * B * ⊥ ] ⊤ B = B ⊤ B * B * ⊤ B + B ⊤ B * ⊥ B * ⊥ ⊤ B and hence s 2 min (B ⊤ B * ) = s min (B ⊤ B * B * ⊤ B) = min ∥a∥=1 a ⊤ B ⊤ B * B * ⊤ Ba = min ∥a∥=1 a ⊤ I k -B ⊤ B * ⊥ B * ⊥ ⊤ B a = 1 -s max (B ⊤ B * ⊥ B * ⊥ ⊤ B) = 1 -(dist(B * , B)) 2 . (5)

A.3 MORE ON PRIVACY PRELIMINARIES

Lemma A.1 (Gaussian Mechanism of RDP). Let R α be the Rényi divergence defined in Definition 2.1, we have R α (N (0, σ 2 ), N (µ, σ 2 )) = αµ 2 /(2σ 2 ). Here N stands for the standard Gaussian distribution. Lemma A.2 (Composition of RDP). Recall the definition of D in Definition 2.1 and let R 1 and R 2 be some abstract space. Let M 1 : D → R 1 and M 1 : D × R 1 → R 2 be (α, ϵ 1 )-RDP and (α, ϵ 2 )-RDP respectively, then the mechanism defined as (X, Y ), where X ∼ M 1 (S) and Y ∼ M 2 (S, X), satisfies (α, ϵ 1 + ϵ 2 )-RDP. If M is an (α, ϵ)-RDP mechanism, it also satisfies (ϵ + log 1/δ α-1 , δ)-differential privacy for any 0 < δ < 1. A.4 PROOF OF THEOREM 5.2 Proof. Recall the definition of the Gaussian mechanism where W ∼ N (0, I). Lemma A.1 states that GM ζg,σg is a (α, 2α/σ 2 g )-RDP mechanism for α ≥ 1 (the sensitivity is 2ζ g /n while the variance of the noise is (σ g ζ g /n) 2 ). Using the composition of RDP again over all the iterates t ∈ [T ], we obtain that Algorithm 1 is an (α, GM ζ,σ ({x i } s i=1 ) . = 1 s ( s i=1 clip(x i ; ζ) + σζW ) ϵ init + 2αTg σ 2 g )-RDP mechanism.

B DETAILS ON EXPERIMENTS SETUP AND MORE RESULTS

Models. We use LeNet-5 for the datasets CIFAR10 and CIFAR100. LeNet-5 consists of two convolution layers with (64, 64) channels and two hidden fully-connected layers. For CIFAR10, the number of hidden neurons are (384, 32) while for CIFAR100, the number of hidden neurons are (128, 32) . We use ReLU for activation. No batch normalization or dropout layer is used. We use MLP for experiments on EMNIST. It consists of three hidden layers with size (256, 128, 16) . We use ReLU for activation. No batch normalization or dropout layer is used. Hyperparameters. All of our experiments are conducted in the fully participating setting, i.e. p c = 1. According to our experiments, the following hyperparameters are most important to the performance of CENTAUR: the clipping threshold of the Gaussian mechanism ζ g , the global step size η g , the local step size η l , the number of global rounds T g . For CIFAR10, to reproduce the utility-privacy tradeoff presented in Figure 1 , we grid search the clipping threshold ζ g in the set {0.01, 0.02, 0.04, 0.06} for every combination of privacy budget and baseline. The resulting optimal clipping threshold is listed in the Table 2 . For other parameters, we set η l = 0.01, T g = 200 uniformly. To reproduce the utility results in Table 1 , for CIFAR10, we uniformly set ζ g = 0.01, η l = 0.01, η g = 1, T g = 200; for CIFAR100, we uniformly set ζ g = 0.02, η l = 0.01, T g = 100, η g = 1. Note that once the privacy budget ϵ dp is given, we use the privacy engine from the package of Opacus to determine the noise multiplier σ g , given T g . For EMNIST, we uniformly set ζ g = 0.25, η l = 0.01, T g = 40, η g = 1. There are also some algorithm-specific parameters: For PPSGD, we set the step size for the local correction to η = 0.1 and set ratio between the global and the local step size to α = 0.1. For PMTL-ft, we set λ, the regularization parameter to 1. For baselines that require local fine tuning, we perform 15 local epochs to fine tune the local head with a fixed step size of 0.01. About data augmentation. In the Non-DP setting, the technique of data augmentation usually significantly improves the testing accuracy in CV tasks. However, in the DP setting, as reported in the previous work De et al. (2022) , directly utilizing data augmentation leads to inferior performance. In the same work, the authors proposed an alternative version of the data augmentation technique which would improve the testing accuracy on various CV tasks in the centralized DP training setting. We tried their strategy in the federated representation learning setting under consideration, which however does not improve the utility in our case. On the other hand, since the fine tuning of the local classification head does not require DP protection (recall that in CENTAUR, the head is kept private), we employed the standard data augmentation in this phase (optimizing over the local classification head), which improves the testing accuracy for CENTAUR. We also tried the same technique for the fine tuning part of other baselines, but it actually leads to worse performance. Hence in the reported results, data augmentation is only used for the fine tuning of the local classification head in CENTAUR and is not used in any other cases.

B.1 MORE EMPIRICAL RESULTS

To study the performance of different baselines given a larger communication budget, i.e. a larger T g , we conduct additional experiments on CIFAR10 and report the results in Table 3 . We can observe that CENTAUR has the best performance among all the included methods, uniformly in all configurations. In Figure 3 , we further show the testing accuracy (utility) vs the privacy cost ϵ dp during the training. We observe that CENTAUR quickly converges to a high utility can consistently outperforms the included baselines.

C UTILITY ANALYSIS OF THE CENTAUR INSTANCE FOR THE LRL CASE

To present the utility analysis of the CENTAUR instance for the LRL case, Problem (L-FRL) is equivalently formulated as a standard matrix sensing problem. By setting the clipping threshold ζ g to a high probability upper bound of the norm of the local gradient G t i (see Lemma C.1), we show that CENTAUR can be regarded as an inexact gradient descent method. Given that m, the mini-batch size parameter, is sufficiently large and that the initializer B 0 is sufficiently close the ground truth B * , we establish a high probability one-step contraction lemma that controls the utility dist(B t , B * ), which directly leads to the main utility theorem 5.1. Matrix Sensing Formulation Consider a reparameterizationfoot_5 of the local classifier w i = √ nv i . Problem (L-FRL) can be written as min B ⊤ B=I k 1 n n i=1 min vi∈R k   1 m m j=1 1 2 ( √ n⟨v i , B ⊤ x ij ⟩ -y ij ) 2   . Further, denote V ∈ R n×k the collection of local classifiers, V i,: = v i . The collection of optimal local classifier heads W * can also be rescaled as W * = √ nV * and the responses y ij ∈ R satisfy y ij = √ n⟨B * ⊤ x ij , V * i,: ⟩. Define the rank-1 matrices A ij = x ij e (i) ⊤ ∈ R d×n (e (i) ∈ R n ) and define the operators A i : R d×n → R m and A : R d×n → R N A i (X) = { √ n⟨A ij , X⟩} j∈[m] ∈ R m , A(X) = {A i (X)} i∈[n] ∈ R N , where we use N = nm to denote the total number of data points globally. Note that 1 √ N A is an isometric operator in expectation w.r.t. the randomness of {x ij }, i.e. for any X ∈ R d×n E {xij } [⟨ 1 √ N A(X), 1 √ N A(X)⟩] = 1 N n i=1 m j=1 ne (i) ⊤ X ⊤ E xij [x ij x ⊤ ij ]Xe (i) = 1 N n i=1 m j=1 nX ⊤ :,i X :,i = 1 N n i=1 m j=1 n∥X :,i ∥ 2 = ∥X∥ 2 F , where we use Assumption 5.1 in the second equality. With the notations defined above, we can rewrite Problem (L-FRL) as a standard matrix sensing problem with the operator A min B ⊤ B=I k min V ∈R n×k F(B, V ; A) = 1 2N ∥A(BV ⊤ ) -A(B * V * ⊤ )∥ 2 . (MSP) Since CENTAUR only uses a portion of the data points from S i to compute the local gradient G t i (see line 2 in Algorithm 4), it is useful to define the operators A t,1 i and A t,2 i that corresponds to S t,1 i and S t,2 i respectively, and their globally aggregated versions A t,1 and A t,2 : A t,l i (X) = { √ n⟨x ij e (i) ⊤ , X⟩} j∈S t,l i ∈ R m, A t,l (X) = {A t,l i (X)} i∈[n] ∈ R N , l = 1, 2, where we denote N . = mn. Clippings are inactive with a high probability The following lemma shows that by properly setting the clipping thresholds ζ g , the clipping step of the Gaussian mechanism in Algorithm 4 takes no effect with a high probability. Lemma C.1. Consider the LRL setting. Under the assumptions 5.2 and 5.1, we have with a probability at least 1 -mn -k , ∥G t i ∥ F ≤ ζ g . =c ζ µ 2 ks 2 k dk log n, where G t i is computed in line 4 of Algorithm 4 and ζ g is some universal constant. Proof. The detailed expression of G t i in line 4 of Algorithm 4 can be calculated as follows: G t i = ∂ B l([w t+1 i , B t ]; S t,2 i ) = 1 m (xij ,yij )∈S t,2 i ⟨B t ⊤ x ij , w t+1 i ⟩ -y ij x ij w t+1 i ⊤ . ( ) Using the triangle inequality of the matrix norm, ζ is a high probability upper bound of ∥G t i ∥ F if the inequality ∥ ⟨B t ⊤ x ij , w⟩ -y ij x ij w t+1 i ⊤ ∥ F ≤ ζ (11) holds with a high probability. In the following, we show that the inequalities Choice of ζ y Recall that in Assumption 5.1, we assume that x ij is a sub-Gaussian random vector with ∥x ij ∥ ψ2 = 1. Using the definition of a sub-Gaussian random vector, we have |⟨B t ⊤ x ij , w⟩| ≤ ζ 1 , |y ij | ≤ ζ y , ∥x ij ∥ 2 ≤ ζ x P{|y ij | ≥ ζ y } ≤ 2 exp(-c s ζ 2 y /∥w * i ∥ 2 ) ≤ exp(-k log n), with the choice ζ y = µ √ ks k • (k log n + log 2)/c s = O(µs k k √ log n) since ∥w * i ∥ 2 ≤ µ √ ks k . Here c s is some constant and we recall that s k is a shorthand for s k (W * / √ n).

Published as a conference paper at ICLR 2023

Choice of ζ x Recall that x ij is a sub-Gaussian random vector with ∥x ij ∥ ψ2 = 1 and therefore with probability at least 1 -δ, ∥x ij ∥ 2 ≤ 4 √ d + 2 log 1 δ . Therefore by taking δ = exp(-k log n), we have that ζ x = 4 √ d + 2 log 1 δ = O( √ d). Choice of ζ w We can show that ζ w = 2µ √ ks k is a high probability upper bound of ∥w t+1 i . Proving this bound requires detailed analysis of FedRep and is discussed later in equation 63.

Choice of ζ 1

The following event is conditioned on the event ∥w t+1 i ∥ 2 ≤ 2µ √ ks k . We will then bound the probability of both events happen simultaneously using the union bound. Since x ij is a sub-Gaussian random vector with ∥x ij ∥ ψ2 = 1, using the definition of a sub-Gaussian random vector, we have P{|⟨B t w t+1 i , x ij ⟩| ≥ ζ 1 } ≤ 2 exp(-c s ζ 2 1 /∥w t+1 i ∥ 2 ) ≤ exp(-k log n), with the choice ζ 1 = 2µ √ ks k • (k log n + log 2)/c s = O(µs k k √ log n) since ∥w t+1 i ∥ 2 ≤ 2µ √ ks k . Here c s is some constant and we recall that s k is a shorthand for s k (W * / √ n). Using the union bound, we have that w.p. at least 1 -2 exp(-k log n), the upper bound ζ 1 is valid. The idea behind the proof To present the intuition of the utility analysis of CENTAUR, define V (B; A) = arg min V ∈R n×k F(B, V ; A), where A is some matrix sensing operator and F is defined in Problem (MSP). Under the event that the clipping steps in the Gaussian mechanism in Algorithm 4 takes no effect, the average gradient G t . = 1 n n i=1 G t i admits the following compact form G t = ∂ B F(B t , V (B t ; A t,1 ); A t,2 ). Suppose that A t,1 ≃ A t,2 ≃ A (recall that all these linear operators are comprised of i.i.d. data points x ij and are hence similar when m is large). Further define the objective G(B; A) = min V ∈R n×k F(B, V ; A). We have G t ≃ ∇G(B t ; A) since ∇G(B t ; A) = ∂ B F(B t , V (B t ); A)+J V (B t ) ⊤ ∂ V F(B t , V (B t ); A) = ∂ B F(B t , V (B t ; A); A), where J V (B) denotes the Jacobian matrix of V (B; A) with respect to B and the second equality holds due to the optimality of V (B t ; A). Consequently, conditioned on the event that all the clipping operation are inactive, CENTAUR behaves similar to the noisy gradient descent on the objective G(B; A) (up to the difference between A t,1 , A t,1 , and A). In the following, we show that while the objective G(B; A) is non-convex globally, but we can show that CENTAUR converges locally within a region around the underlying ground truth. One-step contraction To present our theory, we first establish the following properties that the operators A t,1 and A t,2 (defined in equation 9) satisfy. Recall that N = mn = n i=1 |S t,1 i |. The proofs are deferred to Appendix E. Lemma C.2. Under Assumption 5.1, the linear operator A t,1 satisfies the following property with probability at least 1 -exp(-c 1 k log n): sup V ∈R n×k ,∥V ∥ F =1 | 1 N ⟨A t,1 (B t V ⊤ ), A t,1 (B t V ⊤ )⟩ -1| ≤ δ (1) . Here, the factor δ (1) = √ k log n/ √ m. Lemma C.3. Under Assumption 5.1, the linear operator A t,1 satisfies the following property with probability at least 1 -exp(-c 2 k log n): For V 1 , V 2 ∈ R n×k , sup ∥V1∥ F =∥V2∥ F =1 | 1 N ⟨A t,1 ((B t B t ⊤ -I d )B * V ⊤ 1 ), A t,1 (B t V ⊤ 2 )⟩| ≤ δ (2) dist(B t , B * ). Here, the factor δ (2) = √ k log n/ √ m. Lemma C.4. Under Assumptions 5.1 and 5.2, the linear operator A t,2 satisfies the following property with probability at least 1 -exp(- c 3 k log n): For a ∈ R d , b ∈ R k sup ∥a∥=∥b∥=1 | 1 N ⟨A t,2 (B t V t+1 ⊤ -B * V * ⊤ ), A t,2 (ab ⊤ V t+1 ⊤ )⟩| ≤ δ (3) s 2 1 kdist(B t , B * ). Here, the factor δ (3) = 4( √ d + √ k log n)/ √ mnκ 2 . We now present the one-step contraction lemma of CENTAUR in the LRL setting. The proof can be found in Appendix D.1. Lemma C.5 (One-step contraction). Consider the instance of CENTAUR for the LRL setting with its CLIENT and INITIALIZATION procedures defined in Algorithms 4 and 3 respectively. Suppose that the matrix B 0 returned by INITIALIZATION satisfies dist(B 0 , B * ) ≤ ϵ 0 = 0.2 and suppose that the mini-batch size parameter satisfies m ≥ c m max{κ 2 k 2 log n, k 2 d + k 3 log n n }, for some universal constant c m . Set the clipping threshold of the Gaussian mechanism ζ g according to Lemma C.1 and set the global step size η g = 1/4s 2 1 . Suppose that the level of manually injected noise is sufficiently small: For some universal constant c σ , it satisfies σ g ≤ c σ n µ 2 ( √ d + √ k log n) min 1 k 2 log n , κ 4 √ dk log n . ( ) We have the following one-step contraction from a single iteration of CENTAUR dist(B * , B t+1 ) ≤ dist(B * , B t ) 1 -E 0 /8κ 2 + 3C N η g σ g ζ g n √ d, holds with probability at least 1 -c p mn -k , where c p is some universal constant. Moreover, with the same probability, we also have dist(B * , B t ) ≤ dist(B * , B 0 ). (20) Remark C.1. The lower bound of the mini-batch size parameter m is derived to satisfy the following inequalities: max{ δ (2) √ k 1 -δ (1) , (δ (2) ) 2 k (1 -δ (1) ) 2 , δ (3) k} ≤ s 2 k (1 -ϵ 2 0 ) 36s 2 1 , which is required to establish the above one-step contraction lemma. The upper bound of the noise multiplier σ g is derived to satisfy the following inequalities: η g σ g ζ g n ≤ min 1 -ϵ 2 0 4C N κ 2 √ k log n , 8κ 2 ϵ 0 3C N √ dE 0 . ( ) Proof of Theorem 5.1. Denote E 0 = 1 -ϵ 2 0 . Using the recursion ( 19), we have dist(B * , B t ) ≤ (1 -E 0 /8κ 2 ) t/2 dist(B * , B 0 ) + 3C N η g σ g ζ g n √ d/(1 -1 -E 0 /8κ 2 ) ≤ (1 -E 0 /8κ 2 ) t/2 dist(B * , B 0 ) + 48C N κ 2 η g σ g ζ g n √ d/E 0 . ( ) With the choice of T specified in the theorem, we have that dist(B * , B t ) ≤ c d κ 2 η g σ g ζ g n √ d, for some universal constant c d . By plugging the choices of η g , ζ g , we obtain the simplified bound dist(B * , B t ) ≤ cd σ g µ 2 k 1.5 d/n, where cd hides the constant and log terms.

D PROOF OF THE ONE-STEP CONTRACTION LEMMA D.1 PROOF OF LEMMA C.5

Proof. The following discussion will be conditioned on the event that all the clipping operations are inactive, whose probability is proved to be at least 1 -5n -k in Lemma C.1. Using union bound to combine the following result and Lemma C.1 leads to the result. Recall that the average gradient G t = 1 n n i=1 G t i and denote Q t = B t V t+1 ⊤ -B * V * ⊤ . Denote Bt+1 = B t + η g • G t . ( ) Recall that C t = {1, . . . , n} given p g = 1. Since the clipping operations are inactive, we have Bt+1 = B t -η g Q t V t+1 + η g Q t V t+1 -G t + η g σ g ζ g n W t , ( ) where W t denotes the noise added by the Gaussian mechanism in the t th global round. Note that B * ⊥ ⊤ Q t = B * ⊥ ⊤ B t V t+1 ⊤ , and denote the QR decomposition Bt+1 = B t+1 R t+1 . We have B * ⊥ ⊤ B t+1 = B * ⊥ ⊤ B t -η g Q t V t+1 + η g Q t V t+1 -G t + η g σ g ζ g n W t (R t+1 ) -1 = B * ⊥ ⊤ B t I k -η g V t+1 ⊤ V t+1 (R t+1 ) -1 + η g B * ⊥ ⊤ Q t V t+1 -G t (R t+1 ) -1 + η g σ g ζ g n B * ⊥ ⊤ W t (R t+1 ) -1 . Recall the definition dist(B * , B t ) = ∥B * ⊥ ⊤ B t ∥. We bound dist(B * , B t+1 ) ≤ dist(B * , B t )∥I k -η g V t+1 ⊤ V t+1 ∥∥(R t+1 ) -1 ∥ + η∥Q t V t+1 -G t ∥∥(R t+1 ) -1 ∥ + η g σ g ζ g n ∥B * ⊥ ⊤ W t ∥∥(R t+1 ) -1 ∥. ( ) In the following, we show that the factor ∥I k -η g V t+1 ⊤ V t+1 ∥∥(R t+1 ) -1 ∥ < 1 which leads to a contraction in the principal angle distance and treat the rest two terms as controllable noise for sufficiently small constants (δ (1) , δ (2) , δ (3) ) (see Lemma C.1) and a sufficiently smaller noise multiplier σ g (see equation 18). Bound ∥I k -η g V t+1 ⊤ V t+1 ∥. Recall that η g = 1/4s 2 1 . Using Lemma D.5, we have ∥I k -η g V t+1 ⊤ V t+1 ∥ ≤ 1 -η g (E 0 s 2 k - 2δ (2) s 2 1 √ k 1 -δ (1) dist(B * , B t )) ≤ 1 -η g E 0 s 2 k • 0.75, ( ) where we use the following inequality in Remark C.1 2δ (2) s 2 1 √ k 1 -δ (1) dist(B * , B t ) ≤ 2δ (2) s 2 1 √ k 1 -δ (1) ≤ E 0 s 2 k /4. ( ) Bound ∥(R t+1 ) -1 ∥. With the choice of m stated in the lemma, the tuple (δ (1) , δ (2) , δ (3) ) satisfies the requirements in Remark C.1. Using Lemma D.7, we have with probability at least 1 -4n -k ∥(R t+1 ) -1 ∥ ≤ 1/ 1 -η g s 2 k E 0 /2. ( ) Combining equation 27 and equation 29, we have the contraction ∥I k -η g V t+1 ⊤ V t+1 ∥∥(R t+1 ) -1 ∥ ≤ 1 -η g s 2 k E 0 • 0.75 1 -η g s 2 k E 0 • 0.5 < 1. We now bound the last two terms of equation 26. Bound ∥Q t V t+1 -G t ∥. Using Lemma D.6, we have ∥Q t V t+1 -G t ∥ ≤ δ (3) s 2 1 kdist(B * , B t ) ≤ E 0 s 2 k dist(B * , B t ) • 0.25, where we use condition in Remark C.1 δ (3) s 2 1 k ≤ E 0 s 2 k /4. Bound U * ⊥ ⊤ W t . Due to the rotational invariance of independent Gaussian random variables, every entry in B * ⊥ ⊤ W t ∈ R (n-k)×k is distributed as N (0, 1). Using Lemma I.1, we have with probability at least 1 -n -k ∥B * ⊥ ⊤ W t ∥ ≤ C N ( √ d -k + √ k + ln(k log n)) ≤ 3C N √ d, where we assume d = Ω(log log n) to simplify the above bound. Final Result. Combining the above bounds, we conclude that the following one-step contraction holds under the assumptions stated in the theorem. dist(B * , B t+1 ) ≤ dist(B * , B t ) 1 -η g E 0 s 2 k • (0.75 -0.25) 1 -η g E 0 s 2 k • 0.5 + 3C N η g σ g ζ g n √ d ≤ dist(B * , B t ) 1 η g E 0 s 2 k • 0.5 + 3C N η g σ g ζ g n √ d D.2 LEMMAS FOR THE UTILITY ANALYSIS Lemma D.1. Use vec(•) and ⊗ to denote the standard vectorization operation and Kronecker product respectively. Recall that V t+1 := arg min V ∈R n×k F(B t , V ; A t,1 ). We have vec(V t+1 ) = vec((B * V * ⊤ ) ⊤ B t ) -f t ( ) where f t = B t ⊤ B * ⊗ I d -H tt -1 H t * vec(V * ) = H tt -1 H tt (B t ⊤ B * ⊗ I d ) -H t * vec(V * ). (33) Here we denote H t * = 1 n n i=1 1 m j∈S t,1 i vec(A ⊤ ij B t ) vec(A ⊤ ij B * ) ⊤ ∈ R nk×nk , H tt = 1 n n i=1 1 m j∈S t,1 i vec(A ⊤ ij B t ) vec(A ⊤ ij B t ) ⊤ ∈ R nk×nk . ( ) We use F t ∈ R n×k to denote the matrix satisfying f t = vec(F t ). Proof. Recall that N = mn. For the simplicity of notations, we use the collection {A l }, l = 1, . . . , N to denote {A ij }, i ∈ [n], j ∈ S t,1 i (there exists a one-to-one mapping between the indices l and (i, j)). Compute that for any B ∈ R d×k and V ∈ R n×k vec(∂ V F(B, V )) = 1 N N l=1 (vec(A ⊤ l B) ⊤ vec(V ) -vec(A ⊤ l B * ) ⊤ vec(V * )) vec(A ⊤ l B) =   1 N N l=1 vec(A ⊤ l B) vec(A ⊤ l B) ⊤   vec(V ) -   1 N N l=1 vec(A ⊤ l B) vec(A ⊤ l B * ) ⊤   vec(V * ). Since V t+1 = arg min V ∈R n×k F(B t , V ; A t,1 ), we have ∂ V F(B t , V t+1 ; A t,1 ) = 0 and hence vec(V t+1 ) =   1 N N l=1 vec(A ⊤ l B t ) vec(A ⊤ l B t ) ⊤   -1   1 N N l=1 vec(A ⊤ l B t ) vec(A ⊤ l B * ) ⊤   vec(V * ) = H tt -1 H t * vec(V * ), where we denote H t * = 1 N N l=1 vec(A ⊤ l B t ) vec(A ⊤ l B * ) ⊤ ∈ R nk×nk , H tt = 1 N N l=1 vec(A ⊤ l B t ) vec(A ⊤ l B t ) ⊤ ∈ R nk×nk . Use ⊗ to denote the Kronecker product of matrices. Recall that (B ⊤ ⊗ vec(X) = vec(AXB). ( ) We have vec (B * V * ⊤ ) ⊤ B t = vec V * B * ⊤ B t = (B t ⊤ B * ⊗ I d ) vec(V * ). ( ) Lemma D.2. Recall the definition of H tt in equation 35 and that s min denotes the minimum singular value of a matrix. Suppose that the matrix sensing operator A t,1 satisfies Condition C.2 with constant δ (1) . We can bound s min (H tt ) ≥ 1 -δ (1) . Proof. From the definition of s min , we have s min (H tt ) = min P ∈R n×k ∥P ∥ F =1 vec(P ) ⊤ H tt vec(P ) = 1 N ⟨A t,1 (B t P ⊤ ), A t,1 (B t P ⊤ )⟩ ≥ 1 -δ (1) . Lemma D.3. Recall the definitions of H tt and H t * in equation 35 and equation 34 and recall that dist(B t , B * ) is the principal angle distance between the current variable B t and the ground truth B * . Suppose that the matrix sensing operator A t,1 satisfies Condition C.3 with constant δ (2) . We can bound ∥H tt (B t ⊤ B * ⊗ I d ) -H t * ∥ 2 ≤ δ (2) dist(B t , B * ). Proof. Recall that N = mn. For the simplicity of notations, we use the collection {A l }, l = 1, . . . , N to denote {A ij }, i ∈ [n], j ∈ S t,1 i (there can be a one-to-one mapping between the indices l and (i, j)). For arbitrary W ∈ R n×k , P ∈ R n×k with ∥W ∥ F = ∥P ∥ F = 1, we have vec(W ) ⊤ H tt (B t ⊤ B * ⊗ I d ) vec(P ) = 1 N N i=1 vec(W ) ⊤ vec(A ⊤ i B t ) vec(A ⊤ i B t ) ⊤ vec(P B * ⊤ B t ) = 1 N N i=1 ⟨A i , B t W ⊤ ⟩⟨A i , B t B t ⊤ B * P ⊤ ⟩ = 1 N ⟨A t,1 (B t W ⊤ ), A t,1 (B t B t ⊤ B * P ⊤ )⟩, where we use (B ⊤ ⊗ A) vec(X) = vec(AXB) in the first equality. Similarly, we can compute that vec(W ) ⊤ H t * vec(P ) = 1 N ⟨A t,1 (B t W ⊤ ), A t,1 (B * P ⊤ )⟩, and hence vec(W ) ⊤ H tt (B t ⊤ B * ⊗ I d ) -H t * vec(P ) = 1 N ⟨A t,1 (B t W ⊤ ), A t,1 ((B t B t ⊤ -I k )B * P ⊤ )⟩. (40) Using Condition C.3 and the definition of s max , we have the result. Lemma D.4. Recall the definitions of F t and f t in equation 33 and recall that dist(B t , B * ) is the principal angle distance between the current variable B t and the ground truth B * . Suppose that the matrix sensing operator A satisfies Conditions C.2 and C.3 with constants δ (1) and δ (2) respectively. We can bound ∥F t ∥ F = ∥f t ∥ 2 ≤ δ (2) s 1 √ k 1 -δ (1) dist(B * , B t ), and ∥V t+1 ∥ F ≤ √ ks 1 + δ (2) s 1 √ k 1 -δ (1) dist(B * , B t ) ≤ s 1 √ k(1 -δ (1) + δ (2) ) 1 -δ (1) . ( ) Proof. The bound on ∥F t ∥ F a direct consequence of Lemmas D.1 to D.3 and the fact that the matrix norms are sub-multiplicative. The bound on ∥V t+1 ∥ F is due to the fact that the matrix norms are sub-additive. Lemma D.5. Suppose that the matrix sensing operator A t,1 satisfies Conditions C.2 and C.3 with constants δ (1) and δ (2) respectively. Further, suppose that max{ δ (2) √ k 1-δ (1) , (δ (2) ) 2 k (1-δ (1) ) 2 } ≤ s 2 k E0 36s 2 1 . For a sufficiently small step size η g ≤ 1/(4s 2 1 ), we have I k -η g V t+1 ⊤ V t+1 ≽ 0. Moreover, we can bound ∥I k -η g V t+1 ⊤ V t+1 ∥ 2 ≤ 1 -η g (E 0 s 2 k - 2δ (2) s 2 1 √ k 1 -δ (1) dist(B * , B t )). Proof. We first show that the matrix I k -η g V t+1 ⊤ V t+1 is positive semi-definite for a sufficiently small step-size η g . Note that following the idea of the seminal work Jain et al. (2013) , the update of V t+1 = V (B t ; A t,1 ) can be regarded as a noisy power iteration, as detailed in Lemma D.1. This allows us to compute ∥V t+1 ⊤ V t+1 ∥ 2 = ∥V * B * ⊤ B t -F t ∥ 2 2 ≤ 2∥V * B * ⊤ B t ∥ 2 2 + 2∥F t ∥ 2 2 ≤ 2s 2 1 + 2 δ (2) s 1 √ k 1 -δ (1) dist(B * , B t ) 2 ≤ 4s 2 1 , where we use Lemma D.4 in the first inequality and use δ (2) √ k 1-δ (1) ≤ 1 in the last. Consequently, for η g ≤ 1 4s 2 1 , I k -η g V t+1 ⊤ V t+1 ≽ 0. Given the positive semi-definiteness of the matrix I k -η g V t+1 ⊤ V t+1 , we can bound ∥I k -η g V t+1 ⊤ V t+1 ∥ 2 ≤ 1 -η g s min (V t+1 ⊤ V t+1 ). By using Lemma D.1 again. We have V t+1 ⊤ V t+1 = V * B * ⊤ B t -F t ⊤ V * B * ⊤ B t -F t = B t ⊤ B * V * ⊤ V * B * ⊤ B t -F t ⊤ V * B * ⊤ B t -B t ⊤ B * V * ⊤ F t + F t ⊤ F t . Note that F t ⊤ F t is PSD which makes nonnegative contribution to s min (V t+1 ⊤ V t+1 ) and hence s min (V t+1 ⊤ V t+1 ) ≥ s min (B t ⊤ B * V * ⊤ V * B * ⊤ B t -F t ⊤ V * B * ⊤ B t -B t ⊤ B * V * ⊤ F t ) ≥ s min (B t ⊤ B * V * ⊤ V * B * ⊤ B t ) -2s max (F t ⊤ V * B * ⊤ B t ) ≥ s 2 min (B t ⊤ B * )s 2 min (V * ) -2∥F t ∥s max (V * ) To bound the first term of Eq. equation 44, recall that s 2 min (B t ⊤ B * ) = 1 -(dist(B * , B t )) 2 from Eq. equation 5 and dist(B * , B t ) ≤ dist(B * , B 0 ) from the induction, we have s 2 min (B t ⊤ B * ) ≥ E 0 . To bound the last term of Eq. equation 44, we use Lemma D.4 to obtain ∥F t ∥ 2 ≤ ∥F t ∥ F ≤ δ (2) s1 √ k 1-δ (1) dist(B * , B t ). Combining the above results, we have ∥I k -η g V t+1 ⊤ V t+1 ∥ 2 ≤ 1 -η g (E 0 s 2 k - 2δ (2) s 2 1 √ k 1 -δ (1) dist(B * , B t )). Lemma D.6. Recall that V t+1 = V (B t ; A t,1 ) (see the definition of V (•; A) in Eq. equation 15), Q t = B t V t+1 ⊤ -B * V * ⊤ , G t = 1 n n i=1 G t i is the global average of local gradient, and recall that dist(B t , B * ) is the principal angle distance between the current variable B t and the ground truth B * . Suppose that the matrix sensing operator A t,2 satisfies Condition C.4 with a constant δ (3) . We have ∥G t -Q t V t+1 ∥ 2 ≤ δ (3) s 2 1 kdist(B * , B t ). Proof. Recall that N = mn. For the simplicity of notations, we use the collection {A l }, l = 1, . . . , N to denote {A ij }, i ∈ [n], j ∈ S t,2 i (there can be a one-to-one mapping between the indices l and (i, j)). With this notation, we can compactly write G t as G t = 1 N N l=1 ⟨A l , Q t ⟩A l V t+1 . From the definition of s max , for any a ∈ R d with ∥a∥ 2 = 1 and any b ∈ R k with ∥b∥ 2 = 1, we have ∥ 1 N N l=1 ⟨A l , Q t ⟩A l V t+1 -Q t V t+1 ∥ 2 = max ∥a∥2=∥b∥2=1 a ⊤   1 N N l=1 ⟨A i , Q t ⟩A l V t+1 -Q t V t+1   b. We obtain the result from Condition C.4 | 1 N ⟨A t,2 (B t V t+1 ⊤ -B * V * ⊤ ), A t,2 (ab ⊤ V t+1 ⊤ )⟩| ≤ δ (3) s 2 1 kdist(B t , B * ). Lemma D.7. Recall that Bt+1 = B t+1 R t+1 is the QR decomposition of Bt+1 . Denote E 0 = 1-ϵ 2 0 and σ = ζ g σ g η g /n. Suppose that Conditions C.2 to C.4 are satisfied with constants δ (1) , δ (2) , and δ (3) . Further, suppose that max{ δ (2) √ k 1-δ (1) , (δ (2) ) 2 k (1-δ (1) ) 2 , δ (3) k} ≤ s 2 k E0 36s 2 1 and the level of manually injected noise is sufficiently small σ ≤ E0 4C N κ 2 √ k log n . We have with probability at least 1 -4 exp(-k log n) ∥(R t+1 ) -1 ∥ 2 ≤ 1 1 -η g s 2 k E 0 /2 . ( ) Proof. We now focus on bounding s min (R t+1 ). Recall that G t = ∂ B F(B t , V t+1 ; A t,2 ) with V t+1 = V (B t ; A t,1 ) (see the definition of V (•; A) in equation 15) and recall that Bt+1 := B t - η g G t + σW t . Compute R t+1 ⊤ R t+1 = Bt+1 ⊤ Bt+1 = I k + η 2 g G t ⊤ G t + σ 2 W t ⊤ W t -η g B t ⊤ G t -η g G t ⊤ B t + σB t ⊤ W t + σW t ⊤ B t -η g σG t ⊤ W t -η g σW t ⊤ G t . Therefore, we have s min (R t+1 ⊤ R t+1 ) ≥ 1-2η g s max B t ⊤ G t -2σs max (B t ⊤ W t )-2η g σs max G t ⊤ W t . (50) We now bound the last three terms of the R.H.S. of the above inequality. 1. s max B t ⊤ G t : To bound s max B t ⊤ G t , compute that B t ⊤ G t = B t ⊤ Q t V t+1 + B t ⊤ G t -Q t V t+1 , where we recall the definition of Q t as Q t = B t V t+1 ⊤ -B * V * ⊤ . Note that the spectral norm of second term in Eq. equation 51 is bounded by Lemma D.6 and we hence focus on the spectral norm of the first term B t ⊤ Q t V t+1 . Recall the noisy power interpretation of V t+1 in Lemma D.1. We can write B t ⊤ Q t V t+1 = B t ⊤ B t (B t ⊤ B * V * ⊤ -F t ⊤ ) -B * V * ⊤ V t+1 = F t ⊤ V t+1 = F t ⊤ V * B * ⊤ B t -F t , where we use B t ⊤ B t B t ⊤ -I d B * V * ⊤ = 0. Consequently, we can bound s max B t ⊤ G t as follows s max B t ⊤ G t ≤ s 1 ∥F t ∥ 2 + ∥F t ∥ 2 2 + ∥G t -Q t V t+1 ∥ 2 ≤ δ (2) s 2 1 √ k 1 -δ (1) + (δ (2) ) 2 s 2 1 k (1 -δ (1) ) 2 + δ (3) s 2 1 k ≤ s 2 k E 0 12 , where we use the assumptions that max{ δ (2) √ k 1-δ (1) , (δ (2) ) 2 k (1-δ (1) ) 2 , δ (3) k} ≤ s 2 k E0 36s 2 1 and dist(B * , B t ) ≤ 1 for the last inequality. 2. s max B t ⊤ W t : Due to the rotational invariance of independent Gaussian random variables, every entry in B t ⊤ W t ∈ R k×k is distributed as N (0, 1). According to Theorem 4.4.5 in Vershynin (2018), with probability at least 1 -2 exp(-k log n), we have the bound ∥B t ⊤ W t ∥ ≤ C N √ k log n 48 √ 2 for some universal constant C N . 3. s max G t ⊤ W t : Let G t = U G S G V ⊤ G be the compact singular value decomposition of G t such that U G ∈ R d×k and U ⊤ G U G = I k . We can bound s max G t ⊤ W t ≤ ∥S G ∥∥U ⊤ G W t ∥. ( ) Due to the rotational invariance of independent Gaussian random variables, every entry in U t G ⊤ W t ∈ R k×k is distributed as N (0, 1) and hence with probability at least 1 -2 exp(-k log n), we have the bound ∥U G ⊤ W t ∥ ≤ C N √ k log n 48 √ 2 for some universal constant C N . We now focus on bounding ∥S G ∥ 2 = ∥G t ∥ 2 . Note that G t = Q t V t+1 + G t -Q t V t+1 , where the spectral norm of second term in Eq. equation 54 is bounded by Lemma D.6. Recall the noisy power interpretation of V t+1 in Lemma D.1. We can write Q t V t+1 = B t (B t ⊤ B * V * ⊤ -F t ⊤ ) -B * V * ⊤ V t+1 = (B t B t ⊤ -I d )B * V * ⊤ -B t F t ⊤ V * B * ⊤ B t -F t , and therefore we can bound ∥Q t V t+1 ∥ 2 ≤ (s 1 + ∥F t ∥ 2 ) 2 ≤ s 1 + δ (2) s 1 √ k 1 -δ (1) dist(B * , B t ) 2 ≤ 4s 2 1 , since we assume that δ (2) √ k 1-δ (1) ≤ 1. Combining the above three points, we have with probability at least 1 -4 exp(-k log n) s min (R t+1 ⊤ R t+1 ) ≥ 1 -η g s 2 k E 0 /6 -σ • C N k log n/6 -η g σ • C N k log n • s 2 1 /6. (56) Hence, if we choose σ such that (recall that η g ≤ 1/4s 2 1 ) σ • C N k log n ≤ η g s 2 k E 0 and η g σ • C N k log n • s 2 1 ≤ η g s 2 k E 0 ⇒ σ ≤ E 0 4C N κ 2 √ k log n , we have s min (R t+1 ⊤ R t+1 ) ≥ 1 -η g s 2 k E 0 /2 ⇒ s max ((R t+1 ) -1 ) ≤ 1 1 -η g s 2 k E 0 /2 . ( ) E ESTABLISH LEMMAS C.2 TO C.4 FOR THE LRL CASE E.1 PROOF OF LEMMA C.2 For the simplicity of the notations, we omit the superscript of A t,1 i and A t,1 . Moreover, recall that v i = V i,: ∈ R k denotes the ith row of the matrix V . While we can directly use an ϵ-net argument to establish the desired property on the set of matrices V ∈ R n×k , ∥V ∥ F = 1, it leads to a suboptimal bound since the size of the ϵ-net is O(( 2 ϵ +1) nk ). In the following, we show that by exploiting the special structure of the operator A, i.e. V is row-wise separable in A(B t V ⊤ ), we can reduce the size of the ϵ-net to O(( 2 ϵ + 1) k ): Compute that ⟨ 1 √ N A(B t V ⊤ ), 1 √ N A(B t V ⊤ )⟩ = 1 n n i=1 1 m ⟨A i (B t v i e (i) ⊤ ), A i (B t v i e (i) ⊤ )⟩. If for any v ∈ R k , ∥v∥ 2 = 1, we have 1 m ⟨A i (B t ve (i) ⊤ ), A i (B t ve (i) ⊤ )⟩ = n 1 ± O(δ (1) ) , we can show ⟨ 1 √ N A(B t V ⊤ ), 1 √ N A(B t V ⊤ )⟩ = 1 n n i=1 ∥v i ∥ 2 2 1 m ⟨A i (B t v i e (i) ⊤ ∥v i ∥ 2 ), A i (B t v i e (i) ⊤ ∥v i ∥ 2 )⟩ = n i=1 ∥v i ∥ 2 2 1 ± O(δ (1) ) = 1 ± O(δ (1) ), ∀V ∈ R n×k , ∥V ∥ F = 1, which is the desired result (note that ⟨B t V ⊤ , B t V ⊤ ⟩ = ∥B t V ⊤ ∥ F = 1). We now establish 1 m ⟨A i (B t ve (i) ⊤ ), A i (B t ve (i) ⊤ )⟩ = 1 m m j=1 n((x ⊤ ij B t ) ⊤ v) 2 = n 1 ± O(δ (1) ) holds for any v ∈ R k , ∥v∥ 2 = 1. Let S k-1 be the sphere in the k-dimensional Euclidean space and let N k be the 1/4-net of cardinality 9 k (see Corollary 4.2.13 in Vershynin ( 2018)). Note that 1 m m j=1 n((x ⊤ ij B t ) ⊤ v) 2 -n = nv ⊤ B t ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t -I k v and we have sup v∈S k-1 v ⊤   B t ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t -I k   v ≤ 2 sup v∈N k v ⊤   B t ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t -I k   v, where we use Lemma 4.4.1 in Vershynin (2018). In the following, we prove 1) ) . sup v∈N k v ⊤   B t ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t -I k   v ≤ δ (1) /2 so that we have 1 m m j=1 n((x ⊤ ij B t ) ⊤ v) 2 = n 1 + O(δ For any fixed index i, denote  Z ij = (x ⊤ ij B t v) 2 = n(x ⊤ ij B t v Z ij -n| ≥ nτ } ≤ 2 exp -c min{ m2 n 2 τ 2 m j=1 ∥Z ij ∥ 2 ψ1 , mnτ max j∈[ m] ∥Z ij ∥ ψ1 } = 2 exp -c min{ mτ 2 , mτ } τ <1 = 2 exp -c mτ 2 where c > 0 is an absolute constant. Using the union bound over 1) ) . We therefore have the result. N k and set τ = √ k log n/ √ m. We obtain δ (1) ≤ √ k log n/ √ m with probability at least 1 -exp(-c ′ k log n) for some constant c ′ . Similarly, we can show that 1 m m j=1 n((x ⊤ ij B t ) ⊤ v) 2 = n 1 -O(δ

E.2 PROOF OF LEMMA C.3

Recall that v i ∈ R k denotes the ith row of the matrix V and denote W = (B t B t ⊤ -I d )B * . Following a similar argument as Lemma C.2, we simply need to focus on showing for any 2) ) (Note that ⟨W v 1 e (i) ⊤ , B t v 2 e (i) ⊤ ⟩ = 0). Let S k-1 be the sphere in the k-dimensional Euclidean space and let N k be the 1/4-net of cardinality 9 k (see Corollary 4.2.13 in Vershynin ( 2018)). Note that v 1 , v 2 ∈ R k , ∥v 1 ∥ = ∥v 2 ∥ = 1, we have 1 m ⟨A i (W v 1 e (i) ⊤ ), A i (B t v 2 e (i) ⊤ )⟩ = ±O(nδ 1 m ⟨A i (W B * v 1 e (i) ⊤ ), A i (B t v 2 e (i) ⊤ )⟩ = nv ⊤ 1   W ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t   v 2 . Moreover, according to Exercise 4.4.3 in Vershynin ( 2018), we have sup v1,v2∈S k-1 v ⊤ 1   W ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t   v 2 ≤ 2 sup v1,v2∈N k v ⊤ 1   W ⊤ ( 1 m m j=1 x ij x ⊤ ij )B t   v 2 . In the following, we prove that the quantity on the RHS is bounded by δ (2) /2 so that we have the desired result. For any fixed index i, denote  Z ij = n(x ⊤ ij W v 1 )(x ⊤ ij B t v 2 ) and note that E xij [Z ij ] = 0. Since x ij 's are independent subgaussian variables, Z ij 's are independent subexponential variables. We can bound ∥Z ij ∥ ψ1 ≤ n∥W v 1 ∥ 2 ∥B t v 2 ∥ 2 = ndist(B t , Z ij | ≥ nτ } ≤ 2 exp -c min{ m2 n 2 τ 2 m j=1 ∥Z ij ∥ 2 ψ1 , mnτ max j∈[ m] ∥Z ij ∥ ψ1 } = 2 exp -c min{ mτ 2 (dist(B t , B * )) 2 , mτ dist(B t , B * ) } where c > 0 is an absolute constant. Use the union bound over N 2 k , set δ (2) ≤ √ k log n/ √ m and set τ = dist(B t , B * )δ (2) . We show that Condition C.3 is satisfied with parameter δ (2) with probability at least 1 -exp(-c ′ k log n) for some constant c ′ .

E.3 PROOF OF LEMMA C.4

Denote Q t = B t V t+1 ⊤ -B * V * ⊤ and recall that V i,: ∈ R k denotes the ith row of the matrix V . For simplicity, denote W = 1 N n, m i=1,j=1 ⟨A ij , Q t ⟩A ij V t+1 and note that E xij [W ] = Q t V t+1 . We use an ϵ-net argument to establish the desired property on the set of vectors a ∈ R d , b ∈ R k , ∥a∥ = ∥b∥ = 1. Let S k-1 and S d-1 be the spheres in the k-dimensional and d-dimensional Euclidean spaces and let N k and N d be the 1/4-net of cardinality 9 k and 9 d respectively (see Corollary 4.2.13 in Vershynin (2018) ). Note that ⟨A(Q t ), A(ab ⊤ V t+1 ⊤ )⟩ = a ⊤ W b. Moreover, according to Exercise 4.4.3 in Vershynin (2018), we have sup a∈S d-1 ,b∈S k-1 a ⊤ (W -Q t V t+1 )b ≤ 2 sup a∈N d ,b∈N k a ⊤ (W -Q t V t+1 )b. In the following, we prove that the quantity on the RHS is bounded by δ (3) s 2 1 kdist(B t , B * )/2 so that we have the desired result. We first bound ∥Q t e (i) ∥ 2 and ∥V t+1 i,: ∥ 2 as they will be used in our concentration argument. Bound ∥V t+1 i,: ∥ 2 . Using Lemma D.1, we can write V t+1 = V * B * ⊤ B t -F t and therefore V t+1 i,: = B t ⊤ B * V * i,: -F t i,: . Using Assumption 5.2, we have ∥V * i,: ∥ 2 ≤ µ √ ks k / √ n. Further, we can compute that F t i,: = 1 m B t ⊤ x ij x ⊤ ij B t -1 1 m B t ⊤ x ij x ⊤ ij (B t B t ⊤ -I d )B * V * i,: . Using the variational formulation of the spectral norm and Conditions C.2 and C.3 (note that δ (1) = δ (2) = √ k log n/ √ m), we have ∥F t i,: ∥ 2 ≤ δ (2) dist(B t ,B * ) 1-δ (1) • µ √ ks k √ n . Therefore we obtain ∥V t+1 i,: ∥ 2 ≤ 2µ √ ks k / √ n. The high probability upper bound of w t i in Lemma C.1 can be derived from the above inequality by noting that w t i = √ nV t i,: . Bound ∥Q t e (i) ∥ 2 . Recall the definition of Q t = B t V t+1 ⊤ -B * V * ⊤ . Using Lemma D.1, we obtain Q t = B t (V * B * ⊤ B t -F t ) ⊤ -B * V * ⊤ = (B t B t ⊤ -I d )B * V * -B t F t ⊤ . We can bound ∥Q t e (i) ∥ 2 ≤ ∥(B t B t ⊤ -I d )B * ∥ 2 ∥v * i ∥ 2 + ∥F t i,: ∥ 2 ≤ 2dist(B t , B * )µ √ ks k / √ n. Denote Z ij = n(x ⊤ ij Q t e (i) )(x ⊤ ij ab ⊤ V t+1 i,: ) and note that 1 n n i=1 E xij [Z ij ] = nb ⊤ V t+1 ⊤ Q t a. Since x ij 's are independent subgaussian variables, Z ij 's are independent subexponential variables. We can bound ∥Z ij ∥ ψ1 ≤ n∥Q t e (i) ∥ 2 ∥ab ⊤ V t+1 i,: ∥ 2 ≤ 4kµ 2 s 2 k dist(B t , B * ). Using the centering property (see Exercise 2.7.10 of Vershynin ( 2018)) and Bernstein's inequality (see Theorem 2.8.1 of Vershynin ( 2018)) of the zero mean subexponential variables, we can bound, for every t ≥ 0 Pr{| 1 n m n, m i=1,j=1 Z ij -a ⊤ (Q t V t+1 )b| ≥ τ } ≤ 2 exp -c min{ m2 n 2 τ 2 n, m i=1,j=1 ∥Z ij ∥ 2 ψ1 , mnτ max i∈[n],j∈[ m] ∥Z ij ∥ ψ1 } = 2 exp -c min{ mnτ 2 (4kµ 2 s 2 k dist(B t , B * )) 2 , mnτ 4kµ 2 s 2 k dist(B t , B * ) } Set τ = δ (3) s 2 1 kdist(B t , B * ). We have that when δ (3) = 4( √ d+ √ k log n) √ mnκ 2 , Pr{| 1 n m n, m i=1,j=1 Z ij -a ⊤ (Q t V t+1 )b| ≥ τ } ≤ exp (-c(d + k log n)) . Use the union bound over N k × N d , we have that with probability at least 1 -exp(-c ′′ k log n) Condition C.4 holds with δ (3) = 4( √ d+ √ k log n) √ mnκ 2 .

F ANALYSIS OF THE INITIALIZATION PROCEDURE

In this section, we present the privacy and utility guarantees of the INITIALIZATION procedure (Algorithm 3), when the local data points x ij are Gaussian variables. Recall that to establish the utility guarantee for CENTAUR in the LRL setting, a column orthonormal initialization B 0 ∈ R d×k such that the initial error dist(B 0 , B * ) ≤ ϵ 0 = 0.2 is required. While such a requirement can be ensured by the private power method (PPM, presented in Algorithm 5), the utility guarantee only holds with a constant probability, e.g. 0.99. A key contribution of our work is to propose a cross-validation type scheme to boost the success probability of PPM to 1 -O(n -k ) with a small extra cost of O(k log n). Note that boosting the success probability has a small cost of utility and hence we need a higher target accuracy ϵ i = 0.01foot_6 for PPM compared to ϵ 0 = 0.2. The most important novelty of our selection scheme is that it only takes as input the results of O(k log n) independent PPM runs, which hence can be treated as post-processing and is free of privacy leakage.

F.1 UTILITY AND PRIVACY GUARANTEES OF THE PPM PROCEDURE

In this section, we present the guarantees for the PPM procedure, under the additional assumption that x ij are Gaussian variables. One can prove that the choice of ζ 0 described in Lemma F.1 is a high probability upper bound of ∥Y l i ∥ F (see Lemma G.1 in the appendix). Therefore, with the same high probability, the clipping operation in the Gaussian mechanism will not take effect. Conditioned on the above event (clipping takes no effect), to establish the utility guarantee of Algorithm 5, we view PPM Hardt & Price (2014) as a specific instance of the perturbed power method presented in Algorithm 6. Suppose that the level of perturbation is sufficiently small, we can exploit the analysis from (Hardt & Price, 2014) to prove the following lemma. Lemma F.1 (Utility Guarantee of a Single PPM Trial). Consider the LRL setting. Suppose that Assumptions 5.2 and 5.1 hold and x ij are Gaussian variables. Let ϵ i = 0.01 be the target accuracy. For PPM presented in Algorithm 5, set ζ 0 = c ′ 0 µ 2 k 1.5 s 2 k d 0.5 log n • ( √ log n + √ k), L = c ′ L (s 2 k + k j=1 s 2 j )/s 2 k • log(kd/ϵ i ). Suppose that n is sufficiently large so that there exists m0 such that mn log 6 mn ≥ c ′ 1 d • log 6 d • k 3 • µ 2 • k i=1 s 2 i s 2 k . ( ) Choose a noise multiplier σ 0 = ns 2 k /c ′ 2 ζ 0 k √ d log L. We have with probability at least 0.99, we have dist(B 0 , B * ) ≤ ϵ i . Remark F.1. If we focus on the dependence on the problem dimension d and the number clients n, treat other quantities, e.g. the rank k, as constant, and ignore the logarithmic terms, we have m ≥ Θ(d/n), which is the same as the requirement on m in Lemma C.5. The following RDP guarantee of PPM can be established using the Gaussian mechanism of RDP and the RDP composition lemma. Lemma F.2 (Privacy Guarantee). Consider PPM presented in Algorithm 5. Set inputs L, the number of communication rounds, σ 0 , the noise multiplier according to Lemma F.1. We have that PPM is an (α, ϵ ′ init )-RDP mechanism with ϵ ′ init = αL • (c ′ 2 ζ 0 k √ d log L) 2 n 2 s 4 k = Õ( ακ 2 k 7 µ 4 d 2 n 2 ), where Õ hides the constants and the log terms and we treat (s 2 k + k j=1 s 2 j )/s 2 k = O(k).

F.2 BOOST THE SUCCESS PROBABILITY WITH CROSS-VALIDATION

The following lemma shows that if the output of PPM has utility at least ϵ i , e.g. ϵ i = 0.01, with probability p, e.g. p = 0.99, then any candidate that passes the test (68) has utility no less than ϵ 0 , e.g. ϵ 0 = 0.2, with a high probability (1 -δ). The proof is provided in Appendix H. Choose X 0 ∈ R d×k to be a random column orthonormal matrix; 3: for l = 1 to L do 4: for i = 1 to n do 5: Sample without replacement a subset S l,0 i from S i with cardinality m0 .

6:

Denote M l i = 1 m j∈S l,0 i y 2 ij x ij x ⊤ ij . 7: Compute Y l i := M l i X l-1 . 8: Let X l = QR(Y l ) , where Y l = GM ζ0,σ0 ({Y l i } n i=1 ). // A is the target matrix 3: Choose X 0 ∈ R d×k to be a random column orthonormal matrix; 4: for l = 1 to L do 5: Let X l = QR(Y l ), where Y l = AX l-1 + G l . 6: // G l is some perturbation matrix. 7: // QR(•) denotes the QR decomposition and only returns the Q matrix. Lemma F.3. Use ϵ 0 to denote the accuracy by CENTAUR in the LRL setting and use ϵ i to denote the accuracy of a single PPM trial. Recall that p = 0.99 is the probability of success of PPM. Use B 0,c to denote the output of PPM in the c th trial and set T 0 = 8p log 1/δ in the procedure INITIALIZATION (Algorithm 3). We have with probability at least 1 -δ that there exists one element ĉ in {1, . . . , T 0 } such that for at least half of c ∈ {1, . . . , T 0 }, s i (B ⊤ 0,c B 0,ĉ ) ≥ 1 -2ϵ 2 i , ∀i ∈ [k]. Moreover, B 0,ĉ must satisfy dist(B 0,ĉ , B * ) ≤ ϵ 0 with a sufficiently small ϵ i ∈ [0, 1] such that 1 -ϵ 2 0 + 1 -1 -ϵ 2 i + ϵ i < 1 -2ϵ 2 i . One valid example is ϵ i = 0.01 and ϵ 0 = 0.2, which is the one chosen in our previous discussions. Corollary F.1. Consider the INITIALIZATION procedure presented in Algorithm 3. By setting T 0 = c ′ T k log n, ϵ i = 0.01, and setting L 0 , σ 0 and ζ 0 according to Lemma F.1, we have with probability at least 1 -n -k , the output B ĉ satisfies dist(B ĉ, B * ) ≤ ϵ 0 = 0.2. Moreover, the INITIALIZATION procedure is an (α, ϵ init )-RDP mechanism with ϵ init = ϵ ′ init • T 0 = Õ( ακ 2 k 8 µ 4 d 2 n 2 ). The idea of boosting the success probability is inspired by Algorithm 5 of (Liang et al., 2014) . The major improvement of our approach is that given the outputs of O(k log n) PPM trials, it no long requires access to the dataset and hence can be treated as postprocessing, while (Liang et al., 2014) requires an extra data-dependent SVD operation which violates the purpose of DP protection in the first place.

G GUARANTEES FOR THE PPM PROCEDURE

In this section, we present the analysis to establish the guarantees for the PPM procedure. We first show that the choice of ζ 0 described in Lemma F.1 is a high probability upper bound of ∥Y l i ∥ F . Therefore, with the same high probability, the clipping operation in the Gaussian mechanism will not take effect. Lemma G.1. Consider the LRL setting. Under the assumptions 5.2 and 5.1, we have with a probability at least 1 -3 mn -100 , ∥Y l i ∥ F ≤ ζ 0 . =c 0 µ 2 k 1.5 s 2 k (log n)( log n + √ d)( log n + √ k) where the last equality is because by definition, we have B * = U . We now perform variable transformation and denote v ij = V ⊤ x ij where V = [U , U ′ ] is an orthogonal matrix that is extended based on U . Therefore, we equivalently write the above equation 87 to the following inequality. ∥E (U ⊤ Z l ij X l-1 ) ⊤ U ⊤ Z l ij X l-1 ∥ (88) ≤ ∥(V ⊤ X l-1 ) ⊤ E [w * i ⊤ (v ij [1], • • • , v ij [k]) ⊤ ] 4 ( k a=1 v ij [a] 2 )v ij v ⊤ ij V ⊤ X l-1 ∥, where v ij [a] means the a-th coordinate of the k-dimensional vector v ij , which is also distributed as Gaussian. Due to the isotropy of the Gaussian it suffices to compute the expectation assuming w * i ∝ ∥w i ∥e 1 where e 1 = (1, 0, • • • , 0) ⊤ . Then following the proof for (Tripuraneni et al., 2021, Lemma 5) , by combinatorics, we have the following equation.

E[(w

* i ⊤ (v ij [1], • • • , v ij [k]) ⊤ ) 4 ( k a=1 v ij [a]) 2 v ij v ⊤ ij ] (90) = ∥w * i ∥ 4 E (v ij [1]) 4 ( k a=1 v ij [a] 2 )v ij v ⊤ ij = O(∥w * i ∥ 4 k) Therefore, by plugging the above equation into equation 89, we prove the following inequality. The above inequalities can be simplified as follows, with c 1 being some constant. ∥E (U ⊤ Z l ij X l-1 ) ⊤ U ⊤ Z l ij X l-1 ∥ ≤ O(k∥w * i ∥ 4 ) ≤ O(kµ 2 ks 2 k k i=1 s 2 i ) mn log 6 mn ≥ c 1 d • log 6 d • k 3 • µ 2 • k i=1 s 2 i s 2 k . ( ) Control terms related to P l 2 . Recall that P l 2 ∼ N (0, σ 2 ) d×k , with σ = σ 0 ζ 0 /n. Using Lemma I.2, we have with probability 1 -2e -x : • max l∈[L] ∥P l 2 ∥ ≤ C N σ( √ d + √ p + ln(2L + x)); • max l∈[L] ∥U ⊤ P l 2 ∥ ≤ C N σ(2 √ p + ln(2L + x)). Consequently, we can obtain bound the terms related to P l 2 by setting σ 0 sufficiently small such that the following inequalities hold with x = 100 log n 10C N σ( √ d + √ k + ln(2L + x)) ≤ ϵ a s 2 k , 10C N σ(2 √ k + ln(2L + x)) ≤ s 2 k /(τ √ dk). To simplify the above inequalities, suppose that ϵ a is a constant and neglect the log log terms. We obtain c 2 σ 0 ζ 0 n ≤ s 2 k τ k √ d √ log L . ( ) for some constant c 2 . Having established equation 77 for both P l 1 and P l 2 , we can then use Lemma G.1 to obtain the target result. H PROOF OF LEMMA F.3 Proof. In the following, we first show that 1. the index ĉ exists with a high probability and we then show that 2. any candidate B ĉ that passes the test equation 68 has utility no less than b, i.e.  s min (B ⊤ c1 B c2 ) = s min B ⊤ c1 (B * B * ⊤ + B * ⊥ B * ⊥ ⊤ )B c2 ≥ s min (B ⊤ c1 B * B * ⊤ B c2 ) -s max (B ⊤ c1 B * ⊥ B * ⊥ ⊤ B c2 ) ≥ s min (B ⊤ c1 B * )s min (B * ⊤ B c2 ) -s max (B ⊤ c1 B * ⊥ )s max (B * ⊥ ⊤ B c2 ) ≥ 1 -2a 2 . (98) Define the binomial random variable X c = 1 Bc is successful . We have that E[X c ] = P{B c is successful} ≥ p. Using the concentration of the binomial random variable, we have P{ T0 c=1 X c ≤ T 0 • E[X c ] -t} ≤ exp(-t 2 /(2T 0 p)). (99) Therefore, with the choice t = T 0 /2 and T 0 ≥ 8p log 1/δ, we have with probability at least 1 -δ, at least half of the outputs of T 0 independent NPM runs are successful. Consequently, there exists at least T 0 /2 pairs of B c1 and B c2 such that Eq. equation 98 holds, which shows the existence of ĉ. Utility of B ĉ We now show that the candidate B ĉ that passes the test Eq. equation 68 must satisfy dist(B ĉ, B * ) ≤ b. We prove via contradiction. Suppose that there exists a candidate B that passes the test, but with dist(B, B * ) > b. This means that there exists x ∈ R k and ŷ ∈ R k with ∥x∥ 2 = ∥ŷ∥ 2 = 1 achieving the minimum singular value of B ⊤ B * , such that ⟨B x, B * ŷ⟩ ≤ √ 1 -b 2 . Let B i be a successful candidate, i.e. dist(B i , B * ) ≤ a (note that with a high probability they are in the majority according to the discussion above). We have s min (B ⊤ B i ) = s min (B ⊤ B * B * ⊤ + B * ⊥ B * ⊥ ⊤ B i ) ≤ s min (B ⊤ B * B * ⊤ B i ) + s max (B ⊤ B * ⊥ B * ⊥ ⊤ B i ). For the second term, we have s max (B ⊤ B * ⊥ B * ⊥ ⊤ B i ) ≤ s max (B ⊤ B * ⊥ )•s max (B * ⊥ ⊤ B i ) ≤ 1•a ≤ a. To bound the first term, recall the variational formulation of the minimum singular value s min (A) = min ∥x∥=∥y∥=1 x ⊤ Ay and hence s min (B ⊤ B * B * ⊤ B i ) ≤ x⊤ B ⊤ B * B * ⊤ B i ŷ = x⊤ B ⊤ B * ŷ + x⊤ B ⊤ B * (ŷ -B * ⊤ B i ŷ) ≤ 1 -b 2 + x⊤ B ⊤ B * (ŷ -B * ⊤ B i ŷ) ≤ 1 -b 2 + ∥I k -B * ⊤ B i ∥ 2 ≤ 1 -b 2 + 1 -1 -a 2 . where we recall the definitions of x and ŷ in the above paragraph. Combining the above bounds, we obtain s min (B ⊤ B i ) ≤ √ 1 -b 2 + 1 -√ 1 -a 2 + a which is strictly smaller than 1 -2a 2 for a sufficiently small a and a sufficiently large b, e.g. a = 0.01 and b = 0.2 . To put it in other words, we obtain that B will fail test (68). This leads to a contradiction and hence we have proved that any candidate that passes the test (68) must satisfy dist(B ĉ, B * ) ≤ b. To see the equivalence to Lemma 4.6 in (Jain et al., 2021) , note that in order for the convergence analysis of the main procedure to hold, the initializer U init in (Jain et al., 2021) should satisfy ∥U ⊤ ⊥ U init ∥ F = O(1). To achieve this, the R.H.S. of Lemma 4.6 in (Jain et al., 2021) should be bounded by a constant, which means that m = Ω(d/n).



A similar problem is considered in(Jain et al., 2021), but the measurement yij suffers from an extra white noise with variance σF . In our paper, we consider the noiseless case and hence when comparing with(Jain et al., 2021), we treat the σF = 0 in their results for a fair comparison. The intuition behind this requirement is that our convergence analysis requires the iterates to stay within a ball centered around the ground truth, with a constant radius (measured in terms of the principal angle distance). Adding a large noise will break this argument. Similar requirements are made inTripuraneni et al. (2021). Note that Jain et al. (2021) use the Frobenius norm (instead of the spectral norm) of B ⊤ ⊥ B * ∈ R d×k as the optimality metric. However, since rank(B ⊤ ⊥ B * ) ≤ k, we can always bound ∥B ⊤ ⊥ B * ∥F ≤ √ kdist(B, B * ). With this extra factor, Theorem 5.1 quadratically depends on k, same as Lemma 4.4 in(Jain et al., 2021), while the dependency on d is substantially reduced, from d 1.5 to d. We consider this rescaling of wi so that the corresponding linear operator in equation 8 is an isometric operator in expectation (see more discussion below). The choice of ϵi should satisfy equation ϵi = 0.01 is one valid example.



Figure 1: Privacy utility trade-off for models trained under CENTAUR and other algorithms on CIFAR10 (500 clients, 5 shards per user). Error bar denotes the std. across 3 runs.

Figure 2: The t-th global round of the CENTAUR algorithm, where clients keep their classification head w t i secret while updating shared representation b t → b t+1 based on perturbed gradients g t i

Definition A.1 (DP). Let Θ be an abstract output space. A randomized algorithm M : D → Θ is (ϵ, δ)-differential private if for all D, D ′ ∈ D with d(D, D ′ ) ≤ 1, we have that for all subset of the range, S ⊆ Θ, the algorithm M satisfies: Pr{M(D) ∈ S} ≤ exp(ϵ) Pr{M(D ′ ) ∈ S} + δ. Theorem A.1 (Conversion from RDP to DP).

Figure 3: Testing accuracy vs ϵ dp during training.

and ∥w t+1 i ∥ ≤ ζ w hold jointly with probability at least 1 -5n -k , which together with (ζ y + ζ 1 )ζ x ζ w ≤ ζ g and the union bound leads to the result of the lemma.

(•) denotes the QR decomposition and only returns the Q matrix. Algorithm 6 NPM: Noisy Power Method (Adapted from Hardt & Price (2014)) 1: procedure NPM(A, L) 2:

By combining both the norm bound and the above matrix variance bound and using the matrix Bernstein inequality(Tripuraneni et al., 2021, Lemma 31), we have ∥U ⊤ (M l -A)X l-1 ∥ ≤ log 3 mn • log 3 d • ensure that equation 77 holds, it suffices to use mn sufficiently large such thatlog 3 mn • log 3 d • O ϵ(s k (A) -s k+1 (A))(94)log 3 mn • log 3 d • O

dist(B ĉ, B * ) ≤ b.Existence of ĉ Suppose that both B c1 andB c2 are successful, i.e. dist(B ci , B * ) ≤ a for i = 1, 2 or equivalently s min (B ⊤ ci B * ) ≥ √ 1 -a 2 for i = 1, 2. Recall that B * B * ⊤ + B * ⊥ B * ⊥ ⊤ = I d . Compute that

Testing accuracy (%) on CIFAR10/CIFAR100/EMNIST with various data allocation settings. No data augmentation is used for training the representations. n stands for the number clients and S stands for the number classes per client. The δ parameter of DP is fixed to 10 -5 as a common choice in the literature. The budget parameter of DP is fixed to a small value of 1 for results in this table, i.e. ϵ dp = 1.

Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations.In International Conference on Machine Learning, pp. 10434-10443. PMLR, 2021.Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Clipping threshold ζg to reproduce the results in Figure1.

Testing accuracy (%) on CIFAR10 with various data allocation settings given larger communication budget Tg = 400. No data augmentation is used for training the representations. n stands for the number clients and S stands for the number classes per client. The δ parameter of DP is fixed to 10 -5 as a common choice in the literature. The budget parameter of DP is fixed to a small value of ϵ dp = 1 for results in this table.

) 2 and note that E xij [Z ij ] = n. Since x ij 's are independent subgaussian variables, Z ij 's are independent subexponential variables. Recall that ∥ • ∥ ψ2 and ∥ • ∥ ψ1 denote the subgaussian norm and subexponential norm respectively and we have ∥XY ∥ ψ1 ≤ ∥X∥ ψ2 ∥Y ∥ ψ2 for subgaussian random variables X and Y (see Lemma 2.2.7 in Vershynin (2018)). Therefore, we can bound ∥Z ij ∥ ψ1 ≤ n∥B t v∥ 2 2 = n. Using the centering property (see Exercise 2.7.10 of Vershynin (2018)) and Bernstein's inequality (see Theorem 2.8.1 of Vershynin (2018)) of the zero mean subexponential variables, we can bound, for every τ ≥ 0

B * ). Using the centering property (see Exercise 2.7.10 of Vershynin (2018)) and Bernstein's inequality (see Theorem 2.8.1 of Vershynin (2018)) of the zero mean subexponential variables, we can bound, for every t ≥ 0

Algorithm 5 PPM: Private Power Method (Adapted from Hardt & Price (2014)) 1: procedure PPM(L, σ 0 , ζ 0 , m0 )

ACKNOWLEDGEMENT

The work of Zebang Shen was supported by NSF-CPS-1837253. Hamed Hassani acknowledges the support from the NSF Institute for CORE Emerging Methods in Data Science (EnCORE), under award CCF-2217058. The research of Reza Shokri is supported by Google PDPO faculty research award, Intel within the www.private-ai.org center, Meta faculty research award, the NUS Early Career Research Award (NUS ECRA award number NUS ECRA FY19 P16), and the National Research Foundation, Singapore under its Strategic Capability Research Centres Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. In addition, Zebang Shen thanks Prof. Hui Qian from Zhejiang University for providing the computational resource, who is supported by National Key Research and Development Program of China under Grant 2020AAA0107400. The authors would like to thank Hongyan Chang for helpful discussions on earlier stages of this paper.

annex

where Y l i is computed in line 7 of Algorithm 5 and c 0 is some universal constant.Proof. The detailed expression of Y l i in line 7 of Algorithm 5 can be calculated as follows:Using the triangle inequality of the matrix norm, ζ is a high probability upper bound of ∥Y l i ∥ F if the following inequality holds with a high probability,To bound |y ij | 2 , recall that in Assumption 5.1, we assume that x ij is a sub-Gaussian random vector with ∥x ij ∥ ψ2 = 1. Using the definition of a sub-Gaussian random vector, we havewith the choice τ = µHere c s is some constant and we recall that s k is a shorthand for s k (W * / √ n).To bound ∥x ij ∥, recall that x ij is a sub-Gaussian random vector with ∥x ij ∥ ψ2 = 1 and therefore with probability at least 1 -δ,Therefore by taking δ = exp(-100 log n), we have that 4To bound ∥x ⊤ ij X l-1 ∥, note that due to the rotational invariance of the Gaussian random vector x ij (recall that X l-1 is an column orthonormal matrix), the ℓ 2 norm ∥x ⊤ ij X l-1 ∥ is distributed like the ℓ 2 norm of a Gaussian random vector drawn from N (0, I k ). Therefore, w.p. at leastUsing the union bound and the fact that ζ x ζ 2 y ζ ≤ ζ 0 leads to the conclusion.Conditioned on the above event (clipping takes no effect), to establish the utility guarantee of Algorithm 5, we view PPM as a specific instance of the noisy power method (NPM) presented in Algorithm 6 Hardt & Price (2014) , where the target matrix is A = (2Γ + trace(Γ)I d ) with Γ = B * V * ⊤ V * B * ⊤ and the perturbation matrix G l = P l 1 + P l 2 is the sum of the noise matrix added by the Gaussian mechanism, P l 2 = σ0ζ0 n W l , and the error matrix P l 1 = (M l -A)X l-1 . One can easily check that with these choices, we recover line 8 of Algorithm 5Suppose that the level of perturbation is sufficiently small, we can exploit the following analysis of NPM from (Hardt & Price, 2014) . Theorem G.1 (Adapted from Corollary 1.1 of Hardt & Price (2014) ). Consider the noisy power method (NPM) presented in Algorithm 6. Let U ∈ R d×k represent the top-k eigenvectors of the input matrix A ∈ R d×d . Suppose that the perturbation matrix G l satisfies for all l ∈ {1, . . . , L}for some fixed parameter τ and ϵ < 1/2. Then with all but 1/τ + e -Ω(d) probability, there exists an L = O(Here C > 0 is a constant defined in Lemma I.3.To prove that the perturbation matrix G l satisfies the conditions required by the above theorem, we bound M l -A and σ0ζ0 n W l individually, both with high probabilities.Proof of Lemma F.1. Recall that A = 2Γ + trace(Γ)I d and note that rank(Γ) = k and that its singular values are s 2 1 , . . . , s 2 k . ThereforeIn this proof, we will show that for both matrices P l 1 = (M l -A)X l-1 and P l 2 = σ0ζ0 n W l the following inequalities hold for all l ∈ {1, . . . , L}, which is a sufficient condition for Lemma G.1 to holdControl terms related to P l 1 . 1. We first bound ∥P l 1 ∥. By the independence between M l and X l-1 , we have thatTo bound the norm term E ∥(M l -A)X l-1 ∥ , we begin by controlling the norms ofBy the proof for Lemma G.1, with a probability at least 1 -δ, we have thatWe then compute an upper bound on the matrix variance termDue to isotropy of the Gaussian, by (Tripuraneni et al., 2021, Lemma 5), we have thatBy plugging equation 80 into equation 79, we prove the following inequality.By combining both the norm bound and the matrix variance bound and using the modified Bernstein matrix inequality (Tripuraneni et al., 2021, Lemma 31), we have2. We then proceed to bound the termWe then compute an upper bound on the matrix variance term.

I PRELIMINARY ON MATRIX CONCENTRATION INEQUALITIES

Lemma I.1 (Theorem 4.4.5 in Vershynin ( 2018)). Let G 1 , . . . , G L ∼ N (0, σ 2 ) d×n . There exists a constant C N such that with probability at least 1 -e -x ,Lemma I.2 (Lemma A.2 of Hardt & Price (2014) ). Let U ∈ R d×p be a matrix with orthonormal columns. Let G 1 , . . . , G L ∼ N (0, σ 2 ) d×p with 0 ≤ p ≤ d. There exists a constant C N such that with probability at least 1 -e -x ,Lemma I.3 (Minimum Singular Value of a Square Gaussian Matrix (Theorem 1.2 of Rudelson & Vershynin (2008) )). Let A ∈ R k×k be a Gaussian random matrix, i.e. A ij ∼ N (0, 1). Then, for every ϵ > 0, we have • Suppose that the left hand side ct κk 1.5 µ 2 d/n is fixed, then for a target accuracy ϵ a , we cannot establish the theoretical guarantee that CENTAUR achieves the accuracy ϵ a within a DP budget of ϵ dp ≤ ctκk 1.5 µ 2 d nϵa (this is natural since a smaller DP budget ϵ dp requires a larger noise multiplier σ g which jeopardizes the convergence analysis of CENTAUR). However, we need to emphasize that, we are not ruling out the possibility that such a DP budget ϵ dp is achieved, since the privacy guarantee that we are establishing is just an upper bound. Hence we are not establishing a lower bound.• Now suppose that all factors other than the number of clients n is fixed. Corollary 5.1 implies that for an n that is sufficiently large, i.e. n ≥ ctκk 1.5 µ 2 d ϵa•ϵ dp , we can establish the guarantee that the output of CENTAUR achieves an ϵ a utility within an ϵ dp budget. This interpretation also allows us to understand the benefit of having a better dependence on d: A better dependence on d means that a smaller n is sufficient to achieve the same utilityprivacy guarantee.

K DISCUSSION ON THE REQUIRED ASSUMPTIONS

In this section, we show that the requirements we made in Assumption 5.1 to 5.3 are similar to the assumptions made in (Collins et al., 2021) and (Jain et al., 2021) .Discussion on Assumption 5.1 We note that our Assumption 5.1 is the same as Assumption 1 in (Collins et al., 2021) , and is similar to point (i) of Assumption 4.1 in (Jain et al., 2021) where x ij is assumed to be exactly Gaussian.Discussion on Assumption 5.2 We note that our Assumption 5.2 is the same as the definition of the incoherence parameter µ in (Jain et al., 2021) (the parameter λ k therein is equivalent to σ 2 k in our paper), and is similar to Assumption 3 in (Collins et al., 2021) where the incoherence parameter µ as well as σ k is assumed to be 1.Discussion on Assumption 5.3 We focus on the dependence on the parameters d and n while treating log terms and the other parameters like the rank k and the incoherence parameter µ as constants. In this case, Assumption 5.3 can be simplified as m = Ω(d/n). We note that, under this setting our Assumption 5.3 is the same as the requirement (12) in (Collins et al., 2021) and Lemma 4.6 in (Jain et al., 2021) . The equivalence to the requirement (12) in (Collins et al., 2021) is straight forward.

