PRIVATE AND EFFICIENT META-LEARNING WITH LOW RANK AND SPARSE DECOMPOSITION

Abstract

Meta-learning is critical for a variety of practical ML systems -like personalized recommendations systems -that are required to generalize to new tasks despite a small number of task-specific training points. Existing meta-learning techniques use two complementary approaches of either learning a low-dimensional representation of points for all tasks, or task-specific fine-tuning of a global model trained using all the tasks. In this work, we propose a novel meta-learning framework that combines both the techniques to enable handling of a large number of data-starved tasks. Our framework models network weights as a sum of low-rank and sparse matrices. This allows us to capture information from multiple domains together in the low-rank part while still allowing task specific personalization using the sparse part. We instantiate and study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-r and a k-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with hard thresholding -AMHT-LRS-to learn the low-rank and sparse part effectively and efficiently. For the realizable, Gaussian data setting, we show that AMHT-LRS indeed solves the problem efficiently with nearly optimal samples. We extend AMHT-LRS to ensure that it preserves privacy of each individual user in the dataset, while still ensuring strong generalization with nearly optimal number of samples. Finally, on multiple datasets, we demonstrate that the framework allows personalized models to obtain superior performance in the data-scarce regime.

1. INTRODUCTION

Typical real world settings -like multi user/enterprise personalization -have a long tail of tasks with a small amount of training data. Meta-learning addresses the problem by learning a "learner" that extracts key information/representation from a large number of training tasks, and can be applied to new tasks despite limited number of task specific training data points. Most existing meta-learning approaches can be categorized as: 1) Neighborhood Models: these methods learn a global model, which is then "fine-tuned" to specific tasks (Guo et al., 2020; Howard & Ruder, 2018; Zaken et al., 2021) , 2) Representation Learning: these methods learn a low-dimensional representation of points which can be used to train task-specific linear learners (Javed & White, 2019; Raghu et al., 2019; Lee et al., 2019; Bertinetto et al., 2018; Hu et al., 2021) . In particular, task-specific fine-tuning has demonstrated exceptional results across many natural language tasks (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) . However, such fine-tuned models update all parameters, so each fine-tuned total parameter footprint is same as the original model. This implies that fine-tuning large models -like say a standard BERT model (Devlin et al., 2018) with about 110M parameters -for thousands or millions of tasks would be quite challenging even from storage point of view. One potential approach to handle the large number of parameters is to fine-tune only the last layer, but empirical findings suggest that such solutions can be significantly less accurate than fine-tuning the entire model (Chen et al., 2020a; Salman et al., 2020) . Moreover, representation learning based approaches apply strict restrictions like each task's parameters have to be in a low-dimensional subspace which tend to affect performance in general (Sec. 3) . In this work, we propose and study the LRS framework that combines both the above mentioned complementary approaches. That is, LRS restricts the model parameters Θ (i) for the i th task as Θ (i) := U • W (i) + B (i) , where first term denotes applying a low-dimensional linear operator on W (i) , while B (i) is restricted to be sparse. That is the first term is based on representing parameters of each task in a low dimensional subspace while the second term allows task-specific fine-tuning but only of a few parameters. Note that methods that only allows fine-tuning of batch-norm statistics, or of the final few layers can also be thought of as "sparse" fine-tuning but with a fixed set of parameters. In contrast, we allow the tunable parameters to be selected from any part of the network. The framework allows collaboration among different tasks so as to learn accurate model despite lack of data per task. Similarly, presence of sparse part allows task-specific personalization of arbitrary but a small set of weights. Finally, the framework allows tasks/users to either contribute to a central model using privacy preserving bill-board models (see Section 2.2) or also allows flexibility where certain tasks only learns their parameters locally and do not contribute to the central model. While the framework applies more generally, to make the exposition easy to follow, we instantiate it in the case of linear models. In particular, suppose the goal is to learn a linear model for the i th task that is parameterized by θ (i) , i.e., expected prediction for the data point x ∈ R d is given by x, θ (i) . θ (i) is modeled as θ (i) := U w (i) + b (i) for all tasks 1 ≤ i ≤ t, where U ∈ R d×r (where r d) is a tall orthonormal matrix that captures task representation and is shared across all the tasks. Note that estimation of the low-rank and sparse part is similar to the robust-PCA (Netrapalli et al., 2014) that is widely studied in the structured matrix estimation literature. However, in that setting, the matrices are fully observed and the goal is to separate out the low-rank and the sparse part. In comparison, in the framework proposed in this work, we only get a few linear measurements of the underlying matrix due to which the estimation problem is significantly more challenging. We address this challenge of estimating U along with w (i) , b (i) (for each task) using a simple alternating minimization style iterative technique. Our method -AMHT-LRS-alternatingly estimates the global parameters U as well as the task-specific parameters w (i) and b (i) independently for each task. To ensure sparsity of b we use an iterative hard-thresholding style estimator (Jain et al., 2014) . In general, even estimating U is an NP-hard problem (Thekumparampil et al., 2021) . One of the main contributions of the paper is a novel analysis that shows that AMHT-LRS indeed efficiently converges to the optimal solution in the realizable setting assuming the data is generated from a Gaussian distribution. Formally, consider t, d-dimensional linear regression tasks (indexed by i ∈ [t]) with m samples {(x (i) j , y (i) j )} m j=1 being provided to each of them such that x (i) j ∼ N (0, I d ) and y (i) j | x (i) j = x (i) j , U w (i) + b (i) + z (i) j for all i ∈ [t], j ∈ [m], where z (i) j ∼ iid N (0, σ 2 ). Below, we state our main result informally in the noiseless setting (σ = 0): Theorem (Informal, Noiseless setting). Suppose we are given m • t samples from t linear regression tasks of dimension d as in equation 1. Goal is to learn a new regression task's parameters using m samples, i.e., learn the shared rank-r parameter matrix U along with task-specific w , b . Then, AMHT-LRS with total m • t = Ω(kdr 4 ) samples and m = Ω(max(k, r 3 )) samples per task can recover all the parameters exactly and in time nearly linear in m • t. That is, AMHT-LRS is able to learn the underlying model exactly as long as the total number of tasks is large enough, and per task samples scale only linearly in sparsity of b and cubically in the rank of U . Assuming r, k d, additional parameter overhead per task is small, allowing efficient deployment of such models in production. Finally, using the billboard model of ( , δ) differential privacy (DP) (Jain et al., 2021; Chien et al., 2021; Kearns et al., 2014) , we can extend AMHT-LRS to preserve privacy of each individual. Furthermore, for similar sample complexity as in the above theorem albeit with slightly worse dependence on r, we can guarantee strong generalization error up to a standard error term due to privacy.

Summary of our Contributions:

• We propose a theoretical framework for combining the meta-learning approaches of representation learning and neighborhood model. Our model non-trivially generalizes guarantees of Thekumparampil et al. (2021) ; Tripuraneni et al. (2021) ; Boursier et al. (2022) to the setting where the parameter matrix allows a low rank plus sparse decomposition. • We propose an efficient method AMHT-LRS for the above problem and provide rigorous total sample complexity and per-task sample complexity bounds that are nearly optimal (Theorem 1). • We provide a DP variant of AMHT-LRS that guarantees user level privacy. At a high-level we show that under ( , δ)-DP, one can obtain a generalization error as a non-private version but with an additional error due to privacy budget (see Theorem 3). • We demonstrate experiments on synthetic data and Movielens dataset using linear models (Sec. 3) and toy neural nets (Appendix A); apart from showing the advantage of our framework, they also show limitations of only using representation learning or a single model among other baselines. Technical Challenges (Linear Models): Denote (W ) T = [w (1) , . . . , w (t) ] ∈ R r×t and B = [b (1) , . . . , b (t) ] ∈ R d×t ; hence the matrix of optimal regressors is given by U (W ) T + B . There are several technically novel steps that are required for recovering the model parameters in (1) that combine both the representation and neighborhood models. In the general setting, even the taskspecific r-dimensional regression coefficients {w (i) } i∈[t] are unknown and therefore, we are faced with the additional challenge of learning {w (i) } i∈[t] jointly along with the shared representation matrix U and the task-specific sparse parameter vectors {b (i) } i∈ [t] . Note that in the AM framework proposed in Thekumparampil et al. ( 2021), the authors consider only the representation model and not the neighborhood model (which is a simpler case of LRS setting). Similarly, in Netrapalli et al. (2014) , the authors design an AM algorithm for the problem of reconstructing the low rank and sparse components of a matrix if the matrix is provided as an input. However, in our setting, we only observe linear measurements of the individual columns of the parameter matrix. Therefore, informally speaking, our analysis is faced with the key challenge of combining both sets of complementary techniques in Thekumparampil et al. ( 2021); Netrapalli et al. (2014) . This leads to the analysis of several crucial steps in each iteration: 1) We track the incoherence of several intermediate matrices corresponding to the latest estimates W ( ) , U ( ) of W , U . 2) We also track the L 2,∞ norm of the matrix (I -U (U ) T )U ( ) to make progress on learning B . In particular, the second step is the most technically involved component of our analysis. Organization: In Sec. 2, we introduce the general low rank+sparse (LRS) framework. In Sec. 2.1, we provide theoretical guarantees for the canonical linear model in the LRS framework and provide a differentially private version in Sec. 2.2. We analyze a special setting in Appendix B as warm-up (Rmk. 3) while detailed proofs are delegated to Appendix C, D. In Sec. 3 and Appendix A, we provide experimental results on synthetic and real datasets. In Appendix E, we discuss how to obtain a good initialization for our methods in the realizable setting via Method of Moments.

2. LRS FRAMEWORK FOR META-LEARNING/PERSONALIZATION

Notations: [m] to denotes the set {1, 2, . . . , m}. For a matrix A, A i denotes i th row of A. For a vector x, x i denotes i th element of x. We sometimes use x j to denote an indexed vector; in this case x j,i denotes the i th element of x j . ||•|| 2 denotes euclidean norm of a vector and the operator norm of a matrix. ||•|| ∞ , ||•|| 0 will denote the ∞ and 0 norms of a vector respectively. ||•|| 2,∞ , ||•|| F will be used to denote the L 2,∞ and Frobenius norm of a matrix respectively. For a sparse vector v ∈ R d , we define the support supp(v) ⊆ [d] to be a set of indices such that v i = 0 for all i ∈ supp(v) and v i = 0 otherwise. We use I to denote the identity matrix. O(•) notation subsumes logarithmic factors. X (i) ∈ R m×d denotes the matrix of covariates for the i th task such that X (i) j = (x (i) j ) T . Similarly, we write y (i) , z (i) ∈ R m to denote the task-specific response vector and noise vector respectively. Let x ∈ R d be the input point, and let y = f (x; Θ) be the predicted label using a DNN f with parameters Θ. Let there be t tasks/domains. Then, the goal is to personalize Θ to Θ (i) for each task i ∈ [t] such that a) Θ (i) does not over-fit despite a small number of data-points labelled as task i, b) {Θ (i) , 1 ≤ i ≤ t} can be stored and inferred fast even for large t, say for t ≥ 1M , c) the framework allows enough flexibility to ensure that different tasks/domains/users can contribute their data at different levels of privacy risks. Our method LRS attempts to address all the three requirements using a simple low-rank+sparse approach: we model each i) , where UW (i) denotes a linear operation on W (i) in a r-dimensional basis specified by U (can be very large for e.g. standard BERT with ∼ 110M parameters), and B (i) is a k-sparse matrix. That is, we represent Θ (i) as a combination of a small number of parameter matrices represented by U, along with a sparse set of weights that can be fine-tuned arbitrarily for a given task. Θ (i) as Θ (i) := U • W (i) + B ( Note that due to low-dimensional and sparse representation, LRS should require relatively small number of points per task. In the next section, we formally prove this claim for a simple linear setting with Gaussian data. Furthermore, for each task the additional number of parameters is relatively small -assuming r and k are small -which implies that memory cost can be controlled. The latency cost also remains same as the baseline model, assuming we can explicitly compute Θ (i) on the fly. Finally, the model allows tasks/domains/users to contribute data to learn U with differential privacy (see Section 2.2), but it also admits domains/tasks who are not inclined to share data, and can just fine-tune their model privately by learning W (i) and B (i) in isolation. 2021) for meta-learning with large number of tasks at scale. Although the authors demonstrate promising experimental results, LORA only allows a central model (in a low dimensional manifold) and does not incorporate sparse fine-tuning. Hence, LORA becomes ineffective when the output dimension is small (say 1). The said limitation of LORA has been demonstrated in detailed experiments on the MovieLens 1M dataset (a standard recommendation dataset) in Sec. 3. Moreover, the low rank fine-tuning (as proposed in LORA) is limited when output dimension is small. Finally, LORA does not have any theoretical guarantees even in simple settings.

Comparison with

Comparison with representation learning style models: A recent line of work (Thekumparampil et al., 2021; Du et al., 2020; Tripuraneni et al., 2021; Boursier et al., 2022; Jain et al., 2021) proposes a similar model to LRS but is specific to representation learning. These papers only consider a low-rank representation of the task parameters but no sparse fine-tuning. Moreover, the proposed algorithms in these papers have not been explored at scale and mostly been applied to vanilla linear models (without considering privacy constraints or extension to more complex models). Even from a theoretical point of view, these methods do not apply in our case due to the additional non-convex sparsity constraint. Comparison with Prompt-based and Batch-norm Fine-tuning: Another popular approach for personalization is to use prompt-based or batch-norm based fine-tuning (Wang et al., 2022; Liu et al., 2021; Lester et al., 2021) ; this usually involves a task-based feature embedding concatenated with the covariate. Note that in a linear model, such an approach will only lead to an additional scalar bias which can be easily modeled in our framework; thus our framework is richer and more expressive with a smaller number of parameters. We have compared against such techniques in our experiments and demonstrated their limitations in both real and synthetic datasets (see Sec. 3). Private Meta-learning: Model-personalization is a key application of meta-learning, where we wish to have a personalized model for each user i.e. each user represents a task. Due to sensitivity of user-data, we would want to preserve privacy of each user for which we use user-level ( , δ)-DP as the privacy notion (see Definition 1). In this setting, each user i ∈ [t] holds a set of data samples D (i) = {x (i) j , 1 ≤ j ≤ m}. Furthermore, users interact via a central algorithm that maintains the common representation matrix U which is guaranteed to be DP w.r.t. all the data samples of any single user. The central algorithm publishes the current U to all the users (a.k.a. on a billboard) and obtains further updates from the users. It has been shown in prior works (Jain et al., 2021; Chien et al., 2021; Thakkar et al., 2019) that such a billboard mechanism allows for significantly more accurate privacy preserving methods while ensuring user-level privacy. In particular, it allows learning of U effectively, while each user can keep a part of the model which is personal to them, e.g., the W (i) , B (i) 's in our context. See (Jain et al., 2021, Section 3) for more details about billboard model in the personalization setting. Traditionally, such model of private computation is typically called the billboard model of DP, which in turn is a subclass of joint DP (Kearns et al., 2014) . In Definition 1, when we define the notion of neighborhood, we define it w.r.t. the addition (removal) of a single user (i.e., additional removal of all the data samples D i for any user i ∈ [t]). In the literature Dwork & Roth (2014) , the definition is referred to as user-level DP.

2.1. LINEAR LRS: ALGORITHM AND ANALYSIS

In this section, we describe our LRS framework for the linear setting, provide an efficient algorithm for parameter estimation, and provide rigorous analysis under realizable setting with Gaussian data. We then extend our framework, algorithm and analysis to allow user-level differential privacy. Consider the linear LRS model introduced in Section 1 where we have t d-dimensional linear regression tasks (indexed by i ∈ [t]) with m samples {(x (i) j , y (i) j )} m j=1 being provided to each of them such that the i th sample for the j th task (x (i) j , y (i) j ) is generated independently according to eq. 1. So the problem reduces to that of designing statistically and computationally efficient algorithms to estimate the common representation learning parameter U as well as task-specific parameters {w (i) } i∈[t] , {b (i) } i∈[t] . The ERM for this model assuming squared loss is given by: (LRS) minimize L(U, W, B) = i∈[t] j∈[m] 1 2 y (i) j -x (i) j , Uw (i) + b (i) 2 s.t. U T U = I, b (i) 0 ≤ k ∀i ∈ [t] and ||B i || 0 ≤ ζ ∀i ∈ [d], where U ∈ R d×r , W = [w (1) w (2) . . . w (t) ] T ∈ R t×r stores the task-specific coefficients, and t) ] ∈ R d×t stores the task-specific sparse vectors for fine-tuning. Note that LRS is non-convex due to: a) bilinearity of U, W, b) non-convexity of 0 norm constraint. B = [b (1) b (2) . . . b ( We propose AMHT-LRS that handles the non-convexity in the objective and the constrained set by carefully combining alternating minimization for U, w and b with hard thresholding to ensure sparsity of b. Let HT : R d × R → R d be a hard thresholding function that takes a vector v ∈ R d and a parameter ∆ as input and returns a vector v ∈ R d such that v i = v i if |v i | > ∆ and 0 otherwise. Let U +( -1) , {w (i, -1) } i∈[t] and {b (i, -1) } i∈[t] be the latest iterates at the beginning of the th iteration. First, for each task i ∈ [t], given estimates U +( -1) , w (i, -1) , we can update b (i, ) by solving the following problem: argmin b∈R d X (i) (U +( -1) w (i, -1) + b) -y (i) 2 such that ||b|| 0 ≤ k. While the problem is non-convex, we can still apply a projected gradient descent algorithm which reduces to iterative hard thresholding. In particular, we use Algorithm 2 for the i th task where in each iteration, we run a gradient descent step on the parameter vector estimate b (of b (i) ) and subsequently apply HT(•, ∆) function where ∆ > 0 is set appropriately. Next, given estimates U +( -1) , b (i, ) , we can update w (i, ) by solving the following task-specific optimization problem argmin w∈R r X (i) (U +( -1) w + b (i, ) ) -y (i) 2 for each i ∈ [t]. Subsequently, given the updated estimates of the task-specific parameters {w (i, ) } i∈[t] and {b (i, ) } i∈[t] , we update U +( ) (estimate of shared representation matrix) using: argmin U∈R d×r i∈[t] X (i) (Uw (i, ) + b (i, ) ) -y (i) 2 . ( ) followed by a QR decomposition of the solution. Note that the above two problems ((4) and ( 5)) can be solved using standard least squares regression methods. Finally, we must ensure independence of the estimates (which are random variables themselves) from the data that is used in a particular update. We can ensure such independence by using a fresh batch of samples in every iteration. Analysis: As in prior works, we are interested in the few-shot learning regime when there are only a few samples per task. From information theoretic viewpoint, we expect the number of samples per task to scale linearly with the sparsity k and rank r and logarithmically with the dimension d. On the other hand, U has dr parameters and therefore, it is expected that the total number of samples across all tasks scales linearly with dr which implies we would want the number of tasks t to scale linearly with dimension d. Note that if the sparse vectors {b (i) } i∈[t] have the same support (or a high overlap between the supports), then the model parameters might not be uniquely identifiable. This is because, in that case, the matrix B can be represented as a low-rank matrix. To establish identifiability of U and sparse vectors b , we make the following assumption: Algorithm 1 AMHT-LRS Require: Data {(x (i) j ∈ R d , y (i) j ∈ R)} m j=1 for all i ∈ [t], column sparsity k of B, ∆(U +(0) , U ) F ≤ B, max i b (i,0) -b (i) ∞ ≤ γ (0) , Parameter > 0. 1: for = 1, 2, . . . do 2: Set T ( ) = Ω log γ ( -1) 3: 1) , T = T ( ) ) for suitable constants c 4 , c 5 > 0. for i = 1, 2, . . . , t do 4: b (i, ) ← OptimizeSparseVector((X (i) , y (i) ), v = U +( -1) w (i, -1) , α = O c -1 4 B √ k , β = O(c -1 5 B), γ = γ ( - 5: w (i, ) = (X (i) U +( -1) ) T (X (i) U +( -1) ) -1 (X (i) U +( -1) ) T (y (i) -X (i) b (i, ) ) 6: end for 7: Set A := i∈[t] w (i, ) (w (i, ) ) T ⊗ m j=1 x (i) j (x (i) j ) T and V := i∈[t] (X (i) ) T y (i) - b (i, ) (w (i, ) ) T . Compute U ( ) = vec -1 d×r (A -1 vec(V)) and U +( ) ← QR(U ( ) ) 8: γ ( ) ← (c 3 ) -1 B for a suitable constant c 3 < 1. 9: end for 10: Return w ( ) , U +( ) and {b (i, ) } i∈ [t] . Assumption 1 (A1). Consider the matrix B ∈ R d×t whose i th column is the vector b (i) . Then each row of B is ζ-sparse i.e. B i 0 ≤ ζ for all i ∈ [d], and each column is k-sparse. Note that the orthonormal matrix U cannot have extremely sparse columns otherwise it would be information theoretically impossible to separate columns of U from b . Moreover, similar to Tripuraneni et al. (2021) , we need to ensure that each task contributes to learning the underlying representation U . These properties can be ensured by the standard incoherence assumptions Tripuraneni et al. (2021) ; Collins et al. (2021) ; Netrapalli et al. (2014) and therefore, we have Assumption 2 (A2). Let λ 1 and λ r be the largest and smallest eigenvalues of the task diversity matrix (r/t)(W ) T W ∈ R r×r . We assume that W ∈ R t×r and the representation matrix U ∈ R d×r ) 2 . Suppose Algorithm 1 is initialized with U +(0) such that (I -U (U ) T )U +(0) F = O λ r λ 1 and U +(0) 2,∞ = O( µ r/d), and is run for L = O(1) iterations. Then, with high probability, the outputs U +(L) , {b (i,L) } i∈[t] satisfy: (I -U (U ) T )U +(L) F = O(1)σS µ λ r , b (i,L) -b (i) ∞ ≤ O(1)σS √ k , i ∈ [t], where S = µ r 3 d mt + r 3 mλ r + k m provided the total number of samples satisfies: m = Ω k + r 2 µ λ 1 λ r 2 + σ 2 r 3 λ r , mt = Ω r 3 dµ r(µ ) 4 (λ r ) 2 k + µ λ 1 λ r 2 + σ 2 1 + 1 λ r For a new task, modified AMHT-LRS (Alg. 6 in Appendix D) has the following generalization bound: L(U, w, b) -L(U , w , b ) = O σ 2 S 2 + k + r m . Note that the per-task sample complexity of our method roughly scales as m = (r 3 + k), which is information theoretically optimal in k and is roughly r 2 factor larger. Total sample complexity scales as mt = kdr 4 , which is roughly kr 3 multiplicative factor larger than the information theoretic bound. Note that typically r and k are considered to be small, so the additional factors are small, but we leave further investigation into obtaining tighter bounds for future work. Finally, the generalization error scales as σ 2 (r + k)/m which is nearly optimal. Note that, ignoring meta-learning, and directly optimizing the single-task error would lead to significantly larger error of σ 2 d/m. Remark 1 (Runtime and Memory). The run-time of Algorithm 1 is dominated by the update for U +( ) . For each iteration , Step 8 has a time complexity of O((dr) 3 + (mt)(dr) 2 ); however in practice, a gradient descent step for the update of U ( ) can bring down the time complexity to O(mtdr). Moreover, the memory usage of Algorithm 1 is O((dr) 2 + tr 2 ). Remark 2 (Initialization). Note that Algorithm 1 has local convergence properties as described in Theorem 1. In practice, typically we use random initialization for U +(0) . However, similar to the representation learning framework in Tripuraneni et al. (2021) , we can use the Method of Moments to obtain a good initialization. See Appendix E for more details. Remark 3 (Special Settings). In the setting where for each task, we just need to learn a single central model for all tasks and sparse fine-tune the weights for each task i.e. w (i) = 1 for all i ∈ [t] is fixed, AMHT-LRS obtains global convergence guarantees (Theorem 4 in Appendix B). Moreover, if the central model U is also frozen, then the task-based sparse fine-tuning reduces to standard compressed sensing. In the realizable setting, our framework recovers the standard generalization error of σ 2 k/m in compressed sensing (Jain & Kar, 2017) 2021) cannot capture the sparse fine-tuning in each task (with potentially arbitrary magnitude). However, since our framework combines neighborhood and representation models, our sample complexity guarantees are sub-optimal by only a factor of r 2 (generalization error is optimal) when restricted to the special case of representation model (Thekumparampil et al., 2021) . 

2.2. PRIVATE LINEAR

c > 0, 0 < c 1 < 1/2. 1: for j = 1,2,. . . , T do 2: c ← b -1 m • (X (i) ) T (X (i) b + X (i) v -y (i) ) 3: ∆ ← α + c 1 γ + β √ k and b ← HT(c, ∆) 4: γ ← 2c 1 γ + 2(α + c1 √ k β) 5: end for 6: Return vector b. In this section, we provide a user level DP variant of Algorithm 1 in the billboard model. We obtain DP for the computation of each U ( ) by perturbing the covariance matrix A and the linear term V in the algorithm with Gaussian noise to ensure that the contribution of any single user is protected. We start by introducing the function clip : R × R → R that takes as input a scalar x, parameter ρ and returns clip(x, ρ) = x • min 1, ρ x . We can extend the definition of clip to vectors and matrices by using clip(v, ρ) = v • min 1, ρ v 2 for a vec- tor v and clip(A, ρ) = A • min 1, ρ A F for a matrix A. In order to ensure that Algorithm 1 is private, for input parameters A 1 , A 2 , A 3 , A w , we first clip the covariates and responses: for all i ∈ [t], j ∈ [m], we will have x , ) , A 3 and w (i, ) ← clip w (i, ) , A w . Now, we can modify Line 7 in Algorithm 1 as follows (let L be the number of iterations of Alg. 1): where, for some σ DP > 0, each entry of N (1) is independently generated from (i) j ← clip x (i) j , A 1 , y (i) j ← clip y (i) j , A 2 , (x (i) j ) T b (i, ) ← clip (x (i) j ) T b (i A := 1 mt i∈[t] w (i, ) ( w (i, ) ) T ⊗ m j=1 x (i) j ( x (i) j ) T + N (1) V := 1 mt i∈[t] j∈[m] x (i) j y (i) j -( x (i) j ) T b (i, ) ( w (i, ) ) T + N (2) N 0, m 2 • A 4 1 • A 4 w • L • σ 2 DP ; similarly, each entry of N (2) is independently generated from N 0, m 2 • A 2 1 (A 2 + A 3 ) 2 A 2 w • L • σ 2 DP . We are now ready to state our main result: Theorem 2. Algorithm 1 (with modifications mentioned in equation 7 and equation 8) satisfies σ -2 DP -zCDP and correspondingly satisfies (ε, δ)-differential privacy in the billboard model, when we set the noise multiplier σ DP ≥ 2ε -1 (log(1/δ) + ε). Furthermore, if ε ≤ log(1/δ), then σ DP ≥ ε -1 8 log(1/δ) suffices to ensure (ε, δ)-differential privacy. Next, we characterize the generalization properties of modified AMHT-LRS: Theorem 3. Consider the LRS problem equation 2 with all parameters m, t, ζ obeying the bounds stated in Theorem 1 and furthermore, t = Ω( (rd) 3/2 √ log(1/δ)+ µ ). Suppose we run AMHT-LRS (Step 7 in Alg. 1 replaced with 7 and 8) for L = O(1) iterations with A 1 = O( √ d), A 2 = O( µ λ r + (max i b (i) 2 )), A 3 = O λ r µ λ 1 , A w = O( µ λ r ). Then, with high probability, generalization error for a new task satisfies: L(U, w, b) -L(U , w , b ) = O σ 2 S 2 + dr 2 (log(1/δ) + )(λ r µ ) 2 2 t 2 • (κ 2 + r 2 d 2 ) where S = µ r 3 d mt + r 3 mλ r + k m , η = O t -1 µ r 2 d 3/2 1 + λ r λ 1 + max i∈[t] b (i) 2 √ µ λ r σ DP and κ = 1 + λ r λ 1 + max i∈[t] b (i) 2 √ µ λ r . Note that the modified AMHT-LRS ensures ( , δ)-differential privacy without any assumptions. However Thm. 3 still has good generalization properties; moreover, the per-task sample complexity guarantee m still only needs to scale polylogarithmically with the dimension d. In other words, our algorithm can ensure good generalization along with privacy in data-starved settings as long as the number of tasks is large -scales as ∼ d 3/2 / . Similarly, generalization error for a new task has two terms: the first has a standard dependence on noise σ 2 and the second has a scaling of d 3 ( t) -2 which is standard in private linear regression and private meta-learning (Smith et al., 2017; Jain et al., 2021) . Detailed proofs of our main results namely Theorems 1,3 are delegated to Appendix C, D.

3. EMPIRICAL RESULTS

In this section, we conduct an empirical study of AMHT-LRS with the following two goals: a) demonstrate that personalization with AMHT-LRS indeed improves accuracy for tasks with a small number of points, b) for a fixed budget of parameters, AMHT-LRS is significantly more accurate than existing baselines. For simplicity, we fix the model class to be linear and consider the following baselines: 1) Single Model (u central ): learns a single model for all tasks, 2) Full Fine-tuning (u indv ) separate model for each task aka standard fine-tuning, 3) i) . We select d = 150, set k, ζ, the column and row sparsity level of {b (i) } i∈[t] to be 10 and 5, respectively. We sample u uniformly from the unit sphere; non-zero elements of {b (i) } i∈[t] and w (i) are sampled i.i.d. from N (0, 1) with the indices of zeros selected randomly. i) j )} j∈[m] where x (i) j ∼ N (0, I d×d ), y (i) j = x (i) j , u w (i) + b ( b. MovieLens Data: The MovieLens 1M dataset comprises of 1M ratings of 6K users for 4K movies. Each user is associated with some demographic data namely gender, age group, and occupation in the MovieLens dataset. We partition the users into 241 disjoint clusters where each cluster represents a unique combination of the demographic data. Each user group thus represents a "task" in the language of our paper. We partition the data into training and validation in the following way: for each task, we randomly choose 20% movies rated by at least one user from that task and put all ratings made by users from that task for the chosen movie into the validation set. The remaining ratings belong to the training set. Based on the ratings in the training set, we fit a matrix of rank 50 onto the ratings matrix and obtain a 50 dimensional embedding of each movie. Thus we ensure that there is no data leakage while creating the embeddings. For each task, the samples consist of (movie embedding, average rating) tuples; the response is the average rating of the movie given by users in that task. The number of samples per task varies from 22 to 3070 -clearly many clusters are data starved. We use the training data to learn the different models (with some hyper-parameter tuning) mentioned earlier and use them to predict the ratings in the validation data. Empirical Observations on Synthetic Data: Figure 2 shows that not only having a single model can lead to poor performance, but a fully fine-tuned model per task can also be highly inaccurate as scarcity of data per task can leading to over-fitting. Finally, low-rank representation learning as well as prompt tuning based techniques do not perform well due to lack of modeling power. In contrast, our method is able to exactly recover the underlying parameters -as also predicted by Theorem 1and provides 5 orders of magnitude better RMSE. The overall average validation RMSE for AMHT-LRS and the different baselines that we consider is shown in Fig. 1a against percentage of fine-tunable parameters used by the model. With respect to the single model as reference, in the linear rank-1 case, the representation learning and the prompt learning based baselines have 1 and 50 additional parameters per task respectively; they are unable to personalize well. In contrast, with only 10%(= 5) additional parameters per task, AMHT-LRS has smaller RMSE than fully fine-tuned model, which require 241x more parameters. However, for data-starved clusters/tasks (samples < 100), we observe that fully fine-tuning approach start to overfit. In contrast, our method outperforms other baselines for both data-starved and data-surplus tasks.

4. CONCLUSIONS

We presented a powerful theoretical framework to study meta-learning, and develop novel algorithms. In particular, our framework combines representation learning and neighborhood fine-tuning based approaches for meta-learning. We proposed AMHT-LRS method, that combines alternative minimization -popular in representation learning -with hard thresholding based methods. We rigorously proved that AMHT-LRS is statistically and computationally efficient, and is able to generalize to new tasks with only O(r + k) samples, where r is the representation learning dimension and k is the number of fine-tuning weights. Finally, we extended our result to ensure that privacy of each task is preserved despite sharing information across tasks. Extending our framework to non-realizable setting and adversarial settings are critical future directions.



Hu et al. (2021): LORA (Low Rank Adaptation of Large Language Models) was proposed by Hu et al. (

Definition 1. Differential Privacy Dwork et al. (2006b;a); Bun & Steinke (2016) A randomized algorithm A is (ε, δ)-differentially private if for any pair of data sets D and D that differ in one user (i.e., |D D | = 1), and for all S in the output range of A, we have Pr[A(D) ∈ S] ≤ e ε • Pr[A(D ) ∈ S] + δ, where probability is over the randomness of A. Similarly, an algorithm A is ρ-zero Concentrated DP (zCDP) if D α (A(D)||A(D )) ≤ αρ, where D α is the Rényi divergence of order α.

are µ -incoherent i.e. ||W || 2,∞ ≤ µ λ r and ||U || 2,∞ ≤ µ r d . Theorem 1. Consider the LRS problem equation 2 with t linear regression tasks and samples obtained by equation 1. Let model parameters satisfy assumptions A1, A2. Let the row sparsity of B satisfy ζ = O t(r 2 µ ) -1 λ r λ 1 , and let k = O d • ( λ r λ 1

[Chapter 7]. Remark 4 (Sample complexity comparison). Note that the theoretical guarantees in representation models studied in Tripuraneni et al. (2021); Thekumparampil et al. (2021); Collins et al. (

LRS: PRIVACY PRESERVING META-LEARNING Algorithm 2 OPTIMIZE SPARSE VECTOR Require: Data (X, y) ∈ R m×d × R m where we minimize ||y -X(v + b )|| 2 such that ||b || 0 ≤ k. Estimate v (of v ) and initialization b (of b ). Iterations T, parameters α, β, γ > 0 and suitable constants

RMSE for data-starved tasks (c) RMSE for data surplus tasks

Figure 1: Decrease in RMSE on MovieLens data for AMHT-LRS algorithm on increase in fine-tunable parameters. Note that AMHT-LRS outperforms other baselines for both data-starved and data-surplus tasks.

Figure 2: Decrease in RMSE on Synthetic data for AMHT-LRS on increase in fine-tunable parameters

Also, the above approaches are not just restricted to linear models and can be extended to complex model classes such as Neural Networks (see Appendix A for extension to 3 layer Neural Net architectures).We conduct experiments on two datasets: a. Synthetic dataset: here, for each task i ∈ [t], we generate m = 100 samples {(x

