PRIVATE AND EFFICIENT META-LEARNING WITH LOW RANK AND SPARSE DECOMPOSITION

Abstract

Meta-learning is critical for a variety of practical ML systems -like personalized recommendations systems -that are required to generalize to new tasks despite a small number of task-specific training points. Existing meta-learning techniques use two complementary approaches of either learning a low-dimensional representation of points for all tasks, or task-specific fine-tuning of a global model trained using all the tasks. In this work, we propose a novel meta-learning framework that combines both the techniques to enable handling of a large number of data-starved tasks. Our framework models network weights as a sum of low-rank and sparse matrices. This allows us to capture information from multiple domains together in the low-rank part while still allowing task specific personalization using the sparse part. We instantiate and study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-r and a k-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with hard thresholding -AMHT-LRS-to learn the low-rank and sparse part effectively and efficiently. For the realizable, Gaussian data setting, we show that AMHT-LRS indeed solves the problem efficiently with nearly optimal samples. We extend AMHT-LRS to ensure that it preserves privacy of each individual user in the dataset, while still ensuring strong generalization with nearly optimal number of samples. Finally, on multiple datasets, we demonstrate that the framework allows personalized models to obtain superior performance in the data-scarce regime.

1. INTRODUCTION

Typical real world settings -like multi user/enterprise personalization -have a long tail of tasks with a small amount of training data. Meta-learning addresses the problem by learning a "learner" that extracts key information/representation from a large number of training tasks, and can be applied to new tasks despite limited number of task specific training data points. Most existing meta-learning approaches can be categorized as: 1) Neighborhood Models: these methods learn a global model, which is then "fine-tuned" to specific tasks (Guo et al., 2020; Howard & Ruder, 2018; Zaken et al., 2021) , 2) Representation Learning: these methods learn a low-dimensional representation of points which can be used to train task-specific linear learners (Javed & White, 2019; Raghu et al., 2019; Lee et al., 2019; Bertinetto et al., 2018; Hu et al., 2021) . In particular, task-specific fine-tuning has demonstrated exceptional results across many natural language tasks (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) . However, such fine-tuned models update all parameters, so each fine-tuned total parameter footprint is same as the original model. This implies that fine-tuning large models -like say a standard BERT model (Devlin et al., 2018) with about 110M parameters -for thousands or millions of tasks would be quite challenging even from storage point of view. One potential approach to handle the large number of parameters is to fine-tune only the last layer, but empirical findings suggest that such solutions can be significantly less accurate than fine-tuning the entire model (Chen et al., 2020a; Salman et al., 2020) . Moreover, representation learning based approaches apply strict restrictions like each task's parameters have to be in a low-dimensional subspace which tend to affect performance in general (Sec. 3). In this work, we propose and study the LRS framework that combines both the above mentioned complementary approaches. That is, LRS restricts the model parameters Θ (i) for the i th task as Θ (i) := U • W (i) + B (i) , where first term denotes applying a low-dimensional linear operator on

annex

W (i) , while B (i) is restricted to be sparse. That is the first term is based on representing parameters of each task in a low dimensional subspace while the second term allows task-specific fine-tuning but only of a few parameters. Note that methods that only allows fine-tuning of batch-norm statistics, or of the final few layers can also be thought of as "sparse" fine-tuning but with a fixed set of parameters. In contrast, we allow the tunable parameters to be selected from any part of the network.The framework allows collaboration among different tasks so as to learn accurate model despite lack of data per task. Similarly, presence of sparse part allows task-specific personalization of arbitrary but a small set of weights. Finally, the framework allows tasks/users to either contribute to a central model using privacy preserving bill-board models (see Section 2.2) or also allows flexibility where certain tasks only learns their parameters locally and do not contribute to the central model.While the framework applies more generally, to make the exposition easy to follow, we instantiate it in the case of linear models. In particular, suppose the goal is to learn a linear model for the i th task that is parameterized by θ (i) , i.e., expected prediction for the data point x ∈ R d is given by x, θ (i) . θ (i) is modeled as θ (i) := U w (i) + b (i) for all tasks 1 ≤ i ≤ t, where U ∈ R d×r (where r d) is a tall orthonormal matrix that captures task representation and is shared across all the tasks.Note that estimation of the low-rank and sparse part is similar to the robust-PCA (Netrapalli et al., 2014) that is widely studied in the structured matrix estimation literature. However, in that setting, the matrices are fully observed and the goal is to separate out the low-rank and the sparse part. In comparison, in the framework proposed in this work, we only get a few linear measurements of the underlying matrix due to which the estimation problem is significantly more challenging. We address this challenge of estimating U along with w (i) , b (i) (for each task) using a simple alternating minimization style iterative technique. Our method -AMHT-LRS-alternatingly estimates the global parameters U as well as the task-specific parameters w (i) and b (i) independently for each task. To ensure sparsity of b we use an iterative hard-thresholding style estimator (Jain et al., 2014) .In general, even estimating U is an NP-hard problem (Thekumparampil et al., 2021) . One of the main contributions of the paper is a novel analysis that shows that AMHT-LRS indeed efficiently converges to the optimal solution in the realizable setting assuming the data is generated from a Gaussian distribution. Formally, consider t, d-dimensional linear regression tasks (indexed by i ∈ [t]) with m samples {(xj=1 being provided to each of them such thatwhere z (i) j ∼ iid N (0, σ 2 ). Below, we state our main result informally in the noiseless setting (σ = 0): Theorem (Informal, Noiseless setting). Suppose we are given m • t samples from t linear regression tasks of dimension d as in equation 1. Goal is to learn a new regression task's parameters using m samples, i.e., learn the shared rank-r parameter matrix U along with task-specific w , b . Then, AMHT-LRS with total m • t = Ω(kdr 4 ) samples and m = Ω(max(k, r 3 )) samples per task can recover all the parameters exactly and in time nearly linear in m • t.That is, AMHT-LRS is able to learn the underlying model exactly as long as the total number of tasks is large enough, and per task samples scale only linearly in sparsity of b and cubically in the rank of U . Assuming r, k d, additional parameter overhead per task is small, allowing efficient deployment of such models in production. Finally, using the billboard model of ( , δ) differential privacy (DP) (Jain et al., 2021; Chien et al., 2021; Kearns et al., 2014) , we can extend AMHT-LRS to preserve privacy of each individual. Furthermore, for similar sample complexity as in the above theorem albeit with slightly worse dependence on r, we can guarantee strong generalization error up to a standard error term due to privacy.

Summary of our Contributions:

• We propose a theoretical framework for combining the meta-learning approaches of representation learning and neighborhood model. Our model non-trivially generalizes guarantees of Thekumparampil et al. (2021); Tripuraneni et al. (2021); Boursier et al. (2022) to the setting where the parameter matrix allows a low rank plus sparse decomposition. • We propose an efficient method AMHT-LRS for the above problem and provide rigorous total sample complexity and per-task sample complexity bounds that are nearly optimal (Theorem 1).

