FEW-SHOT LEARNING VIA LEARNING THE REPRESENTATION, PROVABLY

Abstract

This paper studies few-shot learning via representation learning, where one uses T source tasks with n 1 data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only n 2 ( n 1 ) data. Specifically, we focus on the setting where there exists a good common representation between source and target, and our goal is to understand how much a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a risk bound of Õ( dk n1T + k n2 ) on the target task for the linear representation class; here d is the ambient input dimension and k( d) is the dimension of the representation. This result bypasses the Ω( 1 T ) barrier under the i.i.d. task assumption, and can capture the desired property that all n 1 T samples from source tasks can be pooled together for representation learning. We further extend this result to handle a general representation function class and obtain a similar result. Next, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural networks, and show that representation learning can fully utilize all n 1 T samples from source tasks. * Alphabetical Order. 1 We only focus on the dependence on T , n1 and n2 in this paragraph. Note that Maurer et al. ( 2016) only considered n1 = n2, but their approach does not give a better result even if n1 > n2. On the theoretical side, Baxter (2000) performed the first theoretical analysis and gave sample complexity bounds using covering numbers. Maurer et al. (2016) and follow-up work gave analyses on the benefit of representation learning for reducing the sample complexity of the target task. They assumed every task is i.i.d. drawn from an underlying distribution and can obtain an Õ( 1

1. INTRODUCTION

A popular scheme for few-shot learning, i.e., learning in a data-scarce environment, is representation learning, where one first learns a feature extractor, or representation, e.g., the last layer of a convolutional neural network, from different but related source tasks, and then uses a simple predictor (usually a linear function) on top of this representation in the target task. The hope is that the learned representation captures the common structure across tasks, which makes a linear predictor sufficient for the target task. If the learned representation is good enough, it is possible that a few samples are sufficient for learning the target task, which can be much smaller than the number of samples required to learn the target task from scratch. While representation learning has achieved tremendous success in a variety of applications (Bengio et al., 2013) , its theoretical studies are limited. In existing theoretical work, the most natural algorithm is to explicitly look for the optimal representation given source data, which when combined with a (different) linear predictor on top for each task can achieve the smallest cumulative training error on the source tasks. Of course, it is not guaranteed that the representation found will be useful for the target task unless one makes some assumptions to characterize the connections between different tasks. Existing work often imposes a probabilistic assumption about the connection between tasks: each task is sampled i.i.d. from an underlying distribution. Under this assumption, Maurer et al. (2016) showed an Õ( 1 √ T + 1 √ n2 ) risk bound on the target task, where T is the number of source tasks, n 1 is the number of samples per source task, and n 2 is the number of samples from the target task. 1 Unsatisfactorily, this bound necessarily requires the number of tasks T to be large, and it does not improve when the number of samples per source task, n 1 , increases. Intuitively, one should expect more data to help, and therefore an ideal bound would be 1 √ n1T + 1 √ n2 (or 1 n1T + 1 n2 in the realizable case), because n 1 T is the total number of training data points from source tasks, which can be potentially pooled to learn the representation. Unfortunately, as pointed out by Maurer et al. (2016) , there exists an example that satisfies the i.i.d. task assumption for which Ω( 1 √ T ) is unavoidable (or Ω( 1T ) in the realizable setting). This means that the i.i.d. assumption alone is not sufficient if we want to take advantage of a large amount of samples per task. Therefore, a natural question is: What connections between tasks enable representation learning to utilize all source data? In this paper, we obtain the first set of results that fully utilize the n 1 T data from source tasks. We replace the i.i.d. assumption over tasks with natural structural conditions on the input distributions and linear predictors. These conditions depict that the target task can be in some sense "covered" by the source tasks, which will further give rise to the desirable guarantees. First, we study the setting where there exists a common well-specified low-dimensional representation in source and target tasks, and obtain an Õ( dk n1T + k n2 ) risk bound on the target task where d is the ambient input dimension, k( d) is the dimension of the representation, and n 2 is the number of data from the target task. Note that this improves the d n2 rate of just learning the target task without using representation learning. The term dk n1T indicates that we can fully exploit all n 1 T data in the source tasks to learn the representation. We further extend this result to handle general representation function class and obtain an Õ( C (Φ) n1T + k n2 ) risk bound on the target task, where Φ is the representation function class and C (Φ) is a certain complexity measure of Φ. Second, we study the setting where there exists a common linear high-dimensional representation for source and target tasks, and obtain an Õ R√ Tr(Σ) √ n1T + R√ Σ 2 √ n2 rate where R is a normalized nuclear norm control over linear predictors, and Σ is the covariance matrix of the raw feature. This also improves over the baseline rate for the case without using representation learning. We further extend this result to two-layer neural networks with ReLU activation. Again, our results indicate that we can fully exploit n 1 T source data. A technical insight coming out of our analysis is that any capacity-controlled method that gets low test error on the source tasks must also get low test error on the target task by virtue of being forced to learn a good representation. Our result on high-dimensional representations shows that the capacity control for representation learning does not have to be through explicit low dimensionality. Organization. The rest of the paper is organized as follows. We review related work in Section 2. In Section 3, we formally describe the setting we consider. In Section 4, we present our main result for low-dimensional linear representation learning. A generalization to nonlinear representation classes is demonstrated in Section 5. In Section 6, we present our main result for high-dimensional linear representation learning. In Section 7, we present our result for representation learning in neural networks. We conclude in Section 8 and leave most of the proofs to appendices.

2. RELATED WORK

The idea of multitask representation learning at least dates back to Caruana (1997) ; Thrun and Pratt (1998) ; Baxter (2000) . Empirically, representation learning has shown its great power in various domains; see Bengio et al. (2013) for a survey. In particular, representation learning is widely adopted for few-shot learning tasks (Sun et al., 2017; Goyal et al., 2019) . Representation learning is also closely connected to meta-learning (Schaul and Schmidhuber, 2010) . Recent work Raghu et al. (2019) empirically suggested that the effectiveness of the popular meta-learning algorithm Model Agnostic Meta-Learning (MAML) is due to its ability to learn a useful representation. The scheme we analyze in this paper is closely related to Lee et al. (2019) ; Bertinetto et al. (2018) for meta-learning. rate. As pointed out in Maurer et al. (2016) , the 1 √ T dependence is not improvable even if n 1 → ∞ because 1 √ T is the rate of concentration for the distribution over tasks. The concurrent work of Tripuraneni et al. (2020) studies low-dimensional linear representation learning and obtains a similar result as ours in this case, but they assume isotropic inputs for all tasks, which is a special case of our result. Furthermore, we also provide results for high-dimensional linear representations, general non-linear representations, and two-layer neural networks. Tripuraneni et al. (2020) also give a computationally efficient algorithm for standard Gaussian inputs and a lower bound for subspace recovery in the low-dimensional linear setting. Another recent line of theoretical work analyzed gradient-based meta-learning methods (Denevi et al., 2019; Finn et al., 2019; Khodak et al., 2019) and showed guarantees for convex losses by using tools from online convex optimization. Lastly, we remark that there are analyses for other representation learning schemes (Arora et al., 2019; McNamara and Balcan, 2017; Galanti et al., 2016; Alquier et al., 2016; Denevi et al., 2018) .

3. NOTATION AND SETUP

Notation. Let [n] = {1, 2, . . . , n}. We use • or • 2 to denote the 2 norm of a vector or the spectral norm of a matrix. Denote by • F and • * the Frobenius norm and the nuclear norm of a matrix, respectively. Let •, • be the Euclidean inner product between vectors or matrices. Denote by I the identity matrix. Let N (µ, σfoot_0 )/N (µ, Σ) be the one-dimensional/multi-dimensional Gaussian distribution, and χ 2 (m) the chi-squared distribution with m degrees of freedom. For a matrix A ∈ R m×n , let σ i (A) be its i-th largest singular value. Let span(A) be the subspace of R m spanned by the columns of A, i.e., span( A) = {Av | v ∈ R n }. Denote P A = A(A A) † A ∈ R m×m , which is the projection matrix onto span(A). Here † stands for the Moore-Penrose pseudoinverse. Note that 0 P A I and P 2 A = P A . We also define P ⊥ A = I -P A , which is the projection matrix onto span(A) ⊥ , the orthogonal complement of span(A) in R m . For a positive semidefinite (psd) matrix B, denote by λ max (B) and λ min (B) its largest and smallest eigenvalues, respectively; let B 1/2 be the psd matrix such that (B 1/2 ) 2 = B. Problem Setup. Suppose that there are T source tasks. Each task t ∈ [T ] is associated with a distribution µ t over the joint data space X × Y, where X is the input space and Y is the output space. In this paper we consider X ⊆ R d and Y ⊆ R. For each source task t ∈ [T ] we have access to n 1 i.i.d. samples (x t,1 , y t,1 ), . . . , (x t,n1 , y t,n1 ) from µ t . For convenience, we express these n 1 samples collectively as an input matrix X t ∈ R n1×d and an output vector y t ∈ R n1 .

We

Multitask learning tries to learn prediction functions for all the T source tasks simultaneously in the hope of discovering some underlying common property of these tasks. The common property we consider in this paper is a representation, which is a function φ : X → Z that maps an input to some feature space Z ⊆ R k . We restrict the representation function to be in some function class Φ, e.g., neural networks. We try to use different linear predictors on top of a common representation function φ to model the input-output relations in different source tasks. Namely, for each task t ∈ [T ], we set the prediction function to be x → w t , φ(x) (w t ∈ R k ). Therefore, using the training samples from T tasks, we can solve the following optimization problem to learn the representation: 2 minimize φ∈Φ,w1,...,w T ∈R k 1 2n 1 T T t=1 n1 i=1 (y t,i -w t , φ(x t,i ) ) 2 . (1) We overload the notation to allow φ to apply to all the samples in a data matrix simultaneously, i.e., φ(X t ) = [φ(x t,1 ), . . . , φ(x t,n1 )] ∈ R n1×k . Then (1) can be rewritten as minimize φ∈Φ,w1,...,w T ∈R k 1 2n 1 T T t=1 y t -φ(X t )w t 2 . ( ) Let φ ∈ Φ be the representation function obtained by solving (2). Now we retain this representation and apply it to future (target) tasks. For a target task specified by a distribution µ T +1 over X × Y, suppose we receive n 2 i.i.d. samples X T +1 ∈ R n2×d , y T +1 ∈ R n2 . We further train a linear predictor on top of φ for this task: minimize w T +1 ∈R k 1 2n 2 y T +1 -φ(X T +1 )w T +1 2 . ( ) Let ŵT +1 be the returned solution. We are interested in whether our learned predictor x → ŵT +1 , φ(x) works well on average for the target task, i.e., we want the population loss L µ T +1 ( φ, ŵT +1 ) = E (x,y)∼µ T +1 1 2 (y -ŵT +1 , φ(x) ) 2 to be small. In particular, we are interested in the few-shot learning setting, where the number of samples n 2 from the target task is small -much smaller than the number of samples required for learning the target task from scratch. Data assumption. In order for the above learning procedure to make sense, we assume that there is a ground-truth optimal representation function φ * ∈ Φ and specializations w * 1 , . . . , w * T +1 ∈ R k for all the tasks such that for each task t ∈ [T + 1], we have E (x,y)∼µt [y|x] = w * t , φ * (x) . More specifically, we assume (x, y) ∼ µ t can be generated by y = w * t , φ * (x) + z, x ∼ p t , z ∼ N (0, σ 2 ), where x and z are independent. Our goal is to bound the excess risk of our learned model on the target task, i.e., how much our learned model ( φ, ŵT +1 ) performs worse than the optimal model (φ * , w * T +1 ) on the target task: ER( φ, ŵT +1 ) = L µ T +1 ( φ, ŵT +1 ) -L µ T +1 (φ * , w * T +1 ) = 1 2 E x∼p T +1 [( ŵT +1 , φ(x) -w * T +1 , φ * (x) ) 2 ]. Here we have used the relation (4). Oftentimes we are interested in the average performance on a random target task (i.e., w * T +1 is random). In such case we look at the expected excess risk E w * T +1 [ER( φ, ŵT +1 )].

4. LOW-DIMENSIONAL LINEAR REPRESENTATIONS

In this section, we consider the case where the representation is a linear map from the original input space R d to a low-dimensional space R k (k d). Namely, we let the representation function class be Φ = {x → B x | B ∈ R d×k }. Then the optimization problem (2) for learning the representation can be written as: ( B, Ŵ ) ← arg min B∈R d×k W =[w 1 ,...,w T ]∈R k×T 1 2n 1 T T t=1 y t -X t Bw t 2 . ( ) The inputs from T source tasks, X 1 , . . . , X T ∈ R n1×d , can be written in the form of a linear operator X : R d×T → R n1×T , where X (Θ) = [X 1 θ 1 , . . . , X T θ T ], ∀Θ = [θ 1 , . . . , θ T ] ∈ R d×T . With this notation, (6) can be rewritten as ( B, Ŵ ) ← arg min B∈R d×k ,W ∈R k×T 1 2n 1 T Y -X (BW ) 2 F , where Y = [y 1 , . . . , y T ] ∈ R n1×T . (7) With the learned representation B from (7), for the target task, we further find a linear function on top of the representation: ŵT +1 ← arg min w∈R k 1 2n 2 y T +1 -X T +1 Bw 2 . ( ) As described in Section 3, we assume that all T + 1 tasks share a common ground-truth representation specified by a matrix B * ∈ R d×k such that a sample (x, y) ∼ µ t satisfies x ∼ p t and y = (w * t ) (B * ) x + z where z ∼ N (0, σ 2 ) is independent of x. Here w * t ∈ R k , and we assume w * t = Θ(1) for all t ∈ [T + 1]. Denote W * = [w * 1 , . . . , w * T ] ∈ R k×T . Then we can write Y = X (B * W * ) + Z, where the noise matrix Z has i.i.d. N (0, σ 2 ) entries. Assume E x∼pt [x] = 0 and let Σ t = E x∼pt [xx ] for all t ∈ [T + 1]. Note that a sample x ∼ p t can be generated from x = Σ 1/2 t x for x ∼ pt such that E x∼ pt [ x] = 0 and E x∼ pt [ x x ] = I. (p t is called the whitening of p t .) In this section we make the following assumptions on the input distributions p 1 , . . . , p T +1 . Assumption 4.1 (subgaussian input). There exists ρ > 0 such that, for all t ∈ [T + 1], the random vector x ∼ pt is ρ 2 -subgaussian.foot_1 Assumption 4.2 (covariance dominance). There exists c > 0 such that Σ t c•Σ T +1 for all t ∈ [T ]. 4Assumption 4.1 is a standard assumption in statistical learning to obtain probabilistic tail bounds used in our proof. It may be replaced with other moment or boundedness conditions if we adopt different tail bounds in the analysis. Assumption 4.2 says that every direction spanned by Σ T +1 should also be spanned by Σ t (t ∈ [T ]), and the parameter c quantifies how "easy" it is for Σ t to cover Σ T +1 . Intuitively, the larger c is, the easier it is to cover the target domain using source domains, and we will indeed see that the risk will be proportional to 1 c . We remark that we do not necessarily need Σ t c • Σ T +1 for all t ∈ [T ]; as long as this holds for a constant fraction of t's, our result is valid. We also make the following assumption that characterizes the diversity of the source tasks. Assumption 4.3 (diverse source tasks). The matrix W * = [w * 1 , . . . , w * T ] ∈ R k×T satisfies σ 2 k (W * ) ≥ Ω( T k ). Recall that w * t = Θ(1), which implies k j=1 σ 2 j (W * ) = W * 2 F = Θ(T ). Thus, Assumption 4.3 is equivalent to saying that σ1(W * ) σ k (W * ) = O(1). Roughly speaking, this means that {w * t } t∈[T ] can cover all directions in R k . As an example, Assumption 4.3 is satisfied with high probability when w * t 's are sampled i.i.d. from N (0, Σ) with λmax(Σ) λmin(Σ) = O(1). Finally, we make the following assumption on the distribution of the target task. Assumption 4.4 (distribution of target task). Assume that w * T +1 follows a distribution ν such that E w∼ν [ww ] ≤ O 1 k . Since we assume w * T +1 = Θ(1), the assumption E w∼ν [ww ] ≤ O( 1 k ) means that the distribution of w * T +1 does not align with any direction significantly more than average. It is useful to think of the uniform distribution on the unit sphere as an example, though we can allow a much more general class of distributions. This is also compatible with Assumption 4.3 which says that w * t 's cover all the directions. Assumption 4.4 can be removed at the cost of a slightly worse risk bound. See Remark 4.1. Our main result in this section is the following theorem. Theorem 4.1 (main theorem for linear representations). Fix a failure probability δ ∈ (0, 1). Under Assumptions 4.1, 4.2, 4.3 and 4.4, we further assume 2k ≤ min{d, T } and that the sample sizes in source and target tasks satisfy Σt) . Then with probability at least 1 -δ over the samples, the expected excess risk of the learned predictor x → ŵ T +1 Bx on the target task satisfies n 1 ρ 4 (d + log T δ ), n 2 ρ 4 (k + log 1 δ ), and cn 1 ≥ n 2 . Define κ = max t∈[T ] λmax(Σt) min t∈[T ] λmin( E w * T +1 ∼ν [ER( B, ŵT +1 )] σ 2 kd log(κn 1 ) cn 1 T + k + log 1 δ n 2 . ( ) The proof of Theorem 4.1 is in Appendix A. Theorem 4.1 shows that it is possible to learn the target task using only O(k) samples via learning a good representation from the source tasks, which is better than the baseline O(d) sample complexity for linear regression, thus demonstrating the benefit of representation learning. It also shows that all n 1 T samples from source tasks can be pooled together, bypassing the Ω( 1 T ) barrier under the i.i.d. tasks assumption. Remark 4.1 (deterministic target task). We can drop Assumption 4.4 and easily obtain the following excess risk bound for any deterministic w * T +1 by slightly modifying the proof of Theorem 4.1: ER( B, ŵT +1 ) σ 2 k 2 d log(κn 1 ) cn 1 T + k 2 cn 1 + k + log 1 δ n 2 , which is only at most k times larger than the bound in (9).

5. GENERAL LOW-DIMENSIONAL REPRESENTATIONS

Now we return to the general case described in Section 3 where we allow a general representation function class Φ. We still assume that the representation is of low dimension k like in Section 4, and we assume that inputs from all the tasks follow the same distribution, i.e., p 1 = • • • = p T +1 = p, but each task t still has its own specialization function w * t (c.f. ( 4)). We overload the notation from Section 4 and use X to represent the collection of all the training inputs from T source tasks X 1 , . . . , X T ∈ R n1×d . We can think of X as a third-order tensor of dimension n 1 × d × T . To characterize the complexity of the representation function class Φ, we need the standard definition of Gaussian width. Definition 5.1 (Gaussian width). Given a set K ⊂ R m , the Gaussian width of K is defined as: G (K) := E z∼N (0,I) sup v∈K v, z . We will measure the complexity of Φ using the Gaussian width of the following set that depends on the input data X : F X (Φ) = A = [a 1 , . . . , a T ] ∈ R n1×T : A F = 1, ∃φ, φ ∈ Φ s.t. a t ∈ span([φ(X t ), φ (X t )]), ∀t ∈ [T ] . ( ) We also need the following definition. Definition 5.2 (covariance between two representations). Given a distribution q over R d and two representation functions φ, φ ∈ Φ, define the covariance between φ and φ with respect to q to be Σ q (φ, φ ) = E x∼q φ (x) φ (x) ∈ R k×k . Also define the symmetric covariance as Λ q (φ, φ ) = Σ q (φ, φ) Σ q (φ, φ ) Σ q (φ , φ) Σ q (φ , φ ) ∈ R 2k×2k . It is easy to verify Λ q (φ, φ ) 0 for any φ, φ and q, as shown in the proof of Lemma B.2. We make the following assumptions on the input distribution p, which ensure concentration properties of the representation covariances. Assumption 5.1 (point-wise concentration of covariance). For δ ∈ (0, 1), there exists a number N point (Φ, p, δ) such that if n ≥ N point (Φ, p, δ), then for any given φ, φ ∈ Φ, n i.i.d. samples of p will with probability at least 1 -δ satisfy 0.9Λ p (φ, φ ) Λ p(φ, φ ) 1.1Λ p (φ, φ ), where p is the empirical distribution over the n samples. Assumption 5.2 (uniform concentration of covariance). For δ ∈ (0, 1), there exists a number N unif (Φ, p, δ) such that if n ≥ N unif (Φ, p, δ), then n i.i.d. samples of p will with probability at least 1 -δ satisfy 0.9Λ p (φ, φ ) Λ p(φ, φ ) 1.1Λ p (φ, φ ), ∀φ, φ ∈ Φ, where p is the empirical distribution over the n samples. Assumptions 5.1 and 5.2 are conditions on the representation function class Φ and the input distribution p that ensure concentration of empirical covariances to their population counterparts. Typically, we expect N unif (Φ, p, δ) N point (Φ, p, δ) since uniform concentration is a stronger requirement. In Section 4, we have essentially shown that for linear representations and subgaussian input distributions, N unif (Φ, p, δ) = Õ (d) and N point (Φ, p, δ) = Õ (k) (see Claims A.1 and A.2). Our main theorem in this section is the following: Theorem 5.1 (main theorem for general representations). Fix a failure probability δ ∈ (0, 1). Suppose n 1 ≥ N unif Φ, p, δ 3T and n 2 ≥ N point Φ, p, δ 3 . Under Assumptions 4.3 and 4.4, with probability at least 1 -δ over the samples, the expected excess risk of the learned predictor x → ŵ T +1 φ(x) on the target task satisfies E w * T +1 ∼ν [ER( φ, ŵT +1 )] σ 2 G(F X (Φ)) 2 + log 1 δ n 1 T + k + log 1 δ n 2 . ( ) Theorem 5.1 is very similar to Theorem 4.1 in terms of the result and the assumptions made. In the bound ( 11), the complexity of Φ is captured by the Gaussian width of the data-dependent set F X (Φ) defined in (10). Data-dependent complexity measures are ubiquitous in generalization theory, one of the most notable examples being Rademacher complexity. Similar complexity measure also appeared in existing representation learning theory (Maurer et al., 2016) . Usually, for specific examples, we can apply concentration bounds to get rid of the data dependency, such as our result for linear representations (Theorem 4.1). Our assumptions on the linear specification functions w * t 's are the same as in Theorem 4.1. The probabilistic assumption on w * T +1 can also be removed at the cost of an additional factor of k in the bound -see Remark 4.1. We defer the full proof of Theorem 5.1 to Appendix B.

6. HIGH-DIMENSIONAL LINEAR REPRESENTATIONS

In this section, we consider the case where the representation is a general linear map without an explicit dimensionality constraint, and we will prove a norm-based result by exploiting the intrinsic dimension of the representation. Such a generalization is desirable since in many applications the representation dimension is not restricted.

Without loss of generality, we let the representation function class be

Φ = {x → B x | B ∈ R d×T }. We note that a dimension-T representation is sufficient for learning T source tasks and any choice of dimension greater than T will not change our argument. We use the same notation from Section 4 unless otherwise specified. In this section we additionally assume that all tasks have the same input covariance: Assumption 6.1. The input distributions in all tasks satisfy Σ 1 = • • • = Σ T +1 = Σ. Note that each task t still has its own specialization function w * t (c.f. ( 4)). We remark that there are many interesting and nontrivial scenarios under Assumption 6.1 -for example, consider the case where the inputs in each task are all images from ImageNet and each task asks whether the image is from a specific class. Since we do not have a dimensionality constraint, we modify (7) by adding norm constraints: ( B, Ŵ ) ← arg min B∈R d×T ,W ∈R T ×T 1 2n 1 Y -X (BW ) 2 F + λ 2 W 2 F + λ 2 B 2 . ( ) For the target task, we also modify (8) by adding a norm constraint: ŵT +1 ← arg min w ≤r 1 2n 2 X T +1 Bw -y T +1 2 . ( ) We will specify the choices of regularization, i.e., λ and r in Theorem 6.1. Similar to Section 4, the source task data relation is denoted as Y = X (Θ * ) + Z, where Θ * ∈ R d×T is the ground truth and Z has i.i.d. N (0, σ 2 ) entries. Suppose that the target task data satisfy y T +1 = X T +1 θ * T +1 + z T +1 ∈ R n2 . Similar to the setting in Section 4, we assume the target task data is subgaussian as in Assumption 4.1. Published as a conference paper at ICLR 2021 Theorem 6.1 (main theorem for high-dimensional representations). Fix a failure probability δ ∈ (0, 1). Under Assumptions 4.1 and 6.1, we further assume n 1 ≥ n 2 , R = Θ * . Let r = 2 R/T , R = R/ √ T and proper λ specified in Lemma C.2. Let the target task model θ * T +1 be coherent with the source task models Θ * in the sense that θ * T +1 ∼ ν = N (0, Θ * (Θ * ) /T ). Then with probability at least 1 -δ over the samples, the expected excess risk of the learned predictor x → ŵ T +1 B x on the target task satisfies: E θ * T +1 ∼ν [ER( B, ŵT +1 )] ≤ σ R • Õ Tr(Σ) √ n 1 T + Σ 2 √ n 2 + ζ n1,n2 , where ζ n1,n2 := ρ 4 R2 Õ Tr(Σ) n1 + Σ n2 is lower-order terms due to randomness of the input data. Here Õ hides logarithmic factors. The proof of Theorem 6.1 is given in Appendix C. Note that Θ * F = √ T when each θ * t is of unit norm. Thus R = Θ * * / √ T should generally be regarded as O(1) for a well-behaved Θ * that is nearly low-dimensional. In this regime, Theorem 6.1 indicates that we are able to exploit all n 1 T samples from the source tasks, similar to Theorem 4.1. With a good representation, the sample complexity on the target task can also improve over learning the target task from scratch. Consider the baseline of regular ridge regression directly applied to the target task data: θ ← arg min θ ≤ θ * T +1 1 2n 2 X T +1 θ -y T +1 2 . ( ) Its standard excess risk bound in fixed design is ER( θλ ) σ θ * T +1 2 2 Tr(Σ) n2 . (See e.g. Hsu et al. (2012) .) Taking expectation over θ * T +1 ∼ ν = N (0, Θ * (Θ * ) /T ), we obtain E θ * T +1 ∼ν [ER( θλ )] σ Θ * F √ T Tr(Σ) n 2 . ( ) Compared with (16), our bound ( 14) is an improvement as long as Θ * 2 * Θ * 2 F Tr(Σ) Σ . The left hand side Θ * 2 * Θ * 2 F is always no more than the rank of Θ * , and we call it the intrinsic rank. Hence we see that we can gain from representation learning if the source predictors are intrinsically low dimensional. To intuitively understand how this is achieved, we note that a representation B is reweighing linear combinations of the features according to their "importance" on the T source tasks. We make an analogy with a simple case of feature selection. Suppose we have learned a representation vector b where b i scales with the importance of the i-th feature, i.e., the representation is φ(x) = x b (entry-wise product). Then ridge regression on the target task data (X, y), minimize w ≤r 1 2n2 X • diag(b)•w -y 2 2 , is equivalent to minimize diag(b) -1 v ≤r 1 2n2 Xv -y 2 2 . From the above equation, we see that the features with large |b i | (those that were useful on the source tasks) will be more heavily used than the ones with small |b i | due to the reweighed 2 constraint. Thus the important features are learned from the source tasks, and the coefficients are learned from the target task. Remark 6.1 (The non-convex landscape). Although the optimization problem (12) is non-convex, its structure allows us to apply existing landscape analysis of matrix factorization problems (Haeffele et al., 2014) and to show that it has the nice properties of no strict saddles and no bad local minima. Therefore, randomly initialized gradient descent or perturbed gradient descent are guaranteed to converge to a global minimum of (12) (Ge et al., 2015; Lee et al., 2016; Jin et al., 2017) . Remark 6.2 (Multi-class problems). When both source and target have multi-class labels instead of independent tasks, using quadratic loss on the one-hot labels, our results apply similarly and will attain an excess risk of the form σ R Õ √ Tr(Σ) √ n1 + √ Σ 2 √ n2 plus lower-order terms (see e.g. Lee et al. (2020) ). Notice the result is independent of the number of classes.

7. NEURAL NETWORKS

In this section, we show that we can provably learn good representations in a neural network. Consider a two-layer ReLU neural network f B,w (x) = w (B x) + , where w ∈ R d , B ∈ R d0×d and x ∈ R d0 . Here (•) + is the ReLU activation (z) + = max{0, z} defined element-wise. Namely, we let the representation function class be Φ = {x → (B x) + |B ∈ R d0×d }. On the source tasks we use the square loss with weight decay regularizer:foot_3 ( B, Ŵ ) ← arg min B∈R d 0 ×d ,W =[w1,•••w T ]∈R d×T 1 2n 1 T T t=1 y t -(X t B) + w t 2 + λ 2 B 2 F + λ 2 W 2 F . On the target task, we simply re-train the output layer while fixing the hidden layer weights: ŵT +1 ← arg min w ≤r 1 2n 2 y T +1 -(X T +1 B) + w 2 . ( ) Assumption 7.1. All tasks share the same input distribution: p 1 = • • • = p T +1 = p. We redefine Σ to be the covariance operator of the feature induced by ReLU, i.e., it is a kernel defined by Σ(u, v) = E x∼p [(u x) + (v x) + ], for u, v on the unit sphere S d0-1 ⊂ R d0 . Assumption 7.2 (teacher network). Assume for the source tasks that y t = (X t B * ) + w * t + z t is generated by a teacher network with parameters B * ∈ R d0×d , W * = [w * 1 , • • • , w * T ] ∈ R d×T , and noise term z t ∼ N (0, σ 2 I). A standard lifting of the neural network is: f αt = α t , φ(x) where φ(x) : S d0-1 → R, φ(x) b = (b x) + is the feature map, i.e., for each task, α t (b i / b i ) = W i,t b i and is zero elsewhere. We assume α T +1 that describes the target function to follow a Gaussian process µ with covariance function K(b, b ) = T t=1 α t (b)α t (b ). Theorem 7.1. Fix a failure probability δ ∈ (0, 1). Under Assumptions 4.1, 7.1 and 7.2, let n 1 ≥ n 2 , R = ( 1 2 B * 2 F + 1 2 W * 2 F )/ √ T . Let the target task model f α T +1 = α T +1 , φ(x) be coherent with the source task models in the sense that α * T +1 ∼ ν. Set r 2 = ( B * 2 F + W * 2 F )/T . Then with probability at least 1 -δ over the samples, the expected excess risk of the learned predictor x → ŵ T +1 ( B x) + on the target task satisfies: E α T +1 ∼ν [ER(f B, ŵT +1 )] ≤ σ R • Õ Tr(Σ) √ n 1 T + Σ 2 √ n 2 + ζ n1,n2 , where ζ n1,n2 := ρ 4 R2 Õ( Tr(Σ) n1 + Σ n2 ) is lower-order term due to randomness of the input data. To highlight the advantage of representation learning, we compare to training a neural network with weight decay directly on the target task: ( B, ŵ) = arg min B,w, Bw ≤ R 1 2n n i=1 y t+1 -(X T +1 B) + w 2 . ( ) The error of the baseline method in fixed-design is E[ER(f B, ŵ)] σ R Tr(Σ) n 2 . ( ) We see that Equation ( 19) is always smaller than Equation ( 21) since n 1 T ≥ n 2 . See Appendix D for the proof of Theorem 7.1 and the calculation of ( 21).

8. CONCLUSION

We gave the first statistical analysis showing that representation learning can fully exploit all data points from source tasks to enable few-shot learning on a target task. This type of results were shown for both low-dimensional and high-dimensional representation function classes. There are many important directions to pursue in representation learning and few-shot learning. Our results in Sections 6 and 7 indicate that explicit low dimensionality is not necessary, and norm-based capacity control also forces the classifier to learn good representations. Further questions include whether this is a general phenomenon in all deep learning models, whether other capacity control can be applied, and how to optimize to attain good representations. A PROOF OF THEOREM 4.1 We first prove several claims and then combine them to finish the proof of Theorem 4.1. We will use technical lemmas proved in Section A.1. Claim A.1 (covariance concentration of source tasks). Suppose n 1 ρ 4 (d + log(T /δ)) for δ ∈ (0, 1). Then with probability at least 1 -δ 10 over the inputs X 1 , . . . , X T in the source tasks, we have 0.9Σ t 1 n 1 X t X t 1.1Σ t , ∀t ∈ [T ]. Proof. According to our assumption on p t , we can write X t = Xt Σ 1/2 t , where Xt ∈ R n1×d and the rows of Xt hold i.i.d. samples of pt . Since pt satisfies the conditions in Lemma A.6, from Lemma A.6 we know that with probability at least 1 -δ 10T , 0.9I 1 n 1 X t Xt 1.1I, which implies 0.9Σ t 1 n 1 Σ 1/2 t X t Xt Σ 1/2 t = 1 n 1 X t X t 1.1Σ t . The proof is finished by taking a union bound over all t ∈ [T ]. Claim A.2 (covariance concentration of target task). Suppose n 2 ρ 4 (k + log(1/δ)) for δ ∈ (0, 1). Then for any given matrix B ∈ R d×2k that is independent of X T +1 , with probability at least 1 -δ 10 over X T +1 we have 0.9B Σ T +1 B 1 n 2 B X T +1 X T +1 B 1.1B Σ T +1 B. Proof. According to our assumption on p T +1 , we can write X T +1 = XT +1 Σ 1/2 T +1 , where XT +1 ∈ R n2×d and the rows of XT +1 hold i.i.d. samples of pT +1 . We take the SVD of Σ 1/2 T +1 B: Σ 1/2 T +1 B = U DV , where U ∈ R d×2k has orthonormal columns. Now we look at the matrix XT +1 U ∈ R n2×2k . It is easy to see that the rows of XT +1 U are i.i.d. 2k-dimensional random vectors with zero mean, identity covariance, and are ρ 2 -subgaussian. Therefore, applying Lemma A.6, with probability at least 1 -δ 10 we have 0.9I 1 n 2 U X T +1 XT +1 U 1.1I, which implies 0.9V DDV 1 n 2 V DU X T +1 XT +1 U DV 1.1V DDV . Since 1 n2 V DU X T +1 XT +1 U DV = 1 n2 B Σ 1/2 T +1 X T +1 XT +1 Σ 1/2 T +1 B = 1 n2 B X T +1 X T +1 B and V DDV = V DU U DV = B Σ T +1 B, the above inequality becomes 0.9B Σ T +1 B 1 n 2 B X T +1 X T +1 B 1.1B Σ T +1 B. Claim A.3 (guarantee on source training data). Under the setting of Theorem 4.1, with probability at least 1 -δ 5 we have X ( B Ŵ -B * W * ) 2 F σ 2 (kT + kd log(κn 1 ) + log(1/δ)) . Proof. We assume that ( 22) is true, which happens with probability at least 1 -δ 10 according to Claim A.1. Let Θ = B Ŵ and Θ * = B * W * . From the optimality of B and Ŵ for ( 7 ) we have Y -X ( Θ) 2 F ≤ Y -X (Θ * ) 2 F . Plugging in Y = X (Θ * ) + Z, this becomes X ( Θ -Θ * ) 2 F ≤ 2 Z, X ( Θ -Θ * ) . ( ) Let ∆ = Θ -Θ * . Since rank(∆) ≤ 2k, we can write ∆ = V R = [V r 1 , • • • , V r T ] where V ∈ O d,2k and R = [r 1 , • • • , r T ] ∈ R 2k×T . Here O d1,d2 (d 1 ≥ d 2 ) is the set of orthonormal d 1 × d 2 matrices (i.e., the columns are orthonormal). For each t ∈ [T ] we further write X t V = U t Q t where U t ∈ O n1,2k and Q t ∈ R 2k×2k . Then we have Z, X (∆) = T t=1 z t X t V r t = T t=1 z t U t Q t r t ≤ T t=1 U t z t • Q t r t ≤ T t=1 U t z t 2 • T t=1 Q t r t 2 = T t=1 U t z t 2 • T t=1 U t Q t r t 2 = T t=1 U t z t 2 • T t=1 X t V r t 2 = T t=1 U t z t 2 • X (∆) F . Next we give a high-probability upper bound on T t=1 U t z t 2 using the randomness in Z. Since U t 's depend on V which depends on Z, we will need an -net argument to cover all possible V ∈ O d,2k . First, for any fixed V ∈ O d,2k , we let X t V = Ūt Qt where Ūt ∈ O n,2k . The Ūt 's defined in this way are independent of Z. Since Z has i.i.d. N (0, σ 2 ) entries, we know that σ -2 T t=1 Ū t z t 2 is distributed as χ 2 (2kT ). Using the standard tail bound for χ 2 random variables, we know that with probability at least 1 -δ over Z, σ -2 T t=1 Ū t z t 2 kT + log(1/δ ). Therefore, using the same argument in (26) we know that with probability at least 1 -δ ,  Z, X ( V R) σ kT + log(1/δ ) X ( V R) F . Now, -δ |N |, Z, X ( V R) σ kT + log(1/δ ) X ( V R) F , ∀ V ∈ N . ( ) Choosing δ = δ 20( 6 √ 2k ) 2kd , we know that (27) holds with probability at least 1 -δ 20 . We will use ( 22), ( 25) and ( 27) to complete the proof of the claim. This is done in the following steps: 1. Upper bounding Z F . Since σ -2 Z 2 F ∼ χ 2 (n 1 T ), we know that with probability at least 1 -δ 20 , Z 2 F σ 2 (n 1 T + log(1/δ)). 2. Upper bounding ∆ F . From (25) we have X (∆) 2 F ≤ 2 Z F X (∆) F , which implies X (∆) F ≤ 2 Z F σ n 1 T + log(1/δ). On the other hand, letting the t-th column of ∆ be δ t , we have X (∆) 2 F = T t=1 X t δ t 2 = T t=1 δ t X t X t δ t ≥ 0.9n 1 T t=1 δ t Σ t δ t (using (22)) ≥ 0.9n 1 T t=1 λ min (Σ t ) δ t 2 ≥ 0.9n 1 λ ∆ 2 F , where λ = min t∈[T ] λ min (Σ t ). Hence we obtain ∆ 2 F X (∆) 2 F n 1 λ σ 2 (n 1 T + log(1/δ)) n 1 λ . 3. Applying the -net N . Let V ∈ N such that V -V F ≤ . Then we have X (V R -V R) 2 F = T t=1 X t (V -V )r t 2 ≤ T t=1 X t 2 V -V 2 r t 2 ≤ T t=1 1.1n 1 λ max (Σ t ) 2 r t 2 (using (22)) ≤ 1.1n 1 λ 2 R 2 F ( λ = max t∈[T ] λmax(Σt)) = 1.1n 1 λ 2 ∆ 2 F ( ∆ F = V R F = R F ) n 1 λ 2 • σ 2 (n 1 T + log(1/δ)) n 1 λ = κ 2 σ 2 (n 1 T + log(1/δ)). ( ) 4. Finishing the proof. We have the following chain of inequalities: 1 2 X (∆) 2 F ≤ Z, X (∆) (using (25)) = Z, X ( V R) + Z, X (V R -V R) σ kT + log(1/δ ) X ( V R) F + Z F X (V R -V R) F (using (27)) ≤ σ kT + log(1/δ ) X (V R) F + X (V R -V R) F + σ n 1 T + log(1/δ) X (V R -V R) F (using (28)) σ kT + log(1/δ ) X (V R) F + σ n 1 T + log(1/δ ) X (V R -V R) F (using k < n 1 and δ < δ) σ kT + log(1/δ ) X (∆) F + σ n 1 T + log(1/δ ) • κ 2 σ 2 (n 1 T + log(1/δ)) (using (29)) ≤ σ kT + log(1/δ ) X (∆) F + σ 2 √ κ(n 1 T + log(1/δ )). Finally, we let = k √ κn1 , and recall δ = δ 20( 6 √ 2k ) 2kd . Then the above inequality implies X (∆) F max σ kT + log(1/δ ), σ 2 √ κ(n 1 T + log(1/δ )) = max σ kT + log(1/δ ), σ k n 1 (n 1 T + log(1/δ )) ≤ max σ kT + log(1/δ ), σ kT + log(1/δ )) (using k < n 1 ) = σ kT + log(1/δ ) σ kT + kd log k + log 1 δ ≤ σ kT + kd log(κn 1 ) + log 1 δ . The high-probability events we have used in the proof are ( 22), ( 27) and ( 28). By a union bound, the failure probability is at most δ 10 + δ 20 + δ 20 = δ 5 . Therefore the proof is completed. Claim A.4 (Guarantee on target training data). Under the setting of Theorem 4.1, with probability at least 1 -2δ 5 , we have 1 n 2 P ⊥ X T +1 B X T +1 B * 2 F σ 2 kT + kd log(κn 1 ) + log 1 δ cn 1 • σ 2 k (W * ) . Proof. We suppose that the high-probability events in Claims A.1, A.2 and A.3 happen, which holds with probability at least 1 -2δ 5 . Here we instantiate Claim A.2 using B = [ B, B * ] ∈ R d×2k . From the optimality of B and Ŵ in (6) we know X t B ŵt = P Xt B y t = P Xt B (X t B * w * t + z t ) for each t ∈ [T ]. Then we have σ 2 (kT + kd log(κn 1 ) + log(1/δ)) X ( B Ŵ -B * W * ) 2 F (from (24)) = T t=1 X t B ŵt -X t B * ŵ * t 2 = T t=1 P Xt B (X t B * w * t + z t ) -X t B * ŵ * t 2 = T t=1 -P ⊥ Xt B X t B * w * t + P Xt B z t 2 = T t=1 -P ⊥ Xt B X t B * w * t 2 + P Xt B z t 2 (the cross term is 0) ≥ T t=1 P ⊥ Xt B X t B * w * t 2 ≥ 0.9n 1 T t=1 P ⊥ Σ 1/2 t B Σ 1/2 t B * w * t 2 (using ( 22) and Lemma A.7) ≥ 0.9cn 1 T t=1 P ⊥ Σ 1/2 T +1 B Σ 1/2 T +1 B * w * t 2 (using Assumption 4.2 and Lemma A.7) = 0.9cn 1 P ⊥ Σ 1/2 T +1 B Σ 1/2 T +1 B * W * 2 F ≥ 0.9cn 1 P ⊥ Σ 1/2 T +1 B Σ 1/2 T +1 B * 2 F • σ 2 k (W * ). Next, we write B = [ B, B * ] I 0 =: BA and B * = [ B, B * ] 0 I =: BC. Recall that we have 1 n2 B X T +1 X T +1 B 1.1B Σ T +1 B from Claim A.2. Then using Lemma A.7 we can obtain 1.1 P ⊥ Σ 1/2 T +1 BA Σ 1/2 T +1 BC 2 F ≥ 1 n 2 P ⊥ X T +1 BA X T +1 BC 2 F , i.e., 1.1 P ⊥ Σ 1/2 T +1 B Σ 1/2 T +1 B * 2 F ≥ 1 n 2 P ⊥ X T +1 B X T +1 B * 2 F . Therefore we get σ 2 (kT + kd log(κn 1 ) + log(1/δ)) 0.9cn 1 1.1n 2 P ⊥ X T +1 B X T +1 B * 2 F • σ 2 k (W * ), completing the proof. Proof of Theorem 4.1. We will use all the high-probability events in Claims A.1, A.2, A.3 and A.4. Here we instantiate Claim A.2 using B = [ B, B * ] ∈ R d×2k . The success probability is at least 1 -4δ 5 . For the target task, the excess risk of our learned linear predictor x → ( B ŵT +1 ) x is ER( B, ŵT +1 ) = 1 2 E x∼p T +1 x ( B ŵT +1 -B * w * T +1 ) 2 = 1 2 ( B ŵT +1 -B * w * T +1 ) Σ T +1 ( B ŵT +1 -B * w * T +1 ). Applying Claim A.2 with B = [ B, B * ], we have 0.9B Σ T +1 B 1 n 2 B X T +1 X T +1 B, which implies 0.9v B Σ T +1 Bv ≤ 1 n2 vB X T +1 X T +1 Bv for v = ŵT +1 w * T +1 . This becomes ( B ŵT +1 -B * w * T +1 ) Σ T +1 ( B ŵT +1 -B * w * T +1 ) ≤ 1 0.9n 2 ( B ŵT +1 -B * w * T +1 ) X T +1 X T +1 ( B ŵT +1 -B * w * T +1 ). Therefore we have ER( B, ŵT +1 ) ≤ 1 1.8n 2 ( B ŵT +1 -B * w * T +1 ) X T +1 X T +1 ( B ŵT +1 -B * w * T +1 ) = 1 1.8n 2 X T +1 ( B ŵT +1 -B * w * T +1 ) 2 . From the optimality of ŵT +1 in (8) we know X T +1 B ŵT +1 = P X T +1 B y T +1 = P X T +1 B (X T +1 B * w * T +1 + z T +1 ). It follows that ER( B, ŵT +1 ) 1 n 2 P X T +1 B (X T +1 B * w * T +1 + z T +1 ) -X T +1 B * w * T +1 2 F = 1 n 2 -P ⊥ X T +1 B X T +1 B * w * T +1 + P X T +1 B z T +1 2 F = 1 n 2 P ⊥ X T +1 B X T +1 B * w * T +1 2 F + 1 n 2 P X T +1 B z T +1 2 F . Recall that w * T +1 ∼ ν and E w∼ν [ww ] ≤ O( 1 k ). Taking expectation over w * T +1 ∼ ν and denoting Σ = E w∼ν [ww ], we obtain E w * T +1 ∼ν [ER( B, ŵT +1 )] 1 n 2 E w * T +1 ∼ν P ⊥ X T +1 B X T +1 B * w * T +1 2 F + 1 n 2 P X T +1 B z T +1 2 F = 1 n 2 E w * T +1 ∼ν Tr P ⊥ X T +1 B X T +1 B * w * T +1 w * T +1 P ⊥ X T +1 B X T +1 B * + 1 n 2 P X T +1 B z T +1 2 F = 1 n 2 Tr P ⊥ X T +1 B X T +1 B * Σ P ⊥ X T +1 B X T +1 B * + 1 n 2 P X T +1 B z T +1 2 F = 1 n 2 P ⊥ X T +1 B X T +1 B * Σ 1/2 2 F + 1 n 2 P X T +1 B z T +1 2 F ≤ 1 n 2 P ⊥ X T +1 B X T +1 B * 2 F Σ 1/2 2 + 1 n 2 P X T +1 B z T +1 2 F 1 n 2 k P ⊥ X T +1 B X T +1 B * 2 F + 1 n 2 P X T +1 B z T +1 2 F (using Σ 1 k ) 1 k • σ 2 (kT + kd log(κn 1 ) + log(1/δ)) cn 1 • σ 2 k (W * ) + 1 n 2 P X T +1 B z T +1 2 F (using Claim A.4) σ 2 (kT + kd log(κn 1 ) + log(1/δ)) cn 1 T + 1 n 2 P X T +1 B z T +1 2 F . (using σ 2 k (W * ) T k ) For the second term above, notice that 1 σ 2 P X T +1 B z T +1 2 F ∼ χ 2 (k) , and thus with probability at least 1 -δ 5 we have 1 σ 2 P X T +1 B z T +1 2 F k + log 1 δ . Therefore we obtain the final bound E w * T +1 ∼ν [ER( B, ŵT +1 )] σ 2 (kT + kd log(κn 1 ) + log(1/δ)) cn 1 T + σ 2 (k + log 1 δ ) n 2 = σ 2 kd log(κn 1 ) cn 1 T + k cn 1 + log 1 δ cn 1 T + k + log 1 δ n 2 σ 2 kd log(κn 1 ) cn 1 T + k + log 1 δ n 2 , where the last inequality is due to cn 1 ≥ n 2 . A.1 TECHNICAL LEMMAS Lemma A.5. Let O d1,d2 = {V ∈ R d1×d2 | V V = I} (d 1 ≥ d 2 ) , and ∈ (0, 1). Then there exists a subset N ⊂ O d1,d2 that is an -net of O d1,d2 in Frobenius norm such that |N | ≤ ( 6 √ d2 ) d1d2 , i.e., for any V ∈ O d1,d2 , there exists V ∈ N such that V -V F ≤ . Proof. For any V ∈ O d1,d2 , each column of V has unit 2 norm. It is well known that there exists an 2 all the columns, we obtain a set N ⊂ R d1×d2 that is an 2 -net of O d1,d2 in Frobenius norm and |N | ≤ ( 6 √ d2 ) d1d2 . Finally, we need to transform N into an -net N that is a subset of O d1,d2 . This can be done by projecting each point in N onto O d1,d2 . Namely, for each V ∈ N , let P( V ) be its closest point in O d1,d2 (in Frobenium norm); then define N = {P( V ) | V ∈ N }. Then we have |N | ≤ |N | ≤ ( 6 √ d2 ) d1d2 and N is an -net of O d1,d2 , because for any V ∈ O d1,d2 , there exists V ∈ N such that V -V F ≤ 2 , which implies P( V ) ∈ N and V -P( V ) F ≤ V -V F + V -P( V ) F ≤ V -V F + V -V F = 2 V -V F ≤ . Lemma A.6. Let a 1 , . . . , a n be i.i.d. d-dimensional random vectors such that E[a i ] = 0, E[a i a i ] = I, and a i is ρ 2 -subgaussian. For δ ∈ (0, 1), suppose n ρ 4 (d + log(1/δ)). Then with probability at least 1 -δ we have 0.9I 1 n n i=1 a i a i 1.1I. Proof. Let A = 1 n n i=1 a i a i -I. Then it suffices to show A ≤ 0.1 with probability at least 1 -δ. We use a standard -net argument for the unit sphere S d-1 = {v ∈ R d : v = 1}. First, consider any fixed v ∈ S d-1 . We have v Av = 1 n n i=1 [(v a i ) 2 -1]. From our assumptions on a i we know that v a i has mean 0 and variance 1 and is ρ 2 -subgaussian. (Note that we must have ρ ≥ 1.) Therefore (v a i ) 2 -1 is zero-mean and 16ρ 2 -sub-exponential. By Bernstein inequality for sub-exponential random variables, we have for any > 0, d) . By a union bound over all v ∈ N , we have Pr |v Av| > ≤ 2 exp - n 2 min 2 (16ρ 2 ) 2 , 16ρ 2 . Next, take a 1 5 -net N ⊂ S d-1 of S d-1 with size |N | ≤ e O( Pr max v∈N |v Av| > ≤ 2|N | exp - n 2 min 2 (16ρ 2 ) 2 , 16ρ 2 ≤ exp O(d) - n 2 min 2 (16ρ 2 ) 2 , 16ρ 2 . Plugging in = 1 20 and noticing ρ > 1, the above inequality becomes Pr max v∈N |v Av| > 1 20 ≤ exp O(d) - n 2 • (1/20) 2 (16ρ 2 ) 2 ≤ δ, where the last inequality is due to n ρ 4 (d + log(1/δ)). Therefore, with probability at least 1 -δ we have max v∈N |v Av| ≤ 1 20 . Suppose this indeed happens. Next, for any u ∈ S d-1 , there exists u ∈ N such that uu ≤ 1 5 . Then we have u Au ≤ (u ) Au + 2 (u -u ) Au + (u -u ) A(u -u ) ≤ 1 20 + 2 u -u • A • u + u -u 2 • A ≤ 1 20 + 2 • 1 5 • A • 1 + 1 5 2 • A ≤ 1 20 + 1 2 A . Taking a supreme over u ∈ S d-1 , we obtain A ≤ 1 20 + 1 2 A , i.e., A ≤ 1 10 . Lemma A.7. If two matrices A 1 and A 2 (with the same number of columns) satisfy A 1 A 1 A 2 A 2 , then for any matrix B (of compatible dimensions), we have A 1 P ⊥ A1B A 1 A 2 P ⊥ A2B A 2 . As a consequence, for any matrices B and B (of compatible dimensions), we have P ⊥ A1B A 1 B 2 F ≥ P ⊥ A2B A 2 B 2 F . Proof. For the first part of the lemma, it suffices to show the following for any vector v: v A 1 P ⊥ A1B A 1 v ≥ v A 2 P ⊥ A2B A 2 v, which is equivalent to min w A 1 Bw -A 1 v 2 2 ≥ min w A 2 Bw -A 2 v 2 2 . Let w * ∈ arg min w A 1 Bw -A 1 v 2 2 .

Then we have min

w A 1 Bw -A 1 v 2 2 = A 1 Bw * -A 1 v 2 2 = (Bw * -v) A 1 A 1 (Bw * -v) ≥ (Bw * -v) A 2 A 2 (Bw * -v) = A 2 Bw * -A 2 v 2 2 ≥ min w A 2 Bw -A 2 v 2 2 , finishing the proof of the first part. For the second part, from A 1 P ⊥ A1B A 1 A 2 P ⊥ A2B A 2 we know (B ) A 1 P ⊥ A1B A 1 B (B ) A 2 P ⊥ A2B A 2 B . Taking trace on both sides, we obtain P ⊥ A1B A 1 B 2 F ≥ P ⊥ A2B A 2 B 2 F , which finishes the proof.

B PROOF OF THEOREM 5.1

Here we first prove an important intermediate result on the in-sample risk, which explains how the Gaussian width of F X (Φ) arises. Claim B.1 (analogue of Claim A.3). Let φ and ŵ1 , . . . , ŵT be the optimal solution to (2). Then with probability at least 1 -δ we have T t=1 φ(X t ) ŵt -φ * (X t )w * t 2 σ 2 G(F X (Φ)) 2 + log 1 δ . Proof. By the optimality of φ and ŵ1 , . . . , ŵT for (2), we know T t=1 y t -φ(X t ) ŵt 2 ≤ T t=1 y t -φ * (X t )w * t 2 . Plugging in y t = φ * (X t )w * t + z t (z t ∼ N (0, I) is independent of X t ), we get T t=1 φ * (X t )w * t + z t -φ(X t ) ŵt 2 ≤ T t=1 z t 2 , which gives T t=1 φ(X t ) ŵt -φ * (X t )w * t 2 ≤ 2 t t=1 z t , φ(X t ) ŵt -φ * (X t )w * t . Denote Z = [z 1 , • • • , z T ] ∈ R n1×T and A = [a 1 , • • • , a T ] ∈ R n1×T where a t = φ(X t ) ŵt - φ * (X t )w * t . Then the above inequality reads A 2 F ≤ 2 Z, A . Notice that A A F ∈ F X (Φ) (c.f. (10)). It follows that A F ≤ 2 Z, A A F ≤ 2 sup Ā∈F X (Φ) Z, Ā . By definition, we have E Z sup Ā∈F X (Φ) σ -1 Z, Ā = G(F X (Φ)) . Furthermore, since the function Z → sup Ā∈F X (Φ) Z, Ā is 1-Lipschitz in Frobenius norm , by the standard Gaussian concentration inequality, we have with probability at least 1 -δ, sup Ā∈F X (Φ) σ -1 Z, Ā ≤ E sup Ā∈F X (Φ) σ -1 Z, Ā + log 1 δ = G(F X (Φ)) + log 1 δ . Then the proof is completed using (30). The proof is conditioned on several high-probability events, each happening with probability at least 1 -Ω(δ). By a union bound at the end, the final success probability is also at least 1 -Ω(δ). We can always rescale δ by a constant factor such that the final probability is at least 1 -δ. Therefore, we will not carefully track the constants before δ in the proof. All the δ's should be understood as Ω(δ). We use the following notion of representation divergence. Definition B.1 (divergence between two representations). Given a distribution q over R d and two representation functions φ, φ ∈ Φ, the divergence between φ and φ with respect to q is defined as D q (φ, φ ) = Σ q (φ , φ ) -Σ q (φ , φ) (Σ q (φ, φ)) † Σ q (φ, φ ) ∈ R k×k . It is easy to verify D q (φ, φ ) 0, D q (φ, φ) = 0 for any φ, φ and q. See Lemma B.2's proof. The next lemma shows a relation between (symmetric) covariance and divergence. Lemma B.2. Suppose that two representation functions φ, φ ∈ Φ and two distributions q, q over R d satisfy Λ q (φ, φ ) α • Λ q (φ, φ ) for some α > 0. Then it must hold that D q (φ, φ ) α • D q (φ, φ ). Proof. Fix any v ∈ R k . We will prove v D q (φ, φ )v ≥ α • v D q (φ, φ )v, which will complete the proof of the lemma. We define a quadratic function f : R k → R as f (w) = [w , -v ]Λ q (φ, φ ) w -v . According to Definition 5.2, we can write f (w) = w Σ q (φ, φ)w -2w Σ q (φ, φ )v + v Σ q (φ , φ )v = E x∼q w φ(x) -v φ (x) 2 . Therefore we have f (w) ≥ 0 for any w ∈ R k .foot_5 This means that f must have a global minimizer in R k . Since f is convex, taking its gradient ∇f (w) = 2Σ q (φ, φ)w -2Σ q (φ, φ )v and setting the gradient to 0, we obtain a global minimzer w * = (Σ q (φ, φ)) † Σ q (φ, φ )v. Plugging this into the definition of f , we obtainfoot_6  min w∈R k f (w) = f (w * ) = v D q (φ, φ )v. Similarly, letting g(w) = [w , -v ]Λ q (φ, φ ) w -v , we have min w∈R k g(w) = v D q (φ, φ )v. From Λ q (φ, φ ) α • Λ q (φ, φ ) we know f (w) ≥ αg(w) for any w ∈ R k . Recall that w * ∈ arg min w∈R k f (w). We have αv D q (φ, φ )v = α min w∈R k g(w) ≤ αg(w * ) ≤ f (w * ) = min w∈R k f (w) = v D q (φ, φ )v. This finishes the proof. Claim B.3 (analogue of Claim A.4). Under the setting of Theorem 5.1, with probability at least 1 -δ we have 1 n 2 P ⊥ φ(X T +1 ) φ * (X T +1 ) 2 F σ 2 G(F X (Φ)) 2 + log 1 δ n 1 σ 2 k (W * ) . Proof. We continue to use the notation from Claim B.1 and its proof. Let pt be the empirical distribution over the samples in X t (t ∈ [T +1]). According to Assumptions 5.1 and 5.2 as well as the setting in Theorem 5.1, we know that the followings are satisfied with probability at least 1 -δ: 0.9Λ p (φ, φ ) Λ pt (φ, φ ) 1.1Λ p (φ, φ ), ∀φ, φ ∈ Φ, ∀t ∈ [T ], 0.9Λ p ( φ, φ * ) Λ pT +1 ( φ, φ * ) 1.1Λ p ( φ, φ * ). Notice that φ and φ * are independent of the samples from the target task, so n 2 ≥ N point (Φ, p, δ 3 ) is sufficient for the second inequality above to hold with high probability. Using Lemma B.2, we know that (32) implies 0.9D p (φ, φ ) D pt (φ, φ ) 1.1D p (φ, φ ), ∀φ, φ ∈ Φ, ∀t ∈ [T ], 0.9D p ( φ, φ * ) D pT +1 ( φ, φ * ) 1.1D p ( φ, φ * ). ( ) By the optimality of φ and ŵ1 , . . . , ŵT for (2), we know φ(X t ) ŵt = P φ(Xt) y t = P φ(Xt) (φ * (X t )w * t + z t ). Then we have the following chain of inequalities: σ 2 G(F X (Φ)) 2 + log 1 δ T t=1 φ(X t ) ŵt -φ * (X t )w * t 2 (Claim B.1) = T t=1 P φ(Xt) (φ * (X t )w * t + z t ) -φ * (X t )w * t 2 = T t=1 -P ⊥ φ(Xt) φ * (X t )w * t + P φ(Xt) z t 2 = T t=1 P ⊥ φ(Xt) φ * (X t )w * t 2 + P φ(Xt) z t 2 (cross term is 0) ≥ T t=1 P ⊥ φ(Xt) φ * (X t )w * t 2 = T t=1 (w * t ) φ * (X t ) I -φ(X t ) φ(X t ) φ(X t ) † φ(X t ) φ * (X t )w * t = n 1 T t=1 (w * t ) D pt ( φ, φ * )w * t ≥ 0.9n 1 T t=1 (w * t ) D p ( φ, φ * )w * t ((33)) = 0.9n 1 D p ( φ, φ * ) 1/2 W * 2 F ≥ 0.9n 1 D p ( φ, φ * ) 1/2 2 F σ 2 k (W * ) = 0.9n 1 Tr D p ( φ, φ * ) σ 2 k (W * ) ≥ 0.9n 1 1.1 Tr D pT +1 ( φ, φ * ) σ 2 k (W * ) ((33)) = 0.9n 1 1.1n 2 P ⊥ φ(X T +1 ) φ * (X T +1 ) 2 F σ 2 k (W * ), completing the proof. Now we can finish the proof of Theorem 5.1. Proof of Theorem 5.1. The excess risk is bounded as ER( φ, ŵT +1 ) = 1 2 E x∼p ŵ T +1 φ(x) -(w * T +1 ) φ * (x) 2 = 1 2 ŵT +1 -w * T +1 Λ p ( φ, φ * ) ŵT +1 -w * T +1 ŵT +1 -w * T +1 Λ pT +1 ( φ, φ * ) ŵT +1 -w * T +1 ((32)) = 1 n 2 φ (X T +1 ) ŵT +1 -φ * (X T +1 ) w * T +1 2 = 1 n 2 -P ⊥ φ(X T +1 ) φ * (X T +1 ) w * T +1 + P φ(X T +1 ) z T +1 2 = 1 n 2 P ⊥ φ(X T +1 ) φ * (X T +1 ) w * T +1 2 + P φ(X T +1 ) z T +1 2 1 n 2 P ⊥ φ(X T +1 ) φ * (X T +1 ) w * T +1 2 + σ 2 (k + log 1 δ ) n 2 . (using χ 2 tail bound) Taking expectation over w * T +1 ∼ ν, we get E w * T +1 ∼ν [ER( φ, ŵT +1 )] 1 n 2 P ⊥ φ(X T +1 ) φ * (X T +1 ) 2 F E w∼ν [ww ] + σ 2 (k + log 1 δ ) n 2 1 kn 2 P ⊥ φ(X T +1 ) φ * (X T +1 ) 2 F + σ 2 (k + log 1 δ ) n 2 1 k • σ 2 G(F X (Φ)) 2 + log 1 δ n 1 σ 2 k (W * ) + σ 2 (k + log 1 δ ) n 2 (Claim B.3) σ 2 G(F X (Φ)) 2 + log 1 δ n 1 T + σ 2 (k + log 1 δ ) n 2 , (σ k (W * ) T k ) finishing the proof. C PROOF OF THEOREM 6.1 C.1 PROOF SKETCH OF THEOREM 6.1 Let R = Θ * * . Recall B and Ŵ are derived from Eqn. ( 12) and let Θ := B Ŵ . We first note that the constraint set { w 2 i ≤ R/T, B 2 F ≤ R} ensures W 2 F ≤ R and W B * ≤ R at global minimum. On the other hand, our constraint for W, B is also expressive enough to attain any Θ that satisfies Θ * ≤ R. See reference e.g. Srebro and Shraibman (2005) . Therefore at global minimum Ŵ F ≤ √ R, B F ≤ √ R and Θ * ≤ R. For the ease of proof, we introduce the following auxiliary functions and parameters. Write L 1 (W ) = 1 2 Σ 1/2 Θ * -Σ 1/2 BW 2 F , L λ 1 (W ) =L 1 (W ) + λ 2 W 2 F , W λ 1 ← arg min W {L λ 1 (W )}, L 2 (w) = 1 2 Σ 1/2 θ * T +1 -Σ 1/2 Bw 2 , w λ 2 ← arg min w {L 2 (w) + λ/2 w 2 }, L1 (W ) = 1 2n 1 X (Θ * -BW ) 2 , Lλ 1 (W ) = L1 (W ) + λ 2 W 2 F W λ 1 ← arg min W { Lλ 1 (W )}, L2 (w) = 1 2n 2 X T +1 θ * T +1 -X T +1 Bw 2 , w2 ← arg min w≤r { L2 (w)}. We define terms ic,1 and ic,2 that will be used to bound intrinsic dimension concentration error in the input signal. Namely with high probability, Σ 1/2 Θ -1/n 1 T t=1 X t θ t 2 ≤ ic,1 Θ * , and similarly Σ 1/2 Bv -1 n2 X Bv 2 ≤ ic,2 v 2 . Additionally we use ee,i , i ∈ {1, 2} to bound the estimation error (for fixed design) incurred when using noisy label y T +1 and Y . The choice of ee,i , and ic,i are respectively justified in Lemma C.5, Claim C.4, Lemma C.10 and Claim C.11, along with some more detailed descriptions. Proof of Theorem 6.1. E θ * ∼ν ER( B, ŵT +1 ) = E θ * ∼ν L 2 ( ŵT +1 ) E θ * ∼ν L2 ( ŵT +1 ) + 2 ic,2 r 2 (Claim C.11) E θ * ∼ν L2 ( w2 ) + 2 ee,2 r + 2 ic,2 r 2 (Lemma C.4) ≤ E θ * ∼ν L2 (w λ 2 ) + 2 ee,2 r + 2 ic,2 r 2 (Definition of w2 ) E θ * ∼ν L 2 (w λ 2 ) + 2 ee,2 r + 2 ic,2 r 2 (Claim C.11) = 1 T L 1 (W λ 1 ) + 2 ee,2 r + 2 ic,2 r 2 (Claim C.3) λR T + 2 ee,2 r + 2 ic,2 r 2 (Lemma C.2) 2 ee,1 R + 2 ic,1 R 2 T + 2 ee,2 r + 2 ic,2 r 2 . (Choices of λ) Each step is with high probability 1 -δ/10 over the randomness of X or X T +1 . Therefore overall by union bound, with probability 1 -δ, by plugging in the values of ic,i and ee,i we have: E θ * ∼ν ER( B, ŵT +1 ) ≤ σR √ T Õ Tr(Σ) √ T n 1 + Σ √ n 2 + ρ 4 R 2 T Õ TrΣ n 1 + Σ n 2 . Notice a term Σ /n 1 is absorbed by Σ /n 2 since we assume n 1 ≥ n 2 . Published as a conference paper at ICLR 2021 Claim C.1 (guarantee with source regularization). 1 n 1 X (Θ * -Θ) 2 F + λ Θ * ≤ 3λ Θ * * ≤ 3λR, and B 2 F ≤ 3R, Ŵ 2 F ≤ 3R for any λ ≥ 2 n X * (Z) 2 . Here X * is the adjoint operator of X such that X * (Z) = T i=1 X t z t e t . Proof. With the optimality of Θ we have: 1 2n 1 X ( Θ -Θ * ) -Z 2 F + λ Θ * ≤ 1 2n 1 Z 2 F + λ Θ * * , Let ∆ = Θ -Θ * . Therefore 1 2n 1 X (∆) 2 F ≤λ( Θ * * -Θ * ) + 1 n 1 ∆, X * (Z) ≤λ Θ * * + 1 n 1 Θ * * • X * (Z) + 1 n 1 Θ * • X * (Z) -λ Θ * ≤λ Θ * * + λ/2 Θ * * + λ/2 Θ * -λ Θ * (Let λ ≥ 2 n1 X * (Z) ) = 3 2 λ Θ * * - 1 2 λ Θ * . Therefore 1 2n1 X (∆) 2 F + λ 2 Θ * ≤ 3 2 λ Θ * * , and clearly both terms satisfy 1 n1 X (∆) 2 F ≤ 3λ Θ * * and Θ * ≤ 3 Θ * * . Lemma C.2 (source task concentration). For a fixed δ > 0, let λ = 2 ee,1 + 2 ic,1 R, we have L λ 1 (W λ 1 ) λR W λ 1 F √ R. with probability 1 -δ/10. Proof of Lemma C.2. W λ 1 2 F < 2 λ L λ 1 (W λ 1 ) ≤ 2 λ L λ 1 ( Ŵ ) (Definition of W λ 1 ) = 2 λ 1 2 Σ 1/2 (Θ * -Θ) 2 F + λ 2 Ŵ 2 F ≤ 2 λ 1 2 ( 1 √ n 1 X (Θ * -Θ) F + O( ic,1 )R) 2 + λ 2 Ŵ 2 F ≤ 2 λ 1 n 1 X (Θ * -B W λ ) 2 F + λ 2 W λ 2 F + O( 2 ic,1 R 2 ) ≤ 2 λ 6λR + O( 2 ic,1 R 2 ) (from Claim C.1) = 2 λ O(λR) =O(R). Thus both results have been shown.

Claim C.3 (Source and Target Connections

). E θ * ∼ν L 2 (w λ 2 ) =L 1 (W λ 1 ) Proof of Claim C.3. w λ 2 = ( B Σ B + λI) -1 B Σθ * T +1 =: S λ θ * T +1 , where S λ := ( B Σ B + λI) -1 B Σ. E θ * ∼ν L 2 (w λ 2 ) = E θ * ∼ν Σ 2 (I -S λ )θ * T +1 2 = 1 T Σ 2 (I -S λ )Θ * T +1 2 = 1 T L 1 (W λ ). Lemma C.4 (Estimation Error for Target Task). L2 ( ŵ) -L2 ( w) ≤ R √ T n 2 σ(log 1/δ) 3/2 log(n 2 ) Σ =: 2 ee,2 r. Proof of Lemma C.4. With the definition of ŵ we write the basic inequality: 1 2n 2 X T +1 ( B ŵ -θ * ) -z T +1 2 F ≤ 1 2n 2 X T +1 ( B w -θ * ) -z T +1 2 F , Therefore by rearranging we get: 1 2n 2 X T +1 B( ŵ -w) 2 F ≤ 1 n 2 ŵ -w, B X T +1 z T +1 ≤ R/T n 2 B X T +1 z T +1 2 F R √ T n 2 σ log 2/3 (1/δ) log(n 2 ) Σ (Claim C.6) =O(r 2 ee,2 ) C.2 TECHNICAL LEMMAS This section includes the technical details for several parts: bounding the noise term from basic inequality; and intrinsic dimension concentration for both source and target tasks. Lemma C.5 (Regularizer Estimation). For X ∈ R n×d drawn from distribution p with covariance matrix Σ, and noise Z ∼ N (0, σ 2 I n ), with high probability 1 -δ, we have 2 ee,1 := 1 n X Z 2 ≤ 1 √ n σ log 1 δ 3/2 log(T + n) T Σ + Tr(Σ). Proof. We use matrix Bernstein with intrinsic dimension to bound λ (See Theorem 7.3.1 in Tropp et al. (2015) ). Write Finally from Hanson-Wright inequality, the upper bound on each term is A = 1 √ n X Z = 1 √ n T t=1 X z t e t =: T t=1 S t . E X,Z [AA ] = E X T t=1 1 n 1 X E Z [z t z t ]X =σ 2 T Σ E X,Z [A A] = T t=1 1 n e E X,Z z t XX z t e t = T t=1 1 n E X,Z [z t XX z t ]e t e t S t 2 ≤ Xz t 2 ≤ σ 2 Tr(Σ) + σ 2 Σ log 1 δ + σ 2 Σ F log 1 δ with probability 1 -δ. Thus using Σ F ≤ Tr(Σ), S t ≤ σ (1 + log 1 δ )Tr(Σ) + Σ log 1 δ =: L. Then from intrinsic matrix bernstein (Theorem 7.3.1 in Tropp et al. (2015) ), with probability 1 -δ we have, A ≤ O(σ log 1 δ v log(d Σ ) + σ log 1 δ L log(d Σ )), which gives A ≤ σ log 1 δ T Σ log(T + n) + log 1 δ Tr(Σ) log(T + n) + log 1 δ σL log(T + n) σ log 1 δ 3/2 log(T + n) T Σ + Tr(Σ). Claim C.6 (target noise concentration). For a fixed δ > 0, with probability 1 -δ/10, 2 ee,2 := 1 √ n2 B X T +1 z 2 ≤ O(log 2/3 (1/δ) log(n 2 ) Tr( B Σ B)) ≤ Õ( Σ 2 R). Proof. The first inequality directly follows from Lemma C.5. Meanwhile Tr( B Σ B) = Σ, B B ≤ Σ 2 B B * Σ 2 R. This finishes the proof. Definition C.7. The sub-gaussian norm of some vector y is defined as:  y ψ2 := sup x∈S n-1 y, x ψ2 , where the radius r(T ) := sup x∈T x 2 . Lemma C.10 (intrinsic dimension concentration). Let X, X t , t ∈ [T ] be n × d matrix whose rows x are independent, isotropic and sub-gaussian random vectors in R d that satisfy Assumption 4.1, and the whitening distribution is with sub-gaussian norm C 1 ρ, where E[x] = 0 and E[xx ] = Σ. For a fixed δ > 0, and any v ∈ R d , we have Σ 1/2 v 2 ≤ 1 √ n Xv 2 + Cρ 2 √ n Tr(Σ) + log(2/δ) Σ v 2 . For any Θ ∈ R d×T , we further have Σ 1/2 Θ F ≤ 1 √ n T t=1 X t θ t 2 2 + ic,1 Θ * , where ic,1 := 2Cρ 2 √ n Tr(Σ) + log(2/δ) Σ , with probability 1 -δ. Proof. We use Theorem C.9. Let T = {v : |Σ -1/2 v| 2 ≤ 1}. Let x = Σ 1/2 z, X = ZΣ 1/2 . Then γ(T ) = Tr(Σ), r(T ) = Σ 1/2 . We note with probability 1 -δ,  sup v =1 1 √ n Xv 2 -Σ 1/2 v 2 = sup v∈T 1 √ n Z v 2 -v 2 ≤ Cρ 2 √ n γ ≥ Σ 1/2 Θ 2 F - 2Cρ 2 √ n Tr(Σ) + log(2/δ) Σ Θ * Σ 1/2 Θ F . = 1 2 β 2 + 1 2 W 2 F ≤ R. Thus the network given by α t φ(x) has the same network outputs and regularizer values. Thus γ ≤ γ d . Finally, we show that γ d ≤ γ d . Let bj for j ∈ [d] be the support of the optimal measure of (39). Define β j = α( bj ) 2 , B = BD β where B is a matrix whose rows are bj , and W such that W jt = α t ( bj )/ α( bj ) . We verify that the network values agree Finally by our construction β j = W e j , so the regularizer values agree. Thus γ d = γ d . Finally, we note that the regularizer can be expressed in a variational form as 8 With these in place, we note that Equation (39) can be expressed as Equation ( 12) with B constrained to be a diagonal operator and x it as the lifted features φ(x it ). Proof of Theorem 7.1. The global minimizer of Equation ( 39) with d = ∞ may have infinite support, so the corresponding value may not be achieved by minimizing (17). However, Theorem 6.1 only requires that the we obtain a learner network with regularized loss less than the regularized loss of the teacher network. Since the teacher network has d neurons, this value is attainable by (17). Thus the finite-size network does not need to attain the global minimum of (39) for Claim C.1 to apply. Since Theorem 6.1 has no dependence (even in the logarithmic terms) on the input dimension of the data, it can be applied when the input features the infinite-dimensional feature vector φ(x). The only part of the proof of Theorem 6.1 specific to the nuclear norm is that the dual norm is the operator norm. In Lemma C.5 we had an upper bound on 1 n X Z 2 . Since we use the • 2,1 norm, we must upper bound 1 n X Z 2,∞ , the dual of the (2, 1)-norm. Note that A 2,∞ ≤ A 2 , so the upper bound in Lemma C.5 still applies. Thus, Theorem 7.1 follows from Theorem 6.1. Proof of (21). The test error of (20) is given by  E[ER(f B, ŵ)] σ 1 2 √ n ( B * T +1 2 + w * T +1 2 2 ) E x i n iid ∼ p, z∼N (0,σ 2 I) [ Φ(X) z ∞ ],



We use the 2 loss throughout this paper. A random vector x is called ρ 2 -subgaussian if for any fixed unit vector v of the same dimension, the random variable v x is ρ 2 -subgaussian, i.e., E[e s•v (x-E[x]) ] ≤ e s 2 ρ 2 /2 (∀s ∈ R). Note that Assumption 4.2 is a significant generalization of the identically distributed isotropic assumption used in concurrent work Tripuraneni et al. (2020): they require Σ1 = Σ2 = • • • = ΣT +1 = I. Wei et al. (2019) show that (17) can be minimized in polynomial iteration complexity using perturbed gradient descent, though potentially exponential width is required. √d2 -net (in 2 norm) of the unit sphere in R d1 with size ( 6 √ d2 ) d1 . Using this net to cover Note that we have proved Λq(φ, φ ) 0. Note that (31) implies Dq(φ, φ ) 0.



use the standard O(•), Ω(•) and Θ(•) notation to hide universal constant factors. We also use a b or b a to indicate a = O(b), and use a b or b a to mean that a ≥ C • b for a sufficiently large universal constant C > 0.

Tr(Σ)I n . Therefore the matrix variance statistic of the sum v(A) satisfies: v(A) = σ 2 max{T Σ , Tr(Σ)}. Denote V = diag([T Σ, Tr(Σ)I]) and its intrinsic dimension d Σ = tr(V )/ V . Tr(V ) = σ 2 (T + n)Tr(Σ), and V 2 ≥ σ 2 Tr(Σ). Therefore d Σ ≤ T + n.

)where S n-1 denotes the unit Euclidean sphere in R n . Definition C.8. Let T ⊂ R d be a bounded set, and g be a standard normal random vector in R d , i.e., g ∼ N (0, I d ).Then the quantitiesw(T ) := E sup x∈T g, x , and γ(T ) := E sup x∈T | g, x | (35)are called the Gaussian width of T and the Gaussian complexity of T , respectively. Theorem C.9 (Restated Matrix deviation inequality from Vershynin (2017)). Let A be an m × n matrix whose rows a i are independent, isotropic and sub-gaussian random vectors in R n . Let T ⊂ R n be a fixed bounded set. ThenE sup x∈T Ax 2 -√ m x 2 | ≤ Cρ 2 γ(T ),(36)where K = max i A i ψ2 is the maximal sub-gaussian norm of the rows of A. A high-probability version states as follows. With probability 1 -δ, sup x∈T Ax 2 -√ m x 2 | ≤ Cρ 2 [γ(T ) + log(2/δ)r(T )],

Xv 2 -Σ 1/2 v 2 ≤ Cρ 2 √ n Tr(Σ) + log(2/δ) Σ , ∀ v = 1.Then by homogeneity of v, for arbitrary v, we haveNotice when n C 2 ρ 4 (Tr(Σ) + Σ log 1/δ), term I ≤ 0.1 √ λ. Therefore | Σ 1/2 v 2 -1 √ Xv 2 | ≤ 0.1 √ λ v . Write Θ = U DV , where D = diag(σ 1 , σ 2 , • • • , σ T ).

1

t f B,W (x) = e t W D β ( B x) + = j W jt β j ( b j x) + = α φ(x).

α 2,1 = min b,W :αt( b)=β( b)wt( b) ( b) 2 d( b) and W 2 F = t w t ( b) 2 d( b).

via the basic inequality (c.f. proof of Claim C.2 and C.4). By the matrix Bernstein inequality (c.f. Lemma C.5 or Wei et al. (2019)), Informally if α ∈ R D×T with D potentially infinite, α 2,1 = min α=diag(b

from Lemma A.5 we know that there exists an -net N of O d,2k in Frobenius norm such that N ⊂ O d,2k and |N | ≤ ( 6

n

ACKNOWLEDGMENTS

SSD acknowledges support of National Science Foundation (Grant No. DMS-1638352) and the Infosys Membership. JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, and NSF CCF 2002272. WH is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC. QL is supported by NSF #2030859 and the Computing Research Association for the CIFellows Project. The authors also acknowledge the generous support of the Institute for Advanced Study on the Theoretical Machine Learning program, where SSD, WH, JDL, and QL were participants.

annex

R Σ log(1/δ), and C is a universal constant.Proof. This result directly uses Lemma C.10 when replacing X by X B. Notice now the subgaussian norm for the whitening distribution for B x remains the same asD PROOF OF THEOREM 7.1First, we describe a standard lifting of neural networks to infinite dimension linear regression Wei et al. (2019) ; Rosset et al. (2007) ; Bengio et al. (2006) . Define the infinite feature vector with coordinates φ(x) b = (b x) + for every b ∈ S d0-1 . Let α t be a signed measure on S d0-1 . The inner product notation denotes integration: α φ(x)where α(u) = [α 1 (u), . . . , α T (u)], andThe regularizer corresponds to a group 1 regularizer on the vector measure α. Proposition D.1. Let γ d be the value of Equation (17) when the network has d neurons and γ d be the value of Equation (39). ThenProof. Let B, W be solutions to Equation (17) . Let B = BD -1 β and D β be a diagonal matrix whose entries are β j = B e j 2 . The network f B,W (x) = W D β ( B x) + and it satisfiesDue to the regularizer, and using the AM-GM inequality, at optimality β j = W e j 2 . Next, we verify that the two regularizer values are the same. Let wj be the j-th row vector of W . We have

