PREDICTING WHAT YOU ALREADY KNOW HELPS: PROVABLE SELF-SUPERVISED LEARNING

Abstract

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks), that do not require labeled data, to learn semantic representations. These pretext tasks are created solely using the input features, such as predicting a missing image patch, recovering the color channels of an image from context, or predicting missing words in text, yet predicting this known information helps in learning representations effective for downstream prediction tasks. This paper posits a mechanism based on approximate conditional independence to formalize how solving certain pretext tasks can learn representations that provably decrease the sample complexity of downstream supervised tasks. Formally, we quantify how the approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.

1. INTRODUCTION

Self-supervised learning revitalizes machine learning models in computer vision, language modeling, and control problems (see reference therein (Jing & Tian, 2020; Kolesnikov et al., 2019; Devlin et al., 2018; Wang & Gupta, 2015; Jang et al., 2018) ). Training a model with auxiliary tasks based only on input features reduces the extensive costs of data collection and semantic annotations for downstream tasks. It is also known to improve the adversarial robustness of models (Hendrycks et al., 2019; Carmon et al., 2019; Chen et al., 2020a) . Self-supervised learning creates pseudo labels solely based on input features, and solves auxiliary prediction tasks in a supervised manner (pretext tasks). However, the underlying principles of self-supervised learning are mysterious since it is a-priori unclear why predicting what we already know should help. We thus raise the following question: What conceptual connection between pretext and downstream tasks ensures good representations? What is a good way to quantify this? As a thought experiment, consider a simple downstream task of classifying desert, forest, and sea images. A meaningful pretext task is to predict the background color of images (known as image colorization (Zhang et al., 2016) ). Denote X 1 , X 2 , Y to be the input image, color channel, and the downstream label respectively. Given knowledge of the label Y , one can possibly predict the background X 2 without knowing much about X 1 . In other words, X 2 is approximately independent of X 1 conditional on the label Y . Consider another task of inpainting (Pathak et al., 2016) the front of a building (X 2 ) from the rest (X 1 ). While knowing the label "building" (Y ) is not sufficient for successful inpainting, adding additional latent variables Z such as architectural style, location, window positions, etc. will ensure that variation in X 2 given Y, Z is small. We can mathematically interpret this as X 1 being approximate conditionally independent of X 2 given Y, Z. In the above settings with conditional independence, the only way to solve the pretext task for X 1 is to first implicitly predict Y and then predict X 2 from Y . Even without labeled data, the information of Y is hidden in the prediction for X 2 . Contributions. We propose a mechanism based on approximate conditional independence (ACI) to explain why solving pretext tasks created from known information can learn representations that provably reduce downstream sample complexity. For instance, learned representation will only require Õ(k) samples to solve a k-way supervised task under conditional independence (CI). Under ACI (quantified by the norm of a certain partial covariance matrix), we show similar sample complexity improvements. We verify our main Theorem (4.2) using simulations. We check that pretext task helps when CI is approximately satisfied in text domain, and demonstrate on a real-world image dataset that a pretext task-based linear model outperforms or is comparable to many baselines.

1.1. RELATED WORK

Self-supervised learning (SSL) methods in practice: There has been a flurry of self-supervised methods lately. One class of methods reconstruct images from corrupted or incomplete versions of it, like denoising auto-encoders (Vincent et al., 2008) , image inpainting (Pathak et al., 2016) , and split-brain autoencoder (Zhang et al., 2017) . Pretext tasks are also created using visual common sense, including predicting rotation angle (Gidaris et al., 2018) , relative patch position (Doersch et al., 2015) , recovering color channels (Zhang et al., 2016) , solving jigsaw puzzle games (Noroozi & Favaro, 2016) , and discriminating images created from distortion (Dosovitskiy et al., 2015) . We refer to the above procedures as reconstruction-based SSL. Another popular paradigm is contrastive learning (Chen et al., 2020b; c) . The idea is to learn representations that bring similar data points closer while pushing randomly selected points further away (Wang & Gupta, 2015; Logeswaran & Lee, 2018; Arora et al., 2019) or to maximize a contrastive-based mutual information lower bound between different views (Hjelm et al., 2018; Oord et al., 2018; Tian et al., 2019) . A popular approach for text domain is based on language modeling where models like BERT and GPT create auxiliary tasks for next word predictions (Devlin et al., 2018; Radford et al., 2018) . The natural ordering or topology of data is also exploited in video-based (Wei et al., 2018; Misra et al., 2016; Fernando et al., 2017) , graph-based (Yang et al., 2020; Hu et al., 2019) or map-based (Zhang et al., 2019) self-supervised learning. For instance, the pretext task is to determine the correct temporal order for video frames as in (Misra et al., 2016) . Theory for self-supervised learning: Our work initiates some theoretical understanding on the reconstruction-based SSL. Related to our work is the recent theoretical analysis of contrastive learning. Arora et al. (2019) shows guarantees for representations from contrastive learning on linear classification tasks using a class conditional independence assumption, but do not handle approximate conditional independence. Recently, Tosh et al. (2020a) show that contrastive learning representations can linearly recover any continuous functions of the underlying topic posterior under a topic modeling assumption for text. While their assumption bears some similarity to ours, the assumption of independent sampling of words that they exploit is strong and not generalizable to other domains like images. More recently, concurrent work by Tosh et al. (2020b) shows guarantees for contrastive learning, but not reconstruction-based SSL, with a multi-view redundancy assumptions that is very similar to our CI assumption. (Wang & Isola, 2020) theoretically studies contrastive learning on the hypersphere through intuitive properties like alignment and uniformity of representations; however there is no theoretical connection made to downstream tasks. There is a mutual information maximization view of contrastive learning, but (Tschannen et al., 2019) points out issues with it. Previous attempts to explain negative sampling (Mikolov et al., 2013) based methods use the theory of noise contrastive estimation (Gutmann & Hyvärinen, 2010; Ma & Collins, 2018) . However, guarantees are only asymptotic and not for downstream tasks. CI is also used in sufficient dimension reduction Fukumizu et al. (2009; 2004) . CI and redundancy assumptions on multiple views (Kakade & Foster, 2007; Ando & Zhang, 2007) are used to analyze a canonical-correlation based dimension reduction algorithm. Finally, (Alain & Bengio, 2014; Vincent, 2011) provide a theoretical analysis for denoising auto-encoder.

1.2. OVERVIEW OF RESULTS:

Section 2 introduces notation, setup, and the self-supervised learning procedure considered in this work. In Section 3, we analyze downstream sample complexity under CI. Section 4 presents our main result with relaxed conditions: under ACI with latent variables, and assuming finite samples in both pretext and downstream tasks, for various function classes, and both regression and classification tasks. Experiments verifying our theoretical findings are in Section 5.

2.1. NOTATION

We use lower case symbols (x) to denote scalar quantities, bold lower case symbols (x) for vector values, capital letters (X) for random variables, and capital and bold letters X for matrices. P X denotes the probability law of random variable X, and the space of square-integrable functions with probability P is denoted by L 2 (P ). We use standard O notation to hide universal factors and Õ to hide log factors. • stands for 2 -norm for vectors or Frobenius norm for matrices. Linear conditional expectation. E L [Y |X] denotes the prediction of Y with linear regression: E L [Y |X = x] := W * x + b * , where W * , b * := arg min W ,b E[ Y -W X -b 2 ]. In other words, E L [Y |X] denotes the best linear predictor of Y given X. We also note that E[Y |X] ≡ min f E[ Y -f (X) 2 ] is the best predictor of Y given X. (Partial) covariance matrix. For random variables X, Y , we denote Σ XY to be covariance matrix of X and Y . For simplicity in most cases, we assume E[X] = 0 and E[Y ] = 0; thus we do not distinguish E[XY ] and Σ XY . The partial covariance matrix between X and Y given Z is: Σ XY |Z :=cov{X -E L [X|Z], Y -E L [Y |Z]} ≡ Σ XY -Σ XZ Σ -1 ZZ Σ ZY (1) Sub-gaussian random vectors. A random vector X ∈ R d is ρ 2 -sub-gaussian if for every fixed unit vector v ∈ R d , the variable v X is ρ 2 -sub-gaussian, i.e., E[e s•v (X-E[X]) ] ≤ e s 2 ρ 2 /2 (∀s ∈ R).

2.2. SETUP AND METHODOLOGY

We denote by X 1 the input variable, X 2 the target random variable for the pretext task, and Y the label for the downstream task, with X 1 ∈ X 1 ⊂ R d1 , X 2 ∈ X 2 ⊂ R d2 and Y ∈ Y ⊂ R k . If Y is finite with |Y| = k, we assume Y ⊂ R k is the one-hot encoding of the labels. P X1X2Y denotes the joint distribution over X 1 × X 2 × Y. P X1Y , P X1 denote the corresponding marginal distributions. Our proposed self-supervised learning procedure is as follows: Step 1 (pretext task): Learn representation ψ(x 1 ) through ψ := arg min g∈H E X 2 -g(X 1 ) 2 F , where H can be different for different settings that we will specify and discuss later. Step 2 (downstream task): Perform linear regression on Y with ψ(X 1 ), i.e. f (x 1 ) := (W * ) ψ(x 1 ), where W * ← arg min W E X1,Y [ Y -W ψ(X 1 ) 2 ]. Namely we learn f (•) = E L [Y |ψ(•)]. Performance of the learned representation on the downstream task depends on the following quantities. Approximation error. We measure this for a learned representation ψ by learning a linear function on top of it for the downstream task. Denote e apx (ψ ) = min W E[ f * (X 1 ) -W ψ(X 1 ) 2 ], where f * (x 1 ) = E[Y |X 1 = x 1 ] is the optimal predictor for the task. This gives a measure of how well ψ can do with when given infinite samples for the task. Estimation error. We measure sample complexity of ψ on the downstream task and assume access to n 2 i.i.d. samples (x (1) 1 , y (1) ), • • • , (x (n2) 1 , y (n2) ) drawn from P X1Y . We express the n 2 samples collectively as X down 1 ∈ R n2×d1 , Y ∈ R n2×k and overload notation to say ψ(X down 1 ) = [ψ(x (1) 1 )|ψ(x (2) 1 ) • • • |ψ(x (n2) 1 )] ∈ R n2×d2 . We perform linear regression on the learned representation ψ and are interested in the excess risk that measures generalization. Ŵ ← arg min W 1 2n 2 Y -ψ(X 1 )W 2 F ; ER ψ ( Ŵ ) := 1 2 E f * (X 1 ) -Ŵ ψ(X 1 ) 2 2 3 GUARANTEED RECOVERY WITH CONDITIONAL INDEPENDENCE In this section, we focus on the case when input X 1 and pretext target X 2 are conditionally independent (CI) given the downstream label Y . While this is a strong assumption that is rarely satisfied in practice, it helps us understand the role of CI with clean results and builds up to our main results with ACI with latent variables in Section 4. As a warm-up, we show how CI helps when (X 1 , X 2 , Y ) are jointly Gaussian to give us a flavor for the results to follow. We then analyze it for general random variables under two settings: (a) when the function class used for ψ is universal, (b) when ψ is restricted to be a linear function of given features. For now we assume access to a large amount of unlabeled data so as to learn the optimal ψ * perfectly and this will be relaxed later in Section 4. The general recipe for the results is as follows: 1. Find a closed-form expression for the optimal solution ψ * for the pretext task. 2. Use conditional independence to argue that e apx (ψ * ) is small. 3. Exploit the low rank structure of ψ * to show small estimation error on downstream tasks. Data assumption. Suppose Y = f * (X 1 ) + N , where f * = E[Y |X 1 ] and hence E[N ] = 0. We assume N is σ 2 -subgaussian. For simplicity, we assume non-degeneracy: Σ XiXi , Σ Y Y are full rank.

3.1. WARM-UP: JOINTLY GAUSSIAN VARIABLES

We assume X 1 , X 2 , Y are jointly Gaussian, and so the optimal regression functions are all linear, i.e., E[Y |X 1 ] = E L [Y |X 1 ]. We also assume data is centered: E[X i ] = 0 and E[Y ] = 0. Non-centered data can easily be handled by learning an intercept. All relationships between random variables can then be captured by the (partial) covariance matrix. Therefore it is easy to quantify the CI property and establish the necessary and sufficient conditions that make X 2 a reasonable pretext task. Assumption 3.1. (Jointly Gaussian) X 1 , X 2 , Y are jointly Gaussian. Assumption 3.2. (Conditional independence) X 1 ⊥X 2 |Y . Claim 3.1 (Closed-form solution). Under Assumption 3.1, the representation function and optimal prediction that minimize the population risk can be expressed as follows: ψ * (x 1 ) := E L [X 2 |X 1 = x 1 ] = Σ X2X1 Σ -1 X1X1 x 1 (2) Our target f * (x 1 ) := E L [Y |X 1 = x 1 ] = Σ Y X1 Σ -1 X1X1 x 1 . Our prediction for downstream task with representation ψ * will be: g(•) := E L [Y |ψ * (X 1 )]. Recall from Equation 1 that the partial covariance matrix between X 1 and X 2 given Y is Σ X1X2|Y ≡ Σ X1X2 -Σ X1Y Σ -1 Y Y Σ Y X2 . This partial covariance matrix captures the correlation between X 1 and X 2 given Y . For jointly Gaussian random variables, CI is equivalent to Σ X1X2|Y = 0. We first analyze the approximation error based on the property of this partial covariance matrix. Lemma 3.2 (Approximation error). Under Assumption 3.1, 3.2, if Σ X2Y has rank k, e apx (ψ * ) = 0. Remark 3.1. Σ X2Y being full column rank implies that E[X 2 |Y ] has rank k, i.e., X 2 depends on all directions of Y and thus captures all directions of information of Y . This is a necessary assumption for X 2 to be a reasonable pretext task for predicting Y . e apx (ψ * ) = 0 means f * is linear in ψ * . Therefore ψ * selects d 2 out of d 1 features that are sufficient to predict Y . Next we consider the estimation error that characterizes the number of samples needed to learn a prediction function f (x 1 ) = Ŵ ψ * (x 1 ) that generalizes. Theorem 3.3 (Estimation error). Fix a failure probability δ ∈ (0, 1). Under Assumption 3.1,3.2, if n 2 k + log(1/δ), excess risk of the learned predictor x 1 → Ŵ ψ * (x 1 ) on the target task satisfies ER ψ * ( Ŵ ) ≤ O Tr(Σ Y Y |X1 )(k + log(k/δ)) n 2 , with probability at least 1 -δ. Here Σ Y Y |X1 ≡ Σ Y Y -Σ Y X1 Σ -1 X1X1 Σ X1Y captures the noise level and is the covariance matrix of tespeche residual term Y -f * (X 1 ) = Y -Σ Y X1 Σ -1 X1X1 X 1 . Compared to directly using X 1 to predict Y , self-supervised learning reduces the sample complexity from Õ(d 1 ) to Õ(k). We generalize these results even when only a weaker form of CI holds. Assumption 3.3 (Conditional independence given latent variables). There exists some latent variable Z ∈ R m such that X 1 ⊥X 2 | Ȳ , and Σ X2 Ȳ is of rank k + m, where Ȳ = [Y, Z]. This assumption lets introduce some reasonable latent variables that capture the information between X 1 and X 2 apart from Y . Σ X2 Ȳ being full rank says that all directions of Ȳ are needed to predict X 2 , and therefore Z is not redundant. For instance, when Z = X 1 , the assumption is trivially true but Z is not the minimal latent information we want to add. Note it implicitly requires d 2 ≥ k + m. Corollary 3.4. Under Assumption 3.1, 3.3, the approximation error e apx (ψ * ) is 0. Under CI with latent variable, we can generalize Theorem 3.3 by replacing k by k + m.

3.2. GENERAL RANDOM VARIABLES

Next we move on to general setting where the variables need not be Gaussian. Assumption 3.4. Let X 1 ∈ R d1 , X 2 ∈ R d2 be random variables from some unknown distribution. Let label Y ∈ Y be a discrete random variable with k = |Y| < d 2 . We assume conditional independence: X 1 ⊥X 2 |Y . Here Y can be interpreted as the multi-class labels where k is the number of classes. For regression problems, one can think about Y as the discretized values of continuous labels. We do not specify the dimension for Y since Y could be arbitrarily encoded but the results only depend on k and the variance of Y (conditional on the input X 1 ). Universal function class. Suppose we learn the optimal ψ * among all measurable functions The optimal function ψ * in this case is naturally given by conditional expectation: ψ * (x 1 ) = E[X 2 |X 1 = x 1 ] . We now show that CI implies that ψ * is good for downstream tasks, which is not apriori clear. Lemma 3.5 (Approximation error). Suppose random variables X 1 , X 2 , Y satisfy Assumption 3.4, and matrix A ∈ R Y×d2 with A y,: := E[X 2 |Y = y] is of rank k = |Y|. Then e apx (ψ * ) = 0. This tells us that although f * could be nonlinear in x 1 , it is guaranteed to be linear in ψ * (x 1 ). Note that Y does not have to be linear in X 2 . We provide this simple example for better understanding: Example 3.1. Let Y ∈ {-1, 1} be binary labels, and X 1 , X 2 be 2-mixture Gaussian random variables with X 1 ∼ N (Y µ 1 , I), X 2 ∼ N (Y µ 2 , I). In this example, X 1 ⊥X 2 |Y . Although f * = E[Y |X 2 ] is not linear, E[Y |ψ] is linear: ψ(x 1 ) = P (Y = 1|X 1 = x 1 )µ 2 -P (Y = -1|X 1 = x 1 )µ 2 and f * (x 1 ) = P (Y = 1|X 1 = x 1 ) -P (Y = -1|X 1 = x 1 ) ≡ µ T 2 ψ(x 1 )/ µ 2 2 . Given that ψ * is good for downstream, we now care about the sample complexity. We will need to assume that the representation has some nice concentration properties. We make an assumption about the whitened data ψ * (X 1 ) to ignore scaling factors. Assumption 3.5. We assume the whitened feature variable U := Σ -1/2 ψ ψ(X 1 ) is a ρ 2 -subgaussian random variable, where Σ ψ = E[ψ(X 1 )ψ(X 1 ) ]. We note that all bounded random variables satisfy sub-gaussian property. Theorem 3.6 (General conditional independence). Fix a failure probability δ ∈ (0, 1), under the same assumption as Lemma 3.5 and Assumption 3.5 for ψ * , if additionally n ρ 4 (k + log(1/δ)), then the excess risk of the learned predictor x 1 → Ŵ ψ * (x 1 ) on the downstream task satsifies: ER ψ * [ Ŵ ] ≤ O k + log(k/δ) n 2 σ 2 . Function class induced by feature maps. Given feature map φ 1 : X 1 → R D1 , we consider the function class H 1 = {ψ : X 1 → R d2 |∃B ∈ R d2×D1 , ψ(x 1 ) = Bφ 1 (x 1 )}. Claim 3.7 (Closed form solution). The optimal function in H is ψ * (x 1 ) = Σ X2φ1 Σ -1 φ1φ1 x 1 , where Σ X2φ1 := Σ X2φ1(X1) and Σ φ1φ1 := Σ φ1(X1)φ1(X1) . We again show the benefit of CI, this time only comparing the performance of ψ * to the original features φ 1 . Since ψ * is linear in φ 1 , it cannot have smaller approximation error than φ 1 . However CI will ensure that ψ * has the same approximation error as φ 1 and enjoys much better sample complexity.

Lemma 3.8 (Approximation error). If Assumption 3.4 is satisfied, and if the matrix

A ∈ R Y×d2 with A y,: := E[X 2 |Y = y] is of rank k = |Y|. Then e apx (ψ * ) = e apx (φ 1 ). We additionally need an assumption on the residual a(x 1 ) : = E[Y |X 1 = x 1 ] -E L [Y |φ 1 (x 1 )]. Assumption 3.6. (Bounded approx. error; Condition 3 in Hsu et al. (2012) )) We have almost surely Σ -1/2 φ1φ1 φ 1 (X 1 )a(X 1 ) F ≤ b 0 √ k Theorem 3.9. (CI with approximation error) Fix a failure probability δ ∈ (0, 1), under the same assumption as Lemma 3.8, Assumption 3.5 for ψ * and Assumption 3.6, if n 2 ρ 4 (k + log(1/δ)), then the excess risk of the learned predictor x 1 → Ŵ ψ * (x 1 ) on the downstream task satisfies: ER ψ * [ Ŵ ] ≤ e apx (φ 1 ) + O k + log(k/δ) n 2 σ 2 . Theorem 3.9 is also true with Assumption 3.3 instead of exact CI, if we replace k by km. Therefore with SSL, the requirement of labels is reduced from complexity for H to O(k) ( or O(km)). Remark 3.2. We note that since X 1 ⊥X 2 |Y ensures X 1 ⊥h(X 2 )|Y for any deterministic function h, we could replace X 2 by h(X 2 ) and all results hold. Therefore we could replace X 2 with h(X 2 ) in our algorithm especially when d 2 < km.

4. BEYOND CONDITIONAL INDEPENDENCE

In the previous section, we focused on the case where exact CI is satisfied. A weaker but more practical assumption is that Y captures some portion of the dependence between X 1 and X 2 but not all. A warm-up result with jointly-Gaussian variables is defered in Appendix C.1, where ACI is quantified by the partial covariance matrix. In the section below, we generalize the result from linear function space to arbitrary function space, and introduce the appropriate quantities to measure ACI.

4.1. LEARNABILITY WITH GENERAL FUNCTION SPACE

We state the main result with finite samples for both pretext task and downstream task to achieve good generalization. Let X pre 1 = [x (1,pre) 1 , • • • , x (n1,pre) 1 ] ∈ R n1×d1 and X 2 = [x (1) 2 , • • • , x ] ∈ R n1×d2 be the training data from pretext task, where (x (i,pre) 1 , x 2 ) is sampled from P X1X2 . We consider two types of function spaces: H ∈ {H 1 , H u }. Recall H 1 = {ψ : X 1 → R d2 |∃B ∈ R d2×D1 , ψ(x 1 ) = Bφ 1 (x 1 )} is induced by feature map φ 1 : X 1 → R D1 . H u is a function space with universal approximation power (e.g. deep networks) that ensures ψ * = E[X 2 |X 1 ] ∈ H u . We learn a representation from H by using n 1 samples: ψ := arg min f ∈H d 2 1 1 n1 X 2 -f (X pre 1 ) 2 F . For downstream tasks we similarly define X down 1 ∈ R n2×d1 , Y ∈ R n2×d3 1 , and learn a linear classifier trained on ψ(X down 1 ): Ŵ ← arg min W 1 2n 2 Y -ψ(X down 1 )W 2 F , ER ψ ( Ŵ ) := E X1 f * H (X 1 ) -Ŵ ψ(X 1 ) 2 2 . Here f * H = E L [Y |φ 1 (X 1 )] when H = H 1 and f * H = f * for H = H u . Assumption 4.1 (Correlation between X 2 and Y, Z). Suppose there exists latent variable Z ∈ Z, |Z| = m that ensures Σ φȳX2 is full column rank and Σ Y φȳ Σ † X2φȳ 2 = 1/β, where A † is pseudo-inverse, and φ ȳ is the one-hot embedding for Ȳ = [Y, Z]. Definition 4.1 (Approximate conditional independence with function space H). 1. For H = H 1 , define CI := Σ -1/2 φ1φ1 Σ φ1X2|φȳ F . 2. For H = H u , define 2 CI := E X1 [ E[X 2 |X 1 ] -E Ȳ [E[X 2 | Ȳ ]|X 1 ] 2 ]. Exact CI for both cases ensures CI = 0. We present a unified analysis in the appendix that shows the CI for the second case is same as the first case, with covariance operators instead of matrices.  0 , such that Σ -1/2 φ1φ1 φ 1 (X 1 )a(X 1 ) F ≤ b 0 √ k almost surely. Theorem 4.2. For a fixed δ ∈ (0, 1), under Assumptions 4.1, 4.2 for ψ and ψ * and 3.5 for nonuniversal feature maps, if n 1 , n 2 ρ 4 (d 2 + log 1/δ), and we learn the pretext tasks such that: E ψ(X 1 ) -ψ * (X 1 ) 2 F ≤ 2 pre . Then the generalization error for downstream task with probability 1 -δ is: ER ψ ( Ŵ ) ≤ O σ 2 d 2 + log(d 2 /δ) n 2 + 2 CI β 2 + 2 pre β 2 We defer the proof to the appendix. The proof technique is similar to that of Section 3. The difference is now our ψ(X (down) ) ∈ R n2×d2 will be an approximately low rank matrix (low rank + small norm), where the low rank part is the high-signal features that implicitly comes from Y, Z that will be useful for downstream. The remaining part comes from CI and pre . Again by selecting the top km (dimension of φ ȳ ) features we could further improve the sample complexity: Remark 4.1. By applying PCA on ψ(X down

1

) and keeping the top km principal components only, we can improve the bound in Theorem 4.2 to ER ψ ( Ŵ ) ≤ O σ 2 km + log(km/δ) n 2 + 2 CI β 2 + 2 pre β 2 . ( ) We take a closer look at the different sources of errors in (5): 1) the noise term Y -f * (X 1 ) with noise level σ 2 ; 2) CI that measures the approximate CI; and 3) pre the error from not learning the pretext task exactly. The first term is optimal setting ignoring log factors as we do linear regression on mk-dimensional features. The second and third term are non-reducible due to the fact that f * is not exactly linear in ψ while we use it as a fixed feature and learn a linear function on it. Therefore it is important to fine-tune when we have sufficient downtream labeled data. We leave this as future work. Compared to traditional supervised learning, learning f * H requires sample complexity scaling with the (Rademacher/Gaussian) complexity of H (see e.g. Bartlett & Mendelson (2002) ; Shalev-Shwartz & Ben-David (2014)), which is very large for complicated models such as deep networks. In Section D, we consider a similar result for cross-entropy loss.

5. EXPERIMENTS

In this section, we empirically verify our claim that SSL performs well when ACI is satisfied. Simulations. With synthetic data, we verify how excess risk (ER) scales with the cardinality/feature dimension of Y (k), and ACI ( CI in Definition 4.1). We consider a mixture of Gaussian data and conduct experiments with both linear function space (H 1 with φ 1 as identity map) and universal function space H u . We sample the label Y uniformly from {1, ..., k}. For i-th class, the centers µ 1i ∈ R d1 and µ 2i ∈ R d2 are uniformly sampled from [0, 10). Given Y = i, α ∈ [0, 1], let X 1 ∼ N (µ 1i , I), X2 ∼ N (µ 2i , I), and X 2 = (1 -α) X2 + αX 1 . Therefore α is a correlation coefficient: α = 0 ensures X 2 being CI with X 1 given Y and when α = 1, X 2 fully depends on X 1 . (if d 1 = d 2 , we append zeros or truncate to fit accordingly). We first conduct experiments with linear function class. We learn a linear representation ψ with n 1 samples and the linear prediction of Y from ψ with n 2 samples. We set d 1 = 50, d 2 = 40, n 1 = 4000, n 2 = 1000 and ER is measured with Mean Squared Error (MSE). As shown in Figure 1 (a)(b), the MSE of learning with ψ(X 1 ) scales linearly with k as indicated in Theorem 3.9, and scales linearly with CI associated with linear function class as indicated in Theorem 4.2. Next we move on to general function class, i.e., ψ * = E[Y |X 1 ] with a closed form solution (see example 3.1). We use the same parameter settings as above. For baseline method, we use kernel linear regression to predict Y using X 1 (we use RBF kernel which also has universal approximation power). As shown in Figure 1 (c)(d), the phenomenon is the same as what we observe in the linear function class setting, and hence they respectively verify Theorem 3.6 and Theorem 4.2 with H u . NLP task. We look at the setting where both X 1 and X 2 are the set of sentences and perform experiments by enforcing CI with and without latent variables. The downstream task is sentiment analysis with the Stanford Sentiment Treebank (SST) dataset (Socher et al., 2013) , where inputs are movie reviews and the label set Y is {±1}. We use the representation class H 1 , with features φ 1 being the bag-of-words representation (D 1 = 13848). For X 2 we use a d 2 = 300 dimensional embedding of the sentence, that is the mean of word vectors (random gaussians) for the words in the sentence. For SSL data we consider 2 settings, (a) enforce CI with the labels Y, (b) enforce CI with extra latent variables, for which we use fine-grained version of SST with label set Ȳ = {1, 2, 3, 4, 5}foot_1 . We test the learned ψ on SST binary task with linear regression and linear classification; results are presented in Figure 2 . We observe that in both settings ψ outperforms φ 1 , especially in the small-sample-size regime. Also exact CI is better than CI with extra latent variables, as suggested by theory.

6. CONCLUSION

In this work we theoretically quantify how an approximate conditional independence assumption that connects pretext and downstream task data distributions can give sample complexity benefits of self-supervised learning on downstream tasks. Our theoretical findings are also supported by experiments on simulated data and also on real CV and NLP tasks. We would like to note that approximate CI is only a sufficient condition for a useful pretext task. We leave it for future work to investigate other mechanisms by which pretext tasks help with downstream tasks.

A SOME USEFUL FACTS A.1 RELATION OF INVERSE COVARIANCE MATRIX AND PARTIAL CORRELATION

en For a covariance matrix of joint distribution for variables X, Y , the covariance matrix is Σ XX Σ XY Σ Y X Σ Y Y = Σ X1X1 Σ X1X2 Σ X1Y Σ X2X1 Σ X2X2 Σ X2Y Σ Y X1 Σ X2Y Σ Y Y . Its inverse matrix Σ -1 satisfies Σ -1 = A ρ ρ B . Here A -1 = Σ XX -Σ XY Σ -1 Y Y Σ Y X ≡ cov(X -E L [X|Y ], X -E L [X|Y ]) := Σ XX•Y , the partial covariance matrix of X given Y . A.2 RELATION TO CONDITIONAL INDEPENDENCE Proof of Lemma C.5. Fact A.1. When X 1 ⊥X 2 |Y , the partial covariance between X 1 , X 2 given Y is 0: Σ X1X2•Y :=cov(X 1 -E L [X 1 |Y ], X 2 -E L [X 2 |Y ]) ≡Σ X1X2 -Σ X1Y Σ -1 Y Y Σ Y X2 = 0. The derivation comes from the following: Lemma A.1 (Conditional independence (Adapted from Huang (2010))). For random variables X 1 , X 2 and a random variable Y with finite values, conditional independence X 1 ⊥X 2 |Y is equivalent to: sup f ∈N1,g∈N2 E[f (X 1 )g(X 2 )|Y ] = 0. ( ) Here N i = {f : R di → R : E[f (X i )|Y ] = 0}, i = 1, 2. Notice for arbitrary function f , E[f (X)|Y ] = E L [f (X)|φ y (Y ) ] with one-hot encoding of discrete variable Y . Therefore for any feature map we can also get that conditional independence ensures: Σ φ1(X1)φ2(X2)|Y :=cov(φ 1 (X 1 ) -E L [φ 1 (X 1 )|φ y (Y )], φ 2 (X 2 ) -E L [φ 2 (X 2 )|φ y (Y )]) = E[ φ1 (X 1 ) φ2 (X 2 ) ] = 0. Here φ1 (X 1 ) = φ 1 (X 1 ) -E[φ 1 (X 1 )|φ y (Y )] is mean zero given Y , and vice versa for φ2 (X 2 ). This thus finishes the proof for Lemma C.5.

A.3 TECHNICAL FACTS FOR MATRIX CONCENTRATION

We include this covariance concentration result that is adapted from Claim A.2 in Du et al. ( 2020): Claim A.2 (covariance concentration for gaussian variables). Let X = [x 1 , x 2 , • • • x n ] ∈ R n×d where each x i ∼ N (0, Σ X ). Suppose n k + log(1/δ) for δ ∈ (0, 1). Then for any given matrix B ∈ R d×m that is of rank k and is independent of X, with probability at least 1 -δ 10 over X we have 0.9B Σ X B 1 n B X XB 1.1B Σ X B. And we will also use Claim A.2 from Du et al. ( 2020) for concentrating subgaussian random variable. Claim A.3 (covariance concentration for subgaussian variables). Let X = [x 1 , x 2 , • • • x n ] ∈ R n×d where each x i ∼ N (0, Σ X ). Suppose n ρ 4 (k + log(1/δ)) for δ ∈ (0, 1). Then for any given matrix B ∈ R d×m that is of rank k and is independent of X, with probability at least 1 -δ 10 over X we have 0.9B Σ X B 1 n B X XB 1.1B Σ X B. Claim A.4. Let Z ∈ R n×k be a matrix with row vectors sampled from i.i.d Gaussian distribution N (0, Σ Z ). Let P ∈ R n×n be a fixed projection onto a space of dimension d. Then with a fixed δ ∈ (0, 1), we have: P Z 2 F Tr(Σ Z )(d + log(k/δ)) , with probability at least 1 -δ. Proof of Claim A.4. Each t-th column of Z is an n-dim vector that is i.i.d sampled from Gaussian distribution N (0, Σ tt ). P Z 2 F = k t=1 P z t 2 = k t=1 z t P z t . Each term satisfy Σ -1 kk P z t 2 ∼ χ 2 (d), and therefore with probability at least 1 -δ over z t , Σ -1 kk P z t 2 d + log(1/δ ). Using union bound, take δ = δ/k and summing over t ∈ [k] we get:  P Z 2 F Tr(Σ Z )(d + log(k/δ)). ))). Let X = (X 1 , X 2 , • • • X n ) ∈ R n be a random vector with independent components X 1 which satisfy E[X i ] = 0 and X i ψ2 ≤ K. Let A be an n × n matrix. Then, for every t ≥ 0, P |X AX -E[X AX]| > t ≤ 2 exp -c min t 2 K 4 A 2 F , t K 2 A . Theorem A.6 (Vector Bernstein Inequality (Theorem 12 in Gross (2011))). Let X 1 , • • • , X m be independent zero-mean vector-valued random variables. Let N = m i=1 X i 2 . Then P[N ≥ √ V + t] ≤ exp -t 2 4V , where V = i E X i 2 and t ≤ V /(max X i 2 ). Lemma A.7. Let Z ∈ R n×k be a matrix whose row vectors are n independent mean-zero (conditional on P ) σ-sub-Gaussian random vectors. With probability 1 -δ: P Z 2 σ 2 (d + log(d/δ)). Proof of Lemma A.7. Write P = U U = [u 1 , • • • , u d ] where U is orthogonal matrix in R n×d where U U = I. P Z 2 F = U Z 2 F = d j=1 u j Z 2 = d j=1 n i=1 u ji z i 2 , where each z i ∈ R k being the i-th row of Z is a centered independent σ sub-Gaussian random vectors. To use vector Bernstein inequality, we let X : = n i=1 X i with X i := u ji z i . We have X i is zero mean: E[X i ] = E[u ji E[z i |u ji ]] = E[u ji • 0] = 0. V := i E X i 2 2 = i E[u 2 ji z i z i ] = i E uji [u 2 ji E[ z i 2 2 |u ji ]] ≤σ 2 i E uji [u 2 ji ] =σ 2 . Therefore by vector Bernstein Inequality, with probability at least 1-δ/d, X ≤ σ(1+ log(d/δ)). Then by taking union bound, we get that P Z 2 = d j=1 u j Z 2 σ 2 (d + log(d/δ)) with probability 1 -δ. Corollary A.8. Let Z ∈ R n×k be a matrix whose row vectors are n independent samples from centered (conditioned on P ) multinomial probabilities (p 1 , p 2 , • • • p k ) (where p t could be different across each row). Let P ∈ R n×n be a projection onto a space of dimension d (that might be dependent with Z). Then we have P Z 2 d + log(d/δ). with probability 1 -δ.

B OMITTED PROOFS WITH CONDITIONAL INDEPENDENCE

Proof of Lemma 3.2. cov(X 1 |Y, X 2 |Y ) = Σ X1X2 -Σ X1Y Σ -1 Y Y Σ Y X2 = 0. By plugging it into the expression of E L [X 2 |X 1 ], we get that ψ(x 1 ) := E L [X 2 |X 1 = x 1 ] = Σ X2X1 Σ -1 X1X1 x 1 = Σ X2Y Σ -1 Y Y Σ Y X1 Σ -1 X1X1 x 1 =Σ X2Y Σ -1 Y Y E L [Y |X 1 ]. Therefore, as long as Σ X2Y of rank k, it has left inverse matrix and we get: E L [Y |X 1 = x 1 ] = Σ † X2Y Σ Y Y ψ(x 1 ) . Therefore there's no approximation error in using ψ to predict Y . Proof of Corollary 3.4 . Let selector operator S y be the mapping such that S y Ȳ = Y , we overload it as the matrix that ensure S y Σ Ȳ X = Σ Y X for any random variable X as well. From Lemma 3.2 we get that there exists W such that E L [ Ȳ |X 1 ] = W E L [X 2 |X 1 ], just plugging in S y we get that E L [Y |X 1 ] = (S y W ) E L [X 2 |X 1 ]. Proof of Theorem 3.3 . Since N is mean zero, f * (X 1 ) = E[Y |X 1 ] = (A * ) X 1 . E L [Y |X 1 = x 1 ] = Σ † X2Y Σ Y Y ψ(x 1 ). Let W * = Σ Y Y Σ † Y X2 . First we have the basic inequality, 1 2n 2 Y -ψ(X 1 ) Ŵ 2 F ≤ 1 2n 2 Y -X 1 A * 2 F = 1 2n 2 Y -ψ(X 1 )W * 2 F . Therefore ψ(X 1 )W * -ψ(X 1 ) Ŵ 2 ≤2 N, ψ(X 1 )W * -ψ(X 1 ) Ŵ =2 P ψ(X1) N , ψ(X 1 )W * -ψ(X 1 ) Ŵ ≤2 P ψ(X1) N F ψ(X 1 )W * -ψ(X 1 ) Ŵ F ⇒ ψ(X 1 )W * -ψ(X 1 ) Ŵ ≤2 P ψ(X1) N F Tr(Σ Y Y |X1 )(k + log k/δ). (from Claim A.4) The last inequality is derived from Claim A.7 and the fact that each row of N follows gaussian distribution N (0, Σ Y Y |X1 ). Therefore 1 n 2 ψ(X 1 )W * -ψ(X 1 ) Ŵ 2 F Tr(Σ Y Y |X1 )(k + log k/δ) n 2 . Next we need to concentrate 1/nX 1 X 1 to Σ X . Suppose E L [X 2 |X 1 ] = B X 1 , i.e., φ(x 1 ) = B x 1 , and φ(X 1 ) = X 1 B. With Claim A.2 we have 1/nφ(X 1 ) φ(X 1 ) = 1/nB X 1 X 1 B satisfies: 0.9B Σ X B 1/n 2 φ(X 1 ) φ(X 1 ) 1.1B Σ X B Therefore we also have: E[(W * -Ŵ ) ψ(x 1 )] = Σ 1/2 X B(W * -Ŵ ) 2 F ≤ 1 0.9n 2 k ψ(X 1 )W * -ψ(X 1 ) Ŵ 2 F Tr(Σ Y Y |X1 )(k + log k/δ) n 2 .

B.1 OMITTED PROOF FOR GENERAL RANDOM VARIABLES

Proof of Lemma 3.5. Let the representation function ψ be defined as: ψ(•) := E[X 2 |X 1 ] = E[E[X 2 |X 1 , Y ]|X 1 ] = E[E[X 2 |Y ]|X 1 ] (uses CI) = y P (Y = y|X 1 ) E[X 2 |Y = y] =:f (X 1 ) A, where f : R d1 → ∆ Y satisfies f (x 1 ) y = P (Y = y|X 1 = x 1 ), and A ∈ R Y×d2 satisfies A y,: = E[X 2 |Y = y]. Here ∆ d denotes simplex of dimension d, which represents the discrete probability density over support of size d. Let B = A † ∈ R Y×d2 be the pseudoinverse of matrix A, and we get BA = I from our assumption that A is of rank |Y|. Therefore f (x 1 ) = Bψ(x 1 ), ∀x 1 . Next we have: E[Y |X 1 = x 1 ] = y P (Y = y|X 1 = x 1 ) × y =Y f (x 1 ) =(Y B) • ψ(X 1 ). Here we denote by Y ∈ R k×Y , Y :,y = y that spans the whole support Y. Therefore let W * = Y B will finish the proof. Proof of Theorem 3.6. With Lemma 3.5 we know e apx = 0, and therefore W * ψ(X 1 ) ≡ f * (X 1 ). Next from basic inequality and the same proof as in Theorem 3.3 we have: ψ(X 1 )W * -ψ(X 1 ) Ŵ ≤2 P ψ(X1) N F Notice N is a random noise matrix whose row vectors are independent samples from some centered distribution. Also we assumed E[ N 2 |X 1 ] ≤ σ 2 , i.e. E[ N 2 |N ] ≤ σ 2 . Also, P ψ(X1) is a projection to dimension c. From Lemma A.7 we have: f * (X 1 ) -ψ(X 1 ) Ŵ ≤σ c + log c/δ. Next, with Claim A.3 we have when n ρ 4 (c + log(1/δ)), since W * -Ŵ ∈ R d2×k , 0.9(W * -Ŵ ) Σ ψ (W * -Ŵ ) 1 n 2 (W * -Ŵ ) i ψ(x (i) 1 )ψ(x (i) 1 ) (W * -Ŵ ) 1.1(W * -Ŵ ) Σ ψ (W * -Ŵ ) And therefore we could easily conclude that: E Ŵ ψ(X 1 ) -f * (X 1 ) 2 σ 2 c + log(c/δ) n 2 .

B.2 OMITTED PROOF OF LINEAR MODEL WITH APPROXIMATION ERROR

Proof of Theorem 3.9. First we note that Y = f * (X 1 ) + N , where E[N |X 1 ] = 0 but Y -(A * ) X 1 is not necessarily mean zero, and this is where additional difficulty lies. Write approximation error term a(X 1 ) := f * (X 1 ) -(A * ) X 1 , namely Y = a(X 1 ) + (A * ) X 1 + N . Also, (A * ) X 1 ≡ (W * ) ψ(X 1 ) with conditional independence. Second, with KKT condition on the training data, we know that E[a(X 1 )X 1 ] = 0. Recall Ŵ = arg min W Y -ψ(X 1 )W 2 F . We have the basic inequality, 1 2n 2 Y -ψ(X 1 ) Ŵ 2 F ≤ 1 2n 2 Y -X 1 A * 2 F = 1 2n 2 Y -ψ(X 1 )W * 2 F . i.e., 1 2n 2 ψ(X 1 )W * + a(X 1 ) + N -ψ(X 1 ) Ŵ 2 F ≤ 1 2n 2 a(X 1 ) + N 2 F . Therefore 1 2n 2 ψ(X 1 )W * -ψ(X 1 ) Ŵ 2 ≤ - 1 n 2 a(X 1 ) + N , ψ(X 1 )W * -ψ(X 1 ) Ŵ = - 1 n 2 a(X 1 ), ψ(X 1 )W * -ψ(X 1 ) Ŵ -N , ψ(X 1 )W * -ψ(X 1 ) Ŵ With Assumption 3.6 and by concentration 0.9 1 n2 X 1 X 1 Σ X1 1.1 1 n2 X 1 X 1 , we have 1 √ n 2 a(X 1 )X 1 Σ -1/2 X1 F ≤ 1.1b 0 √ k (10) Denote ψ(X 1 ) = X 1 B, where B = Σ -1 X1 Σ X1X2 is rank k under exact CI since Σ X1X2 = Σ X1Y Σ -1 Y Σ Y X2 . We have 1 n 2 a(X 1 ), ψ(X 1 )W * -ψ(X 1 ) Ŵ = 1 n 2 a(X 1 ), X 1 BW * -X 1 B Ŵ = 1 n 2 Σ -1/2 X1 X 1 a(X 1 ), Σ 1/2 X1 (BW * -B Ŵ ) ≤ k n 2 Σ 1/2 X1 (BW * -B Ŵ ) F (from Ineq. ( )) Back to Eqn. ( 9), we get 1 2n 2 ψ(X 1 )W * -ψ(X 1 ) Ŵ 2 F k n 2 Σ 1/2 X1 (BW * -B Ŵ ) F + 1 n 2 P X1 N F X 1 (BW * -B Ŵ ) F √ k n 2 + 1 n 2 P X1 N F X 1 (BW * -B Ŵ ) F =⇒ 1 √ n 2 ψ(X 1 )W * -ψ(X 1 ) Ŵ F k + log k/δ n 2 . Finally, by concentration we transfer the result from empirical loss to excess risk and get: E[ ψ(X 1 )W * -ψ(X 1 ) Ŵ 2 ] k + log(k/δ) n 2 .

B.3 ARGUMENT ON DENOISING AUTO-ENCODER OR CONTEXT ENCODER

Remark B.1. We note that since X 1 ⊥X 2 |Y ensures X 1 ⊥h(X 2 )|Y for any deterministic function h, we could replace X 2 by h(X 2 ) and all results hold. Therefore in practice, we could use h(ψ(X 1 )) instead of ψ(X 1 ) for downstream task. Specifically with denoising auto-encoder or context encoder, one could think about h as the inverse of decoder D (h = D -1 ) and use D -1 ψ ≡ E the encoder function as the representation for downstream tasks, which is more commonly used in practice. This section explains what we claim in Remark B.1. For context encoder, the reconstruction loss targets to find the encoder E * and decoder D * that achieve min E min D E X 2 -D(E(X 1 )) 2 F , ( ) where X 2 is the masked part we want to recover and X 1 is the remainder. If we naively apply our theorem we should use D * (E * (•)) as the representation, while in practice we instead use only the encoder part E * (•) as the learned representation. We argue that our theory also support this practical usage if we view the problem differently. Consider the pretext task to predict (D * ) -1 (X 2 ) instead of X 2 directly, namely, Ē ← arg min E E (D * ) -1 (X 2 ) -E(X 1 ) 2 , ( ) and then we should indeed use E(X 1 ) as the representation. On one hand, when X 1 ⊥X 2 |Y , it also satisfies X 1 ⊥(D * ) -1 (X 2 )|Y since (D * ) -1 is a deterministic function of X 2 and all our theory applies. On the other hand, the optimization on ( 11) or ( 12) give us similar result. Let E * = arg min E E[ X 2 -D * (E(X 1 )) 2 ], and E X 2 -D * (E * (X 1 )) 2 ≤ , then with pretext task as in ( 12) we have that: E (D * ) -1 (X 2 ) -E * (X 1 ) 2 = E (D * ) -1 (X 2 ) -(D * ) -1 • D * (E * (X 1 )) 2 ≤ (D * ) -1 2 Lip E X 2 -D * (E * (X 1 )) 2 ≤L 2 , where L := (D * ) -1 Lip is the Lipschitz constant for function (D * ) -1 . This is to say, in practice, we optimize over (11), and achieves a good representation E * (X 1 ) such that pre ≤ L √ and thus performs well for downstream tasks. (Recall pre is defined in Theorem 4.2 that measures how well we have learned the pretext task.)

C OMITTED PROOFS BEYOND CONDITIONAL INDEPENDENCE C.1 WARM-UP: JOINTLY GAUSSIAN VARIABLES

As before, for simplicity we assume all data is centered in this case. Assumption C.1 (Approximate Conditional Independent Given Latent Variables). Assume there exists some latent variable Z ∈ R m such that Σ -1/2 X1 Σ X1,X2| Ȳ F ≤ CI , σ k+m (Σ † Y Ȳ Σ Ȳ X2 ) = β > 0 3 and Σ X2, Ȳ is of rank k + m, where Ȳ = [Y, Z]. When X 1 is not exactly CI of X 2 given Y and Z, the approximation error depends on the norm of Σ -1/2 X1 Σ X1,X2| Ȳ 2 . Let Ŵ be the solution from Equation 2. Theorem C.1. Under Assumption C.1 with constant CI and β, then the excess risk satisfies ER ψ * [ Ŵ ] := E[ Ŵ ψ * (X 1 ) -f * (X 1 ) 2 F ] 2 CI β 2 + Tr(Σ Y Y |X1 ) d 2 + log(d 2 /δ) n 2 . Proof of Theorem C.1. Let V := f * (X 1 ) ≡ X 1 Σ -1 X1X1 Σ 1Y be our target direction. Denote the optimal representation matrix by Ψ := ψ(X 1 ) ≡ X 1 A (where A := Σ -1 X1X1 Σ X1X2 ). Next we will make use of the conditional covariance matrix: Σ X1X2| Ȳ := Σ X1X2 -Σ X1 Ȳ Σ -1 Ȳ Σ Ȳ X2 , and plug it in into the definition of Ψ: Ψ =X 1 Σ -1 X1X1 Σ X1 Ȳ Σ -1 Ȳ Σ Ȳ X2 + X 1 Σ -1 X1X1 Σ X1X2| Ȳ =:L + E, where L := X 1 Σ -1 X1X1 Σ X1 Ȳ Σ -1 Ȳ Σ Ȳ X2 and E := X 1 Σ -1 X1X1 Σ X1X2| Ȳ . We analyze these two terms respectively. For L, we note that span(V ) ⊆span(L): LΣ † X2 Ȳ Σ Ȳ = X 1 Σ -1 X1X1 Σ X1 Ȳ . By right multiplying the selector matrix S Y we have: LΣ † X2 Ȳ Σ Ȳ Y = X 1 Σ -1 X1X1 Σ X1Y , i.e., L W = V , where W := Σ † X2 Ȳ Σ Ȳ Y . From our assumption that σ r (Σ † Ȳ Y Σ Ȳ X2 ) = β, we have W 2 ≤ Σ † X2 Ȳ Σ Ȳ 2 ≤ 1/β. (Or we could directly define β as σ k (Σ † Y Ȳ Σ Ȳ X2 ) ≡ W 2 . ) By concentration, we have E = X 1 Σ -1 X1X1 Σ X1X2| Ȳ converges to Σ -1/2 X1X1 Σ X1X2| Ȳ . Specifically, when n k + log 1/δ, E F ≤ 1.1 Σ -1/2 X1X1 Σ X1X2| Ȳ F ≤ 1.1 CI (by using Lemma A.2 ). Together we have E W F CI /β. Let Ŵ = arg min W Y -ΨW 2 . We note that Y = N + V = N + Ψ W -E W where V is our target direction and N is random noise (each row of N has covariance matrix Σ Y Y |X1 ). From basic inequality, we have: Ψ Ŵ -Y 2 F ≤ Ψ W -Y 2 F = N -E W 2 F . =⇒ Ψ Ŵ -V -E W 2 ≤2 Ψ Ŵ -V -E W , N -E W =⇒ Ψ Ŵ -V -E W ≤ P [Ψ,E,V ] N + E W =⇒ Ψ Ŵ -V E F W + ( d 2 + log 1/δ) Tr(Σ Y Y |X1 ). (from Lemma A.7) ≤ √ n 2 CI β + ( d 2 + log 1/δ) Tr(Σ Y Y |X1 ). (from Assumption C.1) Next, by the same procedure that concentrates 1 n2 X 1 X 1 to Σ X1X1 with Claim A.2, we could easily get ER[ Ŵ ] := E[ Ŵ ψ(X 1 ) -f * (X 1 ) 2 ] 2 CI β 2 + Tr(Σ Y Y |X1 ) d 2 + log 1/δ n 2 . C.2 TECHNICAL FACTS Lemma C.2 (Approximation Error of PCA). Let matrix A = L + E where L is rank r <size of A and E 2 ≤ and Σ r (A) = β. Then we have sin Θ(A, L) 2 ≤ /β. Proof. We use Davis Kahan for this proof. First note that A -L = E ≤ . From Davis-Kahan we get: sin Θ(A, L) 2 ≤ E 2 Σ r (A) -Σ r+1 (L) = E 2 Σ r (A) /β.

C.3 MEASURING CONDITIONAL DEPENDENCE WITH CROSS-COVARIANCE OPERATOR

L 2 (P X ) denotes the Hilbert space of square integrable function with respect to the measure P X , the marginal distribution of X. We are interested in some function class H x ⊂ L 2 (P X ) that is induced from some feature maps: Definition C.3 (General and Universal feature Map). We denote feature map φ : X → F that maps from a compact input space X to the feature space F. F is a Hilbert space associated with inner product: φ(x), φ(x ) F . The associated function class is: H x = {h : X → R|∃w ∈ F, h(x) = w, φ(x) F , ∀x ∈ X }. We call φ universal if the induced H x is dense in L 2 (P X ). Linear model is a special case when feature map φ = Id is identity mapping and the inner product is over Euclidean space. A feature map with higher order polynomials correspondingly incorporate high order moments (Fukumizu et al., 2004; Gretton et al., 2005) . For discrete variable Y we overload φ as the one-hot embedding. Remark C.1. For continuous data, any universal kernel like Gaussian kernel or RBF kernel induce the universal feature map that we require (Micchelli et al., 2006) . Two-layer neural network with infinite width also satisfy it, i.e., ∀x ∈ , 1993) . X ⊂ R d , φ N N (x) : S d-1 × R → R, φ N N (x)[w, b] = σ(w x + b) (Barron When there's no ambiguity, we overload φ 1 as the random variable φ 1 (X 1 ) over domain F 1 , and H 1 as the function class over X 1 . Next we characterize CI using the cross-covariance operator. Definition C.4 (Cross-covariance operator). For random variables X ∈ X , Y ∈ Y with joint distribution P : X × Y → R, and associated feature maps φ x and φ y , we denote by C φxφy = E[φ x (X) ⊗ φ y (Y )] = X ×Y φ x (x) ⊗ φ y (y)dP (x, y), the (un-centered) cross-covariance operator. Similarly we denote by C Xφy = E[X ⊗ φ y (Y )] : F y → X . To understand what C φxφy is, we note it is of the same shape as φ x (x) ⊗ φ y (y) for each individual x ∈ X , y ∈ Y. It can be viewed as a self-adjoint operator: C φxφy : F y → F x , C φxφy f = X ×Y φ y (y), f φ x (x)dP (x, y), ∀f ∈ F y . For any f ∈ H x and g ∈ H y , it satisfies: , 1973; Fukumizu et al., 2004) . CI ensures C φ1X2|φy = 0 for arbitrary φ 1 , φ 2 : Lemma C.5. With one-hot encoding map φ y and arbitrary φ 1 , X 1 ⊥X 2 |Y ensures: f, C φxφy g Hx = E XY [f (X)g(Y )](Baker C φ1X2|φy := C φ1X2 -C φ1φy C -1 φyφy C φyX2 = 0. A more complete discussion of cross-covariance operator and CI can be found in (Fukumizu et al., 2004) . Also, recall that an operator C : F y → F x is Hilbert-Schmidt (HS) (Reed, 2012) if for complete orthonormal systems (CONSs) {ζ i } of F x and {η i } of F y , C 2 HS := i,j ζ j , Cη i 2 Fx < ∞. The Hilbert-Schmidt norm generalizes the Frobenius norm from matrices to operators, and we will later use C φ1X2|φy to quantify approximate CI. We note that covariance operators (Fukumizu et al., 2009; 2004; Baker, 1973) are commonly used to capture conditional dependence of random variables. In this work, we utilize the covariance operator to quantify the performance of the algorithm even when the algorithm is not a kernel method.

C.4 OMITTED PROOF IN GENERAL SETTING

Claim C.6. For feature maps φ 1 with universal property, we have: ψ * (X 1 ) := E[X 2 |X 1 ] = E L [X 2 |φ 1 ] =C X2φ1 C -1 φ1φ1 φ 1 (X 1 ). Our target f * (X 1 ) := E[Y |X 1 ] = E L [Y |φ 1 ] =C Y φ1 C -1 φ1φ1 φ 1 (X 1 ). For general feature maps, we instead have: ψ * (X 1 ) := arg min f ∈H d 2 1 E X1X2 X 2 -f (X 1 ) 2 2 =C X2φ1 C -1 φ1φ1 φ 1 (X 1 ). Our target f * (X 1 ) := arg min f ∈H k 1 E X1Y Y -f (X 1 ) 2 2 =C Y φ1 C -1 φ1φ1 φ 1 (X 1 ). To prove Claim C.6, we show the following lemma: Lemma C.7. Let φ : X → F x be a universal feature map, then for random variable Y ∈ Y we have: E[Y |X] = E L [Y |φ(X)]. Proof of Lemma C.7. Denote by E[Y |X = x] =: f (x). Since φ is dense in X , there exists a linear operator a : X → R such that x∈X a(x)φ(x)[•]dx = f (•) a.e. Therefore the result comes directly from the universal property of φ. Proof of Claim C.6. We want to show that for random variables Y, X, where X is associated with a universal feature map φ x , we have E[Y |X] = C Y φx(X) C -1 φx(X)φx(X) φ x (X). First, from Lemma C.7, we have that E[Y |X] = E L [Y |φ x (X)]. Next, write A * : F x → Y as the linear operator that satisfies E[Y |X] = A * φ x (X) s.t. A * = arg min A E[ Y -Aφ x (X) 2 ]. Therefore from the stationary condition we have A * E X [φ x (X) ⊗ φ x (X)] = E XY [Y ⊗ φ x (X)]. Or namely we get A * = C Y φx C -1 φxφx simply from the definition of the cross-covariance operator C. Claim C.8. C -1/2 φ1φ1 C φ1X2|φȳ 2 HS = E X1 [ E[X 2 |X 1 ] -E Ȳ [E[X 2 | Ȳ ]|X 1 ] 2 ] = 2 CI . Proof. C -1/2 φ1φ1 C φ1X2|φȳ 2 HS = X1 X2 p X1X2 (x 1 , x 2 ) p X1 (x 1 ) - p X1⊥X2|Y (x 1 , x 2 ) p X1 (x 1 ) X 2 dp x2 2 dp x1 = E X1 [ E[X 2 |X 1 ] -E Ȳ [E[X 2 | Ȳ ]|X 1 ] 2 ].

C.5 OMITTED PROOF FOR MAIN RESULTS

We first prove a simpler version without approximation error. Theorem C.9. For a fixed δ ∈ (0, 1), under Assumption 4.1, 3.5, if there is no approximation error, i.e., there exists a linear operator A such that f * (X 1 ) ≡ Aφ 1 (X 1 ), if n 1 , n 2 ρ 4 (d 2 + log 1/δ), and we learn the pretext tasks such that: E ψ(X 1 ) -ψ * (X 1 ) 2 F ≤ 2 pre . Then we are able to achieve generalization for downstream task with probability 1 -δ: E[ f * H1 (X 1 ) -Ŵ ψ(X 1 ) 2 ] ≤ O{σ 2 d 2 + log d 2 /δ n 2 + 2 CI β 2 + 2 pre β 2 }. ( ) Proof of Theorem C.9. We follow the similar procedure as Theorem C.1. For the setting of no approximation error, we have f * = f * H1 , and the residual term N := Y -f * (X 1 ) is a mean- zero random variable with E[ N 2 |X 1 ] σ 2 according to our data assumption in Section 3. N = Y -f * (X down 1 ) is the collected n 2 samples of noise terms. We write Y ∈ R d3 . For classification task, we have Y ∈ {e i , i ∈ [k]} ⊂ R k (i.e, d 3 = k) is one-hot encoded random variable. For regression problem, Y might be otherwise encoded. For instance, in the yearbook dataset, Y ranges from 1905 to 2013 and represents the years that the photos are taken. We want to note that our result is general for both cases: the bound doesn't depend on d 3 , but only depends on the variance of N . Let Ψ * , L, E, V be defined as follows: Let V = f * (X down 1 ) ≡ f * H1 (X down 1 ) ≡ φ(X down 1 )C -1 φ1 C φ1Y be our target direction. Denote the optimal representation matrix by Ψ * :=ψ * (X down 1 ) =φ(X down 1 )C -1 φ1φ1 C φ1X2 =φ(X down 1 )C -1 φ1φ1 C φ1φȳ C -1 φȳ Σ φȳX2 + φ(X down 1 )C -1 φ1φ1 C φ1X2|φȳ =:L + E, where L = φ(X down 1 )C -1 φ1φ1 C φ1φȳ C -1 φȳ C φȳX2 and E = φ(X down 1 )C -1 φ1φ1 C φ1X2| Ȳ . In this proof, we denote S Y as the matrix such that S Y φ ȳ = Y . Specifically, if Y is of dimension d 3 , S Y is of size d 3 × |Y||Z|. Therefore S Y Σ φyA = Σ Y A for any random variable A. Therefore, similarly we have: LΣ † X2φȳ Σ φȳφȳ S Y = LΣ † X2φȳ Σ φȳY = L W = V where W := Σ † X2φȳ Σ φȳY satisfies W 2 = 1/β. Therefore span(V ) ⊆span(L) since we have assumed that Σ † X2φȳ Σ φȳY to be full rank. On the other hand, E = X down 1 C -1 φ1φ1 C φ1X2| Ȳ concentrates to C -1/2 φ1φ1 C φ1X2|φȳ . Specifically, when n c + log 1/δ, E F ≤ 1.1 C -1/2 φ1φ1 C φ1X2|φȳ F ≤ 1.1 CI (by using Lemma A.3 ). Together we have E W F CI /β. We also introduce the error from not learning ψ * exactly: E pre = Ψ-Ψ * := ψ(X down 1 )-ψ * (X down ). With proper concentration and our assumption, we have that E ψ(X 1 ) -ψ * (X 1 ) 2 ≤ pre and 1 √ n2 ψ(X down 1 ) -ψ * (X down 1 ) 2 ≤ 1.1 pre . Also, the noise term after projection satisfies P [Ψ,E,V ] N d 2 + log d 2 /δσ as using Lemma A.7. Therefore Ψ = Ψ * -E pre = L + E -E pre . Recall that Ŵ = arg min W ψ(X down 1 )W -Y 2 F . And with exactly the same procedure as Theorem C.1 we also get that: Ψ Ŵ -V ≤2 E W + 2 E pre W + P [Ψ,E,V ,E pre ] N √ n 2 CI + pre β + σ d 2 + log(d 2 /δ). With the proper concentration we also get: E[ Ŵ ψ(X 1 ) -f * H1 (X 1 ) 2 ] 2 CI + 2 pre β 2 + σ 2 d 2 + log(d 2 /δ) n 2 . Next we move on to the proof of our main result Theorem 4.2 where approximation error occurs. Proof of Theorem 4.2. The proof is a combination of Theorem 3.9 and Theorem C.9. We follow the same notation as in Theorem C.9. Now the only difference is that an additional term a(X down

1

) is included in Y : Y =N + f * (X down 1 ) =N + Ψ * W + a(X down 1 ) =N + (Ψ + E pre ) W + a(X down 1 ) =Ψ W + (N + E pre W + a(X down 1 )). From re-arranging 1 2n2 Y -Ψ Ŵ 2 F ≤ 1 2n2 Y -Ψ W 2 F , 1 2n 2 Ψ( W -Ŵ ) + (N + E pre + a(X down 1 )) 2 F ≤ 1 2n 2 N + E pre W + a(X down 1 ) 2 F (15) ⇒ 1 2n 2 Ψ( W -Ŵ ) 2 F ≤ 1 n 2 Ψ( W -Ŵ ), N + E pre W + a(X down 1 ) . ( ) Then with similar procedure as in the proof of Theorem 3.9, and write Ψ as φ(X down 1 )B, we have: 1 n 2 Ψ( W -Ŵ ), a(X down 1 ) = 1 n 2 B( W -Ŵ ), φ(X down 1 ) a(X down 1 ) = 1 n 2 C 1/2 φ1 B( W -Ŵ ), C -1/2 φ1 φ(X down 1 ) a(X down 1 ) ≤ d 2 n 2 C 1/2 φ1 B( W -Ŵ ) F ≤1.1 1 √ n 2 d 2 n 2 φ(X down 1 )B( W -Ŵ ) F =1.1 √ d 2 n 2 Ψ( W -Ŵ ) F . Therefore plugging back to (16) we get: 1 2n 2 Ψ( W -Ŵ ) 2 F ≤ 1 n 2 Ψ( W -Ŵ ), N + E pre W + a(X down 1 ) ⇒ 1 2n 2 Ψ( W -Ŵ ) F ≤ 1 2n 2 E pre W F + 1 2n 2 P Ψ N F + 1.1 √ d 2 n 2 . ⇒ 1 2 √ n 2 Ψ Ŵ -f * H1 (X down 1 ) F -E W F ≤ 1 √ n 2 (1.1 d 2 + E pre W + d 2 + log(d 2 /δ)) ⇒ 1 2 √ n 2 Ψ Ŵ -f * H1 (X down 1 ) F d 2 + log d 2 /δ n 2 + CI + pre β . Finally by concentrating 1 n2 Ψ Ψ to E[ ψ(X 1 ) ψ(X 1 ) ] we get: E[ Ŵ ψ(X 1 ) -f * H1 (X 1 ) 2 2 ] d 2 + log d 2 /δ n 2 + 2 CI + 2 pre β 2 , with probability 1 -δ. 

D THEORETICAL ANALYSIS

f : X 1 → R k is clf (f ) = E[ log (f (X 1 ), Y )] , where log (ŷ, y) = -log e ŷy y e ŷy The loss for representation ψ : X 1 → R d1 and linear classifier W ∈ R k×d1 is denoted by clf (W ψ). We note that the function log is 1-Lipschitz in the first argument. The result will also hold for the hinge loss hinge (ŷ, y) = (1 -ŷy + max y =y ŷy ) + which is also 1-Lipschitz, instead of log . We assume that the optimal regressor f * H1 for one-hot encoding also does well on linear classification. Assumption D.1. The best regressor for 1-hot encodings in H 1 does well on classification, i.e. clf (γf * H1 ) ≤ one-hot is small for some scalar γ. Remark D.1. Note that if H 1 is universal, then f * H1 (x 1 ) = E[Y |X 1 = x 1 ] and we know that f * H1 is the Bayes-optimal predictor for binary classification. In general one can potentially predict the label by looking at arg max i∈[k] f * H1 (x 1 ) i . The scalar γ captures the margin in the predictor f * H1 . We now show that using the classifier Ŵ obtained from linear regression on one-hot encoding with learned representations ψ will also be good on linear classification. The proof is in Section D Theorem D.2. For a fixed δ ∈ (0, 1), under the same setting as Theorem 4.2 and Assumption D.1, we have: clf γ Ŵ ψ ≤ O   γ σ 2 d 2 + log d 2 /δ n 2 + 2 β 2 + 2 pre β 2   + one-hot , with probability 1 -δ. Proof of Theorem D.2. We simply follow the following sequence of steps clf γ Ŵ ψ = E[ log γ Ŵ ψ(X 1 ), Y ] ≤ (a) E log γf * H1 (X 1 ), Y + γ Ŵ ψ(X 1 ) -f * H1 (X 1 ) ≤ (b) one-hot + γ E Ŵ ψ(X 1 ) -f * H1 (X 1 ) 2 = one-hot + γ ER ψ [ Ŵ ] where (a) follows because log is 1-Lipschitz and (b) follows from Assumption D.1 and Jensen's inequality. Plugging in Theorem 4.2 completes the proof.

E FOUR DIFFERENT WAYS TO USE CI

In this section we propose four different ways to use conditional independence to prove zero approximation error, i.e., Claim E.1 (informal). When conditional independence is satisfied: X 1 ⊥X 2 |Y , and some nondegeneracy is satisfied, there exists some matrix W such that E[Y |X 1 ] = W E[X 2 |X 1 ]. We note that for simplicity, most of the results are presented for the jointly Gaussian case, where everything could be captured by linear conditional expectation E L [Y |X 1 ] or the covariance matrices. When generalizing the results for other random variables, we note just replace X 1 , X 2 , Y by φ 1 (X 1 ), φ 2 (X 2 ), φ y (Y ) will suffice the same arguments.

E.1 INVERSE COVARIANCE MATRIX

Write Σ as the covariance matrix for the joint distribution P X1X2Y . Σ = Σ XX Σ XY Σ Y Y Σ Y Y , Σ -1 = A ρ ρ B where A ∈ R (d1+d2)×(d1+d2) , ρ ∈ R (d1+d2)×k , B ∈ R k×k . Furthermore ρ = ρ 1 ρ 2 ; A = A 11 A 12 A 21 A 22 for ρ i ∈ R di×k , i = 1, 2 and A ij ∈ R di×dj for i, j ∈ {1, 2}. Claim E.2. When conditional independence is satisfied, A is block diagonal matrix, i.e., A 12 and A 21 are zero matrices. Lemma E.3. We have the following E[X 1 |X 2 ] = (A 11 -ρ1 ρ 1 ) -1 (ρ 1 ρ2 -A 12 )X 2 (17) E[X 2 |X 1 ] = (A 22 -ρ2 ρ 2 ) -1 (ρ 2 ρ1 -A 21 )X 1 (18) E[Y |X] = -B -1 2 (ρ 1 X 1 + ρ 2 X 2 ) where ρi = ρ i B -1 2 for i ∈ {1, 2}. Also, (A 11 -ρ1 ρ 1 ) -1 ρ1 ρ 2 = 1 1 -ρ 1 A -1 11 ρ1 A -1 11 ρ1 ρ 2 (A 22 -ρ2 ρ 2 ) -1 ρ2 ρ 1 = 1 1 -ρ 2 A -1 22 ρ2 A -1 22 ρ2 ρ 1 Proof. We know that E[X 1 |X 2 ] = Σ 12 Σ -1 22 X 2 and E[X 2 |X 1 ] = Σ 21 Σ -1 11 x 1 , where Σ XX = Σ 11 Σ 12 Σ 21 Σ 22 First using ΣΣ -1 = I, we get the following identities 22) we get that Σ XY = -Σ XX ρB -1 and plugging this into Equation ( 20) we get Σ XX A + Σ XY ρ = I (20) Σ XY A + Σ Y Y ρ = 0 (21) Σ XX ρ + Σ XY B = 0 (22) Σ XY ρ + Σ Y Y B = I (23) From Equation ( Σ XX A-Σ XX ρB -1 ρ = I =⇒ Σ XX = (A -ρB -1 ρ ) -1 = (A -ρρ ) -1 =⇒ Σ 11 Σ 12 Σ 21 Σ 22 = A 11 -ρ1 ρ 1 A 12 -ρ1 ρ 2 A 21 -ρ2 ρ 1 A 22 -ρ2 ρ 2 -1 We now make use of the following expression for inverse of a matrix that uses Schur complement: M /α = δ -γα -1 β is the Schur complement of α for M defined below If M = α β γ δ , then, M -1 = α -1 + α -1 β(M /α) -1 γα -1 -α -1 β(M /α) -1 -(M /α) -1 γα -1 (M /α) -1 For M = (A -ρρ ), we have that Σ XX = M -1 and thus Σ 12 Σ -1 22 = -α -1 β(M /α) -1 ((M /α) -1 ) -1 = -α -1 β = (A 11 -ρ1 ρ 1 ) -1 (ρ 1 ρ 2 -A 12 ) This proves Equation ( 17) and similarly Equation ( 18) can be proved. For Equation ( 19), we know that E [Y |X = (X 1 , X 2 )] = Σ Y X Σ -1 XX X = Σ XY Σ -1 XX X. By using Equation (22) we get Σ XY = -Σ XX ρB -1 and thus E[Y |X = (X 1 , X 2 )] = -B -1 ρ Σ XX Σ -1 XX X = -B -1 ρ X = B -1 (ρ 1 X 1 + ρ 2 X 2 ) = -B -1 2 (ρ 1 X 1 + ρ 2 X 2 ) For the second part, we will use the fact that (I -ab ) -1 = I + 1 1-a b ab . Thus (A 11 -ρ1 ρ 1 ) -1 ρ1 ρ2 = (I -A -1 11 ρ1 ρ 1 )A -1 11 ρ1 ρ 2 = (I + 1 1 -ρ 1 A -1 11 ρ1 A -1 11 ρ1 ρ1 )A -1 11 ρ1 ρ 2 = A -1 11 (I + 1 1 -ρ 1 A -1 11 ρ1 ρ1 ρ1 A -1 11 )ρ 1 ρ 2 = A -1 11 (ρ 1 ρ 2 + ρ1 A -1 11 ρ1 1 -ρ 1 A -1 11 ρ1 ρ1 ρ 2 ) = A -1 11 ρ1 ρ 2 (1 + ρ1 A -1 11 ρ1 1 -ρ 1 A -1 11 ρ1 ) = 1 1 -ρ 1 A -1 11 ρ1 A -1 11 ρ1 ρ 2 The other statement can be proved similarly. Claim E.4. E[X 2 |X 1 ] = (A 22 -ρ2 ρ 2 ) -1 ρ2 ρ 1 X 1 . E[Y |X 1 ] = -B -1/2 ρ 1 X 1 -B -1/2 ρ 2 E[X 2 |X 1 ] Therefore E[Y |X 1 ] is in the same direction as E[X 2 |X 1 ].

E.2 CLOSED FORM OF LINEAR CONDITIONAL EXPECTATION

Refer to Claim 3.1 and proof of Lemma 3.2. As this is the simplest proof we used in our paper.

E.3 FROM LAW

OF ITERATED EXPECTATION E L [X 2 |X 1 ] = E L [E L [X 2 |X 1 , Y ]|X 1 ] = E [Σ X2X1 , Σ X2Y ] Σ X1X1 Σ X1Y Σ Y X1 Σ Y Y -1 X 1 Y | X 1 =AX 1 + B E L [Y |X 1 ]. Using block matrix inverse, A = (Σ X2X1 -Σ X2Y Σ -1 Y Y Σ Y X1 )(Σ X1X1 -Σ X1Y Σ -1 Y Y Σ Y X1 ) -1 ∈ R d2×d1 = Σ X1X2|Y (Σ X1X1|Y ) -1 B = Σ X2Y |X1 (Σ Y Y |X1 ) -1 ∈ R d2×Y . Therefore in general (without conditional independence assumption) our learned representation will be ψ( x 1 ) = Ax 1 + Bf * (x 1 ), where f * (•) := E L [Y |X 1 ]. It's easy to see that to learn f * from representation ψ, we need A to have some good property, such as light tail in eigenspace, and B needs to be full rank in its column space. Notice in the case of conditional independence, Σ X1X2|Y = 0, and A = 0. Therefore we could easily learn f * from ψ if X 2 has enough information of Y such that Σ X2Y |X1 is of the same rank as dimension of Y . E.4 FROM E[X 2 |X 1 , Y ] = E[X 2 |Y ] Proof. Let the representation function ψ be defined as follows, and let we use law of iterated expectation:  ψ(•) := E[X 2 |X 1 ] = E[E[X 2 |X 1 , Y ]|X 1 ] = E[E[X 2 |Y ]|X 1 ] (

F MORE ON THE EXPERIMENTS

In this section, we describe more experiment results. Simulations. Following Theorem 4.2, we know that the Excessive Risk (ER) is also controlled by (1) the number of samples for the pretext task (n 1 ), and (2) the number of samples for the downstream task (n 2 ), besides k and CI as discussed in the main text. In this simulation, we enforce strict conditional independence, and explore how ER varies with n 1 and n 2 . We generate the data the same way as in the main text, and keep α = 0, k = 2, d 1 = 50 and d 2 = 40 We restrict the function class to linear model. Hence ψ is the linear model to predict X 2 from X 1 given the pretext dataset. We use Mean Squared Error (MSE) as the metric, since it is the empirical version of the ER. As shown in Figure 3 , ψ consistently outperforms X 1 in predicting Y using a linear model learnt from the given downstream dataset, and ER does scale linearly with 1/n 2 , as indicated by our analysis. Computer Vision Task. We testify if learning from ψ is more effective than learning directly from X 1 , in a realistic setting (without enforcing conditional independence). Specifically, we test on the Yearbook dataset (Ginosar et al., 2015) , and try to predict the date when the portraits are taken (denoted as Y D ), which ranges from 1905 to 2013. We resize all the portraits to be 128 by 128. We crop out the center 64 by 64 pixels (the face), and treat it as X 2 , and treat the outer rim as X 1 as shown in Figure 4 . Our task is to predict Y D , which is the year when the portraits are taken, and the year ranges from 1905 to 2013. For ψ, we learn X 2 from X 1 with standard image inpainting techniques (Pathak et al., 2016) , and full set of training data (without labels). After that we fix the learned ψ and learn a linear model to predict Y D from ψ using a smaller set of data (with labels). Besides linear model on X 1 , another strong baseline that we compare with is using ResNet18 (He et al., 2016) to predict Y D from X 1 . With the full set of training data, this model is able to achieve a Mean Absolute Difference of 6.89, close to what state-of-the-art can achieve (Ginosar et al., 2015) . ResNet18 has similar amount of parameters as our generator, and hence roughly in the same function class. We show the MSE result as in Figure 4 . Learning from ψ is more effective than learning from X 1 or X 2 directly, with linear model as well as with ResNet18. Practitioner usually fine-tune ψ with the downstream task, which usually leads to more competitive performance (Pathak et al., 2016) . Following the same procedure, we try to predict the gender Y G . We normalize the label (Y G , Y D ) to unit variance, and confine ourself to linear function class. That is, instead of using a context encoder to impaint X 2 from X 1 , we confine ψ to be a linear function. As shown on the left of Figure 5 , the MSE of predicting gender is higher than predicting dates. We find that Σ -1/2 X1X1 Σ X1X2|Y G F = 9.32, while Σ -1/2 X1X1 Σ X1X2|Y D F = 8.15. Moreover, as shown on the right of Figure 5 , conditioning on Y D cancels out more spectrum than conditioning on Y G . In this case, we conjecture that, unlike Y D , Y G does not capture much dependence between X 1 and X 2 . And as a result, CI is larger, and the downstream performance is worse, as we expected.



d3 = k and Y ≡ φy(Y ) (one-hot encoding) refers multi-class classification task, d3 = 1 refers to regression. Ratings {1, 2} correspond to y = -1 and {4, 5} correspond to y = 1 σ k (A) denotes k-th singular value of A, and A † is the pseudo-inverse of A.



Figure 1: Left two: how MSE scales with k (the dimension of Y ) and CI (ACI 4.1) with the linear function class. Right two: how MSE scales with k and with ψ * and non-linear function class. Mean of 30 trials are shown in solid line and one standard error is shown by shadow.

Figure 2: Performance on SST of baseline φ 1 (x 1 ), i.e. bag-of-words, and learned ψ(x 1 ) for the two settings. Left: Classification accuracy, Right: Regression MSE.

Hanson-Wright Inequality (Theorem 1.1 from Rudelson et al. (

FOR CLASSIFICATION TASKS D.1 CLASSIFICATION TASKS We now consider the benefit of learning ψ from a class H 1 on linear classification task for label set Y = [k]. The performance of a classifier is measured using the standard logistic loss Definition D.1. For a task with Y = [k], classification loss for a predictor

y|X 1 ) E[X 2 |Y = y] =:f (X 1 ) A, where f : R d1 → ∆ Y satisfies f (x 1 ) y = P (Y = y|X 1 = x 1 ), and A ∈ R Y×d2 satisfies A y,: = E[X 2 |Y = y].Here ∆ d denotes simplex of dimension d, which represents the discrete probability density over support of size d.Let B = A † ∈ R Y×d2 be the pseudoinverse of matrix A, and we get BA = I from our assumption that A is of rank |Y|. Therefore f (x 1 ) = Bψ(x 1 ), ∀x 1 . Next we have:E[Y |X 1 = x 1 ] = y P (Y = y|X 1 = x 1 ) × y = Ŷ f (x 1 ) =( Ŷ B) • ψ(X 1 ).

Figure 3: Left: MSE of using ψ to predict Y versus using X 1 directly to predict Y . Using ψ consistently outperforms using X 1 . Right: MSE of ψ learned with different n 1 . The MSE scale with 1/n 2 as indicated by our analysis. Simulations are repeated 100 times, with the mean shown in solid line and one standard error shown in shadow.

Figure 4: Left: Example of the X 2 (in the red box of the 1st row), the X 1 (out of the red box of the 1st row), the input to the inpainting task (the second row), ψ(X 1 ) (the 3 row in the red box), and in this example Y = 1967. Middle: Mean Squared Error comparison of yearbook regression predicting dates. Right: Mean Absolute Error comparison of yearbook regression predicting dates. Experiments are repeated 10 times, with the mean shown in solid line and one standard error shown in shadow.

Figure 5: Left: Mean Squared Error comparison of predicting gender and predicting date. Right: the spectrum comparison of covariance condition on gender and condition on date.

acknowledgement

Here we denote by Ŷ ∈ R k×Y , Ŷ:,y = y that spans the whole support Y. Therefore let W * = Ŷ B will finish the proof.

