NO DOUBLE DESCENT IN PCA: TRAINING AND PRE-TRAINING IN HIGH DIMENSIONS

Abstract

With the recent body of work on overparameterized models the gap between theory and practice in contemporary machine learning is shrinking. While many of the present state-of-the-art models have an encoder-decoder architecture, there is little theoretical work for this model structure. To improve our understanding in this direction, we consider linear encoder-decoder models, specifically PCA with linear regression on data from a low-dimensional manifold. We present an analysis for fundamental guarantees of the risk and asymptotic results for isotropic data when the model is trained in a supervised manner. The results are also verified in simulations and compared with experiments from real-world genetics data. Furthermore, we extend our analysis to the popular setting where parts of the model are pre-trained in an unsupervised manner by pre-training the PCA encoder with subsequent supervised training of the linear regression. We show that the overall risk depends on the estimates of the eigenvectors in the encoder and present a sample complexity requirement through a concentration bound. The results highlight that using more pre-training data decreases the overall risk only if it improves the eigenvector estimates. Therefore, we stress that the eigenvalue distribution determines whether more pre-training data is useful or not.

1. INTRODUCTION

Many recent success stories of deep learning employ an encoder-decoder structure, where parts of the model are pre-trained in an unsupervised or self-supervised way. Examples can be found in computer vision (Caron et al., 2020; Chen et al., 2020; Goyal et al., 2021) , natural language processing (Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020) or multi-modal models (Ramesh et al., 2021; Alayrac et al., 2022) . Understanding the properties of this model structure might shed light on how to reliably build large-scale models. We add to the theoretical understanding of encoder-decoder based models by studying a model consisting of PCA and a linear regression head. We analyse this model for the supervised case and for the case where unsupervised pre-training is followed by supervised linear regression. Our model can be viewed as a simplified, linear example of a large pre-trained deep neural network in combination with linear probing (Devlin et al., 2019; Schneider et al., 2019) . While linear models do not reveal the whole picture, they are studied as a tractable, first step towards deeper understanding. Indeed, research on linear models has previously provided important insights into relevant mechanisms (Saxe et al., 2014; Lampinen & Ganguli, 2019; Arora et al., 2019; Gidel et al., 2019; Pesme et al., 2021) . We utilize data generated from a low-dimensional manifold, similar to Goldt et al. (2020) . This is motivated by the manifold hypothesis (Fefferman et al., 2016) which states that real-world highdimensional data often have an underlying low-dimensional representation. Our PCA encoder can exploit this data structure effectively. While we keep the low-dimensional data structure fixed, we vary the number of features w.r.t. the number of training data points which allows us to analyse what is often referred to as overparameterization, i.e. data features or model parameters than training samples (Belkin et al., 2019) . We do not consider parameter count, since for our model the number of parameters, i.e. the linear regressors, stay fixed due to the PCA encoding. Instead we analyse highdimensional settings. Studying overparameterization gives theoretical justification of the success of modern large-scale neural networks such as Szegedy et al. (2016) ; Dosovitskiy et al. (2021) . Theoretical grounding is exceeded by the empirical success of machine learning and specifically deep learning methods through new model structures (Krizhevsky et al., 2017; He et al., 2016; Vaswani et al., 2017) or training methods (Erhan et al., 2010; Ioffe & Szegedy, 2015; Ba et al., 2016) . In recent years our theoretical understanding grew e.g. through the analysis of implicit regularization (Gunasekar et al., 2017; Chizat & Bach, 2020; Smith et al., 2021) . But also experimental work contributed to our understanding (Keskar et al., 2017; Zhang et al., 2017) . The goal of this paper is to extend our understanding of the successful encoder-decoder model structure through theoretical analysis of PCA-regression and by extensive numerical simulations. We generalize results for linear regression and combine classical analysis of overparamterization with pre-training of model components. Our contributions can be summarized as: • In the supervised case, we provide theoretical guarantees for the risk and parameter norm of the PCA-regression model. For isotropic data we extend the results to the limit where the number of data points n and features m tend to infinity such that m/n → γ. • Through simulations, we confirm our theory for isotropic data and explain the model behavior on data from a low-dimensional manifold. Using genetics data, we validate our findings in a high-dimensional real-world example. • We extend our analysis to the popular scenario of unsupervised pre-training of the encoder and show that the correct estimation of feature covariance eigenvectors is crucial for low risk. These estimates are highly dependent on the data structure through the eigenvalue decay rate. The results provide a link to known asymptotic results by Xu & Hsu (2019) . We challenge the common wisdom that more pre-training data improves the overall risk and show that this is the case only if it improves the estimate of the eigenvectors in the encoder which is e.g. the case in data with rapidly decaying eigenvalues.

2. RELATED WORK

Overparameterization The study of overparameterized models offers a natural route to gain theoretical understanding when it comes to the successes of large models with good generalization properties (Neyshabur et al., 2015; Zhang et al., 2017) . The double descent was discovered and analysed in early works (Krogh & Hertz, 1991; Geman et al., 1992; Opper, 1995) but the framing as 'double descent' (Belkin et al., 2019) boosted research in this direction even if generalization of large models was already studied before (Bartlett & Mendelson, 2002; Dziugaite & Roy, 2017; Belkin et al., 2018; Advani et al., 2020) . We add to the understanding of machine learning models by analysing the neglected class of encoder-decoder models with the PCA-regression model.

Analysis of pre-training

The introduction of pre-training of neural networks was a paradigm shift for deep learning. Empirical work (Erhan et al., 2010; Raghu et al., 2019) but also theoretical work such as for sample complexity (Tripuraneni et al., 2020; Du et al., 2021) or the out-of-distribution risk (Kumar et al., 2022) tried to understand the mechanisms. For unsupervised pre-training, contrastive methods were studied (Wang & Isola, 2020; Von Kügelgen et al., 2021) . Encoder-decoder based autoencoders are analysed for training dynamics (Nguyen et al., 2019; 2021) or overparameterization (Radhakrishnan et al., 2019; 2020; Buhai et al., 2020; Zhang et al., 2020) . In contrast, we study pre-trained PCA encoders and relate the risk to the covariance estimation of the encoder.

Latent variable data generator

We generate data via a linear latent variable data generator based on a low-dimensional manifold. The hidden manifold model (Goldt et al., 2020) and random feature model (Rahimi & Recht, 2007) present similar but nonlinear models. Goldt et al. (2022) ; Hu & Lu (2022) showed that these nonlinear models are asymptotically equivalent to linear Gaussian models under assumptions such as that the latent dimension d → ∞. In contrast, we keep this dimension fixed. Asymptotic generalization results for this data generator are presented in Gerace et al. (2020) ; Mei & Montanari (2022) . Different to our work where we exploit the low-dimensional structure with the PCA-regression model, they do not use this information by using Ridge or logistic regression.

PCA-regression model

Using PCA (Jolliffe, 1982) is common-discussions focus on the choice of principle components (Breiman & Freedman, 1983) or its use for high-dimensional data (Lee et al., 2012) . PCA-regression is investigated in Xu & Hsu (2019) for general but fully known covariances in the asymptotic regime. Wu & Xu (2020) extend it by showing that the misalignment of true and estimated eigenvectors affect the risk. Huang et al. (2022) use misalignment bounds (Loukas, 2017) to remove the known covariance assumption and obtain non-asymptotic risk bounds. Our work fills the gaps by providing asymptotic results for isotropic data. We generalize Loukas (2017) to obtain a sample complexity for the covariance estimation in the PCA which is the missing piece to quantify when the results from Xu & Hsu (2019) can be used in practice with pre-training. It turns out that the data covariance structure is crucial as Wainwright (2019) points out.

3. PROBLEM FORMULATION

Data generator We generate a data set {x i , y i } n i=1 according to a latent variable data generator x i = Dz i + e i , y i = θ ⊤ z i + ε i , by mapping the latent variable z i ∈ R d with D ∈ R m×d into the observed features x i ∈ R m and with θ ∈ R d into the observed outputs y i ∈ R. We create D randomly such that ∥D∥ 2 F = dc 2 with c as correction factor to control the signal-to-noise ratio (SNR), defined in (27) . Similarly, to control the outcome-noise-ratio we create θ such that E ∥θ ⊤ z∥ 2 2 = r 2 θ . Feature noise e i ∼ N (0, I m ) and output noise ε i ∼ N (0, σ 2 y ) are added. The latent variables are generated such that the singular values of the features have an exponential decay controlled by the decay rate α ≥ 0 according to z i ∼ N (0, λ 2 i I d ) with λ 2 i = exp(-iα). (3) Our theoretical results do not specifically require an exponential decay of the eigenvalues or a specific rate. However, fast decaying eigenvalues occur in many real-world examples, see Appendix B. We distinguish between two data generators: 1. Isotropic data. This is a special case of ( 1), ( 2) with d = m, D = I m , α = 0 and e = 0 to generate isotropic features. It allows us to rewrite the data generator as y i = θ ⊤ x i + ε i with x i ∼ N (0, I m ). 2. Latent variable data. We distinguish between 1) α = 0 leading to an isotropic but lowdimensional signal and 2) α > 0 which has dominant, but rapidly decaying eigenvalues of the feature covariance matrix. The latter data generator is motivated since many real-world data sets have a low-dimensional signal manifold with rapidly decaying eigenvalues. Note that while our latent variable data generator is similar to the latent space model from Hastie et al. (2022) , we use our PCA-regression model instead of direct regression from features to outputs. A graphical model of our data generator is provided in Figure 1 and details are in Appendix C. In our numerical results we compare with a model which directly regresses the outcomes from the features, referred to as the direct regression model. x i z i y i x i ẑi ŷi PCA lin. reg.

4. ANALYZING THE SUPERVISED CASE

Training the complete PCA-regression model in a supervised way represents a situation commonly encountered in high-dimensional real-world applications. Examples of using this model are in exploratory statistical research (Massy, 1965) , econometrics (Geweke, 1996) , genetics (Wang & Abbott, 2008) , robotics (Vijayakumar & Schaal, 2000) and many more.

4.1. THEORETICAL ANALYSIS

For our analysis, we are interested in closed form solutions for the risk and parameter norm in order to obtain fundamental guarantees for the PCA-regression model. We decompose our solution into bias and variance terms similar to classic decompositions and interpret the results.

Bias-variance decomposition

We stack all outputs in the vector y ∈ R n and all estimated latent variables as rows in the matrix Ẑ ∈ R n× d. The solution to the unregularized linear regression yields θ = ( Ẑ⊤ Ẑ) + Ẑ⊤ y = ( V ⊤ X ⊤ X V ) + V ⊤ X ⊤ y, where (•) + denotes the Moore-Penrose pseudoinverse. We can rewrite our data generator directly from features to outputs as y = Xβ + ϵ with β ⊤ = θ ⊤ D + ∈ R m and ϵ i ∼ N (0, σ 2 ϵ ) where σ 2 ϵ = σ 2 y + ||β|| 2 2 . Following Appendix D.2 the solution becomes θ = ( Σ⊤ Σ) + Σ⊤ Û ⊤ (Xβ + ϵ) = V ⊤ β + Σ+ Û ⊤ ϵ. ( ) Lemma 1. Let the feature sample covariance be Ĉ = 1 n X ⊤ X and the true covariance be C. Define the orthogonal projectors Φ = V V ⊤ and Π = I m -Φ, where Φ is the projection onto the column space of the d first right singular vectors of X. Then, the risk of the PCA-regression model R( θ) = E (x0,y0) (y 0 -ŷ(x 0 ) 2 and the parameter norm ∥ θ∥ 2 2 = θ⊤ θ are given by E ϵ R( θ) = β ⊤ ΠCΠβ + σ 2 ϵ n Tr( V ⊤ C V V ⊤ Ĉ+ V ) + σ 2 ϵ , E ϵ ∥ θ∥ 2 2 = β ⊤ Φβ + σ 2 ϵ n Tr( V ⊤ Ĉ+ V ). The proofs are in Appendices D.3, D.5. In both equations, the variance (second) term is controlled by the estimated singular vectors V , which project the covariances C, Ĉ to a d-dimensional subspace and therefore contain less noise. Hence, we expect the variance term to decrease constantly for larger γ and that the PCA-regression model therefore avoids the "interpolation peak" at γ = 1 which linear regression has. The results generalize Lemma 1 in Hastie et al. (2022) for the risk of direct regression models since we obtain the same form when choosing d = m, i.e. no dimensionality reduction. Asymptotics for isotropic features Using results from random matrix theory, and Lemma 1 we derive asymptotics for the risk and parameter norm in the case of isotropic features C = I m . Theorem 1. Assume isotropic features C = I m , which implies d = m and choose constant d. Then, as m, n → ∞, such that m n → γ, the expected risk and parameter norm satisfy almost surely E ϵ R( θ) → σ 2 ϵ m n ∞ s 1 s dF γ (s) + σ 2 ϵ +    β ⊤ β 1 -min( d, m)/m for γ < 1 β ⊤ β 1 -min( d, n)/m for γ > 1 , E ϵ ∥ θ∥ 2 2 → σ 2 ϵ m n ∞ s 1 s dF γ (s) + β ⊤ β min( d, m)/m for γ < 1 β ⊤ β min( d, n)/m for γ > 1 , ( ) with F γ the Marčenko-Pastur law (Marčenko & Pastur, 1967) and s the value in R that satisfies d m = ∞ s dF γ . In both equations, the first term represents the variance and the last one the bias. The proofs are in Appendices D.4, D.6. Again, we obtain the same risk when choosing d = m as for direct regression models on isotropic data, see Theorem 1 in Hastie et al. (2022) . Contrary to direct regression, the PCA-model will always have a bias term since d < m, n in general.

4.2. NUMERICAL RESULTS

In this section we give numerical results for the different data generators and compare these results with those from the analysis above. We compare our PCA-regression model with 1) the learnt direct regression model and 2) a model that predicts always zero which we denote as null risk.

Isotropic features

We generate n = 400 data points for training and testing according to our isotropic data generator (4), implying d = m with σ 2 ε = 1 and r 2 θ = 1. Each sample has m = γn features where we vary γ ∈ [0.3, 20], i.e. from low-dimensional (γ < 1) to high-dimensional (γ > 1) features. We compute risk R( θ) and parameter norm || θ|| 2 2 as in the definition of Lemma 1 and average over 200 realizations. The results are compared with analytical solutions from Theorem 1. Figure 2 depicts the results for different values of d; we can make several observations: 1) The numerical results, i.e. the '×' marks, and the analytical solutions, i.e. the solid lines, align perfectly and therefore support our theoretical analysis. The expected decrease of the variance term and a nonzero bias term for all γ can be observed in Figure 10 where we show the bias-variance decomposition according to Theorem 1; 2) For sufficiently large d the results of the PCA model match the direct regression results in the limit of small and large γ. For isotropic data, every singular direction is equally important and the PCA requires sufficiently many components, i.e. large d to achieve reasonable results; 3) The PCA-regression model does not suffer from the singularity at γ = 1 as we predicted from Lemma 1. The PCA alleviates the bad conditioning of the matrix X ⊤ X which has to be inverted for the least squares solution. Below we will see that ridge regression has a similar effect; and, 4) The parameter norm is constantly decreasing for larger γ. We observe this for all models, which implies that we obtain smooth solutions which are beneficial to avoid overfitting.

Latent variable data

We use the latent variable data generator with d = 20, r 2 θ = 1, σ 2 y = 0, feature SNR ρ x = 1 and θ as in (33) to generate n = 400 training and testing data points and average over 200 realizations. The results for the risks are depicted in Figure 3 for eigenvalue decay of α = 0 (left) and for α = 0.25 (right). Corresponding plots for the parameter norm are in Figure 11 . We observe for α = 0 (left plot) if d ≥ d, then the PCA-regression model approaches the direct regression results for small and large γ. The plots for d = 20 and d = 40 overlay since both are larger than d and capture all information. However, for misspecified models with d < d the solution obtained for the PCA-regression is suboptimal. Following Lemma 1, by choosing d < d we remove important eigendirections and therefore observe an increased risk. Similar conclusions can be drawn for the results for data with α > 0 (right plot) but with less penalty on the risk for suboptimal d. Figure 4 shows the median results over 100 realizations for different latent dimension d. We observe a qualitative resemblance to the results for the latent variable model in Figure 3 . 1) The PCA-regression risk decreases monotonically with increasing γ and 2) higher values of d reach the lowest overall risk. Different is that the PCA-regression does not reach the same level as the direct regression for larger γ. However, this is reasonable since 1) the eigenvalue distribution in the genetics example is heavy tailed (see Figure 16 ) which implies that the true latent dimension would be much larger. Further, 2) the relationship between genotypes and phenotypes may not be linear in nature.

5. PRE-TRAINING THE PCA ENCODER

So far, we analysed the case when the complete model is trained supervised. Now we extend to the popular case of pre-training parts of the model in an unsupervised way. In this context we can view our model as a simple, linear version of large pre-trained neural networks with linear probing. Our analysis therefore yields insights to their understanding. The pre-training extension requires a generalization of our theory because we deal with different data sets of varying size.

5.1. GENERALISATION OF PROBLEM FORMULATION

First, we pre-train the PCA on a so called pre-training data set {x i } np i=1 without output values y i . It can therefore only be used for unsupervised pre-training. Second, we train only the linear regression head on the PCA features with the training data set {x i , y i } n i=1 . Note that the number of samples n p in the pre-training data set differs from the number of samples n in the training data set.

Data generator

In this section, we focus on the latent variable data generator. We change our feature generation from (1) to simplify the theoretical analysis. We orthogonalise the signal z (generated by ( 3)) and the noise e by introducing D ⊥ such that D ⊤ D ⊥ = 0. We use the following feature generator for both, the training and pre-training data set features x i = Dz i + D ⊥ e i . ( ) Model The model is the same as in the supervised case. Since the PCA is performed on the pretraining data set, we rename the first d estimated eigenvectors of the feature covariance matrix from the pre-training data set to Ĥ. We do so to distinguish it from the eigenvectors V estimated using the training data set as in the supervised case. Hence, in the pre-training case we obtain ẑi = Ĥ⊤ x i .

5.2. THEORETICAL ANALYSIS

As in section 4.1 we want to establish a connection to the complete training risk as fundamental model guarantee. Since we extend the setting to pre-train the encoder on a different data set, we have to deal with sample complexities for the estimation of eigenvectors in the PCA. With the orthogonal feature generator (13), we recover the true latent variables from features z i = D + (x i -D ⊥ e i ) = D + x i . ( ) Comparing it with the projection from the model in ( 14), we notice that the estimated latent space depends on how well Ĥ⊤ estimates D + . Hence, the risk analysis problem in the case with pretraining turns into a sample complexity problem of the eigenvectors. Note that D + = D ⊤ /c 2 with correction factor c for SNR control, see Appendix C.

Estimation of eigenvectors

The sample complexity of eigenvectors is thoroughly studied by Loukas (2017) . Here, we review some of their results and adapt them to our setting. The PCA loss of encoding x into the (estimated) latent space is given by L(D) = E ∥x∥ 2 2 -∥D + x∥ 2 2 = m i=d+1 s i , L( Ĥ) = E ∥x∥ 2 2 -∥ Ĥ⊤ x∥ 2 2 = m i=1 s i - d i=1 m j=1 ( ĥ⊤ i h j ) 2 s j . Here, s i is the ith eigenvalue and h i is the ith eigenvector of the true feature covariance matrix. The difference of the PCA losses, quantifies how well a sample x is projected with the estimated eigenvectors Ĥ into the latent space compared to a projection with the true eigenvectors (D + ) ⊤ . Lemma 2. Define the loss of projecting a sample x with D or Ĥ as in ( 16), ( 17). Then, we can write the loss difference as L( Ĥ) -L(D) = E ∥z∥ 2 2 -∥ ẑ∥ 2 2 and formulate it as L( Ĥ) -L(D) = min(d, d) i=1 m j=1 ( ĥ⊤ i h j ) 2 (s i -s j ) + d i= d s i =0 for d≥d + d i=d m j=1 ( ĥ⊤ i h j ) 2 s j =0 for d≤d . ( ) The result indicates that if we have perfect encoding ( d = d), then only the first term remains. If also all eigenvalues are equal, then there is no loss difference and the estimation of the direction of eigenvectors Ĥ does not matter since we are dealing with the isotropic case. However, for more natural scenarios such as exponentially decaying eigenvalues, the eigenvalue difference is nonzero and correct estimation of the eigenvectors Ĥ is crucial for a small loss difference. If we are dealing with imperfect encoding, there is either an additional term due to misalignment of the estimated eigenvalues ( d < d) or due to encoding of noise ( d > d). The proof is in Appendix F.1. Theorem 2. Define a real t > 0, using Corollary 4.1 from Loukas (2017), and with k 2 j = s j (s j + Tr(C)) from Corollary 4.3 in Loukas ( 2017), then we obtain the concentration inequality P L( Ĥ) -L(D) > t ≤ ≤ 4 t n p   min(d, d) i=1 m j=i+1 k 2 j |s i -s j | + d i= d m j=1 k 2 j s i (s i -s j ) 2 + d i=d m j=1 k 2 j s j (s i -s j ) 2   . (19) Under review as a conference paper at ICLR 2023 This theorem states, that in addition to the implications of Lemma 2, there are two main scenarios where we obtain a lower right hand side and therefore tighter bound. 1) When the feature covariance matrix has rapidly decaying eigenvalues, i.e. large |s i -s j | ≥ 0, since j > i or 2) when we have access to more pre-training samples n p . The proof is in Appendix F.2. Connection to the risk We define the risk between features x and outcomes y in the same way as for the supervised case, see Lemma 1. The goal is to obtain asymptotic results for the risk for different eigenvalue decays including our latent variable data generator. Xu & Hsu (2019) presents asymptotic results for polynomial and more general eigenvalue decays in the PCA-regression model. However, their analysis relies on the assumption that the eigenvectors are fully known, i.e. Ĥ⊤ = D + , which is an unrealistic scenario. A solution to resolve this condition is to estimate the eigenvectors Ĥ from unlabeled data {x i }. But it is unclear under what conditions the estimate is sufficiently good. The eigenvector estimation is precisely what is done during the pre-training step. Theorem 2 provides a sample complexity for the eigenvector estimation quality and therefore provides the missing condition when the results from Xu & Hsu (2019) hold in practice. Choosing t sufficiently small, we can quantify how many samples are necessary for the estimated eigenvectors to be close to the true ones. Hence, we provide conditions when the asymptotic risk results from Xu & Hsu (2019) can be used in practice. However, if we do not have access to sufficiently many pre-training data samples, then we know that our estimated eigenvectors Ĥ are misaligned. These eigenvectors will project the features into a misaligned latent space ẑ. Finally, we perform linear regression from this misaligned space. Quantifying the additional error on the overall risk for misaligned linear regression is an open problem.

5.3. NUMERICAL RESULTS

We present numerical results when using pre-training. We denote the relation of pre-training samples to training samples as µ = In Figure 5 the risk for data with two different eigenvalue decay rates are depicted. We make three main observations: 1) Horizontally for µ = const., in both plots the risk decreases similar to the supervised case in Figure 3 and therefore follows Lemma 1. 2) Vertically for γ = const., in the right plot (α = 0.25) the risk decreases as expected from Theorem 2 when using more pre-training samples. The effect is most significant for large overparameterization γ. 3) Vertically for γ = const., in the left plot (α = 0), we notice that more pre-training data does not decrease the risk. Since we have perfect encoding d = d and two blocks of constant eigenvalues, the eigenvector estimation is by Lemma 2 almost perfect and barely improves with more pre-training, see Theorem 2. Therefore, using more pre-training data does also not improve the overall risk. The observation supports our finding that more pre-training data decreases the risk only if it improves the eigenvector estimation. Hence, the eigenvalue distribution is crucial for the necessity of pre-training. Figure 6 shows horizontal slices of Figure 5 (right) and compares it with fully supervised models. We notice that all pre-trained models outperform the fully supervised models for γ > 1. Interestingly, in the results for µ = 1 (blue '×') we use the same amount of data to learn the PCA n p = n as in the fully supervised case (black triangles). While for the pre-trained model we use a different data set of the same size to learn the regression, we use use the exact same data in the supervised case. 

6. CONCLUSION

Limitations Our proofs in the supervised case rely on random matrix theory for which we present asymptotic results for isotropic data. However, it is not trivial to find solutions for the general case, including our latent variable data generator, which requires more research. Similarly, it is an open question how to obtain a closed form solution for the complete risk in the scenario with pre-training based on eigenvector alignment which leads to our sample complexity bounds. Furthermore, while we observe key phenomena for our real-world example, the data here is not approximately on a low-dimensional manifold as our latent variable data generator and hence not fully comparable. Supervised case Our theoretical analysis generalizes the results for linear regression (Hastie et al., 2022) which is a special case of PCA-regression without dimensionality reduction ( d = m). In the non-asymptotic regime, Huang et al. ( 2022) describe similar results and hence they independently support our theory. Selecting the correct latent dimension d for data from a low dimensional manifold is crucial for the risk as Lemma 2 suggests. This is in line with the discovery of latent factors in variational autoencoders from the disentanglement literature (Higgins et al., 2017; Kumar et al., 2018) . While our results that PCA mitigates the "interpolation peak" due to its regularizing behavior may not surprise, they provide formal guarantees for the performance of a commonly used model on real-world data structures. Practitioners can now rely on these fundamental guarantees for model development, but more research is needed for general data structures. Pre-training Our results from Figure 5 that a certain decay rate of the data covariance eigenvalues is necessary for pre-training to have its expected effect (more pre-training data is better) may be surprising at first. However, from Theorem 2 it becomes clear that more pre-training data only helps to improve the eigenvector estimation. If however, the eigenvectors are already estimated perfectly such as for two blocks of isotropic data (e.g. latent variable data with decay rate α = 0), then using more pre-training data has no effect. Hence, we provide a fundamental insight into the mechanisms of pre-training which highlight that we have to be aware of the data structure instead of following the general philosophy of adding more pre-training data. Our results provide the missing link to Xu & Hsu (2019) when their asymptotic generalisation results can be used in practice. We believe that our simple PCA-regression model is suitable for extensive studies of pre-training phenomena. Therefore, this study lays the groundwork for future research and opens up many questions.

A ADDITIONAL RELATED WORK

This section complements the related work in Section 2. Overparameterization-additional related work While the double descent has been observed in deep and state-of-the-art models (d'Ascoli et al., 2020; Nakkiran et al., 2021) , most theoretical studies focus on simple models: Examples are found for linear regression (Bartlett et al., 2020; Muthukumar et al., 2020; Hastie et al., 2022 ), ensembles (LeJeune et al., 2020; Loureiro et al., 2021) , classification (Gerace et al., 2020; Wang et al., 2021; Deng et al., 2022) , random features (Belkin et al., 2019; Mei & Montanari, 2022) or small neural networks trained using gradient descent (Goldt et al., 2019; Advani et al., 2020) .

PCA-regression in applications

The PCA-regression model is also known as principle component regression (PCR) (Xu & Hsu, 2019) or PCA-OLS (Huang et al., 2022) . The number of chosen principal components or eigenvectors d is subject to model selection, see for example Xu & Hsu (2019) for an analysis if the true feature covariance matrix is fully known. Selecting d is a crucial step. While we do not specify how to select d, we discuss the implication of model misspecification with d ̸ = d. When using a supervised setup as in Section 4, there are plenty examples when it comes to use of PCA-regression models: Early work use the de-correlating property of PCA for their inputs in small scale examples (Massy, 1965) . Tran et al. (2018) uses 10 years of data from Seoul to analyse the impact of air pollution on the health of the population using PCA-regression. Wang & Abbott (2008) makes use of PCA-regression for genetic association to determine genetic variants of human diseases with a large number of features and few samples. Metwally (2008) uses the model for spectrophotometry. When using pre-training as in Section 5, the PCA-regression model is a simplified, linear surrogate for large, nonlinear encoder-decoder models. Examples in this setting are is the transformer based BERT model (Devlin et al., 2019) or DALL-E Ramesh et al. (2021) . In these models, parts of the model are pre-trained on a large corpus on unlabeled data. The pre-trained model can then be used by other developers to fine-tune or adapt the last layer, see Kumar et al. (2022) .

B EIGENVALUE DISTRIBUTION OF REAL-WORLD DATA SETS

In Figure 7 we plot the eigenvalue distribution of four real-world data sets. Each of them has a low number of significant eigenvalues with a sharp exponential decay. For some data sets such as e.g. Steel Plates Fault there is even a low-dimensional data embedding up to about eigenvalue 12 visible. All data sets except MNIST were downloaded from the UCI Machine Learning Repository (Dua & Graff, 2017) through the OpenML interface. (Dua & Graff, 2017) . Bottom left: Breastcancer data set from UCI (Dua & Graff, 2017) . Bottom right: Steel-plates-fault data set provided by Semeion, Research of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy.

C DETAILS ON THE DATA GENERATOR

In this appendix we concentrate without loss of generality on the latent variable data generator with orthogonal features introduced in ( 2) and ( 13). In matrix notation when collecting all features in rows we can write the data generator as X = ZD ⊤ + ED ⊤ ⊥ , y = Zθ + ε. ( ) Singular value and eigenvalue decomposition Approximating the data matrix with an estimated singular values decomposition and reducing the rank to d yields X = Û Σ V ⊤ , with estimated singular values Σ = diag(σ 1 , . . . , σd ). Similarly we can define the eigenvalue decomposition of the sample covariance matrix as Ĉ = 1 n X ⊤ X = 1 n V Σ⊤ Σ V ⊤ = V Ŝ V ⊤ , with estimated eigenvalue matrix S = diag(ŝ 1 , . . . , ŝd ).

Covariance matrices

The feature covariance matrix can be written as C = E X ⊤ X = [D D ⊥ ] E Z ⊤ Z Z ⊤ E E ⊤ Z E ⊤ E D ⊤ D ⊤ ⊥ = V SV ⊤ , where V := [D D ⊥ ] are the true eigenvectors-compare with the sample eigenvectors denoted by V . The eigenvalue matrix S can be written as S = diag(s 1 , . . . , s m ) = diag(λ 1 , . . . , λ d , 1, . . . , 1) = Λ 0 0 I m-d . ( ) Signal-to-noise ratio control For the orthogonal latent variable data generator we can compute the SNR ρ x of the features as ρ x = E ∥Dz∥ 2 2 E [∥ED ⊥ ∥ 2 2 ] = Tr(DΛD ⊤ ) Tr(D ⊥ D ⊤ ⊥ ) = Tr(c 2 Λ) m -d , since Tr(DD ⊤ ) = Tr(I m-d ) and D ⊤ D = c 2 I d . Here c is a correction factor which controls the SNR. We define is as c = ρ x (m -d) d d Tr(Λ) . ( ) If non-orthogonal noise is used, then the first factor reduces to ρ x m/d. In the same way, we can compute the SNR ρ y of the outputs as ρ y = E ∥θ ⊤ z∥ 2 2 E [∥ε∥ 2 2 ] = Tr(θ ⊤ Λθ) σ 2 y = r 2 θ σ 2 y , with r 2 θ = 1 usually. Implementation details for data generation For our latent variable orthogonal feature generator, we generate the matrices D and D ⊥ by first sampling an auxiliary random variable A ∈ R m×d and then orthogonalizing it with a QR-decomposition A i,j ∼ N (0, 1), QR = A, Where we used X = Û Σ V ⊤ . Now we combine the singular value matrices. We indicate dimensions of combined matrices. Note that Σd and Σ are of different sizes. θ =     1 σ2 1 0 . . . 0 1 σ2 d     d×d    σ2 1 0 . . . 0 0 σ2 d    d×m V ⊤ β + +    1 σ1 0 . . . 0 0 1 σd    d×n Û ⊤ ϵ (44) = [I d 0] V ⊤ β + Σ-1 dd 0 Û ⊤ ϵ (45) Summarizing the matrices by truncating V ⊤ and Û ⊤ yields the following solution for the regression parameter estimation θ = V ⊤ d β + Σ-1 dd Û ⊤ d ϵ (46) D.3 PARAMETER NORM Here, we prove the parameter norm part of Lemma 1. Proof. In order to evaluate the parameter norm ∥ θ∥ 2 2 = θ⊤ θ, we consider θ⊤ θ = β ⊤ V V ⊤ β + Tr(ϵ ⊤ Û Σ-1 dd Σ-1 dd Û ⊤ ϵ) + 2β ⊤ V Σ-1 dd Û ⊤ ϵ (47) where the second term is scalar and hence equal to its trace. Now, we can make use of the cyclic property of the trace. Furthermore, define Φ := V V ⊤ as an orthogonal projector. θ⊤ θ = β ⊤ Φβ + Tr( Û ⊤ ϵϵ ⊤ Û Σ-2 dd ) + 2β ⊤ V Σ-1 dd Û ⊤ ϵ (48) Note that the properties of the orthogonal projector with Φ ⊤ = Φ and ΦΦ = Φ hold for our definition. We take the expectation with respect to the noise E ϵ θ⊤ θ = β ⊤ Φβ + Tr( Û ⊤ E ϵ ϵϵ ⊤ Û Σ-2 dd ) (49) = β ⊤ Φβ + σ 2 ε Tr( Û ⊤ Û Σ-2 dd ) (50) = β ⊤ Φβ + σ 2 ε Tr( Σ-2 dd ) using ( 39) we can write E ϵ θ⊤ θ = β ⊤ Φβ + σ 2 ϵ n Tr( V ⊤ Ĉ+ V ) The second term uses the sample covariance matrix Ĉ projected down on the d dimensional eigenvector space using V .

D.4 LIMITING PARAMETER NORM FOR ISOTROPIC FEATURES

Here, we prove the parameter norm part of Theorem 1. Proof. We can analyze the two terms in (52) independently in the limit of m, n → ∞ such that m n → γ ∈ (0, ∞) almost surely. Furthermore we assume isotropic features Cov(x i ) = C = I m . First term We can write with the definition of our orthogonal projector. β ⊤ Φβ = β ⊤ V V ⊤ β (53) We can write V ⊤ with the SVD definition as V ⊤ = Σ+ d Û ⊤ X which yields β ⊤ Φβ = β ⊤ X ⊤ Û Σ+⊤ d Σ+ d Û ⊤ Xβ (54) For the special case of i.i.d. matrix entries x i ∼ N (0, 1) we have by rotational invariance that the distribution of X and XP are equal for any orthogonal P ∈ R m×m β ⊤ Φβ = β ⊤ P ⊤ X ⊤ Û Σ+⊤ d Σ+ d Û ⊤ XP β Choose P such that P β = βe i with e i as the ith normal vector and then average over i = 1, . . . , m β ⊤ Φβ = β ⊤ β Tr(X ⊤ Û Σ+⊤ d Σ+ d Û ⊤ X)/m (56) Now use again the definition of X = Û Σ V ⊤ yields β ⊤ Φβ = β ⊤ β Tr( V Σ⊤ Û ⊤ Û Σ+⊤ d Σ+ d Û ⊤ Û Σ V ⊤ )/m (57) = β ⊤ β Tr( V Σ⊤ Σ+⊤ d Σ+ d Σ V ⊤ )/m (58) Using the same arguments as for the linear regression parameter solution by combining the singular value matrices, we obtain β ⊤ Φβ = β ⊤ β Tr( V V ⊤ )/m (59) Here we again identify our orthogonal projector Φ = β ⊤ β Tr(Φ)/m ( ) Since Φ is symmetric positive definite and since its components V are orthogonal, all eigenvalues of Φ are equal to one, yielding β ⊤ Φβ = β ⊤ β rank(Φ)/m For m/n → γ we have to distinguish between γ < 1 and γ > 1. Therefore, we obtain the final version for the first term of the limiting parameter norm: β ⊤ Φβ = β ⊤ β min( d, m)/m for γ < 1 β ⊤ β min( d, n)/m for γ > 1 (62) Checking the results with considering all principal components, i.e. choosing d = m (with m > n for γ > 1), we obtain β ⊤ Φβ = β ⊤ β for γ < 1 β ⊤ β 1 γ for γ > 1 which is the same results as for the case of direct regression between X and y. Second term For the second term of the parameter norm we can write the trace as the sum over the eigenvalues s i of Ĉ but limited to the first d eigenvalues due to the projection using V σ 2 ϵ n Tr( V ⊤ Ĉ+ V ) = σ 2 ϵ 1 n d i=1 1 s i (63) = σ 2 ϵ m n ∞ s f 1 s dF Ĉ (s) (64) where the summation is rewritten as integral over the spectral measure F Ĉ of Ĉ as and s f is the d largest eigenvalue of Ĉ. We know that in the limit m, n → ∞ the spectral measure will almost surely converge to the Marčenko-Pastur distribution F γ which describes the distribution of the eigenvalues of Ĉ σ 2 ϵ n Tr( V ⊤ Ĉ+ V ) → σ 2 ϵ m n ∞ s f 1 s dF γ (s) There are now two steps to solve this integral. First, we need to find out the lower integral bound s f and second, solve the integral itself. For s f = -∞, one can use the closed form solution of the Stieltjes transformation f (z) of the Marčenko-Pastur distribution and evaluate it at z = 0. However, there is no known closed form solution for general s f . We therefore solve this part numerically. Step 1 obtain the lower bound s f : We can view the spectral measure as F Σ as a series of m impulses at s i with magnitude 1/m because the sum is normalized to 1. Since we only consider the d largest eigenvalues, we know that their sum is d/m, see Figure 8a . This sum is the same as the integral from s f over the Marčenko-Pastur distribution, see Figure 8b . Therefore we can find the lower integral bound s f by solving d m = ∞ s f dF γ (s) (66) = s+ s f 1 2π (s + -s)(s -s -) γs ds ( ) for s f numerically with s ± = (1 ± √ γ) 2 where s ± is the lowest/highest eigenvalue. Note that s ∈ [s -, s + ]. Step 2 solve integral of interest: Now we can solve the integral in (65) numerically from s f to the upper bound s + . Therefore, we obtain a solution for the second term, which is not based on data but the properties of our data matrix, especially γ and d. This concludes the full proof for the asymptotics of the parameter norm. (Marčenko & Pastur, 1967) for γ = 0.3 with specific lower integration bound. The area under the distribution from that threshold is equal to d/m. R F Σ s 1 1 m s 2 1 m . . . 1 m s m- d 1 m . . . 1 m s m 1 m s f (a) s F γ (s) s f d m (b)

D.5 RISK

Here, we prove the risk part of Lemma 1. Proof. We define the risk as the expectation over the mean squared error, and then use y 0 = β ⊤ x 0 + ϵ, ŷ(x 0 ) = θ⊤ ẑ and ẑ = V ⊤ x 0 to obtain R( θ) = E (x0,y0) (y 0 -ŷ(x 0 )) 2 (68) = E x0 (β ⊤ x 0 + ϵ -ŷ(x 0 )) 2 (69) = E x0 (β ⊤ x 0 + ϵ -θ⊤ ẑ) 2 (70) = E x0 (β ⊤ x 0 + ϵ -θ⊤ V ⊤ x 0 ) 2 (71) = E x0 ((β -V θ) ⊤ x 0 + ϵ) 2 (72) = (β -V θ) ⊤ C(β -V θ) + ϵϵ ⊤ For simplicity we first rephrase the term in the bracket using the solution of the regression parameter estimation in (46). We re-use our orthogonal projector Φ = V V ⊤ and define another orthogonal projector with Π = I m -Φ to obtain β -V θ = β -V ( V ⊤ β + Σ-1 dd Û ⊤ ϵ) (74) = β -V V ⊤ β -V Σ-1 dd Û ⊤ ϵ (75) = (I m -Φ)β -V Σ-1 dd Û ⊤ ϵ (76) = Πβ -V Σ-1 dd Û ⊤ ϵ Now we use this expression to take the expectation of the risk with respect to the noise. This yields E ϵ R( θ) = β ⊤ ΠCΠβ + E ϵ Tr(ϵ ⊤ Û Σ-1 dd V ⊤ C V Σ-1 dd Û ⊤ ϵ) + E ϵ ϵϵ ⊤ Here we made use of the Trace since the expression is scalar. Hence, we can use the cyclic property of the trace and pull the expectation inside = β ⊤ ΠCΠβ + Tr( V ⊤ C V Σ-1 dd Û ⊤ E ϵ ϵϵ ⊤ Û Σ-1 dd ) + E ϵ ϵϵ ⊤ (79) with E ϵ ϵϵ ⊤ = σ 2 ϵ and Û ⊤ Û = I = β ⊤ ΠCΠβ + σ 2 ϵ Tr( V ⊤ C V Σ-2 dd ) + σ 2 ϵ (80) using (39) for Σ-2 dd = β ⊤ ΠCΠβ + σ 2 ϵ n Tr( V ⊤ C V V ⊤ Ĉ+ V ) + σ 2 ϵ Again, similarly to the parameter norm, the second term here uses the covariance matrices projected onto the d dimensional eigenvector space.

D.6 LIMITING RISK FOR ISOTROPIC FEATURES

Here, we prove the risk part of Theorem 1. Proof. Since we use isotropic features, we have C = I m . Similar to the limiting parameter norm we split the analysis for the two first parts of (81). First term: limiting bias Using isotopic features and the definition of the orthogonal projector, we have β ⊤ ΠCΠβ = β ⊤ Πβ (82) = β ⊤ (I m -V V ⊤ )β now we can use the same arguments as for the first term in the parameter norm. Namely, rewrite 1) and by rotation invariance the distribution of X and XP are equal, where P is any orthogonal matrix. Then we choose P β = βe i and average over all i = 1, . . . , m. V ⊤ = Σ+ d Û ⊤ X in terms of X, assume x i ∼ N (0, = β ⊤ I m -P ⊤ X ⊤ Û Σ+⊤ d Σ+ d Û ⊤ XP β (84) = β ⊤ β 1 -Tr(X ⊤ Û Σ+⊤ d Σ+ d Û ⊤ X)/m (85) = β ⊤ β 1 -Tr( V V ⊤ )/m (86) = β ⊤ β (1 -Tr(Φ)/m) (87) = β ⊤ β (1 -rank(Φ)/m) in the limit of m, n → ∞ we have m n → γ almost surely. We therefore obtain the final solution for the limiting bias as β ⊤ ΠCΠβ =    β ⊤ β 1 -min( d, m)/m for γ < 1 β ⊤ β 1 -min( d, n)/m for γ > 1 Again, checking the results with considering all principal components, i.e. choosing d = m (with m > n for γ > 1), we obtain β ⊤ Φβ = 0 for γ < 1 β ⊤ β 1 -1 γ for γ > 1 which is the same results as for the case of direct regression between X and y. Second term: limiting variance Using isotropic features we have σ 2 ϵ n Tr( V ⊤ C V V ⊤ Ĉ+ V ) = σ 2 ϵ n Tr( V ⊤ Ĉ+ V ) this is the same form as the second term for the parameter norm and therefore yields the same numeric solution by solving σ 2 ϵ n Tr( V ⊤ C V V ⊤ Ĉ+ V ) = σ 2 ϵ m n ∞ s f 1 s dF γ (s) E ADDITIONAL NUMERICAL RESULTS FOR SUPERVISED CASE E.1 ISOTROPIC DATA: BIAS-VARIANCE DECOMPOSITION In Figure 9 we extend Figure 2 . We additionally show the results for the PCA-regression model with d = m, which corresponds to a PCA without compression and therefore a direct regression between input x and output y. We compare the analytical results from Theorem 1 of the PCA-model to 1) the direction regression model from simulations (solid line) and 2) the analytical solution for direct regression of isotropic data from Hastie et al. (2022) . The authors give the solution for the risk in their Lemma 1 and we derive the parameter norm in the same way: E ϵ R( θ) → σ 2 ϵ + 0 + σ 2 ϵ γ 1-γ for γ < 1 β ⊤ β 1 -1 γ + σ 2 ϵ 1 γ-1 for γ > 1 , E ϵ ∥ θ∥ 2 2 → β ⊤ β + σ 2 ϵ γ 1-γ for γ < 1 β ⊤ β 1 γ + σ 2 ϵ 1 γ-1 for γ > 1 , We see in Figure 9 that the theory form Hastie et al. (2022) for direct regression, the numerical results for direct regression and our PCA-regression results without compression ( d = m) match and therefore further support our theory. Figure 9 : Supervised results on isotropic data: analysis. We compare analytical solutions from Theorem 1 for the PCA-regression without compression ( d = m) with 1) analytical solution for direct regression and 2) simulations for direct regression Complementary to Figure 2 , we can decompose the risk and parameter norm according to Theorem 1 in a bias and variance term. The results for this decomposition are shown in Figure 10 . We can see that the bias term is nonzero for all γ and increases for larger γ. Further, we observe a decrease of the variance term. In contrast, in the direct regression model, the variance term reaches a peak at γ = 1 and therefore forms the classical bias-variance decomposition trade-off for γ < 1. 

E.2 LATENT VARIABLE DATA

Complementary result for parameter norm Complementary to the results of the risk for α = 0 and α = 0.25 in Figure 3 , we show the results for the parameter norm in Figure 11 . We observe that similarly to the isotropic case, the parameter norm decreases monotonically for larger γ. This indicates simpler solution for larger γ also in the latent variable data case. 

E.3 CONNECTION TO RIDGE REGRESSION

In the main paper, we focused on the unregularized linear regression problem. In this part we compare the PCA-regression model with the regularized Ridge regression solution and λ as the Ridge parameter θ = (X ⊤ X + λI m ) + X ⊤ y. (94) In the first part, we look at isotropic data where we have analytical solution. In the second part we compare numerical simulation for the latent variable data generator.

E.3.1 ISOTROPIC DATA

For isotropic data we can compare the results from Theorem 1 for the asymptotic risk in the PCAregression model with Ridge regression. Corollary 6 in Hastie et al. (2022) provides asymptotic results of Ridge regression for isotropic data. For completeness we state the asymptotic Ridge result: E ϵ R( θλ ) → β ⊤ βλ 2 m ′ (-λ) + σ 2 ϵ γ (m(-λ) -λm ′ (-λ)) + σ 2 ϵ , with m(z) = 1-γ-z- √ (1-γ-z) 2 -4γz 2γz and m ′ (z) as the derivative w.r.t z. The optimal Ridge regularization is achieved at λ * = σ 2 ϵ γ/β ⊤ β which then yields the optimal risk E ϵ R( θλ * ) → σ 2 ϵ γm(-λ * ) + σ 2 ϵ . Note that the optimal Ridge regularization strength is a function of γ and monotonically increases with γ. Optimal regularization is not achieved by a single regularization value. 

E.3.2 LATENT VARIABLE DATA GENERATOR

In this setting, we use the latent variable data generator without eigenvalue decay (α = 0) and compare our PCA-regression model without model misspecifications, i.e. d = d, to solutions using different Ridge parameters. We rely on numerical solutions in this section. The results are in Figure 15 . While for isotropic data, the comparison between both models shows qualitative different behavior, here for the latent variable data the results indicate a clear connection between both models for large values of λ. This observation is theoretically justified from the known relationship of both methods on the eigenvalues. Ridge regression lifts all eigenvalues S of the features by a value of λ X ⊤ X + λI m = V ⊤ (S + λI d )V . ( ) Here, V are the true, non-truncated eigenvectors. In contrast, the PCA-regression model cuts the eigenvalues off at a threshold chosen by d, which is clear in (7). The main difference is that with ridge regression, there is a smooth change of the risk controlled by the ridge parameter whereas with PCA-regression there is a hard cut-off. Figure 15 may imply that the optimal Ridge penalty is at λ → ∞ as it avoids the interpolation peak at no additional cost. Previous studies have concluded that finite Ridge regularization is better. Gerace et al. (2020) uses the hidden manifold model from Goldt et al. (2020) and Mei & Montanari (2022) use random features model by Rahimi & Recht (2007) which can be seen as a two-layer neural network. Both studies conclude that finite λ achieves optimal regularization. However, we are working with linearly separable data, which is closer to the latent space model in Hastie et al. (2022) . The difference to our setup is, that for us both, the data generating process and the PCAregression model have a low-dimensional embedding. The conclusion in Hastie et al. (2022) that the best Ridge regularization is λ = 0 and achieved in γ > 1 may hold for us as well but is difficult to proof with our numerical results in Figure 15 . Optimal penalty at λ → ∞ for Ridge regularized problems was also observed in previous studies with different setups to ours. Both Mignacco et al. (2020) and Loureiro et al. (2021) study the classification of high-dimensional (isotropic) Gaussian-mixtures of balanced data from each mixture and show that large λ are necessary to reach the Bayes-optimal performance. Thrampoulidis et al. ( 2020) studies a similar model for the classification of Gaussian mixtures as well as for a multinomial logit model where they identify that the class averaging algorithm, which is equal to Ridge regression with λ = ∞, performs optimal in some settings. 

E.4 REAL-WORLD EXAMPLE: GENETICS

Background The Diverse MAGIC Wheat data setfoot_0 is based on 16 founding wheat varieties which were listed between 1935 and 2004. These varieties were interbred to obtain new wheat varieties. From the resulting wheat types the genome of total of 502 wheats were sequenced. This genome sequence consists of ≈ 1.1 M single nucleotide polymorphisms. Furthermore, phenotypes of the 502 wheat types were analysed, see Scott et al. (2021) .

Data processing

The genotypes consist of binary features. The binary variables represent equality or difference to a reference genotype. The phenotypes are real-values variables. We choose the phenotype column named 'HET 2' in this example. Missing values for both, genotype and phenotype are replaced with the mean value of the variable. We select a subset of genotypes as inputs randomly at uniform to obtain the necessary m features. Then, we normalize both, genotype and phenotype by z-transformation. Data analysis In Figure 16 we plot the eigenvalue distribution for the Diverse MAGIC Wheat data set, similar to the ones in Appendix B. We observe that the eigenvalue distribution is heavy tailed. It does not depict a clear example of a low dimensional latent manifold. Therefore, using the PCA-regression model will discard some useful information similar to the isotropic case.

F PROOFS FOR THE CASE WITH PRE-TRAINING

This appendix derives and proofs the results of section 5.2. First, we derive the results for the estimation of the eigenvectors with the PCA loss in F.1. Then, we show derive the concentration inequality for the PCA loss in F.2.

F.1 ESTIMATION OF EIGENVECTORS

Here, we prove Lemma 2 for the loss difference for the projection into the true or estimated latent space from (18). We will look at both losses induced by the two projections separately, Proof. First, define the loss for the projection into the true latent space as in ( 16) L(D) = E ∥x∥ 2 2 -∥D + x∥ 2 2 . ( ) We can write the second term as the following where we use the cyclic property of the trace on scalars and apply the expectation on xx ⊤ E ∥D + x∥ 2 2 = E x ⊤ (D + ) ⊤ D + x (99) = E Tr(x ⊤ (D + ) ⊤ D + x) (100) = Tr(D + V SV ⊤ (D + ) ⊤ ) (101) Since V = [D D ⊥ ] defined in (24), we obtain E ∥D + x∥ 2 2 = Tr [I d 0 d×m-m ] S I d 0 m-m×d (102) = Tr(Λ) = d i=1 s i Hence, the loss L(D) becomes L(D) = m i=1 s i - d i=1 s i = m i=d+1 s i = m i=d+1 s i Second, define the loss for the projection into the estimated latent space as in ( 17) L( Ĥ) = E ∥x∥ 2 2 -∥ Ĥ⊤ x∥ 2 2 . Again, we can write the second term as the following by using the same arguments as for the first term E ∥ Ĥx∥ 2 2 = Tr( Ĥ⊤ V SV ⊤ Ĥ) = d i=1 m j=1 ( ĥ⊤ i v j ) 2 s j . The last equality holds by switching to vector notation where factors can be combined to squared terms. Hence, the loss L( Ĥ) becomes  L( Ĥ) = m i=1 s i - d i=1 m j=1 ( ĥ⊤ i v j ) 2 s j . ( w ij ( ĥ⊤ i v j ) 2 > t   ≤ i̸ =j 4w ij k 2 j n p t(s i -s j ) 2 where k 2 j = E ∥xx ⊤ v j ∥ 2 2 -s 2 j and w ij ̸ = 0 when s i ̸ = s j and sgn(s i -s j )2s i > sgn(s is j )(s i + s j ). In accordance with this Corollary, we define (2017) . This Corollary holds for our latent variable data generator. This concludes the proof. w ij =         

G ADDITIONAL NUMERICAL RESULTS FOR THE CASE WITH PRE-TRAINING

While in Section 5.3 we concentrate on the well specified case with d = d, here we show the effect of model misspecification. Specifically for the same data generator with d = 20 we choose d = {15, 20, 40}. The results for the full risk over all µ is in Figure 17 and slices of this figure are in Figure 18 . We can observe a qualitatively similar behavior to model misspecification as when we train fully supervised, see Section 4.2 or Figure 3 : For d < d, the risk is high and does not decrease significantly for larger γ. For d ≥ d, the risk decreases as expected. Therefore, from our observations the conclusions for model misspecification from the supervised case translates to the case with pre-training. 



http://mtweb.cs.ucl.ac.uk/mus/www/MAGICdiverse/



Figure 1: Problem formulation. Left: Data generator. Right: PCA and linear regression model.

Figure 2: Supervised results on isotropic data: analysis vs simulation. Solid lines: analytical solutions (Theorem 1); '×': avg. simulation results; '•': null risk. Left: Risk. Right: Parameter norm.

Figure 3: Supervised risk on latent variable data: simulation. Left: Risk of models for data generated with feature covariance eigenvalue decay of α = 0. Right: Results with α = 0.25.

Figure 4: Supervised risk for real-world example. Diverse MAGIC wheat genetics data set.

np n with µ ≥ 1 as we could use the training data set also for pre-training. We choose d = 20, r 2 θ = 1, σ 2 y = 0, ρ x = 1 and focus on d = d as the effects of misspecified models is equal as without pre-training and is elaborated in Section 4.2. Experiments to confirm this behavior for pre-training are in Appendix G. We generate n = 200 training samples and n p = nµ pre-training samples by varying µ ∈ [1, 10] and average the computed risk over 100 realizations.

Figure5: Pre-training risk on latent variable data: simulation. On the x-axis we increase the number of features m and therefore the degree of overparameterization γ. On the y-axis we increase amount of pre-training n p compared to training data n. Left: Risk for latent variable data generated with feature covariance eigenvalue decay of α = 0. Right: Same setup but for α = 0.25.

Figure 6: Pre-training risk for different µ: simulation. Comparing horizontal slices of Figure 5 (right, α = 0.25) for pre-trained models with different amount of pre-training data µ to 1) a fully supervised direct regression model and 2) a fully supervised PCAregression model, comparable to Figure 3.

Figure 7: Eigenvalue distribution of real-world data sets. Top left: Distribution for MNIST digit 0 of the test data set(LeCun et al., 2010). Notice that many of the 784 eigenvalues are almost zero. Top right: Complete features of the spambase data set from UCI(Dua & Graff, 2017). Bottom left: Breastcancer data set from UCI(Dua & Graff, 2017). Bottom right: Steel-plates-fault data set provided by Semeion, Research of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy.

Figure 8: Visualization of steps for variance term derivation. (a) spectral measure impulses and and lower integral bound of integral s f . (b) Marčenko-Pastur distribution(Marčenko & Pastur, 1967) for γ = 0.3 with specific lower integration bound. The area under the distribution from that threshold is equal to d/m.

Figure 10: Bias-variance decomposition of supervised results on isotropic data: analysis. Same as Figure 2 but decomposed in the bias and variance terms of Theorem 1. In all plots the direct regression results are given as comparison. Left: Risk. Right: Parameter norm. Top: Bias terms. Bottom: Variance terms.

Figure 11: Supervised parameter norm on latent variable data: simulation. Left: Parameter norm of models for data generated with α = 0. Right: Equivalent results with α = 0.25. This figure complements Figure 3

Figure14visualizes the comparison of the analytical solutions. The lowest risk for all γ is obtained for the optimal Ridge regression solution. Comparing the solutions from Ridge regularization with different λ with the solution from PCA-regression with different choices of d shows qualitative different behavior for isotropic data. While Ridge regression smoothens out the interpolation peak of direct regression with well tuned λ, for PCA-regression we require a sufficiently large amount of eigenvectors, i.e. large d to obtain a risk lower than the null risk. Overall, optimally tunes Ridge regression outperforms PCA-regression for all γ.

Figure14: PCA-regression comparison with ridge regression for isotropic data: analysis. Solid lines depict the Ridge regularized models. Note that the red solid line with very low λ is equal to the unregularized direct regression. Dashed lines depict the analytical PCA-regression model results.

. ridge, ln λ=-1 ridge, ln λ=1 ridge, ln λ=3 ridge, ln λ=6 pca, d=20

Figure 15: PCA-regression comparison with ridge regression for latent variable data: simulation. We highlight the similarity of the results obtained with large ridge parameter to our PCAregression model. Left: Risk. Right: Parameter norm.

Figure 16: Eigenvalue distribution of the Diverse MAGIC Wheat genetics data set.

Combining both solutions, the loss difference yieldsL( Ĥ) -L(D) = i v j ) 2 s j (109)We can multiply the first term by 1 = ∥v j ∥ 2 2 = v ⊤ j Iv j = v ⊤ j ĥi ĥ⊤ i v j = ∥ ĥ⊤ i v combine terms but have to be careful due to different end indices of the sumL( Ĥ) -L(D) i v j ) 2 (s i -s j ) + d i=d m j=1 ( ĥ⊤ i v j ) 2 s j for d > d prove the concentration inequality presented in Theorem 2.Proof. Write the loss difference (18) with the same argument as in its derivation by including the factor ∥v j ∥ 2 2 = 1 to the second summation yields s i -s j ≥ 0 for j > i and s i -s j ≤ 0 for j < i. Therefore, we can upper bound it by removing the negative indices from the first summationL( Ĥ) -L(D) ≤ v j ) 2 |s i -s j | + v j ) 2 s j (114)We simplify by denoting the three terms asL( Ĥ) -L(D) = a + b + c(115) Now we define the probability that this upper bound on the loss difference is larger than a chosen real t. We can upper bound this expression by applying the union boundP (a + b + c > t) ≤ P (a > t) + P (b > t) + P (c > t)(116)Recall Corollary 4.1 from Loukas (2017): We have that for any weights w ij and real t > 0 that

|s i -s j | for i ≤ min(d, d), j > i s i for d ≤ i ≤ d, ∀j s j for d ≤ i ≤ d, ∀j 0 = s j (s j + Tr(C)) from Corollary 4.3 in Loukas

Figure 17: Pre-training risk on latent variable data: simulation. We use the latent variable data generator with d = 20. In this simulations we show the effect of model misspecification. Left: Risk for data with feature covariance eigenvalue decay of α = 0. Right: Same setup but for α = 0.25. Top row: d < d. Middle row: d = d. This is a repetition of Figure 5. Bottom row: d > d.

., µ=1 pre-train., µ=2 pre-train., µ=4 pre-train., µ=10 pca, no pre-train, n train samples

Figure 18: Pre-training risk for different horizontal slices of Figure 17: simulation. Left: Risk for data with feature covariance eigenvalue decay of α = 0. Right: Same setup but for α = 0.25. Top row: d < d. Middle row: d = d. Bottom row: d > d. The middle right figure is a repetition of Figure 6.

availability

Code for reproducibility is attached as Jupyter notebook in the supplementary material and will be published online upon acceptance; all simulation parameters are explained in detail in the paper and copied in the code. All of our numerical simulations are run on Intel Core i7-6850K CPUs @ 3.60GHz in a matter of minutes. The computationally most heavy experiment is for pre-training with large µ, see Figure 5 which takes for one run about 15 minutes for the fine-grained grid that we show in the paper. Averaging over multiple runs for more accurate results increase the computational cost linearly.

annex

where the SNR-correction factor c is defined in ( 27).In order to hold that E ∥θ ⊤ z∥ 2 2 = r 2 θ , we generate θ as

D PROOFS FOR THE SUPERVISED CASE

In this appendix we detail and proof the results shown in section 4.1. After stating some notation and definition in D.1, we first show the result for the linear regression solution in D.2. Subsequently, we will derive the result for the parameter norm in D.3 and its asymptotic in the isotropic case in D.4. Finally, the result for the risk is proven in D.5 and its asymptotic in the isotropic case is derived in D.6. For simplicity of notation, we will replace in the indices d by d-it is clear from context that we use d for the estimated latent space and d for the true one.

D.1 GENERAL

Note on notation in main text While in the main paper for the derivation of the linear regression solution in (8) we denote for simplicity the estimated singular value matrix of the data X as Σ, here we are more precise. We distinguish between the case of γ > 1 with m > n (left) and the case of γ < 1 with m < n (right)When truncating with d ≪ min(m, n) singular values, we obtain in both casesIn ( 8) we write Σ instead of Σd and Σ-1 instead of ( Σ⊤ d Σd ) -1 Σ⊤ d in order to not overload notation and simplify reading without harming the results. In the following we will use the notation with subscript in order to highlight the zero rows or columns. Further, below we simplify the square matrix with only the first d singular values on the diagonal as Σdd to indicate its dimensions.

Sample covariance matrix

We can define the feature sample covariance matrix Ĉ ∈ R m×m and its Moore-Penrose pseudoinverse asUsing the definition of the truncated SVD, we can rewrite the sample covariance asThe above formulation also implies the following which will be usefulWe consider the unregularized linear regression solution between the latent variables Ẑ and the outcome y:withVarying noise levels In Figure 12 we show the results for risk and parameter norm for different values of σ 2 y . In the main text we only present results for σ 2 y = 0 but the results here show that our conclusion hold also for additive output noise. Of course, the associated risk increases with the noise level but interestingly, the found solution as the same parameter norm. Details of phenomenon for α > 0 In Figure 3 one could observe for α > 0 that for a certain range of γ > 1, the optimal PCA-regression model is worse than the direct regression model. We investigated the details of this observation in the following ways, which we visualized in Figure 13: 1. We increased the range of γ for higher values. Doing so, we can observe that results for direct regression (dashed line) and the PCA-regression model (×-marks) converge again. 2. We conduct more experiments with a wider range of values for α. In all tested values, the same phenomenon is visible.3. Analysing the uncertainty of our risk estimates over the 200 averaged simulations, we note that the difference lies within the standard deviation of our risk estimates. This originates from the limited number of test samples (=400) to estimate the risk.Therefore, we conclude that this phenomenon is an artifact of our experimental setup.

