NO DOUBLE DESCENT IN PCA: TRAINING AND PRE-TRAINING IN HIGH DIMENSIONS

Abstract

With the recent body of work on overparameterized models the gap between theory and practice in contemporary machine learning is shrinking. While many of the present state-of-the-art models have an encoder-decoder architecture, there is little theoretical work for this model structure. To improve our understanding in this direction, we consider linear encoder-decoder models, specifically PCA with linear regression on data from a low-dimensional manifold. We present an analysis for fundamental guarantees of the risk and asymptotic results for isotropic data when the model is trained in a supervised manner. The results are also verified in simulations and compared with experiments from real-world genetics data. Furthermore, we extend our analysis to the popular setting where parts of the model are pre-trained in an unsupervised manner by pre-training the PCA encoder with subsequent supervised training of the linear regression. We show that the overall risk depends on the estimates of the eigenvectors in the encoder and present a sample complexity requirement through a concentration bound. The results highlight that using more pre-training data decreases the overall risk only if it improves the eigenvector estimates. Therefore, we stress that the eigenvalue distribution determines whether more pre-training data is useful or not.

1. INTRODUCTION

Many recent success stories of deep learning employ an encoder-decoder structure, where parts of the model are pre-trained in an unsupervised or self-supervised way. Examples can be found in computer vision (Caron et al., 2020; Chen et al., 2020; Goyal et al., 2021) , natural language processing (Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020) or multi-modal models (Ramesh et al., 2021; Alayrac et al., 2022) . Understanding the properties of this model structure might shed light on how to reliably build large-scale models. We add to the theoretical understanding of encoder-decoder based models by studying a model consisting of PCA and a linear regression head. We analyse this model for the supervised case and for the case where unsupervised pre-training is followed by supervised linear regression. Our model can be viewed as a simplified, linear example of a large pre-trained deep neural network in combination with linear probing (Devlin et al., 2019; Schneider et al., 2019) . While linear models do not reveal the whole picture, they are studied as a tractable, first step towards deeper understanding. Indeed, research on linear models has previously provided important insights into relevant mechanisms (Saxe et al., 2014; Lampinen & Ganguli, 2019; Arora et al., 2019; Gidel et al., 2019; Pesme et al., 2021) . We utilize data generated from a low-dimensional manifold, similar to Goldt et al. (2020) . This is motivated by the manifold hypothesis (Fefferman et al., 2016) which states that real-world highdimensional data often have an underlying low-dimensional representation. Our PCA encoder can exploit this data structure effectively. While we keep the low-dimensional data structure fixed, we vary the number of features w.r.t. the number of training data points which allows us to analyse what is often referred to as overparameterization, i.e. data features or model parameters than training samples (Belkin et al., 2019) . We do not consider parameter count, since for our model the number of parameters, i.e. the linear regressors, stay fixed due to the PCA encoding. Instead we analyse highdimensional settings. Studying overparameterization gives theoretical justification of the success of modern large-scale neural networks such as Szegedy et al. ( 2016 Theoretical grounding is exceeded by the empirical success of machine learning and specifically deep learning methods through new model structures (Krizhevsky et al., 2017; He et al., 2016; Vaswani et al., 2017) or training methods (Erhan et al., 2010; Ioffe & Szegedy, 2015; Ba et al., 2016) . In recent years our theoretical understanding grew e.g. through the analysis of implicit regularization (Gunasekar et al., 2017; Chizat & Bach, 2020; Smith et al., 2021) . But also experimental work contributed to our understanding (Keskar et al., 2017; Zhang et al., 2017) . The goal of this paper is to extend our understanding of the successful encoder-decoder model structure through theoretical analysis of PCA-regression and by extensive numerical simulations. We generalize results for linear regression and combine classical analysis of overparamterization with pre-training of model components. Our contributions can be summarized as: • In the supervised case, we provide theoretical guarantees for the risk and parameter norm of the PCA-regression model. For isotropic data we extend the results to the limit where the number of data points n and features m tend to infinity such that m/n → γ. • Through simulations, we confirm our theory for isotropic data and explain the model behavior on data from a low-dimensional manifold. Using genetics data, we validate our findings in a high-dimensional real-world example. • We extend our analysis to the popular scenario of unsupervised pre-training of the encoder and show that the correct estimation of feature covariance eigenvectors is crucial for low risk. These estimates are highly dependent on the data structure through the eigenvalue decay rate. The results provide a link to known asymptotic results by Xu & Hsu (2019). We challenge the common wisdom that more pre-training data improves the overall risk and show that this is the case only if it improves the estimate of the eigenvectors in the encoder which is e.g. the case in data with rapidly decaying eigenvalues.

2. RELATED WORK

Overparameterization The study of overparameterized models offers a natural route to gain theoretical understanding when it comes to the successes of large models with good generalization properties (Neyshabur et al., 2015; Zhang et al., 2017) . The double descent was discovered and analysed in early works (Krogh & Hertz, 1991; Geman et al., 1992; Opper, 1995) but the framing as 'double descent' (Belkin et al., 2019) boosted research in this direction even if generalization of large models was already studied before (Bartlett & Mendelson, 2002; Dziugaite & Roy, 2017; Belkin et al., 2018; Advani et al., 2020) . We add to the understanding of machine learning models by analysing the neglected class of encoder-decoder models with the PCA-regression model.

Analysis of pre-training

The introduction of pre-training of neural networks was a paradigm shift for deep learning. Empirical work (Erhan et al., 2010; Raghu et al., 2019) but also theoretical work such as for sample complexity (Tripuraneni et al., 2020; Du et al., 2021) or the out-of-distribution risk (Kumar et al., 2022) tried to understand the mechanisms. For unsupervised pre-training, contrastive methods were studied (Wang & Isola, 2020; Von Kügelgen et al., 2021) . Encoder-decoder based autoencoders are analysed for training dynamics (Nguyen et al., 2019; 2021) or overparameterization (Radhakrishnan et al., 2019; 2020; Buhai et al., 2020; Zhang et al., 2020) . In contrast, we study pre-trained PCA encoders and relate the risk to the covariance estimation of the encoder. 



); Dosovitskiy et al. (2021).

We generate data via a linear latent variable data generator based on a low-dimensional manifold. The hidden manifold model(Goldt et al., 2020)  and random feature model(Rahimi & Recht, 2007)  present similar but nonlinear models.Goldt et al. (2022); Hu & Lu  (2022)  showed that these nonlinear models are asymptotically equivalent to linear Gaussian models under assumptions such as that the latent dimension d → ∞. In contrast, we keep this dimension fixed. Asymptotic generalization results for this data generator are presented in Gerace et al. (2020);Mei & Montanari (2022). Different to our work where we exploit the low-dimensional structure with the PCA-regression model, they do not use this information by using Ridge or logistic regression.PCA-regression model Using PCA(Jolliffe, 1982)  is common-discussions focus on the choice of principle components(Breiman & Freedman, 1983)  or its use for high-dimensional data(Lee  et al., 2012). PCA-regression is investigated in Xu & Hsu (2019) for general but fully known covariances in the asymptotic regime. Wu & Xu (2020) extend it by showing that the misalignment of true

