OPTIMAL REGULARIZATION CAN MITIGATE DOUBLE DESCENT

Abstract

Recent empirical and theoretical studies have shown that many learning algorithms -from linear regression to neural networks -can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned 2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned 2 regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.

1. INTRODUCTION

Recent works have demonstrated a ubiquitous "double descent" phenomenon present in a range of machine learning models, including decision trees, random features, linear regression, and deep neural networks (Opper, 1995; 2001; Advani & Saxe, 2017; Spigler et al., 2018; Belkin et al., 2018; Geiger et al., 2019b; Nakkiran et al., 2020; Belkin et al., 2019; Hastie et al., 2019; Bartlett et al., 2019; Muthukumar et al., 2019; Bibas et al., 2019; Mitra, 2019; Mei & Montanari, 2019; Liang & Rakhlin, 2018; Liang et al., 2019; Xu & Hsu, 2019; Dereziński et al., 2019; Lampinen & Ganguli, 2018; Deng et al., 2019; Nakkiran, 2019) . The phenomenon is that models exhibit a peak of high test risk when they are just barely able to fit the train set, that is, to interpolate. For example, as we increase the size of models, test risk first decreases, then increases to a peak around when effective model size is close to the training data size, and then decreases again in the overparameterized regime. Also surprising is that Nakkiran et al. ( 2020) observe a double descent as we increase sample size, i.e. for a fixed model, training the model with more data can hurt test performance. These striking observations highlight a potential gap in our understanding of generalization and an opportunity for improved methods. Ideally, we seek to use learning algorithms which robustly improve performance as the data or model size grow and do not exhibit such unexpected nonmonotonic behaviors. In other words, we aim to improve the test performance in situations which would otherwise exhibit high test risk due to double descent. Here, a natural strategy would be to use a regularizer and tune its strength on a validation set. This motivates the central question of this work: When does optimally tuned regularization mitigate or remove the double-descent phenomenon? Another motivation is the fact that double descent is largely observed for unregularized or underregularized models in practice. As an example, Figure 1 shows a simple linear ridge regression Figure 1 : Test Risk vs. Num. Samples for Isotropic Ridge Regression in d = 500 dimensions. Unregularized regression is non-monotonic in samples, but optimallyregularized regression (λ = λ opt ) is monotonic. In this setting, the optimal regularizer λ opt does not depend on number of samples n (Lemma 2), but this is not always true -see Figure 2 . setting in which the unregularized estimator exhibits double descent, but an optimally-tuned regularizer has monotonic test performance. Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we start with the setting of high-dimensional linear regression. Linear regression is a sensible starting point to study these questions, since it already exhibits many of the qualitative features of double descent in more complex models (e.g. Belkin et al. ( 2019); Hastie et al. ( 2019) and further related works in Section 1.1). Our work shows that optimally-tuned ridge regression can achieve both sample-wise monotonicity and model-size-wise monotonicity under certain assumptions. Concretely, we show 1. Sample-wise monotonicity: In the setting of well-specified linear regression with isotropic features/covariates (Figure 1 ), we prove that optimally-tuned ridge regression yields monotonic test performance with increasing samples. That is, more data never hurts for optimally-tuned ridge regression. (See Theorem 1).

2.. Model-wise monotonicity:

We consider a setting where the input/covariate lives in a highdimensional ambient space with isotropic covariance. Given a fixed model size d (which might be much smaller than ambient dimension), we consider the family of models which first project the input to a random d-dimensional subspace, and then compute a linear function in this projected "feature space." (This is nearly identical to models of double-descent considered in Hastie et al. (2019, Section 5.1)). We prove that in this setting, as we grow the model-size, optimally-tuned ridge regression over the projected features has monotone test performance. That is, with optimal regularization, bigger models are always better or the same. (See Theorem 3).

3.. Monotonicity in the real-world:

We also demonstrate several richer empirical settings where optimal 2 regularization induces monotonicity, including random feature classifiers and convolutional neural networks. This suggests that the mitigating effect of optimal regularization may hold more generally in broad machine learning contexts. (See Section 5).

A few remarks are in order:

Problem-specific vs Minimax and Bayesian. It is worth noting that our results hold for all linear ground-truths, rather than holding for only the worst-case ground-truth or a random ground-truth. Indeed, the minimax optimal estimator or the Bayes optimal estimator are both trivially sample-wise and model-wise monotonic with respect to the minimax risk or the Bayes risk. However, they do not guarantee monotonicity of the risk itself for a given fixed problem. In particular, there exist minimax optimal estimators which are not sample-monotonic in the sense we desire. Universal vs Asymptotic. We also remark that our analysis is not only non-asymptotic but also works for all possible input dimensions, model sizes, and sample sizes. To our knowledge, the results herein are the first non-asymptotic sample-wise and model-wise monotonicity results for linear regression. (See discussion of related works Hastie et al. (2019); Mei & Montanari (2019) for related results in the asymptotic setting). Our work reveals aspects of the problem that were not

