STRONG INDUCTIVE BIASES PROVABLY PREVENT HARMLESS INTERPOLATION

Abstract

Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.

1. INTRODUCTION

According to classical wisdom (see, e.g., Hastie et al. (2001) ), an estimator that fits noise suffers from "overfitting" and cannot generalize well. A typical solution is to prevent interpolation, that is, stopping the estimator from achieving zero training error and thereby fitting less noise. For example, one can use ridge regularization or early stopping for iterative algorithms to obtain a model that has training error close to the noise level. However, large overparameterized models such as neural networks seem to behave differently: even on noisy data, they may achieve optimal test performance at convergence after interpolating the training data (Nakkiran et al., 2021; Belkin et al., 2019a ) -a phenomenon referred to as harmless interpolation (Muthukumar et al., 2020) or benign overfitting (Bartlett et al., 2020) and often discussed in the context of double descent (Belkin et al., 2019a) . To date, we lack a general understanding of when interpolation is harmless for overparameterized models. In this paper, we argue that the strength of an inductive bias critically influences whether an estimator exhibits harmless interpolation. An estimator with a strong inductive bias heavily favors "simple" solutions that structurally align with the ground truth (such as sparsity or rotational invariance). Based on well-established high-probability recovery results of sparse linear regression (Tibshirani, 1996; Candes, 2008; Donoho & Elad, 2006) , we expect that models with a stronger inductive bias generalize better than ones with a weaker inductive bias, particularly from noiseless data. In contrast, the effects of inductive bias are much less studied for interpolators of noisy data. Recently, Donhauser et al. ( 2022) provided a first rigorous analysis of the effects of inductive bias strength on the generalization performance of linear max-ℓ p -margin/min-ℓ p -norm interpolators. In particular, the authors prove that a stronger inductive bias (small p → 1) not solely enhances a model's ability to generalize on noiseless data, but also increases a model's sensitivity to noiseeventually harming generalization when interpolating noisy data. As a consequence, their result suggests that interpolation might not be harmless when the inductive bias is too strong. In this paper, we confirm the hypothesis and show that strong inductive biases indeed prevent harmless interpolation, while also moving away from sparse linear models. As one example, we consider data where the true labels nonlinearly only depend on input features in a local neighborhood, and vary the strength of the inductive bias via the filter size of convolutional kernels or shallow convolutional neural networks -small filter sizes encourage functions that depend nonlinearly only on local neighborhoods of the input features. As a second example, we also investigate classification for rotationally invariant data, where we encourage different degrees of rotational invariance for neural networks. In particular, • we prove a phase transition between harmless and harmful interpolation that occurs by varying the strength of the inductive bias via the filter size of convolutional kernels for kernel regression in the high-dimensional setting (Theorem 1). • we further show that, for a weak inductive bias, not only is interpolation harmless but partially fitting the observation noise is in fact necessary (Theorem 2). • we show the same phase transition experimentally for neural networks with two common inductive biases: varying convolution filter size, and rotational invariance enforced via data augmentation (Section 4). From a practical perspective, empirical evidence suggests that large neural networks not necessarily benefit from early stopping. Our results match those observations for typical networks with a weak inductive bias; however, we caution that strongly structured models must avoid interpolation, even if they are highly overparameterized.

2. RELATED WORK

We now discuss three groups of related work and explain how their theoretical results cannot reflect the phase transition between harmless and harmful interpolation for high-dimensional kernel learning. Low-dimensional kernel learning: Many recent works (Bietti et al., 2021; Favero et al., 2021; Bietti, 2022; Cagnetta et al., 2022) prove statistical rates for kernel regression with convolutional kernels in low-dimensional settings, but crucially rely on ridge regularization. In general, one cannot expect harmless interpolation for such kernels in the low-dimensional regime (Rakhlin & Zhai, 2019; Mallinar et al., 2022; Buchholz, 2022) ; positive results exist only for very specific adaptive spiked kernels (Belkin et al., 2019b) . Furthermore, techniques developed for low-dimensional settings (see, e.g., Schölkopf et al. ( 2018)) usually suffer from a curse of dimensionality, that is, the bounds become vacuous in high-dimensional settings where the input dimension grows with the number of samples. High-dimensional kernel learning: One line of research (Liang et al., 2020; McRae et al., 2022; Liang & Rakhlin, 2020; Liu et al., 2021) tackles high-dimensional kernel learning and proves non-asymptotic bounds using advanced high-dimensional random matrix concentration tools from El Karoui (2010). However, those results heavily rely on a bounded Hilbert norm assumption. This assumption is natural in the low-dimensional regime, but misleading in the high-dimensional regime, as pointed out in Donhauser et al. (2021b) . Another line of research (Ghorbani et al., 2021; 2020; Mei et al., 2021; Ghosh et al., 2022; Misiakiewicz & Mei, 2021; Mei et al., 2022) asymptotically characterizes the precise risk of kernel regression estimators in specific settings with access to a kernel's eigenfunctions and eigenvalues. However, these asymptotic results are insufficient to investigate how varying the filter size of a convolutional kernel affects the risk of a kernel regression estimator. In contrast to both lines of research, we prove tight non-asymptotic matching upper and lower bounds for high-dimensional kernel learning which precisely capture the phase transition described in Section 3.2. Overfitting of structured interpolators: Several works question the generality of harmless interpolation for models that incorporate strong structural assumptions. Examples include structures enforced via data augmentation (Nishi et al., 2021 ), adversarial training (Rice et al., 2020; Kamath et al., 2021; Sanyal et al., 2021; Donhauser et al., 2021a) , neural network architectures (Li et al., 2021 ), pruning-based sparsity (Chang et al., 2021) , and sparse linear models (Wang et al., 2022; Muthukumar et al., 2020; Chatterji & Long, 2022) . In this paper, we continue that line of research and offer a new theoretical perspective to characterize when interpolation is expected to be harmless.

3. THEORETICAL RESULTS

For convolutional kernels, a small filter size induces a strong bias towards estimators that depend nonlinearly on the input features only via small patches. This section analyzes the effect of filter size (as an example inductive bias) on the degree of harmless interpolation for kernel ridge regression.

