MITIGATING DEEP DOUBLE DESCENT BY CONCATE-NATING INPUTS

Abstract

The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.

1. INTRODUCTION

Underparameterization and overparameterization are at the heart of understanding modern neural networks. The traditional notion of underparameterization and overparameterization led to the classic U-shaped generalization error curve (Trevor Hastie & Friedman, 2001; Stuart Geman & Doursat, 1992) , where generalization would worsen when the model had either too few (underparameterized) or too many parameters (overparameterized). Correspondingly, it was expected that an underparameterized model would underfit and fail to identify more complex and informative patterns, and an overparameterized model would overfit and identify non-informative patterns. This view no longer holds for modern neural networks. It is widely accepted that neural networks are vastly overparameterized, yet generalize well. There is strong evidence that increasing the number of parameters leads to better generalization (Zagoruyko & Komodakis, 2016; Huang et al., 2017; Larsson et al., 2016) , and models are often trained to achieve zero training loss (Salakhutdinov, 2017), while still improving in generalization error, whereas the traditional view would suggest overfitting. To bridge the gap, Belkin et al. (2018a) proposed the double descent curve, where the underparameterized region follows the U-shaped curve, and the overparameterized region smoothly decreases in generalization error, as the number of parameters increases further. This results in a peak in generalization error, where a fewer number of samples would counter-intuitively decrease the error. There has been extensive experimental evidence of the double descent curve in deep learning (Nakkiran et al., 2019; Yang et al., 2020) , as well as in models such as random forests, and one layer neural networks (Belkin et al., 2018a; Ba et al., 2020) . One recurring theme in the definition of overparameterization and underparameterization lies in the number of neural network parameters relative to the number of samples (Belkin et al., 2018a; Nakkiran et al., 2019; Ba et al., 2020; Bibas et al., 2019; Muthukumar et al., 2019; Hastie et al., 2019) . On a high level, a greater number of parameters than samples is generally considered overparameterization, and fewer is considered underparameterization. However, this leads to the question "What is a sample?" In this paper, we revisit the fundamental underpinnings of overparameterization and underparameterization, and stress test when it means to be overparameterized or underparameterized, through extensive experiments of a cleverly constructed input. We artificially augment existing datasets by simply stacking every combination of inputs, and show the mitigation of the double descent curve in the deep neural network setting. We humbly hypothesize that in deep neural networks we can, perhaps, artificially increase the number of samples without increasing the information contained in the dataset, and by implicitly changing the classification pipeline mitigate the double descent curve. In particular, the narrative of our paper obeys the following: • We propose a simple construction to artificially augment existing datasets of size O(n) by stacking inputs to produce a dataset of size O(n 2 ). • We demonstrate that the construction has no impact on the double descent curve in the linear regression case. • We show experimentally that those results on double descent curve do not extend to the case of neural networks. Concretely, we reproduce results from recent landmark papers, and present the difference in behavior with respect to the double descent curve.

2. RELATED WORKS

The double descent curve was proposed recently in (Belkin et al., 2018a) , where the authors define overparameterization and underparameterization as the proportion of parameters to samples. The authors explain the phenomenon through the model capacity class. With more parameters in the overparameterized region, there is larger "capacity" (i.e., the model class contains more candidates), and thus may contain better, simpler models by Occam's Razor rule. The interpolation region is suggested to exist when the model capacity is capable of fitting the data nearly perfectly by overfitting on non-informative features, resulting in higher test error. Experiments included a one layer neural network, random forests, and others. The double descent curve is also observed in deep neural networks (Nakkiran et al., 2019) , with the additional observation of epoch-wise double descent. Experimentation is amplified by label noise. With the observation of unimodel variance (Neal et al., 2018) , Yang et al. ( 2020) also decomposes the risk into bias and variance, and posits that the double descent curve arises due to the bell-shaped variance curve rising faster than the bias decreases. There is substantial theoretical work on double descent, particularly in the least squares regression setting. Advani & Saxe (2017) analyses this linear setting and proves the existence of the interpolation region, where the number of parameters equals the number of samples in the asymptotic limit where samples and parameters tend to infinity. Hastie et al. ( 2019) follows a similar line of work, and proves that regularization reduces the peak in the interpolation region. Belkin et al. (2019b) requires only finite samples, where the features and target be jointly Gaussian. Other papers with similar setup include (Bartlett et al., 2019; Muthukumar et al., 2019; Bibas et al., 2019; Mitra, 2019; Mei & Montanari, 2019) . Ba et al. (2020) analyses the least squares regression setting for two layer linear neural networks in the asymptotic setting, where the double descent curve is present when only the second layer is optimized. There is also work in proving that optimally tuned 2 -norm regularization mitigates the double descent curve for certain linear regression models with isotropic data distribution (Nakkiran, 2019) . This setting has also been studied with respect to the variance in the parameter space (Bartlett et al., 2019) . Multiple descent has also been studied, and in particular there is work to show in the linear regression setting that multiple descent curves can be directly designed by the user (Chen et al., 2020). Additionally, there is supporting evidence of double descent in the sample-wise perspective (Nakkiran et al., 2020) . There is other work in this area, including studying the double descent curve for least squares in random feature models (Belkin et al., 2019a; d'Ascoli et al., 2020; Ghorbani et al., 2019) , leveraging the Neural Tangent Kernel to argue for certain number of parameters the output of the neural network diverges (Geiger et al., 2020) , characterizing the double descent in non-linear settings (Caron & Chretien, 2020), kernel learning (Belkin et al., 2018b; Liang et al., 2019) , and connecting to other fields (Geiger et al., 2019) . Lastly, we note here that, in the deep neural network setting, models can be trained to zero training loss even with random labels (Zhang et al., 2016) .

