MITIGATING DEEP DOUBLE DESCENT BY CONCATE-NATING INPUTS

Abstract

The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.

1. INTRODUCTION

Underparameterization and overparameterization are at the heart of understanding modern neural networks. The traditional notion of underparameterization and overparameterization led to the classic U-shaped generalization error curve (Trevor Hastie & Friedman, 2001; Stuart Geman & Doursat, 1992) , where generalization would worsen when the model had either too few (underparameterized) or too many parameters (overparameterized). Correspondingly, it was expected that an underparameterized model would underfit and fail to identify more complex and informative patterns, and an overparameterized model would overfit and identify non-informative patterns. This view no longer holds for modern neural networks. It is widely accepted that neural networks are vastly overparameterized, yet generalize well. There is strong evidence that increasing the number of parameters leads to better generalization (Zagoruyko & Komodakis, 2016; Huang et al., 2017; Larsson et al., 2016) , and models are often trained to achieve zero training loss (Salakhutdinov, 2017), while still improving in generalization error, whereas the traditional view would suggest overfitting. To bridge the gap, Belkin et al. (2018a) proposed the double descent curve, where the underparameterized region follows the U-shaped curve, and the overparameterized region smoothly decreases in generalization error, as the number of parameters increases further. This results in a peak in generalization error, where a fewer number of samples would counter-intuitively decrease the error. There has been extensive experimental evidence of the double descent curve in deep learning (Nakkiran et al., 2019; Yang et al., 2020) , as well as in models such as random forests, and one layer neural networks (Belkin et al., 2018a; Ba et al., 2020) . One recurring theme in the definition of overparameterization and underparameterization lies in the number of neural network parameters relative to the number of samples (Belkin et al., 2018a; Nakkiran et al., 2019; Ba et al., 2020; Bibas et al., 2019; Muthukumar et al., 2019; Hastie et al., 2019) . On a high level, a greater number of parameters than samples is generally considered overparameterization, and fewer is considered underparameterization. However, this leads to the question "What is a sample?" In this paper, we revisit the fundamental underpinnings of overparameterization and underparameterization, and stress test when it means to be overparameterized or underparameterized, through extensive experiments of a cleverly constructed input. We artificially augment existing datasets by simply stacking every combination of inputs, and show the mitigation of the double descent curve in the deep neural network setting. We 1

