THE FINAL ASCENT: WHEN BIGGER MODELS GENER-ALIZE WORSE ON NOISY-LABELED DATA

Abstract

Increasing the size of overparameterized neural networks has been shown to improve their generalization performance. However, real-world datasets often contain a significant fraction of noisy labels, which can drastically harm the performance of the models trained on them. In this work, we study how neural networks' test loss changes with model size when the training set contains noisy labels. We show that under a sufficiently large noise-to-sample size ratio, generalization error eventually increases with model size. First, we provide a theoretical analysis on random feature regression and show that this phenomenon occurs as the variance of the generalization loss experiences a second ascent under large noise-to-sample size ratio. Then, we present extensive empirical evidence confirming that our theoretical results hold for neural networks. Furthermore, we empirically observe that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data. Our results have important practical implications: First, larger models should be employed with extra care, particularly when trained on smaller datasets or using robust learning methods. Second, a large sample size can alleviate the effect of noisy labels and allow larger models to achieve a superior performance even under noise.

1. INTRODUCTION

Modern neural networks of ever-increasing size, with billions of parameters, have achieved an unprecedented success in various tasks. However, real-world training datasets are usually unlabeled and the commonly used crowd-sourcing and automatic-labeling techniques can introduce a lot of noisy labels (Krishna et al., 2016) . Over-parameterized models can easily overfit the noisy labels and suffer a drastic drop in their generalization performance (Zhang et al., 2016) . This phenomenon has inspired a recent body of work on dealing with high levels of noisy labels (Han et al., 2018; Zhang & Sabuncu, 2018; Jiang et al., 2018; Mirzasoleiman et al., 2020; Liu et al., 2020; Li et al., 2020) . However, the effect of model size on neural networks' generalization performance on noisy data has been overlooked, and the following important question has remained unanswered: can large models trained on noisy-labeled data be trusted and safely used? Contradictory to the classical view of bias-variance trade-off, increasing the size of overparameterized neural networks only improves generalization Neyshabur et al. (2014) . To explain this, (Belkin et al., 2019) proposed the the double descent phenomenon, suggesting that the test error follows a U-shaped curve until the training set can be fit exactly, but then it begins to descend again and reaches its minimum in the overparameterized regime. This has further been investigated by a body of recent work (see Section 2) focusing on confirming or reproducing the double descent curve. Here, we instead show that label noise can change the above picture and introduce a final ascent to the monotonically decreasing loss curve in the overparameterized regime. Specifically, we analyze the generalization performance of models of increasing size, in terms of both network width and density, under varying sample size and levels of noisy labels. We find that, when noise-to-sample-size ratio is sufficiently large, increasing the width or density of the model beyond a certain point only hurts the generalization performance. We first provide theoretical evidence from random feature regression studied in prior work (Yang et al., 2020) and show that the test generalization loss can be decomposed into a decreasing bias, a unimodal noise-independent variance and an increasing noise-dependent variance. The noisedependent variance is more pronounced when the noise-to-sample size ratio is large, which leads to a second ascent in the total variance (c.f. Figures 1, and 2 ) and an ascent in the test loss. Interestingly, our analysis also demonstrates that under a large noise-to-sample size ratio, reducing model density by keeping a randomly selected fraction of weights can improve the generalization. Our analysis complements the double-descent phenomena, by providing a complete picture of the generalization curve vs. model size under various levels of noisy labels. Through extensive experiments, we corroborate our theoretical results and show their validity for neural networks by showing that (1) sufficiently large label noise can lead to the final ascent in test loss and (2) sufficiently large sample size can eliminate the final ascent. In addition, we study the effect of model size on the performance of state-of-the-art methods for robust learning against noisy labels. We show that, perhaps surprisingly, the adverse effect of larger models can be observed even under a smaller noise-to-sample size ratio, when robust methods are employed. Finally, we take a closer look into the smoothness of the learned networks trained with noisy labels. We show that noisy labels can turn the previously suggested negative correlation between network size and model complexity into a positive one. Our work is a step toward understanding the complicated overfitting and generalization behavior of modern machine learning models trained on large real-world data. In practice, our results also have several important implications: First, reducing width or dropping a fraction of (even randomly selected) weights can alleviate the effect of noisy labels. Second, when training large models, larger sample size can effectively counter the effect of noisy labels and even remove the final ascent. This explains why the harm of larger models on noisy-labeled data have not been observed by prior work (Arpit et al., 2017) . Finally, large models should be used with extra care even on large datasets when training with robust methods, as larger width or density can hurt the performance of the robust learning algorithms.

2. RELATION TO PRIOR WORK

A bulk of recent work has studied the double descent phenomena by analyzing the error of linear regression (Hastie et al., 2019; Belkin et al., 2020; Derezinski et al., 2020) , random feature regression (Mei & Montanari, 2019; Adlam & Pennington, 2020a; Yang et al., 2020; d'Ascoli et al., 2020) or even NTK regression Adlam & Pennington (2020b), showing these models can capture some important features of double descent. The primary focus of these works has been on the behavior of the total test error. A few of them decompose the test error into bias and variance (Mei & Montanari, 2019; Yang et al., 2020) . Most relevant to our analysis are (Adlam & Pennington, 2020a; d'Ascoli et al., 2020) which decompose the variance into several sources. However, we take different asymptotic limit than theirs. Given n training examples and d input dimensions, we let n d tend to infinity instead of constant. The benefit of this limit is that we can derive closed-form expression of different terms in the test error and therefore obtain more interpretable result. Our setting is particularly comparable to that of (Yang et al., 2020) which analyzed the noiseless case, in the same asymptotic limit. Our result shows that label noise contributes a monotonically increasing term in the variance, in addition to the bias and noise-independent variance term derived in (Yang et al., 2020) . Furthermore, we include model density into the analysis by adding another layer and show that density plays a different role than width. In noiseless setting, our Theorem 2 reproduces the empirical observation made by Golubeva et al. (2020) that increasing width while keeping the number of parameters fixed by reducing density improves generalization (see Appendix A.7), which separates the benefit of width from the effect of model capacity. The role of label noise in double descent has been discussed before in either empirical Nakkiran et al. (2021) or theoretical Adlam & Pennington (2020a) studies. However, these works focus on the settings where double descent still holds and only highlight that label noise exacerbates the peak of test loss at the interpolation threshold. In contrast, we find that label noise can essentially change the monotonicity of the loss curve by adding a final ascent. Furthermore, we show that this ascent can be removed by using sufficiently large sample size.

