THE FINAL ASCENT: WHEN BIGGER MODELS GENER-ALIZE WORSE ON NOISY-LABELED DATA

Abstract

Increasing the size of overparameterized neural networks has been shown to improve their generalization performance. However, real-world datasets often contain a significant fraction of noisy labels, which can drastically harm the performance of the models trained on them. In this work, we study how neural networks' test loss changes with model size when the training set contains noisy labels. We show that under a sufficiently large noise-to-sample size ratio, generalization error eventually increases with model size. First, we provide a theoretical analysis on random feature regression and show that this phenomenon occurs as the variance of the generalization loss experiences a second ascent under large noise-to-sample size ratio. Then, we present extensive empirical evidence confirming that our theoretical results hold for neural networks. Furthermore, we empirically observe that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data. Our results have important practical implications: First, larger models should be employed with extra care, particularly when trained on smaller datasets or using robust learning methods. Second, a large sample size can alleviate the effect of noisy labels and allow larger models to achieve a superior performance even under noise.

1. INTRODUCTION

Modern neural networks of ever-increasing size, with billions of parameters, have achieved an unprecedented success in various tasks. However, real-world training datasets are usually unlabeled and the commonly used crowd-sourcing and automatic-labeling techniques can introduce a lot of noisy labels (Krishna et al., 2016) . Over-parameterized models can easily overfit the noisy labels and suffer a drastic drop in their generalization performance (Zhang et al., 2016) . This phenomenon has inspired a recent body of work on dealing with high levels of noisy labels (Han et al., 2018; Zhang & Sabuncu, 2018; Jiang et al., 2018; Mirzasoleiman et al., 2020; Liu et al., 2020; Li et al., 2020) . However, the effect of model size on neural networks' generalization performance on noisy data has been overlooked, and the following important question has remained unanswered: can large models trained on noisy-labeled data be trusted and safely used? Contradictory to the classical view of bias-variance trade-off, increasing the size of overparameterized neural networks only improves generalization Neyshabur et al. (2014) . To explain this, (Belkin et al., 2019) proposed the the double descent phenomenon, suggesting that the test error follows a U-shaped curve until the training set can be fit exactly, but then it begins to descend again and reaches its minimum in the overparameterized regime. This has further been investigated by a body of recent work (see Section 2) focusing on confirming or reproducing the double descent curve. Here, we instead show that label noise can change the above picture and introduce a final ascent to the monotonically decreasing loss curve in the overparameterized regime. Specifically, we analyze the generalization performance of models of increasing size, in terms of both network width and density, under varying sample size and levels of noisy labels. We find that, when noise-to-sample-size ratio is sufficiently large, increasing the width or density of the model beyond a certain point only hurts the generalization performance. We first provide theoretical evidence from random feature regression studied in prior work (Yang et al., 2020) and show that the test generalization loss can be decomposed into a decreasing bias,

