HETEROSKEDASTIC AND IMBALANCED DEEP LEARN-ING WITH ADAPTIVE REGULARIZATION

Abstract

Real-world large-scale datasets are heteroskedastic and imbalanced -labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning. 1

1. INTRODUCTION

In real-world machine learning applications, even well-curated training datasets have various types of heterogeneity. Two main types of heterogeneity are: (1) data imbalance: the input or label distribution often has a long-tailed density, and (2) heteroskedasticity: the labels given inputs have varying levels of uncertainties across subsets of data stemming from various sources such as the intrinsic ambiguity of the data or annotation errors. Many deep learning algorithms have been proposed for imbalanced datasets (e.g., see (Wang et al., 2017; Cao et al., 2019; Cui et al., 2019; Liu et al., 2019) and the reference therein). However, heteroskedasticity, a classical notion studied extensively in the statistical community (Pintore et al., 2006; Wang et al., 2013; Tibshirani et al., 2014) , has so far been under-explored in deep learning. This paper focuses on addressing heteroskedasticity and its interaction with data imbalance in deep learning. Heteroskedasticity is often studied in regression analysis and refers to the property that the distribution of the error varies across inputs. In this work, we mostly focus on classification, though the developed technique also applies to regression. Here, heteroskedasticity reflects how the uncertainty in the conditional distribution y | x, or the entropy of y | x, varies as a function of x. Real-world datasets are often heteroskedastic. For example, Li et al. (2017) shows that the WebVision dataset has a varying number of ambiguous or true noisy examples across classes. 2Conversely, we consider a dataset to be homoscedastic if every example is mislabeled with a fixed probably , as assumed by many prior theoretical and empirical works on label corruption (Ghosh et al., 2017; Han et al., 2018; Jiang et al., 2018; Mirzasoleiman et al., 2020) . We note that varying uncertainty in y | x can come from at least two sources: the intrinsic semantic ambiguity of the input, and the (data-dependent) mislabeling introduced by the annotation process. Our approach can handle both types of noisy examples in a unified way, but for the sake of comparisons with past methods, we call them "ambiguous examples" and "mislabeled examples" respectively, and refer to both of them as "noisy examples".



Code available at https://github.com/kaidic/HAR. See Figure4of(Li et al., 2017), the number of votes for each example indicates the level of uncertainty of that example.

