WHY RESAMPLING OUTPERFORMS REWEIGHTING FOR CORRECTING SAMPLING BIAS WITH STOCHASTIC GRA-DIENTS

Abstract

A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for the bias. We consider two commonlyused techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that resampling outperforms reweighting when combined with stochastic gradient algorithms. By analyzing illustrative examples, we explain the reason behind this phenomenon using tools from dynamical stability and stochastic asymptotics. We also present experiments from regression, classification, and off-policy prediction to demonstrate that this is a general phenomenon. We argue that it is imperative to consider the objective function design and the optimization algorithm together while addressing the sampling bias.

1. INTRODUCTION

A data set sampled from a certain population is called biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying population proportions. Applying machine learning algorithms naively to biased training data can raise serious concerns and lead to controversial results (Sweeney, 2013; Kay et al., 2015; Menon et al., 2020) . In many domains such as demographic surveys, fraud detection, identification of rare diseases, and natural disasters prediction, a model trained from biased data tends to favor oversampled subgroups by achieving high accuracy there while sacrificing the performance on undersampled subgroups. Although one can improve by diversifying and balancing during the data collection process, it is often hard or impossible to eliminate the sampling bias due to historical and operational issues. In order to mitigate the biases and discriminations against the undersampled subgroups, a common technique is to preprocess the data set by compensating the mismatch between population proportion and the sampling proportion. Among various approaches, two commonly-used choices are reweighting and resampling. In reweighting, one multiplies each sample with a ratio equal to its population proportion over its sampling proportion. In resampling, on the other hand, one corrects the proportion mismatch by either generating new samples for the undersampled subgroups or selecting a subset of samples for the oversampled subgroups. Both methods result in statistically equivalent models in terms of the loss function (see details in Section 2). However, it has been observed in practice that resampling often outperforms reweighting significantly, such as boosting algorithms in classification (Galar et al., 2011; Seiffert et al., 2008) , off-policy prediction in reinforcement learning (Schlegel et al., 2019) and so on. The obvious question is why. Main contributions. Our main contribution is to provide an answer to this question: resampling outperforms reweighting because of the stochastic gradient-type algorithms used for training. To the best of our knowledge, our explanation is the first theoretical quantitative analysis for this phenomenon. With stochastic gradient descent (SGD) being the dominant method for model training, our analysis is based on some recent developments for understanding SGD. We show via simple and explicitly analyzable examples why resampling generates expected results while reweighting performs undesirably. Our theoretical analysis is based on two points of view, one from the dynamical stability perspective and the other from stochastic asymptotics. In addition to the theoretical analysis, we present experimental examples from three distinct categories (classification, regression, and off-policy prediction) to demonstrate that resampling outperforms reweighting in practice. This empirical study illustrates that this is a quite general phenomenon when models are trained using stochastic gradient type algorithms. Our theoretical analysis and experiments show clearly that adjusting only the loss functions is not sufficient for fixing the biased data problem. The output can be disastrous if one overlooks the optimization algorithm used in the training. In fact, recent understanding has shown that objective function design and optimization algorithm are closely related, for example optimization algorithms such as SGD play a key role in the generalizability of deep neural networks. Therefore in order to address the biased data issue, we advocate for considering data, model, and optimization as an integrated system. Related work. In a broader scope, resampling and reweighting can be considered as instances of preprocessing the training data to tackle biases of machine learning algorithms. Though there are many well-developed resampling (Mani & Zhang, 2003; He & Garcia, 2009; Maciejewski & Stefanowski, 2011) and reweighting (Kumar et al., 2010; Malisiewicz et al., 2011; Chang et al., 2017) techniques, we only focus on the reweighting approaches that do not change the optimization problem. It has been well-known that training algorithms using disparate data can lead to algorithmic discrimination (Bolukbasi et al., 2016; Caliskan et al., 2017) , and over the years there have been growing efforts to mitigate such biases, for example see (Amini et al., 2019; Kamiran & Calders, 2012; Calmon et al., 2017; Zhao et al., 2019; López et al., 2013) . We also refer to (Guo et al., 2017; He & Ma, 2013; Krawczyk, 2016) for a comprehensive review of this growing research field. Our approaches for understanding the dynamics of resampling and reweighting under SGD are based on tools from numerical analysis for stochastic systems. Connections between numerical analysis and stochastic algorithms have been rapidly developing in recent years. The dynamical stability perspective has been used in (Wu et al., 2018) to show the impact of learning rate and batch size in minima selection. The stochastic differential equations (SDE) approach for approximating stochastic optimization methods can be traced in the line of work (Li et al., 2017; 2019; Rotskoff & Vanden-Eijnden, 2018; Shi et al., 2019) , just to mention a few.

2. PROBLEM SETUP

Let us consider a population that is comprised of two different groups, where a proportion a 1 of the population belongs to the first group, and the rest with the proportion a 2 = 1 -a 1 belongs to the second (i.e., a 1 , a 2 > 0 and a 1 + a 2 = 1). In what follows, we shall call a 1 and a 2 the population proportions. Consider an optimization problem for this population over a parameter θ. For simplicity, we assume that each individual from the first group experiences a loss function V 1 (θ), while each individual from the second group has a loss function of type V 2 (θ). Here the loss function V 1 (θ) is assumed to be identical across all members of the first group and the same for V 2 (θ) across the second group, however it is possible to extend the formulation to allow for loss function variation within each group. Based on this setup, a minimization problem over the whole population is to find θ * = arg min θ V (θ), where V (θ) ≡ a 1 V 1 (θ) + a 2 V 2 (θ). (1) For a given set Ω of N individuals sampled uniformly from the population, the empirical minimization problem is θ * = arg min θ 1 N r∈Ω V ir (θ), where i r ∈ {1, 2} denotes which group an individual r belongs to. When N grows, the empirical loss in (2) is consistent with the population loss in (1) as there are approximately a 1 fraction of samples from the first group and a 2 fraction of samples from the second.

