SIMPLICITY BIAS LEADS TO AMPLIFIED PERFORMANCE DISPARITIES

Abstract

The simple idea that not all things are equally difficult has surprising implications when applied in a fairness context. In this work we explore how "difficulty" is model-specific, such that different models find different parts of a dataset challenging. When difficulty correlates with group information, we term this difficulty disparity. Drawing a connection with recent work exploring the inductive bias towards simplicity of SGD-trained models, we show that when such a disparity exists, it is further amplified by commonly-used models. We quantify this amplification factor across a range of settings aiming towards a fuller understanding of the role of model bias. We also present a challenge to the simplifying assumption that "fixing" a dataset is sufficient to ensure unbiased performance.

1. INTRODUCTION

Without actually training, understanding what a model will find challenging is far from trivial. A certain dataset may be hard for one model but not for another (Wolpert & Macready, 1997) . For a given model, two classes may be easily separable, while for another they may be hard to distinguish. Given this, it follows naturally that "difficulty" is a function of both data and model, such that we can't properly account for difficulty by analyzing the dataset alone. In the context of fairness in machine learning, for a given task, a data-model pair may be more difficult for one social group than another, leading to disparate impact (Barocas & Selbst, 2016) . they exhibit worse accuracy for darker-skinned women than for any other group. Typically, accuracy disparity of this kind is attributed to either under-representation of certain groups or spurious correlations between group information and the variable of interest. In this work, we further show that-even with perfectly balanced data and in the absence of correlations between group labels and class labels-trained models can find certain groups harder than others. Crucially, group difficulty is not always predictable from a dataset audit, providing key evidence for the necessity of a complementary post-training model audit. Having identified model-specific disparities in the post-dataset-audit setting, we turn to the role of the model itself. We show that implicit bias of certain model classes towards simple functions (Arpit et al., 2017; Kalimeris et al., 2019; Rahaman et al., 2019; Valle-Perez et al., 2019; Shah et al., 2020) further amplifies disparity: When a model finds one group easier than another, its bias towards the easy group leads to greater than expected performance disparity after training (see fig. 1 ). We show that difficulty amplification is highly sensitive to model architecture, training time, and parameter count. Seemingly innocuous design decisions, such as whether to use early stopping, can have a significant impact on the amount of amplification and consequently the performance disparity.

Contributions

1. We identify difficulty disparity, a pervasive phenomenon that persists in the post-datasetaudit setting: using data with perfect representation and without spurious correlations. 2. We introduce difficulty amplification factor to quantify how much a model exacerbates difficulty disparity. 3. We empirically evaluate how choices including model architecture, training time, and parameter count impact difficulty amplification. Paper structure. In § 2 we provide background and related work. We begin in § 3 by evaluating variability of dataset difficulty across models. In § 4 we formalize difficulty disparity and amplification and design a synthetic task to isolate it. In § 5 we show how factors such as model architecture, scale, training time and regularization impact amplification. After a real-world example using Dollar Street in § 6, we discuss the fairness implication of our work in § 7 before concluding in § 8.

2.1. BIASED DATASETS AND BIASED MODELS

Bias in ML systems arises from many sources. At the most basic level, a dataset itself is biased if certain groups are under-represented (Stock & Cisse, 2018; Hendricks et al., 2018; Yang et al., 2020; Menon et al., 2021) . Proposals to rectify under-representation include actively collecting more data for marginalized groups (Dutta et al., 2020) , under/oversampling or reweighting (Byrd & Lipton, 2019; Sagawa et al., 2020; Idrissi et al., 2022; Arjovsky et al., 2022) during training, and optimizing for worst-group (as opposed to average) accuracy (Sagawa et al., 2019) . Other recent work has suggested fine-tuning on an explicitly balanced set (Kirichenko et al., 2022) . Alternatively, datasets can reinforce harmful associations (Goyal et al., 2022b) , both due to sampling error and by inadvertently capturing an undesirable association that is present in society. Bucketed as "spurious correlations" (Muthukumar et al., 2018; Wang et al., 2019; Sagawa et al., 2020) or "shortcuts" (Geirhos et al., 2020), a large body of fairness work seeks to train models that learn some true function invariant to the spuriously correlated feature. Both under-representation and spurious correlations dominate the fairness literature landscape, though in both cases the onus is squarely on the data.

2.2. BEYOND SPURIOUS CORRELATIONS

There is increasing focus on the model itself, independent of the role of data (Hooker, 2021), of which bias amplification is a perfect example (Zhao et al., 2017; Wang & Russakovsky, 2021) . Here, a small correlation in the training set is amplified into a larger correlation at test time. Models don't just replicate the bias in the data, but exacerbate it. In empirical experiments evaluating bias amplification, Hall et al. (2022) suggest that if group membership is easier to identify than class



For example, Buolamwini & Gebru's 2018 audit of commercial image recognition systems finds

Figure 1: What is difficulty amplification? (a) Consider a binary classification of circles and triangles. Above y = 0 (light gray background) we have a simple group which is linearly separable. Below y = 0 (dark gray) we have a more complex group with a non-linear decision boundary. (b) Illustration of test accuracy when training on the simple group only (light gray) and the complex group only (dark gray). As expected we obtain better accuracy on the simple group. (c) However, when training on both groups at once, our model exacerbates the difference: the observed accuracy disparity d (height of pink area) exceeds the estimated accuracy disparity from individual group training d (height of green area). When d > d, we call this difficulty amplification.

