QUANTIFYING AND MITIGATING THE IMPACT OF LA-BEL ERRORS ON MODEL DISPARITY METRICS

Abstract

Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics 1 . Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in both training and test data, affect these disparity metrics. We find that group calibration and other metrics are sensitive to train-time and test-time label error-particularly for minority groups. This disparate effect persists even for models trained with noise-aware algorithms. To mitigate the impact of training-time label error, we present an approach to estimate the influence of a training input's label on a model's group disparity metric. We empirically assess the proposed approach on a variety of datasets and find significant improvement, compared to alternative approaches, in identifying training inputs that improve a model's disparity metric. We complement the approach with an automatic relabel-and-finetune scheme that produces updated models with, provably, improved group calibration error.

1. INTRODUCTION

Label error (noise) -mistakes associated with the label assigned to a data point -is a pervasive problem in machine learning (Northcutt et al., 2021) . For example, 30 percent of a random 1000 samples from the Google Emotions dataset (Demszky et al., 2020) had label errors (Chen, 2022) . Similarly, an analysis of the MS COCO dataset found that up to 37 percent (273,834 errors) of all annotations are erroneous (Murdoch, 2022) . Yet, little is known about the effect of label error on a model's group-based disparity metrics like equal odds (Hardt et al., 2016 ), group calibration (Pleiss et al., 2017) , and false positive rate (Barocas et al., 2019) . It is now common practice to conduct 'fairness' audits (see: (Buolamwini and Gebru, 2018; Raji and Buolamwini, 2019; Bakalar et al., 2021 )) of a model's predictions to identify data subgroups where the model underperforms. Label error in the test data used to conduct a fairness audit renders the results unreliable. Similarly, label error in the training data, especially if the error is systematically more prevalent in certain groups, can lead to models that associate erroneous labels to such groups. The reliability of a fairness audit rests on the assumption that labels are accurate; yet, the sensitivity of a model's disparity metrics to label error is still poorly understood. Towards such end, we ask: what is the effect of label error on a model's disparity metric? We address the high-level question in a two-pronged manner via the following questions: In addressing these questions, we make two broad contributions: Empirical Sensitivity Tests. We assess the sensitivity of model disparity metrics to label errors with a label flipping experiment. First, we iteratively flip the labels of samples in the test set, for a fixed model, and then measure the corresponding change in the model disparity metric compared to an unflipped test set. Second, we fix the test set for the fairness audit but flip the labels of a proportion of the training samples. We then measure the change in the model disparity metrics for a model trained on the data with flipped labels. We perform these tests across a datasets and model combinations. Training Point Influence on Disparity Metric. We propose an approach, based on a modification to the influence of a training example on a test example's loss, to identify training points whose labels have undue effects on any disparity metric of interest on the test set. We empirically assess the proposed approach on a variety of datasets and find a 10-40% improvement, compared to alternative approaches that focus solely on model's loss, in identifying training inputs that improve a model's disparity metric.

2. SETUP & BACKGROUND

In this section, we discuss notation, and set the stage for our contributions by discussing the disparity metrics that we focus on. We also provide an overview of the datasets and models used in the experimental portions of the paper.foot_1  Overview of Notation. We consider prediction problems, i.e, settings where the task is to learn a mapping, θ : X × A → Y, where X ∈ R d is the feature space, Y ∈ {0, 1} is the output space, and A is a group identifier that partitions the population into disjoint sets e.g. race, gender. We can represent the tuple (x i , a i , y i ) as z i . Consequently, the n training points can be written as: {z i } n i=1 . Throughout this work, we will only consider learning via empirical risk minimization (ERM), which corresponds to: θ := arg min θ∈Θ 1 n n i ℓ(z i , θ). Similar to Koh and Liang (2017), we will assume that the ERM objective is twice-differentiable and strictly convex in the parameters. We focus on binary classification tasks, however, our analysis can be easily generalized. Disparity Metrics. We define a group disparity metric to be a function, GD, that gives a performance score given a model's probabilistic predictions (θ outputs the probability of belonging to the positive class) and 'ground-truth' labels. We consider the following metrics (We refer readers to the Appendix for a detailed overview of these metrics):



Group-based disparity metrics like subgroup calibration, false positive rate, false negative rate, equalized odds, and equal opportunity are more often known, colloquially, as fairness metrics in the literature. We use the term group-based disparity metrics in this work. We refer readers to the longer version of this work on the arxiv. Code to replicate our findings is available at: https://github.com/adebayoj/influencedisparity



Figure 1: A schematic of the test and train-time empirical sensitivity tests. Here we show the model training and fairness audit pipeline. Our proposed sensitivity tests capture the effect of label error, in both stages, on the disparity metric. In the Test-time sensitivity test, we flip the label of a portion of the test data and then compare the corresponding disparity metric (group calibration for example) for the flipped dataset to the metrics for a standard model where the test labels were not flipped. In the Train-time sensitivity test, we flip the labels of a portion of the training set, and then measure the change in disparity metric to a standard model.

