ON THE IMPACT OF MACHINE LEARNING RANDOM-NESS ON GROUP FAIRNESS Anonymous

Abstract

Statistical measures for group fairness in machine learning reflect the gap in performance of algorithms across different groups. These measures, however, exhibit a high variance, between different training instances, that makes them unreliable for empirical evaluation of fairness. What is the cause of this variance, and how can we reduce it? We investigate the impact of different sources of randomness in machine learning on group fairness. We show that the variance in group fairness measures is mainly due to the high volatility of the learning process on under-represented groups, which itself is largely caused by the stochasticity of data order during training. Based on these findings, we show how to manipulate group level accuracy (i.e. model fairness), with high efficiency and negligible impact on the overall predictive power of the model, by changing the data order.

1. INTRODUCTION

Machine learning models are shown to manifest and escalate historical prejudices and biases present in their training data (Crawford, 2013; Barocas & Selbst, 2016; Zhao et al., 2017; Abbasi et al., 2019) . Understanding these biases and the following ethical considerations has led to the rise of fair machine learning research (Chouldechova & Roth, 2018; Caton & Haas, 2020; Mehrabi et al., 2021) . Recent work in fair deep learning have observed a trend of high variance in fairness measures across multiple training runs (Qian et al., 2021; Amir et al., 2021; Sellam et al., 2021; Soares et al., 2022) , usually attributed to non-determinism in training. These results have challenged the legitimacy of existing claims in the literature (Soares et al., 2022) , and have even disputed the effectiveness of various bias mitigation techniques (Amir et al., 2021; Sellam et al., 2021) . Thus, a reliable extraction of fairness trends in a model requires accounting for the high variance to avoid lottery winners (see Figure 1 ). The naive solution of executing a large number of identical training runs to capture the overall variance creates a huge computational demand, and discourages the examination of biases in several rapidly growing forefronts of machine learning research by increasing the resource requirements. Therefore, understanding the actual cause of the high variance in the fairness measures is critical. To the best of our knowledge, we are the first to study fairness variance beyond trivially executing multiple identical training runs. More specifically, we show the following: • We show that the trends of fairness variance observed in literature are dominated by random data reshuffling during training, which causes high fairness variance between epochs even within a single training run, while the non-determinism in weight initialization has minimal influence. • We extract an empirical relationship between group representation and instability in group level performance, highlighting higher vulnerability of minority to changing model behavior. Our results attribute the high fairness variance to lower prediction stability for under-represented subgroups. 

2. EXPERIMENTAL SETUP

Datasets and Models We will conduct our investigation on ACSIncome and ACSEmployment tasks of the Folktables dataset (Ding et al., 2021) , and binary classification of the 'smiling' label in CelebA dataset (Liu et al., 2015) , with gender (Female vs. Male) as the sensitive attribute for all datasets. For CelebA, input features are obtained by passing the input image through a pre-trained and frozen ResNet-50 backbone and extracting the output feature vector after the final average pooling layer. More details on the datasets are provided in Appendix A. We train a feed forward network on input features with a single hidden layer consisting of 64 neurons and ReLU activation, and train the model with cross-entropy (CE) loss objective for a total of 300 epochs at batch size 128, in all our experiments unless specified otherwise. We include additional experiments by changing training hyperparameters, i.e., batch size, learning rate, and model architecture, in Appendix E.4. We use train-val-test split of 0.7 : 0.1 : 0.2, and maintain the same split throughout all our experiments, i.e. we do not consider potential non-determinism introduced due to changing train-val-test splits. All our evaluations are performed on the test split. We will focus primarily on the ACSIncome dataset in the main text, while additional experiments on CelebA and ACSEmployment are included in the appendix. Fairness Metrics and Variance Fairness in machine learning has been interpreted widely into a multitude of definitions. In this paper, we will focus on average odds (AO), empirically interpreted as the average disparity between true positive rates and false positive rates of various groups. We also include additional results for equalized opportunity (EOpp) and disparate impact (DI) in Appendix F.3. For model predictions R, true labels Y , and sensitive attributes A, average odds can be defined as,  AverageOdds := 1 2 y={0,1} |P(R = 1|Y = y, A = 0) -P(R = 1|Y = y, A = 1)|.

Non-Determinism in Model Training

In this work, we focus on randomness due to stochasticity in the training algorithm, and we set manual seeds at various intermediate locations in our code to control the randomness. We refer to the seed set right before building the neural model as the weight



Figure 1: (a) Fairness (average odds) has a high variance across identical runs due to non-determinism in training. This variance persists even with several state-of-the-art bias mitigation algorithms (Reweighing (Kamiran & Calders, 2012); Equalized Odds Loss (Fukuchi et al., 2020); FairBatch (Roh et al., 2020)), and reliable comparison between these methods can only be made after capturing the overall model behavior, due to intersecting ranges. (b) The performance score (F1 score), however, has significantly smaller range of variance and doesn't face similar issues as fairness.

At the heart of our work is the study of fairness variance across model checkpoints. Unless otherwise specified, variance across multiple training runs refers to the variance across final checkpoints at epoch 300 of each training run. Similarly, variance across epochs refers to variance in a single training run across checkpoints at the end of every epoch for the last 200 epochs of training (i.e., from epoch 100 to epoch 300). We make this choice as the model has converged to stable accuracy scores before epoch 100 (refer to the training curve in Appendix A for more details).

• We demonstrate an immediate dominance of the data order on model fairness. A model's fairness is predictable, based on only the most recent training points, irrespective of preceding model behavior.• Based on this information, we propose to use the fairness variance across epochs as a proxy to study the changing model fairness across multiple training runs, thus reducing the computational requirements by a significant margin. • Finally, we manipulate group level performances (i.e., model fairness) by changing the data order, with a relatively minor impact on the overall accuracy. This manipulation can improve fairness as well as reverse the effects of several bias mitigation algorithms within a single training epoch.

