MEDFAIR: BENCHMARKING FAIRNESS FOR MEDICAL IMAGING

Abstract

A multitude of work has shown that machine learning-based medical diagnosis systems can be biased against certain subgroups of people. This has motivated a growing number of bias mitigation algorithms that aim to address fairness issues in machine learning. However, it is difficult to compare their effectiveness in medical imaging for two reasons. First, there is little consensus on the criteria to assess fairness. Second, existing bias mitigation algorithms are developed under different settings, e.g., datasets, model selection strategies, backbones, and fairness metrics, making a direct comparison and evaluation based on existing results impossible. In this work, we introduce MEDFAIR, a framework to benchmark the fairness of machine learning models for medical imaging. MEDFAIR covers eleven algorithms from various categories, ten datasets from different imaging modalities, and three model selection criteria. Through extensive experiments, we find that the under-studied issue of model selection criterion can have a significant impact on fairness outcomes; while in contrast, state-of-the-art bias mitigation algorithms do not significantly improve fairness outcomes over empirical risk minimization (ERM) in both in-distribution and out-of-distribution settings. We evaluate fairness from various perspectives and make recommendations for different medical application scenarios that require different ethical principles. Our framework provides a reproducible and easy-to-use entry point for the development and evaluation of future bias mitigation algorithms in deep learning. Code is available at https://github.com/ys-zong/MEDFAIR.

1. INTRODUCTION

Machine learning-enabled automatic diagnosis with medical imaging is becoming a vital part of the current healthcare system (Lee et al., 2017) . However, machine learning (ML) models have been found to demonstrate a systematic bias toward certain groups of people defined by race, gender, age, and even the health insurance type with worse performance (Obermeyer et al., 2019; Larrazabal et al., 2020; Spencer et al., 2013; Seyyed-Kalantari et al., 2021) . The bias also exists in models trained from different types of medical data, such as chest X-rays (Seyyed-Kalantari et al., 2020) , CT scans (Zhou et al., 2021) , skin dermatology images (Kinyanjui et al., 2020) , etc. A biased decision-making system is socially and ethically detrimental, especially in life-changing scenarios such as healthcare. This has motivated a growing body of work to understand bias and pursue fairness in the areas of machine learning and computer vision (Mehrabi et al., 2021; Louppe et al., 2017; Tartaglione et al., 2021; Wang et al., 2020) . Informally, given an observation input x (e.g., a skin dermatology image), a sensitive attribute s (e.g., male or female), and a target y (e.g., benign or malignant), the goal of a diagnosis model is to learn a meaningful mapping from x to y. However, ML models may amplify the biases and confounding factors that already exist in the training data related to sensitive attribute s. For example, data imbalance (e.g., over 90% individuals from UK Biobank (Sudlow et al., 2015) originate from European ancestries), attribute-class imbalance (e.g., in age-related macular degeneration (AMD) datasets, subgroups of older people contain more pathology examples than that of younger people (Farsiu et al., 2014) ), label noise (e.g., Zhang et al. (2022) find that label noises in CheXpert dataset (Irvin et al., 2019) is much higher in some subgroups than the others), etc. Bias mitigation algorithms therefore aim to help diagnosis algorithms learn predictive models that are robust to confounding factors related to sensitive attribute s (Mehrabi et al., 2021) . Given the importance of ensuring fairness in medical applications and the special characteristics of medical data, we argue that a systematic and rigorous benchmark is needed to evaluate the bias mitigation algorithms for medical imaging. However, a straightforward comparison of algorithmic fairness for medical imaging is difficult, as there is no consensus on a single metric for fairness of medical imaging models. Group fairness (Dwork et al., 2012; Verma & Rubin, 2018 ) is a popular and intuitive definition adopted by many debiasing algorithms, which optimises for equal performance among subgroups. However, this can lead to a trade-off of increasing fairness by decreasing the performance of the advantaged group, reducing overall utility substantially. Doing so may violate the ethical principles of beneficence and non-maleficence (Beauchamp, 2003) , especially for some medical applications where all subgroups need to be protected. There are also other fairness definitions, including individual fairness (Dwork et al., 2012) , minimax fairness (Diana et al., 2021) , counterfactual fairness (Kusner et al., 2017) , etc. It is thus important to consider which definition should be used for evaluations. In addition to the use of differing evaluation metrics, different experimental designs used by existing studies prevent direct comparisons between algorithms based on the existing literature. Most obviously, each study tends to use different datasets to evaluate their debiasing algorithms, preventing direct comparisons of results. Furthermore, many bias mitigation studies focus on evaluating tabular data with low-capacity models (Madras et al., 2018; Zhao et al., 2019; Diana et al., 2021) , and recent analysis has shown that their conclusions do not generalise to high-capacity deep networks used for the analysis of image data (Zietlow et al., 2022) . A crucial but less obvious issue is the choice of model selection strategy for hyperparameter search and early stopping. Individual bias mitigation studies are divergent or vague in their model selection criteria, leading to inconsistent comparisons even if the same datasets are used. Finally, given the effort required to collect and annotate medical imaging data, models are usually deployed in a different domain than the domain used for data collection. (E.g., data collected at hospital A is used to train a model deployed at hospital B). While the maintenance of prediction quality across datasets has been well studied, it is unclear if fairness achieved within one dataset (in-distribution) holds under dataset shift (out-of-distribution). In order to address these challenges, we provide the first comprehensive fairness benchmark for medical imaging -MEDFAIR. We conduct extensive experiments across eleven algorithms, ten datasets, four sensitive attributes, and three model selection strategies to assess bias mitigation algorithms in both in-distribution and out-of-distribution settings. We report multiple evaluation metrics and conduct rigorous statistical tests to find whether any of the algorithms is significantly better. Having trained over 7,000 models using 6,800 GPU-hours, we have the following observations: • Bias widely exists in ERM models trained in different modalities, which is reflected in the predictive performance gap between different subgroups for multiple metrics. • Model selection strategies can play an important role in improving the worst-case performance. Algorithms should be compared under the same model selection criteria. • The state-of-the-art methods do not outperform the ERM with statistical significance in both in-distribution and out-of-distribution settings. These results show the importance of a large benchmark suite such as MEDFAIR to evaluate progress in the field and to guide practical decisions about the selection of bias mitigation algorithms for deployment. MEDFAIR is released as a reproducible and easy-to-use codebase that all experiments in this study can be run with a single command. Detailed documentation is provided in order to allow researchers to extend and evaluate the fairness of their own algorithms and datasets, and we will also actively maintain the codebase to incorporate more algorithms, datasets, model selection strategies, etc. We hope our codebase can accelerate the development of bias mitigation algorithms and guide the deployment of ML models in clinical scenarios.

2. FAIRNESS IN MEDICINE 2.1 PROBLEM FORMULATION

We focus on evaluating the fairness of binary classification of medical images. Given an image, we predict its diagnosis label in a way that is not confounded by any sensitive attributes (age, sex, race, etc.) so that the trained model is fair and not biased towards a certain subgroup of people. Formally, Let D ∈ {D i } I i be a set of domains, where I is the total number of domains. A domain can represent a dataset collected from a particular imaging modality, hospital, population, etc. Consider a domain D = (X , Y, S) to be a distribution where we have input sample x ∈ R d over input space X , the corresponding binary label y ∈ {0, 1} over label space Y,foot_0 , and sensitive attributes s ∈ {0, 1, ..., m-1} with m classes over sensitive space S. We train a model h ∈ H to output the prediction ŷ ∈ {0, 1}, i.e., h : X → Y, where H is the hypothesis class of models. Note that for each dataset D i , there may exist several sensitive attributes at the same time, e.g., there are metadata of both patients' age and sex. We only consider one sensitive attribute at one time. In-distribution Given a domain D i , assume the input samples X i , their labels Y i , and the sensitive attributes S i are identically and independently distributed (iid) from a joint probability distribution P i (X i , Y i , S i ). We define the evaluation where the training and testing on the same domain D i to be in-distribution, i.e., the training and testing set are from the same distribution. We train models for each combination of algorithms × datasets × sensitive attributes. Out-of-distribution In clinical scenarios, due to the lack of training data, it is common to deploy a model trained in the original dataset to new hospitals/populations that have different data distributions. We define the training on one domain D i and testing on the other unseen domain D j to be out-of-distribution settings, where D i and D j may have different distribution P i (X i , Y i , S i ) and P j (X j , Y j , S j ). In this case, we assume domains D i and D j must have the same input space (e.g., X-ray imaging), diagnosis labels, and sensitive attributes, but differ in their joint distributions due to collection from different locations or different imaging protocols. We evaluate if bias mitigation algorithms are robust to distribution shift by directly using the model selected from the in-distribution setting of the domain D i to test on the domain D j .

2.2. FAIRNESS DEFINITION IN MEDICINE

Here we consider two most salient fairness definitions for healthcare, i.e., group fairness and Max-Min fairness. We argue that one should focus on different fairness definitions depending on the specific clinical application. Group Fairness Metrics based on group fairness usually aim to achieve parity of predictive performance across protected subgroups. For resource allocation problems that can be considered a zero-sum game due to the limited resources, e.g., prioritising which patients should be sent to a limited number of intensive care units (ICUs), it is important to consider group fairness to reduce the disparity among different subgroups (related discussions in Hellman (2008) ; Barocas & Selbst (2016) ). We measure the performance gap in diagnosis AUC between the advantaged and disadvantaged subgroups as an indicator of group fairness. This is in line with the "separability" criteria (Chen et al., 2021; Dwork et al., 2012) that algorithm scores should be conditionally independent of the sensitive attribute given the diagnostic label (i.e., Ŷ ⊥ S|Y ), which is also adopted by (Gardner et al., 2019; Fong et al., 2021) . On the other hand, Zietlow et al. (2022) find that for high-capacity models in computer vision, this is typically achieved by worsening the performance of the advantaged group rather than improving the disadvantaged group, a phenomenon termed as leveling down in philosophy that has incurred numerous criticisms (Christiano & Braynen, 2008; Brown, 2003; Doran, 2001) . Worse, practical implementations often lead to worsening the performance of both subgroups (Zietlow et al., 2022) , making it pareto inefficient and comprehensively violating beneficence and non-maleficence principles (Beauchamp, 2003) . Thus, we argue that group fairness alone is not sufficient to analyse the trade-off between fairness and utility.

Max-Min Fairness

It is another definition of fairness (Lahoti et al., 2020) following Rawlsian max-min fairness principle (Rawls, 2001) , which is also studied as minimax group fairness (Diana et al., 2021) or minimax Pareto fairness (Martinez et al., 2020) . Here, instead of seeking to equalize the error rates among subgroups, it treats the model that reduces the worst-case error rates as the fairer one. It may be a more appropriate definition than group fairness for some medical applications such as diagnosis, as it better satisfies the beneficence and non-maleficence principles (Beauchamp, 2003; Chen et al., 2018; Ustun et al., 2019) , i.e., do the best and do no harm. Formally, for a model h in the hypothesis class H, denote U s (h) to be a utility function for subgroup s. A model h * is considered to be Max-Min Fair if it maximizes (Max-) the utility of the worst-case (Min) group: h * = argmax h∈H min s∈S U s (h). (1) In practice, it is hard to quantify the maximum optimal utility, and therefore we treat a model h k to be fairer than the other model h t if min s∈S U s (h k ) > min s∈S U s (h t ). We measure both group fairness and Max-Min fairness to give a more comprehensive evaluation for fairness in medical applications.

3. MEDFAIR

We implement a reproducible and easy-to-use codebase MEDFAIR to benchmark fairness in machine learning algorithms for medical imaging. In our benchmark, we conduct large-scale experiments in ten datasets, eleven algorithms, up to three sensitive attributes for each dataset, and three model selection criteria, where all the experiments can be run with a single command. We provide source code and detailed documentation, allowing other researchers to reproduce the results and incorporate other datasets and algorithms easily.

3.1. DATASETS

Ten datasets are included in MEDFAIR: CheXpert (Irvin et al., 2019) , MIMIC-CXR (Johnson et al., 2019) , PAPILA (Kovalyk et al., 2022) , HAM10000 (Tschandl et al., 2018) , Fitzpatrick17k (Groh et al., 2021) , OL3I (Chaves et al., 2021) , COVID-CT-MD (Afshar et al., 2021) , OCT (Farsiu et al., 2014) , ADNI 1.5T, and ADNI 3T (Petersen et al., 2010) , to evaluate the algorithms comprehensively, which are all publicly available to ensure the reproducibility. We consider five important aspects during dataset selection: Imaging modalities. We select datasets covering various 2D and 3D imaging modalities, including X-ray, fundus photography, computed tomography (CT), magnetic resonance imaging (MRI), spectral domain optical coherence tomography (SD-OCT), and skin dermatology images. Potential sources of bias. We involve datasets that may introduce bias from different sources, including label noise, data/class imbalance, spurious correlation, etc. Note that each dataset may contain more than one source of bias.

Sensitive attributes.

The selected datasets contain attributes that are commonly treated sensitively and may be biased in clinical practice, including age, sex, race, and skin type. (Madras et al., 2018) de-biases the representation by minimizing the ability to recognize sensitive attributes. -Conditional learning of Fair representation (CFair) (Zhao et al., 2019) tries to enforce the balanced error rate and conditional alignment of representations during training. -Learning Not to Learn (LNL) (Kim et al., 2019) unlearns the bias information iteratively by minimizing the mutual information between feature representation and bias. • Disentanglement -Entangle and Disentangle (EnD) (Tartaglione et al., 2021) disentangles confounders by inserting an "information bottleneck", while still passing the useful information. -Orthogonal Disentangled Representations (ODR) (Sarhan et al., 2020) disentangles the useful and sensitive representations by enforcing orthogonality constraints for independence. • Domain Generalization (DG) -Group Distributionally Robust Optimization (GroupDRO) (Sagawa et al., 2019) minimizes the worst-case training loss with increased regularization. -Stochastic Weight Averaging Densely (SWAD) (Cha et al., 2021) , a state-of-the-art method in DG, aims to find a robust flat minima by a dense stochastic weight sampling strategy. -Sharpness-Aware Minimization (SAM) (Foret et al., 2020) seeks parameters that lie in neighborhoods having uniformly low loss during optimization. The hyper-parameter tuning strategy is described in Appendix B.2.2.

3.3. MODEL SELECTION

The trade-off between fairness and utility has been widely noted (Kleinberg et al., 2016; Zhang et al., 2022) , making hyper-parameter selection criteria particularly difficult to define given the multiobjective nature of optimising for potentially conflicting fairness and utility. Previous work differs greatly in model selection. Some use conventional utility-based selection strategies, e.g., overall validation loss, while others have no explicit specification. We provide a summary of model selection strategies across the literature in Table A1 . To investigate the influence of model selection strategies on the final performance, we study three prevalent selection strategies in MEDFAIR. Overall Performance-based Selection This is one of the most basic and common strategies for model selection. It picks the model that has the smallest loss value or highest accuracy/AUC across the validation set of all sub-populations. However, this strategy tends to select the model with better performance in the majority group to achieve the best overall performance, leading to a potentially large performance gap among subgroups, which is illustrated in the red pentagon on the right side of Figure 2 (note that it is not necessarily Pareto optimal). et al. (1995) and utilized in fair machine learning to study the trade-off among subgroup accuracies (Martinez et al., 2020) . Intuitively, for a model on the Pareto front, no group can achieve better performance without hurting the performance of other groups. In other words, it defines the set of best achievable trade-offs among subgroups (without introducing unnecessary harm). Based on this definition, we select the model that lies on the Pareto front and achieves the best worst-case AUC (the red star in the middle top of Figure 2 ). We present a formal definition of minimax Pareto selection in Appendix B.4.

DTO-based Selection

Distance to optimal (DTO) (Han et al., 2021) is calculated by the normalized Euclidean distance between the performance of the current model and the optimal utopia point. Here, we construct the utopia point by taking the maximum AUC value of each subgroup among all models. The DTO strategy selects the model that has the smallest distance to the utopia point (the red hexagon in Figure 2 ).

3.4. EVALUATION AND IMPLEMENTATION

We apply the bias mitigation algorithms to medical image classification tasks and evaluate fairness based on the performance of different subgroups (sensitive attributes). The sensitive attributes are regarded as available during training (if needed). We consider two settings to evaluate fairness for medical imaging, i.e., in-distribution and out-of-distribution.

Evaluation Metrics

We use the area under the receiver operating characteristic curve (AUC) as the major metric, which is a commonly used metric for medical binary classification. We evaluate the algorithms from three aspects: (1) utility: overall AUC across all subgroups; (2) group fairness: AUC gap between the subgroups that have maximum AUC and minimum AUC; (3) Max-Min fairness: AUC of the worst-case group. Besides, we also report the values of binary cross entropy (BCE), expected calibration error (ECE), false positive rate (FPR), false negative rate (FNR), and true positive rate (TPR) at 80% true negative rate (TNR) of each subgroup, as well as the Equalized Odd (EqOdd). We provide detailed explanations of these metrics in Appendix B.3.

Statistical Tests

Prior work has empirically evaluated bias mitigation algorithms and occasionally claimed that some algorithm works well based on results from a couple of datasets. We note that to make a stronger conclusion that would be more useful to practitioners, e.g., 'algorithm A works better than B for medical imaging' (i.e., A is better general, rather than better for dataset C specifically), one needs to evaluate performance across several datasets and perform significance tests that check for consistently good performance that can not be explained by overfitting to a single dataset. This is where the MEDFAIR benchmark suite comes in. To rigorously compare the relative performance of different algorithms, we perform the Friedman test (Friedman, 1937) followed by Nemenyi post-hoc test (Nemenyi, 1963) for both settings to identify if any of the algorithms is significantly better than the others, following the authoritative guide of Demšar (2006) . We first calculate the relative ranks among all algorithms on each dataset and sensitive attribute separately, and then take the average ranks for the Nemenyi test if significance is detected by Friedman test. We consider a p-value lower than 0.05 to be statistically significant. The testing results are visualized by Critical Difference (CD) diagrams (Demšar, 2006) . In CD diagrams, methods that are connected by a horizontal line are in the same group, meaning they are not significantly different given the p-value, and methods that are in different groups (not connected by the same line) have statistically significant difference. Implementation Details We adopt 2D and 3D ResNet-18 backbone (He et al., 2016; Hara et al., 2018) for 2D and 3D datasets, respectively. The light backbone is used to avoid overfitting as there are datasets with small sizes, and also to remain consistent with the backbone used in the original literature (Kim et al., 2019; Wang et al., 2020; Tartaglione et al., 2021; Sarhan et al., 2020) . Binary cross entropy loss is used as the major objective. To ensure the stability of randomness, for each experiment, we report the mean values and the standard deviations for three separate runs with three randomly selected seeds. Further implementation details for all datasets and algorithms can be found in Appendix B.

4.1. BIAS WIDELY EXISTS IN ML MODELS TRAINED IN DIFFERENT MODALITIES AND TASKS

Firstly, we train ERM on different datasets and sensitive attributes, and select models using the regular overall performance-based strategy. For each dataset and sensitive attribute, we calculate the maximum and minimum AUC and underdiagnosis rate among subgroups, where we use FNR for the malignant label and FPR for "No Finding" label as the underdiagnosis rate. As shown in Figure 3 , most points are to the side of the equality line, showing that the performance gap widely exists. This confirms a problem that has been widely discussed (Seyyed-Kalantari et al., 2021) but, until now, has never been systematically quantified for deep learning across a comprehensive variety of modalities, diagnosis tasks, and sensitive attributes.

4.2. MODEL SELECTION SIGNIFICANTLY INFLUENCES WORST-CASE GROUP PERFORMANCE

We study the impact of model selection strategies on ERM using our three metrics of interest: The AUC of the worst-case group, the AUC gap, and overall AUC with ERM. We first conduct a hyperparameter sweep for ERM while training on all the datasets, and then compute the metrics and the relative ranks of the three model selection strategies. The results, including statistical significance tests are summarised in Figure 4 , and the raw data in Table A8 . Each sub-plot corresponds to a metric of interest (worst-case AUC, AUC Gap, Overall AUC), and the average rank of each selection strategy (Pareto, DTO, Overall) is shown on a line. Selection strategies not connected by the bold bars have significantly different performances. The results show that for the worst-case AUC metric (left), the Pareto-optimal model selection strategy has the highest average rank of around 1.5, which is statistically significantly better than the overall AUC model selection strategy's average rank of around 2.5. Meanwhile, in terms of the overall AUC metric (right) the Pareto selection strategy is SWAD is the highest ranked method for worst-and overall-AUC metrics, but it is still not significantly better than ERM. We next ask whether any of the purpose-designed bias mitigation algorithms is significantly better than ERM, and which algorithm is best overall? To answer these questions, we evaluate the perfor-mance of all methods using the Pareto model selection strategy. We report the Nemenyi post-hoc test results on worst-group AUC, AUC gap, and overall AUC in Figure 5 for in-distribution (top row) and out-of-distribution (bottom row) settings with raw data in Tables A9 and A10 . For in-distribution, while there are some significant performance differences, no method outperforms ERM significantly for any metric: ERM is always in the highest-rank group of algorithms without significant differences. The conclusion is the same for the out-of-distribution testing, and some methods that rank higher than ERM in the in-distribution setting perform worse than ERM when deployed to an unseen domain, suggesting that preserving fairness across domain-shift is challenging. It is worth noting there are some methods that consistently perform better, e.g., SWAD ranks a clear first for the worst-case and overall AUC for both settings, and thus could be a promising method for promoting fairness. However, from a statistical significance point of view, SWAD is still not significantly better despite the fact that we use a much larger sample size (number of datasets) than most of the previous fairness studies. This shows that many studies do not use enough number of datasets to justify their desired claims. Our benchmark suite provides the largest collection of medical imaging datasets for fairness to date, and thus provides the best platform for future research to evaluate the efficacy of any method works in a rigorous statistical way.

5. DISCUSSION

Source of bias There are multiple confounding effects that can lead to bias, rather than any single easy-to-isolate factor. As summarised in Table 1 and discussed further in Appendix A.4 and E, these include both measurable and unmeasurable factors spanning imbalance in subgroup size, imbalance in subgroup disease prevalence, difference in imaging protocols/time/location, spurious correlations, the intrinsic difference in difficulty of diagnosis for different subgroups, unintentional bias from the human labellers, etc. It is difficult or even impossible to disentangle all of these factors, making algorithms that specifically optimise for one particular factor to succeed. Failure of the bias mitigation algorithms Although most bias mitigation algorithms are not consistently effective across our benchmark suite, we are certainly not trying to disparage them. It is understandable because some are not originally designed for medical imaging, which contains characteristics distinct from those of natural images or tabular data, and more work may be necessary to design medical imaging specific solutions. More fundamentally, different algorithms may succeed if addressing solely the specific confounding factors for which they are designed to compensate, but fail when presented with other confounders or a mixture of multiple confounders. For example, resampling specifically targets data imbalance, while disentanglement focuses more on removing spurious correlations. But real datasets may also simultaneously contain other potential sources of bias such as label noise. This may explain why SWAD is the most consistently high-ranked algorithm, as it optimises a general notion of robustness without any specific assumption on confounders or sensitive attributes, and thus may be more broadly beneficial to different confounding factors.

Relation of domain generalization and fairness

The aim of domain generalization (DG) algorithms is to maintain stable performance on unseen sub-populations, while the fairness-promoting algorithms try to ensure that no known sub-populations are poorly treated. Despite this difference, they share the eventual goal -being robust to changes in distribution across different sub-populations. As shown in section 4, some domain generalization methods, such as SWAD, consistently improve the performance of all subgroups, and thus overall utility. However, we also notice that they may also enlarge the performance gap among subgroups. It introduces a question of whether a systematically better algorithm (i.e., improving Max-Min fairness) is fairer if it increases the disparity (i.e., not satisfying group fairness)? This question goes beyond machine learning and depends on application scenarios. We suggest that a relevant differentiator may be between diagnosis and zero-sum resource allocation problems where Max-Min and group fairness could be prioritised respectively. Are the evaluations enough for now? Although we have tried our best to include a diverse set of algorithms and datasets in our benchmark, it is certainly not exhaustive. There are methods to promote fairness from other perspectives, e.g., self-supervised learning may be more robust (Liu et al., 2021; Azizi et al., 2022) . Also, datasets from other medical data modalities (e.g., cardiology, digital pathology) should be added. Beyond image classification, other important tasks in medical imaging, such as segmentation, regression, and detection, are underexplored. We will keep our codebase alive and actively incorporate more algorithms, datasets, and even other tasks in the future.

REPRODUCIBILITY STATEMENT

We report the data preprocessing in Appendix B.1. 

A RELATED WORK

In Appendix A, We present a broad review of the model selection strategies adopted by existing literature, current bias mitigation algorithms, existing fairness benchmarks, and domain generalization methods and their relationship to fairness.

A.1 MODEL SELECTION STRATEGY IN FAIRNESS

Considering the trade-off between fairness and the utility, how to select the appropriate model among hyper-parameters remains an important problem. We broadly review and summarize the model selection strategies of recent work in Table A1 . N/A means they do not explicitly specify their model selection strategies in their papers (although they may have implemented it in their open-source code). It can be seen that the model selection strategies differ greatly among existing literature, making a direct comparison infeasible. Hence, we investigate the influence of model selection strategies in our study. 

A.2 BIAS MITIGATION ALGORITHMS

According to the stage when bias mitigation methods are introduced, they can be mainly classified into three categories, namely pre-processing, in-processing, and post-processing. The preprocessing methods aim to curate the data in the dataset to remove the potential bias before learning a model (Khodadadian et al., 2021) , while the post-processing methods seek to adjust the predictions given a trained model according to the sensitive attributes (Pleiss et al., 2017) . In this paper, we focus on benchmarking in-processing methods, which aim to mitigate the bias during the training of the models. Below, we review four popular categories of bias mitigation algorithms. Subgroup Rebalancing For imbalanced data, synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002) is a classic resampling method that over-samples the minority class and under-samples the majority class. Recent work found that simply using data balancing can effectively improve the worst-case group accuracy (Idrissi et al., 2022) . Domain-Independence Wang et al. ( 2020) develop a domain-independent training strategy that applies different classification heads to different subgroups. (Royer & Lampert, 2015) propose classifier adaption strategies in the prediction time to reduce the error rates when the distribution of testing domain is different. Adversarial Learning There are two major categories of adversarial learning: (1) it plays a minimax game that the classification head tries to achieve the best classification performance while minimizing the ability of the discriminator to predict the sensitive attributes (Zhang et al., 2018; Kim et al., 2019) . After training, the sensitive attributes in the representation are expected to be indistinct. (2) it enforces the fairness constraints (e.g., group fairness) on the representation during trianing, and the representation is used for downstream tasks later (Xie et al., 2017; Madras et al., 2018; Zhao et al., 2019) . Disentanglement Disentanglement methods (Tartaglione et al., 2021; Sarhan et al., 2020; Creager et al., 2019; Lee et al., 2021) isolates independent factors of variation into different and independent components of a representation vector. In other words, these methods disentangle the sensitive attributes and task-meaningful information in the representation level. Then, the classification tasks can be conducted on representation containing only task-specific representation so that the sensitive attributes and other unobserved factors will not make an impact. In our benchmark, we select typical in-processing algorithms from these categories to provide a comprehensive comparison.

A.3 FAIRNESS BENCHMARKS

There are efforts in general machine learning communities to benchmark the performance of the fairness-aware algorithms, inspecting their effectiveness in different aspects. AIF360 (Bellamy et al., 2019) implements a wide range of fairness metrics and debiasing algorithms, which is available in both Python and R language. Fairlearn (Bird et al., 2020 ) also provides debiasing algorithms and fairness metrics to evaluate ML models. Friedler et al. ( 2019) benchmark a series of fairness-aware machine learning techniques. However, they only study the traditional machine learning models. For deep learning methods, Reddy et al. (2021) benchmark algorithms from the perspective of representation learning on tabular and synthetic datasets, and find they can successfully remove spurious correlation. Locatello et al. (2019) specifically benchmark the disentanglement fair representation learning methods on 3D shape datasets, suggesting disentanglement can be a useful property to encourage fairness. In the area of computational medicine, machine learning models have been found to demonstrate a systematic bias toward a wide range of attributes, such as race, gender, age, and even the health insurance type (Obermeyer et al., 2019; Larrazabal et al., 2020; Spencer et al., 2013; Seyyed-Kalantari et al., 2021) . The bias also exists in different types of medical data, such as chest X-ray (Seyyed-Kalantari et al., 2020), CT scans (Zhou et al., 2021) , skin dermatology (Kinyanjui et al., 2020), health record (Obermeyer et al., 2019) , etc. The most relevant work to ours is Zhang et al. (2022) , which compares a series of algorithms on chest X-ray images, and finds no method outperforms simple data balancing. However, it is unclear whether the conclusion can be generalized to other medical imaging modalities, and whether the selection of methods is comprehensive. In contrast, we benchmark a wider range of algorithms on different data modalities, study the ultimately more significant issue of model selection, and provide further analysis of the cause of the bias and the explanation of the effective algorithms. To the best of our knowledge, we are the first to provide a comprehensive benchmark for a wide range of algorithms and datasets, and a comprehensive analysis of different model selection criteria.

A.4 SOURCE OF BIAS

Data imbalance across subgroups is one of the most common sources of bias for medical imaging, as many biomedical datasets lack demographic diversity. For example, many datasets are developed with individuals originating from European ancestries, such as UK Biobank and The Cancer Genome Atlas (TCGA) (Sudlow et al., 2015; Liu et al., 2018) . Also, more data samples in total will be from older people if the dataset is collected to study a specific disease that occurs more in older people, e.g. dataset for Age-related macular degeneration (AMD) (Farsiu et al., 2014) . Overall, when there are more samples in one subgroup than the other, it can be expected that the machine learning model may have different prediction accuracy for different subgroups. Class imbalance can occur along with the data imbalance, where one subgroup has more samples of some specific classes while the other subgroup has more samples from other classes. For example, in the AMD dataset, subgroups of older people contain more pathology examples than subgroups of younger people. Also, the problem of class imbalance happens for rare diseases, which is generally due to genetic mutations that occur in a very limited number of people (Lee et al., 2020) , e.g., 10 in 1,000,000. So, the class imbalance is severe and there would never be enough data (even worldwide) for a balanced representation in the training set. Spurious correlation can be learned by machine learning models undesirably from imaging devices. For example, a model can learn to classify skin diseases by looking at the markings placed by dermatologists near the lesions, instead of really learning the diseases (Winkler et al., 2019) . It is also related to the data imbalance and class imbalance because the spurious correlations may occur more frequently in a subgroup with a smaller number of examples by simply remembering all the data points, i.e., overfitting. Moreover, class imbalance itself may lead to a spurious correlation. For example, the model may try to use age-related features to predict pathology in a dataset where most of the older patients are unhealthy. Label noise can also be a source of bias for medical imaging datasets. As the labeling of medical imaging datasets is labor-intensive and time-consuming, some large-scale datasets are labeled by automatic tools (Irvin et al., 2019; Johnson et al., 2019) , which may not be precisely accurate and thus introduce noises. Zhang et al (Zhang et al., 2022) recruit a board-certified radiologist to relabel a subset of the chest X-ray dataset CheXpert, and find that the label noises are much higher in some subgroups than the others. Inherent characteristics of the data of certain subgroups can lead to different performance for different subgroups even if the dataset is balanced, i.e., the tasks in some subgroups are inherently difficult even for humans. For example, in skin dermatology images, the lesions are usually more difficult to recognize for darker skin than that for light skin due to the low contrast (Wen et al., 2021) . Thus, even with balanced datasets, a trained ML model can still give lower accuracy to patients with darker skin. Considering this, other measures beyond algorithms should be adopted to promote fairness, such as collecting more representative samples and improving imaging devices. In summary, the bias is usually not from a single source, and different sources of bias also usually correlate with each other, leading to the unfairness of the machine learning model.

A.5 DOMAIN GENERALIZATION AND FAIRNESS

Domain generalization (DG) algorithms aim to maintain good performance on unseen subpopulation, while fairness-promoting algorithms try to ensure that no known sub-population is poorly treated. Generally, though differing in detail, their eventual goal is the same -being robust to distribution changes across different sub-populations, which is also discussed in (Creager et al., 2020) . Hence, in this work, we also explore fairness from the perspective of domain generalization. One line of work for DG is to treat it as a robust optimization problem (Ben-Tal et al., 2009) , where the goal is to try to minimize the worst-case loss for subgroups of the training set. Duchi et al. (2016) propose to minimize the worst-case loss for constructed distributional uncertainty sets with Distributionally Robust Optimization (DRO). GroupDRO (Sagawa et al., 2019) extends this idea by adding increased regularization to overparameterized networks and achieves good worst-case performance. Another line of work focuses on finding flat minima in the loss landscape during optimization. As flat minima in the loss landscape is considered to be able to generalize better to different domains (Hochreiter & Schmidhuber, 1997) , methods have been proposed to optimize for flatness (Izmailov et al., 2018; Cha et al., 2021; Keskar et al., 2016; Foret et al., 2020) . Stochastic weight averaging (SWA) (Izmailov et al., 2018) finds flat minima by averaging model weights during parameters updating every K epoch. Stochastic weight averaging densely (SWAD) adopts a similar weight ensemble strategy to SWA by sampling weights densely, i.e., for every iteration. Also, SWAD searches the start and end iterations for averaging by considering the validation loss to avoid overfitting. Sharpness-aware minimization (SAM) (Foret et al., 2020) finds flat minima by seeking parameters that lie in neighborhoods having uniformly low loss. We select three popular methods of them in our benchmark -GroupDRO, SAM, and SWAD.

B IMPLEMENTATION AND EVALUATION DETAILS

B.1 DATA B.1.1 DATASET We summarize the statistics of the subgroups and class labels in Table A2 to A5. The numbers with percentages out of the brackets are the percentage of the appearing prevalence, and the numbers in the brackets are the percentage of being unhealthy (class label). The used datasets are all publicly available, but we cannot directly include the downloading links in our benchmark. Thus we provide the access links in Table A6 provides. CheXpert We first incorporate ethnicity labels (Gichoya et al., 2022) and the original data (Links in Table A6 ), dropping those images without sensitive attribute labels. The "No Finding" label is used for training and testing. We use all of the available frontal and lateral images, and images of the same patient do not share across train/validation/test split.

MIMIC-CXR

The race data is available via MIMIC-IV (Johnson et al., 2020) dataset, which is also deposited in the PhysioNet database (Goldberger et al., 2000) . We merge it to the original MIMIC-CXR metadata based on "subject ID". Other preprocessing steps are similar to that of CheXpert. PAPILA We exclude the "suspect" label class and use images with labels of glaucomatous and non-glaucomatous for binary classification tasks. The dataset contains right-eye and left-eye images of the same patient. We split the train/validation/test in a proportion of 70/10/20, and images of the same patient do not share across the train/validation/test split. HAM10000 We split 7 diagnostic labels into binary labels, i.e., benign and malignant following Maron et al. (2019) . Benign contains basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanocytic nevi (nv), and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). malignant contains Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), and melanoma (mel). We discard images whose sensitive attributes are not recorded, resulting in 9948 images in total. Fitzpatrick17k We split the three partition labels into binary labels i.e., benign and malignant. We treat "non-neoplastic" and "benign" as the benign label, and use "malignant" as the malignant label. Fitzpatrick skin type labels are used as sensitive attributes. OL3I Opportunistic L3 computed tomography slices for Ischemic heart disease risk assessment (OL3I) dataset provides 8,139 axial computed tomography (CT) slices at the third lumbar vertebrae (L3) level of individuals. We design the task to predict whether the individual would be diagnosed with ischemic heart disease one year after the scan according to the labels provided (i.e. prognosis). Sex and age are treated as sensitive attributes. OCT The experiments are conducted on a Scientific Linux release version 7.9 with one NVIDIA A100-SXM-80GB GPU. We trained over 7,000 models using ∼ 0.77 GPU year. The implementation is based on Python 3.9 and PyTorch 1.10. We adapt the source code released by the original authors to our framework. We use ResNet-18 for 2D images and 3D ResNet-18 for 3D images as the backbone network for all experiments except otherwise specified. For 2D datasets, we resize the images to the size of 256×256, and apply standard data augmentation strategies, i.e., random cropping to 224 × 224, random horizontal flipping, and random rotation of up to 15 degrees. The backbone network is initialized using ImageNet (Deng et al., 2009) pretrained weights and images are normalized with the ImageNet mean and standard deviation values. For 3D datasets, we resize 3D images according to the original imaging characteristics as described in B.1.2. The backbone network is initialized using kinetics (Carreira & Zisserman, 2017) pretrained weights and images are normalized with the kinetics mean and standard deviation values. Dataset-specific preprocessing and hyper-parameters can be found below.

B.2.2 HYPER-PARAMETERS

To achieve the optimal performance of each algorithm for fair comparisons, we perform a Bayesian hyper-parameter optimization search for each algorithm and each combination of datasets and sensitive attributes using a machine learning platform Weights & Bias (Biewald, 2020) . We use the batch size 1024 and 8 for 2D and 3D images respectively. SGD optimizer is used for all methods and we apply early stopping if the validation worst-case AUC does not improve for 5 epochs. Empirically, we find the best learning rate for all methods is around 1e -4 except SAM, which requires a higher learning rate at around [1e-1, 1e-3] depending on the datasets.

B.2.3 METHODS

We summarize the benchmarked methods in Table A7 . We show their categories, whether they can accept multiple sensitive attributes for training, and whether they require the information of sensitive attributes for training and testing.

B.3 METRICS

We explain the metrics used in this study in detail below: AUC Area under the receiver operating characteristic curve (AUROC) is a standard metric to measure the performance of binary classification tasks, whose value is not affected by the imbalance of class labels. We use the name AUC in our text for simplicity. We measure the average AUC and the AUC of each subgroup. We pay particular attention to the AUC gap and the worst-case AUC to evaluate group fairness and max-min fairness. ECE Expected calibration error (ECE) (Guo et al., 2017; Nixon et al., 2019) is an indicator of group sufficiency (Castelnovo et al., 2022) . A high ECE value may result in a different optimal best decision threshold. BCE Binary cross entropy (BCE) is the objective function we optimize for the classification task. TPR, FPR, FNR, FPR Define true positive (TP) and true negative (TN) are values that are actually positive (negative) and correctly predicted positive (negative). Define false positive (FP) and false negative (FN) are values that are actually negative (positive) but wrongly predicted positive (negative). Then, we define true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR) as T P R = T P T P + F N = 1 -F N R (3) T N R = T N T N + F P = 1 -F P R. We report the overall FPR, FNR, and their values for each subgroup. The threshold is selected based on the minimum F1 score following (Seyyed-Kalantari et al., 2021) . We also report the value of TPR at 80% TNR, which indicates the true positive rate of a given desirable true negative rate. EqOdd Equalized Odds is a widely used group fairness metric that the true positive and false positive rates should be equalized across subgroups. Denote the input, label, and sensitive attribute as x, y, s, the prediction and the output probability as ŷ, p. Following (Reddy et al., 2021) , we define Equality of Opportunity w.r.t. y = 0 and 1, i.e., EqOpp0 and EqOpp1, as EqOpp0 = 1 -|p(ŷ = 1|y = 0, s = group-0) -p(ŷ = 1|y = 0, s = group-1)|, (5) EqOpp1 = 1 -|p(ŷ = 1|y = 1, s = group-0) -p(ŷ = 1|y = 1, s = group-1)|. (6) Then, we have EqOdd = 0.5 × (EqOpp0 + EqOpp1).

B.4 MINIMAX PARETO SELECTION

Following (Martinez et al., 2020) and the dominance definition (Miettinen, 2008) , we give a formal definition of the Pareto optimal regarding AUC. 

Dominant vector

A vector t ′ ∈ R k is said to dominate t ∈ R k if t ′ i ≥ t i , ∀i = 1, ..., if T (h ′ ) ≻ T (h ′′ ), denote h ′ ≻ h ′′ . Likewise, a model h ′ ⪰ h ′′ if T (h ′ ) ⪰ T (h ′′ ). Pareto Optimality Given a set of models H, and a set of group-specific metrics function T (h), a Pareto front model is P S,H = {h ∈ H : ∄h ′ ∈ H|h ′ ≻ h} = {h ∈ H : h ⪰ h ′ ∀h ′ ∈ H}. We call a model h a Pareto optimal solution iff h ∈ P A,H . Finally, a model h * is a minimax Pareto fair classifier if it maximizes the worst-group AUC among all Pareto front models. For example, in Figure 2 , each data point represents a different hyperparameter combination for one algorithm, where the red points are models lying on the Pareto front. As we can see from the figure, the Pareto optimal points cannot improve the AUC of group 0 without hurting the AUC of group 1, and vice versa, indicating the best trade-off between different groups. The worst-case group is group 1, as the best AUC achieved for group 1 is lower than the best AUC achieved for group 0. In this case, we select the model that achieves the best AUC of group 1 (red star point) -the disadvantaged group, to make the selection as fair as possible. We implement a base algorithm class BaseNet, which contains basic configuration and regular training/validation/testing loop. A new algorithm can be added by inheriting it and rewriting the training loop, loss, etc. if needed. For example, SAM (Foret et al., 2020) than the other subgroups as identified by Zhang et al. (2022) , and the upsampled subgroups may contain more noisy labels to worsen the overall performance. To summarize, most of the medical imaging datasets contain multiple sources of bias instead of only one, where they are correlated in different ways and it is difficult or impossible to fully disentangle for separate analyses. The mixture of multiple confounders makes the algorithms that specifically optimize for one particular factor (e.g. data imbalance) fail to succeed, i.e. do not outperforms ERM. This, to some extent, explains why the domain generalization method SWAD is the most consistently high-ranked algorithm as it does not have a specific assumption of the source of bias. Increasing the general notion of robustness may be beneficial to various kinds of confounders. 



Our framework can be easily extended to non-binary classification.



Figure 1: Components of MEDFAIR benchmark.

Figure 2: Illustration of three different model selection strategies. Each data point represents a different hyper-parameter combination for one algorithm, where the red points are the models lying on the Pareto front.Minimax Pareto Selection The concept of Pareto optimality was proposed byMas-Colell et al. (1995) and utilized in fair machine learning to study the trade-off among subgroup accuracies(Martinez et al., 2020). Intuitively, for a model on the Pareto front, no group can achieve better performance without hurting the performance of other groups. In other words, it defines the set of best achievable trade-offs among subgroups (without introducing unnecessary harm). Based on this definition, we select the model that lies on the Pareto front and achieves the best worst-case AUC (the red star in the middle top of Figure2). We present a formal definition of minimax Pareto selection in Appendix B.4.

Figure 3: The AUC (left) and underdiagnosis rates (right) for the advantaged and disadvantaged subgroups across each dataset and sensitive attribute, when training with ERM. Most points are off the blue equality line, showing that bias widely exists in conventional ERM-trained models.

The following hyper-parameter space is searched (20 runs for each method per dataset × sensitive attribute), where [ ] means the value range, and {} means the discrete values: ERM/Resampling/DomainInd Learning rate lr ∈ [1e-3, 1e-5]. LAFTR Learning rate lr ∈ [1e-3, 1e-4]. Adversarial coefficients η ∈ [0.01, 5]. CFair Learning rate lr ∈ [1e-3, 1e-4]. Adversarial coefficients η ∈ [0.01, 5]. LNL Learning rate lr ∈ [1e-3, 1e-4]. Adversarial coefficients η ∈ [0.01, 5]. EnD Learning rate lr ∈ [1e-3, 1e-4]. Entangling term coefficients α ∈ [0.01, 5]. Disentangling term coefficients β ∈ [0.01, 5]. ODR Learning rate lr ∈ [1e-3, 1e-4]. Entropy Weight coefficients λ E ∈ [0.01, 5]. Orthogonal-Disentangled loss coefficients λ OD ∈ [0.01, 5]. KL divergence loss coefficients γ OD ∈ [0.01, 5]. Entropy Gamma γ E ∈ [0.1, 5]. GroupDRO Learning rate lr ∈ [1e-3, 1e-4]. Group adjustments η ∈ [0.01, 5]. Weight decay L 2 ∈ {1e-1, 1e-2, 1e-3, 1e-4, 1e-5}. SWAD Learning rate lr ∈ [1e -3, 1e -4]. Starting epoch E s ∈ {3, 5, 7, 9}. Tolerance epochs E t ∈ {3, 5, 7, 9}. Tolerance ratio T r ∈ [0.01, 0.3]. SAM Learning rate lr ∈ [1e-1, 1e-4]. Neighborhood size rho ∈ [0.01, 5].

0.31±0.02 0.29±0.01 0.28±0.01 0.31±0.02 0.30±0.02 0.32±0.00 0.35±0.04 0.28±0.01 0.31±0.01 0.57±0.00 1.09±0.22 Grp. 1 0.32±0.03 0.30±0.01 0.29±0.01 0.32±0.02 0.31±0.01 0.33±0.00 0.36±0.04 0.29±0.01 0.32±0.01 0.58±0.00 1.09±0.22 ECE Grp. 0 0.13±0.03 0.11±0.02 0.05±0.01 0.13±0.02 0.11±0.02 0.13±0.01 0.15±0.03 0.09±0.02 0.13±0.01 0.27±0.00 0.54±0.07 Grp. 1 0.12±0.03 0.11±0.02 0.05±0.01 0.13±0.02 0.11±0.02 0.13±0.00 0.15±0.04 0.08±0.02 0.13±0.01 0.27±0.00 0.54±0.07 TPR@80 TNRGrp. 0 78.40±1.0 77.71±1.0 79.24±0.0 78.88±1.0 78.61±2.0 75.59±2.0 76.24±2.0 76.21±3.0 79.38±0.0 80.59±0.0 73.78±1.0 Grp. 1 78.62±1.0 76.37±0.0 77.68±1.0 77.54±2.0 76.14±2.0 74.87±1.0 73.83±2.0 73.89±4.0 78.36±0.0 80.33±0.0 73.84±1.0 42.12±3.0 40.93±1.0 42.98±3.0 43.03±5.0 37.98±3.0 39.11±4.0 43.78±2.0 43.39±2.0 39.63±4.0 39.58±1.0 39.00±1.0 Grp. 1 43.30±3.0 41.58±0.0 44.26±3.0 43.99±6.0 39.95±2.0 40.25±5.0 46.54±2.0 45.88±4.0 40.42±5.0 41.08±1.0 36.97±1.0 Published as a conference paper at ICLR 2023

Detailed statistics of the datasets. "# images/scans" listed here are the actual numbers used in this study after removing those missing sensitive attributes. For potential bias, LN, CI, DI, and SC represent label noise, class imbalance, data imbalance, and spurious correlation, respectively. Table 1 lists the basic datasets information, and more detailed statistics are provided in Appendix B.

2, hyper-parameter space in Appendix B.2.2. All of the datasets we use are publicly available, and we provide the download links in Table A6. Source code and documentation are available at https://ys-zong.github.io/MEDFAIR/. Running all the experiments required ∼ 0.77 NVIDIA A100-SXM-80GB GPU years. ACKNOWLEDGMENT Yongshuo Zong is supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. For the purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising. We thank Dr. Maria Valdés Hernández for the discussion of data preprocessing. We acknowledge the data resources providers at Appendix F.

Summary of model selection strategies of fairness-aware methods and benchmarks.

The statistics of the subgroups and class labels. The numbers with percentages out of the brackets are the percentage of the appearing prevalence, and the numbers in the brackets are the percentage of being unhealthy (class label).

The statistics of the subgroups and class labels. The numbers with percentages out of the brackets are the percentage of the appearing prevalence, and the numbers in the brackets are the percentage of being unhealthy (class label). Age group 0 and age group 1 range from 0-60 and 60+ except for the OCT dataset, whose age groups are from 55-75 and 75+.Data splitting for experiments: unless otherwise specified, we randomly split the whole dataset into training/validation/testing sets with a proportion of 80/10/10 for 2D datasets and 70/10/20 for 3D datasets.

The statistics of the Fitzpatrick17k dataset.

The statistics of the subgroups and class labels. The numbers with percentages out of the brackets are the percentage of the appearing prevalence, and the numbers in the brackets are the percentage of being unhealthy (class label). Age group 0 and age group 1 range from 0-60 and 60+.

We design the task to predict if patients have Alzheimer's disease (AD). MRI scans from ADNI have been preprocessed. We resize the height and width of scans of both 1.5T and 3T to the same size of 224 × 224 × 144 to reduce the variance for cross-domain testing. Random cropping of the size of 196 × 196 × 128 is used for training.

A list of methods used in the benchmark. SA is short for Sensitive Attributes. Y and N represent Yes and No, respectively.

Dominant Classifier Given a set of group-specific metrics function T (h), i.e. AUC in our case, a model h

Results of the in-distribution evaluation.

Results of the out-of-distribution evaluation. In the dataset column, the dataset in the first row is the training domain, and the second row is the testing domain.

Results of other metrics for HAM10000, CheXpert, and ADNI 1.5T dataset

Results of other metrics for MIMIC-CXR, OCT, and Fitzpatrick17k dataset

Results of other metrics for COVID-CT-MD, PAPILA, and OL3I dataset.

Results of other metrics in out-of-distribution setting. In the dataset column, the dataset in the first row is the training domain, and the second row is the testing domain.

Results of other metrics in out-of-distribution setting. In the dataset column, the dataset in the first row is the training domain, and the second row is the testing domain.

INCORPORATE NEW DATASETS AND ALGORITHMS IN MEDFAIRWe implement MEDFAIR using the PyTorch framework. We show example pseudo codes to demonstrate how to incorporate new datasets and algorithms. Detailed documentation can be found at https://ys-zong.github.io/MEDFAIR/.D.1 ADDING NEW DATASETSWe implement a base dataset class BaseDataset, and a new dataset can be added by creating a new file and inheriting it.

algorithm can be added by re-implementing the training loop.

The statistics of the intersectional subgroup MIMIC-CXR dataset.

The performance of different methods on MIMIC-CXR dataset.

C ADDITIONAL RESULTS

Table A8 are the results of ERM under different model selection strategies behind Figure 4 . Table A9 presents in-distribution results of the maximum and minimum AUC, and the gap between them for all datasets, as well as their average ranks. Table A10 presents out-of-distribution results. The highest maximum and minimum AUC values, and the smallest value of the performance gap are in bold.We also report the BCE, ECE, TPR @80 TNR, FPR, FNR and EqOdd (for binary attributes) of each subgroup for a complete evaluation in the in-distribution setting in Table A11 to A13, and out-of-distribution setting in Table A14 and A15 . 

E ANALYSIS OF SOURCE OF BIAS

We take two datasets HAM10000 and MIMIC-CXR as case studies to analyze the source of bias.As shown in Table A2 , we can observe severe data and class imbalance as the direct sources of bias for both datasets. Thus, we utilize resampling strategies to explicitly mitigate the imbalance. We use three types of resampling to upsample the minority subgroup, class, or both subgroup and class so that all groups appear with equal chances during training, i.e. subgroup resampling, class resampling, and subgroup and class resampling, respectively.For the HAM10000 dataset, we take the cartesian product of age and sex subgroups to construct 8 intersectional subgroups, whose statistics are shown in Table A16 (excluding age 0-20 as it has too few samples). The results of the ERM, resampling strategies, and the best-performing SWAD method are shown in Table A17 . The worst-performing subgroup is "40-60 Male", which neither contains the least number of images nor not the most class-imbalanced subgroup. Also, it can be seen that for intersectional subgroups, different resampling strategies produce a similar performance as the ERM and sometimes even worse, where all of them are worse than SWAD. In other words, there are other sources of bias beyond the observable data/class imbalance, which explains why resampling methods do not lead to better performance. For example, the skin type can be a potential source of bias (not recorded by metadata) as it is a dermatology dataset, and lesions on darker skin are intrinsically more difficult to diagnose than that on lighter skin. Overall, they are difficult or impossible to disentangle given the metadata as the sensitive attributes of different datasets can be correlated in different ways. Therefore, methods specifically optimizing for one particular factor are usually less effective. Following a similar procedure, we further analyze the intersectional subgroup of MIMIC-CXR dataset, where we obtain 20 subgroups by taking the cartesian product of the age, sex, and race subgroups. As it can be seen from Table A18 , the number of images and the value of labels ("No Finding") vary greatly among different subgroups. The worst-performing subgroup is the "60-80 non-White Female", which is imbalanced in data and class, but also does not contain the least number of images nor not the most class-imbalanced subgroup.From the results of Table A19 , we do observe a slight performance increase of the worst-case subgroup after resampling the minority subgroup, class, or both, while there are also decreases in the overall performance. A possible reason is that the label noises in some subgroups are more severe 

