PRACTICAL APPROACHES FOR FAIR LEARNING WITH MULTITYPE AND MULTIVARIATE SENSITIVE ATTRIBUTES Anonymous

Abstract

It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. Fair ML has largely focused on the protection of single attributes in the simpler setting where both attributes and target outcomes are binary. However, the practical application in many a real-world problem entails the simultaneous protection of multiple sensitive attributes, which are often not simply binary, but continuous or categorical. To address this more challenging task, we introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces. This leads to two practical tools: first, the FairCOCCO Score, a normalized metric that can quantify fairness in settings with single or multiple sensitive attributes of arbitrary type; and second, a subsequent regularization term that can be incorporated into arbitrary learning objectives to obtain fair predictors. These contributions address crucial gaps in the algorithmic fairness literature, and we empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets.

1. INTRODUCTION

There is a clear need for scalable and practical methods that can be easily incorporated into machine learning (ML) operations, in order to make sure they don't inadvertently disadvantage one group over another. The ML community has responded with a number of methods designed to ensure that predictive models are fair (under a variety of definitions that we shall explore later) (Caton & Haas, 2020) . Perhaps due to the archetypal fairness example, an investigation into the COMPAS software that found racial discrimination in the assessment of risk of recidivism (Angwin et al., 2016) , most of the focus has been on single, binary variables -in this case race being treated as an indicator of whether an individual was black or white. This, combined with a discrete target, allows for easy analysis of fairness criteria such as demographic parity and equalized odds (Barocas & Selbst, 2016; Hardt et al., 2016) , through the rates of outcomes in the confusion matrix of the subgroups. The problem is, however, that in many practical applications we may have multiple attributes which we would like to protect, for example both race and sex -indeed U.S. federal law protects groups from discrimination based on nine protected classes (EEOC, 2021) . Algorithms deployed in the real-world therefore need to be capable of protecting multiple attributes both jointly (e.g. 'black woman') and individually (e.g. 'black' and 'woman') . This is non-trivial and cannot be simply achieved by introducing separate fairness conditions for each attribute. Such an approach both does not provide joint protection of sensitive attributes and complicates matters by introducing additional hyperparameters that need to be traded-off against each other. Matters are further complicated by the fact that many sensitive attributes (e.g. age) and outcomes (e.g. credit limit) take on continuous values, for which calculated rates do not make sense. Existing methods simply discretise these into categorical bins, which leads to several issues in practice, as it entails thresholding and data sparsity effects while discarding element order information. As we shall see later in Section 4, this approach is unlikely to be optimal in delivering discriminative yet fair predictors. Contributions and Outline. Consequently, we introduce two practical tools to the community, which we hope can be used to more easily incorporate fairness into a standard ML pipeline: a (1) Fairness metric. We introduce the FairCOCCO score, a flexible normalized metric that can quantify the level of independence-based fairness in tasks with multitype and multivariate sensitive attributes by employing the cross-covariance operator on reproducing kernel Hilbert Spaces (RKHS); and a (2) Fairness regulariser. Based on the FairCOCCO score, we construct a fairness regulariser that can be easily added to arbitrary learning objectives for fairness-aware learning. In what follows, we introduce current notions of fairness alongside contemporary methods to ensure fair learning (Section 2), before introducing our contributions and explain how they plug the crucial gaps in the literature (Section 3). With that established, we illustrate the practical advantages of FairCOCCO in a series of demonstrations on multiple real-world datasets across a variety of modalities, quantitatively demonstrating consistent improvements over state-of-the-art techniques (Section 4). We conclude with a discussion on future work and societal implications (Section 5).

2. BACKGROUND

Table 1 : Popular definitions of fairness. Defined in terms of (conditional) independence requirements.

Definition

Requirement FTU A ⊥ ⊥ ( Ŷ | X \ A) DP A ⊥ ⊥ Ŷ EO (A ⊥ ⊥ Ŷ ) | Y CAL (A ⊥ ⊥ Y ) | Ŷ Fairness Notions Let d X , d Y , d A be dimensions of measur- able space X ⊂ R d X , Y ⊂ R d Y and A ⊂ R d A , respectively. We introduce random variable X defined on X to denote the features; Y and A are similarly defined and denote the target and sensitive attribute(s) that we want to protect (e.g. gender or race). Note that A can be part of X, i.e. with a slight abuse of notation, we can write A ⊂ X. We are mainly concerned with quantifying group fairness, which requires that protected groups (e.g. black applicants) be treated similarly to advantaged groups (e.g. white applicants) (Caton & Haas, 2020) . In Table 1 , we highlight four popular definitions and how each quantifies a different aspect of fairness. Fairness through unawareness (FTU) (Grgic-Hlaca et al., 2016) prohibits the algorithm from using sensitive attributes explicitly in making predictions. While straightforward to implement, this method ignores the indirect discriminatory effect of proxy covariates that are correlated with A, e.g. "redlining" (Avery et al., 2009) . Demographic parity (DP) (Barocas & Selbst, 2016; Zafar et al., 2017) accounts for indirect discrimination, by requiring statistical independence between predictions and attributes Ŷ ⊥ ⊥ A. Evidently, this strict notion sacrifices predictive utility by ignoring all correlations between Y and A, thereby precluding the optimal predictor. Dwork et al. (2012) , most notably, argues that this approach permits laziness, which can hurt fairness in the long run. To address some of these concerns, Hardt et al. (2016) introduced equalized odds (EO), requiring that predictions Ŷ and attributes A are independent given the true outcome Y , i.e. Ŷ ⊥ ⊥ A | Y . This approach recognizes that sensitive attributes have predictive value, but only allows A to influence Ŷ to the extent allowed for by the true outcome Y . For binary predictions and sensitive attributes, a metric known as difference in equal opportunity (DEO) highlights the different predictions made based on different group memberships: DEO = |P ( Ŷ |A = 1, Y = 1) -P ( Ŷ |A = 0, Y = 1)| Additional notions of fairness include calibration (CAL) (Kleinberg et al., 2016) , which ensures that predictions are calibrated between subgroups, i.e. Y ⊥ ⊥ A | Ŷ . For a comprehensive review of fairness notions, we defer to §3 in Caton & Haas (2020) . In the remaining sections, we illustrate our proposed methods using the framework of EO, but this is without loss of generality, as our method is compatible with any dependency-based fairness measure. It is important to note that there is no universal measure of fairness, and the correct notion depends on ethical, legal and technical contexts.

2.1. RELATED WORKS

Technical approaches to algorithmic fairness can be categorized into three main types: prior to modelling (pre-processing), during modelling (in-processing) or after modelling (post-processing) (del Barrio et al., 2020) . The work herein falls into the category of in-processing techniques, which achieve fairness by incorporating either constraints or regularisers. settings with a single, binary label and attribute (Kamishima et al., 2012; Goel et al., 2018; Jiang et al., 2020; Donini et al., 2018) , where fairness quantification is straightforward by comparing rates of outcomes between subgroups. However, settings involving continuous variables are significantly more challenging (Bergsma, 2004) . Recent efforts in fair regression (where only outcomes are continuous) (Agarwal et al., 2019; Chzhen et al., 2020) Put formally, the prediction should be jointly independent (i.e. fair) to multiple sensitive attributes while also being independent to each individual attribute. Fortunately, this is already implied due to the decomposition property: Ŷ ⊥ ⊥ (A 1 , . . . , A d A ) | Y ⇒ Ŷ ⊥ ⊥ A i | Y ∀ i ∈ {1, . . . , d A } (1) However, we cannot naively extend existing methods to protect multiple attributes by introducing separate conditions on each attribute. This is evident, as the inverse proposition of (1) does not hold in general. In other words, while this naive approach can ensure individual protection, it does not guarantee protection of all attributes simultaneously. A related stream of research investigates intersectional fairness (Kearns et al., 2018; Foulds et al., 2020) , which models combinatorial intersection of various subgroups. However, this only considers discrete attributes and outcomes, and one notion of fairness (DP). Table 2 provides an overview and comparison of related works.

3. EVALUATING AND LEARNING FAIRNESS

In this section, we introduce FairCOCCO, a strong fairness measure from which we develop a metric and regulariser for fair learning. It applies kernel measures to quantify and control the level of dependence between algorithm predictions and protected attributes, such that the fairness requirements in Table 1 We propose a measure based on the conditional cross-covariance operator in Reproducing Kernel Hilbert Space (RKHS). A RKHS H Y is a Hilbert space of functions, in which each point evaluation f (y), for any y ∈ Y and f ∈ H Y , is a bounded linear functional. Distributions of variables can be embedded into the RKHS through kernels, where inference of higher order moments and dependence between distributions can be performed (Bach & Jordan, 2002; Gretton et al., 2005) . Unconditional fairness. We start by describing how operators in the RKHS can be used to evaluate fairness in the unconditional case (DP), by quantifying reliance of model predictions Ŷ on sensitive attributes A. The cross-covariance operator (CCO) Σ Ŷ A : H A → H Y is the unique, bounded operator that satisfies the relation: ⟨g, Σ Ŷ A f ⟩ H Y = E Ŷ A [f ( Ŷ )g(A)] -E Ŷ [f ( Ŷ )]E A [g(A)], (2) for all f ∈ H Y and g ∈ H A . Intuitively, the Σ Ŷ A operator extends the covariance matrix defined on Euclidean spaces to represent higher (possibly infinite) order covariance between Ŷ and A through kernel mappings f (X) and g(Y ). Additionally, we can obtain a normalized operator, i.e. the normalized cross-covariance operator (NOCCO) V Ŷ A (Baker, 1973): V Ŷ A = Σ -1 2 Ŷ Ŷ Σ Ŷ A Σ -1 2 AA , where Σ Ŷ Ŷ , Σ AA are defined similarly to (2). This normalization is analogous to the relationship between covariance and correlation, and disentangles the influence of marginals while retaining the same dependence information. Intuitively, we have obtained a strong measure of correlation between sensitive attributes and fairness by leveraging the RKHS to represent higher-order moments. Conditional fairness. For many notions of fairness (i.e. EO and CAL), we also require a measure of conditional fairness. We will frame the discussion around EO, where the prediction should be independent of the sensitive attribute given the true outcome Ŷ ⊥ ⊥ A | Y . It is straightforward to adapt this for CAL by swapping variables around. We can derive a normalized, conditional cross-covariance operator, by manipulating (3), i.e. V Ŷ A|Y (COCCO): V Ŷ A|Y = V Ŷ A -V Ŷ Y V Y A (4) where V Ŷ Y , V Y A are defined similarly to (3). In line with the intuition established previously, this operator measures higher-order partial correlation through function transformations f (A) ∀ f ∈ H A and g( Ŷ ), h(Y ) ∀ g, h ∈ H Y . We round up this discussion by characterizing the relation between the V Ŷ A|Y operator and conditional fairness. Lemma 3.1 (COCCO and Conditional Fairness (Fukumizu et al., 2007) ) Denote Ä ≜ (A, Y ), and the product of kernels k Ä ≜ k A k Y , and further assuming k Ä is a characteristic kernel. Then: V Ŷ Ä|Y = 0 ⇐⇒ Ŷ ⊥ ⊥ A | Y (5) Note that Ä denotes the extended variable set. For ease of notation, we write V Ŷ A|Y in place of V Ŷ Ä|Y from this point onward. ( 3) and ( 4) gives us a way to measure unconditional and conditional fairness, respectively, and lower values will indicate higher levels of fairness. Additionally, we note that (3) can be viewed as a special case of ( 4), where Y = ∅, i.e. V Ŷ A = 0 ↔ Ŷ ⊥ ⊥ A.

3.2. METRIC: FAIRCOCCO SCORE

Having described a kernel-based measure of fairness, we propose a fairness metric that is applicable to conditional and unconditional fairness as well as settings with multiple sensitive attributes of arbitrary (continuous or discrete) type. Many metrics (e.g. DEO (Hardt et al., 2016) to evaluate EO, and DI (Feldman et al., 2015) to evaluate DP) have been proposed for binary fairness settings. However, their utility is limited to classification tasks with single binary sensitive attributes. This is insufficient in real-world conditions, where there often exists many sensitive attributes that can be discrete or continuous. To address these challenges, we propose FairCOCCO Score that can evaluate fairness of several attributes of mixed type and for both continuous and discrete outcomes. We start by summarizing the information contained in V Ŷ A into a single statistic using the squared Hilbert-Schmidt (HS) norm (Bach & Jordan, 2002) : I = ||V Ŷ A || 2 HS (6) This scalar value can be estimated from samples analytically, and we provide the complete closedform expression in Appendix A. By Lemma 3.1, we know that ||V Ŷ A || 2 HS = 0 ⇐⇒ Ŷ ⊥ ⊥ A. Thus, values closer to zero indicate higher levels of conditional fairness. However, while ( 6) is non-negative, it can be arbitrarily large. This makes it hard to interpret and compare across different tasks. To address this, we propose the normalized metric FairCOCCO score: Definition 3.2 (FairCOCCO Score) FairCOCCO Score (unconditional) = ||V Ŷ A || 2 HS ||V Ŷ Ŷ || HS ||V AA || HS (7) FairCOCCO Score (conditional) = ||V Ŷ Ä | Y || 2 HS ||V Ŷ Ŷ |Y || HS ||V Ä Ä|Y || HS which takes values in [0, 1], where value closer to 0 indicates higher levels of fairness, and vice versa. This normalization scheme is derived from the Cauchy-Schwarz inequality and can be understood as taking into account the (conditional) variance within each variable (c.f. relationship between covariance and correlation). In Appendix A, we derive the metric and its conditional counterpart and additionally demonstrate how the measure (6) can be used to perform (conditional) independence testing for additional transparency and interpretability. The FairCOCCO Score can be used to measure any of the independence based notions of fairness. In particular, to make the connection with the Table 1 clear, the terms for the different notions can be expressed as: I EO = ||V Ŷ A|Y || 2 HS , I CAL = ||V Y A| Ŷ || 2 HS , I DP = ||V Ŷ A || 2 HS (9) 3.3 LEARNING: FAIRCOCCO LEARNING Now that we have established the FairCOCCO Score that can be used to detect (un-)fairness, we move on to how it can be employed in order to obtain fair predictors. We focus on a standard supervised learning setup with the task to learn the map X × A → Y, with a given fairness condition F that should be satisfied. Given a batch D of N training triplets {(X i , Y i , A i )} N i=1 , a learning function f θ (•) with learnable parameters θ ∈ Θ, training loss L, and I F denoting one of the fairness statistics from (9) that takes both the batch and learning function and returns the corresponding score, we arrive at a constrained optimization problem: min θ∈Θ 1 N N i=1 L(f θ (x i ), y i ) subject to I F (D, f θ ) = 0 (10) Practically speaking, this can be relaxed via a Lagrangian in order to obtain an unconstrained optimization problem that can be solved significantly more easily: min θ∈Θ 1 N N i=1 L(f θ (x i ), y i ) + λI F (D, f θ ) The summary statistic (6) and therefore I F (D, f θ ) is differentiable and so as shown here can be employed as a regulariser in any gradient-based method, with λ > 0 a hyperparameter that determines the fairness-performance trade-off: a higher λ guarantees higher fairness, but this typically leads to lower predictive performance. Consequently, this measure can be used to quantify and enforce fairness notions by controlling dependence between Ŷ , A and A. We term this regularization scheme FairCOCCO Learning.

3.4. IN SUMMARY

The proposed kernel fairness measure provides a non-parametric, and strong characterization of fairness. The mappings allow both multivariate continuous and discrete variables to be embedded into the RKHS, from which we infer higher-order dependencies, and thus fairness effects. This enables the evaluation of multivariate, multitype fairness problems as commonly encountered in the real world. Additionally, the proposed metric and regularization methods are compatible with all dependency-based notions of fairness (as in Table 1 ), giving practitioners more flexibility in choosing the appropriate definitions for their scenario. In our experiments, we use a Gaussian kernel: k (X i , X j ) = exp - ||Xi-Xj || 2 2σ 2 ∀ i, j ∈ N where σ, the bandwidth parameter, is selected with the median heuristic, σ et al., 2002) . As the calculation of ( 9) comprises a matrix inversion operation, the computational complexity scales with the number of samples O(N 3 ). We improve the scaling with training samples in two ways, (1) by employing a low-rank Cholesky decomposition of the Gram matrix (of rank r), resulting in O(r 2 N ) complexity (Harbrecht et al., 2012) and (2) by estimating regulariser on mini-batches. We empirically investigate the effect of these relaxations on fairness estimation in Appendix C.2 and demonstrate that they lead to strong results in real-world experiments. = median{|x i -x j |, ∀ i ̸ = j ∈ N } (Schölkopf

4. EXPERIMENTAL DEMONSTRATION

We now turn our attention to how our proposed methods works in practice. We perform experiments within the EO framework, since it is usually considered the most challenging, and it covers the middle-ground between the strict DP and lenient FTU definitions. However, we re-iterate that our method is framework-agnostic and attach further results under alternative definitions in Appendix C.1. There are a number of areas that require empirical demonstration, and so we proceed as follows: 1. First, in Section 4.1, we employ standard real-world benchmarks to compare against existing methods on single binary attributes and outcomes, resulting in competitive (and usually superior) predictive performance on these tasks while consistently producing the best DEO score. 2. Then, in Section 4.2, we apply FairCOCCO to real data with multiple attributes and continuous outcomes. This is an area that to the best of our knowledge no other method naturally extends to, and one that FairCOCCO now sets a strong benchmark for future work. 3. Finally, in Section 4.3, we consider the more complicated setting of fair learning in image data and time series. Here, we demonstrate how the problems of sepsis treatment and facial recognition are important applications of our method. In the interest of limited space, we attach additional results in Appendix C. Specifically, we include experiments on: 4. Different notions of fairness: evaluating accuracy-fairness trade-off on different definitions of fairness (specifically DP and CAL); 5. Statistical testing: demonstrating the FairCOCCO Score as a test statistic for stronger fairness transparency; 6. Sensitivity analysis: to better evaluate the performance of our method on varying numbers of sensitive attributes. Benchmarks. We compare against state-of-the-art fairness methods, including classic baselines (Zafar et al., 2017; Hardt et al., 2016; Donini et al., 2018) and more recent methods that adopt a stronger fairness quantification: FACL (Mary et al., 2019) and FARMI (Steinberg et al., 2020b) , which leverages MCC and MI, respectively. Datasets. Following the experiment design in recent works (Hardt et al., 2016; Donini et al., 2018) , we employ 9 real-world datasets from the UCI machine learning repository (Dua & Graff, 2017) . Specifically, we consider 4 datasets contain single sensitive attributes and binary outcomes and 5 datasets with multiple sensitive attributes and outcome of arbitrary type. We employ datasets with While the focus of this work is on introducing practical methods for fairness in multitype, multivariate settings, we want to first prove that FairCOCCO is also competitive with state-of-the-art methods on problems with binary sensitive attributes and outcomes. We reproduce benchmarks based on UCI's Drugs, German, Adult and COMPAS datasets. We compare against representative methods in literature as well as a standard (unfair) neural network (NN). For strong fairness methods, specifically including our method, FACL and FARMI, we employ the same NN as the underlying predictive model to ensure comparability. 3 We report our results in Table 3 . FairCOCCO achieves higher levels of fairness (lower DEO) while maintaining strong predictive accuracy on all datasets except Drug. We note that (Mary et al., 2019) is specifically tailored for settings with binary sensitive attribute and outcome, but our method is more generally applicable to settings with multitype, multivariate sensitive attributes.

4.2. CONTINUOUS ATTRIBUTES AND OUTCOMES

Next, we illustrate the main contributions of our work, by demonstrating FairCOCCO can protect fairness in settings involving multiple sensitive attributes and outcomes of arbitrary type. We employ Crimes and Communities (C&C), Credit Card, KDD-Census, Law School, and Students datasets from the UCI repository. We start by looking at protection of single continuous attributes, before examining the joint protection of multiple sensitive attributes. Single continuous attribute. We compare our method against our closest competitors FACL and FARMI. While FACL does not support multiple attributes, it is applicable to protect a single continuous variable. FARMI is only compatible with discrete sensitive attributes; we thus binarise the sensitive attributes at the median during training. We take the datasets C&C and Students, and use protected attributes racePctBlack and age respectively. We plot the performance versus fairness by varying the fairness penalty in Figure 1 . Notably, FairCOCCO obtains a better trade-off between fairness and MSE than both methods (optimum desiderata at the origin). Multiple (arbitrary type) attributes. Going one step further, we want to evaluate the concurrent protection of multiple sensitive attributes. We note that while this is natural for FairCOCCO, to the best of our knowledge, there are no existing methods that can jointly protect multiple sensitive attributes of arbitrary type. To enable adequate comparison, we adapt FACL and FARMI by including a separate regularization term for each attribute. In contrast, the FairCOCCO regularization is applied directly and jointly on all sensitive attributes. Previously, we showed that the protection of individual fairness effects does not guarantee protection of joint fairness. To that end, we are interested in analyzing both joint fairness effects and protection w.r.t. individual attributes. In Table 4 , we evaluate the joint fairness (Joint) and fairness on individual attributes (e.g. racePctBlack, racePctAsian, racePctHisp on C&C). To evaluate individual fairness, we also calculate the DEO by binarising the attributes at the median during evaluation. We first note that FairCOCCO and DEO scores are highly correlated in their respective estimation of unfairness. However, the key result we wish to highlight is that not only does FairCOCCO successfully minimize joint fairness effects, it also consistently minimizes the levels of unfairness for each sensitive attribute. The same cannot be said for FARMI and FACL, where the joint fairness outcomes are inadequate as the protection granted to individual attributes are traded-off to the detriment of other attributes. To better investigate the sensitivity of our method to the number of sensitive attributes, the performance fairness trade-off by varying the number of protected attributes in Appendix C.4. 

4.3. BEYOND TABULAR DATA

CelebA facial attributes recognition. In this section, we highlight that FairCOCCO can be applied beyond tabular data by experimenting on the CelebA dataset (Liu et al., 2015) . The CelebA dataset contains images of celebrity faces, where each face is associated with binary sensitive attributes, including gender. We follow the experimental design in (Chuang & Mroueh, 2021) and form binary classification tasks using attributes attractive, smile, and wavy hair, and treat gender as the sensitive attribute. We fine-tune a ResNet-18 (He et al., 2016) with two additional hidden layers to perform the classification task. We report the results in Table 5 , noting similar improvements in fairness with little decrease in accuracy, especially on the classifying attractive and wavy hair. Sepsis treatment. Finally, we emphasize that FairCOCCO is not limited to the standard supervised learning setup and demonstrate how our approach can be applied for learning fairer policies in time series setting. We employ the MIMIC-III ICU database (Johnson et al., 2016a) , containing data routinely collected from adult patients in the United States. We analyze the decisions made by clinicians to treat sepsis, using a patient cohort fulfilling the Sepsis-3 criteria, delineated by Komorowski et al. (2018) . For each patient, we have relevant physiological parameters recorded at 4 hour resolution, and static demographic context. The task is to predict the clinical intervention to treat sepsis by learning from clinician's actions. For this, we have access to a binary variable corresponding to clinical interventions targeting sepsis. Ground-truth treatment outcomes are computed from SOFA scores (measuring sequential organ failure) and lactate levels (correlated with severity of sepsis) in the subsequent time step, and we consider gender as the sensitive attribute. For the complete problem setup, refer to the Appendix B.2. Table 6 indicates that FairCOCCO successfully reduced any bias contained in expert demonstrations and achieved the best predictive and fair performance when compared to FACL and FARMI.

5. DISCUSSION

In this work, we proposed FairCOCCO, a kernel-based fairness measure that strongly quantifies the level of unfairness in the presence of multiple sensitive attributes of mixed type. Specifically, we introduced a normalized fairness metric (FairCOCCO Score), applicable to different problem settings and dependency-based fairness notions, and a fairness regularization scheme. Through our experiments, we empirically demonstrated superior fairness-prediction trade-off and protection of multiple and individual fairness outcomes. Limitations and future works. The main limitation of our work is computational complexity-the matrix operations, required to kernalise the data and embed it in the RKHS, has complexity O(N 3 ). We propose two directions to alleviate this (i.e. low-rank approximation, mini-batch evaluations), which empirically do not noticeably impact performance. Future works should consider speeding up kernel operations using methods proposed in (Zhang et al., 2012) . Additionally, while our regularizer can be applied to models trained using gradient-based methods, future works should extend our approach to be compatible with powerful decision-tree based algorithms.

A MORE ON FAIRCOCCO A.1 CLOSED FORM EXPRESSION

We introduced covariance operators on RKHSs, which can be used to quantify unconditional V Ŷ A and conditional fairness V Ŷ Ä | Y . FairCOCCO is based on the Hilbert-Schmidt (HS) norm of the covariance operators. An operator A : H 1 → H 2 is called HS if, for complete orthonormal systems {ϕ i } of H 1 and {ψ j } of H 2 , the sum i,j ⟨ψ j , Aϕ i ⟩ 2 HS is finite (Reed & Simon, 1980) . Thus, for an HS operator A, the HS norm, ||A|| HS is defined as ||A|| 2 HS = i,j ⟨ψ j , Aϕ i ⟩ 2 HS . Provided that V Ŷ Ä | Y and V Ŷ A are HS operators, FairCOCCO scores can be expressed as: ||V Ŷ Ä | Y || 2 HS (conditional fairness measure) ||V Ŷ A || 2 HS (unconditional fairness measure) The umlaut on A represent extended variable sets, i.e. Ä = (A, Y ). Here, we briefly flesh out the closed-form expression of the empirical estimators, while more details can be found at (Fukumizu et al., 2007; Gretton et al., 2005) . Let G Y be the centered Gram matrices, such that:  G Y,ij = k Y (•, Y i ) - m(N) Y , k Y (•, Y j ) - m(N) Y H Y We choose a Gaussian RBF kernel, k(Y i , Y j ) = exp - ||Yi-Yj || 2 2σ 2 ∀ i, j ∈ N , ), i.e. σ = median{|Y i -Y j |, ∀ i ̸ = j ∈ N } to select bandwidth σ. Additionally, m(N) Y = 1/N N i=1 k Y (•, Y i ) is the empirical mean. G A , G Ŷ are defined similarly. Based on this, proxy Gram matrices R Y can be defined as follows: R Y = G Y (G Y + ϵN I N ) -1 where ϵ = 1e-4 is a regularization constant, used in the same way as Bach & Jordan (2002), I N is an identity matrix and R Ŷ , R A are defined similarly. The empirical estimator of || V Ŷ Ä | Y || 2 HS can then be computed: Î = || V Ŷ Ä | Y || 2 HS (12) = Tr[R Ŷ R Ä -2R Ŷ R ÄR Y + R Ŷ R Y R ÄR Y ] The unconditional fairness score can similarly be estimated empirically as follows (note that unconditional dependence does not entail using extended variables): Î = || V Ŷ A || 2 HS (14) = Tr[R Ŷ R A ] Choice of Kernels. While, in general, kernel dependence measures depend not only on variable distributions, but also the choice of kernel, Fukumizu et al. (2007) showed that, in the limit of infinite data and assumptions on richness of the RKHS, the estimates converges to a kernel-independent value. We employ a Gaussian RBF (characteristic kernel) in our experiments. On the computational complexity. For our experiments, we use a Gaussian RBF kernel: k(X i , X j ) = exp - ||Xi-Xj || 2 2σ 2 ∀ i, j ∈ N where σ is the tuneable bandwidth parameter. We employ the median heuristic introduced by Schölkopf et al. (2002) , i.e. σ = median{|x i - x j |, ∀ i ̸ = j ∈ N } to select bandwidth. As the calculation of (9) comprises a matrix inversion operation, the computational complexity scales with the number of samples in O(N 3 ). We improve the scaling with training samples in two ways, (1) by employing a low-rank Cholesky decomposition of the Gram matrix (of rank r), resulting in O(r 2 N ) complexity (Harbrecht et al., 2012) and (2) by estimating regulariser on mini-batches. We empirically demonstrate that these lead to strong results in real-world experiments.

A.2 FAIRCOCCO SCORE

Here, we derive FairCOCCO score from the underlying measure using the Cauchy-Schwarz Inequality. The FairCOCCO score for conditional fairness and unconditional fairness can be written as: FairCOCCO Score (unconditional) = ||V Ŷ A || 2 HS ||V Ŷ Ŷ || HS ||V AA || HS FairCOCCO Score (conditional) = ||V Ŷ Ä | Y || 2 HS ||R Ŷ -R Ŷ R Y || HS ||R Ä -R ÄR Y || HS We start by looking unconditional version of FairCOCCO, we know from ( 14) and the Cauchy-Schwarz inequality for the inner-product ⟨•, •⟩ that: ||| V Ŷ A || 2 HS | = |Tr[R Ŷ R A ]| = |⟨R T Ŷ , R A ⟩| ≤ ||R Ŷ || HS ||R A || HS = Tr[R T Ŷ R Ŷ ] Tr[R T A R A ] = || V Ŷ Ŷ || HS || VAA || HS By the inequality, FairCOCCO Score (unconditional) ∈ [-1, 1]. Additionally, as the score is also non-negative, it takes value ∈ [0, 1] where 0 indicates perfect fairness (as indicated by Lemma 3.1). By contrast, the score takes value 1 iff the gram matrices, R Ŷ and R A , are linearly dependent (i.e. perfectly unfair). The derivation and interpretation can similarly be shown for the conditional case: ||| V Ŷ A | Y || 2 HS | = |Tr[R Ŷ R A -2R Ŷ R A R Y + R Ŷ R Y R A R Y ]| = |Tr[(R Ŷ -R Ŷ R Y )(R A -R A R Y )]| = |⟨(R Ŷ -R Ŷ R Y ) T , (R A -R A R Y )⟩| ≤ ||R Ŷ -R Ŷ R Y || HS ||R A -R A R Y || HS Here, R Ŷ -R Ŷ R Y is related to the conditional covariance operator, i.e. V Ŷ Ŷ | Y , which captures the conditional covariance of Ŷ given Y . See (Fukumizu et al., 2007; 2009; Baker, 1973) and others for more. B EXPERIMENTAL DETAILS Kohavi, 1996) . The task on the Adult dataset is to classify whether an individual's income exceeded $50K/year based on census data. There are 48842 training instances and 14 attributes, 4 of which are sensitive attributes (age, race, sex, native-country). Here, the sensitive attribute is chosen to be sex, which can be either female or male. Drug Consumption (Drugs) (Mirkes, 2015) . The classification problem is whether an individual consumed drugs based on personality traits. The dataset contains 1885 respondents and 12 personality measurements. Respondents are questioned on drug use on 18 drugs, including a fictitious drug Semeron to identify over-claimers. Here, we focus on Heroin use, drop the respondents who Demographic Parity. DP requires statistical independence between predictions and attributes. Disparate impact (DI) is a metric frequently used to evaluate DP (Feldman et al., 2015) : DI = P ( Ŷ = 1|A = 1) P ( Ŷ = 1|A = 0) where A = 1 and A = 0 denote respectively the discriminated and non-discriminated groups. The US Equal Employment Opportunity Commission Recommendation advocates that DI should not be below 80%, commonly known as the 80%-rule.foot_3 DI closer to 1 corresponds to lower levels of disparate impacts across population subgroups. We show the performance of FairCOCCO for DP in Table 9 and 10, demonstrating superior performance on a benchmark of binary classification tasks as well as protection of multiple sensitive attributes in regression settings. 

C.2 FAIRCOCCO ESTIMATION

In this section, we provide additional results on convergence of FairCOCCO Score estimation as a function of batch size, similar to the experiment performed in the main paper. We show convergence on Adult and German dataset in Figure 2 . We note that while convergence of estimation depends on properties of different datasets, the estimation of FairCOCCO Score stabilizes at batch sizes > 256. 

C.3 STATISTICAL TESTING

We demonstrate how the proposed fairness measures can be employed as a test statistic to perform statistical tests, resulting in stronger guarantees and transparency (Fukumizu et al., 2007; Gretton et al., 2005) . We highlight that while other fairness measures (MI and MCC) can be developed as test statistics, the empirical estimation of these measures involve multiple levels of approximations, and it is unclear whether the approximated statistics still retain the theoretical properties. Figure 3 shows the distributions of predictions with fairness regularization. Notably, EO only requires statistical independence between predictions and sensitive attributes given true outcome, whereas DP enforces "strict" independence between predictions and attributes. As the null distribution is not known (Fukumizu et al., 2007) , permutation testing is performed. Table 13 reveals the accuracy-fairness trade-offs and p-values under different regulation strengths. The p-values indicate the probability of observing the test statistic under null hypothesis of (conditional) independence. As we expect, stronger fairness regularization leads to lower levels of unfairness as measured by DI and DEO, as well as stronger guarantees in statistical tests. For example, at λ = 2.0, we can say with 90% chance that predictions are conditionally independent of sensitive attributes (under EO) or 27% chance that predictions are independent of sensitive attributes (under DP). 



Weak and strong, as defined by (Daudin, 1980), refer to the strength of characterizations of dependence. We make mild assumptions on the involved RKHSs, assuming they are separable and square integrable(Gretton et al., 2005); and employ characteristic kernels, e.g. Gaussian and Laplacian kernels. We re-ran the available implementation in our own pipeline, reporting the best results between our re-runs and original reported scores. www.uniformguidelines.com.



Figure 1: Fairness accuracy trade-off. Crimes and Communities (top) and Students (bottom). Optimum desiderata at the origin, where both MSE and unfairness are minimized.

and employ the median heuristic introduced by Schölkopf et al. (2002

Figure 2: Estimation of FairCOCCO Score. (a) Adult dataset, (b) German dataset.

Figure 3: Visualizing FairCOCCO regularization. (Top) distribution of predictions for label 1 of different group memberships under EO. (Bottom) distribution of predictions for different group memberships under DP. Predictions are produced by regularized logistic regression model with λ = 0, λ = 0.5, λ = 1.0, respectively, across each row.

Figure 4: Fairness-accuracy trade-off. (left) C&C dataset with four sensitive attributes; (middle) students dataset with two sensitive attributes; (right) drugs dataset with three sensitive attributes.

Overview of related work for fairness-aware learning. Comparison made on method of fairness estimation, underlying model class and the following desiderata: (1) supports continuous outcomes; (2) continuous attributes; (3) protects multiple attributes; (4) compatible with all dependency-based notions of fairness (as in Table1).

discretise continuous variables, but such approaches introduce unwanted threshold effects, discards order information and requires sufficient sample coverage in each bin.

Let H Y denote the RKHS on Y, with positive definite kernel k Y . k A and H A are defined similarly. 2 Formally, the problem of interest is quantifying the conditional fairness between Ŷ and A given Y on finite samples.

Performance in binary setting. Accuracy (ACC) and DEO on benchmark datasets. NN is an unregularised neural network, on top of which the regularizers from competitor methods and FairCOCCO are applied to. Best results are emboldened.Steinberg et al. (2020b)  0.88 ± 0.01 0.03 ± 0.01 0.71 ± 0.10 0.09 ± 0.14 0.79 ± 0.05 0.04 ± 0.02 0.80 0.10 FairCOCCO 0.89 ± 0.01 0.00 ± 0.01 0.74 ± 0.03 0.02 ± 0.09 0.80 ± 0.06 0.02 ± 0.01 0.83 0.04 different number of samples (ranging from 649 to 299285) and different feature counts (ranging from 10 to 128) to gain a better understanding of our method's performance profile.

Protection of multiple attributes. Investigation on joint fairness effects and fairness protection with respect to individual sensitive attributes on array of benchmarks. Lowest MSE/ACC, FairCOCCO and DEO scores are emboldened.

Facial attribute recognition. Accuracy (ACC) and DEO on three separate classification tasks -attractive, smile, and wavy hair. Best results are emboldened.

Sepsis treamtent. Accuracy (ACC), DEO and FairCOCCO score on learning fair sepsis treatment policies; the best results are emboldened.

Performance in binary setting. Accuracy (ACC) and DI under DP. NN is an unregularised neural network that is used as base learner; the best results are emboldened.

Protection of multiple attributes. Level of protection provided to individual attributes when all attributes are simultaneously protected under DP. Lowest MSE & FairCOCCO scores are emboldened. (left) C&C dataset, (right) Students dataset.Calibration. CAL requires conditional independence between target and sensitive attributes given predictions. As the conditioning variable is continuous, we report the FairCOCCO score on the same experiments. We see in Table11and 12 that FairCOCCO achieves superior fair and predictive outcomes under different definitions of fairness when compared to other methods.

Performance in binary setting. Accuracy (ACC) and FairCOCCO (COCCO) under CAL; the best results are emboldened.

Protection of multiple attributes. Level of protection provided to individual attributes when all attributes are simultaneously protected under CAL. Lowest MSE and FairCOCCO score are emboldened. (left) C&C dataset, (right) Students dataset.

Statistical testing. Accuracy-fairness trade-offs under different fairness notions and corresponding test of statistical significance. (left) EO setting, (right) DP setting.

ETHICS AND REPRODUCIBILITY STATEMENT

Ethics statement. We caution against using our proposed methods as a certificate of fairness. As Corbett-Davies et al. (2017) rightfully emphasize, fairness measures do not rule out unfair practices. Additionally, future works should focus on interpretable fairness quantification that sheds insight on root causes of unfairness, allowing them to be eliminated through procedural changes rather than solely in prediction tasks. Lastly, we encourage more lively discourse on philosophical implications of ML methods on justice and fairness (Kuppler et al., 2021) that is critical to Fair ML deployment.Reproducibility statement. We detailed exact implementation details, including dataset preprocessing, implementation of benchmark methods, architecture design, hyperparameter tuning, and evaluation methods in Section 3, Section 4 and Appendix B. We will release code upon acceptance of the paper for the camera-ready version. South German Credit (German) (Hoffman, 1994) . The German dataset contains 1000 instances with 20 predictor variables of a debtor's financial history and demographic information, which are used to predict binary credit risk (i.e. complied with credit contract or not). The sensitive attribute is a binary variable indicating whether the debtor is of foreign nationality.COMPAS (Angwin et al., 2016) . COMPAS is a commercial software commonly used by judges and parole officers for scoring a criminal defendant's likelihood of recidivism. The dataset contains 6172 instances with 10 features. The outcome is a binary variable corresponding to whether violent recidivism occurred (is_violent_recid) and the sensitive attribute is race, which is binarised into "Caucasian" and "Non-Caucasian" defendants.Communities and Crime (C&C) (Redmond, 2009) . C&C contains socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey and crime data from 1995 FBI UCR. It contains 1994 instances of communities with 128 attributes. The outcome of the regression problem is crime rate within each community ViolentCrimesPerPop, which is a continuous value. There are three sensitive attributes, corresponding to ethnic proportions in the community-racePctBlack, racePctWhite, racePctAsian.Student Performance (Students) (Cortez, 2014) . The Students dataset predicts academic performance in the last year of high school. There are 649 instances with 33 attributes, including past academic information and student demographics. The response variable is a continuous variable corresponding to final grade and the sensitive attributes are age (continuous value from 15-22) and sex ('F'-female, 'M'-male).

B.2 TIME SERIES TASK

The data used to develop and evaluate our experiment on fair imitation learning is extracted from the MIMIC-III ICU database (Johnson et al., 2016a) , based on the Sepsis-3 cohort defined by Komorowski et al. (2018) .Discrimination in Healthcare. Sepsis is one of the leading causes of mortality in intensive care units (Singer et al., 2016) , and while efforts have been made to provide clinical guidelines for treatment, physicians at the bedside largely rely on experience, giving rise to possible variations in fair treatments. Prejudice in healthcare has been reported in many instances-for example, healthcare professionals are more likely to downplay women's health concerns (Rogers & Ballantyne, 2008) and racial biases affect pain assessment and treatment prescribed (Hoffman et al., 2016) . Thus, it is critical, when learning to imitate expert policy, that no underlying prejudices are leaked into the learned policy.Problem Setup. We have access to a set of expert trajectories D = {τ 1 , ..., τ N }, where each trajectory is a sequence of state-action pairs {(s 1 , a 1 ), ..., (s T , a T )}. The time-varying state space is modelled with a Markov Decision Process (MDP), i.e. at every time step t, the agent observes current state s t and takes action a t .Data. We obtain data from MIMIC-III and use the pre-processing scripts provided by Komorowski et al. (2018) to extract patients satisfying the Sepsis-3 criteria. For each patient, we have relevant physiological parameters, including demographics, lab values, vital signs and intake/output events. Data are aggregated into 4 hour windows.State Space. The pre-processing yields 45 × 1 feature vectors for each patient at each time step, which are summarized in Table 8 . We consider gender as the sensitive attribute. Action Space. We define a binary action for medical intervention based on intravenous (IV) fluid and maximum vasopressor (VP) dosage in a given 4 hour window, where a t = 1 represent either or both interventions taken, and a t = 0 indicates no action taken.Treatment Outcome. The ground truth treatment outcome in each time step is evaluated using SOFA (measuring organ failure) and the arterial lactate levels (higher in septic patients). Specifically, the treatment outcome penalizes high SOFA scores and increases in SOFA and lactate levels from the previous time step (Raghu et al., 2017) :Behavioral Cloning. Our proposed framework should work with any imitation learning algorithm as long as predictions of action rewards are differentiable. For now, we will focus on behavioral cloning. The expert's demonstrations D are divided into i.i.d. state-action pairs. We train a neural network as described in the experimental setup to predict posterior action probabilities.

C ADDITIONAL EXPERIMENTS

In this section, we provide additional results to comprehensively evaluate our proposed methods, specifically:1. DP and EO: While the main paper investigates fairness using EO, Appendix C.1 demonstrates application of FairCOCCO using DP and CAL notions of fairness. To highlight FairCOCCO's compatibility with fairness definitions other than EO, we apply it to demographic parity (DP) and calibration (CAL). We perform the same experiments on 1) binary classification tasks, 2) regression task with multiple sensitive attributes. The experiments are performed using the procedures described in the experimental setup.

