FERMI: FAIR EMPIRICAL RISK MINIMIZATION VIA EXPONENTIAL R ÉNYI MUTUAL INFORMATION

Abstract

Several notions of fairness, such as demographic parity and equal opportunity, are defined based on statistical independence between a predicted target and a sensitive attribute. In machine learning applications, however, the data distribution is unknown to the learner and statistical independence is not verifiable. Hence, the learner could only resort to empirical evaluation of the degree of fairness violation. Many fairness violation notions are defined as a divergence/distance between the joint distribution of the target and sensitive attributes and the Kronecker product of their marginals, such as Rényi correlation, mutual information, L ∞ distance, to name a few. In this paper, we propose another notion of fairness violation, called Exponential Rényi Mutual Information (ERMI) between sensitive attributes and the predicted target. We show that ERMI is a strong fairness violation notion in the sense that it provides an upper bound guarantee on all of the aforementioned notions of fairness violation. We also propose the Fair Empirical Risk Minimization via ERMI regularization framework, called FERMI. Whereas existing in-processing fairness algorithms are deterministic, we provide a stochastic optimization method for solving FERMI that is amenable to large-scale problems. In addition, we provide a batch (deterministic) method to solve FERMI. Both of our proposed algorithms come with theoretical convergence guarantees. Our experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and accuracy on test data across different problem setups, even when fairness violation is measured in notions other than ERMI.

1. INTRODUCTION

Ensuring that decisions made using machine learning algorithms are fair to different subgroups is of utmost importance. Without any mitigation strategy, machine learning algorithms may result in discrimination against certain subgroups based on sensitive attributes, such as gender or race, even if such discrimination is absent in the training data (Datta et al., 2015; Sweeney, 2013; Bolukbasi et al., 2016; Angwin et al., 2016; du Pin Calmon et al., 2017b; Feldman et al., 2015; Hardt et al., 2016; Fish et al., 2016; Woodworth et al., 2017; Zafar et al., 2017; Bechavod & Ligett, 2017; Kearns et al., 2018) . To remedy such discrimination issues, several notions for imposing algorithmic fairness have been proposed in the literature. A learning machine satisfies the demographic parity notion, if the predicted target is independent of the sensitive attributes (Dwork et al., 2012) . Promoting demographic parity can lead to poor performance, especially if the true outcome is not independent of the sensitive attributes. To remedy this, (Hardt et al., 2016) proposed equalized odds to ensure that the predicted target is conditionally independent of the sensitive attributes given the true label. A further relaxed version of this notion is equal opportunity which is satisfied if predicted target is conditionally independent of sensitive attributes given that the true label is in an advantaged class (Hardt et al., 2016) . Note that the inherent assumption in such conditional notions is that the true labels are unbiased. These notions suffer from a potential amplification of the inherent biases that may exist in the targets/labels in the training data (e.g., data collection bias). Tackling such bias is beyond the scope of this work. In practice, the learner cannot empirically verify independence of random variables, and hence cannot verify demographic parity, equalized odds, or equal opportunity. This has led the machine learning community to define several notions of fairness violation that quantify the degree of independence between random variables, e.g., demographic parity/equalized odds, L ∞ distance (Dwork et al., 2012; Hardt et al., 2016) , mutual information (Kamishima et al., 2011; Rezaei et al., 2020; Steinberg et al., 2020; Zhang et al., 2018; Cho et al., 2020) , Pearson correlation (Zafar et al., 2017) , false positive/negative rates (Bechavod & Ligett, 2017) , Hilbert Schmidt independence criterion (HSIC) (Pérez-Suay et al., 2017) , and Rényi correlation (Baharlouei et al., 2020; Grari et al., 2020; 2019) , to name a few. In this paper, we define yet another notion of fairness violation, called exponential Rényi mutual information (ERMI). We show that ERMI is easy to compute empirically and prove that ERMI provides an upper bound on the existing notions of fairness violation such as demographic parity, equalized odds, and equal opportunity. Given a notion of fairness violation, it is still not straightforward to train an algorithm that satisfies a fairness violation constraint (Cotter et al., 2019) . The fairness-promoting machine learning algorithms can be categorized in three main classes: pre-processing, post-processing, and in-processing methods. Pre-processing algorithms (Feldman et al., 2015; Zemel et al., 2013; du Pin Calmon et al., 2017b) transform the biased data features to a new space in which the labels and sensitive attributes are statistically independent. This transform is oblivious to the training procedure. Post-processing approaches (Hardt et al., 2016; Pleiss et al., 2017) mitigate the discrimination of the classifier by altering the the final decision, e.g., by changing the thresholds on soft labels, or reassigning the labels to impose notions of fairness. In-processing approaches focus on the training procedure and impose the notions of fairness as constraints or regularization terms in the optimization procedure. Several regularization-based methods are proposed in the literature to impose measures of fairness to decision-trees (Kamiran et al., 2010; Raff et al., 2018; Aghaei et al., 2019) , support vector machines (Donini et al., 2018) , neural networks (Grari et al., 2020) , or (logistic) regression models (Zafar et al., 2017; Berk et al., 2017; Taskesen et al., 2020; Chzhen & Schreuder, 2020; Baharlouei et al., 2020; Jiang et al., 2020; Grari et al., 2019) . To the best of our knowledge, existing in-processing methods are all deterministic, making them impractical for large-scale problems. Furthermore, most in-processing methods (with the exception of (Baharlouei et al., 2020) ) are designed for problems in which the sensitive attribute and/or the target is binary. In this paper, we introduce a new fair empirical risk minimization framework via ERMI regularization, and call it FERMI. We provide novel batch and stochastic gradient-based methods with guarantees for solving FERMI and demonstrate their effectiveness on multiple numerical experiments, which include a large-scale problem and problem with both non-binary sensitive attributes and targets. We show that FERMI can be used to achieve the most favorable tradeoffs between performance and fairness, even if fairness violation is measured in notions other than ERMI.

2. (Z, Z)-FAIRNESS: A GENERAL NOTION OF FAIRNESS

We consider a learner who trains a model to predict a target, Y , e.g., whether or not to extend a loan, supported on Y which can be discrete or continuous. The prediction is made using a set of features, X, e.g., financial history features, length of credit, and amount of debt. We also assume that there is a set of discrete sensitive attributes, S, e.g., race and sex, supported on S, associated with each sample. Further, let A ⊆ Y denote an advantaged outcome class, e.g., the outcome where a loan is extended. Next, we will present the main fairness notion considered in this paper, which generalizes several existing ones. Definition 1 ((Z, Z)-fairness). Given a random variable Z, let Z be a subset of values that Z can take. We say that a learning machine satisfies (Z, Z)-fairness if for every z ∈ Z, Y is conditionally independent of S given Z = z. More precisely, p Y ,S|Z ( y, s|z) = p Y |Z ( y|z)p S|Z (s|z) ∀ y ∈ Y, s ∈ S, z ∈ Z. (1) Notice that (Z, Z)-fairness recovers several important existing notions of fairness as special cases: 1. (Z, Z)-fairness recovers demographic parity (Dwork et al., 2012) if Z = 0 and Z = {0}. In this case, conditioning on Z has no effect, and hence (0, {0}) fairness is equivalent to the independence between Y and S, i.e., demographic parity (see Definition 7, Appendix A). 2. (Z, Z)-fairness recovers equalized odds (Hardt et al., 2016) if Z = Y and Z = Y. In the case, Z ∈ Z is trivially satisfied and could be dropped. Hence, conditioning on Z is equivalent to conditioning on Y, which recovers the equalized odds notion of fairness, i.e., conditional independence of Y and S given Y (see Definition 8, Appendix A). 3. (Z, Z)-fairness recovers equal opportunity (Hardt et al., 2016) if Z = Y and Z = A. This is also similar to the previous case with Y replaced with A (see Definition 9, Appendix A). Demographic parity amounts to requiring equality of outcomes across sensitive groups. However, it can result in poor performance of a learned model, particularly if the true outcome Y is not independent of S. Equalized odds and equal opportunity remedy this issue by relaxing the independece constraint (Hardt et al., 2016) . In the binary classification setting with binary sensitive attributes (e.g., male or female), equalized odds ensures equality of false negatives and false positives of a classifier across sensitive groups. Equal opportunity is a further relaxation of the equalized odds criterion, which, in the binary setting, requires just equality of false negatives across sensitive groups. For example, this could be used to enforce that a face recognition software does not falsely classify people of one race as criminals more often than people of other races. Note that verifying (Z, Z)-fairness requires having access to the joint distribution of random variables (Z, Y , S). This joint distribution is unavailable to the learner in the context of machine learning, and hence the learner would resort to empirical estimation of the amount of violation of independence. In the next section, we propose exponential Rényi mutual information as a notion of the violation of (Z, Z)-fairness, and show that it is a stronger notion compared to several existing fairness violation notions, including demographic parity L ∞ distance (Kearns et al., 2018) , equalized odds L ∞ distance (Hardt et al., 2016) , and equal opportunity L ∞ distance (Hardt et al., 2016) .

3. EXPONENTIAL R ÉNYI MUTUAL INFORMATION

In this section, we define ERMI and show that several existing fairness violation notions (which are mostly instances of f -divergences or distance metrics between the joint probability distribution and the Kronecker product of the marginals) are also upper bounded by ERMI, implying that ERMI is a stronger notion of fairness violation. In particular, this means that if ERMI is small, then we automatically obtain guarantees that all other notions of fairness violation must be small as well. We present all definitions and results for general (Z, Z) fairness notion, which requires careful extension of several existing notions to the conditional case. These definitions and results will simplify significantly when Z = 0 and Z = {0}, which will eliminate all conditional expectations. Definition 2 (ERMI -exponential Rényi mutual information). We define the exponential Rényi mutual information between Y and S given Z ∈ Z as D R ( Y ; S|Z ∈ Z) := E Z, Y ,S p Y ,S|Z ( Y , S|Z) p Y |Z ( Y |Z)p S|Z (S|Z) Z ∈ Z -1. In Appendix B, we unravel the definition for the special cases of interest corresponding to the existing notions of fairness. We also discuss that ERMI is the χ 2 -divergence (which is an f -divergence) between the joint distribution, p Y ,S|Z , and the Kronecker product of marginals, p Y |Z ⊗ p S|Z . In particular, ERMI is non-negative, and zero if and only if (Z, Z)-fairness is satisfied. Hence, ERMI is a valid notion of fairness violation. Definition 3 (Rényi mutual information (Rényi, 1961) ). Let the Rényi mutual information of order α > 1 between random variables Y and S given Z ∈ Z be defined as: I α ( Y ; S|Z ∈ Z) := 1 α -1 log   E Z, Y ,S    p Y ,S|Z ( Y , S|Z) p Y |Z ( Y |Z)p S|Z (S|Z) α-1 Z ∈ Z      , which generalizes Shannon mutual information I 1 ( Y ; S|Z ∈ Z) := E Z, Y ,S log p Y ,S|Z ( Y , S|Z) p Y |Z ( Y |Z)p S|Z (S|Z) Z ∈ Z , and recovers it as lim α→1 + I α ( Y ; S|Z ∈ Z) = I 1 ( Y ; S|Z ∈ Z). Note that I α ( Y ; S|Z ∈ Z) ≥ 0 with equality if and only if (Z, Z)-fairness is satisfied. Theorem 1 (ERMI is stronger than Shannon mutual information). We have 0 ≤ I 1 ( Y ; S|Z ∈ Z) ≤ I 2 ( Y ; S|Z ∈ Z) ≤ e I2( Y ;S|Z∈Z) -1 = D R ( Y ; S|Z ∈ Z). All proofs are relegated to the appendix. Theorem 1 establishes that ERMI is a stronger notion of fairness in the sense that driving it to zero would also bound the Shannon mutual information. It also shows that ERMI is exponentially related to the Rényi mutual information of order 2. Definition 4 (Rényi correlation (Hirschfeld, 1935; Gebelein, 1941; Rényi, 1959) ). Let F and G be the set of measurable functions such that for random variables Y and S, E Y {f ( Y ; z)} = E S {g(S; z)} = 0, E Y {f ( Y ; z) 2 } = E S g(S; z) 2 = 1, for all z ∈ Z. Rényi correlation is: ρ R ( Y , S|Z ∈ Z) := sup f ∈F ,g∈G E Z, Y ,S f ( Y ; Z)g(S; Z) Z ∈ Z . Rényi correlation generalizes Pearson correlation coefficient ρ( Y , S|Z ∈ Z) := E Z    E Y ,S { Y S|Z} E Y { Y 2 |Z}E S {S 2 |Z} Z ∈ Z    (7) to capture nonlinear dependencies between the random variables by finding functions of random variables that maximize the Pearson correlation coefficient between the random variables. In fact, it is true that ρ R ( Y , S|Z ∈ Z) ≥ 0 with equality if and only if (Z, Z)-fairness is satisfied. Due to these favorable properties, Rényi correlation has gained popularity as a measure of fairness violation (Baharlouei et al., 2020; Grari et al., 2020) . Theorem 2 (ERMI is stronger than Rényi correlation.). We have 0 ≤ |ρ( Y , S|Z ∈ Z)| ≤ ρ R ( Y , S|Z ∈ Z) ≤ D R ( Y ; S|Z ∈ Z), and if |S| = 2, D R ( Y ; S|Z ∈ Z) = ρ R ( Y , S|Z ∈ Z). Next, we turn to another popular notion of fairness violation and establish similar relationships. Definition 5 (L q fairness violation). We define the L q fairness violation for q ≥ 1 by: L q ( Y , S|Z ∈ Z) := E Z y∈Y0 s∈S0 p Y ,S|Z ( y, s|Z) -p Y |Z ( y|Z)p S|Z (s|Z) q dy 1 q Z ∈ Z . Note that L q ( Y , S|Z ∈ Z) = 0 if and only if (Z, Z)-fairness is satisfied. In particular, L ∞ fairness violation recovers the demographic parity violation (Kearns et al., 2018, Definition 2 .1) if we let Z = {0} and Z = 0. It also recovers equal opportunity violation (Hardt et al., 2016) if we let Z = A and Z = Y . L q fairness violation generalizes this notion by considering the L q norm of the difference between the joint distribution p Y ,S and the Kronecker product of the marginal distributions p Y ⊗ p S . Theorem 3 (ERMI is stronger than L ∞ fairness violation). Let Y be a discrete or continuous random variable, and S be a discrete random variable supported on a finite set. Then for any q ≥ 1, 1 0 ≤ L q ( Y , S|Z ∈ Z) ≤ D R ( Y , S|Z ∈ Z). The above theorem says that if a method controls ERMI value for imposing fairness, L ∞ demographic parity violation (Kearns et al., 2018) , L ∞ equal opportunity violation (Hardt et al., 2016) , or L ∞ equalized odds violation (Hardt et al., 2016) is also guaranteed to be bounded.

4. FAIR RISK MINIMIZATION VIA ERMI

Our goal is to train a model that balances fairness and accuracy objectives. To this end, we introduce fair risk minimization through exponential Rényi mutual information framework defined below:foot_1 Definition 6 (FRMI -fair risk minimization through exponential Rényi mutual information). To balance fairness and accuracy, we consider the learning objective function: min θ E X,Y,S X, Y ; θ + λD R Y θ (X); S , where denotes the loss function, such as L 2 loss or cross entropy loss; λ > 0 is a scalar balancing the accuracy versus fairness objectives; D R Y θ (X); S is the notion of ERMI given in Eq. ( 17); and Y θ (X) is the output of the learned model (e.g., the output of a classification or a regression task, or the cluster number in a clustering task). While Y θ (X) inherently depends on X and θ, in the rest of this paper, we sometimes leave the dependence of Y on X and/or θ implicit for brevity of notation. Notice that we have also left the dependence of the loss on the predicted outcome Y implicit. FRMI is the objective we should solve if demographic parity is the desired fairness notion; if instead we are interested in equalized odds or equal opportunity, then D R ( Y , S) should be replaced by D R ( Y , S|Z ∈ Z) for an appropriate (Z, Z) pair, per the discussion in Section 3. Since the theory is very similar in both cases, we stick with FRMI as defined in Definition 6. In practice, the true joint distribution of (X, S, Y, Y ) is unknown and we only have N samples at our disposal, making it impossible to solve FRMI. Hence, we turn into fair empirical risk minimization via exponential Rényi mutual information (FERMI) approach. While it is natural to estimate E X,Y,S X, Y ; θ through the empirical risk, the estimation of D R ( Y , S) in the objective function in Eq. ( 11) is not as straightforward. In what follows, we propose two approaches for estimating D R ( Y , S). These two approaches result in two different algorithms for balancing fairness and accuracy, where we discuss the benefits and shortcomings of each.

4.1. FERMI VIA EMPIRICAL ESTIMATION OF THE PROBABILITY DISTRIBUTIONS

Let {x i , s i , y i , y i } i∈[N ] denote the features, sensitive attributes, targets, and the predictions of the model parameterized by θ for samples i ∈ [N ]. A natural approach to estimate the objective function in Eq. ( 11) and learning the parameter θ is through solving the problem min θ    1 N i∈[N ] x i , y i ; θ + λ D R Y θ ; S    , ( ) Here D R ( Y θ ; S) := s∈S y∈Y p Y ,S ( y,s) 2 p Y ( y) p S (s) d y -1 is an empirical estimate of D R ( Y θ , S) where p S (s), p Y ( y), and p Y ,S ( y, s) are the empirical estimation of the corresponding probability functions. To make the objective function differentiable with respect to θ, we make the following assumption: Assumption 1. Assume the sensitive attribute is a deterministic function of the features, i.e., S = f s (X). This trivially holds if the sensitive attribute is available as part of the features. Further, assume the following soft conditional density: p Y θ ,S ( y θ , s) = E X 1{f s (X) = s} e -τ (X, y;θ) y∈Y e -τ (X,y;θ) dy , ( ) where τ > 0 controls the softness of the decision and τ → ∞ would recover the hard decision made by choosing y that minimizes the loss function. Assumption 1 is a generalization of the assumption in RFI (Baharlouei et al., 2020) , and is natural in problem instances where the decisions made by the learning algorithm are soft decisions. In particular, logistic regression or neural networks with soft-max layer and cross entropy loss satisfy Assumption 1. We can further find p Y |S , p Y , and p S from Eq. ( 13). Under this assumption, we will have the following empirical estimate of the joint distribution of the predicted target and the sensitive attribute: p Y θ ,S ( y θ , s) := 1 N i∈[N ] 1{s i = s} e -τ (xi, y;θ) y∈Y e -τ (xi,y;θ) dy , which could be marginalized to define p S (s) and p Y (•) as well. These empirical probabilities make D R ( Y θ ; S) a differentiable function of θ. Notice that these empirical quantities converge to the true distributions for finite S and Y, however, the sample complexity required for their estimation scales linearly with |S| and |Y|, which implies an exponential scaling with the number of sensitive attributes. Our stochastic algorithm that will be presented in the next section will aim to remedy this potential problem. To solve Eq. ( 12), one can apply the gradient descent algorithm and use the dynamics θ t+1 = θ t -η∇ θ    1 N i∈[N ] x i , y i ; θ t + λ D R Y θ t ; S    , where η > 0 is the learning rate/step-size. Note that when the sensitive attribute is binary, our algorithm is the same as the one in (Baharlouei et al., 2020) since then D R Y θ t ; S = ρ R ( Y , S), the Rényi correlation, by Theorem 8. However, in the non-binary case, our algorithm is different from (Baharlouei et al., 2020) in general, and we show in Sec. 5 that it achieves a more favorable fairness-accuracy tradeoff curve. Under standard assumptions on the loss function and the learning rate, one can show that the dynamics in Eq. ( 15) find an -stationary point, (i.e., a point with the norm of gradient being smaller than ) in O( 12 ) iterations (Nesterov, 2013) . Theorem 4. (Informal statement) Gradient descent (i.e. Eq. ( 15)) converges to the set of -first order stationary points of the FERMI objective (Eq. ( 12)) in O( 12 ) iterations (gradient evaluations). While this algorithm achieves the optimal rate of first-order methods for general smooth non-convex optimization problems, the empirical ERMI term in the objective in Eq. ( 12) is a biased estimator of ERMI in Eq. ( 11). This bias makes the optimization problem in Eq. ( 12) not suitable for using stochastic methods. For example, in the extreme case, where only one sample is available to the learner for updating the objective at each turn, D R (•; •) can be severely biased due to the nonlinearities in how it is defined. In the next subsection, we propose another approach for estimation of D R (•; •) that results in an unbiased estimator, which is amenable to stochastic optimization. Algorithm 1 Two-Time Scale SGDA for FERMI 1: Input: θ 0 ∈ R d θ , W 0 ∈ W ⊂ R k×m , step-sizes (η θ , η w ), mini-batch size M , fairness param- eter λ ≥ 0, iteration number T. 2: for t = 0, 1, . . . , T do 3: Draw a batch B of data points {(x i , y i )} i∈B 4: Set θ t+1 ← θ t -η θ 1 |B| i∈B ∇ θ (x i , y i , θ) -2λ∇ θ vec( y i (θ) y i (θ) T ) T vec(W T W ) + 2λ∇ θ vec( y i (θ)s T i ) T vec(W T P -1/2 s ) 5: Set W t+1 ← Π W W t + η w i∈B -2λW y i (θ) y i (θ) T + 2λP -1/2 s s i y i (θ) 6: end for 7: Pick t uniformly at random from {1, . . . , T } 8: Return: θ t.

4.2. STOCHASTIC FERMI

In order to solve the population level objective in Eq. ( 11) using stochastic methods (such as stochastic gradient descent), one needs to obtain an unbiased estimate of the objective in equation 11, i.e., E X,Y,S X, Y ; θ + λD R Y θ (X); S . Clearly, the empirical average 1 |B| i∈B x i , y i ; θ is an unbiased estimator of E X,Y,S X, Y ; θ , where B ⊆ [N ] is a batch of data points. Thus, to develop a stochastic algorithm, we need to have an unbiased estimator of D R Y θ (X); S given a batch of data points B. The following Theorem will help us obtain such an estimator. where P y =    p Y (1) 0 . . . 0 p Y (m)    , P y,s =    p Y ,S (1, 1) . . . p Y ,S (1, k) . . . . . . . . . p Y ,S (m, 1) . . . p Y ,S (m, k)    , P s =    p S (1) 0 . . . 0 p S (k)    . Let Y ∈ {0, 1} m and S ∈ {0, 1} k be the one hot encoded version of Y and S, respectively. Then, the above theorem implies that Eq. ( 11) can be re-written as min θ max W ∈R k×m E (X, Y ; θ) -Tr(W Y Y T W T ) + 2 Tr(W YS T P -1/2 s ) -1 . Hence, given a batch of data points B, we can obtain an unbiased estimator of the above objective function by the empirical average 1 |B| i∈B (x i , y i ; θ) -Tr(W y i y T i W T ) + 2 Tr(W y i s T i P -1/2 s ) -1 . This observation leads to the stochastic algorithm presented in Algorithm 1. Notice that this algorithm is based on the assumption that P s is known. This assumption is practical since the distribution of sensitive attributes (such as male vs female) is known in many applications (or it can be estimated accurately using the training data). The convergence rate of Algorithm 1 is analyzed in Theorem 6. Theorem 6. (Informal statement) Algorithm 1 converges to the set of -first order stationary points of the FERMI objective (c.f. Eq. 12) in O( 14 ) iterations (stochastic gradient evaluations). The formal statement of this theorem can be found in Theorem 11 in Appendix D. Notice that while this algorithm has a slower rate of convergence than the batch algorithm, it is stochastic (each iteration is computationally cheap) and amenable to large-scale problems. Note also that a faster convergence rate of O( 13 ) could be obtained by using the (more complicated) SREDA method of (Luo et al., 2020) instead of SGDA to solve FERMI objective. We omit the details here. In the next section, we numerically evaluate the performance of the algorithms described in this section.

5.1. BINARY FAIR CLASSIFICATION WITH A BINARY SENSITIVE ATTRIBUTE

We start the experimental setup for binary classification problem with a binary sensitive attribute. This is a common setup among most existing baseline methods. Per Theorem 2, in this binary Under review as a conference paper at ICLR 2021 Figure 2 : Comparison between FERMI and RFI (Baharlouei et al., 2020) . FERMI achieves a better fairness vs performance tradeoff. Moreover, due to computationally expensive operations like performing singular value decomposition (SVD), RFI has poor scalability with the cardinality of sensitive features and target classes. classification case ERMI is equivalent to Rényi correlation (Baharlouei et al., 2020) , and per the discussion in Sec. 4.1, our batch algorithm is exactly the same as RFI (Baharlouei et al., 2020) . While we solve FERMI to impose an ERMI regularizer, we still measure fairness violation via popular fairness violation notions, such as conditional demographic parity L ∞ violation (Definition 10), conditional equal opportunity L ∞ violation (Definition 11), and conditional equalized odds violation. In Fig. 1 , we report the fairness violation vs error for German Credit and Adult datasets. As can be seen, for all three popular notions of fairness, FERMI achieves the best trade-off between fairness and error probability on the test data. This could be partly due to smoothness of FERMI optimization problem, and partly due to the fact that ERMI upper bounds other fairness notions as discussed in Sec. 3. We will have a more detailed discussion on this in Section 6.

5.2. NON-BINARY FAIR CLASSIFICATION WITH A NON-BINARY SENSITIVE ATTRIBUTE

Next, we consider a general classification problem with |S| > 2. In this case, we consider the Communities and Crime dataset, which has 18 binary sensitive attributes in total, and we pick {7, 10, 14, 18} sensitive attributes out of those for different experiments, which corresponds to |S| ∈ {2 7 , 2 10 , 2 14 , 2 18 }. We discretize the target into three classes {high, medium, low} (ternary classification). The only baseline that we are aware of that can handle non-binary classification with non-binary is sensitive attributes is RFI (Baharlouei et al., 2020) . The results are presented in Fig. 2 , where we use conditional demographic parity L ∞ violation (Definition 10) and conditional equal opportunity L ∞ violation (Definition 11) as the fairness violation notion. As can be seen, FERMI achieves a better tradeoff curve as compared with RFI. It is noteworthy that the per-iteration complexity of FERMI is far less than that of RFI, which requires solving a singular value decomposition at each iteration. Finally, the convergence rate for FERMI given in Theorem 4 guarantees an O( 12 ) convergence vs O( 14 ) for RFI (Baharlouei et al., 2020, Theorem 4.1) . Empirically, we observed that FERMI converges ∼10x faster on this problem instance. Finally, we also show that conditional demographic parity L ∞ violation (Definition 10) and square root of ERMI are approximately linearly related, which further justifies the use of ERMI regularizer in the FERMI framework. In this experiment, we consider the color MNIST dataset (Li & Vasconcelos, 2019) where MNIST digits are colored with different colors drawn from a Gaussian distribution with variance σ around a certain average color. It is shown in (Li & Vasconcelos, 2019 ) that as σ → 0, a convolutional network model overfits significantly to the color on this dataset, and hence will not be able to generalize on a regular black and white test set. Our goal in this experiment is to show that FERMI can promote independence between the predicted target and color (which we use as the sensitive attribute within FERMI) to improve generalization in this setup. This also allows us to examine the scaling of stochastic FERMI when used in convolutional neural networks. We consider σ = 0, where the test performance is the lowest due to overfitting.

5.3. LARGE-SCALE CLASSIFICATION WITH FERMI

The result of the experiment is presented in Fig. 4 . As expected, FERMI results in learning representations that have less dependence on the color, hence leading to better generalization. It is Figure 4 : The application of FERMI to Color MNIST dataset (Li & Vasconcelos, 2019) . As expected, FERMI achieves a tradeoff between demographic parity L∞ violation and train error on the colored training samples. On test data, however, reducing the demographic parity violation translates to less dependence on the color, which avoids overfitting and decreases the test error. Finally, as λ increases we see that the gap between train error and test error significantly decreases which shows that FERMI can avoid overfitting. For stochastic FERMI, we use a mini-batch of size 512 and achieve a speedup of 100x per iteration. Along with scalability, we also observe that the stochastic variant (Algorithm 1) has better generalization performance (lower test error). noteworthy that the test error achieved by FERMI when σ = 0 is 22.6%, as compared to 23.3% obtained using REPAIR (Li & Vasconcelos, 2019) for σ = 0.1. Further decreasing σ ≤ 0.05 the test error using REPAIR sharply goes above 50%. We were unable to run REPAIR for σ = 0.

6. DISCUSSION & CONCLUDING REMARKS

In this paper, we proposed a new notion of fairness, called exponential Rényi mutual information (ERMI). We showed that ERMI is a strong notion of fairness violation providing guarantees on several other popular notions, namely Pearson correlation, Rényi correlation, Shannon mutual information, Rényi mutual information, and L q distance violation. We proposed a Fair Empirical Risk Minimization framework with an ERMI regularizer to balance performance and fairness, and called it FERMI. Additionally, we showed that FERMI could be efficiently solved for non-binary sensitive attributes and non-binary target variables. We proposed batch and stochastic algorithms for solving FERMI with convergence guarantees for smooth losses. In particular, Algorithm 1 is unique among existing fair algorithms as it is stochastic, making it much more practical for large-scale problems (as demonstrated in Sec. 5.3). This is made possible by Theorem 5, which leads to an unbiased estimator of the gradient of ERMI. It is not at all clear if replacing ERMI by another regularizer, such as Rényi correlation or Shannon mutual information, in Eq. ( 12) would be amenable to stochastic optimization. From an experimental perspective, we showed that FERMI leads to better fairness-accuracy tradeoffs than the existing baselines. There are several possible explanations for the superior empirical performance of FERMI compared to existing methods. One possible reason is that the objective function Eq. ( 12) is easier to optimize than the objectives of competing in-processing methods: ERMI is smooth; and in the discrete case, is equal to the trace of a matrix (see Theorem 8), which is easy to compute. Contrast this with the larger computational overhead of Rényi correlation, for example, which requires finding the second singular value of a matrix. Furthermore, the sample complexity of estimating Rényi mutual information of order 2 (and consequently that of ERMI) scales as Θ( |S|) as compared to Shannon mutual information which scales as Θ(|S|/ log |S|) (Acharya et al., 2014) . Another possible explanation is that ERMI is a stronger notion of fairness than all of the most widely used fairness notions, as shown in Sec. 3, which might lead to better generalization. Together, these facts suggest that ERMI serves as an efficient and easily optimizable proxy for these other notions, leading to better practical performance regardless of which fairness measure is used. We leave it as future work to rigorously understand which of these (or other) factors are responsible for the favorable performance tradeoffs observed from FERMI. Finally, on the Color MNIST experiment with neural network function approximation, we observed that stochastic FERMI outperforms batch FERMI. In this case, we suspect that the randomness in stochastic FERMI supposedly contributes to its convergence to a local minimum with superior generalization performance compared to batch FERMI (see (Kleinberg et al., 2018) and the references therein).

A EXISTING NOTIONS OF FAIRNESS

Let (Y, Y , A, S) denote the true target, predicted target, the advantaged outcome class, and the sensitive attribute, respectively. We review three major notions of fairness. Definition 7 (demographic parity (Dwork et al., 2012) ). We say that a learning machine satisfies demographic parity if Y is independent of S. Definition 8 (equalized odds (Hardt et al., 2016) ). We say that a learning machine satisfies equalized odds, if Y is conditionally independent of S given Y . Definition 9 (equal opportunity (Hardt et al., 2016) ). We say that a learning machine satisfies equal opportunity with respect to A, if Y is conditionally independent of S given Y = y for all y ∈ A. Notice that the equal opportunity as defined here generalizes the definition in (Hardt et al., 2016) . It recovers equalized odds if A = Y, and it recovers equal opportunity of (Hardt et al., 2016) for A = {1} in binary classification.

B PROPERTIES AND SPECIAL CASES OF ERMI

Notice that ERMI is in fact the χ 2 -divergence between the conditional joint distribution, p Y ,S , and the Kronecker product of conditional marginals, p Y ⊗ p S , where the conditioning is on Z ∈ Z. Further, χ 2 -divergence is an f -divergence with f (t) = (t -1) 2 . See (Csiszár & Shields, 2004 , Section 4) for a discussion. As an immediate result of this observation and well-known properties of f -divergences, we can state the following property of ERMI: Remark 7. D R ( Y ; S|Z ∈ Z) ≥ 0 with equality if and only if for all z ∈ Z, Y and S are conditionally independent given Z = z. To further clarify the definition of ERMI, especially as it relates to demographic parity, equalized odds, and equal opportunity, we will unravel the definition explicitly in a few special cases. First, let Z = 0 and Z = {0}. In this case, Z ∈ Z trivially holds, and conditioning on Z has no effect, resulting in: D R ( Y ; S) := D R ( Y ; S|Z ∈ Z) Z=0,Z={0} = E Y ,S p Y ,S ( Y , S) p Y ( Y )p S (S) -1 = s∈S y∈Y p Y ,S ( y, s) -p Y ( y)p S (s) p Y ( y)p S (s) p Y ,S ( y, s)d y. ( ) D R ( Y ; S) is the notion of ERMI that should be used when the desired notion of fairness is demographic parity. In particular, D R ( Y ; S) = 0 implies that χ 2 divergence between p Y ,S , and the Kronecker product of marginals, p Y ⊗ p S is zero. This in turn implies that Y and S are independent, which is the definition of demographic parity. We note that when Y and S are discrete, this special case (Z = 0 and Z = {0}) of ERMI is referred to as χ 2 -information in (du Pin Calmon et al., 2017a) . Finally, we consider Z = Y and Z = A. In this case, we have  D A R ( Y ; S|Y ) := D R ( Y ; S|Z ∈ Z) Z=Y,Z=A = E Y, Y ,S p Y ,S|Y ( Y , S|Y ) p Y |Y ( Y |Y )p S|Y (S|Y ) Y ∈ A -1 = s∈S y∈A y∈Y p Y ,S|Y ( y, s|y) -p Y |Y ( y|y)p S|Y (s|y) p Y |Y ( y|y)p S|Y (s|y) p A Y (y)d ydy = s∈S y∈A y∈Y p Y ,S|Y ( y, s|y) 2 p Y |Y ( y|y)p S|Y (s|y) p Y ,S|Y ( y, s|y)p A Y (y)d ydy -1,

C RELATIONS BETWEEN ERMI AND OTHER FAIRNESS VIOLATION NOTIONS

Proof of Theorem 1. We proceed to prove all the (in)equalities one by one: • 0 ≤ I S ( Y ; S|Z ∈ Z). This is well known and the proof can be found in any information theory textbook (Cover & Thomas, 1991) . • I 1 ( Y ; S|Z ∈ Z) ≤ I 2 ( Y ; S|Z ∈ Z) . This is a known property of Rényi mutual information, but we provide a proof for completeness in Lemma 1. • I 2 ( Y ; S|Z ∈ Z) ≤ e I2( Y ;S|Z∈Z) -1. This follows from the fact that x ≤ e x -1. • e I2( Y ;S)|Z∈Z -1 = D R ( Y ; S|Z ∈ Z). This follows from simple algebraic manipulation. Lemma 1. Let Y , S, Z be discrete or continuous random variables. Then: (a) For any α, β ∈ [1, ∞], I β ( Y ; S|Z ∈ Z) ≥ I α ( Y ; S|Z ∈ Z) if β > α. (b) lim α→1 + I α ( Y ; S|Z ∈ Z) = I 1 ( Y ; S) := E Z D KL (p Y ,S|Z ||p Y |Z ⊗ p S|Z ) Z ∈ Z , where I 1 (•; •) denotes the Shannon mutual information and D KL is Kullback-Leibler divergence (relative entropy). (c) For all α ∈ [1, ∞], I α ( Y ; S|Z ∈ Z) ≥ 0 with equality if and only if for all z ∈ Z, Y and S are conditionally independent given z. Proof. (a) First assume 0 < α < β < ∞ and that α, β = 1. Define a = α -1, and b = β -1. Then the function φ(t) = t b/a is convex for all t ≥ 0, so by Jensen's inequality we have: 1 b log   E    p( Y , S|Z) p( Y |Z)p(S|Z) b Z ∈ Z      ≥ 1 b log   E p( Y , S|Z) p( Y |Z)p(S|Z) a Z ∈ Z b/a   = 1 a log E p( Y , S|Z) p( Y |Z)p(S|Z) a Z ∈ Z . Now suppose α = 1. Then by the monotonicity for α = 1 proved above, we have I 1 ( Y ; S) = lim α→1 -I α ( Y ; S) = sup α∈(0,1) I α ( Y ; S) ≤ inf α>1 I α ( Y ; S). Also, I ∞ ( Y ; S) = lim α→∞ I α ( Y ; S) = sup α>0 I α ( Y ; S). (b) This is a standard property of the cumulant generating function (see (Dembo & Zeitouni, 2009) ). (c) It is straightforward to observe that independence implies that Rényi mutual information vanishes. On the other hand, if Rényi mutual information vanishes, then part (a) implies that Shannon mutual information also vanishes, which implies the desired conditional independence. Proof of Theorem 2. The proof is completed using the following pieces. • 0 ≤ |ρ( Y , S|Z ∈ Z)| ≤ ρ R ( Y , S|Z ∈ Z). This is obvious from the definition of ρ R ( Y , S|Z ∈ Z). • ρ R ( Y , S|Z ∈ Z) ≤ D R ( Y ; S|Z ∈ Z). This follows from Theorem 8. • Notice that if |S| = 2, Theorem 8 implies that D R ( Y ; S|Z ∈ Z) = ρ R ( Y , S|Z ∈ Z). Theorem 8. Suppose that S = [k]. Let the k × k matrix P be defined as P = {P ij } i,j∈[k]×[k] , where P ij := 1 p S (i)p S (j) y∈Y p Y ,S (y, i)p Y ,S (y, j) p Y (y) dy. ( ) Let 1 = σ 1 ≥ σ 2 ≥ . . . ≥ σ k ≥ 0 be the eigenvalues of P . Then, ρ R ( Y , S) = σ 2 , ( ) D R ( Y ; S) = Tr(P ) -1 = k i=2 σ i . Proof. Eq. ( 24) is proved in (Witsenhausen, 1975, Section 3). To prove Eq. ( 25), notice that Tr(P ) = i∈[k] P ii = i∈[k] 1 p S (i) y∈Y p Y ,S (y, i) 2 p Y (y) dy = E Y ,S p Y ,S ( Y , S) p Y ( Y )p S (S) = 1 + D R ( Y ; S), which completes the proof. Proof of Theorem 3. It suffices to prove the inequality for L 1 , as L q is bounded above by L 1 for all q ≥ 1. The proof for the case where Z = 0 and Z = {0} follows from the following set of inequalities:  L 1 ( Y , S|Z ∈ Z) = s∈S y∈Y p Y ,S (y, s) -p Y (y)p S (s) dy ≤ s∈S y∈Y (p Y ,S (y, s) -p Y (y)p S (s)) 2 p Y (y)p S (s) dy (29) = D R ( Y ; S), where Eq. ( 28) follows from Cauchy-Schwarz inequality, and Eq. ( 30) follows from Lemma 2. The extension to general Z and Z is immediate by observing that ρ( Y , S|Z ∈ Z) = E Z ρ( Y , S|Z) Z ∈ Z , ρ R ( Y , S|Z ∈ Z) = E Z ρ R ( Y , S|Z) Z ∈ Z , and D R ( Y , S|Z ∈ Z) = E Z D R ( Y , S|Z) Z ∈ Z . Lemma 2. We have D R ( Y ; S) = s∈S y∈Y (p Y ,S (y, s) -p Y (y)p S (s)) 2 p Y (y)p S (s) dy. Proof. The proof follows from the following set of identities: s∈S y∈Y (p Y ,S (y, s) -p Y (y)p S (s)) 2 p Y (y)p S (s) dy = s∈S y∈Y (p Y ,S (y, s)) 2 p Y (y)p S (s) dy -2 s∈S y∈Y p Y ,S (y, s)dy + s∈S y∈Y p Y (y)p S (s)dy (32) = E p Y ,S ( Y , S) p Y ( Y )p S (S) -1 (33) = D R ( Y ; S). Next, we present some alternative fairness definitions and show that they are also upper bounded by ERMI. Definition 10 (conditional demographic parity L ∞ violation). Given a predictor Y supported on Y and a discrete sensitive attribute S supported on a finite set S, we define the conditional demographic parity violation by: dp( Y |S) := sup y∈Y max s∈S p Y |S ( y|s) -p Y ( y) . First, we show that dp( Y |S) is a reasonable notion of fairness violation. Theorem 9 (ERMI is stronger than conditional demographic parity L ∞ violation). Let Y be a discrete or continuous random variable supported on Y, and S be a discrete random variable supported on a finite set S. Denote p min S := min s∈S p S (s) > 0. Then, 0 ≤ dp( Y |S) ≤ 1 p min S D R ( Y ; S). Proof. The proof follows from the following set of (in)equalities: dp( Y |S) 2 = sup y∈Y max s∈S p Y |S ( y|s) -p Y ( y) 2 (37) ≤ 1 (p min S ) 2 sup y∈Y max s∈S p Y ,S ( y, s) -p Y ( y)p S (s)) 2 (38) ≤ 1 (p min S ) 2 y∈Y s∈S p Y ,S ( y, s) -p Y ( y)p S (s)) 2 (39) = 1 (p min S ) 2 D R ( Y ; S), where Eq. ( 40) follows from Theorem 3. Definition 11 (conditional equal opportunity L ∞ violation (Hardt et al., 2016) ). Let Y, Y take values in Y and let A ⊆ Y be a compact subset denoting the advantaged outcomes (For example, the decision "to interview" an individual or classify an individual as a "low risk" for financial purposes). We define the conditional equal opportunity  0 ≤ eo( Y |S, Y ∈ A) ≤ 1 p min S|A D R ( Y ; S|Y ∈ A). ( ) Proof. Notice that the same proof for Theorem 9 would give that for all y ∈ A: 0 ≤ sup y∈Y max s∈S p Y ,S|Y ( y|s, y) -p Y |Y ( y|y) := eo( Y |S, Y = y) ≤ 1 p min S|y (y) D R ( Y ; S|Y = y) ≤ 1 p min S|C D R ( Y ; S|Y = y). Hence, eo( Y |S, Y ∈ A) = E Y eo( Y |S, Y ) Y ∈ A ≤ 1 p min S|A E Y D R ( Y ; S|Y ) Y ∈ A ≤ 1 p min S|A E Y D R ( Y ; S|Y ) Y ∈ A = 1 p min S|A D R ( Y ; S|Y ∈ A), where the last inequality follows from Jensen's inequality. This completes the proof.

D STOCHASTIC FERMI

Proof of Theorem 5. Let W * ∈ arg max W ∈R k×m -Tr(W P y W T ) + 2 Tr(W P y,s P -1/2 s ). We will compute W * and plug it in the RHS of Eq. ( 16) to show the equality in Eq. ( 16). Setting the derivative of the expression on the RHS equal to zero leads to: -2W P y + 2P -1/2 s P T y,s = 0 =⇒ W * = P -1 y P T y,s P -1/2 s . Plugging this expression for W * , we have max W ∈R k×m -Tr(W P y W T ) + 2 Tr(W P y,s P -1/2 s ) = -Tr(P -1/2 s P T y,s P -1 y P y P -1 y P -1/2 s ) + 2 Tr(P -1/2 s P T y,s P -1 y P y P -1 y P -1/2 s ) = Tr(P -1/2 s P T y,s P -1 y P y,s P -1/2 s ) = Tr(P -1 s P T y,s P -1 y P y,s ). Writing out the matrix multiplication explicitly in the last expression, we have P -1 s P T y,s P -1 y P y,s = U V T , where U i,j = p S (i) -1 p Y ,S (j, i) and V i,j = p Y (j) -1 p Y ,S (j, i), for i ∈ [k], j ∈ [m]. Hence max W ∈R k×m -Tr(W P y W T ) + 2 Tr(W P y,s P -1/2 s ) = Tr(U V T ) = i∈[k] j∈[m] p Y ,S (j, i) 2 p S (i)p Y (j) = D R ( Y ; S), which completes the proof. Next, we move to the statement and proof of the precise version of Theorem 6. We first recall some basic definitions: Definition 12. A function f is β-smooth if for all u, u , we have ∇f (u) -∇f (u) ≤ β uu . Definition 13. A point θ is an -stationary point of a differentiable function Φ if ∇Φ(θ) ≤ . Assumption 2. • is twice differentiable, L -Lipscthiz, and β -smooth in θ. • ∇ θ P ŷ 2 := ∇ θ vec(P y ) 2 ≤ L y and max l∈  [m] ∇ θ ((P y ) l,l ) 2 ≤ L y • max l∈[m] ∇ 2 θθ (P y ) l,l 2 ≤ β y . • ∇ θ P T y,s 2 := ∇ θ vec(P T y,s ) 2 ≤ L ys and max l∈[m],j∈[k] ∇ θ ((P y,s ) l,m ) 2 ≤ L ys • max l∈[m],j∈[k] ∇ 2 θθ (P y,s ) l,j 2 ≤ β y,

Denote

∆ Φ := Φ(θ 0 ) -min θ Φ(θ), where Φ(θ) := max W ∈W f (θ, W ). In Algorithm 1, choose the stepsizes as η θ = Θ(1/κ 2 β) and η W = Θ(1/β) and mini-batch size as M = Θ max 1, κσ 2 -2 . Then under Assumption 2, the iteration complexity of Algorithm 1 to return an -stationary point of f is bounded by O κ 2 β∆ Φ + κβ 2 D 2 2 , which gives the total stochastic gradient complexity of O κ 2 β∆ Φ + κβ 2 D 2 2 max 1, κσ 2 -2 , where β = β l + 8λD 2 β y + 4λ 1 pmin s √ mk 3/2 Dβ ys + 2λ + 4λ DL y + L ys pmin s , µ = 2λp min y , κ = β/µ, σ 2 = 2 L + 2λ L y D 2 + 4λ D pmin s √ mk L ys 2 + 2 2λD + 2(p min s ) -1/2 √ mk 2 The theorem follows from (Lin et al., 2020, Theorem 4.5 ) combined with the following technical lemmas. We assume Assumption 2 holds for the remainder of the proof of Theorem 11: Lemma 4. Let f (θ, W ) = 1 N i∈[N ] x i , y i ; θ + λ -Tr(W P y W T ) + 2 Tr(W P y,s P -1/2 s ) -1 := 1 N i∈[N ] g(θ, W, x i , y i ). Then 1. f is β-smooth, where β = β l + 8λD 2 β y + 4λ 1 √ pmin s √ mk 3/2 Dβ ys + 2λ + 4λ DL y + Lys √ pmin s . 2. f (θ, •) is 2λp min y -strongly concave for all θ. 3. W * F ≤ D, where D is as defined in Theorem 11 and W * denotes any maximizer of f (θ, W ). Proof. By Assumption 2, g is twice continuously differentiable. Hence for part 1, it suffices to upper bound the spectral norm of the second derivative of g(•, •, z) by β for all z = (x, y), where we vectorize and then differentiate with respect to w := vec W and/or θ, so that the resulting first and second derivatives are always vectors or a matrices (not tensors). Notice that g(θ, w, z) = (z, θ) -λw T (P y ⊗ I)w + 2λ(vec(W )) T P y,s P -1/2 s -λ and ∇ 2 g(θ, w, z) = ∇ 2 θθ g(θ, w, z) ∇ 2 θw g(θ, w, z) ∇ 2 wθ g(θ, w, z) ∇ 2 ww g(θ, w, z) . Further, by the definition of operator norm, we have ∇ 2 g(θ, w, z) 2 ≤ ∇ 2 θθ g(θ, w, z) 2 + 2 ∇ 2 θw g(θ, w, z) 2 + ∇ 2 ww g(θ, w, z) 2 . Now we vectorize all matrices and then compute derivatives of g with respect to θ and vec(W ): ∇ θ g(θ, w, z) = ∇ θ (z, θ) -2λ∇ θ vec(P y ) T vec(W T W ) + 2λ∇ θ vec(P y,s ) T vec(W T P -1/2 s ) (43) = ∇ θ (z, θ) -2λ   l∈[m],i∈[k] W 2 i,l ∇ θ ((P y ) l,l )   + 2λ   j∈[m],i∈[k] W i,j (∇ θ (P ys ) j,i ) (P -1/2 s ) i,i   ; (44) ∇ w g(θ, w, z) = -2λW P y + 2λP -1/2 s P T y,s . Differentiating again yields: ∇ 2 ww g(θ, w, z) = -2λP y ⊗ I k ; ∇ 2 wθ g(θ, w, z) = ∂ ∂θ ∂g(θ, w, z) ∂w = -2λ(I m ⊗ W )∇ θ P y + 2λ(I m ⊗ P -1/2 s )∇ θ vec(P T y,s ); ∇ 2 θθ g(θ, w, z) = ∇ 2 θ (z, θ) -2λ   l∈[m],i∈[k] W 2 i,l ∇ 2 θθ ((P y ) l,l )   + 2λ   j∈[m],i∈[k] W i,j (∇ 2 θθ (P ys ) j,i ) (P -1/2 s ) i,i   . Then to establish part 1, use Assumption 2, Clairaut's theorem, the definitions of the matrices and fact that their entries are in [0, 1], the relations AB 2 ≤ A 2 B 2 and vec W 1 ≤ √ mk vec W 2 = √ mk W F , and the fact that A ⊗ B 2 = A 2 B 2 to bound the spectral norm of each second derivative above. The strong concavity statement follows by noticing ∇ 2 ww g(θ, W ) -µI iff P y µ 2λ I iff min i∈[m] p y (i) ≥ µ 2λ . Part 3 follows from the expression for W * in the proof of Theorem 5. Lemma 5. Consider f and g as defined above. Then we have E z ∇g(θ, W, z) = ∇f (θ, W ), E z ∇g(θ, W, z) -∇f (θ, W ) 2 2 ≤ 2 L + 2λ L y D 2 + 4λ D pmin s √ mk L ys 2 + 2 2λD + 2(p min s ) -1/2 √ mk 2 , where both expectations are with respect to the empirical distribution on {z i } i∈[N ] . Proof. The first statement is obvious. The second follows from Eq. ( 44) in the proof of Lemma 4, since We perform the experiments in sections 5.1 and 5.2 with a linear model (with softmax activation). The model parameters are estimated using the algorithm described in section 4.1. In section 5.2, the data set is cleaned and processed as described in (Kearns et al., 2018) . The trade-off curves for FERMI are generated by sweeping across different values for λ in [0, 100], learning rate η in [0.0005, 0.01], and number of iterations T in [50, 200] . E z ∇g(θ, W, z) -∇f (θ, W ) 2 2 = 1 N N i=1 ∇g(θ, W, z i ) 2 2 - 1 N 2 N i,j=1 ∇g(θ, W, z i ), ∇g(θ, W, z j ) ≤ 2 sup zi ∇g(θ, W, z i ) 2 2 ≤ 2 sup z ∇ θ g(θ, W, z) 2 + ∇ w g(θ, W, z) 2 ≤ 2 sup z ∇ θ (z, θ) -2λ   l∈[m],i∈[k] W 2 i,l ∇ θ ((P y ) l,l )   + 2λ   j∈[m],i∈[k] W i,j (∇ θ (P ys ) j,i ) (P -1/2 s ) i,i   2 2 + 2 -2λW P y + 2λP -1/2 s P T y,s For the experiments in section 5.3, we create the synthetic color MNIST as described in (Li & Vasconcelos, 2019) . We set the value σ = 0. In figure 4 , we compare the performance of stochastic solver (section 4.2) against the GD algorithm (section 4.1). We use a mini-batch of size 512 when using the stochastic solver. The color MNIST data has 60000 training samples, so using the stochastic solver gives a speedup of around 100x for each iteration, and an overall speedup of around 40x. We present our results on two neural network architectures; namely, LeNet-5 (Lecun et al., 1998) and a Multi-layer perceptron (MLP). We set the MLP with two hidden layers (with 300 and 100 nodes) and an output layer with ten nodes. A ReLU activation follows each hidden layer, and a softmax activation follows the output layer. Some general advice for tuning λ: Larger value for λ generally translates to better fairness, but one must be careful to not use a very large value for λ as it could lead to poor generalization performance of the model. The optimal values for λ, η, and T largely depend on the data and intended application. We recommend starting with λ ≈ 10. In Appendix E.2, we can observe the effect of changing λ on the model accuracy and fairness for the COMPAS dataset.

E.2 EFFECT OF HYPER-PARAMETER TUNING ON THE ACCURACY-FAIRNESS TRADE-OFF

We run ERMI algorithm for the binary case to COMPAS dataset to investigate the effect of hyperparameter tuning on the accuracy-fairness trade-off of the algorithm. As it can be observed in Fig. 5 , by increasing λ from 0 to 1000, test error (left axis, red curves) is slightly increased. On the other hand, the fairness violation (right axis, green curves) is decreased as we increase λ to 1000. Moreover, for both notions of fairness (demographic parity with the solid curves and equality of opportunity with the dashed curves) the trade-off between test error and fairness follows the similar pattern. To measure the fairness violation, we use demographic parity violation and equality of opportunity violation defined in Section equation 5 for the solid and dashed curves respectively. 

E.3 DATASETS DESCRIPTION

All of the following datasets are publicly available at UCI repository. German Credit Dataset.foot_2 German Credit dataset consists of 20 features (13 categorical and 7 numerical) regarding to social, and economic status of 1000 customers. The assigned task is to classify customers as good or bad credit risks. Without imposing fairness, the DP violation of the trained model is larger than 20%. We chose first 800 customers as the training data, and last 200 customers as the test data. The sensitive attributes are gender, and marital-status. Adult Dataset. 4 Adult dataset contains the census information of individuals including education, gender, and capital gain. The assigned classification task is to predict whether a person earns over 50k annually. The train and test sets are two separated files consisting of 32,000 and 16,000 samples respectively. We consider gender and race as the sensitive attributes (For the experiments involving one sensitive attribute, we have chosen gender). Learning a logistic regression model on the training dataset (without imposing fairness) shows that only 3 features out of 14 have larger weights than the gender attribute. Note that removing the sensitive attribute (gender), and retraining the model does not eliminate the bias of the classifier. the optimal logistic regression classifier in this case is still highly biased. For the clustering task, we have chosen 5 continuous features (Capital-gain, age, fnlwgt, capital-loss, hours-per-week), and 10,000 samples to cluster. The sensitive attribute of each individual is gender. Communities and Crime Dataset. 5 The dataset is cleaned and processed as described in (Kearns et al., 2018) . Briefly, each record in this dataset summarizes aggregate socioeconomic information about both the citizens and police force in a particular U.S. community, and the problem is to predict whether the community has a high rate of violent crime. COMPAS Dataset.foot_5 Correctional Offender Management Profiling for Alternative Sanctions (COM-PAS) is a famous algorithm which is widely used by judges for the estimation of likelihood of reoffending crimes. It is observed that the algorithm is highly biased against the black defendants. The dataset contains features used by COMPAS algorithm alongside with the assigned score by the algorithm within two years of the decision.



Note that a similar relationship with TV norm could be established as well. In this section, we present all results in the context of Z = 0 and Z = {0}, leaving off all conditional expectations for clarity of presentation. The results could be generalized for general (Z, Z), as we have used the resulting algorithms for empirical experiments. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) https://archive.ics.uci.edu/ml/datasets/adult. http://archive.ics.uci.edu/ml/datasets/communities+and+crime https://www.kaggle.com/danofer/compass



Figure 1: Tradeoff of fairness violation vs test error for different baselines on German Credit and Adult datasets. The desired operation point is the lower left corner where both fairness violation and test error are small. FERMI achieves the best fairness vs performance tradeoff across all baselines.

Figure 3: Color MNIST.

Under review as a conference paper at ICLR 2021 Next, we consider Z = Y and Z = Y. In this case, Z ∈ Z is trivially satisfied, and hence,D R ( Y ; S|Y ) := D R ( Y ; S|Z ∈ Z) Z=Y,Z=Y = E Y, Y ,S p Y ,S|Y ( Y , S|Y ) p Y |Y ( Y |Y )p S|Y (S|Y ) -s∈S y∈Y y∈Yp Y ,S|Y ( y, s|y) -p Y |Y ( y|y)p S|Y (s|y) p Y |Y ( y|y)p S|Y (s|y) p Y, Y ,S (y, y, s)d ydy = s∈S y∈Y y∈Y p Y ,S|Y ( y, s|y) 2 p Y |Y ( y|y)p S|Y (s|y) p Y (y)d ydy -1. (18) D R ( Y ; S|Y ) should be used when the desired notion of fairness is equalized odds. In particular, D R ( Y ; S|Y ) = 0 directly implies the conditional independence of Y and S given Y.

what should be used when the desired notion of fairness is equal opportunity. This can be further simplified when the advantaged class is a singleton (which is the case in binary classification). If Z = Y and Z = {y}, then D R ( Y ; S|Y = y) := D {y} R ( Y ; S|Y ) = s∈S y∈Y p Y ,S|Y ( y, s|y) -p Y |Y ( y|y)p S|Y (s|y) p Y |Y ( y|y)p S|Y (s|y) p Y ,S|Y ( y, s|y)d y = s∈S y∈Y p Y ,S|Y ( y, s|y) 2 p Y |Y ( y|y)p S|Y (s|y) d y -1.(21)Finally, we note that we use the notation D R ( Y ; S|Y ) and D R ( Y ; S|Y = y) to be consistent with the definition of conditional mutual information in(Cover & Thomas, 1991).

y)p S (s) p Y ,S (y, s) -p Y (y)p S (s) p Y (y)p S (s) ,S (y, s) -p Y (y)p S (s)) 2 p Y (y)p S (s)

dp( Y |S) = 0 iff (if and only if) Y and S are independent. Proof. By definition, dp( Y |S) = 0 iff for all y ∈ Y, s ∈ S, p Y ,S ( y|s) = p Y ( y) iff Y and S are independent (since we always assume p(s) > 0 for all s ∈ S).

s . Theorem 11 (Precise version of Theorem 6). Denotef (θ, W ) = 1 N i∈[N ]x i , y i ; θ + λ -Tr(W P y W T ) + 2 Tr(W P y,s P -1/2 s ) -1 .Set W := B F (0, 2D) ⊂ R k×m (Frobenius norm ball of radius 2D), D :=

Then use Assumption 2 and basic norm inequalities to bound the norm of each term. experiments, the model's output is of the form O = softmax(W x + b). The model outputs are treated as conditional probabilities p( y = i|x) = O i which are then used to estimate the ERMI regularizer. We encode the true class label Y and sensitive attribute S using one-hot encoding. We define () as the cross-entropy measure between the one-hot encoded class label Y and the predicted output vector O.

Figure 5: Tradeoff of fairness violation vs test error for ERMI algorithm on COMPAS dataset. The solid and dashed curves correspond to ERMI algorithm under the demographic parity and equality of opportunity notions accordingly. The left axis demonstrates the effect of changing λ on the test error (red curves), while the right axis shows how the fairness of the model (measured by equality of opportunity or demographic parity violations) depends on changing λ.

L ∞ violation of Y with respect to the sensitive attribute S and the advantaged outcome A by eo( Y |S, Y ∈ A) := E Y sup Theorem 10 (ERMI is stronger than generalized equal opportunity Kolmogorov violation -alternative definition). Let Y , Y, be discrete or continuous random variables supported on Y, and let S be a discrete random variable supported on a finite set S. Let A ⊆ Y be a compact subset of Y. Denote p min S|A = min s∈S,y∈A p S|Y (s|y). Then,

