EVALUATING FAIRNESS WITHOUT SENSITIVE AT-TRIBUTES: A FRAMEWORK USING ONLY AUXILIARY MODELS Anonymous authors Paper under double-blind review

Abstract

Although the volume of literature and public attention on machine learning fairness has been growing significantly in recent years, in practice some tasks as basic as measuring fairness, which is the first step in studying and promoting fairness, can be challenging. This is because the sensitive attributes are often unavailable in a machine learning system due to privacy regulations. The straightforward solution is to use auxiliary models to predict the missing sensitive attributes. However, our theoretical analyses show that the estimation error of the directly measured fairness metrics is proportional to the error rates of auxiliary models' predictions. Existing works that attempt to reduce the estimation error often require strong assumptions, e.g. access to the ground-truth sensitive attributes in a subset of samples, auxiliary models' training data and the target data are i.i.d, or some form of conditional independence. In this paper, we drop those assumptions and propose a framework that uses only off-the-shelf auxiliary models. The main challenge is how to reduce the negative impact of imperfectly predicted sensitive attributes on the fairness metrics without knowing the ground-truth sensitive attribute values. Inspired by the noisy label learning literature, we first derive a closed-form relationship between the directly measured fairness metrics and their corresponding ground-truth metrics. And then we estimate some key statistics (most importantly transition matrix in the noisy label literature), which we use, together with the derived relationship, to calibrate the fairness metrics. Our framework can be applied to all popular group fairness definitions as well as multi-class classifiers and multi-category sensitive attributes. In addition, we theoretically prove the upper bound of the estimation error in our calibrated metrics and show our method can substantially decrease the estimation error especially when auxiliary models are inaccurate or the target model is highly biased. Experiments on COMPAS and CelebA validate our theoretical analyses and show our method can measure fairness significantly more accurately than baselines under favorable circumstances. 1 For example, if the target dataset contains features about user information (name, location, interests etc.), then our method is applicable as long as the auxiliary model can take any one of those features as input and predict sensitive attributes, e.g. predicting race from name.

1. INTRODUCTION

Despite numerous literature in machine learning fairness (Corbett-Davies & Goel, 2018) , in practice even measuring fairness, which is the first step in studying and mitigating fairness, can be challenging as it requires access to sensitive attributes of samples, which are often unavailable due to privacy regulations (Andrus et al., 2021; Holstein et al., 2019; Veale & Binns, 2017) . It is a problem that the industry is facing, which significantly slows down the progress of studying and promoting fairness. Existing methods to estimate fairness without access to ground-truth sensitive attributes mostly fall into two categories. First, some methods assume they have access to the ground-truth sensitive attributes on a subset of samples or they can label them if unavailable, e.g. Youtube asks its creators to voluntarily provide their demographic information (Wojcicki, 2021) . But it either requires labeling resource or depends on the volunteering willingness, and also the resulting measured fairness can be inaccurate due to sampling bias. Second, many works assume there exists an auxiliary dataset that can be used to train models to predict the missing sensitive attributes on the target dataset (i.e. the dataset that we want to measure the fairness on), e.g. Meta (Alao et al., 2021) and others (Elliott et al., 2009; Awasthi et al., 2021; Diana et al., 2022) . However, they often need to assume the aux-iliary dataset and the target dataset are i.i.d., and some form of conditional independence, which are not realistic. In addition, since the auxiliary dataset also contains sensitive information (i.e. the sensitive labels), it might be more and more difficult to obtain such training data from the open-source projects given the increasingly stringent privacy regulations today. Note that similar to our work, some researchers also draw insight from noisy label literature (Lamy et al., 2019; Celis et al., 2021; Awasthi et al., 2020) . But they assume the noise on sensitive attributes follow assumptions such as conditional independence or known transition probabilities. Furthermore, their goal is to mitigate bias rather than estimating fairness disparity. We emphasize the value of estimating fairness because the metric is vital in reporting and studying fairness in real-world systems. In this work, we drop many commonly made assumptions, i.e. 1) access to labeling resource, 2) access to auxiliary model's training data, 3) data i.i.d, and 4) conditional independence. Instead we only rely on off-the-shelf auxiliary models, which can be easily obtained via various open-source projects (without their training data). The requirement of the auxiliary model is also flexible. We do not need the auxiliary model's input to share the exactly same feature set as the target data. We only need the auxiliary model's input features have some overlap with the target dataset's features 1 . Our contributions are summarized as follows. • We theoretically show that directly using auxiliary models to estimate fairness (by predicting the missing sensitive attributes) would lead to a fairness metric whose estimation error is proportional to the prediction error of auxiliary models and the true fairness disparity (Theorem 1, Corollary 1). • Motivated by the above finding, we propose a general framework (Figure 1 , Algorithm 1) to calibrate the noisy fairness metrics using auxiliary models only. The framework is based on a derived closed-form relationship between the directly estimated noisy fairness metrics and their corresponding ground-truth metrics (Theorem 2) in terms of two key statistics: transition matrix and clean prior probability, which are well-studied in the noisy label literature. To estimate them, our framework can leverage any existing estimator. We show an example by adapting HOC (Zhu et al., 2021b) (Algorithm 2). The estimator only assumes that auxiliary models are informative and different auxiliary models make i.i.d. predictions. • We prove the error upper bound of our estimation (Theorem 3), and show that, in a simplified case, our estimated fairness metrics are guaranteed to be closer to the true metrics than the uncalibrated noisy metrics when auxiliary models are inaccurate or the target model is biased (Corollary 2). • Experiments on COMPAS and CelebA consolidate our theoretical findings and show our calibrated fairness is significantly more accurately than baselines under favorable circumstances.

2. PRELIMINARIES

Consider a K-class classification problem with target dataset D • := {(x n , y n )|n ∈ [N ]}, where N is the number of instances, x n is the feature, and y n is the label. Denote by X the feature space, Y = [K] := {1, 2, • • • , K} the label space, and (X, Y ) the random variables of (x n , y n ), ∀n. The target model f : X → [K] maps X to a predicted label class f (X) ∈ [K] . We aim at measuring group fairness conditioned on a sensitive attribute A ∈ [M ] := {1, 2, • • • , M } which is unavailable in D • . Denote the dataset with ground-truth sensitive attributes by D := {(x n , y n , a n )|n ∈ [N ]}, the joint distribution of (X, Y, A) by D. The task is to estimate the fairness metrics of f on D • without sensitive attributes such that the resulting metrics are as close to the fairness metrics evaluated on D (with ground-truth A) as possible. See Appendix A.1 for a summary of notations. We consider three group fairness (Wang et al., 2020; Cotter et al., 2019) definitions and their corresponding measurable metrics: demographic parity (DP) (Calders et al., 2009; Chouldechova, 2017) , equalized odds (EOd) (Woodworth et al., 2017) , and equalized opportunity (EOp) (Hardt et al., 2016) . Fairness Definitions. To save space, all our discussions in the main paper are specific to DP. We include the complete derivations for EOd and EOp in the Appendix. DP metric is defined as: Definition 1 (Demographic Parity). The demographic parity metric of f on D conditioned on A is: Matrix-form Metrics. We can unify Definitions 1, 4, and 5 (in Appendix A.2) using H. Next, we study how the fairness metrics can be evaluated without A. [P(f (X) = k|A = 1), • • • , P(f (X) = k|A = M )] ⊤ . Denote by ψ(H[a], H[a ′ ]) := ∥H[a] - H[a ′ ]∥ 1 /col(H) Using Auxiliary Models Directly. A direct way to measure fairness is to approximate A with an auxiliary model g : X → [M ] (Ghazimatin et al., 2022; Awasthi et al., 2021; Chen et al., 2019) and get A := g(X). Note the input of g can be any subsets of feature X, and we write the input of g as X just for notation simplicity. In practice, there might be C auxiliary models denoted by the set G := {g 1 , • • • , g C }. The noisy sensitive attributes can be denoted by A c := g c (X), ∀c ∈ [C] and the corresponding target dataset with A is D := {(x n , y n , (ã 1 n , • • • , ãC n ))|n ∈ [N ] } with its distribution denoted as D. Similarly, by replacing A with A in H, we can compute H, which is the corresponding matrix-form fairness metric estimated by the auxiliary model g (or G if multiple auxiliary models are used). Both notations g and G are used interchangeably in the remainder. Define the directly measured noisy fairness metric of f on D as follows. Definition 3 (Noisy Group Fairness Metric). The noisy group fairness of model f on data distribution (X, Y, A) ∼ D directly estimated using g writes as ∆( D, f ) = Ψ( H). From the above definitions, if we can calibrate the direct noisy estimate H back to the ground-truth fairness matrix H, the estimation error will be greatly reduced. We defer more details to Theorem 2. Transition Matrix. The relationship between H and H is largely dependent on the relationship between A and A because it is the single changing variable. Define the matrix T to be the transition probability from A to A where (a, ã)-th element is T [a, ã] = P( A = ã|A = a). Similarly, denote by T k the local transition matrix conditioned on f (X) = k, where the (a, ã)-th element is T k [a, ã] := P( A = ã|f (X) = k, A = a). Note T can be seen as a global transition matrix by weighted averaging T k . Many prior works (Awasthi et al., 2021; Prost et al., 2021; Fogliato et al., 2020) assume A and f (X) are conditionally independent on A. We drop this assumption in our theoretical framework. We further define clean (i.e. ground-truth) prior probability of A as p := [P(A = 1), • • • , P(A = M )] ⊤ and the noisy prior probability of A as p := [P( A = 1), • • • , P( A = M )] ⊤ .

3. KEY INSIGHT: WHY WE NEED CALIBRATION?

Now we study the error of direct noisy fairness metrics and motivate the necessity of calibration. Estimation Error Analysis. Intuitively, the estimation error of directly measured noisy fairness metrics is dependent on the error of the auxiliary model g. Recall p, p, T and T k are clean prior, noisy prior, global transition matrix, and local transition matrix defined in Sec. 2. Denote by Λ p and Λ p the square diagonal matrices constructed from p and p. We formally prove the upper bound of estimation error in the directly measured metrics in Theorem 1 (See Appendix B.1 for the proof). Theorem 1 (Error Upper Bound of Noisy Metrics). Denote by Err raw := | ∆ DP ( D, f ) -∆ DP (D, f )| the estimation error of the directly measured noisy fairness metrics. Its upper bound is the following: Err raw ≤ 2 K k∈[K] hk ∥Λ p(T -1 T k -I)Λ -1 p ∥1 cond. indep. violation +δ k ∥ΛpT k Λ -1 p -I∥1 error of g , where hk := 1 M a∈[M ] H[a, k], δ k := max a∈[M ] |H[a, k] -hk |.

Auxiliary Model

< l a t e x i t s h a 1 _ b a s e 6 4 = " 0 g 2 j Y 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " R a L 9 q c W V n W / T 0 6 Theorem 1 reveals that the estimation error of directly measured metric depends on: f i r K R c E F 3 j t 2 d J C / Z l 1 M s = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o m K e i x 6 8 d i C / Y A 2 l M 1 2 0 q 7 d b M L u R i i h v 8 C L B 0 W 8 + p O 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 m / q t J 1 S a x / L B j B P 0 I z q Q P O S M G i v V B 7 1 S 2 a 2 4 M 5 B l 4 u W k D D l q v d J X t x + z N E J p m K B a d z w 3 M X 5 G l e F M 4 K T Y T T U m l I 3 o A D u W S h q h 9 r P Z o R N y a p U + C W N l S x o y U 3 9 P Z D T S e h w F t j O i Z q g X v a n 4 n 9 d J T X j j Z 1 w m q U H J 5 o v C V B A T k + n X p M 8 V M i P G l l C m u L 2 V s C F V l B m b T d G G 4 C 2 + v E y a 5 x X v q n J R v y x X b / M 4 C n A M J 3 A G H l x D F e 6 h B Y P K C n Q 0 = " > A A A B 8 H i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P V i 8 c K 9 k P a U D a b a b t 0 s w m 7 E 6 G E / g o v H h T x 6 s / x 5 r 9 x 2 + a g 1 Q c D j / d m m J k X J F I Y d N 0 v p 7 C 0 v L K 6 V l w v b W x u b e + U d / e a J k 4 1 h w a P Z a z b A T M g h Y I G C p T Q T j S w K J D Q C k Y 3 U 7 / 1 C N q I W N 3 j O A E / Y g M l + o I z t N J D F 4 U M I b u a 9 M o V t + r O Q P 8 S L y c V k q P e K 3 9 2 w 5 i n E S j k k h n T 8 d w E / Y x p F F z C p N R N D S S M j 9 g A O p Y q F o H x s 9 n B E 3 p k l Z D 2 Y 2 1 L I Z 2 p P y c y F h k z j g L b G T E c m k V v K v 7 n d V L s X / q Z U E m K o P h 8 U T + V F G M 6 / Z 6 G Q g N L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 m / q t J 1 S a x / L B j B P 0 I z q Q P O S M G i v V w 1 6 p 7 F b c G c g y 8 X J S h h y 1 X u m r 2 4 9 Z G q E 0 T F C t O 5 6 b G D + j y n A m c F L s p h o T y k Z 0 g B 1 L J Y 1 Q + 9 n s 0 A k 5 t U q f h L G y J Q 2 Z q b 8 n M h p p P Y 4 C 2 x l R M 9 S L 3 l T 8 z + u k J r z x M y 6 T 1 K B k 8 0 V h K o i J y f R r 0 u c K m R F j S y h T 3 N 5 K 2 J A q y o z N p m h D 8 B Z f X i b N 8 4 p 3 V b m o X 5 a r t 3 k c B T i G E z g D D 6 6 h C v d Q g w Y w Q H i G V 3 h z N i h L p I v Z f Q r T A = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 l V U Y 9 F P X i s Y D + k X U F I R q r p M g 1 S h F B D g R K a s Q I W u B I a b u 9 m 5 D c e Q G k R h f f Y j 8 E J W C c U v u A M U 6 m d y 9 u P w g M U 0 g N q V 0 A i K 8 4 o l R P / O N v O F c y S O Q Z d J N a U F M g U 1 X b u 2 / Y i n g Q Q I p d M 6 5 Z l x u g M m E L B J Q y z d q I h Z r z H O t B K a c g C 0 M 5 g / M u Q H q W K R / 1 I p R U i H a u z E w M W a N 0 P 3 L Q z Y N j V V c C s W b K F D y T q I 5 j Q L J 2 8 H 4 f u a 3 n 7 g 2 I l Y N n C T c j + h Q i V A w i l Z q 9 0 Y U s 8 a 0 X 6 6 4 V X c O s k q 8 n F Q g R 7 1 f / u o N Y p Z G X C G T 1 J i u 5 y b o Z 1 S j Y J J P S 7 3 U 8 I S y M R 3 y r q W K R t z 4 2 f z c K T m z y o C E s b a l k M z V 3 x M Z j Y y Z R I H t j C i O z L I 3 E / / z u i m G t 3 4 m V J I i V 2 y x K E w l w Z j M f i c D o T l D O b G E M i 3 s r Y S N q K Y M b U I l G 4 K 3 / P I q a V 1 U v e v q 5 e N V p X a X x 1 G E E z i F c / D g B m r w A H V o A o M x P M M r v D m J 8 + K 8 O x + L 1 o K T z x z D H z i f P 3 w k j 6 0 = < / l a t e x i t > T < l a t e x i t s h a _ b a s e = " x K Y v G t H E b R s S R m j V n b f E f i F = " > A A A B n i c b V A S w N B E J L X z F + R S t F o N g F e U D J o Y x n B m E B y h L N J l m y t f s z g n h y I + w s V D E t j x k y h i Q G H u / N M D M v l J Y P v r C y u r a + U d w s b W v O V w e b Z I a x h s s k Y l p R d R y K R R v o E D J W p w G k e S N P R d R v P n F j R a I e c K x G N O B E n B K D q p R l S z P S k W V X G s k y C n F Q g R b / u r E p b G X C G T N p G s M M p Q M M k n p U q u a Z s R A e a i i M b d h N j t Q k c i P x L h S S G b q m M x t a O h x h S H d t G b i v R T E m l E R K z Z f E l w Y R M f y c Y T h D O X a E M i P c r Y Q N q a E M • hk : The average confidence of f (X) on class k over all sensitive groups. For example, if f is a crime prediction model and A is race, a biased f (Angwin et al., 2016) may predict that the crime (k = 1) rate for different races are 0.1, 0.2 and 0.6 respectively, then h1 = 0.1+0.2+0.6 3 = 0.3, and it is an approximation (unweighted by sample size) of the average crime rate over the entire population. The term is dependent on D and f , and independent of any estimation algorithm. • δ k : The maximum disparity between confidence of f (X) on class k and average confidence hk across all sensitive groups. Using the same example, δ 1 = max(|0.1 -0.3|, |0.2 -0.3|, |0.6 - 0.3|) = 0.3. It is an approximation of the underlying fairness disparity, and larger δ k indicates f is more biased on D. The term is also dependent on D and f (i.e. the true fairness disparity), and independent of any estimation algorithm. • Conditional Independence Violation: The term is dependent on the auxiliary model g's prediction Ã in terms of the transition matrix (T and T k ) and noisy prior probability ( p). The term goes to 0 when T = T k , which implies Ã and f (X) are independent conditioned on A. This is the common assumption made in the prior work (Awasthi et al., 2021; Prost et al., 2021; Fogliato et al., 2020) . And this term measures how much the conditional independence assumption is violated. • Error of g: Similarly, this term is dependent on the auxiliary model g. It goes to 0 when T k = I which implies the error rates of g's prediction is 0, i.e. g is perfectly accurate. It measures the impact of g's error on the fairness estimation error. To help better understand the upper bound, we consider a simplified case when f is a binary model and A is a binary variable. We further assume the conditional independence condition to remove the third term listed above in Theorem 1. See Appendix A.3 for the formal definition of conditional independence.foot_0 Corollary 1 summarizes the result. Corollary 1. For a binary classifier f and a binary sensitive attribute A ∈ {1, 2}, when ( Ã ⊥ ⊥ f (X)|A) holds, Theorem 1 is simplified to Err raw ≤ 2δ(e 1 + e 2 ) , where e and e 2 are transition probabilities from noisy attributes to clean attributes, i.e. e 1 = P(A = 1| A = 2), e 2 = P(A = 2| A = 1), δ = |P(f (X) = 1|A = 1) -P(f (X) = 1|A = 2)|/2. Why Calibrate? Corollary 1 clearly shows the estimation error of the directly measured fairness is proportional to the true underlying disparity between sensitive groups (i.e. δ) and the auxiliary model's error rates (i.e. e 1 and e 2 ). In other words, the uncalibrated metrics can be highly inaccurate when f is highly biased or g has poor performance. Both are practical cases since when we want to measure f 's fairness, it has already shown some fairness-related concerns and the fairness disparity is not negligible. Moreover, the auxiliary model g is usually not highly accurate due to distribution shift. Hence, in those cases we should calibrate the metrics to get more accurate measurements.

4. METHODOLOGY

In this section, we introduce our calibration framework and algorithm (Sec. 4.1), prove the error upper bounds for our calibration methods (Sec. 4.2), and elaborate key steps of our algorithm (Sec. 4.3).

4.1. PROPOSED FRAMEWORK

With a given auxiliary model g that labels sensitive attributes, we can anatomize the relationship between the true disparity and the noisy disparity. We have the following theorem for DP. See Appendix B.2 for results with respect to EOd and EOp and their proofs. Theorem 2. [Closed-form Relationship (DP)] The closed-form relationship between the true fairness vector H[:, k] and the corresponding directly measured noisy fairness vector H[:, k] is the following: H[:, k] = (T ⊤ k Λ p ) -1 Λ p H[:, k], ∀k ∈ [K]. Framework Overview. Theorem 2 reveals that the noisy disparity and the corresponding true disparity are related in terms of three key statistics: noisy prior p, clean prior p, and local transition matrix T k . Ideally, if we can obtain the ground-truth values of them, then we can calibrate the noisy fairness vectors to their corresponding ground-truth vectors (and therefore the perfectly accurate fairness metrics) using the closed-form in Theorem 2. Hence, the most important step is to estimate T k , p, and p without knowing the ground-truth values of A. Once we have those estimated key statistics, we can easily plug them into the above equation as the calibration step. Figure 1 shows the overview of our framework.

Algorithm.

We summarize our framework in Algorithm 1. In Line 4, we use sample mean in the uncalibrated form to estimate H as H[ã, k] = P(f (X) = k| A = ã) ≈ 1 N N n=1 1(f (x n = k|ã n = ã)) and p as p[ã] = P( A = ã) ≈ 1 N N n=1 1(ã n = ã), ∀ã ∈ [M ]. In Line 6, we plug in an existing transition matrix and prior probability estimator to estimate T k and p with only mild adaption that will be introduced in Sec. 4.3. Note that although we choose a specific estimator to use, our framework is flexible and compatible with any StatEstimator proposed in the noisy label literature (Liu & Chen, 2017; Zhu et al., 2021b; 2022) .

4.2. ESTIMATION ERROR ANALYSIS

We theoretically analyze estimation error on our calibrated metrics in a similar way as in Sec. 3. The derivation is based on local estimates T k , global estimates T would be similar. Denote by ∆ DP ( D, f ) the calibrated DP disparity evaluated on our calibrated fairness matrix H. We have: Theorem 3 (Error Upper Bound of Calibrated Metrics). Denote the estimation error of the calibrated fairness metrics by Err cal := | ∆ DP ( D, f ) -∆ DP (D, f )|. Its upper bound is the following: Theorem 3 shows the upper bound of estimation error mainly depends on the estimates T k and p, i.e., the following two terms in ε( T k , p): Err cal ≤ 2 K k∈[K] Λ -1 p 1 ∥Λ p H[:, k]∥ ∞ ε( T k , p), where ε( T k , p) := ∥Λ -1 p Λ p -I∥ 1 ∥T k T -1 k ∥ 1 + ∥I -T k T -1 k ∥ 1 is ∥Λ -1 p Λ p -I∥ 1 ∥T k T -1 k ∥ 1 and ∥I -T k T -1 k ∥ 1 . When the estimates are perfect, i.e. T k = T k and p = p, then both terms go to 0 because Λ -1 p Λ p = I and T k T -1 k = I. We now compare the above error upper bound with the exact error (not its upper bond) in the case of Corollary 1, and summarize the result in Corollary 2. Corollary 2. When assumptions in Corollary 1 hold, further assume p = [0.5, 0.5] ⊤ , then the proposed calibration method is guaranteed to be more accurate than the uncalibrated measurement, i.e. , Err cal ≤ Err raw , if ε( T k , p) ≤ γ := max k ′ ∈{1,2} e1+e2 1+ ∥H[:,k ′ ]∥ 1 ∆ DP (D,f ) , ∀k ∈ {1, 2}. Algorithm 1 Fairness calibration framework (DP) 1: Input: A set of auxiliary models G = {g 1 , • • • , g C }. Target dataset D • . Target model f . Transition matrix and prior probability estimator StatEstimator. 2: ãc n ← g c (x n ), ∀c ∈ [C], n ∈ [N ] # Predict sensitive attributes using all g ∈ G  3: D ← {(x n , y n , (ã 1 n , • • • , ãC n ))|n ∈ [N ]} # Build : { T 1 , • • • , T K }, p ← StatEstimator( D, f ) 7: ∀k ∈ [K] : H[:, k] ← ( T ⊤ k Λ p) -1 Λ p H[:, k] # Calibrate each fairness vector with Theorem 2 8: ∆( D, f ) ← Ψ( H) # Calculate the final fairness metric as Definition 2 9: Output: The calibrated fairness metric ∆( D, f ) Corollary 2 shows when the error ε( T k , p) that is induced by inaccurate T k and p is below the threshold γ, our method is guaranteed to lead to a smaller estimation error compared to the uncalibrated measurement under the considered setting. The threshold implies that, adopting our method rather than the uncalibrated measurement can be greatly beneficial when e 1 and e 2 are high (i.e. g is inaccurate) or when the normalized (true) fairness disparity ∆ DP (D,f ) ∥H[:,k ′ ]∥1 is high (i.e. f is highly biased).

4.3. ESTIMATING KEY STATISTICS

As mentioned, our framework can plug in any existing estimator of transition matrix and prior probability. We choose HOC (Zhu et al., 2021b) because it is free of training. Some methods (Liu & Tao, 2015; Scott, 2015; Patrini et al., 2017) require extra training with target data and auxiliary model outputs, which introduces extra cost. Moreover, it brings a practical challenge in hyperparameter tuning given we have no ground-truth sensitive attributes. HOC decodes T k by checking both the agreements and disagreements among noisy attributes (auxiliary model predictions). See more details in Appendix C.1. For a successful decoding, HOC makes the following assumptions: Assumption 1 (HOC: Informativeness). The noisy attributes given by each classifier g are informative, i.e. ∀k ∈ [M ], 1) T k is non-singular and 2) either T k [a, a] > P( A = a|f (X) = k) or T k [a, a] > T k [a, a ′ ], ∀a ′ ̸ = a. Assumption 2 (HOC: Independence). Given three auxiliary models, the noisy attributes predicted by them are independent and identically distributed (i.i.d.), i.e., g 1 (X), g 2 (X), and g 3 (X) are i.i.d. Assumption 1 is the prerequisite of getting a feasible and unique estimate of T k (Zhu et al., 2021b) , where the non-singular assumption ensures the matrix inverse in Theorem 2 exists and the constraints on T k [a, a] describes the worst tolerable performance of g. When M = 2, the constraints can be simplified as Chen, 2017; Liu & Guo, 2020) . If this assumption is violated, there might exist more than one feasible estimates of T k , making the problem insoluble. Assumption 2 ensures the additional two auxiliary models provide more information than using only one classifier. Note it has been proved by Liu (2022) that three is the sufficient and necessary number of auxiliary models to provide sufficient information to identify T k . If Assumption 2 is violated, we would still get an estimate but may be inaccurate. T k [1, 2]+T k [2, 1] < 1 (Liu & Adapting HOC. Algorithm 2 shows how we adapt HOC as StatEstimator (in Algorithm 1, Line 6), namely HOCFair. The original HOC uses one auxiliary model and simulates the other two based on clusterability assumption (Zhu et al., 2021b) , which assumes x n and its 2-nearestneighbors share the same true sensitive attribute, and therefore their noisy attributes can be used to simulate the output of auxiliary models. If this assumption does not hold (Zhu et al., 2022) , we can directly use more auxiliary models. With a sufficient number of noisy attributes, we can randomly select three of them for every sample as Line 6, and then approximate T k with T k in Line 8. In our experiments, we test both using one auxiliary model and multiple auxiliary models.  D ← {(x n , y n , (ã 1 n , • • • , ã3C n ))|n ∈ [N ]} ← Get2NN( D) 5: end if 6: {(ã 1 n , ã2 n , ã3 n )|n ∈ [N ]} ← Sample( D) # Randomly sample 3 noisy attributes for each instance 7: ( T , p) ← HOC({(ã 1 n , ã2 n , ã3 n )|n ∈ [N ]}) # Use HOC to get global estimates T ≈ T and p ≈ p 8: ( T k , -) ← HOC({(ã 1 n , ã2 n , ã3 n )|n ∈ [N ], f (x n ) = k}), ∀k ∈ [K] # Get local estimates T k ≈ T k 9: Output: { T 1 , • • • , T K }, p # Return the estimated statistics

5.1. EXPERIMENT SETUP

We test the performance of our method on two real-world datasets: COMPAS (Angwin et al., 2016) and CelabA (Liu et al., 2015) . We report results on all three group fairness metrics (DP, EOd, and EOp) whose true disparities (estimated using the ground-truth sensitive attributes) are denoted by ∆ DP (D, f ), ∆ EOd (D, f ), ∆ EOp (D, f ) respectively. We train the target model f on the dataset without using A, and use the auxiliary models downloaded from open-source projects. The detailed settings are the following: • COMPAS (Angwin et al., 2016) : Recidivism prediction data. Feature X: tabular data. Label Y : recidivism within two years (binary). Sensitive attribute A: race (black and non-black). Target models f (trained by us): decision tree, random forest, boosting, SVM, logit model, and neural network (accuracy range 66%-70% for all models). Three auxiliary models (g 1 , g 2 , g 3 ): racial classifiers given name as input Sood & Laohaprapanon (2018) (average accuracy 68.85%). • CelabA (Liu et al., 2015) : Face dataset. Feature X: facial images. Label Y : smile or not (binary). Sensitive attribute A: gender (male and female). Target models f : ResNet18 (He et al., 2016) (accuracy 90.75%, trained by us). We only use one auxiliary model (g 1 ): gender classifier that takes facial images as input (Serengil & Ozpinar, 2021) (accuracy 92.55%). We then use the clusterability to simulate the other two auxiliary models as Line 3 in Algorithm 2. Practical Estimates of T k : Local vs. Global. According to Theorem 3, when T k s are accurately estimated, we should always rely on the local estimates as Line 8 of Algorithm 2 to achieve a zero calibration error. However, in practice, each time when we estimate a local T k , the estimator would introduce certain error on the T k (discussed in Sec. 4.3) and the matrix inversion in Theorem 2 might amplify the estimation error on T k each time, leading to a large overall error on the metric. One heuristic is to use a single global transition matrix T estimated once on the full dataset D as Line 7 of Algorithm 2 to replace all T k 's. Intuitively, T can be viewed as the weighted average of all T k 's to stabilize estimation error (variance reduction) on T k . Admittedly, the average will introduce bias since the equation in Theorem 2 would not hold when replacing T k with T . The justification is that the error introduced by violating the equality might be smaller than the error introduced by using severely inaccurately estimates of T k 's. Therefore, we offer two options for estimating T k in practice: locals estimates T k ≈ T k and global estimates T k ≈ T . Although it is hard to guarantee which option must be better in reality, we report the experimental results using both options and provide insights for choosing between both estimates in Sec. 5.2. Method. We test our proposed framework with global estimates T k ≈ T (Global) and local estimates T k ≈ T k (Local). We compare with two baselines: the directly estimated metric without any calibration (Base) and Soft (Chen et al., 2019) which also only uses auxiliary models to calibrate the measured fairness by re-weighting metric with the soft predicted probability from the auxiliary model. Evaluation Metric. Let ∆(D, f ) be the ground-truth fairness metric. For a given estimated metric E, we define three estimation errors: Raw Error(E) := |E -∆(D, f )|, Normalized Error(E) := Raw Error(E) ∆(D,f ) , and Improvement(E) := 1-Raw Error(E) Raw Error(Base) where Base is the directly measured metric.

COMPAS Results.

Table 1 reports the normalized error on COMPAS (See Table 7 in Appendix D.1 for the other two evaluation metrics). There are two main observations. First, our calibrated metrics outperform baselines with a big margin on all three fairness definitions. Compared to Base, our metrics are 39.6%-88.2% more accurate (Improvement). As pointed out by Corollary 2, this is because the target models f are highly biased (Table 6 ) and the auxiliary models g are inaccurate (accuracy 68.9%). As a result, Base has large normalized error (40-60%). Second, Global outperforms Local, since with inaccurate auxiliary models, Assumptions 1-2 on HOC estimator may not hold in local dataset, inducing large estimation errors in local estimates. CelebA Results. Table 2 reports the normalized error on CelebA where each row represents using a different pre-trained model to generate feature representations used to simulate the other two auxiliary models (See Table 8 in Appendix D.2 for the full results). We have two observations. First, although our method still outperforms baselines most of time, the margin is smaller and we are underperformed by Soft when estimating EOp. Similarly this is because the conditions in Corollary 2 do not hold, i.e. f is barely biased in EOd and EOp (Table 8 ) and g is accurate (accuracy 92.6%). As a result, Base only has a moderate normalized error in DP (15.3%), and small normalized errors in EOd (4.1%) and EOp (2.8%). Given the highly accurate Base, the benefit of adapting our method is outweighed by the estimation error introduced by calibration (mostly the key statistic estimator). Second, contrary to COMPAS, Local outperforms Global. This is because now the auxiliary models are accurate, Assumption 1 always holds and Assumption 2 is also likely to hold when g 2 and g 3 are well-simulated. Consequently, Local is estimated accurately while Global induces extra error due to violating the equality in Theorem 2. Finally, even though our method is underperformed in EOp, the raw error of our method is acceptable, which is less than 0.01 as Table 8 in Appendix D.2. Ablation Study. To better understand when our method can give a clear advantage with different quality of g, we run an ablation study on CelebA. We randomly flip the predicted sensitive attributes by g to bring down g's accuracy and report the results in Table 3 (See Appendix D.2 for the full results). When g becomes less accurate, our method can outperform baselines, which validates Table 3 : Normalized error on the CelebA when adding noise by randomly flipping the predicted attributes to bring down the performance of auxiliary models. Each row represents the noise magnitude and accuracy of auxiliary models, e.g. "[0.2, 0.0] (82.44%)" means T [1, 2] = 0.2, T [2, 1] = 0.0 and accuracy is 82.44%. Corollary 2. In addition, Local still outperforms Global. This is because we add random noise following Assumption 2 and therefore the estimation error of Local is not increased significantly.

CelebA

Takeaways. Our experimental results imply two takeaways: 1) our calibration method can give a clear advantage when the error rates of g are moderate to high (e.g. error ≥ 15%) or f is highly biased (e.g. fairness disparity ≥ 0.1) and 2) we can prefer Local when the auxiliary model is accurate and Global otherwise. In practice, given no ground-truth A, we can roughly estimate auxiliary models' accuracy range from the estimated transition matrix T .

6. RELATED WORK

Fairness with Imperfect Sensitive Attributes. The closest work to ours is (Chen et al., 2019) , which also assumes only auxiliary models. It is only applicable to demographic disparity, and we compare it in the experiments. In addition, other works focus on how to train auxiliary models from a given auxiliary dataset (Awasthi et al., 2021; Diana et al., 2022) . For example, Awasthi et al. (2021) propose an active learning scheme and assume there exists an auxiliary dataset that is i.i.d with the target dataset. In our work, we do not need the auxiliary dataset; nor do we need to assume the auxiliary model's training set is i.i.d with the target dataset. Lamy et al. ( 2019) also draws insights from noisy label literature. However, the attribute noise is assumed to come from the mutually contaminated assumption rather than from an auxiliary model. Furthermore, Prost et al. (2021) and Fogliato et al. (2020) theoretically study the error gap of estimating fairness, but they do not propose any calibration method. There are other parallel works that aim to mitigate bias without estimating it (Hashimoto et al., 2018; Lahoti et al., 2020; Wang et al., 2020; Yan et al., 2020) . We emphasize the value of estimating fairness because the metrics themselves are vital in reporting and studying fairness in real-world systems.

Noisy Label Learning.

Label noise may come from various sources, e.g., human annotation error (Xiao et al., 2015; Wei et al., 2022; Agarwal et al., 2016) and model prediction error (Lee et al., 2013; Berthelot et al., 2019; Zhu et al., 2021a) , which can be characterized by transition matrix on label (Liu, 2022; Bae et al., 2022; Yang et al., 2021) . Applying the noise transition matrix to ensure fairness is emerging (Wang et al., 2021; Liu & Wang, 2021; Lamy et al., 2019) . There exist two lines of works for estimating transition matrix. The first line relies on anchor points (samples belonging to a class with high certainty) or their approximations (Liu & Tao, 2015; Scott, 2015; Patrini et al., 2017; Xia et al., 2019; Northcutt et al., 2021) . These works requires training a neural network on the (X, A := g(X)). The second line of work, which we leverage, is data-centric (Liu & Chen, 2017; Liu et al., 2020; Zhu et al., 2021b; 2022) and training-free. The main idea is to check the agreements among multiple noisy attributes as discussed in Section 4.3.

7. LIMITATION AND FUTURE WORK

We point out two limitations in our work. First, our method shows limited improvement over directly measured metrics when auxiliary models are highly accurate and the true fairness disparity is small. Second, our theoretical guarantee on the superiority of our method (Corollary 2) is only theoretically proven in a simplified case. One future work is to apply the same method to unfairness mitigation algorithm in addition to evaluating fairness.

ETHICS STATEMENT

Our goal is to better study and promote fairness. Without a promising estimation method, given the increasingly stringent privacy regulations, it would be difficult for academia and industry to measure, detect, and mitigate bias in many real-world scenarios. However, we need to caution readers that, needless to say, no estimation algorithm is perfect. Theoretically, in our framework, if the transition matrix is perfectly estimated, then our method can measure fairness with 100% accuracy. However, if Assumptions 1-2 required by our estimator in Algorithm 2 do not hold, our calibrated metrics might have a non-negligible error, and therefore could be misleading. In addition, the example we use to explain terms in Theorem 1 is based on conclusions from (Angwin et al., 2016) . We do not have any biased opinion on the crime rate across different racial groups. Furthermore, we are fully aware that many sensitive attributes are not binary, e.g. race and gender. We use the binary sensitive attributes in experiments because 1) existing works have shown that bias exists in COMPAS between race "black" and others and 2) the ground-truth gender attribute in CelebA is binary. Finally, all the data and models we use are from open-source projects, and the bias measured on them do not reflect our opinions about those projects. Definition 5 (Equalized Opportunity). The equalized opportunity metric of f on D conditioned on A is: ∆ EOp (D, f ) = 1 M (M -1) a,a ′ ∈[M ] |P(f (X) = 1|Y = 1, A = a) -P(f (X) = 1|Y = 1, A = a ′ )|. Matrix-form Metrics. To unify three fairness metrics in a general form, we represent them with a matrix H. Each column of H denotes the probability needed for evaluating fairness with respect to classifier prediction f (X). For DP, H[:, k] denotes the following column vector: H[:, k] := [P(f (X) = k|A = 1), • • • , P(f (X) = k|A = M )] ⊤ . Similarly for EOd and EOp, let k ⊗ y := K(k -1) + y be the 1-d flattened index that represents the 2-d coordinate in f (X) × Y , H[:, k ⊗ y] is defined as the following column vector: H[:, k ⊗ y] := [P(f (X) = k|Y = y, A = 1), • • • , P(f (X) = k|Y = y, A = M )] ⊤ . The sizes of H for DP, EOd and EOp are M × K, M × K 2 , and M × 1 respectively. The noise transition matrix related to EOd and EOp is T k⊗y , where the (a, ã)-th element is denoted by T k⊗y [a, ã] := P( A = ã|f (X) = k, Y = y, A = a).

A.3 COMMON CONDITIONAL INDEPENDENCE ASSUMPTION IN THE LITERATURE

We present below a common conditional independence assumption in the literature (Awasthi et al., 2021; Prost et al., 2021; Fogliato et al., 2020) . Note out framework successfully drops this assumption. Assumption 3 (Conditional Independence). Ã and f (X) are conditionally independent given A (and Y for EOd, EOp): • DP: DP: P( A = ã|f (X) = k, A = a) = P( A = ã|A = a), ∀a, ã ∈ [M ], k ∈ [K]. (i.e. Ã ⊥ ⊥ f (X)|A). Err raw DP ≤ 2 K k∈[K]    hk ∥Λ p(T -1 T k -I)Λ -1 p ∥ 1 cond. indep. violation +δ k ∥Λ p T k Λ -1 p -I∥ 1 error of g    . where hk := 1 M a∈[M ] H[a, k], δ k := max a∈[M ] |H[a, k] -hk |. • EOd: Err raw EOd ≤ 2 K 2 k∈[K],y∈[K]     hk⊗y ∥Λ py (T -1 y T k⊗y -I)Λ -1 py ∥ 1 cond. indep. violation +δ k⊗y ∥Λ py T k⊗y Λ -1 py -I∥ 1 error of g     . where hk⊗y := 1 M a∈[M ] H[a, k ⊗ y], δ k⊗y := max a∈[M ] |H[a, k ⊗ y] -hk⊗y |. • EOp: We obtain the result for EOp by simply letting k = 1 and y = 1, i.e., Err raw EOp ≤ 2 k=1,y=1     hk⊗y ∥Λ py (T -1 y T k⊗y -I)Λ -1 py ∥ 1 cond. indep. violation +δ k⊗y ∥Λ py T k⊗y Λ -1 py -I∥ 1 error of g     . where hk⊗y := 1 M a∈[M ] H[a, k ⊗ y], δ k⊗y := max a∈[M ] |H[a, k ⊗ y] -hk⊗y |. Proof. The following proof builds on the relationship derived in the proof for Theorem 2. We encourage readers to check Appendix B.2 before the following proof. Recall T y [a, a ′ ] := P( A = a ′ |A = a, Y = y). Note Λ py 1 = T ⊤ y Λ py 1 ⇔ (T ⊤ y ) -1 Λ py 1 = Λ py 1. Denote by H[:, k ⊗ y] = hk⊗y 1 + v k⊗y , where hk⊗y := 1 M a∈[M ] P(f (X) = k|A = a, Y = y). We have Λ py H[:, k ⊗ y] = hk⊗y Λ py 1 + Λ py v k⊗y = hk⊗y (T ⊤ y ) -1 Λ py 1 + Λ py v k⊗y . We further have H[: k ⊗ y] = Λ -1 py T ⊤ k⊗y Λ py -I H[:, k ⊗ y] + H[:, k ⊗ y] = hk⊗y Λ -1 py T ⊤ k⊗y (T ⊤ y ) -1 Λ py 1 + Λ -1 py T ⊤ k⊗y Λ py v k⊗y -hk⊗y 1 -v k⊗y + H[:, k ⊗ y] = hk⊗y Λ -1 py T ⊤ k⊗y (T ⊤ y ) -1 -I Λ py 1 + Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y + H[:, k ⊗ y]. Noting |A| -|B| ≤ |A + B| ≤ |A| + |B|, we have | |A + B| -|B| | ≤ |A|. Therefore, (e ã -e ã′ ) ⊤ H[: k ⊗ y] -(e ã -e ã′ ) ⊤ H[: k ⊗ y] ≤ hk⊗y (e ã -e ã′ ) ⊤ Λ -1 py T -1 y T k⊗y -I ⊤ Λ py 1 (Term 1) + (e ã -e ã′ ) ⊤ Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y . ( Term-1 and Term-2 can be upper bounded as follows. Term 1: With the Hölder's inequality, we have hk⊗y (e ã -e ã′ ) ⊤ Λ -1 py T -1 y T k⊗y -I ⊤ Λ py 1 ≤ hk⊗y ∥e ã -e ã′ ∥ 1 Λ -1 py T -1 y T k⊗y -I ⊤ Λ py 1 ∞ ≤2 hk⊗y Λ -1 py T -1 y T k⊗y -I ⊤ Λ py 1 ∞ ≤2 hk⊗y Λ -1 py T -1 y T k⊗y -I ⊤ Λ py ∞ =2 hk⊗y Λ py T -1 y T k⊗y -I Λ -1 py 1 Term 2: Denote by δ k⊗y := max a∈[M ] |H[a, k ⊗ y] -hk⊗y |, which is the largest absolute offset from its mean. With the Hölder's inequality, we have (e ã -e ã′ ) ⊤ Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y ≤∥e ã -e ã′ ∥ 1 Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y ∞ ≤2 Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y ∞ ≤2δ k⊗y Λ -1 py T ⊤ k⊗y Λ py -I ∞ =2δ k⊗y Λ py T k⊗y Λ -1 py -I 1 Wrap-up: (e ã -e ã′ ) ⊤ H[: k ⊗ y] -(e ã -e ã′ ) ⊤ H[: k ⊗ y] ≤2 hk⊗y Λ py T -1 y T k⊗y -I Λ -1 py 1 + 2δ k⊗y Λ py T k⊗y Λ -1 py -I 1 . Denote by ∆ ã,ã ′ k⊗y := | H[ã, k ⊗ y] -H[ã ′ , k ⊗ y]| the noisy disparity and ∆ ã,ã ′ k⊗y := |H[ã, k ⊗ y] - H[ã ′ , k ⊗ y] | the clean disparity between attributes ã and ã′ in the case when f (X) = k and Y = y. We have ∆ EOd ( D, f ) -∆ EOd (D, f ) ≤ 1 M (M -1)K 2 ã,ã ′ ∈[M ],k,y∈[K] ∆ ã,ã ′ k⊗y -∆ ã,ã ′ k⊗y ≤ 2 M (M -1)K 2 ã,ã ′ ∈[M ],k,y∈[K] hk⊗y Λ py T -1 y T k⊗y -I Λ -1 py 1 + δ k⊗y Λ py T k⊗y Λ -1 py -I 1 = 2 K 2 k,y∈[K] hk⊗y Λ py T -1 y T k⊗y -I Λ -1 py 1 + δ k⊗y Λ py T k⊗y Λ -1 py -I 1 . The results for DP can be obtained by dropping the dependence on Y = y, and the results for EOp can be obtained by letting k = 1 and y = 1.

B.2 FULL VERSION OF THEOREM 2 AND ITS PROOF

Recall p, p, T and T k are clean prior, noisy prior, global transition matrix, and local transition matrix defined in Sec. 2. Denote by Λ p and Λ p the square diagonal matrices constructed from p and p. Theorem 2. [Closed-form relationship (DP,EOd,EOp)] The relationship between the true fairness vector h u and the corresponding noisy fairness vector hu writes as h u = (T u⊤ Λ p u ) -1 Λ pu hu , ∀u ∈ {DP, EOd, EOp}, where Λ pu and Λ p u denote the square diagonal matrix constructed from pu and p u , u unifies different fairness metrics. Particularly, • DP (∀k ∈ [K]): p DP := [P(A = 1), • • • , P(A = M )] ⊤ , pDP := [P( A = 1), • • • , P( A = M )] ⊤ . T DP := T k , where the (a, ã)-th element of T k is T k [a, ã] := P( A = ã|f (X) = k, A = a). h DP := H[:, k] := [P(f (X) = k|A = 1), • • • , P(f (X) = k|A = M )] ⊤ hDP := H[:, k] := [P(f (X) = k| A = 1), • • • , P(f (X) = k| A = M )] ⊤ . • EOd and EOp (∀k, y ∈ [K], u ∈ {EOd, EOp}): ∀k, y ∈ [K]: k ⊗ y := K(k -1) + y, p u := p y := [P(A = 1|Y = y), • • • , P(A = M |Y = y)] ⊤ , pu := py := [P( A = 1|Y = y), • • • , P( A = M |Y = y)] ⊤ . T u := T k⊗y , where the (a, ã)-th element of T k⊗y is T k⊗y [a, ã] := P( A = ã|f (X) = k, Y = y, A = a). h u := H[:, k ⊗ y] := [P(f (X) = k|Y = y, A = 1), • • • , P(f (X) = k|Y = y, A = M )] ⊤ hu := H[:, k ⊗ y] := [P(f (X) = k|Y = y, A = 1), • • • , P(f (X) = k|Y = y, A = M )] ⊤ . Proof. We first prove the theorem for DP, then for EOd and EOp. Proof for DP. In DP, each element of hDP satisfies: P(f (X) = k| A = ã) = a∈[M ] P(f (X) = k, A = ã, A = a) P( A = ã) = a∈[M ] P( A = ã|f (X) = k, A = a) • P(A = a) • P(f (X) = k|A = a) P( A = ã) Recall T k is the attribute noise transition matrix when f (X) = k, where the (a, ã)-th element is T k [a, ã] := P( A = ã|f (X) = k, A = a). Recall p := [P(A = 1), • • • , P(A = M )] ⊤ and p := [P( A = 1), • • • , P( A = M )] ⊤ the clean prior probabilities and noisy prior probability, respectively. The above equation can be re-written as a matrix form as H[:, k] = Λ -1 p T ⊤ k Λ p H[:, k], which is equivalent to H[:, k] = ((T ⊤ k )Λ p ) -1 Λ p H[:, k]. Proof for EOd, EOp. In EOd or EOp, each element of hu satisfies: ⊤ the clean prior probabilities and noisy prior probability, respectively. The above equation can be re-written as a matrix form as P(f (X) = k|Y = y, A = ã) = P(f (X) = k, Y = y, A = ã) P(Y = y, A = ã) = a∈[M ] P(f (X) = k, Y = y, A = ã, A = a) P(Y = y, A = ã) = a∈[M ] P( A = ã|f (X) = k, Y = y, A = a) • P(Y = y, A = a) • P(f (X) = k|Y = y, A H[:, k] = Λ -1 py T ⊤ k⊗y Λ py H[:, k], which is equivalent to H[:, k] = (T ⊤ k⊗y Λ py ) -1 Λ py H[:, k]. Wrap-up. We can conclude the proof by unifying the above two results with u.

B.3 PROOF FOR COROLLARY 1

Proof. When the conditional independence (Assumption 3) P( A = a ′ |A = a, Y = y) = P( A = a ′ |A = a, f (X) = k, Y = y), ∀a ′ , a ∈ [M ] holds, we have T y = T k⊗y and Term-1 in Theorem 1 can be dropped. For Term-2, to get a tight bound in this specific case, we apply the Hölder's inequality by using l ∞ norm on e ã -e ã′ , i.e., (e ã -e ã′ ) ⊤ Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y ≤∥e ã -e ã′ ∥ ∞ Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y 1 = Λ -1 py T ⊤ k⊗y Λ py -I v k⊗y 1 ≤K • δ k⊗y Λ -1 py T ⊤ k⊗y Λ py -I 1 =K • δ k⊗y Λ py T k⊗y Λ -1 py -I ∞ Therefore, ∆ EOd ( D, f ) -∆ EOd (D, f ) ≤ 1 K k,y∈[K] δ k⊗y Λ py T k⊗y Λ -1 py -I ∞ = 1 K k,y∈[K] δ k⊗y Λ py T y Λ -1 py -I ∞ = 1 K k,y∈[K] δ k⊗y Ťy -I ∞ , where Ťy [a, ã] = P(A = a| A = ã, Y = y). Special binary case in DP In addition to the conditional independence, when the sensitive attribute is binary and the label class is binary, considering DP, we have ∆ DP ( D, f ) -∆ DP (D, f ) ≤ 2δ k Ť -I ∞ , where Ťy [a, ã] = P(A = a| A = ã). Let Ťy [1, 2] = e 1 , Ťy [2, 1] = e 2 , we know Ť := 1 -e 2 e 1 e 2 1 -e 1 and ∆ DP ( D, f ) -∆ DP (D, f ) ≤ 2δ k • (e 1 + e 2 ). Note the equality in above inequality always holds. To prove it, firstly we note P(f (X) = k| A = ã) = a∈[M ] P(f (X) = k, A = ã, A = a) P( A = ã) = a∈[M ] P( A = ã|f (X) = k, A = a) • P(A = a) • P(f (X) = k|A = a) P( A = ã) = a∈[M ] P( A = ã|A = a) • P(A = a) • P(f (X) = k|A = a) P( A = ã) = a∈[M ] P(A = a| A = ã) • P(f (X) = k|A = a), i.e. H[:, k] = Ť ⊤ H[:, k]. Denote by H[:, 1] = [h, h ′ ] ⊤ . We have (ã ̸ = ã′ ) (e ã -e ã′ ) ⊤ H[:, 1] = |h -h ′ | • |1 -e 1 -e 2 |,

and

(e ã -e ã′ ) ⊤ H[:, 1] = |h -h ′ |. Therefore, letting ã = 1, ã = 2, we have ∆ DP ( D, f ) -∆ DP (D, f ) = 1 2 k∈{1,2} (e 1 -e 2 ) ⊤ H[:, k] -(e 1 -e 2 ) ⊤ H[:, k] = (e 1 -e 2 ) ⊤ H[:, 1] -(e 1 -e 2 ) ⊤ H[:, 1] =|h -h ′ | • |e 1 + e 2 | =2δ • (e 1 + e 2 ) , where δ = |P(f (X) = 1|A = 1) -P(f (X) = 1|A = 2)|/2. Therefore, the equality holds.  Err cal DP ≤ 2 K k∈[K] Λ -1 p 1 ∥Λ p H[:, k]∥ ∞ ε( T k , p), where ε( T k , p) := ∥Λ -1 p Λ p -I∥ 1 ∥T k T -1 k ∥ 1 +∥I-T k T -1 k ∥ 1 is the error induced by calibration. • EOd: Err cal EOd ≤ 2 K 2 k∈[K],y∈[K] Λ -1 py 1 Λ py H[:, k ⊗ y] ∞ ε( T k⊗y , py ), where ε( T k⊗y , py ) := ∥Λ -1 py Λ py -I∥ 1 ∥T k⊗y T -1 k⊗y ∥ 1 + ∥I -T k⊗y T -1 k⊗y ∥ 1 is the error induced by calibration. • EOp: Err cal EOp ≤ 2 k=1,y=1 Λ -1 py 1 Λ py H[:, k ⊗ y] ∞ ε( T k⊗y , py ), where ε( T k⊗y , py ) := ∥Λ -1 py Λ py -I∥ 1 ∥T k⊗y T -1 k⊗y ∥ 1 + ∥I -T k⊗y T -1 k⊗y ∥ 1 is the error induced by calibration. Proof. We prove with EOd. Consider the case when f (X) = k and Y = y. For ease of notations, we use T to denote the estimated local transition matrix (should be T k⊗y ). Denote the noisy (clean) fairness vectors with respect to f (X) = k and Y = y by h (h). The error can be decomposed by (e a -e a ′ ) ⊤ Λ -1 py ( T ⊤ ) -1 Λ py h -(e a -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 Λ py h = (e a -e a ′ ) ⊤ (Λ -1 py -Λ -1 py )( T ⊤ ) -1 Λ py h Term-1 + (e a -e a ′ ) ⊤ Λ -1 py ( T ⊤ ) -1 Λ py h -(e a -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 Λ py h Term-2 . Now we upper bound them respectively. Term-1: (e a -e a ′ ) ⊤ (Λ -1 py -Λ -1 py )( T ⊤ ) -1 Λ py h (a) = (e a -e a ′ ) ⊤ (Λ -1 py -Λ -1 py )(T k⊗y T -1 ) ⊤ Λ py H[:, k ⊗ y] (b) = (e a -e a ′ ) ⊤ (Λ -1 py Λ py -I)Λ -1 py T ⊤ δ Λ py H[:, k ⊗ y] ≤2 Λ -1 py Λ py -I) ∞ Λ -1 py ∞ ∥T δ ∥ 1 Λ py H[:, k ⊗ y] ∞ =2 Λ -1 py ∞ Λ py H[:, k ⊗ y] ∞ Λ -1 py Λ py -I) ∞ ∥T δ ∥ 1 , where equality (a) holds due to Λ py H[:, k ⊗ y] = T ⊤ k⊗y Λ py H[:, k ⊗ y] and equality (b) holds because we denote the error matrix by T δ , i.e. T = T -1 δ T k⊗y ⇔ T δ = T k⊗y T -1 . Term-2: Before preceeding, we introduce the Woodbury matrix identity: (A + U CV ) -1 = A -1 -A -1 U (C -1 + V A -1 U ) -1 V A -1 Let A := T ⊤ k⊗y , C = I, V := I, U := T ⊤ -T ⊤ k⊗y . By Woodbury matrix identity, we have ( T ⊤ ) -1 =( T ⊤ k⊗y + ( T ⊤ -T ⊤ k⊗y )) -1 =(T ⊤ k⊗y ) -1 -(T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Term-2 can be upper bounded as: (ea -e a ′ ) ⊤ Λ -1 py ( T ⊤ ) -1 Λ py h -(ea -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 Λ py h (a) = (ea -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 -(T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Λ py h -(ea -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 Λ py h ≤ (ea -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Λ py h (b) ≤ ∥ea -e a ′ ∥1 Λ -1 py (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Λ py h ∞ ≤2 Λ -1 py ∞ (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Λ py h ∞ =2 Λ -1 py ∞ I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -I I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Λ py h ∞ =2 Λ -1 py ∞ I -I + (T ⊤ k⊗y ) -1 ( T ⊤ -T ⊤ k⊗y ) -1 (T ⊤ k⊗y ) -1 Λ py h ∞ =2 Λ -1 py ∞ I -T k⊗y T -1 ⊤ (T ⊤ k⊗y ) -1 Λ py h ∞ (c) ≤ 2 Λ -1 py ∞ ∥I -T δ ∥ 1 (T ⊤ k⊗y ) -1 Λ py h ∞ (d) = 2 Λ -1 py ∞ ∥I -T δ ∥ 1 Λp y H[:, k ⊗ y] ∞ , where the key steps are: • (a): Woodbury identity. • (b): Hölder's inequality. • (c): T = T -1 δ T k⊗y and triangle inequality • (d): H[:, k ⊗ y] = Λ -1 py T ⊤ k⊗y Λ py H[:, k ⊗ y] ⇔(T ⊤ k⊗y ) -1 Λ py H[:, k ⊗ y] = Λ py H[:, k ⊗ y]. Wrap-up Combining the upper bounds of Term-1 and Term-2, we have (recovering full notations) (e a -e a ′ ) ⊤ Λ -1 py ( T ⊤ ) -1 Λ py h -(e a -e a ′ ) ⊤ Λ -1 py (T ⊤ k⊗y ) -1 Λ py h ≤2 Λ -1 py ∞ Λ py H[:, k ⊗ y] ∞ Λ -1 py Λ py -I) ∞ ∥T δ ∥ 1 + ∥I -T δ ∥ 1 =2 Λ -1 py ∞ Λ py H[:, k ⊗ y] ∞ Λ -1 py Λ py -I) ∞ T k⊗y T -1 k⊗y 1 + I -T k⊗y T -1 k⊗y 1 . Denote by ∆ ã,ã ′ k⊗y := | H[ã, k ⊗ y] -H[ã ′ , k ⊗ y]| the calibrated disparity and ∆ ã,ã ′ k⊗y := |H[ã, k ⊗ y] -H[ã ′ , k ⊗ y] | the clean disparity between attributes ã and ã′ in the case when f (X) = k and Y = y. We have ∆ EOd ( D, f ) -∆ EOd (D, f ) ≤ 1 M (M -1)K 2 ã,ã ′ ∈[M ],k,y∈[K] ∆ ã,ã ′ k⊗y -∆ ã,ã ′ k⊗y ≤ 2 K 2 k,y∈[K] 2 Λ -1 py ∞ Λ py H[:, k ⊗ y] ∞ Λ -1 py Λ py -I) ∞ T k⊗y T -1 k⊗y 1 + I -T k⊗y T -1 k⊗y 1 . The above inequality can be generalized to DP by dropping dependency on y and to EOp by requiring k = 1 and y = 1.

B.5 PROOF FOR COROLLARY 2

Proof. Consider DP. Denote by H [:, k = 1] = [h, h ′ ] ⊤ . We know δ = |h -h ′ |/2 = ∆ DP (D, f )/2. Suppose p ≤ 1/2, Λ -1 p ∞ = 1/p and ∥Λ p H[:, k]∥ ∞ = max(ph, (1 -p)h ′ ). Recall ε( T k , p) := ∥Λ -1 p Λ p -I∥ 1 ∥T k T -1 k ∥ 1 + ∥I -T k T -1 k ∥ 1 . By requiring the error upper bound in Theorem 3 less than the exact error in Corollary 1, we have (when k = 1) 2 Λ -1 p ∞ ∥Λ p H[:, k]∥ ∞ ε( T k , p)2 ≤ δ • (e 1 + e 2 ) ⇔ε( T k , p) ≤ δ • (e 1 + e 2 ) Λ -1 p ∞ ∥Λ p H[:, k]∥ ∞ ⇔ε( T k , p) ≤ δ • (e 1 + e 2 ) max(h, (1 -p)h ′ /p) . If p = 1/2, noting max(h, h ′ ) = (|h + h ′ | + |h -h ′ |)/2, we further have (when k = 1) ε( T k , p) ≤ |h -h ′ | • (e 1 + e 2 ) |h -h ′ | + |h + h ′ | = e 1 + e 2 1 + h+h ′ |h-h ′ | = e 1 + e 2 1 + h+h ′ ∆ DP (D,f ) . To make the above equality holds for all k ∈ {1, 2}, we have ε( T k , p) ≤ max k ′ ∈{1,2} e 1 + e 2 1 + ∥H[:,k ′ ]∥1 ∆ DP (D,f ) , ∀k ∈ {1, 2}.

C MORE DISCUSSIONS ON TRANSITION MATRIX ESTIMATORS

C.1 HOC HOC (Zhu et al., 2021b) relies on checking the agreements and disagreements among three noisy attributes of one feature. For example, given a three-tuple (ã 1 n , ã2 n , ã3 n ), each noisy attribute may agree or disagree with the others. This consensus pattern encodes the information of noise transition matrix T . Suppose (ã 1 n , ã2 n , ã3 n ) are drawn from random variables ( Ã1 , Ã2 , Ã3 ) satisfying Assumption 2. Denote by e 1 = P( A 1 = 2|A 1 = 1) = P( A 2 = 2|A 2 = 1) = P( A 3 = 2|A 3 = 1) and e 2 = P( A 1 = 1|A 1 = 2) = P( A 2 = 1|A 2 = 2) = P( A 3 = 1|A 3 = 2). Note A 1 = A 2 = A 3 . We have: • First order equations: P( Ã1 = 1) = P(A 1 = 1) • (1 -e 1 ) + P(A 1 = 2) • e 2 P( Ã1 = 2) = P(A 1 = 1) • e 1 + P(A 1 = 2) • (1 -e 2 ) • Second order equations: P( Ã1 = 1, Ã2 = 1) = P( Ã1 = 1, Ã2 = 1|A 1 = 1) • P(A 1 = 1) + P( Ã1 = 1, Ã2 = 1|A 1 = 2) • P(A 1 = 2) = (1 -e 1 ) 2 • P(A 1 = 1) + e 2 2 • P(A 1 = 2). Similarly, P( Ã1 = 1, Ã2 = 2) = (1 -e 1 )e 1 • P(A 1 = 1) + e 2 (1 -e 2 ) • P(A 1 = 2) P( Ã1 = 2, Ã2 = 1) = (1 -e 1 )e 1 • P(A 1 = 1) + e 2 (1 -e 2 ) • P(A 1 = 2) P( Ã1 = 2, Ã2 = 2) = e 2 1 • P(A 1 = 1) + (1 -e 2 ) 2 • P(A 1 = 2). • Third order equations: P( Ã1 = 1, Ã2 = 1, Ã3 = 1) = (1 -e 1 ) 3 • P(A 1 = 1) + e 3 2 • P(A 1 = 2) P( Ã1 = 1, Ã2 = 1, Ã3 = 2) = (1 -e 1 ) 2 e 1 • P(A 1 = 1) + (1 -e 2 )e 2 2 • P(A 1 = 2) P( Ã1 = 1, Ã2 = 2, Ã3 = 2) = (1 -e 1 )e 2 1 • P(A 1 = 1) + (1 -e 2 ) 2 e 2 • P(A 1 = 2) P( Ã1 = 1, Ã2 = 2, Ã3 = 1) = (1 -e 1 ) 2 e 1 • P(A 1 = 1) + (1 -e 2 )e 2 2 • P(A 1 = 2) P( Ã1 = 2, Ã2 = 1, Ã3 = 1) = (1 -e 1 ) 2 e 1 • P(A 1 = 1) + (1 -e 2 )e 2 2 • P(A 1 = 2) P( Ã1 = 2, Ã2 = 1, Ã3 = 2) = (1 -e 1 )e 2 1 • P(A 1 = 1) + (1 -e 2 ) 2 e 2 • P(A 1 = 2) P( Ã1 = 2, Ã2 = 2, Ã3 = 1) = (1 -e 1 )e 2 1 • P(A 1 = 1) + (1 -e 2 ) 2 e 2 • P(A 1 = 2) P( Ã1 = 2, Ã2 = 2, Ã3 = 2) = e 3 1 • P(A 1 = 1) + (1 -e 2 ) 3 • P(A 1 = 2). With the above equations, we can count the frequency of each pattern (LHS) as (ĉ [1] , ĉ[2] , ĉ[3] ) and solve the equations. See the key steps summarized in Algorithm 3. Algorithm 3 Key Steps of HOC 1: Input: A set of three-tuples: {(ã 1 n , ã2 n , ã3 n )|n ∈ [N ]} 2: (ĉ [1] , ĉ[2] , ĉ[3] ) ← CountFreq({(ã 1 n , ã2 n , ã3 n )|n ∈ [N ]}) // Count 1st, 2nd , and 3rd-order patterns 3: Find T such that match the counts (ĉ  [1] , ĉ[2] , ĉ[3] ) // Solve equations

C.2 OTHER ESTIMATORS THAT REQUIRE TRAINING

The other estimators in the noisy label literature mainly focus on training a new model to fit the noisy data distribution. The intuition is that the new model has the ability to distinguish between true attributes and wrong attributes. In other words, they believe the prediction of new model is close to the true attributes. It is useful when the noise in attributes are random. However, this intuitions is hardly true in our setting since we need to train a new model to learn the noisy attributes given by an auxiliary model, which are deterministic. One caveat of this approach is that the new model is likely to fit the auxiliary model when both the capacity of the new model and the amount of data are sufficient, leading to a trivial transition matrix estimate that is an identity matrix, i.e., T = I. In this case, the performance is close to Base. We reproduce Northcutt et al. (2021) follow the setting in Table 2 and summarize the result in Table 5 , which verifies that the performance of this kind of approach is close to Base.

D.1 MORE TABLES FOR THE COMPAS DATASET

We have two tables in this subsection. • Table 6 shows the raw disparities measured on the COMPAS dataset. When our method is better Both Table 1 , DP in Table 2 , and Table 3 show our method is significantly better than both baselines, where the noise rates of g are moderate to high (e.g., ≥ 15%) or f is biased (e.g., ≥ 0.1). This observation is also consistent with our result in Corollary 2. In other cases when both the noise rate and original disparity are low, our calibration may not be perfect compared with others without calibration, e.g., EOp in Table 2 . However, the raw error of EOp is sufficiently small (< 0.01) for all approaches, indicating the absolute performance of our method is not bad although it fails to be better than others. 



We only assume it for the purpose of demonstrating a less complicated theoretical result, we do not need this assumption in our proposed algorithm later.



the normalized l 1 distance between two rows of H, where col(H) is the number of columns in H. Denote by Ψ(H) := a,ã∈[M ] ψ(H[a], H[a ′ ])/(M (M -1)). We define the following disparity as a general statistical group fairness metric on distribution D (Chen et al., 2022): Definition 2 (Group Fairness Metric). The group fairness of model f on data distribution (X, Y, A) ∼ D writes as ∆(D, f ) = Ψ(H).

g 1 g g P A M r / D m P D o v z r v z M W 9 d c f K Z I / g D 5 / M H z x G M 8 w = = < / l a t e x i t > g Noisy Attribute < l a t e x i t s h a 1 _ b a s e 6 4 = " D 0 1 N f 6 T I a S b I 5 c x z i / 3 a

H O b a E c S 3 s r Z Q P m W Y c b U Y l G 4 K 3 + P J f 0 j y p e u f V 0 7 u z S u 0 6 j 6 N I D s g h O S Y e u S A 1 c k v q p E E 4 i c g T e S G v j n a e n T f n f d 5 a c P K Z f f I L z s c 3 8 l e Q g w = = < / l a t e x i t > Ã Target Model < l a t e x i t s h a 1 _ b a s e 6 4 = " g f L E K x k d H S L k P o y r + Q 5 C Z y e r q Q o = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o m K e i x 6 8 d i C / Y A 2 l M 1 2 0 q 7 d b M L u R i i h v 8 C L B 0 W 8 + p O 8 + W / c t j l o 6 4 O B x 3 s z z M w

H p 0 X 5 9 3 5 m L e u O P n M E f y B 8 / k D z Y 2 M 8 g = = < / l a t e x i t > f Target Dataset

s 2 z b a h S X Z J s k J Z 9 l d 4 8 a C I V 3 + O N / + N a b s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R S W l l d W 1 4 r r p Y 3 N r e 2 d 8 u 5 e U 0 e J I r R B I h 6 p d o A 1 5U z S h m G G 0 3 a s K B Y B p 6 1 g d D 3 x W 0 9 U a R b J e z O O q S / w Q L K Q E W y s 9 H D z m H Y J U y T r l S t u 1 Z 0 C L R I v J x X I U e + V v 7 r 9 i C S C S k M 4 1 r r j u b H x U 6 w M I 5 x m p W 6 i a Y z J C A 9 o x 1 K J B d V + O j 0 4 Q 0 d W 6 a M w U r a k Q V P 1 9 0 S K h d Z j E d h O g c 1 Q z 3 s T 8 T + v k 5 j w 0 k + Z j B N D J Z k t C h O O T I Q m 3 6 M + U 5 Q Y P r Y E E 8 X s r Y g M s c L E 2 I x K N g R v / u V F 0 j y p e u f V 0 7 u z S u 0 q j 6 M I B 3 A I x + D B B d T g F u r Q A A I C n u E V 3 h z l v D j v z s e s t e D k M / v w B 8 7 n D 9 j x k H M = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " o I Q 1 g f K e U E F k q B / G f i 1 m 5 9 J J E D 0 = " > A A A C C X i c b V D J S g N B E O 1 x j X E b 9 e i l M Q g R J M y o q M e g O X i M Y B b I D K G n p y Z p 0 r P Q X a O E k K s X f 8 W L B 0 W 8 + g f e / B s n y 0 E T H x S 8 f q + K r n p e I o V G y / o 2 F h a X l l d W c 2 v 5 9 Y 3 N r W 1 z Z 7 e u 4 1 R x q P F Y x q r p M Q 1 S R F B D g R K a i Q I W e h I a X u 9 6 5 D f u Q W k R R 3 f Y T 8 A N W S c S g e A M M 6 l t U u d B + N B l S J 0 K S G T F 8 R u F 9 I F W j o O j f N s s W C V r D D p P 7 C k p k C m q b f P L 8 W O e h h A h l 0 z r l m 0 l 6 A 6 Y Q s E l D P N O q i F h v M c 6 0 M p o x E L Q 7 m B 8 y Z A e Z o p P g 1 h l F S E d q 7 8 n B i z U u h 9 6 W W f I s K t n v Z H 4 n 9 d K M b h 0 B y J K U o S I T z 4 K U k k x p q N Y q C 8 U c J T 9 j D C u R L Y r 5 V 2 m G M c s v F E I 9 u z J 8 6 R + Ur L P S 6 e 3 Z 4 X y 1 T S O H N k n B 6 R I b H J B y u S G V E m N c P J I n s k r e T O e j B f j 3 f i Y t C 4 Y 0 5 k 9 8 g f G 5 w + m d 5 k D < / l a t e x i t > b ( e D, f ) Calibrated Metric < l a t e x i t s h a 1 _ b a s e 6 4 = " W b x 6 a 8 p 7 2 s R c 3 n / K t + o j x A H i n O Y = " > A A A C C 3 i c b V D J S g N B E O 1 x j X G L e v T S J A g R J M y o q M e g O X i M Y B b I D K G n p y Z p 0 r P Q X a O E k L s X f 8 W L B 0 W 8 + g P e / B s n y y E m P i h 4 v F d F V T 0 3 l k K j a f 4 Y S 8 s r q 2 v r m Y 3 s 5 t b 2 z m 5 u b 7 + u o 0 R x q P

8 9 5 I / M 9 r J e h f O Q M R x g l C y C e L / E R S j O g o G O o J B R x l P y W M K 5 H e S n m X K c Y x j W 8 U g j X / 8 i K p n 5 a s i 9 L Z 3 X m h f D 2 N I 0 M O S Z 4 U i U U u S Z n c k i q p E U 6 e y A t 5 I + / G s / F q f B i f k 9 Y l Y z p z Q P 7 A + P o F T w W Z 7 A = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " b y P F C a Z W n D s 5 u b p z r f 0 4 h C U / m j s = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 c K / Y I 2 l M 1 2 0 y 7 d b M L u R C i h P 8 K L B 0 W 8 + n u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 U 1 h b 3 9 j c K m 6 X d n b 3 9 g / K h 0 c t E 6 e a 8 S a L Z a w 7 A T

Figure 1: Overview of our framework to estimate fairness without sensitive attribute given auxiliary models g only. Step 1 (a): Estimate noisy fairness matrix. Step 2 (b): Calibrate the fairness matrix using the estimated transition matrix and prior probability.

the error induced by calibration. With a perfect estimator, i.e. T k = T k and pk = p k , ∀k ∈ [K], we have Err cal = 0.

StatEstimator: HOCFair 1: Input: Noisy dataset D. Target model f . 2: C ← #Attribute( D) # Get the number of noisy attributes (i.e. number of aux. models) 3: if C < 3 then # Get 2-Nearest-Neighbors of xn and save their attributes as xn's attribute 4:

EOd / EOp: P( A = ã|f (X) = k, Y = y, A = a) = P( A = ã|Y = y, A = a), ∀a, ã ∈ [M ], k, y ∈ [K]. (i.e. Ã ⊥ ⊥ f (X)|Y, A). B PROOFS B.1 FULL VERSION OF THEOREM 1 AND ITS PROOF Denote by T y the attribute noise transition matrix with respect to label y, whose (a, ã)-th element is T y [a, ã] := P( A = ã|A = a, Y = y). Note it is different from T k . Denote by T k⊗y the attribute noise transition matrix when f (X) = k and Y = y, where the (a, ã)-th element is T k⊗y [a, ã] := P( A = ã|f (X) = k, Y = y, A = a). Denote by p y := [P(A = 1|Y = y), • • • , P(A = K|Y = y)] ⊤ and py := [P( A = 1|Y = y), • • • , P( A = K|Y = y)] ⊤ the clean prior probabilities and noisy prior probability, respectively. Theorem 1 (Error Upper Bound of Noisy Metrics). Denote by Err raw u := |∆ u ( D, f ) -∆ u (D, f )| the estimation error of the directly measured noisy fairness metrics. Its upper bound is:

a) P(Y = y, A = ã) Denote by T k⊗y the attribute noise transition matrix when f (X) = k and Y = y, where the (a, ã)-th element is T k⊗y [a, ã] := P( A = ã|f (X) = k, Y = y, A = a). Denote by p y := [P(A = 1|Y = y), • • • , P(A = K|Y = y)] ⊤ and py := [P( A = 1|Y = y), • • • , P( A = K|Y = y)]

B.4 PROOF FOR THEOREM 3Theorem 3 (Error upper bound of calibrated metrics). Denote the error of the calibrated fairness metrics by Err calu := | ∆ u ( D, f ) -∆ u (D, f )|.It can be upper bounded as: • DP:

To unify different fairness metrics, we define matrix H as an intermediate variable. Each column of H denotes the probability needed for evaluating fairness with respect to

Normalized estimation error on COMPAS. Each row represents a different target model f .

Normalized error on CelebA. Each row represents a different pre-trained model to generate feature representations that we use to simulate the other two auxiliary models g 2 , g 3 (Line 3, Algorithm 2). Base and Soft are computed on g 1 and not changed since they are independent of feature representations. The ground-truth fairness metrics are DP: 0.13, EOd: 0.03, EOp: 0.05.

Normalized error (×100) of a learning-centric estimator.

Table 7 is the full version of Table 1. D.2 MORE TABLES FOR THE CELEBA DATASET We have three tables in this subsection. Table 8 is the full version of Table 2. • Table 9 is the full version of Table 3. • Table 10 is similar to Table 9, but the error metric is changed to Improvement defined in Section 5.1. D.3 MORE DISCUSSIONS We cross-reference different tables and show several takeaway messages. Note some of them have been introduced in the main paper. Uncalibrated measurement is sensitive to noise rates and raw disparity The target models f trained on the COMPAS dataset are usually biased (details in Table 6), and the auxiliary models g are inaccurate (accuracy 68.85% in binary classifications), while f is almost not biased in EOd and EOp (details in Table 8) and g is accurate (accuracy 92.55%). As a result, all the three types of directly measured fairness metrics (Base) have large normalized errors (∼ 40-60%) as in Table 1, a moderate normalized error in DP (15.33%), and small normalized errors in EOd (4.11%) and EOp (2.82%) as in Table 2, which is consistent with our results in Theorem 1 and Corollary 1. Local vs. Global Table 1 also shows our Global method works consistently better than the Local method, while Table 2 has the reversed result. Intuitively, when the auxiliary models are highly inaccurate (accuracy 68.85%), Assumptions 1-2 for implementing HOC may not hold well in every local dataset, inducing large estimation errors in local estimates and unstable calibrations. On the contrary, when the auxiliary models are accurate (92.55% accuracy in Table 2), Assumption 1 always hold and most instances will satisfy Assumption 2 if we carefully choose the other two auxiliary models g 2 and g 3 given g 1 , then Local will outperform since it can achieve 0 error if both assumptions perfectly hold and Global induces extra error due to approximation. Note Table 3 shows Local is still statistically better than Global when the noise rate is high. This is because the extra random flipping follows Assumption 2 and the estimation error of Local is not improved significantly. Therefore, we prefer Local when the original auxiliary model is accurate and Global to stabilize the calibration otherwise.

Normalized Error on CelebA with different noise rates

Disparity mitigation with our calibration framework. Results are averaged with results from the last 5 epochs. DP is the considered fairness metric. Base: Direct mitigation with noisy sensitive attributes. Facenet, Facenet 512, etc.: Pre-trained models to generate feature representations used to simulate the other two auxiliary models.

Appendix

The Appendix is organized as follows.• Section A presents a summary of notations, more fairness definitions, and a clear statement of the assumption that is common in the literature. Note our framework does not rely on this assumption. • Section B presents the full version of our theorems (for DP, EOd, EOp), corollaries, and the corresponding proofs. • Section C shows how HOC works and analyzes why other learning-centric methods in the noisy label literature may not work in our setting. • Section D presents more experimental results and takeaways. The code and data for reproducing our experiments will be released after acceptance.

A MORE DEFINITIONS AND ASSUMPTIONS

A.1 SUMMARY OF NOTATIONS auxiliary models for generating noisy sensitive attributes X, Y, A, and A := g(X)Random variables of feature, label, ground-truth sensitive attribute, and noisy sensitive attributes x n , y n , a nThe n-th feature, label, and ground-truth sensitive attribute in a dataset N, K, MThe number of instances, label classes, categories of sensitive attributesDistribution of D and D u ∈ {DP, EOd, EOp} A unified notation of fairness definitions, e.g., EOd, EOp, EOdTrue, (direct) noisy, and calibrated group fairness metrics on data distributionsTrue, (direct) noisy, and calibrated group fairness metrics on datasetsFairness matrix, its a-th row, k-th column, (a, k)-th element H Noisy fairness matrix with respect to A T , T [a, a ] := P( A = ã|A = a)Global noise transition matrixClean prior probability

A.2 MORE FAIRNESS DEFINITIONS

We present the full version of fairness definitions and the corresponding matrix form for DP, EOd, and EOp as follows.Fairness Definitions.We consider three group fairness (Wang et al., 2020; Cotter et al., 2019 ) definitions and their corresponding measurable metrics: demographic parity (DP) (Calders et al., 2009; Chouldechova, 2017) , equalized odds (EOd) (Woodworth et al., 2017) , and equalized opportunity (EOp) (Hardt et al., 2016) .Definition 1 (Demographic Parity). The demographic parity metric of f on D conditioned on A is:Definition 4 (Equalized Odds). The equalized odds metric of f on D conditioned on A is: We experiment with three categories of sensitive attributes: black, white, and others, and show the result in Table 11 . Note EOp is not defined in the case with more than two categories. Table 11 shows our proposed framework with global estimates is consistently and significantly better than the baselines, which is also consistent with the results from Table 1 . To make it differentiable, we use a relaxed measure (Madras et al., 2018; Wang et al., 2022) as follows:where f xn [k] is the model's prediction probability on class k, and N ã is the number of samples that have noisy attribute ã. The standard method of multipliers is employed to train with constraints (Boyd et al., 2011) . We train the model for 20 epochs with a stepsize of 256. Table 12 shows the accuracy and DP disparity on the test data averaged with results from the last 5 epochs of training.From the table, we conclude that, with any selected pre-trained model, the mitigation based on our calibration results significantly outperforms the direct mitigation with noisy attributes in terms of both accuracy improvement and disparity mitigation.

