

Abstract

Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its differential calibration, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a realworld application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultenous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.

1. INTRODUCTION

Today, machine learning (ML) models have an impact on outcome disparities across sectors (health, lending, criminal justice) due to their wide-spread use in decision-making. When applied in clinical decision-making, ML models help care providers decide whom to prioritize to receive finite and time-sensitive resources among a population of potentially very ill patients. These resources include hospital beds (Barak-Corren et al., 2021a; Dinh & Berendsen Russell, 2021) , organ transplants (Schnellinger et al., 2021) , specialty treatment programs (Henry et al., 2015; Obermeyer et al., 2019) , and, recently, ventilator and other breathing support tools to manage the COVID-19 pandemic (Riviello et al., 2022) . In scenarios like these, decision makers typically rely on risk prediction models to be calibrated. Calibration measures the extent to which a model's risk scores, R, match the observed probability of the event, y. Perfect calibration implies that P (y|R = r) = r, for all values of r. Calibration allows the risk scores to be used to rank patients in order of priority and informs care providers about the urgency of treatment. However, models that are not equally calibrated among subgroups defined by different sensitive attributes (race, ethnicity, gender, income, etc.) may lead to systematic denial of resources to marginalized groups (e.g. (Obermeyer et al., 2019; Ashana et al., 2021; Roberts, 2011; Zelnick et al., 2021; Ku et al., 2021) ). Just this scenario was observed by Obermeyer et al. (2019) analyzed a large health system algorithm used to enroll high-risk patients into care management programs and showed that, at a given risk score, Black patients exhibited significantly poorer health than white patients. To address equity in calibration, Hebert- Johnson et al. (2018) proposed a fairness measure called multicalibration (MC), which asks that calibration be satisifed simultaneously over many flexiblydefined subgroups. Remarkably, MC can be satisfied efficiently by post-processing risk scores without negatively impacting the generalization error of a model, unlike other fairness concepts like demographic parity (Foulds & Pan, 2020) and equalized odds (Hardt et al., 2016) . This has motivated the use of MC in practical settings (e.g. Barda et al. (2021) ) and has spurred several extensions (Kim et al., 2019; Jung et al., 2021; Gupta et al., 2021; Gopalan et al., 2022) . If we bin our risk predictions, the MC criteria specifies that, for every group within each bin, the absolute difference between the mean observed outcome and the mean of the predictions should be small. As Barocas et al. (2019) note, equity in calibration embeds the fairness notion called sufficiency, which states: for a given risk prediction, the expected outcome should be independent of group membership. Starting from this notion, we can assess the conditions under which MC satisfies sufficiency. In this work, we derive a fairness criteria directly from sufficiency dubbed differential calibration for its relation to differential fairness (Foulds et al., 2019b) . We show that satisfying differential calibration can ensure that a model is equally "trustworthy" among groups in the data. By equally "trustworthy", we mean that a decision maker cannot reasonably come to distrust the model's risk predictions for specific groups, which may help prevent differences in decision-making between demographic groups, given the same risk prediction. By relating sufficiency to MC, we describe a shortcoming of MC that can occur when the outcome probabilities are strongly tied to group membership. Under this condition, the amount of calibration error relative to the expected outcome can be unequal between groups. This inequality hampers the ability of MC to (approximately) guarantee sufficiency, and thus guarantee equity in trustworthiness for the decision maker. We propose a simple variant of MC called proportional multicalibration (PMC) that ensures that the proportion of calibration error within each bin and group is small. We prove that PMC bounds both multicalibration and differential calibration. We show that PMC can be satisfied with an efficinet post-processing method, similarly to MC.

1.1. OUR CONTRIBUTIONS

In this manuscript, we formally analyze the connection of MC to the fairness notion of sufficiency. To do so, we introduce differential calibration (DC), a sufficiency measure that constrains ratios of population risk between pairs of groups within prediction bins. We describe how DC, like sufficiency, provides a sense of equal trustworthiness from the point of view of the decision maker. With this definition, we prove the following. First, models that are α-multicalibrated satisfy (ln rmin+α rmin-α )-DC, where r min is the minimum expected risk prediction among categories defined by subgroups and prediction intervals. We illustrate the meaning of this bound, which is that the proportion of calibration error in multicalibrated models may scale inversely with the outcome probability. Based on these observations, we propose a variation of MC, PMC, that controls the percentage error by group and risk strata (Definition 5). We show that models satisfying α-PMC are ( α 1-α )-multicalibrated and (ln 1+α 1-α )-differentially calibrated. Proportionally multicalibrated models thereby obtain robust fairness guarantees that are less dependent on population risk categories. Furthermore, we define an efficient algorithm for learning predictors satisfying α-PMC. Finally, we investigate the application of these methods to predicting patient admissions in the emergency department, a real-world resource allocation task, and show that post-processing for PMC results in models that are accurate, multicalibrated, and differentially calibrated.

2.1. PRELIMINARIES

We consider the task of training a risk prediction model for a population of individuals with outcomes, y ∈ {0, 1}, and features, x ∈ X . Let D be the joint distribution from which individual samples (y, x) are drawn. We assume the outcomes y are random samples from underlying independent Bernoulli distributions, denoted as p * (x) ∈ [0, 1]. Individuals can be further grouped into collections of subsets, C ⊆ 2 X , such that S ∈ C is the subset of individuals belonging to S, and x ∈ S indicates that individual x belongs to group S. We denote our risk prediction model as R(x) : X → [0, 1]. In order to consider calibration in practice, the risk predictions are typically discretized and considered within intervals. The coarseness of this interval is parameterized by a partitioning parameter, λ ∈ (0, 1]. The λ-discretization of [0, 1] is denoted by a set of intervals, Λ λ = {I j } 1/λ-1 j=0 , where I j = [jλ, (j + 1)λ). For brevity, some proofs in the following sections are given in Appendix A.2.4.

2.2. MULTICALIBRATION

MC (Hébert-Johnson et al., 2018) guarantees that the calibration error for any group from a collection of subsets, C will not exceed a user-defined threshold, over the range of risk scores. Definition 1 (α-calibration (Hébert-Johnson et al., 2018)  ). Let S ⊆ X . For α ∈ [0, 1], R is α- calibrated with respect to S if there exists some S ⊆ S with |S | ≥ (1 -α) |S| such that for all r ∈ [0, 1], E D [y|R = r, x ∈ S ] -r ≤ α. Definition 2 (α-MC (Hébert-Johnson et al., 2018)). Let C ⊆ 2 X be a collection of subsets of X , α ∈ [0, 1]. A predictor R is α-multicalibrated on C if for all S ∈ C, R is α-calibrated with respect to S. We note that, according to Definition 1, a model need only be calibrated over a sufficiently large subset of each group (S ) in order to satisfy the definition. This relaxation is used to maintain a satisfactory definition of MC when working with discretized predictions. For simplicity, we conduct most of our analysis using the continuous versions of fairenss definitions like Definition 2 (see Appendix A.2.1 for an extended discussion). MC is one of few approaches to achieving fairness that does not require a significant trade-off to be made between a model's generalization error and the improvement in fairness it provides (Hébert-Johnson et al., 2018) . As Hébert- Johnson et al. (2018) show, this is because achieving multicalibration is not at odds with achieving accuracy in expectation for the population as a whole. This separates calibration fairness from other fairness constraints like demographic parity and equalized odds (Hardt et al., 2016) , both of which may denigrate the performance of the model on specific groups (Chouldechova, 2017; Pleiss et al., 2017) . In clinical settings, such trade-offs may be difficult or impossible to justify. In addition to its alignment with accuracy in expectation, Hébert -Johnson et al. (2018) propose an efficient post-processing algorithm for MC similar on boosting. We discuss additional extensions to MC in Appendix A.1.

2.3. SUFFICIENCY AND DIFFERENTIAL CALIBRATION

MC provides a sense of fairness by approximating calibration by group, which is perfectly satisfied when P D (y|R = r, x ∈ S) = r for all S ∈ C and r ∈ [0, 1]. Calibration by group is closely related to the sufficiency fairness criterion (Barocas et al., 2019) . Sufficiency is the condition where the outcome probability is independent from C conditioned on the risk score. In the binary group setting (C = {S i , S j }), sufficiency can be expressed as P D (y|R, x ∈ S i ) = P D (y|R, x ∈ S j ), or P D (y|R, x ∈ S i ) P D (y|R, x ∈ S j ) = 1. Unlike calibration by group, sufficiency does not stipulate that the risk scores be calibrated, yet from a fairness perspective, sufficiency and calibration-by-group are equivalent (Barocas et al., 2019) . Consider that one can easily transform a model satisfying sufficiency into one that is calibrated-bygroup with a single function f (R) → [0, 1], for example with Platt scaling (Barocas et al., 2019) . In both cases, the sense of fairness stems from the desire for R to capture everything about group membership that is relevant to predicting y. Under sufficiency, the risk score is equally informative of the outcome, regardless of group membership. In this sense, a model satisfying sufficiency provides equally trustworthy risk predictions to a decision maker, regardless of the groups to which an individual belongs. Below, we define an approximate measure of sufficiency that constrains pairwise differentials between groups, and accomodates binned predictions: Definition 3 (ε-Differential Calibration). Let C ⊆ 2 X be a collection of subsets of X . A model R(x) is ε-differentially calibrated with respect to C if, for all pairs (S i , S j ) ∈ C × C for which P D (S i ), P D (S j ) > 0, for any r ∈ [0, 1], e -ε ≤ E D [y|R = r, x ∈ S i ] E D [y|R = r, x ∈ S j ] ≤ e ε By inspection we see that ε in ε-DC measures the extent to which R satisifies sufficiency. That is, when P (y|R = r, x ∈ S i ) ≈ P (y|R = r, x ∈ S j ) for all pairs (S i , S j ), ε ≈ 0. ε-DC says that, within any bin of risk scores, the outcome y is at most e ε times more likely among one group than another, and a minimum of e -ε less likely. Definition 13 fits into the general definition of a differential fairness measure proposed by Foulds et al. (2019a) , although previously it was used to define demographic parity criteria (Foulds & Pan, 2020) . We describe the relation in more detail in Appendix A.1.1, including Eq. ( 2)'s connection to differential privacy Dwork & Lei (2009) and pufferfish privacy Kifer & Machanavajjhala (2014) .

2.4. THE DIFFERENTIAL CALIBRATION OF MULTICALIBRATED MODELS IS LIMITED BY LOW-RISK GROUPS

At a basic level, the form of MC and sufficiency differ: MC constrainins absolute differences between groups across prediction bins, whereas sufficiency constrains pairwise differentials between groups. To reconcile MC and DC/sufficiency more formally, we pose the following question: if a model satisfies α-MC, what, if anything does this imply about the ε-DC of the model? (In Appendix A.3,Theorem 4, we answer the inverse question). We now show that multicalibrated models have a bounded DC, but that this bound is limited by small values of R. (2), we observe that the lower bound on ε-DC for R(x) is given by r+α r-α ≤ e ε . The maximum of the left-hand side for a fixed α occurs at the smallest value of r; therefore R(x) satisfies ln rmin+α rmin-α ≤ ε. By switching the numeratator and denominator we obtain the minimum differential and the left-hand side constraint from Definition 13, i.e. e -ε ≤ rmin-α rmin+α . Thus R(x) is ln rmin+α rmin-α -differentially calibrated. Theorem 1 illustrates the important point that, in terms of percentage error, M C does not provide equal protection to groups with different risk profiles. Imagine a model satisfying (0.05)-MC for groups S ∈ C. Consider individuals receiving model predictions R(x) = 0.9. MC guarantees that, for any category {x : x ∈ S, R(x) = 0.9}, the expected outcome probability is at least 0.9 -α = 0.85 and at most 0.9 + α = 0.95. This bounds the percent error among groups with this prediction to about 6%. In contrast, consider individuals for whom R(x) = 0.3; each group may have a true outcome probability as low as 0.25, which is an error of 20% -about 3.4x higher than the percent error in the higher-risk group.

3. PROPORTIONAL MULTICALIBRATION

We are motivated to define a measure that is efficiently learnable like MC (Definition 2) but better aligned with the multiplicative interpretation of sufficiency, like DC (Definition 13). To do so, we define PMC, a variant of MC that constrains the proportional calibration error of a model among subgroups and risk strata. In this section, we show that bounding a model's PMC is enough to meaningfully bound DC and MC. Furthermore, we provide an efficient algorithm for satisfying PMC based on a simple extension of MC/Multiaccuracy boosting (Kim et al., 2019) . We begin by defining proportional calibration, which expresses calibration error as a percentage of the outcome probability among a group. Definition 4 (α-Proportional Calibration). Let S ⊆ X . For α > 0, R is α-proportionally calibrated with respect to S if there exists some S ⊆ S with |S | ≥ (1 -α) |S| such that for all r ∈ [0, 1], E D [y|R = r, x ∈ S ] -r ≤ α E D [y|R = r, x ∈ S ]. Proportional multicalibration is then defined by requiring Definition 4 be satisified among a collection of groups: Definition 5 (α-Proportional Multicalibration). Let C ⊆ 2 X be a collection of subsets of X , α > 0. A predictor R is α-proportionally multicalibrated on C if for all S ∈ C, R is α-proportionally calibrated with respect to S. We also define a discretized version of PMC that is useful for implementing the measure in Algorithm 1 and measuring PMC in our experiments. Definition 6 ((α, λ)-Proportional Multicalibration). A model R(x) is (α, λ)-proportionally multicalibrated with respect to a collection of subsets C if, for all S ∈ C and I ∈ Λ λ satisfying P D (R(x) ∈ I|x ∈ S) ≥ αλ, |E D [y|R ∈ I, x ∈ S] -E D [R|R ∈ I, x ∈ S]| E D [y|R ∈ I, x ∈ S] ≤ α. Note that, in practice, we must ensure E D [y|R ∈ I, x ∈ S] ≥ 0 for Definition 6 to be satisfied; we later introduce a lower bound, ρ, to prevent the outcome probability from being too small. In Appendix A.2.1, we show (α, λ)-PMC meaningfully bounds α-PMC under different discretizations, such that we can minimize (α, λ)-PMC to achieve low α-PMC. Comparison to Differential Calibration Rather than constraining the differentials of predictionand group-specific outcomes among all pairs of subgroups in C × C as in DC (Definition 13), PMC constrains the relative error of each group in C. In practical terms, this makes it more efficient to calculate PMC by a factor of O(|C|) steps compared to DC. In addition, PMC does not require additional assumptions about the overall calibration of a model in order to imply guarantees of MC, since PMC directly constrains calibration rather than constraining sufficiency alone. Theorem 2. Let R(x) be a model satisfying (α)-PMC on a collection C. Then R(x) is (ln 1+α 1-α )- differentially calibrated. Proof. Let r = E D [R|R = r, x ∈ S] and p * = E D [y|R = r, x ∈ S]. If R(x) satisfies α-PMC (Definition 5), then r/(1 + α) ≤ p * ≤ r/(1 -α). Solving for the upper bound on ε-DC, we immediately have ε ≥ ln r(1+α) r(1-α) ≥ ln 1+α 1-α . Theorem 2 demonstrates that α-proportionally multicalibrated models satisfy a straightforward notion of differential fairness that depends monotonically only on α. The relationship between PMC and DC is contrasted with the relationship of MC and DC in Fig. 3 , left panel. The figure illustrates how MC's sensitivity to small risk categories limits its DC. Comparison to Multicalibration Rather than constraining the absolute difference between risk predictions and the outcome as in MC, PMC requires that the calibration error be a small fraction of the expected risk in each category (S, I). In this sense, it provides a stronger protection than MC by requiring calibration error to be a small fraction regardless of the risk group. In many contexts, we would argue that this is also more aligned with the notion of fairness in risk prediction contexts. Under MC, the underlying probability of an outcome within a group affects the fairness protection that is received (i.e., the percentage error that Definition 2 allows). Because underlying probabilities of many clinically relevant outcomes vary significantly among subpopulations, multicalibrated models may systematically permit higher percentage error to specific risk groups. The difference in relative calibration error among populations with different risk profiles also translates in weaker sufficiency guarantees, as demonstrated in Theorem 1. In contrast, PMC provides a fairness guarantee that is less dependent on subpopulation risks. In the following theorem, we show that MC is also constrained when a model satisfies PMC. Algorithm 1 Proportional Multicalibration Post-processing Require: Predictor R(x) 1: C ∈ 2 X such that for all S ∈ C, P D (S) ≥ γ 2: α, λ, γ, ρ > 0 3: D = {(y, x)i} N i=0 ∼ D 4: function PMC(R, C, D, α, λ, γ, ρ) 5: repeat 6: {(y, x)} ← sample D 7: for S ∈ C, I ∈ Λ λ such that P D (R ∈ I, x ∈ S) ≥ αλγ do 8: Sr ← S ∩ {x : R(x) ∈ I} 9: r ← 1 |Sr | x∈Sr R(x) average group prediction 10: ȳ ← 1 |Sr | x∈Sr y(x) average subgroup risk 11: if ȳ ≤ ρ then 12: continue 13: ∆r ← ȳ -r 14: if |∆r| ≥ αȳ then 15: R(x) ← R(x) + ∆r for all x ∈ Sr 16: R(x) ← squash(R(x), [0, 1]) squash updates to [0, 1] 17: if No Updates to R(x) then 18: break 19: return R Theorem 3. Let R(x) be a model satisfying α-PMC on a collection C. Then R(x) is ( α 1-α )- multicalibrated on C. The proof of Theorem 3 is given in Appendix A.2.4. This theorem implies that a proportionally calibrated model with sufficiently low α will satisfy a similarly low value of MC, whereas a model with 0.5-PMC (i.e., 50% calibration error) or higher will not have a meaningful MC guarantee. We further discuss and illustrate the bounds given by Theorems 1 to 3 and 4 in Appendix A.2.

3.1. LEARNING PROPORTIONALLY MULTICALIBRATED PREDICTORS

So far we have demonstrated that models satisfying PMC exhibit desirable guarantees relative to two previously defined measures of fair calibration, but have not considered whether PMC is possible to learn. Here, we answer in the affirmative by proposing Algorithm 1 to satisfy PMC . Algorithm 1 is a direct extension of MCBoost (Pfisterer et al., 2021) , which works by checking for calibration errors among groups and prediction intervals that violate the user threshold and adjusting these predictions towards the target. PMC post-processing differs in two main ways: first, it updates whenever calibration error is not within αȳ for all categories, as opposed to simply within α. Second, it ignores updates for categories with low outcome probability (less than ρ). Next, we prove that PMC post-processing learns an (α,λ)-PMC model in a polynomial number of steps. Proposition 1. Define α, λ, γ, ρ > 0. Let C ⊆ 2 X be a collection of subsets of X such that, for all S ∈ C, P D (S) > γ. Let R(x) be a risk prediction model to be post-processed. For all (S, I) ∈ C × Λ λ , let E[y|R ∈ I, x ∈ S] > ρ. There exists an algorithm that satisfies (α, λ)-PMC with respect to C in O( |C| α 3 λ 2 ρ 2 γ ) steps. We analyze Algorithm 1 and show it satisfies Proposition 1 in Appendix A.2.4. This more stringent threshold requires an additional O( 1 ρ 2 ) steps, where ρ > 0 is a lower bound on the expected outcome within a category (S, I).

4. EXPERIMENTS

In our first set of experiments (Section 4), we study MC and PMC in simulated population data to understand and validate the analysis in previous sections. In the second section, we compare the performance of varied model treatments on a real world hospital admission task, using an implementation of Algorithm 1. We make use of empirical versions of our fairness definitions which we refer to as MC loss, PMC loss, and DC loss. In short, these measures calculate the maximum (proportional) calibration error or pairwise calibration differential among subgroups and risk categories in the data sample. Due to space constraints the formal definitions are given in Appendix A.2.2 (Definitions 14 to 16). Simulation study We simulate data from α-multicalibrated models. For simplicity, we specify a data structure with a one-to-one correspondence between subset and model estimated risk, such that for all x in S, R(x) = R(x|x ∈ S) = R(S). Therefore all information for predicting the outcome based on the features in x is contained in the attributes A that define subgroup S. Outcome probability is specified as p * i = P D (y|x ∈ S i ) = 0.2 + 0.01(i -1) and i = 1, • • • , N s , where N s is the number of subsets S, defined by A and indexed by i with increasing p * . For each group, R i = R(S i ) = R(x|x ∈ S i ) = p * i -∆ i . We randomly select ∆ i for one group to be ±α and for the remaining groups, ∆ i = ±δ, where δ ∼ Uniform(min = 0, max = α). In all cases, the sign of ∆ i is determined by a random draw from a Bernoulli distribution. For these simulations we set N S = 61 and α = 0.1, such that p * i ∈ [0.2, 0.8] and R i ∈ [0.1, 0.9]. We generate N sim = 1000 simulated datasets, with n = 1000 observations per group, and for each S i , we calculate the ratio of the absolute mean error to p * i , i.e. the PMC loss function for this data generating mechanism. We also simulate three specific scenarios where: 1) |∆ i | is equivalent for all groups (Fixed); 2) |∆ i | increases with increasing p * i ; and 3) |∆ i | decreases with increasing p * i , with α = 0.1 in each case. These scenarios compare when α is determined by all groups, the group with the lowest outcome probability, and the group with the highest outcome probability, respectively. Hospital admission Next, we test PMC alongside other methods in application to prediction of inpatient hospital admission for patients visiting the emergency department (ED). The burden of overcrowding and long wait times in EDs is significantly higher among non-white, non-Hispanic patients and socio-economically marginalized patients (James et al., 2005; McDonald et al., 2020) . Recent work has demonstrated risk prediction models that can expedite patient visits by predicting patient admission at an early stage of a visit with a high degree of certainty (AUC ≥ 0.9 across three large care centers) (Barak-Corren et al., 2017b; a; 2021b; a) . Our goal is to ensure no group of patients will be over-or under-prioritized over another by these models, which could exacerbate the treatment and outcome disparities that currently exist. We construct a prediction task similar to previous studies but using a new data resource: the MIMIC-IV-ED repository (Johnson et al., 2021) . The overall intersectional demographic statistics for these data are given in Table 1 . In Table 1 we observe stark differences in admission rates by demographic group and gender, suggesting that the use of a proportional measure of calibration could be appropriate for this task. We trained and evaluated logistic regression (LR) and random forest (RF) models of patient admission, with and without post-processing for MC (Pfisterer et al., 2021) or PMC. For each of the parameter settings given in Table 2 , we conducted 100 repeat experiments with different shuffles of the data. Comparisons are reported on a test set of 20% of the data for each trial. Additional experiment details are available in Appendix A.6 and code for the experiments is available here: https://github.com/by1tTZ4IsQkAO80F/pmc.

5. RESULTS

Fig. 1 shows the PMC loss of α-multicalibrated models under the scenarios described in Section 4. Proportional α-MC constrains the ratio of the absolute mean error (AME) to the outcome prevalence, for groups defined by a risk interval (R(x) ∈ I) and subset within a collection of subsets (x ∈ S, S ∈ C). Without the proportionality factor |E D [y|R = r, x ∈ S]| -1 , α-multicalibrated models allow a dependence between the group prevalence and the error or privacy loss permitted that is unfair for groups with lower outcome prevalence. Results on the hospital admission prediction task are summarized in Fig. 2 and Tables 3 and 4 . PMC post-processing has a negligible effect on predictive performance (<0.1% ∆ AUROC, LR and RF) while reducing DC loss by 27% for LR and RF models, and reducing PMC loss by 40% and 79%, respectively. In the case of RF models, PMC post-processing reduces MC loss by 23%, a significantly larger improvement than MC post-processing itself (19%, p=9e-26). Senstivity Analysis We note PMC postprocessing has a lower tolerance for error than MC postprocessing for a given value of α, since ȳ ≤ 1. Thus a natural question is whether MC can match the performance of PMC simply by specifying a smaller α. Further, if α can be made small enough, the calibration error |E D [R|R ∈ I, x ∈ S] -E D [y|R ∈ I, x ∈ S] | on all categories will be small compared to the outcome prevalence, E D [y|R ∈ I, x ∈ S], meaning MC postprocessing could satisfy the desired proportional guarantees of PMC. However, to achieve this performance by MCpostprocessing may require a large number of unnecessary updates for high risk intervals, since the DC and PMC of multicalibrated models are limited by low-risk groups (Theorem 1). Furthermore, the number of steps in MC-postprocessing (and PMC-postprocessing) scales as an inverse high-order polynomial of α (cf. Thm. 2 (Hebert- Johnson et al., 2018) ). We shed light on this question in three ways. First, we quantify how often the use of each postprocessing algorithm gives the best loss for each metric and trial in Table 3 . To account for this, we quantified the number of trials for which a given method produced the best model according to a given metric, over all parameter configurations in Table 2 . PMC post-processing (Algorithm 1) achieves the best fairness the highest percent of the time, according to DC loss (63%), MC loss (70%), and PMC loss (72%), while MC-postprocessed models achieve the best AUROC in 88% of cases. This provides strong evidence that, over a large range of α values, PMC post-processing is beneficial compared to MC-postprocessing. Next, we empirically compare MC-and PMC-postprocessing by the number of steps required for each to reach their best performance in Fig. 8 and Table 4 . We assess how many steps/updates MC and PMC take, and their empirical running time, in Table 4 . We observe that PMC typically requires a larger number of updates to achieve its best performance on MC loss (about 2x wall clock time and number of updates), In contrast MC-postprocessing requires an average of 5x more updates to achieve its best performance on PMC loss, due to its dependence on very small values of α. Finally, in Appendix A.6, we look at detailed performance comparisons of MC and PMC postprocessing over values of α and group definitions in Figs. 5 to 7. We observe that, while low values of α for MC postprocessing improve its PMC loss performance, PMC postprocessing is always able to outperform it for some value of α, and do so in fewer steps.

6. DISCUSSION AND CONCLUSION

In this paper we have analyzed multicalibration through the lens of suffiency and differential calibration to reveal the sensitivity of this metric to correlations between outcome rates and group membership. We have proposed a measure, PMC, that alleviates this sensitivity and attempts to capture the "best of both worlds" of MC and DC. PMC provides equivalent percentage calibration protections to groups regardless of their risk profiles, and in so doing, bounds a model's differential calibration. We provide an efficient algorithm for learning PMC predictors by postprocessing a given risk prediction model. On a real-world and clinically relevant task (admission prediction), we have shown that post-processing LR and RF models with PMC leads to better performance across all three fairness metrics, with little to no impact on predictive performance. Our preliminary analysis suggests PMC can be a valuable metric for training fair algorithms in resource allocation contexts. Future work could extend this analysis on both the theoretical and practical side. On the theoretical side, the generalization properties of the PMC measure should be established and its sample complexity quantified, as Rose ( 2018 et al., 2019) , low-degree multicalibration (Gopalan et al., 2022) , and extensions to conformal prediction and online learning (Jung et al., 2021; Gupta et al., 2021) . Noting that multicalibration is a guarantee over mean predictions on a collection of groups C, Jung et al. (2021) propose to extend multicalibration to higher-order moments (e.g., variances), which allows one to estimate a confidence interval for the calibration error for each category. Gupta et al. (2021) extend this idea and generalize it to the online learning context, in which an adversary chooses a sequence of examples for which one wishes to quantify the uncertainty of different statistics of the predictions. Recent work has also utilized higher order moments to "interpolate" between the guarantees provided by multiaccuracy, which only requires accuracy in expectation for groups in C, and multicalibration, which requires accuracy in expectation at each prediction interval (Kim et al., 2019) . Like proportional multicalibration (Definition 5), definitions of multicalibration for higher order moments provide additional criteria for quantifying model performance over many groups; in general, however, much of the focus in other work is on statistics for uncertainty estimation. Like these works, one may view our proposal for proportional multicalibration as alternative definition of what it means to be multicalibrated. The key difference is that proportional multicalibration measures the degree to which multicalibration depends on differences in outcome prevalence between groups, and in doing so provides guarantees of pufferfish privacy and differential calibration. Dwork et al. ( 2019) study the relation of fair rankings to multicalibration, and, in a similar vein to differential fairness measures, formulate a fairness measure for group rankings using the relations between pairs of groups. However, these definitions are specific to the ranking relation between the groups, whereas differential calibration cares only about the outcome differential (conditioned on model predictions) between pairs of groups. A.1.1 DIFFERENTIAL FAIRNESS DF was explicitly defined to be consistent with the social theoretical framework of intersectionality. This framework dates back as early as the social movements of the '60s and '70s (Collins & Bilge, 2020) and was brought into the academic mainstream by pioneering work from legal scholar Kimberlé Crenshaw (Crenshaw, 1989; 1991) and sociologist Patricia Hill Collins (Collins, 1990) . Central to intersectionality is that hierarchies of power and oppression are structural elements that are fundamental to our society. Through an intersectional lens, these power structures are viewed as interacting and co-constituted, inextricably related to one another. To capture this viewpoint, DF (Foulds et al., 2019a) constrains the differential of a general data mechanism among all pairs of groups, where groups are explicitly defined as the intersections of protected attributes in A. Definition 7 (ε-differential fairness (Foulds et al., 2019a) ). Let Θ denote a set of distributions and let x ∼ θ for θ ∈ Θ. A mechanism M (x) is ε-differentially fair with respect to (C,Θ) for all θ ∈ Θ with x ∼ θ, and m ∈ Range(M ) if, for all (S i , S j ) ∈ C × C where P (S i |θ) > 0, P (S j |θ) > 0, e -ε ≤ P M,θ (M (x) = m|S i , θ) P M,θ (M (x) = m|S j , θ) ≤ e ε Definition 8 (Pufferfish Privacy). Let the collection of subsets C represent sets of secrets. A mechanism M (x) is ε-pufferfish private (Kifer & Machanavajjhala, 2014) with respect to (C, Θ) if for all θ ∈ Θ with x ∼ θ, for all secret pairs (S i , S j ) ∈ C × C and y ∈ Range(M ), e -ε ≤ P M,θ (M (x) = y|S i , θ) P M,θ (M (x) = y|S j , θ) ≤ e ε , ( ) when S i and S j are such that P (S i |θ) > 0, P (S j |θ) > 0. Note on pufferfish and differential privacy Although Eq. ( 4) is notable in its similarity to differential privacy (Dwork & Lei, 2009) , they differ in important ways. Differential privacy aims to limit the amount of information learned about any one individual in a database by computations performed on the data (e.g. M (x)). Pufferfish privacy only limits information learned about the group membership of individuals as defined by C. Kifer & Machanavajjhala (2014) describe in detail the conditions under which these privacy frameworks are equivalent. Efficiency Property Foulds et al. (2019a) also define an interesting property of ε-differential fairness that allows guarantees of higher order (i.e., marginal) groups to be met for free; the property is given in Appendix A.2.2. Definition 9 (Efficiency Property (Foulds et al., 2019a) ). Let M (x) be an ε-differentially fair mechanism with respect to (C, Θ). Let the collection of subsets C group individuals according to the Cartesian product of attributes A ⊆ A. Let G be any collection of subsets that groups individuals by the Cartesian product of attributes in A , where A ⊂ A and A = ∅. Then M (x) is ε-differentially fair in (G, Θ). The authors call this the "intersectionality property", although its implication is the reverse: if a model satisfies ε-DF for the low level (i.e. intersectional) groups in C, then it satisfies ε-DF for every higher-level (i.e. marginal) group. For example, if a model is (ε)-differentially fair for intersectional groupings of individuals by race and sex, then it is ε-DF for the higher-level race and sex groupings as well. Whereas the number of intersections grows exponentially as additional attributes are protected (Kearns et al., 2018) , the number of total possible subgroupings grows at a larger combinatorial rate: for p protected attributes, we have p k=1 p k m k a groups , where m a is the number of levels of attribute a. Limitations To date, analysis of DF for predictive modeling has been limited to defining R(x) as the mechanism, which is akin to asking for demographic parity. Under demographic parity, one requires that model predictions be independent from group membership entirely, and this limits the utility of it as a fairness notion. Although a model satisfying demographic parity can be desirable when the outcome should be unrelated to C (Foulds & Pan, 2020) , it can be unfair if important risk factors for the outcome are associated with demographics (Hardt et al., 2016) . For example, if the underlying rates of an illness vary demographically, requiring demographic parity can result in a healthier patients from one group being admitted more often than patients who urgently need care.

A.2 EXTENDED THEORETICAL ANALYSIS

Illustrating Relationships between Definitions Fig. 3 shows how the definitions of MC, DC, and PMC are related. In each subplot, the x and y coordinates map the guarantee from one metric (x axis) to the implied guarantee in the other metric (y axis). The right panel of Fig. 3 illustrates this relation in comparison to the DC-MC relationship described in Appendix A.3, Theorem 4. At small values of ε and α and when the model is perfectly calibrated overall, α-PMC and ε-DC behave similarly. However, given δ > 0, ε-differentially calibrated models suffer from higher MC error than proportionally calibrated models when α-PMC < 0.3. The right graph also illustrates the feasible range of α for α-PMC is 0 < α < 0.5, past which it does not provide meaningful α-MC. The steeper relation between α-PMC and MC may have advantages or disadvantages, depending on context. It suggests that, by optimizing for α-PMC, small improvements to this measure can result in relatively large improvements to MC; conversely, ε-DC models that are well calibrated may satisfy a lower value of α-MC over a larger range of ε.

A.2.1 DISCRETIZATION

To clarify and simplify our analysis, we work mainly with the continuous versions of multicalibration and proportional multicalibration, under the assumption that minimizing the discretized versions (i.e., binning R(x)) will translate to low values of the continous version. In this section we provide detailed bounds on the continuous versions of PMC and DC that are implied by the discretized versions. First, we will formally define two different discretization schemes. The first, λ-discretization, defines equally spaced bins on the interval [0, 1], as follows. Definition 10 (λ-discretization.). Let λ ∈ [0, 1], ρ ∈ [0, 1]. The λ-discretization of [0, 1] is denoted by a set of intervals, Λ λ = {I j } 1/λ-1 j=0 , where I j = [jλ, (j + 1)λ). For ensuring multiplicative closeness under PMC, it can be useful to instead discretize the prediction bins so that the bins are equally spaced on a log scale. We define such a discretization below. Definition 11 ((λ, ρ)-geometric discretization.). Let λ ∈ [0, 1], ρ ∈ [0, 1]. The (λ, ρ)-geometric discretization of [0, 1] is denoted by a set of intervals, Λ ρ λ = {I j } 1/λ-1 j=0 , where -jλ) , ρ (1-jλ-λ) ). 2018) define a discretized version of MC in which R(x) is binned according to a discretization parameter, λ: I j = [ρ (1 Definition 12 ((α, λ)-multicalibration). Let C ⊆ 2 X be a collection of subsets of X . For any α, λ > 0, a predictor R is (α, λ)-multicalibrated on C if, for all I ∈ Λ λ and S ∈ C where P D (R ∈ I|x ∈ S) ≥ αλ, E D [y|R ∈ I, x ∈ S] -E D [R|R ∈ I, x ∈ S] ≤ α. Hébert- Johnson et al. (2018) establish that (α, λ)-multicalibrated models are at most (α + λ)multicalibrated. In an analagous fashion, we show below that (α, λ)-PMC implies (α + λ/ρ)-PMC for bins defined by a λ-discretization. When using a (λ, ρ)-geometric discretization, (α, λ)-PMC implies (αρ -λ + ρ -λ -1)-PMC, which can be a tighter bound than the former. Claim 1. Define ρ, α, λ > 0 and let C ⊆ 2 X be a collection of subsets of X . Let E D [y|R ∈ I, x ∈ S] ≥ ρ for all S ∈ C and I ∈ Λ λ . Let R(x) be a model satisfying (α, λ)-proportional multicalibration. Then R(x) is at most (α + λ ρ )-proportionally multicalibrated. Proof. By Definition 6, R(x) satisfies The term λ ρ can be potentially large when ρ < λ. One way to avoid this issue is to make the change in R(x) between bins scale with R(x) using Definition 11. What makes Definition 11 different from λ-discretization is that the intervals are a multiplicative, rather than additive, distance apart. Hence, for a given r ∈ [0, 1], a model satisfying (α, λ)-PMC can have its prediction shift by at most a factor of ρ -λ . This leads us to the following proposition.  |E D [y|R ∈ I, x ∈ S] -E D [R|R ∈ I, x ∈ S]| E D [y|R ∈ I, x ∈ S] ≤ α for categories (S, I) ∈ C × Λ λ satisfying P D (R(x) ∈ I|x ∈ S) ≥ αλ. Given 1/λ bins, Claim 2. Define ρ, α, λ > 0 and let C ⊆ 2 X be a collection of subsets of X . Let E D [y|R ∈ I, x ∈ S] ≥ ρ for all S ∈ C and I ∈ Λ λ . Let R(x) be a model satisfying (α, λ)-proportional multicalibration. Given a (λ, ρ)-geometric discretization, R(x) is at most (αρ -λ + ρ -λ -1)- proportionally multicalibrated. Proof. By Definition 6, R(x) satisfies |E D [y|R ∈ I, x ∈ S] -E D [R|R ∈ I, x ∈ S]| E D [y|R ∈ I, x ∈ S] δ ≤ (rρ -λ -p * )/p * Substituting r = αp * + p * yields δ ≤ αρ -λ + ρ -λ -1. We illustrate the relationship between (α, λ)-PMC and α-PMC given a geometric discretization in Fig. 4 , which quantifies the relationship for different settings of λ and ρ.

A.2.2 ADDITIONAL DEFINITIONS

Definition 13 ((ε, λ)-Differential Calibration). Let C ⊆ 2 X be a collection of subsets of X . A model R(x) is (ε,λ)-differentially calibrated with respect to C if, for all pairs (S i , S j ) ∈ C × C for which P D (S i ), P D (S j ) > 0, for any  I ∈ Λ λ , e -ε ≤ E D [y|R ∈ I, x ∈ S i ] E D [y|R ∈ I, x ∈ S j ] ≤ e ε α ≤ p * -r ≤ r 1 -δ -r = r δ 1 -δ ≤ δ 1 -δ . We can also solve for the lower bound on α-MC from Definition 2 for the case when p * < r. This yields α ≤ r -p * ≤ r - r 1 + δ = r δ 1 + δ ≤ δ 1 + δ . For any δ > 0, δ 1-δ > δ 1+δ . Therefore the first case (p * > r) limits the multicalibration of R(x). Proposition 1 Define α, λ, γ, ρ > 0. Let C ⊆ 2 X be a collection of subsets of X such that, for all S ∈ C, P D (S) > γ. Let R(x) be a risk prediction model to be post-processed. For all (S, I) ∈ C × Λ λ , let E[y|R ∈ I, x ∈ S] > ρ. There exists an algorithm that satisfies (α, λ)-PMC with respect to C in O( |C| α 3 λ 2 ρ 2 γ ) steps. Proof. We show that Algorithm 1 converges using a potential function argument (Bansal & Gupta, 2019) , similar to the proof techniques for the MC boosting algorithms in Hebert- Johnson et al. (2018) ; Kim et al. (2019) . Let p * i be the underlying risk, R i be our initial model, and R i be our updated prediction model for individual i ∈ S r , where S r = {x|x ∈ S, R(x) ∈ I} and (S, I) ∈ C × Λ λ . We use p * , R, and R without subscripts to denote these values over S r . We cannot easily construct a potential argument using progress towards (α,λ)-PMC, since its derivative is undefined at E D [y|R ∈ I, x ∈ S]=0. Instead, we analyze progress towards the difference in the 2 norm at each step. ||p * -R|| -||p * -R || = i∈Sr (p * i -R i ) 2 - i∈Sr (p * i -squash(R i + ∆r)) 2 ≥ i∈Sr (p * i -R) 2 -(p * i -(R i + ∆r)) 2 = i∈Sr 2p * i ∆r -2R i ∆r -∆r 2 = 2∆r i∈Sr (p * i -R i ) -|S r |∆r 2 From Algorithm 1 we have ∆r = 1 |S r | i∈Sr (p * i -R i ) Substituting into Eq. ( 7) gives ||p * -R|| -||p * -R || ≥ |S r |∆r 2 We know that |S r | ≥ αλγN , and that the smallest update ∆r is αρ. Thus, ||p * -R|| -||p * -R || ≥ α 3 ρ 2 λγN Since our initial loss, ||p * -R||, is at most N , Algorithm 1 converges in at most O( 1 λ 3 ρ 2 λγ ) updates for category S r . To understand the total number of steps, including those without updates, we consider the worst case, in which only a single category S r is updated in a cycle of the for loop (if no updates are made, the algorithm exits). Since each repeat consists of at most |C|/λ loop iterations, this results in O( |C| α 3 λ 2 ρ 2 γ ) total steps.

MULTICALIBRATED

Here we show that, under the assumption that a model is globally calibrated (satisfies δ-calibration), models satisfying ε-DC are also multicalibrated. Theorem 4. Let R(x) be a model satisfying (ε,λ)-DC and δ-calibration. Then R(x) is (1 -e -ε + δ, λ)-multicalibrated. Proof. From Eq. ( 2) we observe that ε is bounded by the two groups with the largest and smallest group-and prediction-specific probabilities of the outcome. Let I M be the risk stratum maximizing We will look at the first case. Let p * r = P D (y|R ∈ I M ). Due to δ-calibration, p * r -δ ≤ r ≤ p * r + δ. Then α ≤ p n -r ≤ p n -(p * r -δ) ≤ p n -p d + δ = p n (1 -e -ε ) + δ α ≤ 1 -e -ε + δ. Above we have used the facts that r ≤ p * r -δ, p * r ≥ p d , p d ≤ e -ε p n , and p n ≤ 1. The second scenario is complementary and produces the identical bound. Theorem 4 formally describes how δ-calibration controls the baseline calibration error contribution to α-MC, while ε-DC limits the deviation around this value by constraining the (log) maximum and minimum risk within each category.

A.4 MULTICALIBRATED MODELS SATISFY INTERSECTIONAL GUARANTEES

In contrast to DF, MC (Hebert-Johnson et al., 2018) was not designed to explicitly incorporate the principles of intersectionality. However, we show that it provides an identical efficiency property to DF in the theorem below. Given an individual's attributes x = (x 1 , . . . , x d ), it will be useful to refer to subsets we wish to protect, e.g. demographic identifiers. To do so, we define A = {A 1 , . . . , A p }, p ≤ d, such that A 1 is the set of values taken by attribute x 1 . Theorem 5. Let the collection of subsets C ⊆ 2 X define groups of individuals according to the Cartesian product of attributes A ⊆ A. Let G ∈ 2 X be any collection of subsets that groups individuals by the Cartesian product of attributes in A , where A ⊂ A and A = ∅. If R(x) satisfies α-MC on C, then R(x) is α-multicalibrated on G. In proving Theorem 5, we will make use of the following lemma. Lemma 6. The α-MC criteria can be rewritten as: for a collection of subsets C ⊆ X , α ∈ [0, 1], and r ∈ [0, 1], max c∈C E D [y|R(x) = r, x ∈ c] ≤ r + α and min c∈C E D [y|R(x) = r, x ∈ c] ≥ r -α Proof. The lemma follows from Definition 2, and simply restates it as a constraint on the maximum and minimum expected risk among groups at each prediction level. Proof of Theorem 5. We use the same argument as Foulds et al. (2019a) in proving this property for DF. Define Q as the Cartesian product of the protected attributes included in A, but not A . Then for any (y, x) ∼ D, max g∈G E D [y|R(x) = r, x ∈ g] = max g∈G q∈Q E D [y|R(x) = r, x ∈ g ∩ q]P [x ∈ q|x ∈ g] (8) ≤ max g∈G q∈Q max q ∈Q E D [y|R(x) = r, x ∈ g ∩ q ]P [x ∈ q|x ∈ g] (9) = max g∈G max q ∈Q E D [y|R(x) = r, x ∈ g ∩ q ] (10) = max c∈C E D [y|R(x) = r, x ∈ c]. Moving from ( 5) to ( 6) follows from substituting the maximum value of E D [y|R(x) = r, x] for observations in the intersection of subsets in G and Q which is the upper limit of the expression in (5). Moving from ( 6) to ( 7) follows from recognizing that the sum P [x ∈ q|x ∈ g] for all subsets in Q is 1. Finally, moving from ( 7) to ( 8) follows from recognizing that the intersections of subsets in G and Q that satisfy (7), must define a subset of C. Applying the same argument, we can show that min g∈G E D [y|R(x) = r, x ∈ g] ≥ min c∈C E D [y|R(x) = r, x ∈ c]. Substituting into Lemma 6, max g∈G E D [y|R(x) = r, x ∈ g] ≤ α + r and min g∈G E D [y|R(x) = r, x ∈ g] ≥ r -α or E D [y|R(x) = r, x ∈ g] -r ≤ α for all g ∈ G. Therefore R(x) is α-multicalibrated with respect to G. As a concrete example, imagine we have the protected attributes A = {race ∈ {B, W } , gender ∈ {M, F }}. According to Theorem 5, C would contain four sets: {(B, M ), (B, F ), (W, M ), (W, F )}. In contrast, there are eight possible sets in G: {(B, M ), (B, F ), (W, M ), (W, F ), (B, * ), (W, * ), ( * , M ), ( * , F )}, where the wildcard indicates a match to either attribute. As noted in Appendix A.1.1, the efficiency property is useful because the number of possible sets in G grows at a large combinatorial rate, rate as additional attributes are added; meanwhile C grows at a slower, yet exponential, rate. For an intuition for why this property holds, consider that the maximum calibration error of two subgroups is at least as large as the maximum expected error of those groups combined; e.g., the maximum calibration error in a higher order groups such as (B, * ) will be covered by the maximum calibration error in either (B, M ) or (B, F ).

A.5 ADDITIONAL EXPERIMENT DETAILS

Models were trained on a heterogenous computing cluster. Each training instance was limited to a single core and 4 GB of RAM. We conducted a full parameter sweep of the parameters specified in Table 2 . A single trial consisted of a method, a parameter setting from Table 2 , and a random seed. Over 100 random seeds, the data was shuffled and split 75%/25% into train/test sets. Results in the manuscript are summarized over these test sets. Data We make use of data from the MIMIC-IV-ED repository, version 1.0, to train admission risk prediction models (Johnson et al., 2021) . This resource contains more than 440,000 ED admissions from Beth Isreal Deaconness Medical Center between 2011 and 2019. We preprocessed these data to construct an admission prediction task in which our model delivers a risk of admission estimate for each ED visitor after their first visit to triage, during which vitals are taken. Additional historical data for the patient was also included (e.g., number of previous visits and admissions). A list of features is given in Table 5 .

A.6 ADDITIONAL RESULTS

Table 2 lists a few parameters that may affect the performance of post-processing for both MC and PMC. Of particular interest when comparing MC versus PMC post-processing is the parameter α, which controls how stringent the calibration error must be across categories to terminate, and the group definition (A), which selects which features of the data will be used to asses and optimize fairness. We look at the performance of MC and PMC postprocessing over values of α and group definitions in Figs. 5 to 7. Finally, we empirically compare MC-and PMC-postprocessing by the number of steps required for each to reach their best performance in Fig. 8 and Table 4 . From Fig. 5 , it is clear that post-processing has a minimal effect on AUROC in all cases; note the differences disappear if we round to two decimal places. When post processing with RF, we do note a relationship between lower values of α and a very slight decrease in performance, particularly for MC-postprocessing. Figs. 6 and 7 show performance between methods on MC loss and PMC loss, respectively. In terms of MC loss, PMC-postprocessing tends to produce models with the lowest loss, at α values greater than 0.01. Lower values of α do not help MC-postprocessing in most cases, suggesting that these smaller updates may be overfitting to the post-processing data. In terms of PMC loss (Fig. 7 ), we observe that performance by MC-postprocessing is highly sensitive to the value of α. For smaller values of α, MC-postprocessing is able to achieve decent performance by these metrics, although in all cases, PMC-postprocessing generates a model with a better median loss value at some configuration of α. We assess how many steps/updates MC and PMC take for different values of α in Fig. 8 , and summarize empirical measures of running time in Table 4 . On the figure, we annotate the point for which each post-processing algorithm achieves the lowest median value of PMC loss across trials. Fig. 8 validates that PMC-postprocessing is more efficient than MC-postprocessing at producing models with low PMC loss, on average requiring 4.0x fewer updates to achieve its lowest loss on test. From Table 4 we observe that PMC typically requires a larger number of updates to achieve its best performance on MC loss (about 2x wall clock time and number of updates), whereas MC-postprocessing requires a larger number of updates to achieves its best performance on PMC loss and DC loss, due to its dependence on very small values of α. We accompany these results with the caveat that they are based on performance on one real-world task, and wall clock time measurements are influenced by the heterogenous cluster environment; future work could focus on a larger empirical comparison. Rows are different ML base models, and columns are different attributes used to define C. The color denotes the post-processing method. Each result is annotated with the median PMC loss for that method and parameter combination.



Let R(x) be a model satisfying α-MC on a collection of subsets C ∈ 2 X . Let r min = min S∈C E D [R|R = r, x ∈ S] be the minimum expected risk prediction among S ∈ C and r ∈ [0, 1]. Then R(x) is ln rmin+α rmin-α -differentially calibrated. Proof. Let r = E D [R|R = r, x ∈ S] and p * = E D [y|R = r, x ∈ S]. α-MC guarantees that r -α ≤ p * ≤ r + α for all groups S ∈ C and predictions r ∈ [0, 1]. Plugging these lower and upper bounds into Eq.

Figure 2: A comparison of LR and RF models, with and without MC and PMC post-processing, on the hospital admission task. From left to right, trained models are compared in terms of test set AUROC, MC loss, PMC loss, and DC loss. Points represent the median performance over 100 shuffled train/test splits with bootstrapped 99% confidence intervals. We test for significant differences between post-processing methods using two-sided Wilcoxon rank-sum tests with Bonferroni correction. ns: p <= 1; **: 1e-03 < p <= 1e-02; ***: 1e-04 < p <= 1e-03; ****: p <= 1e-04.

) did with MC. Additional extensions of PMC could establish a bound on the accuracy of PMC-postprocessed models in a similar vein to work byKim et al. (2019)  and Hébert-Johnson et al.. On the empirical side, future works should benchmark PMC on a larger set of real-world problems, and explore use cases in more depth.

Figure3: A comparison of ε-DC, α-MC, and α-PMC in terms of their parameters α and ε. In both panes, the x value is a given value of one metric for a model, and the y axis is the implied value of the other metric, according to Theorem 4-Theorem 3. The left filled area denotes the dependence of the privacy/DC of α-multicalibrated models on the minimum risk interval, r min ∈ [0.01, 1.0]. The right filled area denotes the dependence of the MC of ε-differentially calibrated models on their overall calibration, δ ∈ [0.0, 0.5]. α-PMC does not have these sensitivities.

the subset where P D (R(x) ∈ I|x ∈ S) < αλ has a size of at most α|S|. Therefore there is a subset |S | ≥ (1 -α)|S| where for all r ∈ Λ λ , α-PMC (Definition 5) is satisfied. Let δ be the constaint on δ-PMC. Let p * = E D [y|R = r, x ∈ S] and r = E D [R|R = r, x ∈ S]. Consider the case r > p * and let α = (r -p * )/p * . λ-discretization shifts r by at most λ. Let δ ≤ (r + λ -p * )/p * Substituting r ≤ αp * + p * yields δ ≤ α + λ p * Plugging in ρ as the minimum of p * , we complete the proof.

Figure 4: Relationship between (α, λ)-PMC and α-PMC given a geometric discretization. Illustrated for (α, λ) -P M C =0.1, for various values of ρ and λ.

≤ α for categories (S, I) ∈ C × Λ λ satisfying P D (R(x) ∈ I|x ∈ S) ≥ αλ. Given 1/λ bins, the subset where P D (R(x) ∈ I|x ∈ S) < αλ has a size of at most α|S|. Therefore there is a subset |S | ≥ (1 -α)|S| where for all r ∈ Λ λ , α-PMC (Definition 5) is satisfied. Let δ be the constaint on δ-PMC. Let p * = E D [y|R = r, x ∈ S] and r = E D [R|R = r, x ∈ S]. Consider the case r > p * and let α = (r -p * )/p * , i.e. the tight bound. (λ, ρ)-geometric discretization shifts r by at most a factor of ρ -λ . This implies

.3 LOSS FUNCTIONS The following loss functions are empirical analogs of the definitions of M C, P M C, and DC, and are used in the experiment section to measure performance. Definition 14 (MC loss). Let D = {(y, x) i } N i=0 ∼ D, and let α, λ, γ > 0. Define a collection of subsets C ∈ 2 X such that for all S ∈ C, |S| ≥ γN . Let S I = {x : R(x) ∈ I, x ∈ S} for (S, I) ∈ C × Λ λ . Define the collection S containing all S I satisfying S I ≥ αλN . The MC loss of a model R(x) on D is max Definition 15 (PMC loss). Let D = {(y, x) i } N i=0 ∼ D, and let α, λ, γ, ρ > 0. Define a collection of subsets C ∈ 2 X such that for all S ∈ C, |S| ≥ γN . Let S I = {x : R(x) ∈ I, x ∈ S} for (S, I) ∈ C × Λ λ . Define the collection S containing all S I satisfying S I ≥ αλN . Let 1 |S I | i∈S I y i ≥ ρ. The PMC loss of a model R(x) on D is max S I ∈S i∈S I y i -i∈S I R i i∈S I y i Definition 16 (DC loss). Let D = {(y, x) i } N i=0 ∼ D, and let α, λ, γ > 0. Define a collection of subsets C ∈ 2 X such that for all S ∈ C, |S| ≥ γN . Given a risk model R(x) and prediction intervals I, Let S I = {x : R(x) ∈ I, x ∈ S} for (S, I) ∈ C × Λ λ . Define the collection S containing all S I satisfying S I ≥ αλN . The DC loss of a model R(x) on D is max Let R(x) be a model satisfying α-PMC on a collection C. Then R(x) is ( α 1-α )multicalibrated on C. Proof. To distinguish the parameters, let R(x) be a model satisfying δ-PMC. Let r = E D [R|R = r, x ∈ S] and p * = E D [y|R = r, x ∈ S]. Then r/(1 + δ) ≤ p * ≤ r/(1 -δ). We solve for the upper bound on α-MC from Definition 2 for the case when p * > r. This yields

(ε, λ)-DC, and let p n = max S∈C P D (y|R ∈ I M , x ∈ S) and p d = min S∈C P D (y|R ∈ I M , x ∈ S). These groups determine the upper and lower bounds of ε as e -ε ≤ p d /p n and p n /p d ≤ e ε . We note that p d ≤ P D (y|R ∈ I M ) ≤ p n , since P (y|R ∈ I M ) = 1 N S∈C |S|P D (y|R ∈ I M , x ∈ S), and p n and p d are the extreme values of P (y|R ∈ I M , x ∈ S) among S. So, α-MC is bound by the group outcome that most deviates from the predicted value, which is either p n or p d . Let r = P D (R|R ∈ I M ). There are then two scenarios to consider: 1. α ≤ |p n -r| = p n -r when r ≤ 1 2 (p n + p d ); and 2. α ≤ |p d -r| = r -p d when r ≥ 1 2 (p n + p d ).

Figure 5: AUROC test performance versus α across experiment settings. Rows are different ML base models, and columns are different attributes used to define C. The color denotes the postprocessing method.

Admission prevalence (Admissions/Total (%)) among patients in the MIMIC-IV-ED data repository, stratified by the intersection of ethnoracial group and gender.

Parameters for the hospital admission prediction experiment.

The number of times each postprocessing method achieved the best score among all methods, out of 100 trials.



Features used in the hospital admission task.

availability

Code Code for the experiments is available here: https://github.com/ by1tTZ4IsQkAO80F/pmc. Code is licensed under GNU Public License v3.0.

A APPENDIX

In this section, we include additional comparisons to related work, additional definitions, proofs to the theorems in the main text, and additional experimental details. The code to reproduce the figures and experiments is available here: https://github.com/by1tTZ4IsQkAO80F/pmc.

A.1 RELATED WORK

Definitions of Fairness There are myriad ways to measure fairness that are covered in more detail in other works (Barocas et al., 2019; Chouldechova & Roth, 2018; Castelnovo et al., 2021) . We briefly review three notions here. The first, demographic parity, requires the model's predictions to be independent of patient demographics (A). Although a model satisfying demographic parity can be desirable when the outcome should be unrelated to sensitive attributes (Foulds & Pan, 2020) , it can be unfair if important risk factors for the outcome are associated with those attributes (Hardt et al., 2016) . For example, it may be more fair to admit socially marginalized patients to a hospital at a higher rate if they are assessed less able to manage their care at home. Furthermore, if the underlying rates of illness vary demographically, requiring demographic parity can result in a healthier patients from one group being admitted more often than patients who urgently need care.When the base rates of admission are expected to differ demographically, we can instead ask that the model's errors be balanced across groups. One such notion is equalized odds, which states that for a given Y , the model's predictions should be independent of A. Satisfying equalized odds is equivalent to having equal FPR and FNR for every group in A.When the model is used for patient risk stratification, as in the target use case in this paper, it is important to consider a model's calibration for each demographic group in the data. Because risk prediction models influence who is prioritized for care, an unfairly calibrated model can systematically under-predict risk for certain demographic groups and result in under-allocation of patient care to those groups. Thus, guaranteeing group-wise calibration via an approach such as multicalibration also guarantees fair patient prioritization for health care provision. In some contexts, risk predictions are not directly interpreted, but only used to rank patients, which in some contexts is sufficient for resource allocation. Authors have proposed various ways of measuring the fairness of model rankings, for example by comparing AUROC between groups (Kallus et al., 2020) .Approaches to Fairness Many approaches to achieving fairness guarantees according to demographic parity, equalized odds and its relaxations have been proposed (Dwork et al., 2012; Hardt et al., 2016; Berk et al., 2017; Jiang & Nachum, 2019; Kearns et al., 2018) . When choosing an approach, is important to carefully weigh the relative impact of false positives, false negatives, and miscalibration on patient outcomes, which differ by use case. When group base rates differ (i.e., group-specific positivity rates), equalized odds and calibration by group cannot both be satisfied (Kleinberg et al., 2016) . Instead, one can often equalized multicalibration while satisfying relaxations of equalized odds such as equalized accuracy, where Accuracy = µT P R+(1-µ)(1-F P R) for a group with base rate µ. However, to do so requires denigrating the performance of the model on specific groups (Chouldechova, 2017; Pleiss et al., 2017) , which is unethical in our context.As mentioned in the introduction, we are also motivated to utilize approaches to fairness that 1) dovetail well with intersectionality theory, and 2) provide privacy guarantees. Most work in the computer science/ machine learning space does not engage with the broader literature on sociocultural concepts like intersectionality, which we see as a gap that makes adoption in real-world settings difficult (Hanna et al., 2020) . One exception to this statement is differential fairness (Foulds et al., 2019a) , a measure designed with intersectionality in mind. In addition to being a definition of fairness that provides equal protection to groups defined by intersections of protected attributes, models satisfying ε-differential fairness also satisfy ε-pufferfish privacy. This privacy guarantee is very desirable in risk prediction contexts, because it limits the extent to which the model reveals sensitive information to a decision maker that has the potential to influence their interpretation of the model's recommendation. However, prior work on differential fairness has been limited to using it to control for demographic parity, which is not an appropriate fairness measure for our use case (Foulds & Pan, 2020) .

