

Abstract

Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its differential calibration, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a realworld application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultenous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.

1. INTRODUCTION

Today, machine learning (ML) models have an impact on outcome disparities across sectors (health, lending, criminal justice) due to their wide-spread use in decision-making. When applied in clinical decision-making, ML models help care providers decide whom to prioritize to receive finite and time-sensitive resources among a population of potentially very ill patients. These resources include hospital beds (Barak-Corren et al., 2021a; Dinh & Berendsen Russell, 2021) , organ transplants (Schnellinger et al., 2021) , specialty treatment programs (Henry et al., 2015; Obermeyer et al., 2019) , and, recently, ventilator and other breathing support tools to manage the COVID-19 pandemic (Riviello et al., 2022) . In scenarios like these, decision makers typically rely on risk prediction models to be calibrated. Calibration measures the extent to which a model's risk scores, R, match the observed probability of the event, y. Perfect calibration implies that P (y|R = r) = r, for all values of r. Calibration allows the risk scores to be used to rank patients in order of priority and informs care providers about the urgency of treatment. However, models that are not equally calibrated among subgroups defined by different sensitive attributes (race, ethnicity, gender, income, etc.) may lead to systematic denial of resources to marginalized groups (e.g. (Obermeyer et al., 2019; Ashana et al., 2021; Roberts, 2011; Zelnick et al., 2021; Ku et al., 2021) ). Just this scenario was observed by Obermeyer et al. (2019) analyzed a large health system algorithm used to enroll high-risk patients into care management programs and showed that, at a given risk score, Black patients exhibited significantly poorer health than white patients. To address equity in calibration, Hebert-Johnson et al. (2018) proposed a fairness measure called multicalibration (MC), which asks that calibration be satisifed simultaneously over many flexiblydefined subgroups. Remarkably, MC can be satisfied efficiently by post-processing risk scores without negatively impacting the generalization error of a model, unlike other fairness concepts like demographic parity (Foulds & Pan, 2020) and equalized odds (Hardt et al., 2016) . This has motivated the use of MC in practical settings (e.g. Barda et al. (2021) ) and has spurred several extensions (Kim et al., 2019; Jung et al., 2021; Gupta et al., 2021; Gopalan et al., 2022) . If we bin our risk predictions, As Barocas et al. (2019) note, equity in calibration embeds the fairness notion called sufficiency, which states: for a given risk prediction, the expected outcome should be independent of group membership. Starting from this notion, we can assess the conditions under which MC satisfies sufficiency. In this work, we derive a fairness criteria directly from sufficiency dubbed differential calibration for its relation to differential fairness (Foulds et al., 2019b) . We show that satisfying differential calibration can ensure that a model is equally "trustworthy" among groups in the data. By equally "trustworthy", we mean that a decision maker cannot reasonably come to distrust the model's risk predictions for specific groups, which may help prevent differences in decision-making between demographic groups, given the same risk prediction. By relating sufficiency to MC, we describe a shortcoming of MC that can occur when the outcome probabilities are strongly tied to group membership. Under this condition, the amount of calibration error relative to the expected outcome can be unequal between groups. This inequality hampers the ability of MC to (approximately) guarantee sufficiency, and thus guarantee equity in trustworthiness for the decision maker. We propose a simple variant of MC called proportional multicalibration (PMC) that ensures that the proportion of calibration error within each bin and group is small. We prove that PMC bounds both multicalibration and differential calibration. We show that PMC can be satisfied with an efficinet post-processing method, similarly to MC.

1.1. OUR CONTRIBUTIONS

In this manuscript, we formally analyze the connection of MC to the fairness notion of sufficiency. To do so, we introduce differential calibration (DC), a sufficiency measure that constrains ratios of population risk between pairs of groups within prediction bins. We describe how DC, like sufficiency, provides a sense of equal trustworthiness from the point of view of the decision maker. With this definition, we prove the following. First, models that are α-multicalibrated satisfy (ln rmin+α rmin-α )-DC, where r min is the minimum expected risk prediction among categories defined by subgroups and prediction intervals. We illustrate the meaning of this bound, which is that the proportion of calibration error in multicalibrated models may scale inversely with the outcome probability. Based on these observations, we propose a variation of MC, PMC, that controls the percentage error by group and risk strata (Definition 5). We show that models satisfying α-PMC are ( α 1-α )-multicalibrated and (ln 1+α 1-α )-differentially calibrated. Proportionally multicalibrated models thereby obtain robust fairness guarantees that are less dependent on population risk categories. Furthermore, we define an efficient algorithm for learning predictors satisfying α-PMC. Finally, we investigate the application of these methods to predicting patient admissions in the emergency department, a real-world resource allocation task, and show that post-processing for PMC results in models that are accurate, multicalibrated, and differentially calibrated.

2.1. PRELIMINARIES

We consider the task of training a risk prediction model for a population of individuals with outcomes, y ∈ {0, 1}, and features, x ∈ X . Let D be the joint distribution from which individual samples (y, x) are drawn. We assume the outcomes y are random samples from underlying independent Bernoulli distributions, denoted as p * (x) ∈ [0, 1]. Individuals can be further grouped into collections of subsets, C ⊆ 2 X , such that S ∈ C is the subset of individuals belonging to S, and x ∈ S indicates that individual x belongs to group S. We denote our risk prediction model as R(x) : X → [0, 1]. In order to consider calibration in practice, the risk predictions are typically discretized and considered within intervals. The coarseness of this interval is parameterized by a partitioning parameter, λ ∈ (0, 1]. The λ-discretization of [0, 1] 

