

Abstract

Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its differential calibration, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a realworld application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultenous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.

1. INTRODUCTION

Today, machine learning (ML) models have an impact on outcome disparities across sectors (health, lending, criminal justice) due to their wide-spread use in decision-making. When applied in clinical decision-making, ML models help care providers decide whom to prioritize to receive finite and time-sensitive resources among a population of potentially very ill patients. These resources include hospital beds (Barak-Corren et al., 2021a; Dinh & Berendsen Russell, 2021) , organ transplants (Schnellinger et al., 2021) , specialty treatment programs (Henry et al., 2015; Obermeyer et al., 2019) , and, recently, ventilator and other breathing support tools to manage the COVID-19 pandemic (Riviello et al., 2022) . In scenarios like these, decision makers typically rely on risk prediction models to be calibrated. Calibration measures the extent to which a model's risk scores, R, match the observed probability of the event, y. Perfect calibration implies that P (y|R = r) = r, for all values of r. Calibration allows the risk scores to be used to rank patients in order of priority and informs care providers about the urgency of treatment. However, models that are not equally calibrated among subgroups defined by different sensitive attributes (race, ethnicity, gender, income, etc.) may lead to systematic denial of resources to marginalized groups (e.g. (Obermeyer et al., 2019; Ashana et al., 2021; Roberts, 2011; Zelnick et al., 2021; Ku et al., 2021) ). Just this scenario was observed by Obermeyer et al. ( 2019) analyzed a large health system algorithm used to enroll high-risk patients into care management programs and showed that, at a given risk score, Black patients exhibited significantly poorer health than white patients. To address equity in calibration, Hebert-Johnson et al. (2018) proposed a fairness measure called multicalibration (MC), which asks that calibration be satisifed simultaneously over many flexiblydefined subgroups. Remarkably, MC can be satisfied efficiently by post-processing risk scores without negatively impacting the generalization error of a model, unlike other fairness concepts like demographic parity (Foulds & Pan, 2020) and equalized odds (Hardt et al., 2016) . This has motivated the use of MC in practical settings (e.g. Barda et al. (2021) ) and has spurred several extensions (Kim et al., 2019; Jung et al., 2021; Gupta et al., 2021; Gopalan et al., 2022) . If we bin our risk predictions,

