SIMULATING ENVIRONMENTS FOR EVALUATING SCARCE RESOURCE ALLOCATION POLICIES Anonymous authors Paper under double-blind review

Abstract

Consider the sequential decision problem of allocating a limited supply of resources to a pool of potential recipients: This scarce resource allocation problem arises in a variety of settings characterized by "hard-to-make" tradeoffs-such as assigning organs to transplant patients, or rationing ventilators in overstretched ICUs. Assisting human judgement in these choices are dynamic allocation policies that prescribe how to match available assets to an evolving pool of beneficiariessuch as clinical guidelines that stipulate selection criteria on the basis of recipient and organ attributes. However, while such policies have received increasing attention in recent years, a key challenge lies in pre-deployment evaluation: How might allocation policies behave in the real world? In particular, in addition to conventional backtesting, it is crucial that policies be evaluated on a variety of possible scenarios and sensitivities-such as distributions of recipients and organs that may diverge from historic patterns. In this work, we present AllSim, an open-source framework for performing data-driven simulation of scarce resource allocation policies for pre-deployment evaluation. Simulation environments are modular (i.e. parameterized componentwise), learnable (i.e. on historical data), and customizable (i.e. to unseen conditions), and -upon interaction with a policy-outputs a dataset of simulated outcomes for analysis and benchmarking. Compared to existing work, we believe this approach takes a step towards more methodical evaluation of scarce resource allocation policies.

1. INTRODUCTION

The distribution of organs for transplant is a prototypical example of the scarce resource allocation problem -one with salient "life-or-death" consequences that places significant pressure on decisionmakers to make implicit but difficult trade-offs. To make the task more manageable, assisting human judgement in these choices are dynamic allocation policies that prescribe how to match each available unit of resource to a potential beneficiary. For instance, the United Network for Organ Sharing (UNOS) stipulates policies for organ allocation according to weighted organ-and patient-specific criteria, such as time on the waiting list, severity of illnesses, human leukocyte antigen matching, prognostic information, and other considerations [1] [2] [3] . Likewise, in the machine learning community, a variety of data-driven algorithms have been proposed as drop-in dynamic allocation policies, leveraging modern techniques for estimating treatment effects, predicting survival times, and accounting for organ scarcity-and often succeed in demonstrating high degrees of improvement in terms of life expectancies when deployed and evaluated on a backtested basis [4] [5] [6] . Is such demonstrated backtested performance sufficiently convincing for practitioners to adopt these developed allocation strategies? In many cases the answer is no, since there is still no standardised way in which this backtesting is actually undertaken [7] . The variety of methods that do exist share common challenges: First, when testing a target policy different from the actual policy used to generate the data, any offline evaluation method is immediately biased away from the true open-loop data-generating process [8] . Second, the evaluation methods themselves impute predicted outcomes, often with simple linear models [9, 10], which are not flexible enough to properly test more flexible machine learning methods. The compounding effect of these limitations leads to clinicians finding the results unconvincing [11] , consequently limiting the use of these potentially very beneficial systems. Historical data is also not the only aspect for which candidate policies should be compared against. In fact, it is crucial that policies be tested on a variety of scenarios and sensitivities for more robust evaluation -based on plausible possible futures -not just conditions that have been seen before. Real-world conditions may change after all, meaning historical data may no longer be representative of the current environment. To this end, we desire a system that allows us to test policies in such counterfactual scenarios for which we need counterfactual machine learning models [12] (cfr. Sect. 3.4 and our Appendix F)-thereby more systematically evaluating and benchmarking candidate policies before they are put into practice, as well as continuously updating existing policies based on anticipated or unanticipated changes to the environment. We propose to evaluate policies using machine learning as we illustrate in Figure 1 . As a motivating example, consider the impact of COVID-19 on the availability of organs for transplant -leading in many cases to a sudden and severe drop in supply [13] . Is it possible to reasonably measure performance of allocation policies against such a theoretical event a priori? Not with current data-based methods that do not contain an example of such an event. Conversely, with our simulated model we can intervene on the distribution of organs -reducing policy roll-out and testing to a simple task. With this in mind, we wish to highlight that accurate evaluation of such allocation policies is equally, if not more, important than their development. After all, there is little point in developing policies if they cannot be shown to be beneficial and then never used. Worse, it may be that the current evaluation techniques are causing researchers to optimise for performance that may end up being actively detrimental in real-world deployment. And so, we now continue by considering what exactly an ideal evaluation method may consist of: Desiderata We argue that a good solution should satisfy three key criteria: The environment must be (1) modular, in the sense that it is composed of parameterized components to allow flexible user interaction; (2) learnable, in the sense that it is grounded in the characteristics of real-world data; and (3) customizable, in the sense that it can test policies in previously unseen environment conditions. In particular, performance estimates should be unbiased when target policies different from the datagenerating policies are evaluated. Note that our objective is not limited to estimating simple aggregate descriptive statistics (such as mean performance): We may also be interested in evaluating more general impacts of policy deployment-such as whether it inadvertently discriminates based on gender or ethnicity, or whether patients suffering from a particular disease are inadvertently disadvantaged.

Contributions

In this work, we present AllSim (Allocation Simulator), a general-purpose opensource framework for performing data-driven simulation of scarce resource allocation policies for pre-deployment evaluation. We use modular environment mechanisms to capture a range of environment conditions (e.g. varying arrival rates, sudden shocks, etc.), and provide for componentwise parameters to be learned from historical data, as well as allowing users to further configure parameters for stress testing and sensitivity analysis. Potential outcomes are evaluated using unbiased causal effects methods: Upon interaction with a policy, AllSim outputs a batch dataset detailing all of the simulated outcomes, allowing users to draw their own conclusions over the effectiveness of a policy. Compared to existing work, we believe this simulation framework takes a step towards more methodical evaluation of scarce resource allocation policies.

2. SIMULATING ENVIRONMENTS

In this section, we explain various design choices we made when building AllSim in light of the three criteria we listed under "desiderata" above. Specifically, we require a method for evaluating a policy, not only for the expected outcome (for example recipient life-years), but also with respect to various demographics, aetiologies, resource waste, etc. We will solve this by outputting a synthetic dataset where recipient covariates are matched with organ covariates and an outcome. An example of an excerpt of such a dataset can be found in Appendix A.

