EXPLOITING STRUCTURED DATA FOR LEARNING CON-TAGIOUS DISEASES UNDER INCOMPLETE TESTING

Abstract

One of the ways that machine learning algorithms can help control the spread of an infectious disease is by building models that predict who is likely to get infected making them good candidates for preemptive interventions. In this work we ask: can we build reliable infection prediction models when the observed data is collected under limited, and biased testing that prioritizes testing symptomatic individuals? Our analysis suggests that when the infection is highly contagious, incomplete testing might be sufficient to achieve good out-of-sample prediction error. Guided by this insight, we develop an algorithm that predicts infections, and show that it outperforms baselines on simulated data. We apply our model to data from a large hospital to predict Clostridioides difficile infections; a communicable disease that is characterized by asymptomatic (i.e., untested) carriers. Using a proxy instead of the unobserved untested-infected state, we show that our model outperforms benchmarks in predicting infections.

1. INTRODUCTION

Preemptively identifying individuals at a high risk of contracting a contagious infection is important for guiding treatment decisions to mitigate symptoms, and preventing further spread of the contagion. In this paper, we study how to build individual-level predictive models for contagious infections while explicitly addressing the challenges inherent to contagious diseases. Building accurate infection prediction models is hindered by two main factors. First, contagious infections defy the usual iid assumption central to most machine learning methods. This is because an individual's infection state is not independent of their contacts' infection states. Previous work has often relied on expert knowledge to construct exposure proxies (Wiens et al., 2012; Oh et al., 2018) . It is then assumed that conditional on the exposure proxy and individual characteristics, individual outcomes are independent of one another. Second, the observed data is biased due to incomplete testing. We use the term "incomplete testing" to describe the scenario where only a small, biased subset of infected individuals get tested. Such a scenario is ubiquitous in the context of contagious infections for several reasons. While many individuals carry the pathogen, only a fraction display symptoms. Even in the presence of unlimited testing resources, the latter are far more likely to get tested leading to biased data collection where individuals predisposed to displaying symptoms are over-represented. Incomplete testing makes learning accurate models difficult since the collected labels are missing not at random leading to biased, inconsistent estimates. In this work, we treat non-independence of outcomes as a blessing rather than a curse. Our proposed approach leverages the fact that an individual's infection state provides useful information about their contacts' true infection states. This information is used to generate pseudo-labels for untested individuals, mitigating issues due to incomplete testing. The key idea behind our approach is that highly structured patterns of contagion transmission can serve as a complementary signal to identify even untested carriers. The stronger that signal is, the less impact that incomplete testing will have. Our contributions can be summarized as follows: (1) We identify two properties of the collected data that can be exploited to mitigate the effects of incomplete testing. (2) We propose an algorithm that leverages that insight to predict the probability of an untested individual carrying the disease. (3) We empirically evaluate the effectiveness of our method on both simulated data and real data for a common healthcare associated infection. We show that predictions from our model can be used to

