EXPLOITING STRUCTURED DATA FOR LEARNING CON-TAGIOUS DISEASES UNDER INCOMPLETE TESTING

Abstract

One of the ways that machine learning algorithms can help control the spread of an infectious disease is by building models that predict who is likely to get infected making them good candidates for preemptive interventions. In this work we ask: can we build reliable infection prediction models when the observed data is collected under limited, and biased testing that prioritizes testing symptomatic individuals? Our analysis suggests that when the infection is highly contagious, incomplete testing might be sufficient to achieve good out-of-sample prediction error. Guided by this insight, we develop an algorithm that predicts infections, and show that it outperforms baselines on simulated data. We apply our model to data from a large hospital to predict Clostridioides difficile infections; a communicable disease that is characterized by asymptomatic (i.e., untested) carriers. Using a proxy instead of the unobserved untested-infected state, we show that our model outperforms benchmarks in predicting infections.

1. INTRODUCTION

Preemptively identifying individuals at a high risk of contracting a contagious infection is important for guiding treatment decisions to mitigate symptoms, and preventing further spread of the contagion. In this paper, we study how to build individual-level predictive models for contagious infections while explicitly addressing the challenges inherent to contagious diseases. Building accurate infection prediction models is hindered by two main factors. First, contagious infections defy the usual iid assumption central to most machine learning methods. This is because an individual's infection state is not independent of their contacts' infection states. Previous work has often relied on expert knowledge to construct exposure proxies (Wiens et al., 2012; Oh et al., 2018) . It is then assumed that conditional on the exposure proxy and individual characteristics, individual outcomes are independent of one another. Second, the observed data is biased due to incomplete testing. We use the term "incomplete testing" to describe the scenario where only a small, biased subset of infected individuals get tested. Such a scenario is ubiquitous in the context of contagious infections for several reasons. While many individuals carry the pathogen, only a fraction display symptoms. Even in the presence of unlimited testing resources, the latter are far more likely to get tested leading to biased data collection where individuals predisposed to displaying symptoms are over-represented. Incomplete testing makes learning accurate models difficult since the collected labels are missing not at random leading to biased, inconsistent estimates. In this work, we treat non-independence of outcomes as a blessing rather than a curse. Our proposed approach leverages the fact that an individual's infection state provides useful information about their contacts' true infection states. This information is used to generate pseudo-labels for untested individuals, mitigating issues due to incomplete testing. The key idea behind our approach is that highly structured patterns of contagion transmission can serve as a complementary signal to identify even untested carriers. The stronger that signal is, the less impact that incomplete testing will have. Our contributions can be summarized as follows: (1) We identify two properties of the collected data that can be exploited to mitigate the effects of incomplete testing. (2) We propose an algorithm that leverages that insight to predict the probability of an untested individual carrying the disease. (3) We empirically evaluate the effectiveness of our method on both simulated data and real data for a common healthcare associated infection. We show that predictions from our model can be used to Infectious disease modeling. Modeling the transmission of infectious diseases has been extensively studied in the epidemiology literature using SIS/SIR models and several other variants (Kermack & McKendrick, 1927) . These epidemiological models focus on the aggregate levels of infections in a community, which is distinct from our approach here where we focus on predicting individual level infections. In the machine learning literature, previous work has relied on proxies for exposure, e.g., the prevalence of a disease in a community (Wiens et al., 2012; Oh et al., 2018) , and implicitly assume that conditioning on individual characteristics. Similar to our approach, Fan et al. ( 2016) and Makar et al. ( 2018) take into account structured data, namely contact networks to compute infection estimates (Fan et al., 2016; Makar et al., 2018) . We differ from these approachs in that (1) we do not make parametric assumptions about the joint distribution of the observed or latent variables, and instead use nonparametric models (neural networks) to model the infection states, (2) we do not assume all infections will become symptomatic as is done in Fan et al. ( 2016), and (3) unlike the approach taken by Makar et al. ( 2018), we model time evolving sequences of infections taking into account the exposure states of potential asymptomatic carriers. Semi-supervised learning. Our proposed approach relies on transductive reasoning to generate labels for untested individuals. In that, it is closely related to semi-supervised learning methods, such as pseudo-labeling (Lee, 2003), and self-training (Robinson et al., 2020) . However, in traditional pseudo-labeling, the transductive power comes from the fact that points similar to each other in the input space should have similar outputs. Here, the rich structure in the data allows for more: we can construct pseudo-labels for untested individuals not just by relying on their similarity to other labeled instances, but also by observing their observed contacts' infection states. Our empirical results, and analysis are similar in spirit to concepts presented in the semi-supervised literature, specifically the cluster assumption, which we discuss at length later (Seeger, 2000; Rigollet, 2007) . Graph Neural Networks. Our proposed approach incorporates knowledge of the contact network. In that it is similar to Graph Neural Networks (GNNs), which utilize relational data to generate prediction estimates (Zhou et al., 2018) . GNNs fall into two categories, the first relies on transductive reasoning and cannot generalize to new communities (e.g., Kipf & Welling (2017)) or inductive, which can be used to generate estimates for previously unseen graphs (e.g., Hamilton et al. ( 2017)). Our work is similar to the latter category with an important distinction: our approach leverages unlabeled data giving more accurate, and robust estimates. Our work can be viewed as combining the strengths of semi-supervised learning, and GNNs to address limited testing. In addition, our approach augments the strengths of those two approaches with ideas from domain shift, and causal inference such as importance weighting (Cortes et al., 2010) to address biased testing.

3. PROBLEM SETTING

Setup. Let y t ∈ {0, 1} denote an individual's true infection state at time t, with y t = 0 if an individual is not infected and 1 if they are. We use x t ∈ X t to denote a vector of the individual's features at time t, and define J t i to be the set of indices of i's contacts at time t. We assume that contact indices are known, i.e., that the contact network is observed. Let e t i ∈ R ≥0 denote i's exposure state, with e t i = j∈J t i y t j . The exposure state is fully observed only when all of i's contacts have been tested, but otherwise either partially observed or unobserved. Define x t = x t ||e t , where || as the concatenation operator, i.e., x t ∈ X t × R ≥0 . Let o t ∈ {0, 1} denote the observation state, with o t = 1 if an individual's label is observed, i.e., if the individual has been tested for the infection. We use the super-script : t to denote variables from time t = 0 up to and including t, e.g., x :t = [x 0 , ..., x s , ..., x t ]. Throughout, we use capital letters to denote variables, and small letters to denote their values. We use P (X t , O t , Y t+1 ) to denote the unknown distribution over the full joint. Under biased testing, we have that P (X t |O t = 1) = P (X t |O t = 0) = P (X t ). We assume that 0 < P (O t = o|X t = x) < 1, for all x ∈ X , and o ∈ {0, 1}. This is the same as the overlap assumption

