FINDING PATIENT ZERO: LEARNING CONTAGION SOURCE WITH GRAPH NEURAL NETWORKS

Abstract

Locating the source of an epidemic, or patient zero (P0), can provide critical insights into the infection's transmission course and allow efficient resource allocation. Existing methods use graph-theoretic centrality measures and expensive message-passing algorithms, requiring knowledge of the underlying dynamics and its parameters. In this paper, we revisit this problem using graph neural networks (GNNs) to learn P0. We establish a theoretical limit for the identification of P0 in a class of epidemic models. We evaluate our method against different epidemic models on both synthetic and a real-world contact network considering a disease with history and characteristics of COVID-19. We observe that GNNs can identify P0 close to the theoretical bound on accuracy, without explicit input of dynamics or its parameters. In addition, GNN is over 100 times faster than classic methods for inference on arbitrary graph topologies. Our theoretical bound also shows that the epidemic is like a ticking clock, emphasizing the importance of early contact-tracing. We find a maximum time after which accurate recovery of the source becomes difficult, regardless of the algorithm used.

1. INTRODUCTION

The ability to quickly identify the origin of an outbreak, or "finding patient zero", is critically important in the effort to contain an emerging epidemic. The identification of early transmission chains and the reconstruction of the possible paths of diffusion of the virus can be the difference between stopping an outbreak in its infancy and letting an epidemic unfold and affect a large share of a population. Hence, solving this problem would be instrumental in informing and guiding contact tracing efforts carried out by public health authorities, allowing for optimal resource allocation that can maximize the probability of an early containment of the outbreak. Disease spreading is modeled as a contagion process on a network Stroock & Varadhan (2007); Pastor-Satorras et al. (2015) of human-to-human interactions where infected individuals are going to transmit the virus by infecting (with a certain probability) their direct contacts. In general, contagion processes can capture a wide range of phenomena, from rumor propagation on social media to virus spreading over cyber-physical networks Centola & Macy (2007) ; Baronchelli (2018) ; Wang et al. (2013); Mishra & Keshri (2013) . Therefore, learning the source of a contagion process would also have broader impact on various domains, from detecting sources of fake news to defending malware attacks. Learning the index case, or patient zero (P0), is a difficult problem. In this paper, we model disease spreading as a contagion process (chains of transmissions) over a graph. The evolution of an outbreak is noisy and highly dependent on the graph structure and disease dynamics. In addition, in real-world epidemics, there is often a delay from the start of the outbreak to when epidemic surveillance and contact tracing starts. Hence, we might only observe the state of the graph at some intermediate times without access to the complete chains of transmission. Furthermore, due to its stochastic nature, the same source node might lead to different epidemic spreading trajectories. Finally, learning P0 from noisy observations of graph snapshots is computationally intractable and the complexity grows exponentially with the size of the graph Shah & Zaman (2011). Our goal is to provide fresh perspectives on the problem of finding patient zero using graph neural networks (GNNs) Gilmer et al. (2017) . First, we conduct a rigorous analysis of learning P0 based on the graph structure and the disease dynamics, allowing us to find conditions for identifying P0 accurately. We test our theoretical results on a set of epidemic simulations on synthetic graphs commonly used in the literature Erdös et al. (1959) ; Albert & Barabási (2002) . We also evaluate our method on a realistic co-location network for the greater Boston area, finding performance similar to the synthetic data. While collecting labeled data to train GNN to find P0 may not be possible, training GNN using simulations on real contact-tracing data can provide a fast method for inferring P0 and help with planning and resource allocation. To the best of our knowledge, our work is the first to tackle the patient zero problem with deep learning and to test the approach on a realistic contact network. In summary, we make the following contributions: • We find upper bounds on the accuracy of finding patient zero in graphs with cycles, independent of the inference algorithm used. • We show that beyond a certain time scale the inference becomes difficult, highlighting the importance of swift and early contact-tracing. • We demonstrate the superiority of GNNs over state-of-the-art message passing algorithms in terms of speed and accuracy. Most importantly, our method is model agnostic and does not require the epidemic parameters to be known. • We validate our theoretical findings using extensive experiments for different epidemic dynamics and graph structures, including a real-world co-location graph of the COVID-19 outbreak.

2. RELATED WORK

Learning contagion dynamics Learning forward dynamics of contagion processes on a graph is a well studied problem area. (2012) . In contrast, research in learning the reverse dynamics of contagion processes is rather scarce. Influence maximization Kempe et al. (2003) , for instance, finds a small set of individuals that can effectively spread information in a graph, but only maximizes the number of affected nodes in the infinite time limit. Our problem is more difficult as we care not just about the number of infected nodes, but which nodes were infected. (2017) proved that it is possible to construct a confidence set for the predicted diffusion source nodes with a size independent of the number of infected nodes over a regular tree. Our work provides fresh perspectives on the patient zero problem on general graphs based on the recent development of graph neural networks



Most work in learning the dynamics of a contagion process Rodriguez et al. (2011); Mei & Eisner (2017); Li et al. (2018a) have focused on inferring the forward dynamics of the diffusion. In epidemiology, for example, Pastor-Satorras & Vespignani (2001) have studied learning the temporal dynamics of diseases spreading on mobility networks. The problem of learning the reverse dynamics and identifying diffusion sources has been largely overlooked due to the aforementioned challenges. Two of the most notable exceptions in the area are "rumor centrality" Shah & Zaman (2011) for contagion processes on trees and Dynamic Message-passing (DMP) on graphs Lokhov et al. (2014) but both require as input the parameters of the spreading dynamics simulations.

For instance, Rodriguez et al. (2011); Du et al. (2013) proposed scalable algorithms to estimate the parameters of the underlying diffusion network, a problem known as network inference. Deep learning has led to novel neural network models that can learn forward dynamics of various processes including neural Hawkes processes Mei & Eisner (2017) and Markov decision processes-based reinforcement learning Li et al. (2018a). Learning forward contagion dynamics have also been intensively studied in epidemiology Pastor-Satorras & Vespignani (2001); Vynnycky & White (2010), social science Matsubara et al. (2012), and cyber-security Prakash et al.

In order to find patient zero, we aim to learn the reverse dynamics of contagion processes. Shah & Zaman (2011) were among the first to formalize the problem on trees in the context of modeling rumor spreading in a network. Prakash et al. (2012); Vosoughi et al. (2017) studied similar problems for detecting viruses in computer networks. More recent advances proposed a dynamic message passing algorithm Lokhov et al. (2014) and belief propagation Altarelli et al. (2014) to estimate the epidemic outbreak source. Fairly recently, Fanti & Viswanath (2017) reduced the deanonymization of Bitcoin to the source identification problem in an epidemic and analyzes the dynamics properties. On the theoretical side, Shah & Zaman (2011); Wang et al. (2014) analyzed the quality of the maximum likelihood estimator and rumor centrality, but only for the simple SI model on trees. Antulov-Fantulin et al. (2015) found detectability limits for patient zero in the SIR model using exact analytical methods and Monte Carlo estimators. Khim & Loh (2016); Bubeck et al.

