MULTI-RESOLUTION MODELING OF A DISCRETE STOCHASTIC PROCESS IDENTIFIES CAUSES OF CANCER

Abstract

Detection of cancer-causing mutations within the vast and mostly unexplored human genome is a major challenge. Doing so requires modeling the background mutation rate, a highly non-stationary stochastic process, across regions of interest varying in size from one to millions of positions. Here, we present the split-Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes.

1. INTRODUCTION

Numerous domains involve modeling highly non-stationary discrete-time and integer-valued stochastic processes where event counts vary dramatically over time or space. An important open problem of this nature in biology is understanding the stochastic process by which mutations arise across the genome. This is central to identifying mutations that drive cancer emergence (Lawrence et al., 2013) . Tumor drivers provide a cellular growth advantage to cells by altering the function of a genomic element such as a gene or regulatory feature (e.g. promoter). Drivers are identifiable because they reoccur across tumors, but there are two major challenges to detecting such recurrence. First, driver mutations are rare and their signal is hidden by the thousands of passenger mutations that passively and stochastically accumulate in tumors (Stratton et al., 2009; Martincorena & Campbell, 2015) . Second, because functional elements vary dramatically in size (genes: 10 3 -10 6 bases; regulatory elements: 10 1 -10 3 bases; and single positions), driver mutations accumulate across regions that vary many orders of magnitude. Accurately predicting the stochastic accumulation of passenger mutations at multiple scales is necessary to reveal the subtle recurrence of driver mutations across the genome. Here, we introduce the split-Poisson Gamma (SPG) process, an extension of the Poisson-Gamma distribution, to efficiently model a non-stationary discrete stochastic process at numerous length scales. The model first approximates quasi-stationary regional rate parameters within small windows; it then projects these estimates to arbitrary regions in linear time (10-15 minutes for genome-wide inference). This approach is in contrast to existing efforts that model fixed regions and require computationally expensive retraining (e.g. over 5 hours) to predict over multiple scales of interest (Nik-Zainal et al., 2016; Martincorena et al., 2017) . We apply our framework to model cancer-specific mutation patterns (fig. 1 ). We perform data-driven training of our model's parameters and show that it more accurately captures mutation patterns than existing methods on simulated and real data. We demonstrate the power of our multi-resolution approach by identifying drivers across functional elements: genes, regulatory features, and single base mutations. Despite the method having no knowledge of genome structure, it detects nearly all gene drivers present in over 5% of samples while making no false discoveries and detects all previously characterized regulatory drivers. Detected events also include novel candidate drivers, providing promising targets for future investigation. These epigenetic states set different mutation rates in different tissues. c. Our model takes these epigenetic tracks as input to estimate the regional mutation density across the genome (95% confidence interval in orange). d. Regional rate parameters and sequence context are integrated via the split-Poisson-Gamma (SPG) distribution to provide arbitrary resolution mutation count estimates. Deviations between the estimated and observed mutation rates identify mutations that are associated with cancers in different tissues. e. The split-Poisson-Gamma (SPG) model plate diagram (squares: inferred parameters; grey: observed input data).

1.1. PREVIOUS WORK

Numerous methods exist for modeling stationary stochastic processes (Lindsey, 2004) . Far fewer exist for non-stationary processes because they are difficult to capture with the covariance functions of parametric models (Risser, 2016) . Non-stationary kernels have been introduced for Gaussian processes (Paciorek & Schervish, 2004) , but these may not be tractable on large datasets due to their computational complexity. More recently, there has been work developing Poisson-gamma models for dynamical systems (Schein et al., 2016; Guo et al., 2018) , but these methods have focused on learning relationships between count variables, not predicting counts based on continuous covariates. In the particular case of modeling mutation patterns across the cancer genome, numerous computational methods exist to model mutation rates within well-understood genomic contexts such as genes (Lawrence et al., 2013; Martincorena et al., 2017; Wadi et al., 2017; Mularoni et al., 2016; Juul et al.) . These models account for < 4% of the genome (Rheinbay et al., 2020) . They are not applicable in non-coding regions, where the majority of mutations occur (Gloss & Dinger, 2018) . A handful of methods to model genome-wide mutation rates have been introduced (Polak et al., 2015; Nik-Zainal et al., 2016; Bertl et al., 2018) . However, they operate on a single length-scale or set of regions and require computationally expensive retraining to predict over each new length-scale. Several methods rely on Poisson or binomial regression; however, previous work has extensively documented that mutation counts data are over-dispersed, leading these models to underestimate variance and yield numerous false-positive driver predictions (Lochovsky et al., 2015; Martincorena et al., 2017; Juul et al., 2019) . Negative binomial regression has recently been used to account for over-dispersion (Nik-Zainal et al., 2016) and perform genome-wide mutation modeling and driver detection. However, resolution was coarse, and it only found a few, highly recurrent driver mutations.

1.2. OUR CONTRIBUTIONS

This work makes three key contributions: 1) we introduce an extension of the Poisson-Gamma distribution to model non-stationary discrete stochastic processes at any arbitrary length scale without retraining; 2) we apply the framework to capture cancer-specific mutation rates with unprecedented accuracy, resolution, and efficiency; and 3) we perform a multi-scale search for cancer driver mutations genome-wide, including the first-ever base-resolution scan of the whole genome. This search yields



Figure 1: Non-stationary stochastic process modeling predicts mutation patterns and identifies cancer-specific driver mutations. Biological processes are shown in blue, data processing is shown in orange. a. Areas of the genome have varying epigenetic states (e.g. accessibility for transcription) depending on the tissue type. b.These epigenetic states set different mutation rates in different tissues. c. Our model takes these epigenetic tracks as input to estimate the regional mutation density across the genome (95% confidence interval in orange). d. Regional rate parameters and sequence context are integrated via the split-Poisson-Gamma (SPG) distribution to provide arbitrary resolution mutation count estimates. Deviations between the estimated and observed mutation rates identify mutations that are associated with cancers in different tissues. e. The split-Poisson-Gamma (SPG) model plate diagram (squares: inferred parameters; grey: observed input data).

