PROPER SCORING RULES FOR SURVIVAL ANALYSIS Anonymous

Abstract

Survival analysis is the problem of estimating probability distributions for future events, which can be seen as a problem in uncertainty quantification. Although there are fundamental theories on strictly proper scoring rules for uncertainty quantification, little is known about those for survival analysis. In this paper, we investigate extensions of four major strictly proper scoring rules for survival analysis. Through the extensions, we discuss and clarify the assumptions arising from the discretization of the estimation of probability distributions. We also discuss the relationship between the existing algorithms and extended scoring rules, and we propose new algorithms based on our extensions of the scoring rules for survival analysis.

1. INTRODUCTION

The theory of scoring rules is a fundamental theory in statistical analysis, and it is widely used in uncertainty quantification (see, e.g., Mura et al. (2008) ; Parmigiani & Inoue (2009) ; Benedetti (2010); Schlag et al. (2015) ). Suppose that there is a random variable Y whose cumulative distribution function (CDF) is F Y . Given an estimation FY of F Y and a single sample y obtained from Y , a scoring rule S( FY , y) is a function that returns an evaluation score for FY based on y. Since FY is a CDF and y is a single sample of Y , it is not straightforward to choose an appropriate scoring rule S( FY , y). The theory of scoring rules suggests strictly proper scoring rules that can be used to recover the true probability distribution F Y by optimizing the scoring rules. This theory shows that there are infinitely many strictly proper scoring rules, and examples of them include the pinball loss, the logarithmic score, the Brier score, and the ranked probability score (see, e.g., Gneiting & Raftery (2007) for the definitions of these scoring rules). Survival analysis, which is also known as time-to-event analysis, can be seen a problem in uncertainty quantification. Despite the long history of research on survival analysis (see, e.g., Wang et al. (2019) for a comprehensive survey), little is known about the strictly proper scoring rules for survival analysis. Therefore, this paper investigates extensions of these scoring rules for survival analysis. Survival analysis is the problem of estimating probability distributions for future events. In healthcare applications, an event usually corresponds to an undesirable event for a patient (e.g., a death or the onset of disease). The time between a well-defined starting point and the occurrence of an event is called the survival time or event time. Survival analysis has important applications in many fields such as credit scoring (Dirick et al., 2017) and fraud detection (Zheng et al., 2019) as well as healthcare. Although we discuss survival analysis in the context of healthcare applications, we can use the extended scoring rules for any other applications. Datasets for survival analysis are censored, which means that events of interest might not be observed for a number of data points. This may be due to either the limited observation time window or missing traces caused by other irrelevant events. In this paper, we consider only right censored data, which is a widely studied problem setting in survival analysis. The exact event time of a right censored data point is unknown; we know only that the event had not happened up to a certain time for the data point. The time between a well-defined starting point and the last observation time of a right censored data point is called the censoring time. One of the classical methods for survival analysis is the Kaplan-Meier estimator (Kaplan & Meier, 1958) . It is a non-parametric method for estimating the probability distribution of survival times as a survival function κ(t), where the value κ(t) represents the survival rate at time t (i.e., the ratio of the patients who survived at time t). By definition, κ(0) = 1 and κ(t) is a monotonically decreasing function. Since there are many applications that require an estimate of the survival function for each patient rather than the overall survival function κ(t) for all patients, many algorithms have been proposed. In particular, many neural network models have been proposed (e.g., (Lee et al., 2018; Avati et al., 2019; Ren et al., 2019; Kamran & Wiens, 2021; Tjandra et al., 2021) ). A problem with these neural network models is that most of them are not based on the theory of scoring rules except for (Rindt et al., 2022) . Since we cannot directly use a known scoring rule due to censoring in survival analysis, the state-of-the-art neural network models for survival analysis use their own custom loss functions instead. Even though these custom loss functions can be seen as variants of known scoring rules, they are not proven to be strictly proper for survival analysis in terms of the theory of scoring rules. We review variants of scoring rules used in survival analysis with respect to the four major strictly proper scoring rules. • Pinball loss. Portnoy's estimator (Portnoy, 2003) , which is a variant of the pinball loss, has been used in quantile regression-based survival analysis (Portnoy, 2003; Neocleous et al., 2006; Pearce et al., 2022) . However, it is unknown if Portnoy's estimator is proper or not. • Logarithmic score. Rindt et al. ( 2022) proved that a variant of the logarithmic score is strictly proper for survival analysis. This variant has been used in the loss function of many neural network models (e.g., (Lee et al., 2018; Avati et al., 2019; Ren et al., 2019; Kamran & Wiens, 2021; Kvamme & Borgan, 2021; Tjandra et al., 2021) ). However, most of them use this variant in part of the loss functions, and these loss functions are used without the proof of properness. • Brier score. The IPCW Brier score (Graf et al., 1999) and integrated Brier score (Graf et al., 1999) are widely used in survival analysis (e.g., (Kvamme et al., 2019; Haider et al., 2020; Han et al., 2021; Zhong et al., 2021) ) as variants of the Brier score. However, Rindt et al. (2022) show that neither of them are not proper in terms of the theory of scoring rules. • Ranked probability score. Variants of the ranked probability score have been proposed in (Avati et al., 2019; Kamran & Wiens, 2021 ), but (Rindt et al., 2022) show that they are not proper in terms of the theory of scoring rules. Our contributions. We analyze survival analysis through the lens of the theory of scoring rules. First, we prove that Portnoy's estimator, which is an extension of the pinball loss, is proper under certain conditions. This result underpins the grid-search algorithm (Portnoy, 2003; Neocleous et al., 2006) and the CQRNN algorithm (Pearce et al., 2022) , which is based on the expectation maximization (EM) algorithm. Second, we show another proof for an extension of the logarithmic score. This scoring rule has already been proven to be strictly proper in (Rindt et al., 2022) , but our proof clarifies the implicit assumption in the proof. Third, we show that there are two other proper scoring rules for survival analysis under certain conditions by extending the Brier score and the ranked probability score. By using these extended scoring rules, we construct two new algorithms by using the EM algorithm.

2. RELATED WORK

Survival analysis has been traditionally studied under the proportional hazard assumption. Its seminal work is the Cox model (Cox, 1972) , and many other prediction models have been proposed under this strong assumption. See, e.g., Wang et al. (2019) for a comprehensive survey of the prediction models based on this assumption. Since we do not require the theory of scoring rules under this assumption, we consider survival analysis without this assumption. Note that most of the stateof-the-art neural network models for survival analysis do not use this assumption. Regarding evaluation metrics for survival analysis, the concordance index (C-index) (Harrell et al., 1982) has been widely used under the proportional hazard assumption. Some variants of the Cindex (Antolini et al., 2005; Uno et al., 2011) are proposed for survival analysis without the proportional hazard assumption. However, they are proven to not be proper in terms of the theory of

