PROPER SCORING RULES FOR SURVIVAL ANALYSIS Anonymous

Abstract

Survival analysis is the problem of estimating probability distributions for future events, which can be seen as a problem in uncertainty quantification. Although there are fundamental theories on strictly proper scoring rules for uncertainty quantification, little is known about those for survival analysis. In this paper, we investigate extensions of four major strictly proper scoring rules for survival analysis. Through the extensions, we discuss and clarify the assumptions arising from the discretization of the estimation of probability distributions. We also discuss the relationship between the existing algorithms and extended scoring rules, and we propose new algorithms based on our extensions of the scoring rules for survival analysis.

1. INTRODUCTION

The theory of scoring rules is a fundamental theory in statistical analysis, and it is widely used in uncertainty quantification (see, e.g., Mura et al. (2008) ; Parmigiani & Inoue (2009) ; Benedetti (2010) ; Schlag et al. (2015) ). Suppose that there is a random variable Y whose cumulative distribution function (CDF) is F Y . Given an estimation FY of F Y and a single sample y obtained from Y , a scoring rule S( FY , y) is a function that returns an evaluation score for FY based on y. Since FY is a CDF and y is a single sample of Y , it is not straightforward to choose an appropriate scoring rule S( FY , y). The theory of scoring rules suggests strictly proper scoring rules that can be used to recover the true probability distribution F Y by optimizing the scoring rules. This theory shows that there are infinitely many strictly proper scoring rules, and examples of them include the pinball loss, the logarithmic score, the Brier score, and the ranked probability score (see, e.g., Gneiting & Raftery (2007) for the definitions of these scoring rules). Survival analysis, which is also known as time-to-event analysis, can be seen a problem in uncertainty quantification. Despite the long history of research on survival analysis (see, e.g., Wang et al. (2019) for a comprehensive survey), little is known about the strictly proper scoring rules for survival analysis. Therefore, this paper investigates extensions of these scoring rules for survival analysis. Survival analysis is the problem of estimating probability distributions for future events. In healthcare applications, an event usually corresponds to an undesirable event for a patient (e.g., a death or the onset of disease). The time between a well-defined starting point and the occurrence of an event is called the survival time or event time. Survival analysis has important applications in many fields such as credit scoring (Dirick et al., 2017) and fraud detection (Zheng et al., 2019) as well as healthcare. Although we discuss survival analysis in the context of healthcare applications, we can use the extended scoring rules for any other applications. Datasets for survival analysis are censored, which means that events of interest might not be observed for a number of data points. This may be due to either the limited observation time window or missing traces caused by other irrelevant events. In this paper, we consider only right censored data, which is a widely studied problem setting in survival analysis. The exact event time of a right censored data point is unknown; we know only that the event had not happened up to a certain time for the data point. The time between a well-defined starting point and the last observation time of a right censored data point is called the censoring time. One of the classical methods for survival analysis is the Kaplan-Meier estimator (Kaplan & Meier, 1958) . It is a non-parametric method for estimating the probability distribution of survival times as a survival function κ(t), where the value κ(t) represents the survival rate at time t (i.e., the ratio of the patients who survived at time t). By definition, κ(0) = 1 and κ(t) is a monotonically decreasing function. Since there are many applications that require an estimate of the survival function for each patient rather than the overall survival function κ(t) for all patients, many algorithms have been proposed. In particular, many neural network models have been proposed (e.g., (Lee et al., 2018; Avati et al., 2019; Ren et al., 2019; Kamran & Wiens, 2021; Tjandra et al., 2021) ). A problem with these neural network models is that most of them are not based on the theory of scoring rules except for (Rindt et al., 2022) . Since we cannot directly use a known scoring rule due to censoring in survival analysis, the state-of-the-art neural network models for survival analysis use their own custom loss functions instead. Even though these custom loss functions can be seen as variants of known scoring rules, they are not proven to be strictly proper for survival analysis in terms of the theory of scoring rules. We review variants of scoring rules used in survival analysis with respect to the four major strictly proper scoring rules. • Pinball loss. Portnoy's estimator (Portnoy, 2003) , which is a variant of the pinball loss, has been used in quantile regression-based survival analysis (Portnoy, 2003; Neocleous et al., 2006; Pearce et al., 2022) . However, it is unknown if Portnoy's estimator is proper or not. • Logarithmic score. Rindt et al. (2022) proved that a variant of the logarithmic score is strictly proper for survival analysis. This variant has been used in the loss function of many neural network models (e.g., (Lee et al., 2018; Avati et al., 2019; Ren et al., 2019; Kamran & Wiens, 2021; Kvamme & Borgan, 2021; Tjandra et al., 2021) ). However, most of them use this variant in part of the loss functions, and these loss functions are used without the proof of properness. • Brier score. The IPCW Brier score (Graf et al., 1999) and integrated Brier score (Graf et al., 1999) are widely used in survival analysis (e.g., (Kvamme et al., 2019; Haider et al., 2020; Han et al., 2021; Zhong et al., 2021) ) as variants of the Brier score. However, Rindt et al. (2022) show that neither of them are not proper in terms of the theory of scoring rules. • Ranked probability score. Variants of the ranked probability score have been proposed in (Avati et al., 2019; Kamran & Wiens, 2021) , but (Rindt et al., 2022) show that they are not proper in terms of the theory of scoring rules. Our contributions. We analyze survival analysis through the lens of the theory of scoring rules. First, we prove that Portnoy's estimator, which is an extension of the pinball loss, is proper under certain conditions. This result underpins the grid-search algorithm (Portnoy, 2003; Neocleous et al., 2006) and the CQRNN algorithm (Pearce et al., 2022) , which is based on the expectation maximization (EM) algorithm. Second, we show another proof for an extension of the logarithmic score. This scoring rule has already been proven to be strictly proper in (Rindt et al., 2022) , but our proof clarifies the implicit assumption in the proof. Third, we show that there are two other proper scoring rules for survival analysis under certain conditions by extending the Brier score and the ranked probability score. By using these extended scoring rules, we construct two new algorithms by using the EM algorithm.

2. RELATED WORK

Survival analysis has been traditionally studied under the proportional hazard assumption. Its seminal work is the Cox model (Cox, 1972) , and many other prediction models have been proposed under this strong assumption. See, e.g., Wang et al. (2019) for a comprehensive survey of the prediction models based on this assumption. Since we do not require the theory of scoring rules under this assumption, we consider survival analysis without this assumption. Note that most of the stateof-the-art neural network models for survival analysis do not use this assumption. Regarding evaluation metrics for survival analysis, the concordance index (C-index) (Harrell et al., 1982) has been widely used under the proportional hazard assumption. Some variants of the Cindex (Antolini et al., 2005; Uno et al., 2011) are proposed for survival analysis without the proportional hazard assumption. However, they are proven to not be proper in terms of the theory of scoring rules (Blanche et al., 2018; Rindt et al., 2022) . Therefore, we do not use these variants of the C-index in this paper. We also note that Sonabend et al. (2022) discuss the problems of using these variants of the C-index in survival analysis. 0 τ 1 τ 2 τ 3 τ 4 1 F (t) Time t z max (a) Quantile regression 0 ζ 1 ζ 2 ζ 3 ζ 4 ζ 5 1 F (t) Time t (b) Distribution regression

3. PRELIMINARIES

We define notation here before showing the extensions of the scoring rules for survival analysis. Unless otherwise stated, we consider a single patient x, and let T and C be random variables for the event time and censoring time of this patient, respectively. Let t ∼ T and c ∼ C be samples obtained from T and C, respectively. We assume that t and c are positive real values (i.e., t ∈ R + and c ∈ R + ). In survival analysis, we can observe only the minimum z = min{t, c}, and we use δ = 1(t ≤ c) to indicate whether z represents the true event time (i.e., δ = 1 means z is uncensored, and z = t) or z represents the censoring time (i.e., δ = 0 means z is censored, and z = c). In this paper, a pair of samples (t, c) is often represented as a pair of values (z, δ) to emphasize that we can observe only one of t and c. We assume that there exists z max > 0 such that 0 < z ≤ z max , which means that we have prior knowledge that z is at most z max . Let F (t) be the CDF of T , which is defined as F (t) = Pr(T ≤ t). By the definition of F (t), we have F (0) = 0, and we can represent the probability that the true event time is between t 1 and t 2 by Pr(t 1 < T ≤ t 2 ) = F (t 2 ) -F (t 1 ). Survival analysis is the problem of estimating the F (t) of the true CDF F (t). For simplicity, we assume that both F (t) and F (t) are monotonically increasing continuous functions. This means that F (t 1 ) < F (t 2 ) holds if and only if 0 ≤ t 1 < t 2 < ∞. This assumption enables us to calculate F (t) for any time 0 ≤ t < ∞ and to calculate F -1 (τ ) for any quantile level 0 ≤ τ ≤ 1. When we estimate F (t) by using a neural network, we usually discretize p = F (t) along with the p-axis or the t-axis as shown in Fig. 1 . In quantile regression-based survival analysis, p = F (t) is discretized along the p-axis, F -1 (τ i ) is estimated for 0 = τ 0 < τ 1 < • • • < τ B-1 < τ B = 1, and we assume that F -1 (τ 0 ) = 0 and F -1 (τ B ) = z max . In distribution regression-based survival analysis, p = F (t) is discretized along the t-axis, F (ζ i ) is estimated for 0 = ζ 0 < ζ 1 < • • • < ζ B-1 < ζ B = z max , and we assume that F (ζ 0 ) = 0 and F (ζ B ) = 1. Throughout this paper we assume that the censoring time and the event time are independent of each other given a feature vector of patient x. This assumption is widely used in survival analysis, and this assumption is represented as Assumption 3.1. T ⊥ ⊥ C|X. Note that he Kaplan-Meier estimator (Kaplan & Meier, 1958) , which is a classical non-parametric method for survival analysis, uses this assumption. D-calibration (Haider et al., 2020) , which is one of the widely used metrics in survival analysis, also uses this assumption. We can find examples of the other stronger assumptions (e.g., unconditionally random right censoring) used in survival analysis in (Peng, 2021) . We briefly review the theory of scoring rules for uncertainty quantification. Let Y be a random variable, and let F Y (y) be its CDF, which is defined as F Y (y) = Pr(Y ≤ y). A scoring rule is a function S( FY , y) that returns a real value for inputs FY and y, where FY is an estimation of F Y and y is a sample obtained from Y . In this paper, we consider negatively-oriented scoring rules. Therefore, the inequality S( F1 , y) < S( F2 , y) means that F1 is a better estimation than F2 . We can interpret the scoring rule S( FY , y) as a penalty function for the misestimation of FY for a sample y. The proper and strictly proper scoring rules are defined by using the expected score of a scoring rule, which can be written as 2007)). Now we extend these definitions of the proper and strictly proper scoring rules for survival analysis. In survival analysis, the inputs of a scoring rule S( F , (z, δ)) are changed from F Y and y to F and (z, δ). The proper and strictly proper scoring rules are defined by using Now we investigate the extensions of the scoring rules for survival analysis. In Sec. 4.1, we consider quantile regression and survival analysis based on quantile regression. In Secs. 4.2-4.4, we consider distribution regression and survival analysis based on distribution regression. S( F ; T, C) = E (t,c)∼(T,C) [S( F , (z, δ))].

4.1. EXTENSION OF PINBALL LOSS

We first review quantile regression (Koenker & Bassett, 1978; Koenker & Hallock, 2001) . Let Y be a real-valued random variable and F Y be its CDF. In quantile regression, we estimate the τ -th quantile of Y , which can be written as F -1 Y (τ ) = inf{y | F Y (y) ≥ τ }. The pinball loss (Koenker & Bassett, 1978) , which is also known as the check function, is a widely used scoring rule. The pinball loss for an estimation FY of F Y and a quantile level τ is defined as S Pinball ( FY , y; τ ) = ρ τ ( F -1 Y (τ ), y) = (1 -τ )( F -1 Y (τ ) -y) if F -1 Y (τ ) ≥ y, τ (y -F -1 Y (τ )) if F -1 Y (τ ) < y. (1) Note that the pinball loss with τ = 0.5 is equivalent to the mean absolute error (MAE) and it can be used to estimate the median (i.e., 0.5-th quantile) of Y . This means that the pinball loss is a generalization of MAE for any quantile level τ ∈ [0, 1]. Note also that we include the quantile level τ in the notation S Pinball ( F -1 Y , y; τ ) to clarify that this scoring rule receives τ as an input. It is known that the pinball loss is strictly proper (see e.g., (Gneiting & Raftery, 2007) ), which means that we have E y∼Y [S Pinball ( FY , y; τ )] ≥ E y∼Y [S Pinball (F Y , y; τ )], and the equality holds only when F -1 Y (τ ) = F -1 Y (τ ) by Definition 4.2. Therefore, quantile regression can be formulated as the problem of computing arg min FY E y∼Y [S Pinball ( FY , y; τ )]. As an extension of the pinball loss for quantile regression-based survival analysis, Portnoy's estimator is proposed in Portnoy (2003) , which is defined as S Portnoy ( F , (z, δ); w, τ ) = ρ τ ( F -1 (τ ), z) if δ = 1, wρ τ ( F -1 (τ ), z) + (1 -w)ρ τ ( F -1 (τ ), z ∞ ) if δ = 0, where ρ τ is the pinball loss defined in Eq. ( 1), w is a weight parameter to control the balance between two pinball loss terms, and z ∞ is any constant such that z ∞ > z max . In Portnoy's estimator, we can set an arbitrary constant 0 ≤ w ≤ 1 for the parameter w if τ c > τ , where τ c = Pr(t ≤ c) = F (c), but we have to set w = Pr(F (c) < F (t) ≤ τ |t > c) = (τ -τ c )/(1 -τ c ) otherwise (i.e., τ c ≤ τ ). Since we do not know the true value τ c = F (c), we have to resolve this problem to use this estimator. Before showing how to resolve this problem, we prove that this estimator is proper under the condition that w is correct. Note that this is the first result for the quantile regression-based survival analysis in terms of the theory of scoring rules. Theorem 4.5. Portnoy's estimator is proper under the condition that w is correct. Proof. The proof is given in Appendix A.1. This theorem means that the crucial part of Portnoy's estimator is to set an appropriate value for w, and this theorem ensures that we can recover the true probability distribution F -1 by minimizing Eq. ( 2) if w is correct. Now we discuss how to set parameter w in Portnoy's estimator. First, we emphasize that we cannot avoid the dependence on F (c) in the definition of any of the scoring rules for survival analysis due to the discretization of F . Even if we know the true value F -1 (τ i ) for all {τ i } B i=0 , we cannot compute F (c) because c is not always contained in {F -1 (τ i )} B i=0 . The best we can do is to find quantile levels τ i and τ i+1 such that F -1 (τ i ) < c ≤ F -1 (τ i+1 ) by using the assumption that F is a monotonically increasing function. Note that, even if we could find such τ i and τ i+1 , we would not be able to calculate some important probabilities such as Pr(c < t ≤ F -1 (τ i+1 )) = τ i+1 -F (c). Therefore, we usually mitigate this problem by using a large B, which enables us to assume, for example, F (τ i+1 ) -F (τ i ) ≈ 0 for all i. Even if we use a large B to assume that we can find the quantile level τ c such that c ≈ F -1 (τ c ) for any c, the problem that we do not know the true F -1 remains. One of the approaches to tackling this problem is the grid search algorithm (Portnoy, 2003; Neocleous et al., 2006) . In this algorithm, we use a sufficiently large B, and we estimate F -1 (τ i ) of F -1 (τ i ) in the increasing order of i = 0, 1, . . . , B. Suppose that we have estimated { F -1 (τ i )} j-1 i=0 and we are going to estimate F -1 (τ j ). The key idea of this algorithm is that we can find τ c ∈ {τ i } j-1 i=0 such that c ≈ F -1 (τ c ) if τ c = F (c) < τ j . If we can find such τ c , we estimate w by using τ c ≈ τ c . If we cannot find such τ c , this algorithm assumes that τ c > τ j and we use an arbitrary constant 0 ≤ w ≤ 1. Portnoy (2003) discuss that this algorithm is analogous to the Kaplan-Meier estimator, and their theoretical analysis (Portnoy, 2003; Neocleous et al., 2006) proves that Portnoy's estimator combined with linear regression can recover the true probability distribution F . As for another approach, Pearce et al. (2022) propose the CQRNN algorithm, which combines a neural network and the EM algorithm. Unlike the grid search algorithm, this algorithm estimates { F -1 (τ i )} B i=0 simultaneously by using a neural network. This algorithm starts with an arbitrary initial estimation F , and the parameter w is estimated by using F . Then, this algorithm updates F by using the estimation ŵ of w, and it repeats this alternative estimation of F and ŵ until these values converge. This EM algorithm can be implemented for "free" according to (Pearce et al., 2022) , which means that we can implement it easily in the computation of the loss function of a neural network training algorithm and we do not need to construct two separate neural network models for estimating F and ŵ. The experimental evaluation in (Pearce et al., 2022) shows that the CQRNN algorithm performs the best among the quantile regression-based survival analysis models.

4.2. EXTENSION OF LOGARITHMIC SCORE

While we estimate { F -1 (τ i )} B i=0 in quantile regression, we consider distribution regression, in which we estimate { F (ζ i )} B i=0 . For distribution regression, the logarithmic score (Good, 1952 ) is known as one of the strictly proper scoring rules, and it is defined as S log ( F , y; {ζ i } B i=0 ) = - B-1 i=0 1(ζ i < y ≤ ζ i+1 ) log( F (ζ i+1 ) -F (ζ i )) = - B-1 i=0 1(ζ i < y ≤ ζ i+1 ) log fi , where fi = F (ζ i+1 ) -F (ζ i ) for i = 0, 1, . . . , B -1. We extend this logarithmic score for distribution regression-based survival analysis as 4) S Cen-log ( F , (z, δ); w, {ζ i } B i=0 ) = - B-1 i=0 1(ζ i < z ≤ ζ i+1 ) δ log fi + (1 -δ)(w log fi + (1 -w) log(1 -F (ζ i+1 ))) ,( where w = Pr(c < t ≤ ζ i+1 |t > c) = (F (ζ i+1 ) -F (c))/(1 -F (c)). If δ = 1 , this scoring rule is equal to Eq. ( 3). Similar to Portnoy's estimator, we cannot set the parameter w of this scoring rule because we do not know F (ζ i+1 ) and F (c). Even though we do not know the correct w, we prove that this scoring rule is strictly proper if the parameter w is correct. Theorem 4.6. The scoring rule S Cen-Log ( F , (z, δ); w, {ζ i } B i=0 ) is strictly proper if w is correct. Proof. The proof is given in Appendix A.2. Similar to Portnoy's estimator, we can use both the grid-search algorithm and an EM algorithm similar to the CQRNN algorithm to estimate w. In addition, we show another simpler approach by using the observation that w ≈ 0 if B is large. If B is large, 1 -F (c) is usually much larger than F (ζ i+1 ) -F (c) (see Fig. 2(a) ), and hence we have w = (F (ζ i+1 ) -F (c))/(1 -F (c)) ≈ 0. Therefore, we obtain a simpler variant of S Cen-log by setting w = 0: S Cen-simple-log ( F , (z, δ); {ζ i } B i=0 ) = - B-1 i=0 1(ζ i < z ≤ ζ i+1 ) δ log fi + (1 -δ) log(1 -F (ζ i+1 )) . Furthermore, by increasing B to infinity (i.e., B → ∞), we obtain the continuous version of this scoring rule: S Cen-cont-log ( F , (z, δ)) = -δ log d F dt (z) -(1 -δ) log(1 -F (z)), which is equal to the extension of the logarithmic score that is proven to be strictly proper in (Rindt et al., 2022) . This clarifies that the proof in (Rindt et al., 2022) implicitly assumes that B is sufficiently large.

4.3. EXTENSION OF BRIER SCORE

In distribution regression, the Brier score (Brier, 1950) is also known as a strictly proper scoring rule, which is defined as We extend this Brier score for distribution regression-based survival analysis as S Brier ( F , y; {ζ i } B i=0 ) = B-1 i=0 (1(ζ i < y ≤ ζ i+1 ) -f i ) 2 , (6) 1 -F (c) F (ζ i+1 ) -F (c) 0 c ζ i+1 1 Time t (a) S Cen-log 1 -F (c) F (ζ) -F (c) 0 c ζ 1 Time t (b) SCen-Binary-Brier S Cen-Brier ( F , (z, δ); {w i } B-1 i=0 , {ζ i } B i=0 ) = B-1 i=0 w i (1 -fi ) 2 + (1 -w i ) f 2 i , where w i =              0 if δ = 1 and ζ i+1 < z = t 1 if δ = 1 and ζ i < z = t ≤ ζ i+1 0 if z ≤ ζ i (F (ζ i+1 ) -F (c))/(1 -F (c)) if δ = 0 and ζ i < z = c ≤ ζ i+1 f j /(1 -F (c)) if δ = 0 and ζ i+1 < z = c. If δ = 1, it is easy to see that Eq. ( 7) is equal to Eq. ( 6). We prove that this scoring rule is strictly proper if the set of parameters {w i } B-1 i=0 is correct. Theorem 4.7. The scoring rule S Cen-Brier ( F , (z, δ); {w i } B-1 i=0 , {ζ i } B i=0 ) is strictly proper if w i is correct for all i. Proof. The proof is given in Appendix A.3. We can use the EM algorithm similar to the CQRNN algorithm to estimate w. However, unlike Portnoy's estimator and the extension of the logarithmic score, we cannot use the grid-search algorithm in this extension of the Brier score because w i depends on f j such that i < j. Note that each w i in this scoring rule is close to zero if B is large and δ = 0. However, since w i s are designed to satisfy i w i = 1, we cannot use the approximation w i ≈ 0 for this scoring rule.

4.4. EXTENSION OF RANKED PROBABILITY SCORE

The ranked probability score (RPS) is also known as a strictly proper scoring rule (see e.g., (Gneiting & Raftery, 2007) ). It is defined as S RPS ( F , y) = B i=1 S Binary-Brier ( F , y; ζ i ), where S Binary-Brier is the binary version of S Brier (Eq. ( 6)) with single threshold ζ: S Binary-Brier ( F , y; ζ) = (1(y ≤ ζ) -1) 2 . ( ) We extend this scoring rule for survival analysis: S Cen-RPS ( F , (z, δ); {w i } B-1 i=1 , {ζ i } B-1 i=1 ) = B-1 i=1 S Cen-Binary-Brier ( F , (z, δ); w i , ζ i ), where S Cen-Binary-Brier is the binary version of S Cen-Brier (Eq. ( 7)) with single threshold ζ: S Cen-Binary-Brier ( F , (z, δ); w, ζ) =      F (ζ) 2 if z > ζ (1 -F (ζ)) 2 if δ = 1 and z = t ≤ ζ w(1 -F (ζ)) 2 + (1 -w) F (ζ) 2 if δ = 0 and z = c ≤ ζ, where w = (F (ζ) -F (c))/(1 -F (c)). Since this scoring rule is just the sum of the binary version of Brier scores, it is straightforward to prove this theorem. Theorem 4.8. The scoring rule S Cen-RPS ( F , (z, δ); {w i } B-1 i=1 , {ζ i } B-1 i=1 ) is strictly proper if w i is correct for all i. Note that the scoring rule S Cen-Binary-Brier is analogous to Portnoy's estimator. The scoring rule S Cen-Binary-Brier is designed to estimate F (ζ), where ζ is an input, and we use F (c) and ζ to set w, whereas Portnoy's estimator is designed to estimate F -1 (τ ), where τ is an input, and we use F (c) and τ to set w. As these two scoring rules are similar, we can use both the grid-search algorithm and an EM algorithm similar to the CQRNN algorithm for S Cen-RPS . Unlike S Cen-log defined in Eq. ( 4), the parameter w of the scoring rule S Cen-Binary-Brier is usually not close to zero, because ζ and c are usually not close to each other as shown in Fig. 2(b ). We note that the parameter w of Portnoy's estimator is also not close to zero for a similar reason.

5. EVALUATION METRICS FOR SURVIVAL ANALYSIS

While we have discussed the extensions of the scoring rules as loss functions, we can use strictly proper scoring rules also for evaluation metrics. If we use a nonproper scoring rule S nonproper as an evaluation metric, a neural network model can find F such that (F, (z, δ ))] by using S non-proper for the loss function. This suggests that we should avoid nonproper scoring rules as evaluation metrics, because we may obtain an over-optimized estimation F , which has a lower score than F in terms of the evaluation metric S nonproper . E (t,c)∼(T,C) [S nonproper ( F , (z, δ))] < E (t,c)∼(T,C) [S nonproper Among the extensions of the scoring rules for survival analysis, we can use only S Cen-simple-log (Eq. ( 5)) as an evaluation metric for survival analysis, because the other scoring rules depend on the parameter w or {w i } B-1 i=1 . Note that this scoring rule S Cen-simple-log is valid only if B is sufficiently large. In Appendix B, we conducted experiments on choosing an appropriate B, and the results suggested using B > 16. Regarding calibration metrics for survival analysis, while D-calibration (Haider et al., 2020) is widely used, we propose another metric for calibration, KM-calibration. We define this metric as d KM-cal (κ, Favg ; {ζ i } B i=0 ) = d KL (κ||1 -Favg ; {ζ i } B i=0 ) = B-1 i=0 (p i log p i -p i log q i ), where κ is the survival function estimated by using the Kaplan-Meier estimator (Kaplan & Meier, 1958) , Favg is the average of the estimated CDFs of all patients, p i = κ(ζ i+1 ) -κ(ζ i ), q i = (1 -Favg (ζ i+1 )) -(1 -Favg (ζ i )) . (In this computation, we assume that κ(ζ B ) = 0.) This metric is the Kullback-Leibler divergence between κ(t) and the average of the estimated survival function 1 -Favg (t). This metric is based on the observation that the model's predicted number of events within any time interval should be similar to the observed number (Goldstein et al., 2020) . We note here that calibration is particularly important for survival analysis especially in healthcare applications. If we use a prediction model that is miscalibrated, the predictions obtained from the miscalibrated model would be pessimistic or optimistic as a whole compared with the actual ones. If medical doctors were to make decisions on patient treatments on the basis of such a miscalibrated prediction model, the treatments could be harmful for patients because of the pessimistic or optimistic predictions. Calster et al. (2019) extensively discuss the importance of calibration in survival analysis. 

6. EXPERIMENTS

In our experiments, we used three datasets for the survival analysis from the packages in R R Core Team ( 2016): the flchain dataset Dispenzieri et al. (2012) , which was obtained from the 'survival' package and contains 7874 data points (69.9% of which are censored), the prostateSurvival dataset (Lu-Yao et al., 2009) , which was obtained from the 'asaur' package and contains 14294 data points (71.7% of which are censored), and the support dataset Knaus et al. (1995) , which was obtained from the 'casebase' package and contains 9104 data points (31.9% of which are censored). We compared the prediction performances of the extended scoring rules: S Cen-log (Eq. ( 4)), S Cen-Brier (Eq. ( 7)), S Cen-RPS (Eq. ( 10)), and S Portnoy (Eq. ( 2)). We used a neural network model with B = 32 to estimate F , and we combined it with the EM algorithm to estimate w or {w i } B-1 i=0 . This means that we used the CQRNN algorithm (Pearce et al., 2022) , which is the state-of-theart model for quantile regression-based survival analysis, for S Portnoy . We used S Cen-log-simple (Eq. ( 5)) as a metric for discrimination performce. We used D-calibration (Haider et al., 2020) and KM-calibration as metrics for calibration performance, where we used 20 bins of equal-length for D-calibration. Table 1 shows the results, and each number shows the mean and standard deviation of the measurements of five-fold cross validation. These results showed that S Cen-log and S Cen-Brier performed the best. Note that the former one is almost equal to the variant of the logarithmic score used in (Rindt et al., 2022) , and that the latter one is our new extension of Brier score. Compared to these two scoring rules, the prediction performance of S Cen-RPS and S Portnoy were worse than expected and these results seemed to be not close to the true probability distribution, even though we prove that they are conditionally proper scoring rules. It is considered that the estimation ŵ of parameter w by the EM algorithm was not accurate enough to converge to the true probability distribution for S Cen-RPS and S Portnoy . As we illustrate in Figure 2 , the parameter w of S Cen-RPS (and S Portnoy ) is usually not close to zero unlike S Cen-log , and this fact indicates that it was difficult to find correct w for these two scoring rules.

7. CONCLUSION

We have discussed the extensions of the four scoring rules for survival analysis, and we have proved that these extensions are proper if the parameter w or {w i } B-1 i=0 is correct. By using these scoring rules, we present neural network models combined with the EM algorithm to estimate the parameter, and our experiments showed that the models with S Cen-log and S Cen-Brier performed the best. In addition, we clarified the hidden assumption in the proof of the variant of the logarithmic score for survival analysis (Rindt et al., 2022) , which suggests us to use a sufficiently large B when we use it as an evaluation metric. We believe that our approach of extending scoring rules for survival analysis can be used for many other known strictly proper scoring rules.

A PROOFS OF THEOREMS

We give proofs of the theorems, which are omitted from the main body of this paper.

A.1 PORTNOY'S ESTIMATOR

We show a proof of Theorem 4.5. Proof. We consider a fixed c ∼ C, and we prove E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ E t∼T |C=c [S Portnoy (F, (z, δ); w, τ )] for these four cases separately. • Case 1: c ≤ min{F -1 (τ ), F -1 (τ )}. • Case 2: max{F -1 (τ ), F -1 (τ )} < c. • Case 3: F -1 (τ ) < c ≤ F -1 (τ ). • Case 4: F -1 (τ ) < c ≤ F -1 (τ ). Note that, if Inequality (11) holds for any c ∼ C, we can marginalize the inequality with respect to C, and we can prove E t∼T,c∼C [S Portnoy ( F , (z, δ); w, τ )] ≥ E t∼T,c∼C [S Portnoy (F, (z, δ); w, τ )], which means that S Portnoy ( F , (z, δ); w, τ ) is proper. Therefore, we prove Inequality (11) for the four cases. Case 1. We prove the case for c ≤ min{F -1 (τ ), F -1 (τ )}. This means that τ c ≤ τ and w = (τ -τ c )/(1 -τ c ). Hence, we have S Portnoy ( F , (z, δ); w, τ ) = ρ τ ( F -1 (τ ), t) if t ≤ c, wρ τ ( F -1 (τ ), c) + (1 -w)ρ τ ( F -1 (τ ), z ∞ ) if t > c, = (1 -τ )( F -1 (τ ) -t) if t ≤ c, -τ c (1 -τ )( F -1 (τ ) -t)/(1 -τ c ) if t > c. By Assumption 3.1, we have Pr(t ≤ c|C = c) = Pr(t ≤ c) = τ c and Pr(t > c|C = c) = 1 -τ c . Hence, we have E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] = Pr(t ≤ c|C = c)(1 -τ ) F -1 (τ ) -(1 -τ ) E t∼T |C=c,t≤c [t] -Pr(t > c|C = c)τ c (1 -τ ) F -1 (τ )/(1 -τ c ) + τ c (1 -τ ) 1 -τ c E t∼T |C=c,t>c [t] = -(1 -τ ) E t∼T |C=c,t≤c [t] + τ c (1 -τ ) 1 -τ c E t∼T |C=c,t>c [t]. Since this value is the same for S Portnoy ( F , (z, δ); w, τ ) and S Portnoy (F, (z, δ); w, τ ), we have E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] = E t∼T |C=c [S Portnoy , (z, δ); w, τ )]. Case 2. We prove the case for max{F -1 (τ ), F -1 (τ )} < c. If F -1 (τ ) ≤ F -1 (τ ) < c, then we have S Portnoy ( F , (z, δ); w, τ ) = ρ τ ( F -1 (τ ), t) if t ≤ c, wρ τ ( F -1 (τ ), c) + (1 -w)ρ τ ( F -1 (τ ), z ∞ ) if t > c, =      (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -wτ ( F -1 (τ ) -t) -(1 -w)τ ( F -1 (τ ) -t) if t > c, ≥          (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -τ ( F -1 (τ ) -t) if t > c, = (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t, where we used -wτ ( F -1 (τ ) -t) ≤ w(1 -τ )( F -1 (τ ) -t) when F -1 (τ ) < t ≤ F -1 (τ ) and w ≥ 0 for the inequality. If F -1 (τ ) ≤ F -1 (τ ) < c, then we have S Portnoy ( F , (z, δ); w, τ ) = ρ τ ( F -1 (τ ), t) if t ≤ c, wρ τ ( F -1 (τ ), c) + (1 -w)ρ τ ( F -1 (τ ), z ∞ ) if t > c, =      (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -wτ ( F -1 (τ ) -t) -(1 -w)τ ( F -1 (τ ) -t) if t > c, >          (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), (1 -τ )( F -1 (τ ) -t) if F -1 (τ ) < t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -τ ( F -1 (τ ) -t) if t > c, = (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t, where we used -wτ ( F -1 (τ ) -t) > w(1 -τ )( F -1 (τ ) -t) when F -1 (τ ) < t ≤ F -1 (τ ) and w ≥ 0 for the inequality. By Assumption 3.1, we have Pr(t ≤ F -1 (τ )|C = c) = Pr(t ≤ F -1 (τ )) = τ and Pr(F -1 (τ ) < t|C = c) = 1 -τ . Hence, we have E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ Pr(t ≤ F -1 (τ )|C = c)(1 -τ ) F -1 (τ ) -(1 -τ ) E t∼T |C=c,t≤F -1 (τ ) [t] -Pr(F -1 (τ ) < t|C = c)τ F -1 (τ ) + τ E t∼T |C=c,F -1 (τ )<t [t] = -(1 -τ ) E t∼T |C=c,t≤F -1 (τ ) [t] + τ E t∼T |C=c,F -1 (τ )<t [t]. By using a similar argument, we have E t∼T |C=c [S Portnoy (F, (z, δ); w, τ )] = -(1 -τ ) E t∼T |C=c,t≤F -1 (τ ) [t] + τ E t∼T |C=c,F -1 (τ )<t [t]. Under review as a conference paper at ICLR 2023 Note that this equation holds with equality. Hence, we have (F, (z, δ) ; w, τ )]. E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ E t∼T |C=c [S Portnoy Case 3. We prove the case for F -1 (τ ) < c ≤ F -1 (τ ). We have S Portnoy ( F , (z, δ); w, τ ) = ρ τ ( F -1 (τ ), t) if t ≤ c, wρ τ ( F -1 (τ ), c) + (1 -w)ρ τ ( F -1 (τ ), z ∞ ) if t > c, = (1 -τ )( F -1 (τ ) -t) if t ≤ c, w(1 -τ )( F -1 (τ ) -c) -(1 -w)τ ( F -1 (τ ) -c) if t > c, ≥      (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -τ ( F -1 (τ ) -c) if t > c, where we used -wτ ( F -1 (τ ) -t) ≤ w(1 -τ )( F -1 (τ ) -t) when F -1 (τ ) < t ≤ c ≤ F -1 (τ ) and w ≥ 0, and w(1 -τ )( F -1 (τ ) -c) > -wτ ( F -1 (τ ) -c) when F -1 (τ ) > t > c and w ≥ 0 for the inequality. By Assumption 3.1, we have Pr(t ≤ F -1 (τ )|C = c) = Pr(t ≤ F -1 (τ )) = τ , Pr(F -1 (τ ) < t ≤ c|C = c) = τ c -τ , and Pr(t > c|C = c) = 1 -τ c . Hence, we have E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ Pr(t ≤ F -1 (τ )|C = c)(1 -τ ) F -1 (τ ) -(1 -τ ) E t∼T |C=c,t≤F -1 (τ ) [t] -Pr(F -1 (τ ) < t ≤ c|C = c)τ F -1 (τ ) + τ E t∼T |C=c,F -1 (τ )<t [t] -Pr(t > c|C = c)τ F -1 (τ ) + τ c = -(1 -τ ) E t∼T |C=c,t≤F -1 (τ ) [t] + τ E t∼T |C=c,F -1 (τ )<t≤c [t] + τ c. By using a similar argument, we have S Portnoy (F, (z, δ); w, τ ) =      (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -τ ( F -1 (τ ) -c) if t > c, and so we have E t∼T |C=c [S Portnoy (F, (z, δ); w, τ )] = -(1 -τ ) E t∼T |C=c,t≤F -1 (τ ) [t] + τ E t∼T |C=c,F -1 (τ )<t≤c [t] + τ c. Note that this equation holds with equality. Hence, we have Case 4. We prove the case for F -1 (τ ) < c ≤ F -1 (τ ). We have S Portnoy ( F , (z, δ); w, τ ) = ρ τ ( F -1 (τ ), t) if t ≤ c, wρ τ ( F -1 (τ ), c) + (1 -w)ρ τ ( F -1 (τ ), z ∞ ) if t > c, =      (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), -τ ( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, -wτ ( F -1 (τ ) -c) -(1 -w)τ ( F -1 (τ ) -c) if t > c, >      (1 -τ )( F -1 (τ ) -t) if t ≤ F -1 (τ ), (1 -τ )( F -1 (τ ) -t) if F -1 (τ ) < t ≤ c, w(1 -τ )( F -1 (τ ) -c) -(1 -w)τ ( F -1 (τ ) -c) if t > c, = (1 -τ )( F -1 (τ ) -t) if t ≤ c, -τ c (1 -τ )( F -1 (τ ) -c)/(1 -τ c ) if t > c. where we used -wτ ( F -1 (τ ) -t) > w(1 -τ )( F -1 (τ ) -t) when F -1 (τ ) < t < c and w ≥ 0, and -wτ ( F -1 (τ ) -c) > w(1 -τ )( F -1 (τ ) -c) when c > F -1 (τ ) and w ≥ 0 for the inequality, and w = (τ -τ c )/(1 -τ c ) when τ c ≤ τ for the last equality. By Assumption 3.1, we have Pr(t ≤ c|C = c) = Pr(t ≤ c) = τ c and Pr(t > c|C = c) = 1 -τ c . Hence, we have E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ Pr(t ≤ c|C = c)(1 -τ ) F -1 (τ ) -(1 -τ ) E t∼T |C=c,t≤c [t] -Pr(t > c|C = c)τ c (1 -τ ) F -1 (τ )/(1 -τ c ) + τ c (1 -τ ) 1 -τ c c = -(1 -τ ) E t∼T |C=c,t≤c [t] + τ c (1 -τ ) 1 -τ c c. By using a similar argument, we have E t∼T |C=c [S Portnoy (F, (z, δ); w, τ )] = -(1 -τ ) E t∼T |C=c,t≤c [t] + τ c (1 -τ ) 1 -τ c c. Note that this equation holds with equality. Hence, we have (F, (z, δ) ; w, τ )]. E t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ E t∼T |C=c [S Portnoy

A.2 VARIANT OF LOGARITHMIC SCORE

We show a proof of Theorem 4.6. Proof. We consider a fixed c ∼ C, and let t be a sample obtained from T . Let i be the index such that ζ i ≤ c < ζ i+1 . Since Assumption 3.1 holds, we have Pr( ζ j < t ≤ ζ j+1 |C = c) = Pr(ζ j < t ≤ ζ j+1 ) = F (ζ j+1 ) -F (ζ j ) = f j for any j < i, Pr(ζ i < t ≤ c|C = c) = F (c) -F (ζ i ), Pr(c < t|C = c) = Pr(c < t) = 1 -F (c) . Hence, we have E t∼T |C=c [S Cen-log ( F , (z, δ); w, {ζ i } B i=0 )] = - j<i Pr(ζ j < t ≤ ζ j+1 |C = c) log fj -Pr(ζ i < t ≤ c|C = c) log fi -Pr(c < t|C = c) w log fi + (1 -w) log(1 -F (ζ i+1 )) = - j<i f j log fj -(F (c) -F (ζ i )) log fi -(1 -F (c)) w log fi + (1 -w) log(1 -F (ζ i+1 )) = - j≤i f j log fj -(1 -F (ζ i+1 )) log(1 -F (ζ i+1 )), where we used w = (F (ζ i+1 ) -F (c))/(1 -F (c)) for the last equality. Hence, we have E t∼T |C=c [S Cen-log ( F , (z, δ); w, {ζ i } B i=0 )] -E t∼T |C=c [S Cen-log (F, (z, δ); w, {ζ i } B i=0 )] = - j≤i f j (log fj -log f j ) -(1 -F (ζ i+1 ))(log(1 -F (ζ i+1 )) -log(1 -F (ζ i+1 ))) ≥ 0, where we used the fact that the Kullback-Leibler divergence between two probability distributions is non-negative for the inequality. This means that the inequality - k p k (log pk -log p k ) ≥ 0 holds for any two probability distributions p k and pk and the equality holds only if p k = pk for all k. Here, we use an (i + 2)-dimensional vector p = (p 0 , p 1 , . . . , p i+1 ), and we set p k = f k for all k ≤ i and we set p i+1 = 1 -F (ζ i+1 ). Note that the vectors p and p constructed in this way is a probability distribution (i.e., k p k = 1). Since Inequality (12) holds for any c ∼ C, we marginalize the inequality with respect to C, and we have E t∼T,c∼C [S Cen-log ( F , (z, δ); w, {ζ i } B i=0 )] ≥ E t∼T,c∼C [S Cen-log (F, (z, δ); w, {ζ i } B i=0 )], which means that S Cen-log ( F , (z, δ)) is proper. Moreover, the equality holds only if F = F , and therefore, S Cen-log ( F , (z, δ)) is strictly proper.

A.3 VARIANT OF BRIER SCORE

We show a proof of Theorem 4.7. Proof. We consider a fixed c ∼ C, and let t be a sample obtained from T . Let i be the index such that ζ i < c ≤ ζ i+1 . Assuming that Assumption 3.1 holds, we have Pr( ζ j < t ≤ ζ j+1 |C = c) = Pr(ζ j < t ≤ ζ j+1 ) = F (ζ j+1 ) -F (ζ j ) = f j for any j < i, Pr(ζ i < t ≤ c|C = c) = F (c) -F (ζ i ), and Pr(c < t|C = c) = Pr(c < t) = 1 -F (c). Hence, we have E t∼T |C=c [S Cen-Brier ( F , (z, δ); {w i } B-1 i=0 , {ζ i } B i=0 )] = j<i Pr(ζ j < t ≤ ζ j+1 |C = c)   (1 -fj ) 2 + j =k f 2 k   +Pr(ζ i < t ≤ c|C = c)   (1 -fj ) 2 + j =k f 2 k   +Pr(c < t|C = c)   w i (1 -fi ) 2 + (1 -w i ) f 2 i + j<i f 2 j + j>i (w j (1 -fj ) 2 + (1 -w j ) f 2 j )   = j<i f j   (1 -fj ) 2 + j =k f 2 k   + (F (c) -F (ζ i ))   (1 -fj ) 2 + j =k f 2 k   +(1 -F (c))   w i (1 -fi ) 2 + (1 -w i ) f 2 i + j<i f 2 j + j>i (w j (1 -fj ) 2 + (1 -w j ) f 2 j )   = j ( f 2 j -2f j fj + 1), where we used w i =    0 if δ = 1 and ζ i+1 < z = t 1 if δ = 1 and ζ i < z = t ≤ ζ i+1 0 if z ≤ ζ i for the first equality and w i = (F (ζ i+1 ) -F (c))/(1 -F (c)) if δ = 0 and ζ i < z = c ≤ ζ i+1 f j /(1 -F (c) if δ = 0 and ζ i+1 < z = c for the last equality. Hence we have E t∼T |C=c [S Cen-Brier ( F , (z, δ))] -E t∼T |C=c [S Cen-Brier (F, (z, δ))] = j ( f 2 j -f 2 j -2f j ( fj -f j )) = j ( fj -f j ) 2 ≥ 0. Note that the equality holds only if fj = f j holds for all j. Since Inequality (13) holds for any c ∼ C, we have E t∼T,c∼C [S Cen-Brier ( F , (z, δ))] ≥ E t∼T,c∼C [S Cen-Brier (F, (z, δ))], which means that S Cen-Brier ( F , (z, δ)) is strictly proper.

B ADDITIONAL EXPERIMENTS

In this section, we report the results of our additional experiments. The source codes used in our experiments are attached as the supplementary material. In our experiments, we used the flchain, prostateSurvival, and support datasets, and we split the data points into training (60%), validation (20%), and test (20%). For each dataset, we divided the Next, we computed the prediction performance of several loss functions used in the state-of-the-art neural network models. The loss function of DeepHit (Lee et al., 2018) consists of two terms. The first term is equal to the extension of the logarithmic score S Cen-log-simple , and the second term is used to improve a ranking metric between patients. The parameter α is used to control the balance between these two terms, and the weight for the second term is increased by using a large α. The loss function of DRSA (Ren et al., 2019) can also be seen as a variant of logarithmic score, and we set α = 0.25 for the parameter. S-CRPS (Avati et al., 2019) is a variant of the ranked probability score, but Rindt et al. (2022) showed that this scoring rule is not proper in terms of theory of scoring rules. We also implemented the IPCW BS(t) game model, which is proposed in (Han et al., 2021) . Table 5 shows the results. The prediction performance of DeepHit degraded by using a large α, which means that it is better to use S Cen-log-simple by setting α = 0. The other prediction models did not outperform S Cen-log and S Cen-Brier . Finally, we show an ablation study on the training models with and without the EM algorithm. Figure 3 shows the average survival functions for B = 32, which means that the average of F (t) = 1 -F (t) for all patients in test dataset were shown. The parameter w (or {w i } B-1 i=0 ) is included in the computation of the gradient in the neural network training of the prediction model without the EM algorithm, whereas the prediction model with the EM algorithm handles the parameter as a constant. The actual survival functions were estimated by the Kaplan-Meier estimator. These results showed that the average predictions for the extension of the logarithmic score were close to the Kaplan-Meier curve regardless of the use of the EM algorithm for the three datasets. As for the other three estimators, the average predictions with the EM algorithm were closer than those without the EM algorithm to the Kaplan-Meier curves. These results mean that we need the EM algorithm except for the extension of the logarithmic score.



Figure 1: Two types of discretization of probability distribution F (t) with B = 5

Definition 4.3. A scoring rule S( F , (z, δ)) is proper if S( F ; T, C) ≥ S(F ; T, C) holds. Definition 4.4. A scoring rule S( F , (z, δ)) is strictly proper if S( F ; T, C) ≥ S(F ; T, C) holds and the equality holds only when F = F .

Figure 2: Illustrations of computations of w

t∼T |C=c [S Portnoy ( F , (z, δ); w, τ )] ≥ E t∼T |C=c [S Portnoy (F, (z, δ); w, τ )].

Prediction performances (lower is better) of extended scoring rules with B = 32

Prediction performances (lower is better) with various loss functions for B = 32 Cen-log-simple DeepHit (α = 0.1) 1.5200 ± 0.0398 1.3644 ± 0.0293 1.8481 ± 0.0453 DeepHit (α = 1) 1.5858 ± 0.0495 1.3813 ± 0.0318 1.9996 ± 0.0525 DeepHit (α = 10) 2.0313 ± 0.1648 1.5688 ± 0.0823 2.3657 ± 0.0441 DRSA 1.6783 ± 0.0393 1.4631 ± 0.0273 2.0342 ± 0.0452 S-CRPS 2.0470 ± 0.1575 1.4589 ± 0.0442 2.1162 ± 0.1095 IPCW BS(t) game 1.9265 ± 0.1093 1.6413 ± 0.0743 2.3581 ± 0.1604 S Cen-log 1.5054 ± 0.0508 1.3608 ± 0.0295 1.8307 ± 0.0452 S Cen-Brier 1.5137 ± 0.0557 1.3680 ± 0.0291 1.8467 ± 0.0448 S Cen-RPS 1.6737 ± 0.0821 1.4821 ± 0.0639 2.1036 ± 0.1012 S Portnoy 1.6641 ± 0.0518 1.4352 ± 0.0420 2.0645 ± 0.0455 D-calibration DeepHit (α = 0.1) 0.0005 ± 0.0002 0.0001 ± 0.0000 0.0056 ± 0.0009 DeepHit (α = 1) 0.0008 ± 0.0003 0.0003 ± 0.0001 0.0062 ± 0.0010 DeepHit (α = 10) 0.0138 ± 0.0046 0.0064 ± 0.0035 0.0179 ± 0.0053 DRSA 0.0043 ± 0.0011 0.0047 ± 0.0004 0.0057 ± 0.0006 S-CRPS 0.0032 ± 0.0005 0.0018 ± 0.0004 0.0072 ± 0.0011 IPCW BS(t) game 0.0022 ± 0.0006 0.0083 ± 0.0018 0.0060 ± 0.0008 S Cen-log 0.0003 ± 0.0001 0.0001 ± 0.0000 0.0063 ± 0.0009 S Cen-Brier 0.0004 ± 0.0002 0.0001 ± 0.0000 0.0071 ± 0.0009 S Cen-RPS 0.0005 ± 0.0003 0.0010 ± 0.0005 0.0045 ± 0.0011 S Portnoy 0.0071 ± 0.0031 0.0055 ± 0.0041 0.0237 ± 0.0037 KM-calibration DeepHit (α = 0.1) 0.0264 ± 0.0071 0.0418 ± 0.0139 0.0249 ± 0.0067 DeepHit (α = 1) 0.0362 ± 0.0084 0.0599 ± 0.0341 0.0545 ± 0.0110 DeepHit (α = 10) 0.2077 ± 0.0543 0.4937 ± 0.1772 0.4273 ± 0.1188 DRSA 0.1929 ± 0.0135 0.1845 ± 0.0050 0.2103 ± 0.0162 S-CRPS 0.2759 ± 0.1279 0.6414 ± 0.3043 0.4090 ± 0.1499 IPCW BS(t) game 0.2770 ± 0.0789 0.4246 ± 0.0841 0.5325 ± 0.1342 S Cen-log 0.0206 ± 0.0049 0.0312 ± 0.0084 0.0299 ± 0.0115 S Cen-Brier 0.0268 ± 0.0071 0.0324 ± 0.0090 0.0492 ± 0.0125 S Cen-RPS 0.1553 ± 0.0349 0.5931 ± 0.3846 0.2668 ± 0.1192 S Portnoy 0.0434 ± 0.0067 0.1895 ± 0.1413 0.0809 ± 0.0381

Metric

Loss Function flchain prostateSurvival support S Cen-log-simple S Cen-log 6.4618 ± 0.1204 1.3460 ± 0.0476 1.5422 ± 0.0704 S Cen-log-simple 6.4176 ± 0.1266 1.3447 ± 0.0451 1.5368 ± 0.0701 D-calibration S Cen-log 0.0045 ± 0.0004 0.0002 ± 0.0000 0.0370 ± 0.0032 S Cen-log-simple 0.0127 ± 0.0013 0.0002 ± 0.0001 0.0349 ± 0.0024 KM-calibration S Cen-log 0.0048 ± 0.0026 0.0048 ± 0.0028 0.0057 ± 0.0027 S Cen-log-simple 0.0614 ± 0.0081 0.0083 ± 0.0024 0.0061 ± 0.0033 S Cen-log 0.0003 ± 0.0001 0.0001 ± 0.0000 0.0063 ± 0.0009 S Cen-log-simple 0.0003 ± 0.0001 0.0001 ± 0.0000 0.0062 ± 0.0012 KM-calibration S Cen-log 0.0206 ± 0.0049 0.0312 ± 0.0084 0.0299 ± 0.0115 S Cen-log-simple 0.0213 ± 0.0049 0.0343 ± 0.0102 0.0288 ± 0.0127 time interval [0, z max + ), where = 10 -3 , into B -1 equal-length intervals to get the thresholds {ζ i } B i=0 for distribution regression-based survival analysis, and we divided the unit interval [0, 1] into B -1 equal-length intervals to get the quantile levels {τ i } B i=0 for quantile regression-based survival analysis.All our experiments were conducted on a virtual machine with an Intel Xeon CPU (3.30 GHz) processor without any GPU and 64 GB of memory running Red Hat Enterprise Linux Server 7.6. We used Python 3.7.4 and PyTorch 1.7.1 for the implementation.We constructed a multi-layer perceptron (MLP) for our experiments. It consists of three hidden layers containing 128 neurons, and the number of outputs was B. The type of activation function after the hidden layer was the rectified linear unit (ReLU), and the activation function at the output node was softmax. We can satisfy the assumption that F (t) is a monotonically increasing continuous function by using it. In distribution regression-based survival analysis, each output of MLP estimatesi=0 and we can represent the function F (t) as a piecewise linear function connecting the values { F (ζ i )} B i=0 . Since f i > 0 holds for all i, F (t) estimated in this way is a monotonically increasing continuous function. We can estimate F for quantile regression-based survival analysis by using a similar way.First, we investigated the differences of the prediction performances between S Cen-log (defined in Eq. ( 4)) and S Cen-log-simple (defined in Eq. ( 5)) by using S Cen-log-simple , D-calibration, and KMcalibration as metrics. Tables 2 3 4 show the results for B = 8, 16, 32, respectively, where the bold numbers were used to emphasize the difference between two scoring rules. These results showed that the prediction performance of these two scoring rules were similar for the prostateSurvival and support datasets even for B = 8. However they showed different prediction performance for the flchain dataset for B = 8 and B = 16, but the performance difference were negligible for B = 32. Therefore, we used B = 32 in the other experiments in this paper. 

