NEURAL FRAILTY MACHINE: BEYOND PROPOR-TIONAL HAZARD ASSUMPTION IN NEURAL SURVIVAL REGRESSIONS

Abstract

We present neural frailty machine (NFM), a powerful and flexible neural modeling framework for survival regressions. The NFM framework utilizes the classical idea of multiplicative frailty in survival analysis to capture unobserved heterogeneity among individuals, at the same time being able to leverage the strong approximation power of neural architectures for handling nonlinear covariate dependence. Two concrete models are derived under the framework that extends neural proportional hazard models and nonparametric hazard regression models. Both models allow efficient training under the likelihood objective. Theoretically, for both proposed models, we establish statistical guarantees of neural function approximation with respect to nonparametric components via characterizing their rate of convergence. Empirically, we provide synthetic experiments that verify our theoretical statements. We also conduct experimental evaluations over 6 benchmark datasets of different scales, showing that the proposed NFM models outperform state-of-the-art survival models in terms of predictive performance.

1. INTRODUCTION

Regression analysis of time-to-event data (Kalbfleisch & Prentice, 2002) has been among the most important modeling tools for clinical studies and has witnessed a growing interest in areas like corporate finance (Duffie et al., 2009) , recommendation systems (Jing & Smola, 2017) , and computational advertising (Wu et al., 2015) . The key feature that differentiates time-to-event data from other types of data is that they are often incompletely observed, with the most prevailing form of incompleteness being the right censoring mechanism (Kalbfleisch & Prentice, 2002) . In the right censoring mechanism, the duration time of a sampled subject is (sometimes) only known to be larger than the observation time instead of being recorded precisely. It is well known in the community of survival analysis that even in the case of linear regression, naively discarding the censored observations produces estimation results that are statistically biased (Buckley & James, 1979) , at the same time losses sample efficiency if the censoring proportion is high. Cox's proportional hazard (CoxPH ) model (Cox, 1972) using the convex objective of negative partial likelihood (Cox, 1975) is the de facto choice in modeling right censored time-to-event data (hereafter abbreviated as censored data without misunderstandings). The model is semiparametric (Bickel et al., 1993) in the sense that the baseline hazard function needs no parametric assumptions. The original formulation of CoxPH model assumes a linear form and therefore has limited flexibility since the truth is not necessarily linear. Subsequent studies extended CoxPH model to nonlinear variants using ideas from nonparametric regression (Huang, 1999; Cai et al., 2007; 2008) , ensemble learning (Ishwaran et al., 2008) , and neural networks (Faraggi & Simon, 1995; Katzman et al., 2018) . While such extensions allowed a more flexible nonlinear dependence structure with the covariates, the learning objectives were still derived under the proportional hazards (PH) assumption, which was shown to be inadequate in many real-world scenarios (Gray, 2000) . The most notable case was the failure of modeling the phenomenon of crossing hazards (Stablein & Koutrouvelis, 1985) . It is thus of significant interest to explore extensions of CoxPH that both allow nonlinear dependence over covariates and relaxations of the PH assumption. Frailty models (Wienke, 2010; Duchateau & Janssen, 2007) are among the most important research topics in modern survival analysis, in that they provide a principled way of extending CoxPH model via incorporating a multiplicative random effect to capture unobserved heterogeneity. The resulting parameterization contains many useful variants of CoxPH like the proportional odds model (Bennett, 1983) , under specific choices of frailty families. While the theory of frailty models has been wellestablished (Murphy, 1994; 1995; Parner, 1998; Kosorok et al., 2004) , most of them focused on the linear case. Recent developments on applying neural approaches to survival analysis (Katzman et al., 2018; Kvamme et al., 2019; Tang et al., 2022; Rindt et al., 2022) have shown promising results in terms of empirical predictive performance, with most of them lacking theoretical discussions. Therefore, it is of significant interest to build more powerful frailty models via adopting techniques in modern deep learning (Goodfellow et al., 2016) with provable statistical guarantees. In this paper, we present a general framework for neural extensions of frailty models called the neural frailty machine (NFM). Two concrete neural architectures are derived under the framework: The first one adopts the proportional frailty assumption, allowing an intuitive interpretation of the neural CoxPH model with a multiplicative random effect. The second one further relaxes the proportional frailty assumption and could be viewed as an extension of nonparametric hazard regression (NHR) (Cox & O'Sullivan, 1990; Kooperberg et al., 1995) , sometimes referred to as "fully neural" models under the context of neural survival analysis (Omi et al., 2019) . We summarize our contributions as follows. • We propose the neural frailty machine (NFM) framework as a principled way of incorporating unobserved heterogeneity into neural survival regression models. The framework includes many commonly used survival regression models as special cases. • We derive two model architectures based on the NFM framework that extend neural CoxPH models and neural NHR models. Both models allow stochastic training and scale to large datasets. • We show theoretical guarantees for the two proposed models via characterizing the rates of convergence of the proposed nonparametric function estimators. The proof technique is different from previous theoretical studies on neural survival analysis and is applicable to many other types of neural survival models. • We conduct extensive studies on various benchmark datasets at different scales. Under standard performance metrics, both models are empirically shown to perform competitively, matching or outperforming state-of-the-art neural survival models.

2.1. NONLINEAR EXTENSIONS OF COXPH

Most nonlinear extensions of CoxPH model stem from the equivalence of partial likelihood and semiparametric profile likelihood (Murphy & Van der Vaart, 2000) of CoxPH model, resulting in nonlinear variants that essentially replaces the linear term in partial likelihood with nonlinear variants: Huang (1999) used smoothing splines, Cai et al. (2007; 2008) used local polynomial regression (Fan & Gijbels, 1996) . The empirical success of tree-based models inspired subsequent developments like Ishwaran et al. ( 2008) that equip tree-based models such as gradient boosting trees and random forests with losses in the form of negative log partial likelihood. Early developments of neural survival analysis Faraggi & Simon (1995) adopted similar extension strategies and obtained neural versions of partial likelihood. Later attempts Katzman et al. (2018) suggest using the successful practice of stochastic training which is believed to be at the heart of the empirical success of modern neural methods (Hardt et al., 2016) . However, stochastic training under the partial likelihood objective is highly non-trivial, as mini-batch versions of log partial likelihood Katzman et al. (2018) are no longer valid stochastic gradients of the full-sample log partial likelihood (Tang et al., 2022) .

2.2. BEYOND COXPH IN SURVIVAL ANALYSIS

In linear survival modeling, there are standard alternatives to CoxPH such as the accelerated failure time (AFT) model (Buckley & James, 1979; Ying, 1993) , the extended hazard regression model (Etezadi-Amoli & Ciampi, 1987) , and the family of linear transformation models (Zeng & Lin, 2006) . While these models allow certain types of nonlinear extensions, the resulting form of (conditional) hazard function is still restricted to be of a specific form. The idea of nonparametric hazard regression (NHR) (Cox & O'Sullivan, 1990; Kooperberg et al., 1995; Strawderman & Tsiatis, 1996) further improves the flexibility of nonparametric survival analysis via directly modeling the conditional hazard function by nonparametric regression techniques such as spline approximation. Neural versions of NHR have been developed lately such as the CoxTime model Kvamme et al. (2019) . Rindt et al. (2022) used a neural network to approximate the conditional survival function and could be thus viewed as another trivial extension of NHR. Aside from developments in NHR, Lee et al. (2018) proposed a discrete-time model with its objective being a mix of the discrete likelihood and a rank-based score; Zhong et al. (2021a) proposed a neural version of the extended hazard model, unifying both neural CoxPH and neural AFT model; Tang et al. (2022) used an ODE approach to model the hazard and cumulative hazard functions.

2.3. THEORETICAL JUSTIFICATION OF NEURAL SURVIVAL MODELS

Despite the abundance of neural survival models, assessment of their theoretical properties remains nascent. In Zhong et al. (2021b) , the authors developed minimax theories of partially linear cox model using neural networks as the functional approximator. Zhong et al. (2021a) provided convergence guarantees of neural estimates under the extended hazard model. The theoretical developments therein rely on specific forms of objective function (partial likelihood and kernel pseudolikelihood) and are not directly applicable to the standard likelihood-based objective which is frequently used in survival analysis.

3.1. THE NEURAL FRAILTY MACHINE FRAMEWORK

Let T ≥ 0 be the interested event time with survival function denoted by S(t) = P( T > t) associated with a feature(covariate) vector Z ∈ R d . Suppose that T is a continuous random variable and let f (t) be its density function. Then λ(t) = f (t)/S(t) is the hazard function and Λ(t) = t 0 λ(s)ds is the cumulative hazard function. Aside from the covariate Z, we use a positive scalar random variable ω ∈ R + to express the unobserved heterogeneity corresponding to individuals, or frailty.foot_0 . In this paper we will assume the following generating scheme of T via specifying its conditional hazard function: λ(t|Z, ω) = ω ν(t, Z). ( ) Here ν is an unspecified non-negative function, and we let the distribution of ω be parameterized by a one-dimensional parameter θ ∈ R. 2 The formulation (1) is quite general and contains several important models in both traditional and neural survival analysis: 1. When ω follows parametric distributional assumptions, and ν(t, Z) = λ(t)e β ⊤ Z , (1) reduces to the standard proportional frailty model (Kosorok et al., 2004) . A special case is when ω is degenerate, i.e., it has no randomness, then the model corresponds to the classic CoxPH model. 2. When ω is degenerate and ν is arbitrary, the model becomes equivalent to nonparametric hazard regression (NHR) (Cox & O'Sullivan, 1990; Kooperberg et al., 1995) . In NHR, the function parameter of interest is usually the logarithm of the (conditional) hazard function. In this paper we construct neural approximations to the logarithm of ν, i.e., ν(t, Z) = log ν(t, Z). The resulting models are called Neural Frailty Machines (NFM). Depending on the prior knowledge of the function ν, we propose two function approximation schemes: The proportional frailty (PF) scheme assumes the dependence of ν on event time and covariates to be completely decoupled, i.e., ν(t, Z) = h(t) + m(Z). (2) Proportional-style assumption over hazard functions has been shown to be a useful inductive bias in survival analysis. We will treat both h and m in (2) as function parameters, and device two multi-layer perceptrons (MLP) to approximate them separately. The fully neural (FN) scheme imposes no a priori assumptions over ν and is the most general version of NFM. It is straightforward to see that the most commonly used survival models, such as CoxPH , AFT, EH, or PF models are included in the proposed model space as special cases. We treat ν = ν(t, Z) as the function parameter with input dimension d + 1 and use a multi-layer perceptron (MLP) as the function approximator to ν. Similar approximation schemes with respect to the hazard function have been proposed in some recent works (Omi et al., 2019; Rindt et al., 2022) , referred to as "fully neural approaches" without theoretical characterizations.

The choice of frailty family

There are many commonly used families of frailty distributions (Kosorok et al., 2004; Duchateau & Janssen, 2007; Wienke, 2010) , among which the most popular one is the gamma frailty, where ω follows a gamma distribution with mean 1 and variance θ. We briefly introduce some other types of frailty families in appendix A.

3.2. PARAMETER LEARNING UNDER CENSORED OBSERVATIONS

In time-to-event modeling scenarios, the event times are typically observed under right censoring. Let C be the right censoring time which is assumed to be conditionally independent of the event time T given Z, i.e., T ⊥ ⊥ C|Z. In data collection, one can observe the minimum of the survival time and the censoring time, that is, observe T = T ∧ C as well as the censoring indicator δ = I( T ⩽ C), where a ∧ b = min(a, b) for constants a and b and I(•) stands for the indicator function. We assume n independent and identically distributed (i.i.d.) copies of (T, δ, Z) are used as the training sample (T i , δ i , Z i ), i ∈ [n] , where we use [n] to denote the set {1, 2, . . . , n}. Additionally, we assume the unobserved frailties are independent and identically distributed, i.e., ω i i.i.d. ∼ f θ (ω), i ∈ [n] . Next, we derive the learning procedure based on the observed log-likelihood (OLL) objective under both PF and FN scheme. To obtain the observed likelihood, we first integrate the conditional survival function given the frailty: S(t|Z) = E ω∼f θ e -ω t 0 e ν(s,Z) ds =: e -G θ ( t 0 e ν(s,Z) ds) . (3) Here the frailty transform G θ (x) = -log (E ω∼f θ [e -ωx ] ) is defined as the negative of the logarithm of the Laplace transform of the frailty distribution. The conditional cumulative hazard function is thus Λ(t|Z) = G θ ( t 0 e ν(s,Z) ds). For the PF scheme of NFM, we use two MLPs h = h(t; W h , b h ) and m = m(Z; W m , b m ) as function approximators to ν and m, parameterized by (W h , b h ) and (W m , b m ), respectively.foot_2 According to standard results on censored data likelihood (Kalbfleisch & Prentice, 2002) , we write the learning objective under the PF scheme as: s) ds . L(W h , b h , W m , b m , θ) = 1 n   i∈[n] δ i log g θ e m(Zi) Ti 0 e h(s) ds + δ i h(T i ) + δ i m(Z i ) -G θ e m(Zi) Ti 0 e h( (4) Here we define g θ (x) = ∂ ∂x G θ (x). Let ( W h n , b h n , W m n , b m n , θ n ) be the maximizer of (4) and fur- ther denote h n (t) = h(t; W h n , b h n ) and m n (Z) = m(Z; W m n , b m n ). The resulting estimators for conditional cumulative hazard and survival functions are: Λ PF (t|Z) = G θn t 0 e hn(s)+ mn(Z) ds , S PF (t|Z) = e -Λ PF (t|Z) , For the FN scheme, we use ν = ν(t, Z; W ν , b ν ) to approximate ν(t, Z) parameterized by (W ν , b ν ). The OLL objective is written as: L(W ν , b ν , θ) = 1 n   i∈[n] δ i log g θ Ti 0 e ν(s,Zi;W ν ,b ν ) ds + δ i ν(T i , Z i ; W ν , b ν ) -G θ Ti 0 e ν(s,Zi;W ν ,b ν ) ds . Let ( W ν n , b ν n , θ n ) be the maximizer of (6), and further denote ν n (t, Z) = ν(t, Z; W ν n , b ν n ). The conditional cumulative hazard and survival functions are therefore estimated as: (t|Z) . Λ FN (t|Z) = G θn t 0 e νn(s,Z) ds , S FN (t|Z) = e -Λ FN (7) The evaluation of objectives like (6) and its gradient requires computing a definite integral of an exponentially transformed MLP function. Instead of using exact computations that are available for only a restricted type of activation functions and network structures, we use numerical integration for such kinds of evaluations, using the method of Clenshaw-Curtis quadrature (Boyd, 2001) , which has shown competitive performance and efficiency in recent applications to monotonic neural networks (Wehenkel & Louppe, 2019) . Remark 1. The interpretation of frailty terms differs in the two schemes. In the PF scheme, introducing the frailty effect strictly increases the modeling capability (i.e., the capability of modeling crossing hazard) in comparison to CoxPH or neural variants of CoxPH (Kosorok et al., 2004 ). In the FN scheme, it is arguable that in the i.i.d. case, the marginal hazard function is a reparameterization of the hazard function in the context of NHR. Therefore, we view the incorporation of frailty effect as injecting a domain-specific inductive bias that has proven to be useful in survival analysis and time-to-event regression modeling and verify this claim empirically in section 5.2. Moreover, frailty becomes especially helpful when handling correlated or clustered data where the frailty term is assumed to be shared among certain groups of individuals (Parner, 1998) . Extending NFM to such scenarios is valuable and we left it to future explorations.

4. THEORETICAL RESULTS

In this section, we present theoretical properties of both NFM estimates by characterizing their rates of convergence when the underlying event data follows corresponding model assumptions. The proof technique is based on the method of sieves (Shen & Wong, 1994; Shen, 1997; Chen, 2007) that views neural networks as a special kind of nonlinear sieve (Chen, 2007 ) that satisfies desirable approximation properties (Yarotsky, 2017) . Since both models produce estimates of function parameters, we need to specify a suitable function space to work with. Here we choose the following Hölder ball as was also used in previous works on nonparametric estimation using neural networks (Schmidt-Hieber, 2020; Farrell et al., 2021; Zhong et al., 2021b ) W β M (X ) = f : max α:|α|≤β esssup x∈X |D α (f (x))| ≤ M , where the domain X is assumed to be a subset of d-dimensional euclidean space. α = (α 1 , . . . , α d ) is a d-dimensional tuple of nonnegative integers satisfying |α| = α 1 + • • • + α d and D α f = ∂ |α| f ∂x α 1 1 •••x α d d is the weak derivative of f . Now assume that M is a reasonably large constant, and let Θ be a closed interval over the real line. We make the following assumptions for the true parameters under both schemes: Condition 1 (True parameter, PF scheme). The euclidean parameter θ 0 ∈ Θ ⊂ R, and the two function parameters m 0 ∈ W β M ([-1, 1] d ), h 0 ∈ W β M ([0, τ ]) , and τ > 0 is the ending time of the study duration, which is usually adopted in the theoretical studies in survival analysis ( Van der Vaart, 2000) . Condition 2 (True parameter, FN scheme). The euclidean parameter θ 0 ∈ Θ ⊂ R, and the function parameter ν 0 ∈ W β M ([0, τ ] × [-1, 1] d ), Next, we construct sieve spaces for function parameter approximation via restricting the complexity of the MLPs to "scale" with the sample size n. Condition 3 (Sieve space, PF scheme). The sieve space H n is constructed as a set of MLPs satisfying h ∈ W β M h ([0, τ ]), with depth of order O(log n) and total number of parameters of order O(n 1 β+d log n). The sieve space M n is constructed as a set of MLPs satisfying m ∈ W β Mm ([-1, 1] d ), with depth of order O(log n) and total number of parameters of order O(n d β+d log n). Here M h and M m are sufficiently large constants such that every function in W β M ([-1, 1] d ) and W β M ([0, τ ] ) could be accurately approximated by functions inside H n and M n , according to (Yarotsky, 2017, Theorem 1) . Condition 4 (Sieve space, FN scheme). The sieve space V n is constructed as a set of MLPs satisfying ν ∈ W β Mν ([0, τ ]), with depth of order O(log n) and total number of parameters of order O(n d+1 β+d+1 log n). Here M ν is a sufficiently large constant such that V n satisfies approximation properties, analogous to condition 3. For technical reasons, we will assume the nonparametric function estimators are constrained to fall inside the corresponding sieve spaces, i.e., h n ∈ H n , m n ∈ M n and ν ∈ V n . This will not affect the implementation of optimization routines as was discussed in Farrell et al. (2021) . Furthermore, we restrict the estimate θ n ∈ Θ in both PF and FN schemes. Additionally, we need the following regularity condition on the function G θ (x): Condition 5. G θ (x) is viewed as a bivariate function G : Θ × B → R, where B is a compact set on R. The functions G θ (x), ∂ ∂θ G θ (x), ∂ ∂x G θ (x),log g θ (x), ∂ ∂θ log g θ (x), ∂ ∂x log g θ (x) are bounded on Θ × B. We define two metrics that measures convergence of parameter estimates: For the PF scheme, let ϕ 0 = (h 0 , m 0 , θ 0 ) be the true parameters and ϕ n = ( h n , m n , θ n ) be the estimates. We abbreviate P ϕ0,Z=z as the conditional probability distribution of (T, δ) given Z = z under the true parameter, and P ϕn,Z=z as the conditional probability distribution of (T, δ) given Z = z under the estimates. Define the following metric d PF ϕ n , ϕ 0 = E z∼P Z H 2 (P ϕn,Z=z ∥ P ϕ0,Z=z ) , where H 2 (P ∥ Q) = √ dP - √ dQ 2 is the squared Hellinger distance between probability distributions P and Q. The case for the FN scheme is similar: Let ψ 0 = (ν 0 , θ 0 ) be the parameters and ν n = ( ν n , θ n ) be the estimates. Analogous to the definitions above, we define P ψ0,Z=z as the true conditional distribution given Z = z, and P ψn,Z=z be the estimated conditional distribution, we will use the following metric in the FN scheme: d FN ψ n , ψ 0 = E z∼P Z H 2 (P ψn,Z=z ∥ P ψ0,Z=z ) . ( ) Now we state our main theorems. We denote P as the data generating distribution and use O to hide poly-logarithmic factors in the big-O notation. Theorem 1 (Rate of convergence, PF scheme). In the PF scheme, under condition 1, 3, 5, we have that d PF ϕ n , ϕ 0 = O P n -β 2β+2d . Theorem 2 (Rate of convergence, FN scheme). In the FN scheme, under condition 2, 4, 5, we have that d FN ψ n , ψ 0 = O P n -β 2β+2d+2 . Remark 2. The idea of using Hellinger distance to measure the convergence rate of sieve MLEs was proposed in Wong & Shen (1995) . Obtaining rates under a stronger topology such as L 2 is possible if the likelihood function satisfies certain conditions such as the curvature condition (Farrell et al., 2021) . However, such kind of conditions are in general too stringent for likelihood-based objectives, instead, we use Hellinger convergence that has minimal requirements. Consequently, our proof strategy is applicable to many other survival models that rely on neural function approximation such as Rindt et al. (2022) , with some modification to the regularity conditions. For proper choices of metrics in sieve theory, see also the discussion in Chen (2007, Chapter 2). The plots in the first row compare the empirical estimates of the nonparametric component ν(t, Z) against its true value evaluated on 100 hold-out points, under the PF scheme. The plots in the second row are obtained using the FN scheme, with analogous semantics to the first row.

5. EXPERIMENTS

In this section, we assess the empirical performance of NFM. We first conduct synthetic experiments for verifying the theoretical convergence guarantees developed in section 4. To further illustrate the empirical efficacy of NFM, we evaluate the predictive performance of NFM over 6 benchmark datasets ranging from small scale to large scale, against state-of-the-art baselines.

5.1. SYNTHETIC EXPERIMENTS

We conduct synthetic experiments to validate our proposed theory. The underlying data generating scheme is as follows: First, we generate a 5-dimensional feature Z that is independently sampled from the uniform distribution over the interval [0, 1]. The (true) conditional hazard function of the event time takes the form of the proportional frailty model (2), with h(t) = t and m(Z) = sin(⟨Z, β⟩) + ⟨sin(Z), β⟩, where β = (0.1, 0.2, 0.3, 0.4, 0.5). The frailty ω is generated according to a gamma distribution with mean and variance equal to 1. We use this generating model to assess the recovery guarantee of both NFM modeling schemes via inspecting the empirical recovery of ν(t, Z). For the PF scheme, we have more underlying information about the generating model, and we present an additional assessment regarding the recovery of m(Z) in appendix D.1. We generate three training datasets of different scales, with n ∈ {1000, 5000, 10000}. A censoring mechanism is applied such that the censoring ratio is around 40% for each dataset. The assessment will be made on a fixed test sample of 100 hold-out points that are independently drawn from the generating scheme of the event time. We report a more detailed description of the implementation of the data generating scheme and model architectures in appendix C.2. We present the results of our synthetic data experiments in figure 1 . The evaluation results suggest that both NFM schemes are capable of approximating complicated nonlinear functions using a moderate amount of data, i.e., n ≥ 1000. et al., 2016) . For all the survival datasets, the event of interest is defined as the mortality after admission. In our experiments, we view METABRIC, RotGBSG, FLCHAIN, and SUPPORT as small-scale datasets and MIMIC-III as a moderate-scale dataset. We additionally use the KKBOX dataset (Kvamme et al., 2019) as a large-scale evaluation. In this dataset, an event time is observed if a customer churns from the KKBOX platform. We summarize the basic statistics of all the datasets in table 3 . Baselines We compare NFM with 9 baselines. The first one is the linear CoxPH model (Cox, 1972) . Gradient Boosting Machine (GBM) (Friedman, 2001; Chen & Guestrin, 2016) and Random Survival Forests (RSF) (Ishwaran et al., 2008) are two tree-based nonparametric survival regression methods. DeepSurv (Katzman et al., 2018) and CoxTime (Kvamme et al., 2019) are two models that adopt neural variants of partial likelihood as objectives. SuMo-net (Rindt et al., 2022) is a neural variant of NHR. We additionally chose three latest state-of-the-art neural survival models: DeepHit (Lee et al., 2018) , DeepEH (Zhong et al., 2021a) , and SODEN (Tang et al., 2022) . Among the chosen baselines, DeepSurv and SuMo-net are viewed as implementations of neural CoxPH and neural NHR and are therefore of particular interest for the empirical verification of the efficacy of frailty. A more thorough performance comparison with a larger set of baselines is provided in appendix D.3.

Evaluation strategy

We use two standard metrics in survival predictions for evaluating model performance: integrated Brier score (IBS) and integrated negative binomial log-likelihood (INBLL). Both metrics are derived from the following: S(ℓ, t 1 , t 2 ) = t1 t2 1 n n i=1 ℓ(0, S(t|Z i ))I(T i ≤ t, δ i = 1) S C (T i ) + ℓ(1, S(t|Z i ))I(T i > t) S C (t) dt. ( ) Where S C (t) is an estimate of the survival function S C (t) of the censoring variable, obtained by the Kaplan-Meier estimate (Kaplan & Meier, 1958) of the censored observations on the test data. ℓ : {0, 1} × [0, 1] → R + is some proper loss function for binary classification (Gneiting & Raftery, 2007) . The IBS metric corresponds to ℓ being the square loss, and the INBLL metric corresponds to ℓ being the negative binomial (Bernoulli) log-likelihood (Graf et al., 1999) . Both IBS and INBLL are proper scoring rules if the censoring times and survival times are independent. 4 We additionally report the result of another widely used metric, the concordance index (C-index), in appendix D. Since all the survival datasets do not have standard train/test splits, we follow previous practice (Zhong et al., 2021a ) that uses 5-fold cross-validation (CV): 1 fold is for testing, and 20% of the rest is held out for validation. In our experiments, we observed that a single random split into 5 folds does not produce stable results for most survival datasets. Therefore we perform 10 different CV runs for each survival dataset and report average metrics as well as their standard deviations. For the KKBOX dataset, we use the standard train/valid/test splits that are available via the pycox package (Kvamme et al., 2019) and report results based on 10 trial runs. Experimental setup We follow standard preprocessing strategies (Katzman et al., 2018; Kvamme et al., 2019; Zhong et al., 2021a ) that standardize continuous features into zero mean and unit variance, and do one-hot encodings for all categorical features. We adopt MLP with ReLU activation for all function approximators, including h, m in PF scheme, and ν in FN scheme, across all datasets, with the number of layers (depth) and the number of hidden units (width) within each layer being tunable. We tune the frailty transform over several standard choices detailed in appendix C.3. We find that the gamma frailty configuration performs reasonably well across all tasks and is recommended to be the default choice. A more detailed description of the tuning procedure, as well as training configurations for baseline models, are reported in appendix C.3. Results we report experimental results of small-scale datasets in table 1 , and results of two larger datasets in table 2. The proposed NFM framework achieves the best performance on 5 of the 6 datasets. The improvement over baselines is particularly evident in METABRIC, SUPPORT, and MIMIC-III datasets. Benefits of frailty to better understand the additional benefits of introducing the frailty formulation, we compute the (relative) performance gain of NFM-PF and NFM-FN, against their non-frailty counterparts, namely DeepSurv (Katzman et al., 2018) and SuMo-net (Rindt et al., 2022) . The evaluation is conducted for all three metrics mentioned in this paper. The results are shown in table 7 . The results suggest a solid improvement in incorporating frailty, as the relative increase in performance could be over 10% for both NFM models. A more detailed discussion is presented in section D.5.

A EXAMPLES OF FRAILTY SPECIFICATIONS

We list several commonly used frailty models, and specify their corresponding characteristics via their frailty transform G θ : Gamma frailty: Arguably the gamma frailty is the most widely used frailty model Murphy (1994; 1995) ; Parner (1998) ; Wienke (2010) ; Duchateau & Janssen (2007) , with G θ (x) = 1 θ log(1 + θx), θ ≥ 0. When θ = 0, G 0 (x) = lim θ→0 G θ (x) is defined as the (pointwise) limit. A notable fact of the gamma frailty specification is that when the proportional frailty (PF) assumption ( 2) is met, if θ = 0, the model degenerates to CoxPH . Otherwise if θ = 1, the model corresponds to the proportional odds (PO) model (Bennett, 1983) . Box-Cox transformation frailty: Under this specification, we have G θ (x) = (1 + x) θ -1 θ , θ ≥ 0. ( ) The case of θ = 0 is defined analogously to that of gamma frailty, which corresponds to the PO model under the PF assumption. When θ = 1, the model reduces to CoxPH under the PF assumption. IGG(α) frailty: This is an extension of gamma frailty (Kosorok et al., 2004) and includes other types of frailty specifications like the inverse gaussian frailty Hougaard (1984) , with G θ (x) = 1 -α αθ 1 + θx 1 -α α -1 , θ ≥ 0, α ∈ [0, 1). In the one-dimensional parameter paradigm, the parameter α is assumed known instead of being learnable. When α = 1/2, we obtain the gamma frailty model. When α → 0, the limit corresponds to the inverse Gaussian frailty. Before proving theorem 1 and 2, we introduce some additional notations that will be useful throughout the proof process.

Satistiability of regularity condition

In the PF scheme, define l(T, δ, Z; h, m, θ) =δ log g θ e m(Z) T 0 e h(s) ds + δh(T ) + δm(Z) -G θ e m(Z) T 0 e h(s) ds , where we denote g θ = G ′ (θ). Under the definition of the sieve space stated in condition 3, we restate the parameter estimates as h n , m n , θ n = argmax h∈Hn, m∈Mn,θ∈Θ 1 n i∈[n] l(T i , δ i , Z i ; h, m, θ). Similarly, in the FN scheme, we define l(T, δ, Z; ν, θ) = δ log g θ T 0 e ν(s,Z) ds + δν(T, Z) -G θ T 0 e ν(s,Z) ds Under the definition of the sieve space stated in condition 4, we restate the parameter estimates as ν n (t, z), θ n = argmax ν∈Vn,θ∈Θ 1 n i∈[n] l(T i , δ i , Z i ; ν, θ). We denote the conditional density function and survival function of the event time T given Z by f T |Z (t) and S T |Z (t), respectively. Similarly, we denote the conditional density function and survival function of the censoring time C given Z by f C|Z (t) and S C|Z (t). Under the assumption that T ⊥ ⊥ C | Z, the joint conditional density of the observed time T and the censoring indicator δ given Z can be expressed as the following: p(T, δ | Z) = f T |Z (T ) δ S T |Z (T ) 1-δ f C|Z (T ) 1-δ S C|Z (T ) δ = λ T |Z (T ) δ S T |Z (T )f C|Z (T ) 1-δ S C|Z (T ) δ , where λ T |Z (T ) is the conditional hazard function of the survival time T given Z. Under the model assumption of PF scheme, p(T, δ | Z) can be expressed by p(T, δ | Z; h, m, θ) = exp (l(T, δ, Z; h, m, θ)) f C|Z (T ) 1-δ S C|Z (T ) δ . For ϕ 0 = (h 0 , m 0 , θ 0 ) and an estimator ϕ = ( h, m, θ), the defined distance d PF ϕ, ϕ 0 can be explicitly expresses by d FN ψ, ψ 0 = E Z p(T, δ | Z; h, m, θ) -p(T, δ | Z; h 0 , m 0 , θ 0 ) 2 µ(dT × dδ) . Here the dominating measure µ is defined such that for any (measurable) function r(T, δ) r(T, δ)µ(dT × dδ) = τ 0 r(T, δ = 1)dT + τ 0 r(T, δ = 0)dT Under the model assumption of FN scheme, p(T, δ | Z) can be expressed by p(T, δ | Z; ν, θ) = exp (l(T, δ, Z; ν, θ)) f C|Z (T ) 1-δ S C|Z (T ) δ . For ψ 0 = (ν 0 , θ 0 ) and an estimator ψ = ( ν, θ), the defined distance d FN ψ, ψ 0 can be explicitly expresses by d FN ψ, ψ 0 = E Z p(T, δ | Z; ν, θ) -p(T, δ | Z; ν 0 , θ 0 ) 2 µ(dT × dδ) .

B.2 TECHNICAL LEMMAS

The following lemmas are needed for the proof of Theorem 1 and 2. Hereafter for notational convenience, we will use h, m for arbitrary elements in the corresponding sieve space listed in condition 3, ν for an arbitrary element in the sieve space listed in condition 4, and θ for an arbitrary element in Θ. Lemma 1. Under condition 1, 3, 5, for (T, δ, Z) ∈ [0, τ ] × {0, 1} × [-1, 1] d , the following terms are bounded: 1. l(T, δ, Z; h 0 , m 0 , θ 0 ) with true parameter (h 0 , m 0 , θ 0 ) 2. l(T, δ, Z; h, m, θ) with parameter estimates ( h, m, θ) in any sieve space listed in condition 3. Lemma 2. Under condition 2, 4, 5, for (T, δ, Z) ∈ [0, τ ] × {0, 1} × [-1, 1] d , the following terms are bounded: 1. l(T, δ, Z; ν 0 , θ 0 ) with true parameter (ν 0 , θ 0 ) 2. l(T, δ, Z; ν, θ) with parameter estimates ( ν, θ) in any sieve space listed in condition 4. Lemma 3. Under condition 1, 3, 5, let ( h, m, θ), ( h 1 , m 1 , θ 1 ), and ( h 2 , m 2 , θ 2 ) be arbitrary three parameter triples inside the sieve space defined in condition 3, the following two inequalities hold. ∥l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ)∥ ∞ ≲ |θ 0 -θ| + ∥h 0 -h∥ ∞ + ∥m 0 -m∥ ∞ ∥l(T, δ, Z; h 1 , m 1 , θ 1 ) -l(T, δ, Z; h 2 , m 2 , θ 2 )∥ ∞ ≲ | θ 1 -θ 2 | + ∥ h 1 -h 2 ∥ ∞ + ∥ m 1 -m 2 ∥ ∞ . Lemma 4. Under condition 2, 4, 5, let ( ν, θ), ( ν 1 , θ 1 ), and ( ν 2 , θ 2 ) be arbitrary three parameter tuples inside the sieve space defined in condition 4,, the following inequalities hold. ∥l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ)∥ ∞ ≲ |θ 0 -θ| + ∥ν 0 -ν∥ ∞ ∥l(T, δ, Z; ν 1 , θ 1 ) -l(T, δ, Z; ν 2 , θ 2 )∥ ∞ ≲ | θ 1 -θ 2 | + ∥ ν 1 -ν 2 ∥ ∞ . Lemma 5 (Approximating error of PF scheme). In the PF scheme, for any n, there exists an element in the corresponding sieve space π n ϕ 0 = (π n h 0 , π n m 0 , π n θ 0 ), satisfying d PF (π n ϕ 0 , ϕ 0 ) = O n -β β+d . Lemma 6 (Approximating error of FN scheme). In the FN scheme, for any n, there exists an element in the corresponding sieve space π n ψ = (π n ν 0 , π n θ 0 ) satisfying d FN (π n ψ 0 , ψ 0 ) = O n -β β+d+1 . Lemma 7. Suppose F is a class of functions satisfying that N (ε, F, ∥ • ∥) < ∞ for ∀ε > 0. We define N (ε, F, ∥ • ∥) to be the minimal number of ε-balls B(f, ε) = {g : ∥g -f ∥ < ε} needed to cover F with radius ε and further constrain that f ∈ F. Then we have N (ε, F, ∥ • ∥) ≤ N (ε, F, ∥ • ∥) ≤ N ( ε 2 , F, ∥ • ∥). Lemma 8. Suppose F is a class of functions satisfying that N [] (ε, F, ∥ • ∥ ∞ ) < ∞ for ∀ε > 0. We define N [] (ε, F, ∥ • ∥ ∞ ) to be the minimal number of brackets [l, u] needed to cover F with ∥l -u∥ ∞ < ε and further constrain that f ∈ F, l = f -ε 2 and u = f + ε 2 . Then we have N [] (ε, F, ∥ • ∥ ∞ ) ≤ N [] (ε, F, ∥ • ∥ ∞ ) ≤ N [] ( ε 2 , F, ∥ • ∥ ∞ ) Furthermore, we have N [] (ε, F, ∥ • ∥ ∞ ) = N ( ε 2 , F, ∥ • ∥ ∞ ). Lemma 9 (Model capacity of PF scheme). Let F n = {l(T, δ, Z; h, m, θ) : h ∈ H n , m ∈ M n , θ ∈ Θ}. Under condition 5, with s h = 2β 2β+1 and s m = 2β 2β+d , there exist constants c h and c m > 0 such that N [] (ε, F n , ∥ • ∥ ∞ ) ≲ 1 ε N (c h ε 1/s h , H n , ∥ • ∥ 2 ) × N (c m ε 1/sm , M n , ∥ • ∥ 2 ). Lemma 10 (Model capacity of FN scheme). Let G n = {l(T, δ, Z; ν, θ) : ν ∈ V n , θ ∈ Θ}. Under condition 5, with s ν = 2β 2β+d+1 , there exists a constant c ν > 0 such that N [] (ε, G n , ∥ • ∥ ∞ ) ≲ 1 ε N (c ν ε 1/sν , V n , ∥ • ∥ 2 ).

B.3 PROOFS OF THEOREM 1 AND 2

Proof of theorem 1. The proof is divided into four steps. Step 1 We denote ϕ 0 = (h 0 , m 0 , θ 0 ) and ϕ = ( h, m, θ), where h ∈ H n , m ∈ M n and θ ∈ Θ. For arbitrary small ε > 0, we have that inf d PF( ϕ,ϕ0)≥ε E l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ) = inf d PF( ϕ,ϕ0)≥ε E Z E T,δ|Z log p(T, δ | Z; h 0 , m 0 , θ 0 ) -log p(T, δ | Z; h, m, θ) = inf d PF( ϕ,ϕ0)≥ε E Z D KL P ϕ,Z ∥ P ϕ0,Z Using the fact that D KL P ϕ,Z ∥ P ϕ0,Z ≥ 2H 2 (P ϕ,Z ∥ P ϕ0,Z ). Thus, we further obtain that inf d PF( ϕ,ϕ0)≥ε E l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ) ≥ inf d PF( ϕ,ϕ0)≥ε 2E Z H 2 (P ϕ,Z ∥ P ϕ0,Z ) = 2 inf d PF ( ϕ,ϕ0)≥ε d 2 PF ϕ, ϕ 0 ≥ 2ε 2 . Step 2 Consider the following derivation. sup d PF( ϕ,ϕ0)≤ε Var l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ) ≤ sup d PF( ϕ,ϕ0)≤ε E l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ) 2 = sup d PF( ϕ,ϕ0)≤ε E Z E T,δ|Z log p(T, δ, Z; h 0 , m 0 , θ 0 ) -log p(T, δ, Z; h, m, θ) 2 = 4 sup d PF( ϕ,ϕ0)≤ε E Z     p(T, δ, Z; h 0 , m 0 , θ 0 ) log p(T, δ, Z; h 0 , m 0 , θ 0 ) p(T, δ, Z; h, m, θ) 2   µ(dT × dδ)   By Taylor's expansion on log x, there exists ξ(T, δ, Z) between p 1 2 (T, δ, Z; h 0 , m 0 , θ 0 ) and p 1 2 (T, δ, Z; h, m, θ) pointwisely such that p(T, δ, Z; h 0 , m 0 , θ 0 ) log p(T, δ, Z; h 0 , m 0 , θ 0 ) p(T, δ, Z; h, m, θ) 2 = p(T, δ, Z; h 0 , m 0 , θ 0 ) log p(T, δ, Z; h 0 , m 0 , θ 0 ) -log p(T, δ, Z; h, m, θ) 2 = p(T, δ, Z; h 0 , m 0 , θ 0 ) ξ(T, δ, Z) 2 p(T, δ, Z; h 0 , m 0 , θ 0 ) -p(T, δ, Z; h, m, θ) 2 Since p(T, δ, Z; h 0 , m 0 , θ 0 ) p(T, δ, Z; h, m, θ) = e l(T,δ,Z;h0,m0,θ0)-l(T,δ,Z; h, m, θ) by lemma 1, l(T, δ, Z; h 0 , m 0 , θ 0 ) and l(T, δ, Z; h, m, θ) are bounded among[0, τ ] × {0, 1} × [-1, 1] d uniformly on all ϕ = ( h, m, θ). Thus, there exist constants C 1 and C 2 such that 0 < C 1 ≤ p(T, δ, Z; h 0 , m 0 , θ 0 )/p(T, δ, Z; h, m, θ) ≤ C 2 . This leads to the fact that p(T, δ, Z; h 0 , m 0 , θ 0 ) 1 ξ(T,δ,Z) 2 is bounded. We further obtained that p(T, δ, Z; h 0 , m 0 , θ 0 ) log p(T, δ, Z; h 0 , m 0 , θ 0 ) -log p(T, δ, Z; h, m, θ) 2 ≲ p(T, δ, Z; h 0 , m 0 , θ 0 ) -p(T, δ, Z; h, m, θ) 2 . Thus, we have that sup d PF [ ϕ,ϕ0]≤ε Var(l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ)) ≲ sup d PF( ϕ,ϕ0)≤ε E Z p(T, δ, Z; h 0 , m 0 , θ 0 ) -p(T, δ, Z; h, m, θ) 2 µ(dT × dδ) = sup d PF( ϕ,ϕ0)≤ε d 2 PF ϕ, ϕ 0 ≤ ε 2 . Step 3 We define that F n = {l(T, δ, Z; h, m, θ) -l(T, δ, Z; π n h 0 , π n m 0 , π n θ 0 ) : h ∈ H n , m ∈ M n , θ ∈ Θ}. Here (π n h 0 , π n m 0 , π n θ 0 ) have been defined in lemma 5. Obviously, we have that log N [] (ε, F n , ∥ • ∥ ∞ ) = log N [] (ε, F n , ∥ • ∥ ∞ ) , where F is defined in lemma 9. By lemma 9, we further have that log N [] (ε, F n , ∥ • ∥ ∞ ) ≲ log 1 ε + log N (c h ε 1/s h , H n , ∥ • ∥ 2 ) + log N (c m ε 1/sm , M n , ∥ • ∥ 2 ). According to Bartlett et al. (2019, Theorem 7) , under condition 3, we have that the VC-dimension of H n and M n satisfy that VC (H n ) ≲ n 1 β+d log 3 n and VC (M n ) ≲ n d β+d log 3 n. Thus, we obtain that log N (c h ε 1/s h , H n , ∥ • ∥ 2 ) ≲ VC (H n ) s h log 1 ε ≲ n 1 β+d log 3 n log 1 ε , and log N (c m ε 1/sm , M n , ∥ • ∥ 2 ) ≲ VC (M n ) s ν log 1 ε ≲ n d β+d log 3 n log 1 ε . Thus, we obtain that log N [] (ε, F n , ∥ • ∥ ∞ ) ≲ n d β+d log 3 n log 1 ε . Step 4 By the Cauchy-Schwartz inequality, we have that E [l(T, δ, Z; π n h 0 , π n m 0 , π n θ 0 ) -l(T, δ, Z; h 0 , m 0 , θ 0 )] ≤ E(l(T, δ, Z; π n h 0 , π n m 0 , π n θ 0 ) -l(T, δ, Z; h 0 , m 0 , θ 0 )) 2 1 4 . Similar to the second part and by lemma 5, we further have that E [l(T, δ, Z; π n h 0 , π n m 0 , π n θ 0 ) -l(T, δ, Z; h 0 , m 0 , θ 0 )] ≲ d PF (π n ϕ 0 , ϕ 0 ) ≲ n -β 2β+2d . Now let τ = β 2β + 2d -2 log log n log n By Step 1,2,3 and Shen & Wong (1994, Theorem 1), we have d PF ϕ n , ϕ 0 = max n -τ , d PF (π n ϕ 0 , ϕ 0 ) , E [l(T, δ, Z; π n h 0 , π n m 0 , π n θ 0 ) -l(T, δ, Z; h 0 , m 0 , θ 0 )] By lemma 5, d PF (π n ϕ 0 , ϕ 0 ) = O(n -β β+d ). By Step 4, E [l(T, δ, Z; π n h 0 , π n m 0 , π n θ 0 ) -l(T, δ, Z; h 0 , m 0 , θ 0 )] = O n -β 2β+2d . Thus, we have d PF ϕ n , ϕ 0 = O(n -β 2β+2d log 2 n) = O(n -β 2β+2d ). Proof of theorem 2. The proof is divided into four steps. Step 1 We denote ψ 0 = (ν 0 , θ 0 ) and ψ = ( ν, θ), where ν ∈ V n and θ ∈ Θ. For arbitrary 0 < ε ≤ 1, we have that inf d FN( ψ,ψ0)≥ε E l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ) = inf d FN( ψ,ψ0)≥ε E Z E T,δ|Z log p(T, δ | Z; ν 0 , θ 0 ) -log p(T, δ | Z; ν, θ) = inf d FN( ψ,ψ0)≥ε E Z D KL P ψ,Z ∥∥ P ψ0,Z Using the fact that KL(P ψ,Z ∥ P ψ0,Z ) ≥ 2H 2 (P ψ,Z ∥ P ψ0,Z ). Thus, we further obtain that inf d FN( ψ,ψ0)≥ε E l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ) ≥ inf d FN( ψ,ψ0)≥ε 2E Z H 2 (P ψ,Z ∥ P ψ0,Z ) = 2 inf d FN( ψ,ψ0)≥ε d 2 FN ψ, ψ 0 ≥ 2ε 2 . Step 2 We consider the following derivation. sup d FN ( ψ,ψ0)≤ε Var l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ) ≤ sup d FN( ψ,ψ0)≤ε E l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ) 2 = sup d FN( ψ,ψ0)≤ε E Z E T,δ|Z log p(T, δ, Z; ν 0 , θ 0 ) -log p(T, δ, Z; ν, θ) 2 = 4 sup d FN ( ψ,ψ0)≤ε E Z p(T, δ, Z; ν 0 , θ 0 )(log p(T, δ, Z; ν 0 , θ 0 ) p(T, δ, Z; ν, θ) ) 2 µ(dT × dδ) By Taylor's expansion on log x, there exists η(T, δ, Z) between p(T, δ, Z; ν 0 , θ 0 ) and p(T, δ, Z; ν, θ) pointwisely such that p(T, δ, Z; ν 0 , θ 0 )(log p(T, δ, Z; ν 0 , θ 0 ) p(T, δ, Z; ν, θ) ) 2 = p(T, δ, Z; ν 0 , θ 0 ) log p(T, δ, Z; ν 0 , θ 0 ) -log p(T, δ, Z; ν, θ) 2 = p(T, δ, Z; ν 0 , θ 0 ) η(T, δ, Z) 2 p(T, δ, Z; ν 0 , θ 0 ) -p(T, δ, Z; ν, θ) 2 Since p(T, δ, Z; ν 0 , θ 0 )/p(T, δ, Z; ν, θ) = e l(T,δ,Z;ν0,θ0)-l(T,δ,Z; ν, θ) , by lemma 2, l(T, δ, Z; ν 0 , θ 0 ) and l(T, δ, Z; ν, θ) are bounded on [0, τ ] × {0, 1} × [-1, 1] d uniformly for all ψ = ( ν, θ). Thus there exist constants C 3 and C 4 such that 0 < C 3 ≤ p(T, δ, Z; ν 0 , θ 0 )/p(T, δ, Z; ν, θ) ≤ C 4 . This leads to the fact that p(T, δ, Z; ν 0 , θ 0 ) 1 η(T,δ,Z) 2 is bounded. We further have that p(T, δ, Z; ν 0 , θ 0 ) log p(T, δ, Z; ν 0 , θ 0 ) -log p(T, δ, Z; ν, θ) 2 ≲ p(T, δ, Z; ν 0 , θ 0 ) -p(T, δ, Z; ν, θ)

2

. Thus, we have that sup d FN( ψ,ψ0)≤ε Var l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ) ≲ sup d FN ( ψ,ψ0)≤ε E Z p(T, δ, Z; ν 0 , θ 0 ) -p(T, δ, Z; ν, θ) 2 µ(dT × dδ) = sup d FN ( ψ,ψ0)≤ε d 2 FN ψ, ψ 0 ≤ ε 2 . Step 3 We define that G n = {l(T, δ, Z; ν, θ) -l(T, δ, Z; π n ν 0 , π n θ 0 ) : ν ∈ V n , θ ∈ Θ}. Here (π n ν 0 , π n θ 0 ) have been defined in lemma 6. Obviously, we have that log N [] (ε, G n , ∥ • ∥ ∞ ) = log N [] (ε, G n , ∥ • ∥ ∞ ) , where G is defined in lemma 10. By lemma 10, we further obtain that log N [] (ε, G n , ∥ • ∥ ∞ ) ≲ log 1 ε + log N (c ν ε 1/sν , V n , ∥ • ∥ 2 ). According to Bartlett et al. (2019, Theorem 7) , under condition 4, we have that the VC-dimension of V n satisfies that VC (V n ) ≲ n d+1 β+d+1 log 3 n. Thus, we obtain that log N (c h ε 1/sν , V n , ∥ • ∥ 2 ) ≲ VC (V n ) s ν log 1 ε ≲ n d+1 β+d+1 log 3 n log 1 ε . Furthermore, we get that log N [] (ε, G n , ∥ • ∥ ∞ ) ≲ n d+1 β+d+1 log 3 n log 1 ε . Step 4 By the Cauchy-Schwartz inequality, we have that E[l(T, δ, Z; π n ν 0 , π n θ 0 ) -l(T, δ, Z; ν 0 , θ 0 )] ≤ E (l(T, δ, Z; π n ν 0 , π n θ 0 ) -l(T, δ, Z; ν 0 , θ 0 )) 2 1 4 . Similar to the second part and by lemma 6, we further obtain that E[l(T, δ, Z; π n ν 0 , π n θ 0 ) -l(T, δ, Z; ν 0 , θ 0 )] ≲ d FN (π n ψ 0 , ψ 0 ) ≲ n -β 2β+2d+2 Now let τ = β 2β + 2d + 2 -2 log log n log n . By step 1,2,3 and Step 1,2,3 and Shen & Wong (1994, Theorem 1), d FN ψ n , ψ 0 = max n -τ , d FN (π n ψ 0 , ψ 0 ) , E[l(T, δ, Z; π n ν 0 , π n θ 0 ) -l(T, δ, Z; ν 0 , θ 0 )] By lemma6, d FN (π n ψ 0 , ψ 0 ) = O(n -β β+d+1 ) By Step 4, E[l(T, δ, Z; π n ν 0 , π n θ 0 ) -l(T, δ, Z; ν 0 , θ 0 )] = O(n -β 2β+2d+2 ). Thus, we have s) ds d FN ψ n , ψ 0 = O(n -β 2β+2d+2 log 2 n) = O(n -β 2β+2d+2 ). (T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ)| ≤ sup θ∈Θ,x∈B ∂ log g θ (x) ∂ θ • θ 0 -θ + sup θ∈Θ,x∈B ∂ log g θ (x) ∂ x • e m0(Z) T 0 e h0(s) ds -e m(Z) T 0 e h(s) ds +|h 0 (T ) -h(T )| + |m 0 (Z) -m(Z)| + sup θ∈Θ,x∈B ∂G θ (x) ∂ θ • θ 0 -θ + sup θ∈Θ,x∈B ∂G θ (x) ∂ x • e m0 ≤ e M • τ e max(M,M h ) h 0 -h ∞ + τ e M h • e max(M,Mm) ∥m 0 -m∥ ∞ . Finally, we obtain that |l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ)| ≤ sup θ∈Θ,x∈B ∂ log g θ (x) ∂ x • e M • τ e max(M,M h ) ∥h 0 -h∥ ∞ + τ e M h • e max(M,Mm) ∥m 0 -m∥ ∞ + sup θ∈Θ,x∈B ∂ log g θ (x) ∂ θ • θ 0 -θ + h 0 (T ) -h(T ) + |m 0 (Z) -m(Z)| + sup θ∈Θ,x∈B ∂G θ (x) ∂ θ • θ 0 -θ + sup θ∈Θ,x∈B ∂G θ (x) ∂ x • e M • τ e max(M,M h ) ∥h 0 -h∥ ∞ + τ e M h • e max(M,Mm) ∥m 0 -m∥ ∞ . Taking supremum on both sides, we conclude that ∥l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ)∥ ∞ ≲ |θ 0 -θ| + ∥h 0 -h∥ ∞ + ∥m 0 -m∥ ∞ . The proof of the second inequality is similar.  ∂ log g θ (x) ∂ θ • θ 0 -θ + sup θ∈Θ,x∈B ∂ log g θ (x) ∂ x • T 0 e ν0(s,Z) ds - T 0 e ν(s,Z) ds +|ν 0 (T, Z) -ν(T, Z)| + sup θ∈Θ,x∈B ∂G θ (x) ∂ θ • |θ 0 -θ| + sup θ∈Θ,x∈B ∂G θ (x) ∂ x • T 0 e ν0(s,Z) ds - T 0 e ν(s,Z) ds . Again, by Taylor's expansion, T 0 e ν0(s,Z) ds -T 0 e ν(s,Z) ds ≤ τ e max(M,Mν ) ∥ν 0 -ν∥ ∞ , Finally, we obtain that l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ) ≤ sup θ∈Θ,x∈B ∂ log g θ (x) ∂ θ • θ 0 -θ + sup θ∈Θ,x∈B ∂ log g θ (x) ∂ x • τ e max(M,Mν ) ∥ν 0 -ν∥ ∞ +|ν 0 (T, Z) -ν(T, Z)| + sup θ∈Θ,x∈B ∂G θ (x) ∂ θ • θ 0 -θ + sup θ∈Θ,x∈B ∂G θ (x) ∂ x • τ e max(M,Mν ) ∥ν 0 -ν∥ ∞ . Taking supremum on both sides, we conclude that ∥l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ)∥ ∞ ≲ |θ 0 -θ| + ∥ν 0 -ν∥ ∞ , The proof of the second inequality is similar. Proof of lemma 5. According to Yarotsky (2017, Theorem 1), there exist approximating functions h * and m * such that ∥ h * -h 0 ∥ ∞ = O n -β β+d and ∥ m * -m 0 ∥ ∞ = O n -β β+d . Let π n h 0 = h * , π n m 0 = m * , and π n θ = θ 0 . We have that d PF (π n ϕ 0 , ϕ 0 ) = E Z | p(T, δ | Z; π n h 0 , π n m 0 , π n θ 0 ) -p(T, δ | Z; h 0 , m 0 , θ 0 )| 2 µ(dT × dδ) = E Z [e 1 2 l(T,δ,Z;πnh0,πnm0,πnθ0) -e 1 2 l(T,δ,Z;h0,m0,θ0) ] 2 f C|Z (T ) 1-δ S C|Z (T ) δ µ(dT × dδ) ≤ e 1 2 l(T,δ,Z;πnh0,πnm0,πnθ0) -e 1 2 l(T,δ,Z;h0,m0,θ0) ∞ × E Z f C|Z (T ) 1-δ S C|Z (T ) δ µ(dT × dδ) . By lemma 1 and 3, we have that ∥e 1 2 l(T,δ,Z;πnh0,πnm0,πnθ0) -e 1 2 l(T,δ,Z;h0,m0,θ0) ∥ ∞ ≲ ∥π n θ 0 -θ 0 ∥ + ∥π n h 0 -h 0 ∥ ∞ + ∥π n m 0 -m 0 ∥ ∞ = O n -β β+d . Since f C|Z (T ) 1-δ ≤ f C|Z (T ) + 1 and S C|Z (T ) δ ≤ 1, we also have that E Z f C|Z (T ) 1-δ S C|Z (T ) δ µ(dT × dδ) ≤ E Z (1 + f C|Z (T ))µ(dT × dδ) ≤ √ 2 + 2τ . Thus, we obtain that d PF (π n ϕ 0 , ϕ 0 ) = O n -β β+d . Proof of lemma 6. According to Yarotsky (2017, Theorem 1), there exists an approximating function ν * such that ∥ ν * -ν 0 ∥ ∞ = O n -β β+d+1 . Let π n ν 0 = ν * and π n θ 0 = θ 0 . We have that d FN (π n ψ 0 , ψ 0 ) = E Z p(T, δ | Z; π n ν 0 , π n θ 0 ) -p(T, δ | Z; ν 0 , θ 0 ) 2 µ(dT × dδ) = E Z e 1 2 l(T,δ,Z;πnν0,πnθ0) -e 1 2 l(T,δ,Z;ν0,θ0) 2 f C|Z (T ) 1-δ S C|Z (T ) δ µ(dT × dδ) ≤ 1 2 e l(T,δ,Z;πnν0,πnθ0) - 1 2 e l(T,δ,Z;ν0,θ0) ∞ E Z f C|Z (T ) 1-δ S C|Z (T ) δ µ(dT × dδ) . By lemma 2 and 4, we have that e 1 2 l(T,δ,Z;πnν0,πnθ0) -e 1 2 l(T,δ,Z;ν0,θ0) ∞ ≲ ∥π n θ 0 -θ 0 ∥ + ∥π n ν 0 -ν 0 ∥ ∞ = O n -β β+d+1 . Since f C|Z (T ) 1-δ ≤ f C|Z (T ) + 1 and S C|Z (T ) δ ≤ 1, we also have that E Z f C|Z (T ) 1-δ S C|Z (T ) δ µ(dT × dδ) ≤ E Z (1 + f C|Z (T ))µ(dT × dδ) ≤ √ 2 + 2τ . Thus, we obtain that d FN (π n ψ 0 , ψ 0 ) = O n -β β+d+1 . Proof of lemma 7. The left inequality is trivial according to the definition of covering number. We need to show that the correctness of the right inequality. The plots in the first row compare the empirical estimates of the survival function S(t| Z) against its true value with Z being the average of the features of the 100 hold-out points, under the PF scheme. The plots in the second row are obtained using the FN scheme, with analogous semantics to the first row. feature Z as the sample mean of all the 100 hold-out test points. And plot S(t| Z) against the ground truth S(t| Z) regarding both PF and FN schemes. The results are shown in figure 3 . The results suggest that both scheme provides accurate estimation of survival functions when the sample size is sufficiently large.

D.3 PERFORMANCE EVALUATION UNDER IBS AND INBLL

In this subsection we augment the experimental results in section 5 with three recent state-of-the-art baseline models: SurvNode Groha et al. (2020) , DCM (Nagpal et al., 2021) and DeSurv (Danks & Yau, 2022) . The results are demonstrated in table 4 for four small scale datasets, and table 5 for two larger datasets. The proposed NFM framework achieves the best performance on 5 of the 6 datasets, which is consistent with the findings in table 1. In each column, the boldfaced score denotes the best result and the underlined score represents the second-best result. Two models are not reported, namely SODEN and DeepEH, as we found empirically that their computational/memory cost is significantly worse than the rest, and we fail to obtain reasonable performances over the two datasets for these two models. on the MIMIC-III dataset evaluated under C-index. This phenomenon is also understandable: Since the DeepSurv model utilized a variant of partial likelihood (PL) for model training, as previous works (Steck et al., 2007) pointed out that PL type objective is closely related to the ranking problem. As C-index could be considered a certain type of ranking measure, it is possible that DeepSurv obtains better ranking performance than NFM-type models which are trained using scale-sensitive likelihood objective.



For example in medical biology, it was observed that genetically identical animals kept in as similar an environment as possible will typically not behave the same upon exposure to environmental carcinogens (Brennan, 2002)2 The choice of one-dimensional frailty family is mostly for simplicity and clearness of theoretical derivations. Note that there exist multi-dimensional frailty families like the PVF family(Wienke, 2010). Generalizing our theoretical results to such kinds of families would require additional sets of regularity conditions, and will be left to future explorations. Here we adopt the conventional notation that W is the collection of the weight matrices of the MLP in all layers, and b corresponds to the collection of the bias vectors in all layers. Otherwise, one may pose a covariate-dependent model on the censoring time and use SC (t|Z) instead of SC (t). We adopt the Kaplan-Meier approach since it's still the prevailing practice in evaluations of survival predictions. CONCLUSIONIn this paper, we make principled explorations on applying the idea of frailty models in modern survival analysis to neural survival regressions. A flexible and scalable framework called NFM is proposed that includes many useful survival models as special cases. Under the framework, we study two derived model architectures both theoretically and empirically. Theoretically, we obtain the rates of convergences of the nonparametric function estimators based on neural function approximation. Empirically, we demonstrate the superior predictive performance of the proposed models by evaluating several benchmark datasets.



Figure1: Visualizations of synthetic data results under the NFM framework. The plots in the first row compare the empirical estimates of the nonparametric component ν(t, Z) against its true value evaluated on 100 hold-out points, under the PF scheme. The plots in the second row are obtained using the FN scheme, with analogous semantics to the first row.

InKosorok et al. (2004, Proposition 1), the authors verified the regularity condition of gamma and IGG(α) frailties. Using a similar argument, it is straightforward to verify the regularity of Box-Cox transformation frailty.B PROOFS OF THEOREMS B.1 PRELIMINARYAdditional definitionsThe theory of empirical processes(van der Vaart et al., 1996) will be involved heavily in the proof. Therefore we briefly introduce some common notations: For a function class F, define N (ϵ, F, ∥ • ∥) to be the covering number of F with respect to norm ∥ • ∥ under radius ϵ, and define N [] (ϵ, F, ∥ • ∥) to be the bracketing number of F with respect to norm ∥ • ∥ under radius ϵ. We use VC (F) to denote the VC-dimension of F. Moreover, we use the notation a ≲ b to denote a ≤ Cb for some positive constant C.

PROOFS OF TECHNICAL LEMMAS Proof of lemma 1. Since h 0 (T ) ∈ W β M ([0, τ ]) and m 0 (Z) ∈ W β M ([-1, 1] d ), we have that h 0 (T ) ≤ M , m 0 (Z) ≤ M and e m0(Z) T 0 h 0 (s)ds ≤ τ e 2M . Let B = [0, τ e 2M ], we have that |l(T, δ, Z; h 0 , m 0 , θ 0 )| ≤ log g θ0 e m0(Z) T 0 e h0(s) ds + |h 0 (T )| + |m 0 (Z)| + G θ0 e m0(Z) T 0 e h0(s) ds≤ 2M + sup x∈B |log g θ0 (x)| + sup x∈B |G θ0 (x)| By condition 5, we have that l(T, δ, Z; h 0 , m 0 , θ 0 ) is bounded for (T, δ, Z) ∈ [0, τ ] × {0, 1} × [-1, 1] d . The proof of the boundness of l(T, δ, Z; h, m, θ) is similar. Proof of lemma 2. Since ν 0 (T, Z) ∈ W β M ([0, τ ] × [-1, 1] d ),we have ν 0 (T, Z) ≤ M and T 0 e ν(s,Z) ds ≤ τ e M . Let B = [0, τ e M ], we have that |l(T, δ, Z; ν 0 , θ 0 )| ≤ log G ′ θ0 T ν0(s,Z) ds + |ν 0 (T, Z)| + G By condition 5, we have that l(T, δ, Z; ν 0 , θ 0 ) is bounded among (T, δ, Z) ∈ [0, τ ]×{0, 1}×[-1, 1] d The proof of the boundness of l(T, δ, Z; ν, θ) is similar. Proof of lemma 3. By definition, we have that |l(T, δ, Z; h 0 , m 0 , θ 0 ) -l(T, δ, Z; h, m, θ)| ≤ log g θ0 e m0(Z) T 0 e h0(s) ds -log g θ e m(Z) T 0 e h(s) ds + h 0 (T ) -h(T ) +|m 0 (Z) -m(Z)| + G θ0 e m0(Z) T 0 e h0(s) ds -G θ e m(Z) T 0 e h(s) ds . Let B = [0, τ max(e 2M , e M h +Mm )]. By Taylor's expansion, we can further show that |l

(s) -e h(s) )ds + (e m0(Z) -e m(Z) )

By definition, we have that |l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ)| ≤ log g θ0 T 0 e ν0(s,Z) ds -log g θ T 0 e ν(s,Z) ds + |ν 0 (T, Z) -ν(T, Z)| s,Z) ds . Let B = [0, τ max(e M , e Mν )]. By Taylor's expansion, we can further show that |l(T, δ, Z; ν 0 , θ 0 ) -l(T, δ, Z; ν, θ)| ≤ sup θ∈Θ,x∈B

Figure 2: Visualizations of synthetic data results under the PF scheme of NFM framework, regarding empirical recovery of the m function in (2)

Survival prediction results measured in IBS and INBLL metric (%) on four small-scale survival datasets. In each column, the boldfaced score denotes the best result and the underlined score represents the second-best result. ±0.39 32.84 ±1.15 19.14 ±0.39 56.35 ±1.00 NFM-FN 16.11 ±0.81 48.21 ±2.04 17.66 ±0.52 52.41 ±1.22 10.05 ±0.39 33.11 ±1.10 18.97 ±0.60 55.87 ±1.50 5.2 REAL-WORLD DATA EXPERIMENTS Datasets We use five survival datasets and one non-survival dataset for evaluation. The survival datasets include the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (Curtis et al., 2012), the Rotterdam tumor bank and German Breast Cancer Study Group (RotG-BSG)(Knaus et al., 1995), the Assay Of Serum Free Light Chain (FLCHAIN) (Dispenzieri et al., 2012), the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUP-PORT) (Knaus et al., 1995), and the Medical Information Mart for Intensive Care (MIMIC-III) (Johnson

Survival prediction results measured in IBS and INBLL metric (%) on two larger datasets. In each column, the boldfaced score denotes the best result and the underlined score represents the second-best result. Two models are not reported, namely SODEN and DeepEH, as we found empirically that their computational/memory cost is significantly worse than the rest, and we fail to obtain reasonable performances over the two datasets for these two models.

Survival prediction results measured in IBS and INBLL metric (%) on four small-scale survival datasets. In each column, the boldfaced score denotes the best result and the underlined score represents the second-best result. DeepSurv 16.55 ±0.93 49.85 ±3.02 17.80 ±0.49 52.62 ±1.25 10.09 ±0.38 33.28 ±1.15 19.20 ±0.41 56.48 ±1.08 CoxTime 16.54 ±0.83 49.67 ±2.67 17.80 ±0.58 52.56 ±1.47 10.28 ±0.45 34.18 ±1.53 19.17 ±0.40 56.45 ±1.10 DeepHit 17.50 ±0.83 52.10 ±2.16 19.61 ±0.38 56.67 ±1.10 11.83 ±0.39 37.72 ±1.02 20.66 ±0.32 60.06 ±0.72 DeepEH 16.56 ±0.65 49.42 ±1.53 17.62 ±0.52 52.08 ±1.27 10.11 ±0.37 33.30 ±1.10 19.30 ±0.39 56.67 ±0.94 SuMo-net 16.49 ±0.83 49.74 ±2.21 17.77 ±0.47 52.62 ±1.11 10.07 ±0.40 33.20 ±1.10 19.40 ±0.38 56.87 ±0.96 SODEN 16.52 ±0.63 49.39 ±1.97 17.05 ±0.63 50.45 ±1.97 10.13 ±0.24 33.37 ±0.57 19.07 ±0.50 56.15 ±1.35 SurvNode 16.67 ±1.32 49.73 ±3.89 17.42 ±0.53 51.70 ±1.16 10.40 ±0.29 34.37 ±1.03 19.58 ±0.34 57.49 ±0.84 DCM 16.58 ±0.87 49.48 ±2.23 17.66 ±0.54 52.26 ±1.23 10.13 ±0.50 33.40 ±1.38 19.29 ±0.42 56.68 ±1.09 DeSurv 16.71 ±0.75 49.61 ±2.15 17.98 ±0.46 53.23 ±1.15 10.06 ±0.62 33.18 ±1.93 19.50 ±0.40 57.28 ±0.89 NFM-PF 16.33 ±0.75 49.07 ±1.96 17.60 ±0.55 52.12 ±1.34 9.96 ±0.39 32.84 ±1.15 19.14 ±0.39 56.35 ±1.00 NFM-FN 16.11 ±0.81 48.21 ±2.04 17.66 ±0.52 52.41 ±1.22 10.05 ±0.39 33.11 ±1.10 18.97 ±0.60 55.87 ±1.50

Survival prediction results measured in IBS and INBLL metric (%) on two larger datasets.

annex

Suppose that we have {B(g i , ε2 )}, i = 1 . . . , N , where N = N ( ε 2 , F, ∥ • ∥), are the minimal number of ε 2 -ball that covers F. Then there exists at least one f i ∈ F such that f i ∈ B(g i , ε). Consider the following ε -balls {B(f i , ε)}, i = 1 . . . , N . For arbitrary f ∈ F ∩ B(g i , ε2 ), we have that ∥f -f i ∥ ≤ ∥f -g i ∥ + ∥f i -g i ∥ ≤ ε. Thus {B(f i , ε)}, i = 1 . . . , N forms a ε-covering of F. By definition, we have that N (ε, F, ∥ • ∥) ≤ N ( ε 2 , F, ∥ • ∥).Proof of lemma 8. The proof of the first two inequalities follows exactly the same steps of lemma 7.Here we just need to mention the rest of the statement that. We first choose a set of ε 2 -covering balls {B(f i , ε 2 )}, i = 1, . . . , N 1 , where N 1 = N ( ε 2 , F, ∥ • ∥ ∞ ). Now we construct a set of brackets {[l i , u i ]}, i = 1 . . . , N 1 , where l i = f i -ε 2 and u i = f i + ε 2 . Noting that the bracket {[l i , u i ]} is exactly the same as B(f i , ε2 ), The set {[l i , u i ]}, i = 1, . . . , N 1 covers F, which leads toProof of lemma 9. By lemma 8, first we have thatwhich indicates that as long asSince Θ is a compact set on R, by lemma 8 and traditional volume argument, we have that. By lemma 7 we further have that N (Similarly, there exists a constant c m > 0 such that. Thus, finally we can obtain thatProof of lemma 10. By lemma 8, first we have. By lemma 4, there exists a constant c 3 > 0 such that for arbitrary ν 1 , ν 2 ∈ V n and θ 1 , θ 2 ∈ Θ, we have thatwhich indicates that as long as, we have that ∥l(T, δ, Z; ν 1 , θ 1 ) -l(T, δ, Z; ν 2 , θ 2 )∥ ∞ ≤ ε. Thus, we have:Since Θ is a compact set on R, by lemma 8 and traditional volume argument, we have that For. Thus, finally we can obtain thatWe report summaries of descriptive statistics of the 6 benchmark datasets used in section 5.2 in table 3 .

C.2 DETAILS OF SYNTHETIC EXPERIMENTS

Since the true model is assumed to be of PF form, we generate event time according to the following transformed regression model (Dabrowska & Doksum, 1988) :where s) ds with h defined in (2). The error term ϵ is generated such that e ϵ has cumulative hazard function G θ . The formulation (15) is the equivalent to (2) (Dabrowska & Doksum, 1988; Cuzick, 1988; Kosorok et al., 2004) . In our experiments, the covariates are of dimension 5, sampled independently from the uniform distribution over [0, 1]. We set h(t) = t and hence H(t) = e t . The function form of m(Z) is set to be m(Z) = sin(⟨Z, β⟩) + ⟨sin(Z), β⟩, where β = (0.1, 0.2, 0.3, 0.4, 0.5). Then censoring time C is generated according towhich reuses covariate Z, and draws independently a noise vector ϵ C such that the censoring ratio is controlled at around 40%. We generate three datasets with n ∈ {1000, 5000, 10000} respectively.Hyperparameter configurations We specify below the network architectures and optimization configurations used in all the tasks:PF scheme: For both m and h, we use 64 hidden units for n = 1000, 128 hidden units for n = 5000 and 256 hidden units for n = 10000. We train each model for 100 epochs with batch size 128, optimized using Adam with learning rate 0.0001, and no weight decay.FN scheme: For both ν, we use 64 hidden units for n = 1000, 128 hidden units for n = 5000 and 256 hidden units for n = 10000. We train each model for 100 epochs with batch size 128, optimized using Adam with learning rate 0.0001, and no weight decay.

C.3 DETAILS OF PUBLIC DATA EXPERIMENTS

Dataset preprocessing For METABRIC, RotGBSG, FLCHAIN, SUPPORT and KKBOX dataset, we take the version provided in the pycox package (Kvamme et al., 2019) . We standardize continuous features into zero mean and unit variance and do one-hot encodings for all categorical features. For the MIMIC-III dataset, we follow the preprocessing routines in Purushotham et al. (2018) which extracts 26 features. The event of interest is defined as the mortality after admission, and the censored time is defined as the last time of being discharged from the hospital. The definition is similar to that in Tang et al. (2022) . But since the dataset is not open sourced, according to our implementation the resulting dataset exhibits a much higher censoring rate (90.2% as compared to 61.0% as reported in the SODEN paper (Tang et al., 2022) ). Since the major purpose of this paper is for the proposal of the NFM framework, We use our own version of the processed dataset to further verify the predictive performance of NFM.Hyperparameter configurations We follow the general training template that uses MLP as all nonparametric function approximators (i.e., m and h in the PF scheme, and ν in the FN scheme), and train for 100 epochs across all datasets using Adam as the optimizer. The tunable parameters and their respective tuning ranges are reported as follows:Number of layers (network depth) We tune the network depth L ∈ {2, 3, 4}. Typically, the performance of two-layer MLPs is sufficiently satisfactory. Number of hidden units in each layer (network width) We tune the network widthOptional dropout We optionally apply dropout with probability p ∈ {0.1, 0.2, 0.3, 0.5, 0.7}. Batch size We tune batch size within the range {128, 256, 512}, in the KKBOX dataset, we also tested with larger batch sizes {1024}.Learning rate and weight decay We tune both the learning rate and weight decay coefficient of Adam within range {0.01, 0.001, 0.0001}.Frailty specification We tested gamma frailty, Box-Cox transformation frailty, and IGG(α) frailty with α ∈ {0, 0.25, 0.75}. Here note that IGG(0.5) is equivalent to gamma frailty. We also empirically tried to set α to be a learnable parameter and found that this additional flexibility provides little performance improvement regarding the datasets used for evaluation.

C.4 IMPLEMENTATIONS

We use pytorch to implement NFM. The source code is provided in the supplementary material. For the baseline models:• We use the implementations of CoxPH , GBM, and RSF from the sksurv package Pölsterl (2020), for the KKBOX dataset, we use the XGBoost library (Chen & Guestrin, 2016) to implement GBM and RSF, which might yield some performance degradation. • We use the pycox package to implement DeepSurv, CoxTime, and DeepHit models.• We use the official code provided in the SODEN paper (Tang et al., 2022) to implement SODEN.• We obtain results of SuMo and DeepEH based on our re-implementations.

D ADDITIONAL EXPERIMENTS D.1 RECOVERY ASSESSMENT OF m(Z) IN PF SCHEME

We plot empirical recovery results targeting the m function in (2) in figure 2 . The result demonstrates satisfactory recovery with a moderate amount of data, i.e., n ≥ 1000.

D.2 RECOVERY ASSESSMENT OF SURVIVAL FUNCTIONS

To assess the recovery performance of NFM with respect to survival functions, we consider the following setup: under the same data generation framework as in section C.2, we compute the test Table 6 : Survival prediction results measured in C-index (%) on all the 6 benchmark datasets. In each column, the boldfaced score denotes the best result and the underlined score represents the second-best result. The average rank of each model is reported in the rightmost column. We did not manage to obtain reasonable results for DeepEH and SODEN on two larger datasets MIMIC-III and KKBOX, and we set corresponding ranks to be the worst on those datasets. The concordance index (C-index) (Antolini et al., 2005 ) is yet another evaluation metric that is commonly used in survival analysis. The C-index estimates the probability that, for a random pair of individuals, the predicted survival times of the two individuals have the same ordering as their true survival times. Formally, C-index is defined as

Model

We report performance evaluations based on C-index over all the 6 benchmark datasets in table 6 . From table 6, it appears that there's no clear winner regarding the C-index metric across the 6 selected datasets. We conjecture this phenomenon to be closely related to the loose correlation between the C-index and the likelihood-based learning objective, as was observed in Rindt et al. (2022) . Therefore we compute the average rank of each model as an overall assessment of performance, as illustrated in the last column in table 6. The results suggest that the two NFM models perform better on average.

D.5 BENEFITS OF FRAILTY

We compute the (relative) performance gain of NFM-PF and NFM-FN, against their non-frailty counterparts, namely DeepSurv (Katzman et al., 2018) and SuMo-net (Rindt et al., 2022) 

