DEEP BIOLOGICAL PATHWAY INFORMED PATHOLOGY-GENOMIC MULTIMODAL SURVIVAL PREDICTION

Abstract

The integration of multi-modal data, such as pathological images and genomic data, is essential for understanding cancer heterogeneity and complexity for personalized treatments, as well as for enhancing survival predictions. Despite the progress made in integrating pathology and genomic data, most existing methods cannot mine the complex inter-modality relations thoroughly. Additionally, identifying explainable features from these models that govern preclinical discovery and clinical prediction is crucial for cancer diagnosis, prognosis, and therapeutic response studies. We propose PONET-a novel biological pathway informed pathology-genomic deep model that integrates pathological images and genomic data not only to improve survival prediction but also to identify genes and pathways that cause different survival rates in patients. Empirical results on six of The Cancer Genome Atlas (TCGA) datasets show that our proposed method achieves superior predictive performance and reveals meaningful biological interpretations. The proposed method establishes insight on how to train biological informed deep networks on multimodal biomedical data which will have general applicability for understanding diseases and predicting response and resistance to treatment.

1. INTRODUCTION

Manual examination of haematoxylin and eosin (H&E)-stained slides of tumour tissue by pathologists is currently the state-of-the-art for cancer diagnosis (Chan, 2014) . The recent advancements in deep learning for digital pathology have enabled the use of whole-slide images (WSI) for computational image analysis tasks, such as cellular segmentation (Pan et al., 2017; Hou et al., 2020) , tissue classification and characterisation (Hou et al., 2016; Hekler et al., 2019; Iizuka et al., 2020) . While H&E slides are important and sufficient to establish a profound diagnosis, genomics data can provide a deep characterisation of the tumour on the molecular level potentially offering the chance for prognostic and predictive biomarker discovery. Cancer prognosis via survival outcome prediction is a standard method used for biomarker discovery, stratification of patients into distinct treatment groups, and therapeutic response prediction (Cheng et al., 2017; Ning et al., 2020) . WSIs exhibit enormous heterogeneity and can be as large as 150,000 × 150,000 pixels. Most approaches adopt a two-stage multiple instance learning-based (MIL) approach for representation learning of WSIs, in which: 1) instance-level feature representations are extracted from image patches in the WSI, and then 2) global aggregation schemes are applied to the bag of instances to obtain a WSI-level representation for subsequent supervision (Hou et al., 2016; Courtiol et al., 2019; Wulczyn et al., 2020; Lu et al., 2021) . Therefore, multimodal survival prediction faces an additional challenge due to the large data heterogeneity gap between WSIs and genomics, and many existing approaches use simple multimodal fusion mechanisms for feature integration, which prevents mining important multimodal interactions (Mobadersany et al., 2018; Chen et al., 2022b; a) . The incorporation of biological pathway databases in a model takes advantage of leveraging prior biological knowledge so that potential prognostic factors of well-known biological functionality can be identified (Hao et al., 2018) . Moreover, encoding biological pathway information into the neural networks achieved superior predictive performance compared with established models (Elmarakeby et al., 2021) . Based on the current challenges in multimodal fusion of pathology and genomics and the potential prognostic interpretation to link pathways and clinical outcomes in pathway-based analysis, we propose a novel biological pathway informed pathology-genomic deep model, PONET, that uses H&E WSIs and genomic profile features for survival prediction. The proposed method contains four major contributions: 1) PONET formulates a biological pathway informed deep hierarchical multimodal integration framework for pathological images and genomic data; 2) PONET captures diverse and comprehensive modality-specific and cross modality relations among different data sources based on factorized bilinear model and graph fusion network; 3) PONET reveals meaningful model interpretations on both genes and pathways for potential biomarker and therapeutic target discovery; PONET also shows spatial visualization of the top genes/pathways which has enormous potential for novel and prognostic morphological determinants; 4) We evaluate PONET on six public TCGA datasets which showed superior survival prediction comparing to state-of-the-art methods. Fig. 1 shows our model framework.

2. RELATED WORK

Multimodal Fusion. Earlier works on multimodal fusion focus on early fusion and late fusion. Early fusion approaches fuse features by simple concatenation which cannot fully explore intra-modality dynamics (Wöllmer et al., 2013; Poria et al., 2016; Zadeh et al., 2016) . In contrast, late fusion fuses different modalities by weighted averaging which fails to model cross-modal interactions (Nojavanasghari et al., 2016; Kampman et al., 2018) . The exploitation of relations within each modality has been successfully introduced in cancer prognosis via bilinear model (Wang et al., 2021b) and graph-based model (Subramanian et al., 2021) . Adversarial Representation Graph Fusion (ARGF) (Mai et al., 2020) interprets multimodal fusion as a hierarchical interaction learning procedure where firstly bimodal interactions are generated based on unimodal dynamics, and then trimodal dynamics are generated based on bimodal and unimodal dynamics. We propose a new hierarchical fusion framework with modality-specific and cross-modality attentional factorized bilinear modules to mine the comprehensive modality interactions. Our proposed hierarchical fusion framework is different from ARGF in the following ways: 1) We take the sum of the weighted modality-specific representation as the unimodal representation instead of calculating the weighted average of the modality-specific representation in ARGF; 2) For higher level's fusion, ARGF takes the original embeddings of each modality as input while we use the weighted modality-specific representations; 3) We argue that ARGF takes redundant information during their trimodal dynamics. Multimodal Survival Analysis. There have been exciting attempts on multimodal fusion of pathology and genomic data for cancer survival prediction (Mobadersany et al., 2018; Cheerla & Gevaert, 2019; Wang et al., 2020) . However, these multimodal fusion based methods fail to explicitly model the interaction between each subset of multiple modalities. Kronecker product considers pairwise interactions of two input feature vectors by producing a high-dimensional feature of quadratic expansion (Zadeh et al., 2017) , and showed its superiority in cancer survival prediction (Wang et al., 2021b; Chen et al., 2022b; a) . Despite of promising results, using Kronecker product in multimodal fusion may introduce a large number of parameters that may lead to high computational cost and a risk of overfitting (Kim et al., 2017; Liu et al., 2021) , thus limiting its applicability and improvement in performance. To overcome this drawback, hierarchical factorized bilinear fusion for cancer survival prediction (HFBSurv) (Li et al., 2022) uses factorized bilinear model to fuse genomic and image features which dramatically reduces computational complexity. PONET differs from HFBSurv in two ways: 1) PONET's multimodal framework has three levels of hierarchical fusion module including unimodal, bimodal, and trimodal fusion while HFBSurv only considers within-modality and cross-modality fusion which we argue it is not adequate for mining the comprehensive interactions; 2) PONET leverages biological pathway informed network for better prediction and meaningful interpretation purposes. Pathway-associated Sparse Neural Network. Pathway-based analysis is an approach that a number of studies have been investigated to improve both predictive performance and biological interpretability (Jin et al., 2014; Cirillo et al., 2017; Hao et al., 2018; Elmarakeby et al., 2021) . Moreover, pathway-based approaches have shown more reproducible analysis results than gene expression data analysis alone (Li et al., 2015; Mallavarapu et al., 2017) . These pathway-based deep neural networks can only model genomic data which severely inhibits their applicability in current biomedical research. Additionally, the existing pathway-associated sparse neural network structures are limited for disease mechanism investigation: there is only one pathway layer in PASNet (Hao et al., 2018) which contains limited biological prior information to deep dive into the hierarchical pathway and biological process relationships; P-NET (Elmarakeby et al., 2021) calculates the final prediction by taking the average of all the gene and pathways layers' outputs, and this will bias the learning process because it will put more weights for some layers' outputs while underestimates the others.

3.1. PROBLEM FORMULATION AND NOTATIONS

The model architecture of PONET is presented in Fig. 1 , where three modalities are included as input: gene expression g ∈ R dg , pathological image p ∈ R dp , and copy number (CNV) + mutation (MUT) CN V + M U T ∈ R dc , with d p being the dimensionality of p and so on. We define a hierarchical factorized bilinear fusion model for PONET. We build a sparse biological pathway-informed embedding network for gene expression, and a fully connected (FC) embedding layer for both preprocessed pathological image feature (f p ) and the copy number + mutation (f c ) to map feature into similar embedding space for alleviating the statistical property differences between modalities, the three network architecture details are in Appendix C.1. We label the three modality embeddings as h m , m ∈ {g, p, c}, the superscript/subscript u, b, and t represents unimodal fusion, biomodal fusion and trimodal fusion. After that, the embeddings of each modality is first used as input for unimodal fusion to generate the modality-specific representation h u m = ω mĥ m , ω m represents the modality-specific importance, the feature vector of the unimodal fusion is the sum of all modality-specific representations h u = m h u m . In the bimodal fusion, modality-specific representations from the output of unimodal fusion are fused to yield cross-modality representations h b m1m2 = ω m1m2 ĥm1m2 , m 1 , m 2 ∈ {p, c, g} and m 1 ̸ = m 2 , ω m1m2 represents the corresponding cross-modality importance. Similarly, the feature vector of bimodal fusion is calculated as h b = m1,m2 h b m1m2 . We propose to build a trimodal fusion to take each cross-modality representation from the output of bimodal fusion to thoroughly mine the interactions. Simiarly as the bimodal fusion architecture, trimodal fusion feature vector will be h t = m1,m2,m3 ω m1m2m3 ĥm1m2m3 , m 1 , m 2 , m 3 ∈ {p, c, g} and m 1 ̸ = m 2 ̸ = m 3 , ω m1m2m3 represents the corresponding trimodal importance. Finally, PONET concatenates h u , h b , h t to obtain the final comprehensive multimodal representation and pass it to the Cox porportional harzard model (Cox, 1972; Cheerla & Gevaert, 2019) for survival prediction. In the following sections we will describe our hierarchical factorized bilinear fusion framework, l, o, s represents the dimensionality of h m , z m , ĥm1m2 .

3.2. SPARSE NETWORK

We design the sparse gene-pathway network consisting of one gene layer followed by three pathway layers. A patient sample of e gene expressions is formed as a column vector, which is denoted by X = [x 1 , x 2 , ..., x e ], each node represents one gene. The gene layer is restricted to have connections reflecting the gene-pathway relationships curated by the Reactome pathway dataset (Fabregat et al., 2020) . The connections are encoded by a binary matrix M ∈ R a×e , where a is number of pathways and e is number of genes, an element of M, m ij , is set to one if gene j belongs to pathway i. The connections that do not exist in the Reactome pathway dataset will be zero-out. For the next pathway-pathway layers, a similar scheme is applied to control the connection between consecutive layers to reflect the parent-child hierarchical relationships that exist in the Reactome dataset. The output of each layer is calculated as y = f [(M * W) T X + ϵ] ( ) where f is the activation function, M represents the binary matrix, W is the weights matrix, X is the input matrix, ϵ is the bias vector, and * is the Hadamard product. We use tanh for the activation of each node. We allow the information flow from the biological prior informed network starting from the first gene layer to the last pathway layer, and we label the last layer output embeddings of the sparse network for gene expression as h g .

3.3. UNIMODAL FUSION

Bilinear models (Tenenbaum & Freeman, 2000) provide richer representations than linear models. Given two feature vectors in different modalities, e.g., the visual features x ∈ R m×1 for an image and the genomic features y ∈ R n×1 for a genomic profile, bilinear model uses a quadratic expansion of linear transformation considering every pair of features: z i = x T W i y where W i ∈ R m×n is a projection matrix, z i ∈ R is the output of the bilinear model. Bilinear models introduce a large number of parameters which potentially lead to high computational cost and overfitting risk. To address these issues, Yu et al. (2017) develop the Multi-modal Factorized Bilinear pooling (MFB) method, which enjoys the dual benefits of compact output features and robust expressive capacity. Inspired by the MFB (Yu et al., 2017) and its application in pathology and genomic multimodal learning (Li et al., 2022) , we propose unimodal fusion to capture modality-specific representations and quantify their importance. The unimodal fusion takes the embedding of each modality h m as input and factorizes the projection matrix W i in Eq. ( 2) as two low-rank matrices: z i = h T m W i h m = k d=1 h T m u m,d v T m,d h m = 1 T (U T m,i h m • V T m,i h m ), m ∈ {p, c, g} we get the output feature z m : z m = SumPooling Ũ T m h m • Ṽ T m h m , k , m ∈ {p, c, g} where k is the latent dimensionality of the factorized matrices. SumPooling (x, k) function performs sum pooling over x by using a 1-D non-overlapped window with the size k, Ũm ∈ R l×ko and Ṽm ∈ R l×ko are 2-D matrices reshaped from U m and V m , U m =[U m,1 , . . . , U m,h ] ∈ R l×k×o and V m = [V m,1 , . . . , V m,h ] ∈ R l×k×o . Each modality-specific representation ĥm ∈ R l+o is obtained as: ĥm = h m ©z m , m ∈ {p, c, g} ) where © denotes vector concatenation. We also introduce an modality attention network Atten ∈ R l+o → R 1 to determine the weight for each modality-specific representation to quantify its importance: ω m = Atten( ĥm ; Θ Atten ), m ∈ {p, c, g} where ω m is the weight of modality m. In practice, Atten consists a sigmoid activated dense layer parameterized by Θ Atten . Therefore, the output of each modality in unimodal fusion, h u m , is denoted as ω m ĥm ∈ R l+o , m ∈ {p, c, g}. Accordingly, the output of unimodal fusion, h u , is the sum of each weighted modality-specific representation ω m ĥm , m ∈ {p, c, g} which is different from ARGF (Mai et al., 2020 ) that used the weighted average of different modalities as the unimodal fusion output.

3.4. BIMODAL AND TRIMODAL FUSION

The goal of bimodal fusion is to fuse diverse information of different modalities and quantify different importance for them. After receiving the modality-specific representations h u m from the unimodal fusion, we can generate the cross-modality representation ĥm1m2 ∈ R s similar to Eq. ( 4) : ĥm1,m2 = Sum Pooling Ũ T m1 h u m1 • Ṽ T m2 h u m2 , k , m 1 , m 2 ∈ {p, c, g}, m 1 ̸ = m 2 (7) where Ũ T m1 ∈ R (l+o)×ks and Ṽ T m2 ∈ R (l+o)×ks are 2-D matrices reshaped from U m1 and V m2 and l+o)×k×s . We leverage a bimodal attention network (Mai et al., 2020) to identify the importance of the crossmodality representation. The similarity S m1m2 ∈ R 1 of h u m1 and h u m2 is first estimated as follows: U m1 = [U m1,1 , . . . , U m1,s ] ∈ R (l+o)×k×s and V m2 = [V m2,1 , . . . , V m2,s ] ∈ R ( S m1,m2 = l+o i=1 e ωm 1 h u m1,i l+o j=1 e ωm 1 h u m 1 ,j e ωm 2 h u m 2 ,i l+o j=1 e ωm 2 h u m 2 ,j where the computed similarity is in the range of 0 to 1. Then, the cross-modality importance ω m1m2 is obtained by: ω m1m2 = e ωm i m j mi̸ =mj e ωm i m j , ωm1m2 = ω m1 + ω m2 S m1m2 + S 0 where S 0 represents a pre-defined term controlling the relative contribution of similarity and modalityspecific importance, and here is set to 0.5. Therefore, the output of bimodal fusion, h b , is the sum of each weighted cross-modality representation ω m1m2 ĥm1m2 , m 1 , m 2 ∈ {p, c, g} and m 1 ̸ = m 2 . In the trimodal fusion, each bimodal fusion output is fused with the unimodal fusion output that does not contribute to the formation of the bimodal fusion. The output for each corresponding trimodal representation is ĥm1m2m3 . In addition, a trimodal attention was applied to identify the importance of each trimodal representation, ω m1m2m3 . The output of the trimodal fusion, h t , is the sum of each weighted trimodal representation ω m1m2m3 ĥm1m2m3 , m 1 , m 2 , m 3 ∈ {p, c, g} and m 1 ̸ = m 2 ̸ = m 3 .

3.5. SURVIVAL LOSS FUNCTION

We train the model through the Cox partial likelihood loss (Cheerla & Gevaert, 2019) with l 1 regularization for survival prediction, which is defined as: ℓ(Θ) = - i:Ei=1  ĥ Θ (x i ) -log j:Ti>Tj exp ĥΘ (x j )   + λ (∥Θ∥ 1 ) where the values E i , T i and x i for each patient represent the survival status, the survival time and the feature, respectively. E i = 1 means event while E i = 0 represents censor. ĥΘ is the neural network model trained for predicting the risk of survival, Θ is the neural network model parameters, and λ is a regularization hyperparameter to avoid overfitting. (Zadeh & Schmid, 2020) .

4. EXPERIMENTS

Evaluation. For each cancer dataset, we used the cross-validated concordance index (C-Index) (Appendix B.2) (Harrell et al., 1982) to measure the predictive performance of correctly ranking the predicted patient risk scores with respect to overall survival.

4.2. RESULTS

Comparison with Baselines. In combing pathology image, genomics, and pathway network via PONET, our approach outperforms CoxPH models, unimodal networks, and previous deep learning based approaches on pathology-genomic-based survival outcome prediction (Table 1 ). From the results, deep learning-based approaches generally exhibit better performance than CoxPH model. PONET achieves superior C-index value in all six cancer types. All versions of PONET outperform Pathomic Fusion by a big margin. Pathomic Fusion uses Kronecker product to fuse the two modalities and that's also the reason why other advanced fusion methods, like GPDBN and HFBSurv, achieves better performance. Also, we argue that Pathomic Fusion extracts the region of interest of pathology image for feature extraction might limit the understanding of the tumor microenvironment of the whole slide. HFBSurv shows better performance than GPDBN and Pathomic Fusion which is consistent with their findings, and these results further demonstrate that the hierarchical factorized Additionally, we can see that PONET consistently outperforms PONET-O and PONET-OH indicating that the effectiveness of the biological pathway-informed neural network and the contribution of pathological image for the overall survival prediction. Ablation Studies. To assess the impact of hierarchical factorized bilinear fusion strategy is indeed effective, we compare PONET with four single-fusion methods: 1) Simple concatenation: concatenate each modality embeddings; 2) Element-wise addition: element-wise addition from each modality embeddings; 3) Tensor fusion (Zadeh et al., 2017) : Kronecker product from each modality embeddings. Table 2 shows the C-index values of different methods. We can see that PONET achieves the best performance and shows remarkable improvement over single-fusion methods on different cancer type datasets. For example, PONET outperforms the Simple concatenation by 8.4% (TCGA-BLCA), 27% (TCGA-KIRP), 15% (TCGA-LUAD), 8.0% (TCGA-LUSC), and 11.4% (TCGA-PAAD), etc. Furthermore, we adopted five different configurations of PONET to evaluate each hierarchical component of the proposed method: 1) Unimodal: unimodal fusion output as the final feature representation; 2) Bimodal: bimodal fusion output as the final feature representation; 3) Unimodal + Bimodal: hierarchical (include both unimodal and bimodal feature representation) fusion; 4) ARGF: ARGF (Mai et al., 2020) demonstrates that averaging all the intermediate layers' output for the final prediction cannot fully capture the prior information flow among the biological hierarchical structures. Model Interpretation. We discuss the model interpretation results for cancer type TCGA-KIRP here and the results for other cancer types are included in the Appendix C.3. To understand the interactions between different genes, pathways and biological processes that contributed to the predictive performance and to study the paths of impact from the input to the outcome, we visualized the whole structure of PONET with the fully interpretable layers after training (Fig. 3 a ). To evaluate the relative importance of specific genes contributing to the model prediction, we inspected the genes layer and used the Integrated Gradients attribution (Sundararajan et al., 2017) method to obtain the total importance score of genes, and the modified ranking algorithm details are included in the Appendix B.5. Highly ranked genes included KRAS, PSMB6, RAC1, and CTNNB1 which are known kidney cancer drivers previously (Yang et al., 2017; Shan et al., 2017; Al-Obaidy et al., 2020; Guo et al., 2022) . GBN2, a member of the guanine nucleotide-binding proteins family, which has been reported that the decrease of its expression reduced tumor cell proliferation (Zhang et al., 2019) . A recent study identified strong dependency on BCL2L1, which encodes the BCL-XL anti-apoptotic protein, in a subset of kidney cancer cells (Grubb et al., 2022) . This biological interpretability revealed established and novel molecular features contributing kidney cancer. In addition, PONET selected a hierarchy of pathways relevant to the model prediction, including down regulation of TGF-β receptor signaling, regulation of PTEN stability and activity, the NLRP1 inflammasome, and noncanonical activation of NOTCH3 by PSEN1, PSMB6, and BCL2L1. TGF-β signaling is increasingly recognized as a key driver in cancer, and in progressive cancer tissues TGF-β promotes tumor formation, and its increased expression often correlates with cancer malignancy (Han et al., 2018) . Noncanonical activation of NOTCH3 was reported to limit tumour angiogenesis and plays vital roles in kidney disease (Lin et al., 2017) . To further inspect the pathway spatial association with WSI slide we adopted the co-attention survival method MCAT (Chen et al., 2021) between WSIs and genomic features on the top pathways of the second layer, visualized as a WSI-level attention heatmap for each pathway genomic embedding in Fig. 3 b (algorithm details are included in the Appendix B.6). We used the gene list from top 4 pathways as the genomic features and trained MCAT on TCGA-KIRP dataset for survival prediction. Overall, we observe that high attention in different pathways showed different spatial pattern associations with the slide. This heatmap can reflect genotype-phenotype relationships in cancer pathology. The high attention regions (red) of different pathways in the heatmap have positive associations with the pre- dicted death risk while the low attention regions (blue) have negative associations with the predicted risk. By further check the cell types in high attention patches we can gain insights of prognostic morphological determinants and have a better understanding about the complex tumor microenvironment. Complexity Comparison. We compared PONET with Pathomic Fusion, GPDBN, and HFBSurv since both Pathomic Fusion and GPDBN are based on Kronecker product to fuse different modalities while GPDBN and HFBSurv modeled inter-modality and intra-modality relations which have similar consideration to our method. As illustrated in Table 3 , PONET has 2.8M (M = Million) trainable parameters, which is approximately 1.6%, 3.4%, and 900% of the number of parameters of Pathomic Fusion, GPDBN, and HFBSurv. To assess the time complexity of PONET and the competitive methods, we calculate floating-point operations per second (FLOPS) of each method in testing. The results in Table 3 show that PONET needs 3.1G during testing, compared with 168G, 91G, and 0.5G in Pathomic Fusion, GPDBN, and HFBSurv. The main reason for fewer trainable parameters and number of FLOPS lies in that PONET and HFBSurv performs multimodal fusion using factorized bilinear model, and can significantly reduce the computational complexity and meanwhile obtain more favorable performance. PONET has one additional trimodal fusion which explains why it has more trainable parameters than HFBSurv.

5. CONCLUSION

In this study, we pioneer propose a novel biological pathway-informed hierarchical multimodal fusion model that integrates pathology image and genomic profile data for cancer prognosis. In comparison to previous works, PONET deeply mines the interaction from multimodal data by conducting unimodal, biomodal and trimodal fusion step by step. Empirically, PONET demonstrates the effectiveness of the model architecture and the pathway informed network for superior predictive performance. Specifically, PONET provides insight on how to train biological informed deep networks on multimodal biomedical data for biological discovery in clinic genomic contexts which will be useful for other problems in medicine that seek to combine heterogeneous data streams for understanding diseases and predicting response and resistance to treatment. A DATA 

B.1 COX PROPORTIONAL HAZARD MODEL

In survival analysis, we are interested in modeling the continuous time T until some event of interest (i.e. survival time). The survival function S(t) = P(T > t 0 ) = 1 -t 0 f (s)ds is the probability of a individual surviving longer than time t 0 , where f is the probability density function of survival times. We denote the probability that an event occurs in an infinitesimal interval after time t, given it has not yet occurred at time t as the hazard function λ(t): λ(t) = lim δ→0 P(t ≤ T < t + δ | T ≥ t) δ This results in the relationship: S(t) = exp-λ(t), where Λ(t) = t 0 λ(s)ds is the cumulative hazard function. The most common semi-parametric approach for estimating the hazard function is the Cox proportion hazards model (Cox, 1972) , which assumes that the hazard function can be parameterized as an linear exponential function λ(t | X) = λ 0 (t) exp β ⊤ X where the baseline hazard function λ 0 (t) describes how the risk of an event changes over time, β are model parameters that describe how the hazard varies with features X. The baseline hazard λ 0 (t) is unspecified in the original model, making it difficult to estimate β, however, the Cox partial log-likelihood technique (Wong, 1986) can first estimate β by maximizing: L n (β) = 1 n n i=1 ∆ i   β ⊤ X i -log   n j=1 Y j (O i ) e β ⊤ Xj     ( ) where n is the set of patients, for the i-th subject in the study, T i and C i denote, respectively, the event time and the potential censoring time; and X i ∈ R p denotes the observed features. Thus, the observed data from a typical survival study contain independent observations D = {X i , O i , ∆ i } n i=1 , where the observed time O i = min (T i , C i ) and the event indicator ∆ i = 1, if the observed time is T i , i.e. T i ≤ C i , otherwise ∆ i = 0, the subject is censored at C i . Once the parameter β has been estimated through the log partial likelihood, the cumulative baseline hazard function Λ(t) = t 0 λ(s)ds can be estimated through the Breslow estimator (Breslow, 1972) .

B.2 C-INDEX

Due to the presence of censoring in survival data, traditional performance measures such as mean squared error cannot be used to evaluate the accuracy of predictions. Instead, the concordance-index (C-index) (Harrell et al., 1982) is one of the most widely used performance measures for survival models. It assesses how good a model is by measuring the concordance between the rankings of the predicted event times and the true event times. Specifically, if the predicted event time of the i-th individual is Ti , the C-index is defined by C = P Ti < Tj | O i < O j , ∆ i = 1 . However, it is difficult to obtain the predicted event time in most survival models, so the following C-index proposed in Antolini et al. (2005) is often used in practice: C = P Ŝ (O i | X i ) < Ŝ (O i | X j ) | O i < O j , ∆ i = 1 (13) If {X i , O i , ∆ i } n i=1 and Ŝ (t | X i ) denote observations and predicted conditional probabilities, respectively, the C-index in Eq. ( 13) can be estimated empirically by Ĉ = n i=1 n j=1 ∆ i I Ŝ (O i | X i ) < Ŝ (O i | X j ) n i=1 n j=1 ∆ i I (O i < O j ) . The range of the C-index is [0, 1], and larger values indicate better performance with a random guess leading to a C-index of 0.5.

B.3 WSI REPRESENTATION LEARNING

It has been shown that the WSI visual representations extracted by self-supervised learning methods on histopathological images are more accurate and transferable than the supervised baseline models on domain-irrelevant datasets such as ImageNet. In this work, a pre-trained Vision Transformer (ViT) model (Wang et al., 2021a) that is trained on a large histopathological image dataset has been utilized for tile feature extraction. The model is composed of two main neural networks that learn from each other, i.e. student and teacher networks. Parameters of the teacher model θ t is updated using the student network with parameter θ s using update rule represented in Eq. ( 14). θ t ← τ θ t + (1 -τ )θ s (14) Two different views of a given input H&E image x, uniformly selected from training set I, are generated using random augmentations, i.e. u, v. Then, student and teacher models generate two different visual representations according to u and v as y 1 = f θs (u) and ŷ2 = f θt (v), respectively. Finally, the generated visual representations are transformed into latent space using linear projection as p 1 = g θs g θs (y 1 ) and ẑ2 = g θt (ŷ 2 ) for student and teacher networks, respectively. Similarly, feeding v and u to student and teacher networks leads to y 2 = f θs (v) , ŷ1 = f θt (u) , p 2 = g θs g θs (y 2 ) and ẑ1 = g θt ( ŷ1 ). Finally, the symmetric objective function L loss is optimized through minimizing the ℓ 2norm distance between student and teacher as Eq. ( 15) L loss = 1 2 L (p 1 , ẑ2 ) + 1 2 L (p 2 , ẑ1 ) where M i . There are three components for the model: 1) the projection layer f p ; 2) the attention module f attn ; 3) the prediction layer f pred . After the VIT feature extraction, each patch-level embeddings of WSI bag, H ∈ R M ×2048 , are first mapped into a more compact, dataset-specific 512-dimensional feature space by the fully-connected layer f p with weights W proj ∈ R 512×2048 and bias b bias ∈ R 512 . Subsequently, the attention module f attn learns to score each patch for its perceived relevance to patient-level prognostic prediction. Patches receives high attention scores will contribute more to the patient-level feature representation than patches assigned low attention scores, all the patches in one patient's WSIs are aggregated based on attention-pooling (Ilse et al., 2018) . Specifically, f attn has 3 fully-connected layers with weights U a ∈ R 256×512 , V a ∈ R 256×512 and W a ∈ R 1×256 . Given a patch embedding h m ∈ R 512 (the m th row entry of H ), its attention score a m can be computed by: L(p, z) = -p ∥p∥2 • z ∥z∥2 and ∥ • ∥ 2 represents ℓ 2 - a m = exp W a tanh V a h ⊤ m ⊙ sigm U a h ⊤ m M m=1 exp {W a (tanh (V a h ⊤ m ) ⊙ sigm (U a h ⊤ m ))} Then the patient-level representations h patient ∈ R 512 are computed based on the attention-pooling operation from the patch-level feature representations by attention scores as weight coefficients, where A ∈ R M is the vector of attention scores: h patient = Attn -pool(A, H) = M m=1 a m h m The last fully-connected layer is used to learn a representation h WSI ∈ R 1×32 , which is then used as input to our multimodal fusion.

B.5 SPARSE NETWORK FEATURE INTERPRETATION

We use the Integrated Gradients attribution algorithm to rank the features in all layers. Inspired by PNET (Elmarakeby et al., 2021) , to reduce the bias introduced by over-annotation of certain nodes (nodes that are member of too many pathways), we adjusted the Integrated Gradients scores using a graph informed function f that considers the connectivity of each node. The importance score of each node i, C l i is divided by the node degree d l i if the node degree is larger than the mean of node degrees plus 5σ where σ is the standard deviation of node degrees. d l i = f an -in l i + f an -out l i adjusted C l i = f (x) = C l i d l i , d l i > µ + 5σ C l i , otherwise B.6 CO-ATTENTION BASED PATHWAY VISUALIZATION After we get the ranking of top genes and pathways, we adopted the co-attention survival model (MCAT) (Chen et al., 2021) to show the spatial visualization of genomic features. We trained MACT on all our TCGA datasets, MACT learns how WSI patches attend to genes when predicting patient survival. We define each WSI patch representation and pathway genomic features as H bag and G bag . The genomic features are the gene list values from the top pathways of each TCGA dataset. The model uses G bag ∈ R N ×dg to guide the feature aggregation of H bag ∈ R N ×dp into a clustered set of gene-guided visual concepts H bag ∈ R N ×dp , d g and d p represents the dimension for the pathway (number of genes involved in the pathway) and patch. Through the following mapping: CoAttn G→H (G, H) = softmax QK ⊤ d p = softmax W q GH ⊤ W ⊤ s d p W v H → A coattn W v H → H where W q , W s , W v ∈ R dp×dp are trainable weight matrices multiplied to the queries G bag and key-value pair ( H bag , H bag ), and A coattn ∈ R N ×M is the co-attention matrix for computing the weighted average of H bag . Here, M represents the number of patches in one slide and N represents number of pathways (We trained top five pathways, so N = 4 in our study). Interpretation: For a single genomic pathway embedding g n ∈ G, the co-attention module scores the pairwise similarity for how much h m attends to g n for all h m ∈ H bag , written as a row vector [a n1 , a n2 , . . . , a nm ] ∈ A coattn. . These attention weights are then applied element-wise to H bag , which constructs a new WSI-level feature embedding h n ∈ R n×1 that reflects the biological function of g n . For example, if g n is a genomic embedding that expresses the underlying biological pathways responsible for tumor formation, A coattn computed by the co-attention layer would saliently localize image patches containing tumor cells as high attention, which then aggregates h n as a WSI-level representation primarily containing tumor cells. We describe the set of high attention image patches that attend to a single genomic embedding g n as a "gene-guided visual concept", in which patches that are similar in feature space to g n would share similar phenotypic information. For N pathway embeddings in G bag , the co-attention module captures up to N different pathway-guided visual concepts, which we visualizes as attention heatmap in Fig. 3 b.

C EXPERIMENTS C.1 NETWORK ARCHITECTURE

Sparse network for gene: The final gene expression embedding is h g ∈ R 1×50 .

Pathology network:

The slide level image feature representation is passed through an image embedding layer and encodes the embedding as h p ∈ R 1×50 . CNV + MUT network: Similarly as the pathology network, the patient level CNV + MUT feature representation is passed through an FC embedding layer and encodes the embedding as h c ∈ R 1×50 .

C.2 EXPERIMENTAL DETAILS

PONET. The latent dimensionality of the factorized matrices k is a very important tuning parameter. We tune k = [3, 5, 10, 20, 30, 50] based on the testing C-index value (Appendix Fig. 5 ) and the loss of training and testing plot (Appendix Fig. 6 ) for each dataset. We choose k to maximize the C-index value and also it should have stable convergence in both training and testing loss. For example, we choose k = 10 in TCGA-KIRP for the optimized results. We can see that in Appendix Fig. 5 the testing loss is quite volatile when k is less than 10. Similarly, we choose k = [20, 10, 20, 20, 10] for TCGA-BLCA, TCGA-KIRC, TCGA-LUAD, TCGA-LUSC, TCGA-PAAD, respectively. The learning rate and the regularization hyperparameter λ for the Cox partial likelihood loss are also tunable parameters. The model is trained with Adam optimizer. For each training/testing pair, we first empirically preset the learning rate to 1.2e-4 as a starting point for a grid search during training, the optimal learning rate is determined through the 5-fold cross-validation on the training set, C-index was used for the performance metric. After that, the model is trained on all the training set and evaluated on the testing set. We use 2e-3 through the experiments for λ. CoxPH. We only include the age and gender for the survival prediction. Using CoxPHFitter from lifelinesfoot_2 . DeepSurvfoot_3 . We concatenate preprocessed pathological images features, gene expression, and copy number + mutant data in a vector to train the DeepSurv model. L2 reg = 10.0, dropout = 0.4, hidden layers sizes = [25, 25] , learning rate = 1e-05, learning rate decay = 0.001, momentum = 0.9. Pathomic Fusionfoot_4 . We use the pathomicSurv model which takes our preprocessed image feature, gene expression and copy number + mutation as model input. k = 20, Learning rate is 2e-3, weight decay is 4e-4. Batch size is 16 and epoch is 100. Drop out rate is 0.25. GPDBNfoot_5 . Learning rate is 2e-3, batch size is 16, weight decay is 1e-6, dropout rate is 0.3, epoch is 



https://github.com/mahmoodlab/PORPOISE https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga https://github.com/CamDavidsonPilon/lifelines https://github.com/czifan/DeepSurv.pytorch https://github.com/mahmoodlab/PathomicFusion https://github.com/isfj/GPDBN https://github.com/Liruiqing-ustc/HFBSurv



Figure 1: Overview of PONET model.

Figure 2: Overall framework of the visual representation extraction using pre-trained self-supervised vision transformer.

Figure 3: Inspecting and interpreting PONET on TCGA-KIRP. a: Sankey diagram visualization of inner layers of PONET shows the estimated relative importance of different nodes in each layer. Nodes in the first layer represent genes; the next layers represent pathways; and the final layer represents the model outcome. Different layers are linked by weights. Nodes with darker colours are more important, while transparent nodes represent the residual importance of undisplayed nodes in each layer, H1 presents gene layer and H2-H4 represent pathway layers; b: Co-attention visualization of top 4 ranked pathways in one case of TCGA-KIRP.

Figure 4: Kaplan-Meier analysis of patient stratification of low and high risk patients via four variatons of PONET on TCGA-KIRP. Low and high risks are defined by the median 50% percentile of hazard predictions via each model prediction. Logrank test was used to test for statistical significance in survival distributions between low and high risk patients.

The batch size is set to 16, and epoch is 100. During the training process, we carefully observe the training and testing loss for convergence (Figure 4 in Appendix C.2). The server used for experiments is NVIDIA GeForce RTX 2080Ti GPU.

Figure 5: C-Index value under K = 3, 5, 10, 20, 30, 50 for TCGA-KIRP. The mean value and standard deviation for 5-fold cross-validation are plotted.

Figure 6: Train and test loss for TCGA-KIRP under K = 3, 5, 10, 20, 50 for 5-fold cross-validation.

Figure 8: Co-attention visualization of top 4 ranked pathways in two cases of TCGA-BLCA.

Figure 9: Inspecting and interpreting PONET on TCGA-KIRC. Sankey diagram visualization of inner layers of PONET shows the estimated relative importance of different nodes in each layer. Nodes in the first layer represent genes; the next layers represent pathways; and the final layer represents the model outcome. Different layers are linked by weights. Nodes with darker colours are more important, while transparent nodes represent the residual importance of undisplayed nodes in each layer.

Figure 10: Co-attention visualization of top 4 ranked pathways in two cases of TCGA-KIRC.

Figure 11: Inspecting and interpreting PONET on TCGA-LUAD. Sankey diagram visualization of inner layers of PONET shows the estimated relative importance of different nodes in each layer. Nodes in the first layer represent genes; the next layers represent pathways; and the final layer represents the model outcome. Different layers are linked by weights. Nodes with darker colours are more important, while transparent nodes represent the residual importance of undisplayed nodes in each layer.

Figure 12: Co-attention visualization of top 4 ranked pathways in two cases of TCGA-LUAD.

Figure 13: Inspecting and interpreting PONET on TCGA-LUSC. Sankey diagram visualization of inner layers of PONET shows the estimated relative importance of different nodes in each layer. Nodes in the first layer represent genes; the next layers represent pathways; and the final layer represents the model outcome. Different layers are linked by weights. Nodes with darker colours are more important, while transparent nodes represent the residual importance of undisplayed nodes in each layer.

Figure 14: Co-attention visualization of top 4 ranked pathways in two cases of TCGA-LUSC.

Figure 15: Inspecting and interpreting PONET on TCGA-PAAD. Sankey diagram visualization of inner layers of PONET shows the estimated relative importance of different nodes in each layer. Nodes in the first layer represent genes; the next layers represent pathways; and the final layer represents the model outcome. Different layers are linked by weights. Nodes with darker colours are more important, while transparent nodes represent the residual importance of undisplayed nodes in each layer.

Figure 16: Co-attention visualization of top 4 ranked pathways in two cases of TCGA-PAAD.

labeled survival times and censorship statuses. The genomic profile features (mutation status, copy number variation, RNA-Seq expression) are preprocessed by Porpoise 1(Chen et al.,  2022b). For this study, we used the following cancer types: Bladder Urothelial Carcinoma (BLCA) (n = 437), Kidney Renal Clear Cell Carcinoma (KIRC) (n = 350), Kidney Renal Papillary Cell Carcinoma (KIRP) (n = 284), Lung Adenocarcinoma (LUAD) (n = 515), Lung Squamous Cell Carcinoma (LUSC) (n = 484), Pancreatic adenocarcinoma (PAAD) (n = 180). We downloaded the same diagnostic WSIs from TCGA website 2 that used in Porpoise study to match the paired genomic features and survival times. The feature alignment table for all the cancer type is in Appendix A. For each WSI, automated segmentation of tissue was performed. Following segmentation, image patches of size 224 × 224 were extracted without overlap at the 20 X equivalent pyramid level from all tissue regions identified while excluding the white background and selecting only patches with at least 50% tissue regions. Subsequently, visual representation of those patches are extracted with a vision transformer(Wang et al., 2021a)  pre-trained on the TCGA dataset through a self-supervised constructive learning approach, such that each patch is represented as a 1 × 2048 vector. Fig.2shows the framework for the visual representation extraction by vision transformer (VIT). Survival outcome information is available at the patient-level, we aggregated the patch level feature into slide level feature representations based on attention-based method(Lu et al., 2021;Ilse et al., 2018), please check the algorithm details in Appendix B.4.

C-Index (mean ± standard deviation) of PONET and ablation experiments in TCGA survival prediction. Top two performers are highlighted in bold.

Evaluation of PONET on different fusion methods and pathway designs by C-index (mean ± standard deviation). Best performer is highlighted in bold.

Comparison of model complexity

TCGA Data Feature Alignment Summary WSI CNV MUT RNA WSI+CNV+MUT WSI+MUT+RNA ALL

Appendix A shows the number of patients with matched different data modalities: WSI (Whole slide image), CNV (Copy number), MUT (Muation), RNA (RNA-Seq gene expression). For each TCGA dataset and each patient we have preprocessed data dimensions d g ∈ R 1×2000 (RNA), d c ∈ R 1×227 (CNV + MUT), and d p ∈ R 1×32 (WSI) which will be used for our multimodal fusion.

norm.B.4 AGGREGATE PATCH LEVEL FEATURE INTO SLIDE LEVEL FEATURESurvival outcome information is available at the patient-level instead of for individual slides, we use the attention based strategy in Porpoise (Chen et al., 2022b) which was originally designed in CLAM(Lu et al., 2021) to aggregate the patch level feature into slide level representation for our model training. We treat all WSIs corresponding to a patient case as a single WSI bag during training and evaluation. If a patient has N WSIs with bag sizes M 1 , • • • , M N respectively, the WSI bag corresponding the patient is formed by concatenating all N bags, and has dimensions M × 2048, where M =

