DEEP BIOLOGICAL PATHWAY INFORMED PATHOLOGY-GENOMIC MULTIMODAL SURVIVAL PREDICTION

Abstract

The integration of multi-modal data, such as pathological images and genomic data, is essential for understanding cancer heterogeneity and complexity for personalized treatments, as well as for enhancing survival predictions. Despite the progress made in integrating pathology and genomic data, most existing methods cannot mine the complex inter-modality relations thoroughly. Additionally, identifying explainable features from these models that govern preclinical discovery and clinical prediction is crucial for cancer diagnosis, prognosis, and therapeutic response studies. We propose PONET-a novel biological pathway informed pathology-genomic deep model that integrates pathological images and genomic data not only to improve survival prediction but also to identify genes and pathways that cause different survival rates in patients. Empirical results on six of The Cancer Genome Atlas (TCGA) datasets show that our proposed method achieves superior predictive performance and reveals meaningful biological interpretations. The proposed method establishes insight on how to train biological informed deep networks on multimodal biomedical data which will have general applicability for understanding diseases and predicting response and resistance to treatment.

1. INTRODUCTION

Manual examination of haematoxylin and eosin (H&E)-stained slides of tumour tissue by pathologists is currently the state-of-the-art for cancer diagnosis (Chan, 2014) . The recent advancements in deep learning for digital pathology have enabled the use of whole-slide images (WSI) for computational image analysis tasks, such as cellular segmentation (Pan et al., 2017; Hou et al., 2020) , tissue classification and characterisation (Hou et al., 2016; Hekler et al., 2019; Iizuka et al., 2020) . While H&E slides are important and sufficient to establish a profound diagnosis, genomics data can provide a deep characterisation of the tumour on the molecular level potentially offering the chance for prognostic and predictive biomarker discovery. Cancer prognosis via survival outcome prediction is a standard method used for biomarker discovery, stratification of patients into distinct treatment groups, and therapeutic response prediction (Cheng et al., 2017; Ning et al., 2020) . WSIs exhibit enormous heterogeneity and can be as large as 150,000 × 150,000 pixels. Most approaches adopt a two-stage multiple instance learning-based (MIL) approach for representation learning of WSIs, in which: 1) instance-level feature representations are extracted from image patches in the WSI, and then 2) global aggregation schemes are applied to the bag of instances to obtain a WSI-level representation for subsequent supervision (Hou et al., 2016; Courtiol et al., 2019; Wulczyn et al., 2020; Lu et al., 2021) . Therefore, multimodal survival prediction faces an additional challenge due to the large data heterogeneity gap between WSIs and genomics, and many existing approaches use simple multimodal fusion mechanisms for feature integration, which prevents mining important multimodal interactions (Mobadersany et al., 2018; Chen et al., 2022b; a) . The incorporation of biological pathway databases in a model takes advantage of leveraging prior biological knowledge so that potential prognostic factors of well-known biological functionality can be identified (Hao et al., 2018) . Moreover, encoding biological pathway information into the neural networks achieved superior predictive performance compared with established models (Elmarakeby et al., 2021) . Based on the current challenges in multimodal fusion of pathology and genomics and the potential prognostic interpretation to link pathways and clinical outcomes in pathway-based analysis, we propose a novel biological pathway informed pathology-genomic deep model, PONET, that uses H&E WSIs and genomic profile features for survival prediction. The proposed method contains four major contributions: 1) PONET formulates a biological pathway informed deep hierarchical multimodal integration framework for pathological images and genomic data; 2) PONET captures diverse and comprehensive modality-specific and cross modality relations among different data sources based on factorized bilinear model and graph fusion network; 3) PONET reveals meaningful model interpretations on both genes and pathways for potential biomarker and therapeutic target discovery; PONET also shows spatial visualization of the top genes/pathways which has enormous potential for novel and prognostic morphological determinants; 4) We evaluate PONET on six public TCGA datasets which showed superior survival prediction comparing to state-of-the-art methods. Fig. 1 shows our model framework.

2. RELATED WORK

Multimodal Fusion. Earlier works on multimodal fusion focus on early fusion and late fusion. Early fusion approaches fuse features by simple concatenation which cannot fully explore intra-modality dynamics (Wöllmer et al., 2013; Poria et al., 2016; Zadeh et al., 2016) . In contrast, late fusion fuses different modalities by weighted averaging which fails to model cross-modal interactions (Nojavanasghari et al., 2016; Kampman et al., 2018) . The exploitation of relations within each modality has been successfully introduced in cancer prognosis via bilinear model (Wang et al., 2021b) and graph-based model (Subramanian et al., 2021) . Adversarial Representation Graph Fusion (ARGF) (Mai et al., 2020) interprets multimodal fusion as a hierarchical interaction learning procedure where firstly bimodal interactions are generated based on unimodal dynamics, and then trimodal dynamics are generated based on bimodal and unimodal dynamics. We propose a new hierarchical fusion framework with modality-specific and cross-modality attentional factorized bilinear modules to mine the comprehensive modality interactions. Our proposed hierarchical fusion framework is different from ARGF in the following ways: 1) We take the sum of the weighted modality-specific representation as the unimodal representation instead of calculating the weighted average of the modality-specific representation in ARGF; 2) For higher level's fusion, ARGF takes the original embeddings of each modality as input while we use the weighted modality-specific representations; 3) We argue that ARGF takes redundant information during their trimodal dynamics. Multimodal Survival Analysis. There have been exciting attempts on multimodal fusion of pathology and genomic data for cancer survival prediction (Mobadersany et al., 2018; Cheerla & Gevaert, 2019; Wang et al., 2020) . However, these multimodal fusion based methods fail to explicitly model the interaction between each subset of multiple modalities. Kronecker product considers pairwise interactions of two input feature vectors by producing a high-dimensional feature of quadratic expansion (Zadeh et al., 2017) , and showed its superiority in cancer survival prediction (Wang et al., 2021b; Chen et al., 2022b; a) . Despite of promising results, using Kronecker product in multimodal fusion may introduce a large number of parameters that may lead to high computational cost and a risk of overfitting (Kim et al., 2017; Liu et al., 2021) , thus limiting its applicability and improvement in performance. To overcome this drawback, hierarchical factorized bilinear fusion for cancer survival prediction (HFBSurv) (Li et al., 2022) uses factorized bilinear model to fuse genomic and image features which dramatically reduces computational complexity. PONET differs from HFBSurv in two ways: 1) PONET's multimodal framework has three levels of hierarchical fusion module



Figure 1: Overview of PONET model.

