BIOLOGICAL FACTOR REGULATORY NEURAL NET-WORK

Abstract

Genes are fundamental for analyzing biological systems and many recent works proposed to utilize gene expression for various biological tasks by deep learning models. Despite their promising performance, it is hard for deep neural networks to provide biological insights for humans due to their black-box nature. Recently, some works integrated biological knowledge with neural networks to improve the transparency and performance of their models. However, these methods can only incorporate partial biological knowledge, leading to suboptimal performance. In this paper, we propose the Biological Factor Regulatory Neural Network (BFReg-NN), a generic framework to model relations among biological factors in cell systems. BFReg-NN starts from gene expression data and is capable of merging most existing biological knowledge into the model, including the regulatory relations among genes or proteins (e.g., gene regulatory networks (GRN), protein-protein interaction networks (PPI)) and the hierarchical relations among genes, proteins and pathways (e.g., several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has the ability to provide new biologically meaningful insights because of its white-box characteristics. Experimental results on different gene expression-based tasks verify the superiority of BFReg-NN compared with baselines. Our case studies also show that the key insights found by BFReg-NN are consistent with the biological literature.

1. INTRODUCTION

Understanding how cells work is an essential problem in biology, and it is also very important in biomedical areas because of disease phenotype and precision medicine. From a genome-scale view, the whole cell system is modeled by level, starting from DNA, mRNA, and protein to metabolomics, and finally, inferring the phenotype. We define these molecules and molecule sets as biological factors. At each level, the same type of biological factors interact or regulate each other, which determines cell fate, driving the cells to develop, differentiate, and do other activities (Angione, 2019) . Thanks to single-cell sequencing technologies, we can obtain gene expression data from the mRNA level, which is fundamental to analyzing the whole cell system. Currently, gene expression data is widely used to identify cell states during cell development, characterize specific tissues or organs, and analyze patient-specific drug responses (Paik et al., 2020) . Many deep learning methods are proposed to utilize gene expression data for predictions, and achieve extraordinary performance in different biological tasks. For instance, gene expression could be treated as a type of input feature to classify cell types, cluster cells and even calculate patient survival time (Erfanian et al., 2021; Huang et al., 2020) . Although most deep neural networks (DNNs) model could diagnose cancers with high precision, the original DNNs cannot tell us detailed biological factors/processes which cause cancers. For instance, the regulation between gene PFKL and HIF1A under HEPG2 pathway has a high probability of causing liver cancer (Shoemaker, 2006; Garcia-Alonso et al., 2019) . Recently, some works leverage existing biological knowledge as graphs to represent the relations of biological factors into the prediction models, and significantly improve the prediction accuracy of specific tasks. For example, Rhee et al. ( 2018 



) andChereda et al. (2021)  mapped gene expression data into the protein-protein interaction network, and used graph neural networks to predict cancer.Elmarakeby et al. (2021)  modeled the relations of gene-pathway and pathway-biological process as

