BIOLOGICAL FACTOR REGULATORY NEURAL NET-WORK

Abstract

Genes are fundamental for analyzing biological systems and many recent works proposed to utilize gene expression for various biological tasks by deep learning models. Despite their promising performance, it is hard for deep neural networks to provide biological insights for humans due to their black-box nature. Recently, some works integrated biological knowledge with neural networks to improve the transparency and performance of their models. However, these methods can only incorporate partial biological knowledge, leading to suboptimal performance. In this paper, we propose the Biological Factor Regulatory Neural Network (BFReg-NN), a generic framework to model relations among biological factors in cell systems. BFReg-NN starts from gene expression data and is capable of merging most existing biological knowledge into the model, including the regulatory relations among genes or proteins (e.g., gene regulatory networks (GRN), protein-protein interaction networks (PPI)) and the hierarchical relations among genes, proteins and pathways (e.g., several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has the ability to provide new biologically meaningful insights because of its white-box characteristics. Experimental results on different gene expression-based tasks verify the superiority of BFReg-NN compared with baselines. Our case studies also show that the key insights found by BFReg-NN are consistent with the biological literature.

1. INTRODUCTION

Understanding how cells work is an essential problem in biology, and it is also very important in biomedical areas because of disease phenotype and precision medicine. From a genome-scale view, the whole cell system is modeled by level, starting from DNA, mRNA, and protein to metabolomics, and finally, inferring the phenotype. We define these molecules and molecule sets as biological factors. At each level, the same type of biological factors interact or regulate each other, which determines cell fate, driving the cells to develop, differentiate, and do other activities (Angione, 2019) . Thanks to single-cell sequencing technologies, we can obtain gene expression data from the mRNA level, which is fundamental to analyzing the whole cell system. Currently, gene expression data is widely used to identify cell states during cell development, characterize specific tissues or organs, and analyze patient-specific drug responses (Paik et al., 2020) . Many deep learning methods are proposed to utilize gene expression data for predictions, and achieve extraordinary performance in different biological tasks. For instance, gene expression could be treated as a type of input feature to classify cell types, cluster cells and even calculate patient survival time (Erfanian et al., 2021; Huang et al., 2020) . Although most deep neural networks (DNNs) model could diagnose cancers with high precision, the original DNNs cannot tell us detailed biological factors/processes which cause cancers. For instance, the regulation between gene PFKL and HIF1A under HEPG2 pathway has a high probability of causing liver cancer (Shoemaker, 2006; Garcia-Alonso et al., 2019) . Recently, some works leverage existing biological knowledge as graphs to represent the relations of biological factors into the prediction models, and significantly improve the prediction accuracy of specific tasks. For example, Rhee et al. ( 2018 (2018) used the Gene Ontology (GO) knowledgebase to build the neural network architecture, but they are too sketchy to simulate the gene or protein reactions in the cells, and may lead to suboptimal performance. Although they mitigate the black-box issues, they only use partial biological knowledge, and they cannot explore new knowledge from gene expression data. In this paper, we propose a generic framework, named biological factor regulatory neural network (BFReg-NN), whose goal is to simulate the complex biological processes in a cell system, understand the functions of genes or proteins, and ultimately give insights into the mechanism of larger living systems. Particularly, BFReg-NN is a neural network with the following characteristics. First of all, each neuron is mapped into a biological factor (e.g., a specific gene or protein), and arranged level by level based on the hierarchy of biological concepts, such as genes, proteins, pathways, biological processes, and so on. Secondly, since biological factors regulate each other, edges between neurons (and hyperedges among neurons) are set to reflect the existence of these regulations. In such a manner, edges also model biological meanings. Moreover, two different operations are utilized to simulate the reactions inside/across the layer. Since genes regulate each other and create feedback loops to form cyclic chains of dependencies in gene regulatory network, graph neural network styled operations are suitable to model the "steady state" of genes. It is the same for proteins in PPI. In the layer of pathways, it is a hypergraph where each hyperedge is a pathway including multiple proteins. Accordingly, a hypergraph neural network is used to aggregate and balance the information of each protein. The operations across the layers imitate the material transformation (e.g., genes translate protein), so we adopt deep neural network styled operations to map the relations. Finally, BFReg-NN is flexible to explore new biological knowledge by adding important but not existing edges in the current biological neural network. We illustrate BFReg-NN simulates on genome-scale cell system as an example in Figure 1 . The advantages of our proposed model include: (1) Compared with previous works, BFReg-NN merges with the structural biological knowledge in cell systems, including hierarchical relations (e.g., genes-proteins, proteins-pathways mappings), and regulatory relations among certain factors, such as GRN and PPI. Therefore, it could imitate how different biological factors work inside a cell. (2) The model of BFReg-NN is transparent and interpretable, as each neuron and edge has its corresponding biological meaning. Thus, the learned model weights give evidence of which biological parts are activated and which biological products are generated, leading to the final prediction. (3) By adding new edges between neurons, BFReg-NN not only achieves better performance in downstream tasks, but also has the potential to complete undiscovered biological knowledge. Traditional knowledge completion methods for biological domains (e.g., link prediction by knowledge bases/graphs) suffer from imbalanced data problems (Bonner et al., 2022) . BFReg-NN utilizes the gene expression data, which reflects the real cell states, and thus obtains more reliable results. In the experiment, we show the effectiveness of BFReg-NN on several biological tasks which have different output formats, including missing gene expression value prediction, cell classification and future gene expression value forecasting. We also test the knowledge completing ability of BFReg-NN by the recall of the existing biological knowledge. Further, we do case studies for newly discovered knowledge. The results demonstrate that BFReg-NN provides biologically meaningful insights.

2. RELATED WORK

Gene expression and its applications: By RNA sequencing, it is easy to obtain gene expression which is a value to represent the amount of gene transcripts from a DNA fragment (Eberwine et al., 2014) . It has been used in a variety of biological applications, including single-cell analysis (Yu et al., 2022; Zhou et al., 2022 ), disease diagnosis (Xing et al., 2022) and drug discovery (Pham et al., 2021) . But most of these models lack transparency and ignore the existing biological knowledge. Knowledge graph enhanced downstream tasks: The emergence of knowledge bases/graphs has led to enhancing the performance in many fields of computer science, such as computer vision and natural language processing (Ren et al., 2021; Hao et al., 2021; Liu et al., 2021) . Similarly, knowledge graphs also have been widely used for specific biological tasks such as cancer diagnosis in recent years (Elmarakeby et al., 2021; Rhee et al., 2018) . OntoProtein (Zhang et al., 2022) embedded the gene ontology knowledge in pre-training to improve the performance of several pro-



) and Chereda et al. (2021) mapped gene expression data into the protein-protein interaction network, and used graph neural networks to predict cancer. Elmarakeby et al. (2021) modeled the relations of gene-pathway and pathway-biological process as a network, and used a deep neural network to diagnose prostate cancer. Yu et al. (2016); Ma et al.

