PREDICTING CELLULAR RESPONSES WITH VARIATIONAL CAUSAL INFERENCE AND REFINED RELATIONAL INFORMATION

Abstract

Predicting the responses of a cell under perturbations may bring important benefits to drug discovery and personalized therapeutics. In this work, we propose a novel graph variational Bayesian causal inference framework to predict a cell's gene expressions under counterfactual perturbations (perturbations that this cell did not factually receive), leveraging information representing biological knowledge in the form of gene regulatory networks (GRNs) to aid individualized cellular response predictions. Aiming at a data-adaptive GRN, we also developed an adjacency matrix updating technique for graph convolutional networks and used it to refine GRNs during pre-training, which generated more insights on gene relations and enhanced model performance. Additionally, we propose a robust estimator within our framework for the asymptotically efficient estimation of marginal perturbation effect, which is yet to be carried out in previous works. With extensive experiments, we exhibited the advantage of our approach over state-of-the-art deep learning models for individual response prediction.

1. INTRODUCTION

Studying a cell's response to genetic, chemical, and physical perturbations is fundamental in understanding various biological processes and can lead to important applications such as drug discovery and personalized therapies. Cells respond to exogenous perturbations at different levels, including epigenetic (DNA methylation and histone modifications), transcriptional (RNA expression), translational (protein expression), and post-translational (chemical modifications on proteins). The availability of single-cell RNA sequencing (scRNA-seq) datasets has led to the development of several methods for predicting single-cell transcriptional responses (Ji et al., 2021) . These methods fall into two broad categories. The first category (Lotfollahi et al., 2019; 2020; Rampášek et al., 2019; Russkikh et al., 2020; Lotfollahi et al., 2021a) approaches the problem of predicting single cell gene expression response without explicitly modeling the gene regulatory network (GRN), which is widely hypothesized to be the structural causal model governing transcriptional responses of cells (Emmert-Streib et al., 2014) . Notably among those studies, CPA (Lotfollahi et al., 2021a) uses an adversarial autoencoder framework designed to decompose the cellular gene expression response to latent components representing perturbations, covariates and basal cellular states. CPA extends the classic idea of decomposing high-dimensional gene expression response into perturbation vectors (Clark et al., 2014; 2015) , which can be used for finding connections among perturbations (Subramanian et al., 2017) . However, while CPA's adversarial approach encourages latent indepen-dence, it does not have any supervision on the counterfactual outcome construction and thus does not explicitly imply that the counterfactual outcomes would resemble the observed response distribution. Existing self-supervised counterfactual construction frameworks such as GANITE (Yoon et al., 2018 ) also suffer from this problem. The second class of methods explicitly models the regulatory structure to leverage the wealth of the regulatory relationships among genes contained in the GRNs (Kamimoto et al., 2020) . By bringing the benefits of deep learning to graph data, graph neural networks (GNNs) offer a versatile and powerful framework to learn from complex graph data (Bronstein et al., 2017) . GNNs are the de facto way of including relational information in many health-science applications including molecule/protein property prediction (Guo et al., 2022; Ioannidis et al., 2019; Strokach et al., 2020; Wu et al., 2022a; Wang et al., 2022) , perturbation prediction (Roohani et al., 2022) and RNAsequence analysis (Wang et al., 2021) . In previous work, Cao & Gao (2022) developed GLUE, a framework leveraging a fine-grained GRN with nodes corresponding to features in multi-omics datasets to improve multimodal data integration and response prediction. GEARS (Roohani et al., 2022) uses GNNs to model the relationships among observed and perturbed genes to predict cellular response. These studies demonstrated that relation graphs are informative for predicting cellular responses. However, GLUE does not handle perturbation response prediction, and GEARS's approach to randomly map subjects from the control group to subjects in the treatment group is not designed for response prediction at an individual level (it cannot account for heterogeneity of cell states). GRNs can be derived from high-throughput experimental methods mapping chromosome occupancy of transcription factors, such as chromatin immunoprecepitation sequencing (ChIP-seq), and assay for transposase-accessible chromatin using sequencing (ATAC-seq). However, GRNs from these approaches are prone to false positives due to experimental inaccuracies and the fact that transcription factor occupancy does not necessarily translate to regulatory relationships (Spitz & Furlong, 2012) . Alternatively, GRNs can be inferred from gene expression data such as RNA-seq (Maetschke et al., 2014) . It is well-accepted that integrating both ChIP-seq and RNA-seq data can produce more accurate GRNs (Mokry et al., 2012; Jiang & Mortazavi, 2018; Angelini & Costa, 2014) . GRNs are also highly context-specific: different cell types can have very distinctive GRNs mostly due to their different epigenetic landscapes (Emerson, 2002; Davidson, 2010) . Hence, a GRN derived from the most relevant biological system is necessary to accurately infer the expression of individual genes within such system. In this work, we employed a novel variational Bayesian causal inference framework to construct the gene expressions of a cell under counterfactual perturbations by explicitly balancing individual features embedded in its factual outcome and marginal response distributions of its cell population. We integrated a gene relation graph into this framework, derived the corresponding variational lower bound and designed an innovative model architecture to rigorously incorporate relational information from GRNs in model optimization. Additionally, we propose an adjacency matrix updating technique for graph convolutional networks (GCNs) in order to impute and refine the initial relation graph generated by ATAC-seq prior to training the framework. With this technique, we obtained updated GRNs that discovered more relevant gene relations (and discarded insignificant gene relations in this context) and enhanced model performance. Besides, we propose an asymptotically efficient estimator for estimating the average effect of perturbations under a given cell type within our framework. Such marginal inference is of great biological interest because scRNA-seq experimental results are typically averaged over many cells, yet robust estimations have not been carried out in previous works on predicting cellular responses. We tested our framework on three benchmark datasets from Srivatsan et al. (2020 ), Schmidt et al. (2022) and a novel CROP-seq genetic knockout screen that we release with this paper. Our model achieved state-of-the-art results on out-of-distribution predictions on differentially-expressed genes -a task commonly used in previous works on perturbation predictions. In addition, we carried out ablation studies to demonstrate the advantage of using refined relational information for a better understanding of the contributions of framework components.

2. PROPOSED METHOD

In this section we describe our proposed model -Graph Variational Causal Inference (graphVCI), and a relation graph refinement technique. A list of all notations can be found in Appendix A.

