CAUSALBENCH: A LARGE-SCALE BENCHMARK FOR NETWORK INFERENCE FROM SINGLE-CELL PERTUR-BATION DATA

Abstract

Mapping biological mechanisms in cellular systems is a fundamental step in earlystage drug discovery that serves to generate hypotheses on what disease-relevant molecular targets may effectively be modulated by pharmacological interventions. With the advent of high-throughput methods for measuring single-cell gene expression under genetic perturbations, we now have effective means for generating evidence for causal gene-gene interactions at scale. However, inferring graphical networks of the size typically encountered in real-world gene-gene interaction networks is difficult in terms of both achieving and evaluating faithfulness to the true underlying causal graph. Moreover, standardised benchmarks for comparing methods for causal discovery in perturbational single-cell data do not yet exist. Here, we introduce CausalBench -a comprehensive benchmark suite for evaluating network inference methods on large-scale perturbational single-cell gene expression data. CausalBench introduces several biologically meaningful performance metrics and operates on two large, curated and openly available benchmark data sets for evaluating methods on the inference of gene regulatory networks from single-cell data generated under perturbations. With real-world datasets consisting of over 200 000 training samples under interventions, CausalBench could potentially help facilitate advances in causal network inference by providing what is -to the best of our knowledge -the largest openly available test bed for causal discovery from real-world perturbation data to date.

1. INTRODUCTION

Studying causality in real-world environments is often challenging because uncovering causal relationships generally either requires the ability to intervene and observe outcomes under both interventional and control conditions, or a reliance on strong and untestable assumptions that can not be verified from observational data alone (Stone (1993) ; Pearl (2009) ; Schwab et al. (2020) ; Peters et al. ( 2017)). In biology, a domain characterised by enormous complexity of the systems studied, establishing causality frequently involves experimentation in controlled in-vitro lab conditions using appropriate technologies to observe response to intervention, such as for example high-content microscopy (Bray et al. (2016) ) and multivariate omics measurements (Bock et al. (2016) ). Highthroughput single-cell methods for observing whole transcriptomics measurements in individual cells under genetic perturbations (Dixit et al. (2016); Datlinger et al. (2017; 2021) ) has recently emerged as a promising technology that could theoretically support performing causal inference in cellular systems at the scale of thousands of perturbations per experiment, and therefore holds enormous promise in potentially enabling researchers to uncover the intricate wiring diagrams of cellular biology (Yu et al. (2004) 2022)). In order to progress the causal machine learning field beyond reductionist (semi-)synthetic experiments towards potential utility in impactful real-world applications, it is imperative that the causal machine learning research community develop and maintain suitable benchmarks for objectively comparing methods that aim to advance the causal interpretation of real-world interventional datasets. To facilitate the advancement of machine learning methods in this challenging domain, we introduce CausalBench -a comprehensive benchmark suite for evaluating network inference methods on perturbational single-cell RNA sequencing data that is -to the best of our knowledge -the largest openly available test bed for causal discovery from real-world perturbation data to date (Figure 1 ). CausalBench contains meaningful biologically-motivated performance metrics, a curated set of two large-scale perturbational single-cell RNA sequencing experiments with over 200 000 interventional samples each that are openly available, and integrates numerous baseline implementations of state-of-the-art methods for network inference from single-cell perturbational data. Similar to benchmarks in other domains, e.g. ImageNet in computer vision (Deng et al. ( 2009)), we hope CausalBench can help accelerate progress on large-scale real-world causal graph inference, and that the methods developed against CausalBench could eventually lead to new therapeutics and a deeper understanding of human health through enabling the reconstruction of the functional gene-gene interactome. The source code for CausalBench is openly available at https://github.com/ananymous-43213123/causalbench. Contributions. Our contributions are as follows: • We introduce CausalBench -a comprehensive benchmark suite for evaluating network inference methods on perturbational single-cell RNA sequencing data consisting of two curated, openly available benchmark datasets with over 200 000 interventional samples each. We introduce a set of meaningful benchmark metrics for evaluating performance including a novel statistical metric that leverages single cell perturbational data to measure performance against a larger set of putative gene regulatory relationships than would be possible using observational data alone. • Using CausalBench, we conduct a comprehensive experimental evaluation of the performance of state-of-the-art network inference algorithms in recovering graphical relationships for mixed observational and interventional scRNAseq data. We implement relevant state-of-the-art methods as baselines for network inference from observational and interventional single-cell data. • In addition, we evaluate the performance and scaling characteristics of network inference methods under varying numbers of available training samples and intervention set sizes to establish whether state-of-the-art network inference algorithms are able to effective use of different scales of intervention and training sample set sizes.



; Chai et al. (2014); Akers & Murali (2021); Hu et al. (2020)). However, while the combination of single-cell perturbational experiments with machine learning holds great promise for causal discovery, making effective use of such datasets is to date still a challenging endeavour due to the general paucity of real-world data under interventions, and the difficulty of establishing causal ground truth datasets to evaluate and compare graphical network inference methods (Neal et al. (2020); Shimoni et al. (2018); Parikh et al. (

An overview of causal gene-gene network inference in mixed observational and perturbational single-cell data. The causal generative process in its unperturbed form is observed in the observational data (left; 10 000+ samples in CausalBench) while data under genetic interventions (e.g., CRISPR knockouts) are observed in the interventional data (right; 200 000+ samples in CausalBench. Either observational or interventional plus observational data that were sampled from the true causal generative process (bottom distributions) can be used by network inference algorithms (bottom right) to infer a reconstructed causal graph (top right) that should as closely as possible recapitulate the original underlying functional gene-gene interactions.

