DISCOBAX: DISCOVERY OF OPTIMAL INTERVENTION SETS IN GENOMIC EXPERIMENT DESIGN

Abstract

The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanism. With billions of potential hypotheses to test, an exhaustive exploration of the entire space of potential interventions is impossible in practice. Sample-efficient methods based on active learning or Bayesian optimization bear the promise of identifying targets of interest using as few experiments as possible. However, genomic perturbation experiments typically rely on proxy outcomes measured in biological model systems that may not completely correlate with the results of interventions in humans. In practical experiment design, one aims to find a set of interventions that maximally move a target phenotype via a diverse mechanism set to reduce the risk of failure in future stages of trials. To that end, we introduce DiscoBAXa sample-efficient algorithm for genomic intervention discovery that maximizes the desired movement of a phenotype while covering a diverse set of underlying mechanisms. We provide theoretical guarantees on the optimality of the approach under standard assumptions, conduct extensive experiments in synthetic and realworld settings relevant to genomic discovery, and demonstrate that DiscoBax outperforms state-of-the-art active learning and Bayesian optimization methods in this task. Better methods for selecting effective and diverse perturbations in biological systems could enable researchers to discover novel therapeutics for many genetically-driven diseases.

1. INTRODUCTION

Genomic experiments probing the function of genes under realistic cellular conditions are the cornerstone of modern early-stage drug target discovery and validation; moreover, they are used to identify effective modulators of one or more disease-relevant cellular processes. These experiments, for example using Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) (Jehuda et al., 2018) perturbations, are both time and resource-intensive (Dickson & Gagnon, 2004; 2009; DiMasi et al., 2016; Berdigaliyev & Aljofan, 2020) . Therefore, an exhaustive search of the billions of potential experimental protocols covering all possible experimental conditions, cell states, cell types, and perturbations (Trapnell, 2015; Hasin et al., 2017; Worzfeld et al., 2017; Chappell et al., 2018; MacLean et al., 2018; Chappell et al., 2018) is infeasible even for the world's largest biomedical research institutes. Furthermore, to mitigate the chances of failure in subsequent stages of the drug design pipeline, it is desirable for the subset of precursors selected in the target identification stage to operate on diverse underlying biological mechanisms (Nica et al., 2022) . That way, if a promising candidate based on in-vitro experiments triggers unexpected issues when tested in-vivo (e.g., undesirable side effects), other lead precursors relying on different pathways might be suitable replacements that are not subject to the same issues. Mathematically, finding a diverse set of precursors corresponds to identifying and sampling from the different modes of the black-box objective function mapping intervention representations to the corresponding effects on the disease phenotype ( § 2). Existing machine learning methods for iterative experimental design (e.g., active learning, Bayesian optimization) have the potential to aid in efficiently exploring this vast biological intervention space. However, to our knowledge, there is no method geared toward identifying the modes of the underlying black-box objective function to identify candidate interventions that are both effective and diverse ( § 6). To this end, we introduce DiscoBAX -a sample-efficient Bayesian Algorithm eXecution (BAX) method for discovering genomic intervention sets with both high expected change in the target phe-1

