NEURAL DESIGN FOR GENETIC PERTURBATION EXPER-IMENTS

Abstract

The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination (OAE) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). We analyze the convergence properties of OAE by relating it to the Eluder dimension of the algorithm's function class and validate that OAE outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the GeneDisco experimental planning challenge.

1. INTRODUCTION

We are inspired by the problem of finding the genetic perturbations that maximize a given function of a cell (a particular biological pathway or mechanism, for example the proliferation or exhaustion of particular immune cells) while performing the least number of perturbations required. In particular, we are interested in prioritizing the set of genetic knockouts (via shRNA or CRISPR) to perform on cells that would optimize a particular scalar cellular phenotype. Since the space of possible perturbations is very large (with roughly 20K human protein-coding genes) and each knockout is expensive, we would like to order the perturbations strategically so that we find one that optimizes the particular phenotype of interest in fewer total perturbations than, say, just brute-force applying all possible knockouts. In this work we consider only single-gene knockout perturbations since they are the most common, but multi-gene perturbations are also possible (though considerably more technically complex to perform at scale). While a multi-gene perturbation may be trivially represented as a distinct (combined) perturbation in our framework, we leave for future work the more interesting extension of embedding, predicting, and planning these multi-gene perturbations using previously observed single-gene perturbations. With this objective in mind we propose a simple method for improving a cellular phenotype under a limited budget of genetic perturbation experiments. Although this work is inspired by this concrete biological problem, our results and algorithms are applicable in much more generality to the setting of experimental design with neural network models. We develop and evaluate a family of algorithms for the zero noise batch query bandit problem based on the Optimistic Arm Elimination principle (OAE). We focus on developing tractable versions of these algorithms compatible with neural network function approximation. During each time-step OAE fits a reward model on the observed responses seen so far while at the same time maximizing the reward on all the arms yet to be pulled. The algorithm then queries the batch of arms whose predicted reward is maximal among the arms that have not been tried out. We conduct a series of experiments on synthetic and public data from the UCI Dua & Graff (2017) database and show that OAE is able to find the optimal "arm" using fewer batch queries than other algorithms such as greedy and random sampling. Our experimental evaluation covers both neurally realizable and not neurally realizable function landscapes. The performance of OAE against benchmarks is comparable in both settings, demonstrating that although our presentation of the OAE algorithm assumes realizability for the sake of clarity, it is an assumption that is not required in practice. In the setting where the function class is realizable i.e. the function class F used by OAE contains the function generating the rewards, and the evaluation is noiseless we show two query lower bounds for the class of linear and 1-Lipshitz functions. We validate OAE on the public CMAP dataset Subramanian et al. (2017) , which contains tens of thousands of genetic shRNA knockout perturbations, and show that it always outperforms a baseline and almost always outperforms a simpler greedy algorithm in both convergence speed to an optimal perturbation and the associated phenotypic rewards. These results illustrate how perturbational embeddings learned from one biological context can still be quite useful in a different biological context, even when the reward functions of these two contexts are different. Finally we also benchmark our methods in the GeneDisco dataset and algorithm suite (see Mehrjou et al. (2021) ) and show OAE to be competitive against benchmark algorithms in the task of maximizing HitRatios.

2. RELATED WORK

Bayesian Optimization The field of Bayesian optimization has long studied the problem of optimizing functions severely limited by time or cost Jones et al. (1998) . For example, Srinivas et al. (2009) introduce the GP-UCB algorithm for optimizing unknown functions. Other approaches based on adaptive basis function regression have also been used to model the payoff function as in Snoek et al. (2015) . These algorithms have been used in the drug discovery context. Mueller et al. (2017) applied Bayesian optimization to the problem of optimizing biological phenotypes. Very recently, GeneDisco was released as a benchmark suite for evaluating active learning algorithms for experiment design in drug discovery Mehrjou et al. (2021) . Perhaps the most relevant to our setting are the many works that study the batch acquisition setting in Bayesian active learning and optimization such as Kirsch et al. (2019) ; Kathuria et al. (2016) and the GP -BUCB algorithm of Desautels et al. (2014) . In this work we move beyond the typical parametric and Bayesian assumptions from these works towards algorithms that work in conjunction with neural network models. We provide guarantees for the no noise setting we study based on the Eluder dimension Russo & Van Roy (2013) . Parallel Bandits Despite its wide applicability in many scientific applications, batch learning has been studied relatively seldom in the bandit literature. Despite this, recent work (Chan et al., 2021) show that in the setting of contextual linear bandits (Abbasi-Yadkori et al., 2011) , the finite sample complexity of parallel learning matches that of sequential learning irrespective of the batch size provided the number of batches is large enough. Unfortunately, this is rarely the regime that matters in many practical applications such as drug development where the size of the experiment batch may be large but each experiment may be very time consuming, thus limiting their number. In this work we specifically address this setting in our experimental evaluation in Section E. Structure Learning Prior work in experiment design tries to identify causal structures with a fixed budget of experiments Ghassami et al. (2018) . Scherrer et al Scherrer et al. (2021) proposes a mechanism to select intervention targets to enable more efficient causal structure learning. Sussex et al. (2021) extend the amount of information contained in each experiment by simultaneously intervening on multiple variables. Causal matching, where an experimenter can perform a set of interventions aimed to transform the system to a desired state, is studied in Zhang et al. (2021) . Neural Bandits Methods such as Neural UCB and Shallow Neural UCB Zhou et al. (2020) ; Xu et al. (2020) are designed to add an optimistic bonus to model predictions of a nature that can be analytically computed as is extremely reminiscent of the one used in linear bandits (Auer, 2002; Dani et al., 2008) , thus their theoretical validity depends on the 'linearizing' conditions to hold. More recently (Pacchiano et al., 2021b) have proposed the use of pseudo-label optimisim for the Bank Loan problem where they propose an algorithm that adds optimism to neural network predictions through the addition of fake data and is only analyzed in the classification setting. Our algorithms instead add optimism to their predictions. The later is achieved via two methods, either by explicitly encouraging it to fit a model whose predictions are large in unseen data, or by computing uncertainties. Active Learning Active learning is relatively well studied problem Settles (2009) ; Dasgupta (2011) ; Hanneke et al. (2014) particularly in the context of supervised learning. See for example Balcan et al. (2009) , Dasgupta et al. (2007) , Settles (2009) , Hanneke et al. (2014) . There is a vast amount of research on active learning for classification (see for example Agarwal (2013) , Dekel et al. (2010) and Cesa-Bianchi et al. (2009) where the objective is to learn a linearly parameterized response model P(y|x). Broadly speaking there are two main sample construction approaches, diversity Sener & Savarese (2017); Geifman & El-Yaniv (2017); Gissin & Shalev-Shwartz (2019) and uncertainty sampling Tong & Koller (2001) ; Schohn & Cohn (2000) ; Balcan et al. (2009) ; Settles et al. (2007) , successful in the large Guo & Schuurmans (2007) ; Wang & Ye (2015) ; Chen & Krause (2013) ; Wei et al. (2015) ; Kirsch et al. (2019) and small batch sizes regimes respectively. Diversity sampling methods produce spread out samples to better cover the space while uncertainty-based methods estimate model uncertainty to select what points to label. Hybrid approaches are common as well. A common objective in the active learning literature is to collect enough samples to produce a model that minimizes the population loss over the data distribution. This is in contrast with the objective we study in this work, which is to find a point in the dataset with a large response. There is a rich literature dedicated to the development of active learning algorithms for deep learning applications both in the batch and single sample settings Settles et al. (2007) ; Ducoffe & Precioso (2018); Beluch et al. (2018) ; Ash et al. (2021) .

3. PROBLEM DEFINITION

Let y ⋆ : A → R be a response function over A ⊂ R d . We assume access to a function class F ⊂ Fun(A, R) where Fun(A, R) denotes the set of functions from A to R. Following the typical online learning terminology we call A the set of arms. In this work we allow A to be infinite, although we only consider finite A in practice. In our setting the experiment designer (henceforth called the learner) interacts with y ⋆ and A in a sequential manner. During the t-th round of this interaction, aided by F and historical query and response information the learner is required to query a batch of b ∈ N arms {a t,i } b i=1 ⊂ A and observe noiseless responses {y t,i = y ⋆ (a t,i )} b i=1 after which these response values are added to the historical dataset D t+1 = D t ∪ {(a t,i , y t,i )} b i=1 . In this work we do not assume that y ⋆ ∈ F. Instead we allow the learner access to a function class F to aid her in producing informative queries. This is a common situation in the setting of neural experiment design, where we may want to use a DNN model to fit the historical responses and generate new query points without prior knowledge of whether it accurately captures y ⋆ . Our objective is to develop a procedure that can recover an 'almost optimal' arm a ∈ A in the least number of arm pulls possible. We consider the following objective, τ -quantile optimality. Find an arm a τ ∈ A belonging to the top τ -quantilefoot_0 of {y ⋆ (a)} a∈A . Although ϵ-optimality (find an arm a ϵ ∈ A such that y ⋆ (a ϵ ) + ϵ ≥ max a∈A y ⋆ (a) for ϵ ≥ 0) is the most common criterion considered in the optimization literature, for it to be meaningful it requires knowledge of the scale of max a∈A y ⋆ (a). In some scenarios this may be hard to know in advance. Thus in our experiments we focus on the setting of τ -quantile optimality as a more relevant practical performance measure. This type of objective has been considered by many works in the bandit literature (see for example Szorenyi et al. (2015) ; Zhang & Ong (2021) ). Moreover, it is a measure of optimality better related to practical objectives used in experiment design evaluation, such as hit ratio in the GeneDisco benchmark library Mehrjou et al. (2021) . We show in Section E that our algorithms are successful at producing almost optimal arms under this criterion after a small number of queries. The main challenge we are required to overcome in this problem is designing a smart choice of batch queries {a t,i } B i=1 that balances the competing objectives of exploring new regions of the arm space and zooming into others that have shown promising rewards. In this work we focus on the case where the observed response values y ⋆ (a) of any arm a ∈ A are noiseless. In the setting of neural perturbation experiments the responses are the average of many expression values across a population of cells, and thus it is safe to assume the observed response is almost noiseless. In contrast with the noisy setting, when the response is noiseless, querying the same arm twice is never necessary. We leave the question on how to design algorithms for noisy responses in the function approximation regime for future work. although note it can be reduced to our setting if we set the exploitation round per data point sufficiently large. Evaluation. After the queries ∪ t ℓ=1 {a ℓ,i } b i=1 the learner will output a candidate approximate optimal arm ât among all the arms whose labels she has queried (all arms in D t+1 ) by considering (â t , ŷt ) = arg max (a,y)∈Dt+1 y, the point with the maximal observed reward so far. Given a quantile value τ we measure the performance of our algorithms by considering the first timestep t τ first where a τ -quantile optimal point ât was proposed.

4. OPTIMISTIC ARM ELIMINATION

With these objectives in mind, we introduce a family of algorithms based on the Optimistic Arm Elimination Algorithm (OAE) principle. We call U t to the subset of arms yet to be queried by our algorithm. At time t any OAE algorithm produces a batch of query points of size b fromfoot_1 U t . Our algorithms start round t by fitting an appropriate response predictor f t : U t → R based on the historical query points D t and their observed responses so far. Instead of only fitting the historical responses with a square loss and produce a prediction function f t , we encourage the predictions of f t to be optimistic on the yet-to-be-queried points of U t . We propose two tractable ways of achieving this. First by fitting a model (or an ensemble of models) f o t to the data in D t and explicitly computing a measure of uncertainty ũt : U t → R of its predictions on U t . We define the optimistic response predictor f t (a) = f o t (a) + ũt (a). Second, we achieve this by defining f t to be the approximate solution of a constrained objective, f t ∈ arg max f ∈F A(f, U t ) s.t. (a,y)∈Dt (f (a) -y) 2 ≤ γ t . where γ t a possibly time-dependent parameter satisfying γ t ≥ 0 and A(f, U) is an acquisition objective tailored to produce an informative arm (or batch of arms) from U t . We consider a couple of acquisition objectives A avg (f, U) = 1 |U | a∈U f (a), A hinge (f, U) = 1 |U | a∈U (max(0, f (a))) p for some p > 0 and A softmax (f, U) = log a∈U exp(f (a)) and A sum (f, U) = a∈U f (a). An important acquisition functions of theoretical interest, although hard to optimize in practice are A max (f, U) = max a∈U f (a) and its batch version A max,b (f, U) = max B⊂U ,|B|=b a∈B f (a). Regardless of whether f t was computed via Equation 1 or it is an uncertainty aware objective of the form f t (a) = f o t (a) + u t (a), our algorithm then produces a query batch B t by solving B t ∈ arg max B⊂Ut,|B|=b A avg ( f t , B). ( ) The principle of Optimism in the Face of Uncertainty (OFU) allows OAE algorithms to efficiently explore new regions of the space by acting greedily with respect to a model that fits the rewards of the arms in D t as accurately as possible but induces large responses from the arms she has not tried. If y ⋆ ∈ F, and f t is computed by solving Equation 1, it can be shown the optimistic model overestimates the true response values i.e. a∈Bt f t (a) ≥ a∈B⋆,t y ⋆ (a) where B ⋆,t = arg max B⊂Ut,|B|=b a∈B y ⋆ (a). Consult Appendix D.2 for a proof and an explanation of the relevance of this observation. Acting greedily based on an optimistic model means the learner tries out the arms that may achieve the highest reward according to the current model plausibility set. After pulling these arms, the learner can successfully update the model plausibility set and repeat this procedure. 

4.1. TRACTABLE IMPLEMENTATIONS OF OAE

In this section we go over the algorithmic details behind the approximations that we have used when implementing the different OAE methods we have introduced in Section 4.

4.1.1. OPTIMISTIC REGULARIZATION

In order to produce a tractable implementation of the constrained problem 1 we approximate it with the optimism regularized objective, f t ∈ arg min f ∈F 1 |D t | (a,y)∈Dt (f (a) -y) 2 -λ reg A(f, U t ) Optimism Regularizer . And define B t following Equation 2. Problem 3 is compatible with DNN function approximation. In our experiments we set the acquisition function to A avg , A hinge with p = 4 and A softmax . The resulting methods are MeanOpt, HingePNorm and MaxOpt. Throughout this work RandomBatch corresponds to uniform arm selection and Greedy to setting the optimism regularizer to 0.

4.1.2. ENSEMBLE METHODS

We consider two distinct methods (Ensemble and EnsembleNoiseY) to produce uncertainty estimations based on ensemble predictions. In both we fit M models { f i t } M i=1 to D t and define f o t (a) = max i=1,••• ,M f i t (a) + min i=1,••• ,M f i t (a) 2 , ũt (a) = max i=1,••• ,M f i t (a) -f o t (a). So that f t = f o t + ũt . We explore two distinct methods to produce M different models { f i t } M i=1 fit to D t that differ in the origin of the model noise. Ensemble produces M models resulting of independent random initialization of their model parameters. The EnsembleNoiseY method injects 'label noise' into the dataset responses. For all i ∈ {1, • • • , M } we build a dataset D i t = {(a, y + ξ) for (a, y) ∈ D t , ξ ∼ N (0, 1)} where ξ is an i.i.d. zero mean Gaussian random sample. The functions { f i t } M i=1 are defined as, f i t ∈ arg min f ∈F (a,y)∈D i t (f (a) -y) 2 . In this case the uncertainty of the ensemble predictions is the result of both the random parameter initialization of the { f i t } M i=1 and the 'label noise'. This noise injection procedure draws its inspiration from methods such as RLSVI and NARL Russo (2019) and Pacchiano et al. (2021a) .

4.2. DIVERSITY SEEKING VERSIONS OF OAE

In the case b > 1, the explore / exploit trade-off is not the sole consideration in selecting the arms that make up B t . In this case, we should also be concerned about selecting sufficiently diverse points within the batch B t to maximize the information gathering ability of the batch. In Section 4.2 we show how to extend Algorithm 1 (henceforth referred to as vanilla OAE) to effectively induce query diversity. We introduce two versions OAE -DvD and OAE -Seq which we discuss in more detail below. A detailed description of tractable implementations of these algorithms can be found in Appendix C.

4.2.1. DIVERSITY VIA DETERMINANTS

Inspired by diversity-seeking methods in the Determinantal Point Processes (DPPs) literature Kulesza & Taskar (2012) , we introduce the OAE -DvD algorithm. Inspired by the DvD algorithm Parker-Holder et al. (2020) we propose to augment the vanilla OAE objective with a diversity regularizer. B t ∈ arg max B⊆Ut,|B|=b A sum ( f t , B) + Div( f t , B). (4) OAE -DvD's Div regularizer is inspired by the theory of Determinantal Point Processes and equals a weighted log-determinant objective. OAE -DvD has access to a kernel function K : R d ×R d → R and at the start of every time step t ∈ N it builds a kernel matrix K t ∈ R |Ut|×|Ut| , K t [i, j] = K(a i , a j ), ∀i, j ∈ U t . For any subset B ⊆ U t we define the OAE -DvD diversity-aware score as, Div(f, B) = λ div log (Det(K t [B, B])) Where K t [B, B] corresponds to the submatrix of K t with columns (and rows) indexed by B and λ div is a diversity regularizer. Since the resulting optimization problem in Equation 6 may prove to be extremely hard to solve, we design a greedy maximization algorithm to produce a surrogate solution. The details can be found in Appendix B.1. OAE -DvD induces diversity leveraging the geometry of the action space.

4.2.2. SEQUENTIAL BATCH SELECTION RULES

In this section we introduce OAE -Seq a generalization of the OAE algorithm designed to produce in batch diversity. OAE -Seq produces a query batch by solving a sequence of b optimization problems. The first element a t,1 of batch B t is chosen as the arm in U t achieving the most optimistic prediction over plausible models (following objective 1 and any of the tractable implementations defined in Section C, either optimistic regularization or ensemble methods). To produce the second point a t,2 in the batch (provided b > 1) we temporarily add the pair (a t,1 , ỹt,1 ) to the data buffer D t , where ỹt,1 is a virtual reward estimator for a t,1 . Using this 'fake labels' augmented data-set we select a t,2 following the same optimistic selection method used for a t,1 . Although other choices are possible in practice we set ỹt,1 as a mid-point between an optimistic and pessimistic prediction of the value of y * (a t,1 ). The name OAE -Seq derives from the 'sequential' way in which the batch is produced. If OAE -Seq selected arm a to be in the batch, and this arm has a non-optimistic virtual reward ỹ(a) that is low relative to the optimistic values of other arms, then OAE -Seq will not select too many arms close to a in the same batch. OAE -Seq induces diversity not through the geometry of the arm space but in a way that is intimately related to the plausible arm values in the function class. A similar technique of adding hallucinated values to induce diversity has been proposed before, for example in the GP -BUCB algorithm of Desautels et al. (2014) . Ours is the first time this general idea has been tested in conjunction with scalable neural network function approximation algorithms. A detailed discussion of this algorithm can be found in Section B.1.1.

4.3. THE STATISTICAL COMPLEXITY OF ZERO NOISE BATCH LEARNING

In this section we present our main theoretical results regarding OAE with function approximation. In our results we use the Eluder dimension Russo & Van Roy (2013) to characterize the complexity of the function class F. This is appropriate because our algorithms make use of the optimism principle to produce their queries B t . We show two novel results. First, we characterize the sample complexity of zero noise parallel optimistic learning with Eluder classes with dimension d. Perhaps surprisingly the regret of Vanilla OAE with batch size b has the same regret profile the case b = 1 up to a constant burn in factor of order bd. Second, our results holds under model misspecification, that is when y ⋆ ̸ ∈ F at the cost of a linear dependence in the misspecification error. Although our results are for the noiseless setting (the subject of this work), we have laid the most important part of the groundwork to extend them to the case of noisy evaluation. We explain why in Appendix D.2. We relegate the formal definition of the Eluder to Appendix D.2. In this section we measure the misspecification of y ⋆ via the ∥ • ∥ ∞ norm. We assume y ⋆ satisfies min f ∈F ∥f -y ⋆ ∥ ∞ ≤ ω where ∥f -y ⋆ ∥ ∞ = max a∈A |f (a) -y ⋆ (a)|. Let y ⋆ = arg min f ∈F ∥y ⋆ - f ∥ ∞ be the ∥ • ∥ ∞ projection of y ⋆ onto F. We analyze OAE where f t is computed by solving 1 with γ t ≥ (t -1)bω 2 with acquisition objective A max,b . We will measure the performance of OAE via its regret defined as, Reg F ,b (T ) = T ℓ=1   a∈B⋆,t y ⋆ (a) - a∈Bt y ⋆ (a)   . Where B ⋆,t = max B⊂Ut,|B|=b a∈B y ⋆ (a). The main result in this section is, Theorem 4.1. The regret of OAE with A max,b acquisition function satisfies, Reg F ,b (T ) = O dim E (F, α T )b + ω dim E (F, α T )T b . With αt = max 1 t 2 , inf{∥f1 -f2∥∞ : f1, f2 ∈ F, f1 ̸ = f2} and C = max f ∈F ,a,a ′ ∈A |f (a) -f (a ′ )|. The proof of theorem 4.1 can be found in Appendix D.2. This result implies the regret is bounded by a quantity that grows linearly with ω, the amount of misspecification but otherwise only with the scale of dim 

5. TRANSFER LEARNING ACROSS GENETIC PERTURBATION DATASETS

Figure 1 : Regression mean squared error loss for models that predict cell-line specific phenotypic rewards from VCAP-derived perturbational features. In order to show the effectiveness of OAE in the large batchsmall number of iterations regime we consider genetic perturbations from the CMAP dataset Subramanian et al. (2017) , which contains a 978-gene bulk expression readout from thousands of single-gene shRNA knockout perturbationsfoot_2 across a number of cell lines. We consider the setting in which we have observed the effect of knockouts in one biological context (i.e., cell line) and would like to use it to plan a series of knockout experiments in another. Related applications may have different biological contexts, from different cell types or experimental conditions. We use the level 5 CMAP observations, each of which contains of 978-gene transcriptional readout from an shRNA knockout of a particular gene in a particular cell line. In our experiments, we choose to optimize a cellular proliferation phenotype, defined as a function on the 978-gene expression space. See Appendix F for details. We use the 4 cells lines with the most number genetic perturbations in common: VCAP (prostate cancer, n = 14602 ), HA1E (kidney epithelium, n = 10487), MCF7 (breast cancer, n = 6638), and A375 (melanoma, n = 10033). We first learn a 100-dimensional action (perturbation) embedding a i for each perturbation in VCAP with an autoencoder. The autoencoder has a 100-dimension bottleneck layer and two intermediate layers of 1500 and 300 ReLU units with dropout and batch normalization and is trained using the Adam optimizer on mean squared reconstruction loss. We use these 100-dimensional perturbations embeddings as the features to train the f t functions for each of the other cell types. According to our OAE algorithm, we train a fresh feed-forward neural network with two intermediate layers (of 100 and 10 units) for after observing the phenotypic rewards for each batch of 50 gene (knockout) perturbations. Figure 6 summarizes this approach. Figure 1 shows the mean squared error loss of models trained to predict the cell-line specific phenotypic reward from the 100-dimensional VCAPderived perturbational features. These models are trained on successive batches of perturbations sampled via RandomBatch and using the same NN 1500-300 hidden layer architecture of the decoder. Not surprisingly, the loss for the VCAP reward is one of the lowest, but that of two other cell lines (HA1E and MCF7) are also quite similar, showing the NN 1500-300 neural net function class is flexible to learn the reward function in one context from the perturbational embedding in another. In all of our experiments we consider either a linear or a neural network function class with ReLU activations. In all of them we consider a batch size b = 50 and a number of batches N = 20. Figure 2 shows the convergence and reward results for the 4 cell lines when the neural network architecture equals NN 100-10. Since the perturbation action features were learned on the VCAP dataset (though agnostic to any phenotypic reward), the optimal VCAP perturbations are found quite quickly by all versions of MeanOpt including Greedy. Interestingly, MeanOpt still outperforms RandomBatch in the HA1E and MCF7 cell lines but not on A375. When the neural network architecture equals NN 10-5, MeanOpt is only competitive with RandomBatch on the VCAP and MCF7 datasets (see figure 37 in Appendix G.4). Moreover, when F is a class of linear functions, MeanOpt can beat RandomBatch only in VCAP. This can be explained by looking at the regression fit plots of figure 4. The baseline loss value is the highest for A375 even for NN 100-10, thus indicating this function class is too far from the true responses values for A375. The loss curves for NN 10-5 lie well above those for NN 100-10 for all datasets thus explaining the degradation in performance when switching from a NN 100-10 to a smaller capacity of NN 10-5. Finally, the linear fit achieves a very small loss for VCAP explaining why MeanOpt still outperforms RandomBatch in VCAP with linear models. In all other datasets the linear fit is subpar to the NN 100-10, explaining why MeanOpt in NN 100-10 works better than in linear ones. We note that EnsembleOptimism is competitive with RandomBatch in both NN 10-5 and NN 100-10 architectures in 7 our of the 8 experiments we conducted. In Appendix G.4 the reader can find results for MaxOpt and HingeOpt. Both methods underperform in comparison with MeanOpt. In Appendix E.2.1 and E.2.2 the reader will find experiments using tractable versions of OAE -DvD and OAE -Seq. In Appendix E and G, we present extensive additional experiments and discussion of our findings over different network architectures (including linear), and over a variety of synthetic and public datasets from the UCI database Dua & Graff (2017) . SarsCov2) datasets and tested for performance using the Hit Ratio metric after collecting 50 batches of size 16. This is defined as the ratio of arm pulls lying in the top .05 quantile of genes with the largest absolute value. Our results are in Table 5 . OAE outperforms the other algorithms by a substantial margin in 3 out of the 4 datasets that we tested.

Dataset

TopUncertain A TRANSFER LEARNING ACROSS GENETIC PERTURBATION DATASETS In this section we have placed a diagrammatic version of the data pipeline described in section 5. 2019). In the active learning setting the objective function used to build the batch is purely driven by the diversity objective. The method works as follows. At time t, OAE -DvD constructs a regression estimator f t using the arms and responses in D t (for example by solving problem 1). Instead of using Equation 2, our algorithm selects batch B t by optimizing a diversity aware objective of the form, B t ∈ arg max B⊆Ut,|B|=b A sum ( f t , B) + Div( f t , B). OAE -DvD's Div regularizer is inspired by the theory of Determinantal Point Processes and equals a weighted log-determinant objective. OAE -DvD has access to a kernel function K : R d ×R d → R and at the start of every time step t ∈ N it builds a kernel matrix K t ∈ R |Ut|×|Ut| , K t [i, j] = K(a i , a j ), ∀i, j ∈ U t . For any subset B ⊆ U t we define the OAE -DvD diversity-aware score as, Div(f, B) = λ div log (Det(K t [B, B])) (7) Where K t [B, B] corresponds to the submatrix of K t with columns (and rows) indexed by B and λ div is a diversity regularizer. Since the resulting optimization problem in Equation 6 may prove to be extremely hard to solve, we design a greedy maximization algorithm to produce a surrogate solution. We build the batch B t greedily. The first point a t,1 in the batch is selected to be the point in U t that maximizes the response f t . For all i ≥ 2 the point a t,i in U t is selected from U t \{a t,j } i-1 j=1 such that, a t,i = max a∈Ut\{at,j } i-1 j=1 A sum ( f t , {a} ∪ {a t,j } i-1 j=1 ) + Div( f t , {a} ∪ {a t,j } i-1 j=1 ) Algorithm 2 Optimistic Arm Elimination -DvD (OAE -DvD) Input Action set A ⊂ R d , num batches N , batch size b, λ div Initialize Unpulled arms U 1 = A. Observed points and labels dataset D 1 = ∅ for t = 1, • • • , N do if t = 1 then Sample uniformly a size b batch B t ∼ U 1 . else Compute B t using the greedy procedure described above. Observe batch rewards Y t = {y * (a) for a ∈ B t } Update D t+1 = D t ∪ {(B t , Y t )}. Update U t+1 = U t \B t . We define a reward augmented kernel matrix K aug t (i, j) = K t (i, j) exp ft(i)+ ft(j) λ det . This matrix satisfies K aug t [B, B] = diag exp( f t /λ det ) [B, B] • K t [B, B] • diag exp( f t /λ det ) [B, B] for all B ⊆ U t . Since the determinant of a product of matrices is the product of their determinants it follows that Det( K aug t [B, B]) = exp a∈B f t (a)/λ det • Det (K t [B, B] ). Thus for all B ⊆ U t , equation 6 with diversity score 7 can be rewritten as B t ∈ arg max B⊆Ut,|B|=b log (Det(K aug t [B, B])) This is because log (Det(K aug t [B, B])) = a∈B ft(a) λ div + log (Det(K t [B, B] )) for all B ⊂ U t . It is well known the log-determinant set function for a positive semidefinite matrix is submodular (see for example section 2.2 of Han et al. (2017) ). It has long been established that the greedy algorithm achieves an approximation ratio of (1 -1 e ) for the constrained submodular optimization problem (see Nemhauser et al. (1978) ). This justifies the choices we have made behind the greedy algorithm we use to select B t .

B.1.1 SEQUENTIAL BATCH SELECTION RULES

In this section we introduce OAE -Seq a generalization of the OAE algorithm designed to produce in batch diversity. OAE -Seq produces a query batch by solving a sequence of b optimization problems. OAE -Seq uses first D t,0 = D t the set of arms pulled so far as well as U t,0 = U t (the set of arms yet to be pulled) to produce a function f t,1 that determines the initial arm in the batch via the greedy choice a t,1 = arg max a∈Ut f t,1 (a). The function f t,1 is computed using the same method as any vanilla OAE procedure. Having chosen this arm, in the case when b > 1, a virtual reward y t,1 (possibly different from f t (a t,1 )) is assigned to the query arm a t,1 , and datasets D t,1 = D t ∪ {(a t,1 , y t,1 )} and U t,1 = U t \{a t,1 } are defined. The same optimization procedure that produced f t,1 is used to output f t,2 now with D t,1 and U t,1 as inputs. Arm a t,2 is defined as the greedy choice a t,2 = arg max a∈ Ut,1 f t,2 (a). The remaining batch elements (if any) are determined by successive repetition of this process so that D t,i = D t ∪{(a t,j , y t,j )} i j=1 and U t,i = U t \{a t,j } i j=1 . The trace of this procedure leaves behind a sequence of functions and datasets {( f t,i , U t,i )} b i=1 such that a t,i ∈ arg max Ut,i-1 f t,i (a).

Algorithm 3 Optimistic Arm Elimination -Batch Sequential (OAE -Seq)

Input Action set A ⊂ R d , number of batches N , batch size b, pessimism-optimism balancing parameter α. Initialize Unpulled arms U 1 = A. Observed points and labels dataset D 1 = ∅ for t = 1, • • • , N do if t = 1 then Sample uniformly a size b batch B t ∼ U 1 . else Solve for {( f t,i , U t,i )} b i=1 via the procedure described in the text. Compute B t = {a t,i ∈ arg max a∈ Ut,i-1 f t,i (a)} b i=1 . Observe batch rewards Y t = {f * (a) for a ∈ B t } Update D t+1 = D t ∪ {(B t , Y t )} ∈ D. Update U t+1 = U t \B t . To determine the value of the virtual rewards y t,i , we consider a variety of options. We start by discussing the case when the fake reward y t,i = f t (a t,i ) and the acquisition function equals A sum (f, U). When γ t = 0, the fake reward satisfies y t,i = f t (a t,i ) and y ⋆ ∈ F it follows that f t,i = f t independent of i ∈ [B] is a valid choice for the function ensemble { f t,i } b i=1 . In this case the query batch B t can be computed by solving for f t , f t ∈ arg max f ∈F a∈Ut f (a) s.t. (a,y)∈ Dt,i (f (a) -y) 2 = 0. And defining the batch B t as as the solutions of the constrained objectives B t = f optimistic t,i = arg max f ∈Ft,i A(f, U t ) and f pessimistic t,i = arg min f ∈Ft,i A(f, U t ) Where F t,i = {f ∈ F s.t. (a,y)∈Dt,i-1 (f (a) -y) 2 ≤ γ t }. In both cases we define the fictitious rewards as a weighted average of the pessimistic and optimistic predictors y t,i = α f optimistic t,i + (1 - α) f pessimistic t,i where α ∈ [0, 1] is an optimism weighting parameter while we keep the functions used to define what points are part of the batch as f t,i = f optimistic t,i . The principle of adding hallucinated values to induce diversity has been proposed before for example in the GP -BUCB algorithm of Desautels et al. (2014) .

C TRACTABLE IMPLEMENTATIONS OF OAE -DvD AND OAE -Seq

In this section we go over the algorithmic details behind the approximations that we have used when implementing the different OAE methods we have introduced in Section 4.

C.0.1 DIVERSITY VIA DETERMINANTS

The OAE -DvD algorithm differs from OAE only in the way in which the query batch B t is computed. OAE -DvD uses equation 6 instead of equation 2. Solving for B t is done via the greedy algorithm described in Section 4.2.1. The function f t can be computed via a regularized optimization objective or using an ensemble. In our experimental evaluation we opt for defining f t via the regularization route as the result of solving problem 3. More experimental details including the type of kernel used are explained in Section E.2. In our experimental evaluation we use the MeanOpt objective to produce f t and we refer to the resulting method as DetD.

C.0.2 SEQUENTIAL BATCH SELECTION RULES

We explore different ways of defining the functions { f t,i } b i=1 . Depending on what procedure we use to optimize and produce these functions we will obtain different versions of OAE -Seq. We use the name SeqB to denote the OAE -Seq method that fits the functions { f optimistic t,i } b i=1 and { f pessimistic t,i } b i=1 by solving the regularized objectives, f optimistic t,i ∈ arg min f ∈F 1 |D t,i-1 | (a,y)∈Dt,i-1 (f (a) -y) 2 -λ reg A(f, U t,i-1 ), f pessimistic t,i ∈ arg min f ∈F 1 |D t,i-1 | (a,y)∈Dt,i-1 (f (a) -y) 2 +λ reg A(f, U t,i-1 ). In our experiments we set the acquisition function to A avg . For some valuefoot_4 of λ reg > 0. The functions { f t,i } b i=1 are defined as f t,i = f optimistic t,i and the virtual rewards as y t,i = α f optimistic t,i + (1 -α) f pessimistic t,i where α ∈ [0, 1] is an optimism-pessimism weighting parameter. In our experimental evaluation we use α = 1 2 . More experimental details are presented in Section E.2. The functions { f optimistic t,i } b i=1 and { f pessimistic t,i } b i=1 can be defined with the use of an ensemble. Borrowing the definitions of Section 4.1.2 f optimistic t,i = f o t,i +ũ t,i , f pessimistic t,i = f o t,i -ũ t,i . Where f o t,i and ũt,i are computed by first fitting an ensemble of models { f j t,i } M j=1 using dataset D t,i-1 . In our experimental evaluation we explore the use of Ensemble and EnsembleNoiseY optimization styles to fit the models { f j t,i } M j=1 . In our experiments we use the names Ensemble -SeqB and Ensemble -SeqB -NoiseY to denote the resulting sequential batch selection methods. More details of our implementation can be found in Section E.2.

D.1 QUANTIFYING THE QUERY COMPLEXITY OF F

Let ϵ ≥ 0 and define t ϵ opt (A, f ) to be the first time-step when an ϵ-optimal point ât is proposed by a learner (possibly randomized) when interacting with arm set A ⊂ R d and the pseudo rewards are noiseless evaluations {f (a)} a∈A with f ∈ F. We define the query complexity of the A, F pair as, T ϵ (A, F) = min Alg max f ∈F E t ϵ opt (A, f ) Where the minimum iterates over all possible learning algorithms. We can lower bound of the problem complexity for several simple problem classes, Lemma D.1. When A = {∥x∥ ≤ 1 for x ∈ R d } and 1. If F is the class of linear functions defined by vectors in the unit ball F = {x → θ ⊤ x : ∥θ∥ ≤ 1 for θ ∈ R d } then T ϵ (A, F) ≥ d when ϵ < 1 d and T ϵ (A, F) ≥ ⌈d -ϵd⌉ otherwise. 2. If F is the class of 1-Lipschitz functions functions then T ϵ (A, F) ≥ 1 4ϵ d . Proof. As a consequence of Yao's principle, we can restrict ourselves to deterministic algorithms.

Indeed, min

Alg max f ∈F E t ϵ opt (A, f ) = max D F min DetAlg E f ∼D F t ϵ opt (A, f ) Thus, to prove the lower bound we are after it is enough to exhibit a distribution D F over instances f ∈ F and show a lower bound for the expected t ϵ opt (A, f ) where the expectation is taken using D F . With the objective of proving item 1 let D F be the uniform distribution over the sphere Unif d (1). By symmetry it is easy to see that E θ∼Unif d (r) [θ i ] = 0, ∀i ∈ d and ∀r ≥ 0. Thus, Var θ∼Unif d (r) (θ i ) = E θ∼Unif d (r) [θ 2 i ] (i) = r 2 d . Equality (i) follows because d i=1 E θ∼Unif d (r) [θ 2 i ] = E θ∼Unif d (r) [∥θ∥ 2 ] = r 2 and because by symmetry for all i, j ∈ [d] the second moments agree, E θ∼Unif d (r) [θ 2 i ] = E θ∼Unif d (r) [θ 2 j ] Finally, E θ∼Unif d (r) [|θ i |] ≤ E θ∼Unif d (r) [θ 2 i ] = r √ d . ( ) Let DetAlg be the optimal deterministic algorithm for D F and a 1 be its first action. Since D F is the unofrm distribution over the sphere, inequality 11 expected scale of the reward reward experienced is upper bounded by 1 √ d , and furthermore, since ∥a 1 ∥ = 1, the expected second moment of the reward experienced (where expectations are taken over D F ) equals 1 d . We now employ a conditional argument, if DetAlg has played a 1 and observed a reward r 1 , We assume that up to time m algorithm DetAlg has played actions a 1 , • • • , a m and received rewards r 1 , • • • , r m . Given these outcomes, DetAlg can recover the component of θ lying in span(a 1 , • • • , a m ). Let a m+1 be DetAlg's action at time m+1. By assumption this is a deterministic function of a 1 , • • • , a m and r 1 , • • • , r m . Since θ is drawn from Unif d (1), the expected squared dot product between the component of a ⊥ m+1 = Proj(a m+1 , span(a 1 , • • • , a m ) ⊥ ) satisfies, E θ∼D F |{a ⊤ 1 θ=ri} m i=1 θ ⊤ a ⊥ m+1 2 = 1 -∥θ 0 m ∥ 2 d -m 1 -∥a 0 m+1 ∥ 2 = 1 -∥θ 0 m ∥ 2 d -m . ( ) where θ 0 m = Proj(θ, span(a 1 , • • • , a m )). The last inequality follows because the conditional distribution of Proj(θ, span(a 1 , • • • , a m ) ⊥ ) given a 1 , • • • , a m and r 1 , • • • , r m is a uniform distribution over the d -m dimensional sphere of radius 1 -∥θ 0 m ∥ 2 , the scale of a ⊥ m+1 is 1 -∥a 0 m+1 ∥ 2 and we have assumed the ∥a 0 m+1 ∥ = 0. Thus, the agreement of a ⊥ m+1 with Proj(θ, span(a 1 , • • • , a m ) ⊥ ) satisfies Equation 10. We consider the expected square norm of the recovered θ up to time m. This is the random variable ∥θ 0 m ∥ 2 = m t=1 θ ⊤ a ⊥ t 2 where a ⊥ t = Proj(a t , span(a 1 , • • • , a t-1 ) ⊥ ). Thus, E θ∼D F ∥θ 0 m ∥ 2 = E θ∼D F m t=1 θ ⊤ a ⊥ t 2 = E θ∼D F m t=1 E θ∼D F |{a ⊤ 1 θ=ri} t-1 i=1 θ ⊤ a ⊥ t 2 (i) = E θ∼D F m t=1 1 -∥θ 0 t-1 ∥ 2 d -m Equality (i) holds because of 12. Recall that by Equation 10, E θ∼D F ∥θ 0 1 ∥ 2 = 1 d . Thus by the above equalities, E θ∼D F ∥θ 0 2 ∥ 2 = 1 d + 1 -1 d d -1 = 2 d . Unrolling these equalities further we conclude that E θ∼D F ∥θ 0 m ∥ 2 = m d . This implies the expected square agreement between the learner's virtual guess ât is upper bounded by m d . Thus, when ϵ < 1 d , the expected number of queries required is at least d. When ϵ > d, the expected number of queries instead satisfies a lower bound of ⌈d -ϵd⌉. We now show shift our attention to Lipschitz functions. First we introduce the following simple construction of a 1-Lipschitz function over a small ball of radius ϵ. We use this construction throughout our proof. Let x ∈ R d be an arbitrary vector, define B(x, ϵ) as the ball centered around x of radius 2ϵ under the ∥ • ∥ 2 norm and S(x, 2ϵ) as the sphere (the surface of B(x, 2ϵ)) centered around x Define the function f ϵ x : R d → R as, f ϵ x (z) = min z ′ ∈S(x,2ϵ) ∥z -z ′ ∥ 2 if z ∈ B(x, 2ϵ) 0 o.w. It is easy to see that f ϵ x is 1-Lipschitz. We consider three different cases, 1. If z 1 , z 2 ∈ B(x, 2ϵ) c then |f ϵ x (z 1 ) -f ϵ x (z 2 )| = 0 ≤ ∥z 1 -z 2 ∥. The result follows. 2. If z 1 ∈ B(x, 2ϵ) but z 2 ∈ B(x, 2ϵ) c . Let z 3 be the intersection point in the line going from z 1 to z 2 lying on S(x, 2ϵ). Then |f ϵ x (z 1 ) -f ϵ x (z 2 )| = min z ′ ∈S(x,2ϵ) ∥z 1 -z ′ ∥ 2 ≤ ∥z 1 -z 3 ∥ 2 ≤ ∥z 1 -z 2 ∥ 2 . 3. If z 1 , z 2 ∈ B(x, 2ϵ). It is easy to see that |f ϵ x (z 1 ) -f ϵ x (z 2 )| = |∥z 1 -x∥ 2 -∥z 2 -x∥ 2 |.

And therefore by the triangle inequality applied to

x, z 1 , z 2 , that |∥z 1 -x∥ 2 -∥z 2 -x∥ 2 | ≤ ∥z 1 -z 2 ∥ 2 . The result follows. Let N (B(0, 1), 2ϵ) be a 2ϵ-packing of the unit ball. For simplicity we'll use the notation N 2ϵ = |N (B(0, 1), 2ϵ)|. Define the set of functions F ϵ = {f ϵ x for all x ∈ N (B(0, 1), 2ϵ)} and define D F as the uniform distribution over F ϵ ⊂ F. Similar to the case when F is the set of linear functions, we make use of Yao's principle. Let DetAlg be an optimal deterministic algorithm for D F . Let a i be DetAlg's i-th query point and r i be the i-th reward it receives. If the ground truth was f ϵ x and the algorithm does not sample a query point from inside B(x, 2ϵ), it will receive a reward of 0 and thus would not have found an ϵ-optimal point. Thus t opt (f ϵ x ) ≥ first time to pull an arm in B(x, 1). As a consequence of this fact, E f ϵ x ∼D F [1(a 1 ∈ B(x, 2ϵ))] ≤ 1 N 2ϵ . Hence E f ϵ x ∼D F [1(a 1 ̸ ∈ B(x, 2ϵ))] ≥ N2ϵ-1 N2ϵ . Therefore, E t ϵ opt ≥ N2ϵ ℓ=1 N 2ϵ -ℓ N 2ϵ -ℓ + 1 ≥ N2ϵ/2 ℓ=1 N 2ϵ -ℓ N 2ϵ -ℓ + 1 ≥ 1 4 N 2ϵ . Since N 2ϵ (i) ≥ Covering(B(0, 1), 4ϵ) (ii) ≥ 1 4ϵ d where inequality (i) is a consequence of Lemma 5.5 and inequality (ii) from Lema 5.7 in Wainwright (2019). Translating to Quantile Optimality . The results of Lemma D.1 can be interpreted in the langauge of quantile optimality by imposing a uniform measure over the sphere. In this case ϵ-optimality is equivalent (approximately) to a 1 -2 -d(1-ϵ) 2 quantile. The results of Lemma D.1 hold regardless of the batch size B. It is thus impossible to design an algorithm that can single out an ϵ-optimal arm in less than T ϵ (A, F) queries for all problems defined by the pair A, F simultaneously.

D.2 OPTIMISM AND ITS PROPERTIES

The objective of this section is to prove Theorem 4.1 which we restart for the reader's convenience. Theorem 4.1. The regret of OAE with A max,b acquisition function satisfies, Reg F ,b (T ) = O dim E (F, α T )b + ω dim E (F, α T )T b . With αt = max 1 t 2 , inf{∥f1 -f2∥∞ : f1, f2 ∈ F, f1 ̸ = f2} and C = max f ∈F ,a,a ′ ∈A |f (a) -f (a ′ )|. Let's start by defining the ϵ-Eluder dimension, a complexity measure introduced by Russo & Van Roy (2013) to analyze optimistic algorithms. Throughout this section we'll use the notation ∥g∥ A = a∈A g 2 (a) to denote the data norm of function g : A → R. Definition D.2. Let ϵ ≥ 0 and Z = {a i } n i=1 ⊂ A be a sequence of arms. 1. An action a is ϵ-dependent on Z with respect to F if any f, f ′ ∈ F satisfying n i=1 (f (a i ) -f ′ (a i )) 2 ≤ ϵ also satisfies |f (a) -f ′ (a)| ≤ ϵ. 2. An action a is ϵ-independent of Z with respect to F if a is not ϵ-dependent on Z. 3. The ϵ-eluder dimension dim E (F, ϵ) of a function class F is the length of the longest sequence of elements in A such that for some ϵ ′ ≥ ϵ, every element is ϵ ′ -independent of its predecessors. where B ⋆,t = max B⊂Ut,|B|=b a∈B y ⋆ (a). Proof. Let F t be the subset of F satisfying f ∈ F t if (x,y)∈Dt (f (x) -y) 2 ≤ (t -1)ω 2 ≤ γ t . By definition y ⋆ ∈ F t . Substituting the definition of A max,b into Equation 1, f t ∈ arg max f ∈Ft max B⊂Ut,|B|=b a∈B f (a). Since y ⋆ ∈ F t , max f ∈Ft,B⊂Ut,|B|=b a∈B f (b) = a∈Bt f t (a) ≥ a∈ B * ,t y ⋆ (a) ≥ a∈B * ,t y ⋆ (a). Where B ⋆,t = max B⊂Ut,|B|=b a∈B y ⋆ (a) and B ⋆,t = max B⊂Ut,|B|=b a∈B y ⋆ (a). Finally, for all a ∈ A, we have y ⋆ (a) + ω ≥ y ⋆ (a). This finalizes the proof. m Since y ⋆ (a) + ω ≥ y ⋆ (a). for all a ∈ A, Lemma D.3 implies, t ℓ=1   a∈B⋆,t y ⋆ (a) - a∈Bt y ⋆ (a)   ≤ 2btω + t ℓ=1 a∈Bt f t (a) -y ⋆ (a) We define the width of a subset F ⊆ F at an action a ∈ A by, r F (a) = sup f,f ′ ∈ F |f (a) -f ′ (a)|. And use the shorthand notation r t (a) to denote r Ft (a) where F t = {f ∈ F s.t. (a,y)∈Dt (f (a) - y) 2 ≤ γ t }. Equation 13 implies, t ℓ=1   a∈B⋆,t y ⋆ (a) - a∈Bt y ⋆ (a)   ≤ 2btω + t ℓ=1 a∈Bt r t (a) In order to bound the contribution of the sum t ℓ=1 a∈Bt r t (a) we use a similar technique as in Russo & Van Roy (2013) . First we prove a generalization of Proposition 3 of Russo & Van Roy (2013) to the case of parallel feedback. Proposition D.4. If {γ t ≥ 0|t ∈ N} is a nondecreasing sequence and F t = {f ∈ F : (a,y)∈Dt (f (a) -y) 2 ≤ γ t } then, T t=1 a∈Bt 1(r t (a) > ϵ) ≤ O γ t d ϵ 2 + bd . Where d = dim E (F, ϵ). Proof. We will start by upper bounding the number of disjoint sequences in D t-1 that an action a ∈ B t can be ϵ-dependent on when r t (a) > ϵ. If r t (a) > ϵ there exist f , f ∈ F t such that f (a) -f (a) > ϵ. By definition if a ∈ B t is ϵ-dependent on a sequence S ⊆ ∪ t-1 ℓ=1 B ℓ = D t then ∥ f -f ∥ 2 S > ϵ 2 (since otherwise ∥ f -f ∥ 2 S ≤ ϵ 2 would imply a to be ϵ-independent of D t ). Thus if a is ϵ-dependent on K disjoint sequences S 1 , • • • , S K ⊆ D t , then ∥ f -f ∥ 2 Dt ≥ Kϵ 2 . By the triangle inequality, ∥ f -f ∥ Dt ≤ ∥ f -y ⋆ ∥ Dt + ∥f -y ⋆ ∥ Dt ≤ 2 √ γ t . Thus it follows that ϵ √ K < 2 √ γ t and therefore K ≤ 4γ t ϵ 2 Next we prove a lower bound for K. In order to do so we prove a slightly more general statement. Consider a batched sequence of arms {ā ℓ,i } i∈[b],ℓ∈ [τ ] where for the sake of the argument āℓ,i is not necessarily meant to be a ℓ,i . We use the notation Bℓ = {ā ℓ,i } i∈[b] to denote the ℓ-th batch in {ā ℓ,i } i∈[b],ℓ∈[τ ] and Dt = ∪ t-1 ℓ=1 Bℓ . Let τ ∈ N and define K as the largest integer such that Kd + b ≤ τ b. We show there is a batch number ℓ ′ ≤ τ and in-batch index i ′ such that a ℓ ′ ,i ′ is ϵ-dependent on a subset of disjoint sequences of size at least K 2 out of a set of K disjoint sequences S1 , • • • , S K ⊆ Dℓ-1 . First let's start building the Si sequences by setting Si =the i-th element in {ā ℓ,i } i∈[b],ℓ∈[τ ] ordered lexicographically. This will involve elements of up to batch B⌈ K/b⌉ . Since we are going to apply the same argument recursively in our proof, let's denote the 'current' batch index in the construction of S1 , • • • , S K as l, this is, we set S1 , • • • , S K ⊆ Dl. At the start of the argument l = ⌈ K/b⌉. If there is an arm a ∈ Bl +1 such that a is ϵ-dependent on at least K/2 of the { Si } K i=1 sequences, the result would follow. Otherwise, it must be the case that for all a ∈ Bl +1 there are at least K 2 sequences in { Si } K i=1 on which a is ϵ-independent. Let's consider a bipartite graph with edge sets Bl +1 and { Si } K i=1 . We draw an edge between a ∈ Bl +1 and Sj ∈ { Si } K i=1 if a is ϵ-independent on Sj . If for all a ∈ Bl +1 there are at least K 2 sequences in { Si } K i=1 on which a is ϵ-independent, Lemma D.5 implies the existence of a matching of size at least min( K 8 , b) between elements in Bl +1 and the sequences in { Si } K i=1 . The case min( K 8 , b) = K 8 can only occur when (τ -1)b 8d ≤ b and therefore when τ b ≤ 8bd + b. In case min( K 8 , b) = b. If we reach l = τ it must be the case that at least (τ -1)b points could be accommodated into the K sequences. By definition of K it must be the case that each sub-sequence S i satisfies |S i | ≥ d. Since each element of subsequence S i is ϵ-independent of its predecesors, |S i | = d. In this case we would conclude there is an element in Bτ that is ϵ-dependent on K and therefore at least K 2 = (τ -1)b 2d subsequences. If τ b ≥ 8γτ d ϵ 2 + 2d + b then K 2 ≥ 4γτ ϵ 2 + 1. Combining the results of the two last paragraphs we conclude that if τ b ≥ 8γτ d ϵ 2 + 2d + 2b + 8bd then there is a batch index ℓ ′ ∈ [τ ] such that there is an arm āℓ ′ ,i ∈ Bℓ ′ that is ϵ-dependent of at least K 2 ≥ 4γτ ϵ 2 + 1 disjoint sequences contained in Dℓ ′ . Let's apply the previous result to the sequence of a t,i for i ∈ [b] and t ∈ R such that r t (a t,i ) > ϵ. An immediate consequence of the previous results is that if t ℓ=1 a∈B ℓ 1(r ℓ (a) > ϵ) ≥ 8γtd ϵ 2 + 2d + 2b+8bd there must exist an arm in B t such that it is ϵ-dependent on at least 4γt ϵ 2 +1 disjoint sequences of D t . Nonetheless, Equation 15 implies this is impossible. Thus, t ℓ=1 a∈B ℓ 1(r ℓ (a) > ϵ) ≤ 8γtd ϵ 2 + 2d + 2b + 8bd ≤ 8γtd ϵ 2 + 12bd. The result follows. Finally, the RHS of equation 14 can be upper bounded using a modified version of Lemma 2 of Russo & Van Roy (2013) (which can be found in Appendix D.2 ) yielding, T ℓ=1   a∈B⋆,t y ⋆ (a) - a∈Bt y ⋆ (a)   ≤ 1 T +O min (dim E (F, α T )b, T ) C + dim E (F, α T )ωT . Where α t = max 1 t 2 , inf{∥f 1 -f 2 ∥ ∞ : f 1 , f 2 ∈ F, f 1 ̸ = f 2 } and C = max f ∈F ,a,a ′ ∈A |f (a) - f (a ′ )|. The quantity in the left is known as regret. This result implies the regret is bounded by a quantity that grows linearly with ω, the amount of misspecification but otherwise only with the scale of dim E (F, α T )b. Our result is not equivalent to splitting the datapoints in b parts and adding b independent upper bounds. The resulting upper bound in the later case will have in a term of the form O b dim E (F, α T )ωT whereas in our analysis this term does not depend on b. When ω = 0, the regret is upper bounded by O (dim E (F, α T )d). For example, in the case of linear models, the authors of Russo & Van Roy (2013) show dim E (F, α T ) = O (d log(1/ϵ)). This shows that for example sequential OAE when b = 1 achieves the lower bound of Lemma D.1 up to logarithmic factors. In the setting of linear models, the b dependence in the rate above is unimprovable by vanilla OAE without diversity aware sample selection. This is because an optimistic algorithm may choose to use all samples in each batch to explore explore a single unexplored one dimensional direction. Theoretical analysis for OAE -DvD and OAE -Seq is left for future work. 2. If instead K 2 < b there exists a subset of K 8 nodes in A with a perfect matching to B. Proof. The first item follows immediately from Hall's marriage theorem. Notice that in this case the neighboring set of any subset of nodes in B has a cardinality of at least K 2 and therefore it is at least the size of B. The conditions of Hall's theorem are satisfied thus implying the existence of a perfect matching between the nodes in B to the nodes in A. Else: • Solve for f t and compute B t ∈ arg max B⊂Ut||B|=b A( f t , B). Observe batch rewards Y t = {y * (a) for a ∈ B t } Update D t+1 = D t ∪ {(B t , Y t )} and U t+1 = U t \B t . So there is a constant c > 0 such that r ij ≤ O γ T d (j-cbd)+ . For all j ≤ cbd notice that cbd j=1 r ij = O (bd) since the radii r ij are all of constant size. We conclude that, T b j=1 r ij ≤ O   bd + T b j=cbd+1 γ T d (j -cbd) +   ≤ O bd + γ T d T b-cbd j=1 √ kdk = O bd + γ T dT b . Substituting γ T = ω 2 T b we conclude, T b j=1 r ij = O bd + ω √ dT b . Thus finalizing the result. Combining the result of Lemma D.6 and Equation 14 finalizes the proof of Theorem 4.1.

D.3 NOISY RESPONSES

In this section we describe how our results imply improved bounds for the setting with noisy responses. In this case we assume that y t,i = y ⋆ (a t,i ) + ξ t,i where ξ t,i is conditionally zero mean. We consider the following optimistic least squares algorithm,

E EXPERIMENTS

We demonstrate the effectiveness of our OAE algorithm in several problem settings across public and synthetic datasets. We evaluate the algorithmic implementations described in Small vs Large Batch Regimes. Oftentimes the large batch -small number of iterations regime is the most interesting scenario from a practical perspective Hanna & Doench (2020) . In scientific settings like pooled genetic perturbations, each experiment may take a long time (many weeks or months) to conclude, but it is possible to conduct a batch of experiments in parallel together. We study this regime in Section 5. We compare these algorithms to the baseline method that selects points B t by selecting a uniformly random batch of size B from the set U t , henceforth referred to as RandomBatch and against each other. We conduct experiments on three kind of datasets. First in Section E.1.1 we capture the behavior of OAE in a set of synthetic one dimensional datasets specifically designed to showcase different landscapes for y ⋆ ranging from uni-modal to multi-modal with missing values. In Section E.1.2 we conduct similar experiments on public datasets from the UCI database Dua & Graff (2017) . In both sections E.1.1 and E.1.2 all of our experiments have a batch size of 3, a time horizon of N = 150 and over two types of network architectures. In Section 5 we consider the setting in which we have observed the effect of knockouts in one biological context (i.e., cell line) and would like to use it to plan a series of knockout experiments in another. We test OAE in this context by showing the effectiveness of the MeanOpt, HingePNorm, MaxOpt, Ensemble and EnsembleNoiseY methods in successfully leveraging the learned features from a source cell line in the optimization of a particular cellular proliferation phenotype for several target cell lines. All of our vanilla OAE methods show that better expressivity of the underlying model class F allows for better performance (as measured by the number of trials it requires to find a response within a particular response quantile from the optimum). Low capacity (in our experiments ReLU neural networks with two layers of sizes of 10 and 5) models have a harder time competing against RandomBatch than larger ones (ReLU neural networks with two layers of sizes 100 and 10). We also present results for linear and 'very high' capacity models (two layer of sizes 300 and 100).

E.1.1 SYNTHETIC ONE DIMENSIONAL DATASETS

Figure 7 shows different one dimensional synthetic datasets that used to validate our methods. The leftmost, the OneHillValleyOneDim dataset consists of 1000 arms uniformly sampled from the interval [-10, 10] . The responses y are unimodal. The learner's goal is to find the arm with x-coordinate value equals to 3 as it is the one achieving the largest response. We use the dataset OneHillValleyOneDimHole to test for OAE's ability to find the maximum when the surrounding points are not present in the dataset. In all cases, the high optimism versions of MeanOpt perform substantially better than RandomBatch. In both multimodal datasets Greedy underperforms with respect to the versions of MeanOpt with λ reg > 0. This points to the usefulness of optimism when facing multimodal optimization landscapes. We also note that for example NN 10-5 is the best performing architecture for MeanOpt with λ reg = 0.001 despite the regression loss of fitting the MultiValleyOneDim's responses with a NN 10-5 architecture not reaching zero (see Figure 13 ). This indicates the function class F need not contain y ⋆ for MeanOpt to achieve good performance. Figure 10 : NN 300-100. τ -quantile batch convergence on the suite of synthetic datasets when the network architecture has two hidden layers of sizes 300 and 100. Moreover, it also indicates the use of higher capacity models, despite being able to achieve better regression fit may not perform better than lower capacity ones in the task of finding a good performing arm. We leave the task of designing smart strategies to select the optimal network architecture for future work. It suffices to note all of the architectures used in our experimental evaluation performed better than more naive strategies such as RandomBatch. We test our methods on public binary classification (Adult, Bank) and regression (BikeSharingDay, BikeSharingHour, BlogFeedback) datasets from the UCI repository (Dua & Graff, 2017) . In our implementation, the versions of the UCI datasets we use have the following characteristics. Due to our internal data processing that splits the data into train, test and validation sets the number of datapoints we consider may be different from the size of their public versions. Our code converts the datasets categorical attributes into numerical ones using one hot encodings. That explains the discrepancy between the number of attributes listed in the public description of these datasets and ours (see https://archive.ics. uci.edu/ml/index.php). The BikeSharingDay dataset consists of 658 datapoints each with 13 attributes. The BikesharingHour dataset consists of 15642 datapoints each with 13 attributes. The BlogFeedback dataset consists of 52396 datapoints each with 280 attributes. The Adult dataset consists of 32561 datapoints each with 119 attributes. The Bank dataset (Moro et al., 2014) consists of 32950 datapoints each with 60 attributes. To evaluate our algorithm we assume the response (regression target or binary classification label) values are noiseless. We consider each observation i in a dataset to represent a discrete action, each of which has features a i and reward y * (a i ) from the response. In all of our experiments we use a batch size of 3 and evaluate over 25 independent runs each with 150 batches. We first use all 5 public datasets to test OAE in the setting when the true function class F is known (in this case, a neural network) is known contain the function OAE learns over the course of the batches. We train a neural network under a simple mean squared error regression fit to the binary responses (for the binary classification datasets) or real-valued responses (for the regression datasets). This regression neural network consists of a neural network with two hidden layers. In Figure 15 we present results where the neural network layers have sizes 10 and 5 and the responses are fit to the Adult classification dataset and the regression dataset BikeSharingDay. In Appendix G.2 and Figure 32 we present results where we fit a two layer neural network model of dimensions 100 and 10 to the responses of the Bank classification dataset and the BikeSharingHour regression dataset. In each case we train the regression fitted responses on the provided datasets using 5, 000 batch gradient We use the same experimental parameters and comparison algorithms as in the synthetic dataset experiments. Figure 15 shows the results on the binary classification Adult and regression BikeSharingDay datasets using these fitted responses. We observe the MeanOpt algorithm handily outperforms RandomBatch on both datasets. Appendix G.2 shows similar results for BikeSharingHour and Bank. We also compare the performance of MeanOpt, Ensemble and EnsembleNoiseY in the Adult and BikeSharingDay datasets. In both cases, ensemble methods achieved better performance than MeanOpt. It remains an exciting open question to verify whether these observations translates into a general advantage for ensemble methods in the case when y ⋆ ∈ F. Given OAE's strong performance when the true and learned reward functions are members of the same function class F, we next explore the performance when they are not necessarily in the same class by revising the problem on the regression datasets to use their original, real-world responses. Figure 14 shows results for the BikeSharingDay, BikeSharingHour and BlogFeedback datasets. In this case Ensemble outperforms both RandomBatch in τ -quantile convergence time. In all of these plots we observe that high optimism approaches underperform compared with low optimism ones. Ensemble and Greedy (the degenerate version λ reg = 0 of MeanOpt) achieve the best performance across all three datasets. We observe the same phenomenon take place even when F is a class of linear functions (see figure 18 ). Just as we observed in the case of the suite of synthetic datasets, setting F to be a class of linear models MeanOpt still achieves substantial performance gains w.r.t RandomBatch (see figure 17 ) despite its regression fit loss never reaching absolute zero (see figure 16 ). In Appendix G.3, figure 33 we compare the performance of MaxOpt and HingeOpt with RandomBatch when F is a class of neural networks with hidden layers of sizes 100 and 10. We observe the performance of MaxOpt, although beats RandomBatch is suboptimal in comparison with MeanOpt. In contrast HingeOpt has a similar performance to MeanOpt.

E.2 EXPERIMENTS DIVERSITY SEEKING OBJECTIVES

In this section we explore how diversity inducing objectives can sometimes result in better performing variants of OAE. We implement and test DetD, SeqB, Ensemble -SeqB and 

E.2.2 OAE -Seq

In this section we present our experimental evaluation of the three tractable implementations of OAE -Seq we described in Section C.0.2. In our experiments we set the optimism-pessimism weighting parameter α = 1/2 and the acquisition function to A avg . We are primarily concerned with answering whether 'augmenting' the MeanOpt, Ensemble and EnsembleNoiseY methods with a sequential in batch selection mechanism leads to an improved performance for OAE. We answer this question in the affirmative. In our experimental results we show that across datasets and neural network architectures adding in batch sequential optimism either improves or leads to no substantial degradation in the number of batches OAE requires to arrive at good arm. 27 ). Finally, we observe the performance of OAE -Seq either did not degrade or slightly improved that of MeanOpt, Ensemble and EnsembleNoiseY in the BlogFeedback and the genetic perturbation datasets (see figures 34, 25 and 38) . We conclude that incorporating a sequential batch decision rule, although it may be computationally expensive, is a desirable strategy to adopt. It is interesting to note that in contrast with DetD the diversity induced by SeqB did not alleviate the subpar performance of MeanOpt in the set of genetic perturbation datasets. This can be explained by SeqB induces query diversity by fitting fake responses and it therefore may be limited by the expressiveness of F. The DetD instead injects diversity by using the geometry of the arm space and may bypass the limitations of exploration strategies induced by F. Unfortunately DetD may result in suboptimal performance when y ⋆ ∈ F.

F CELLULAR PROLIFERATION PHENOTYPE

Let G be the list of genes present in CMAP also associated with proliferation phenotype according to the Seurat cell cycle signature, and let x i,g represent the level 5 gene expression of perturbation a i for gene g ∈ G. We define the proliferation reward for perturbation a i as the average expression of the genes in G, f prolif * (a i ) = 1 |G| g∈G x i,g For convenience, G = {AURKB, BIRC5, CCNB2, CCNE2, CDC20, CDC45, CDK1, CENPE, GMNN, KIF2C, LBR, NCAPD2, NUSAP1, PCNA, PSRC1, RFC2, RPA2, SMC4, STMN1, TOP2A, UBE2C, UBR7, USP1}. 



In the case of an infinite set A quantile optimality is defined with respect to a measure over A. The batch equals Ut when |Ut| ≤ b. The shRNA perturbations are just a subset of the 1M+ total perturbations across different perturbation classes. https://github.com/genedisco/genedisco-starter As we have pointed out in Section 4.2.2 setting λreg = 0 reduces OAE -Seq to vanilla OAE.



E (F, α T )b. The misspecification part of the regret scales as the same rate as a sequential algorithm running T b batches of size 1. When ω = 0, the regret is upper bounded by O (dim E (F, α T )d). For example, in the case of linear models, the authors of Russo & Van Roy (2013) show dim E (F, ϵ) = O (d log(1/ϵ)). This shows that for example sequential OAE when b = 1 achieves the lower bound of Lemma D.1 up to logarithmic factors. In the setting of linear models, the b dependence in the rate above is unimprovable by vanilla OAE without diversity aware sample selection. This is because an optimistic algorithm may choose to use all samples in each batch to explore a single unexplored one dimensional direction. Theoretical analysis for OAE -DvD and OAE -Seq is left for future work. In Appendix D.1 we also show lower bounds for the query complexity for linear and Lipshitz classes.

Figure 3: Linear. τ -quantile batch convergence of MeanOpt on genetic perturbation datasets.

Figure 2: NN 100-10 τ -quantile batch convergence on genetic perturbation datasets of MeanOpt (left) and Ensemble, EnsembleNoiseY (right).

Figure 4: Training set regression fit curves evolution over training for the suite of gene perturbation datasets. Each training batch contains 10 datapoints.

Figure 6: Neural design for genetic perturbation experiments. (a) Learn a perturbation action embedding space by training an autoencoder on the gene expression resulting from a large set of observed genetic perturbations in a particular biological context (e.g., shRNA gene knockouts for a particular cell line in CMAP). (b) Select an initial batch of B perturbation actions to perform in parallel within a related (but different) biological context. Selection can be random (uniform) or influenced by prior information about the relationship between genes and the phenotype to be optimized. (c) Perform the current batch of experimental perturbations in vitro and observed their corresponding phenotypic rewards. (d) Concatenate the latest batch's features and observed rewards to those of previous batches to update the perturbation reward training set. (e) Train a new perturbation reward regression (with some degree of pre-defined optimism) network on the observed perturbation rewards. (f) Use this regressor to predict the optimistic rewards for the currently unobserved perturbations. (g) Select the next batch from these unobserved perturbations with the highest optimistic reward.

Let's assume y ⋆ satisfies min f ∈F ∥y ⋆ -f ∥ ∞ ≤ ω where ∥g∥ ∞ = max a∈Z |g(a)|. Let y ⋆ = arg min f ∈F ∥y ⋆ -f ∥ ∞ be the ∥ • ∥ ∞ projection of y ⋆ onto F. If f t is computed by solving 1 with γ t ≥ (t -1)ω 2 b with acquisition objective A max,b the response predictions of f t over values B t satisfy, a∈B * ,t y ⋆ (a) ≤ bω + a∈B * ,t y ⋆ (a) ≤ bω + a∈Bt f t (a).

Let G be a bipartite graph with node set A ∪ B such that |A| = K and |B| = b. If for all nodes v ∈ B it follows that N (v) ≥ K 2 then, 1. If K 2 ≥ b there exists a perfect matching between the nodes in B and the ones in A.

Noisy Batch Selection Principle (OAE) Input Action set A ⊂ R d , num batches N , batch size b Initialize Observed points and labels dataset D 1 = ∅ for t = 1, • • • , N do If t = 1 then: • Sample uniformly a size b batch B t ∼ U 1 .

Section 4.1 by setting the acquisition function to A avg (f, U) and the batch selection rule as in Equation 2 for the vanilla OAE methods and as Equations 6 and 8 for OAE's diversity inducing versions OAE -DvD and OAE -Seq respectively. All neural network architectures use ReLU activations. All ensemble methods use an ensemble size of M = 10 and xavier parameter initialization.

Figure 7: (top row) Synthetic one dimensional datasets (bottom) Evolution of MeanOpt in the MultiValleyOneDimHole dataset using λ reg = 0.001. From left to right: Iterations 5, 15, 25, 45.E.1 VANILLA OAEWe test MeanOpt, HingePNorm, MaxOpt, Ensemble and EnsembleNoiseY's performance (see Section 4.1 for a detailed description of each of these algorithms) over different values of the regularization parameter λ reg , including the 'greedy' choice of λ reg = 0, henceforth referred to as Greedy. We compare these algorithms to the baseline method that selects points B t by selecting a uniformly random batch of size B from the set U t , henceforth referred to as RandomBatch and against each other.

Figure 8: NN 10-5. τ -quantile batch convergence (left) and corresponding rewards over batches (right) on the OneHillValleyOneDim (easier) and MultiValleyOneDim (harder), OneHillValleyOneD-imHole and MultiValleyOneDimHole synthetic datasets show how the OAE algorithm can achieve higher reward faster than RandomBatch or Greedy. The neural network architecture has two hidden layers of sizes 10 and 5.

Figure 9: NN 100-10. τ -quantile batch convergence (left) and corresponding rewards over batches (right) on the suite of synthetic datasets when the neural network architecture has two hidden layers of sizes 100 and 10.

Second, in Figures11 and 12we evaluate MeanOpt vs ensemble implementations of OAE across the two neural network architectures NN 10-5 and NN 100-10. We observe that Ensemble performs competitively with all other methods in the one dimensional datasets and outperforms all in the OneHillValleyOneDimHole. In the multi dimensional datasets, MeanOpt performs better than Ensemble, EnsembleNoiseY and Greedy. In this case the most optimistic version of MeanOpt (λ reg = .1) is the best performing of all. This may indicate that in multi modal environments the optimism injected by the random initialization of the ensemble models in Ensemble or the reward noise in EnsembleNoiseY do not induce an exploration strategy as effective as the explicit optimistic fit of MeanOpt. In unimodal datasets, the opposite is true, MeanOpt with a large regularizer (λ reg = 0.1) underperforms Ensemble, EnsembleNoiseY and Greedy. In Appendix G.1 the reader may find similar results for MaxOpt and HingeOpt. Similar results hold in that case.

Figure 11: NN 10-5. τ -quantile batch convergence of MeanOptimism vs EnsembleOptimism vs EnsembleNoiseY.

Figure 12: NN 100-10. τ -quantile batch convergence of MeanOptimism vs EnsembleOptimism vs EnsembleNoiseY.

Figure 13: Training set regression fit curves evolution over training for the suite of synthetic datasets. Each training batch contains 10 datapoints.

Figure 15: NN 10-5. τ -quantile batch convergence on the Adult and BikeSharingDay datasets with regression-fitted response values.

Figure 16: Training set regression fit curves evolution over training for the UCI datasets. Each training batch contains 10 datapoints.

Figure 17: Linear. τ -quantile batch convergence on the BikeSharingDay, BikeSharingHour and BlogFeedback datasets.

Figure 21: NN 100-10. τ -quantile batch convergence performance of DetD vs RandomBatch in the suite of synthetic datasets.

Figure 22: NN 10-5. τ -quantile batch convergence performance of DetD vs RandomBatch in the suite of genetic perturbation datasets.

Figure 24: NN 100-10. τ -quantile batch convergence performance of (left) MeanOpt vs SeqB, (center) Ensemble vs Ensemble -SeqB and (right) Ensemble -NoiseY vs Ensemble -SeqB -NoiseY on the suite of synthetic datasets.

Figure 23: NN 100-10. τ -quantile batch convergence performance of DetD vs RandomBatch in the suite of genetic perturbation datasets.

Figure 26: NN 10-5. τ -quantile batch convergence performance of (left) MeanOpt vs SeqB, (center) Ensemble vs Ensemble -SeqB and (right) Ensemble -NoiseY vs Ensemble -SeqB -NoiseY on the suite of synthetic datasets.

Figure 25: NN 100-10. τ -quantile batch convergence performance of MeanOpt vs SeqB on the gene perturbation datasets.

Figure 27: NN 100-10. τ -quantile batch convergence performance of (left) MeanOpt vs SeqB, (center) Ensemble vs Ensemble -SeqB and (right) Ensemble -NoiseY vs Ensemble -SeqB -NoiseY on the BikeSharingDay and BikeSharingHour datasets.

Figure 30: NN 10-5. HingePNorm comparison vs RandomBatch over the suite of synthetic datasets.

Figure 31: NN 100-10. HingePNorm comparison vs RandomBatch over the suite of synthetic datasets.

Figure 34: NN 100-10. τ -quantile batch convergence performance of (left) MeanOpt vs SeqB, (center) Ensemble vs Ensemble -SeqB and (right) Ensemble -NoiseY vs Ensemble -SeqB -NoiseY on the BlogFeedback dataset.

Figure 36: NN 100-10.MaxOptimism and HingePNormOptimism comparison vs RandomBatch.

Figure 37: NN 10-5. τ -quantile batch convergence of MeanOpt (left), Ensemble and EnsembleNoiseY (right) on genetic perturbation datasets.

Figure 38: NN 10-5. τ -quantile batch convergence performance of (left) MeanOpt vs SeqB, (center) Ensemble vs Ensemble -SeqB and (right) Ensemble -NoiseY vs Ensemble -SeqB -NoiseY on the genetic perturbation datasets.

Algorithm 1 Optimistic Arm Elimination Principle (OAE) Input Action set A ⊂ R d , num batches N , batch size b Initialize Unpulled arms U 1 = A. Observed points and labels dataset D 1 = ∅ for t = 1, • • • , N do If t = 1 then: • Sample uniformly a size b batch B t ∼ U 1 . Else: • Solve for f t and compute B t ∈ arg max B⊂Ut||B|=b A( f t , B). Observe batch rewards Y t = {y * (a) for a ∈ B t } Update D t+1 = D t ∪ {(B t , Y t )} and U t+1 = U t \B t .

Hit Ratio Results after 50 batches of size 16. BNN model trained with Achilles treatment descriptors. Final Hit Ratio average of 5 independent runs. TopUncertain selects the 16 points with the largest uncertainty values and SoftUncertain samples them using a softmax distribution.6 CONCLUSIONIn this work we introduce a variety of algorithms inspired in the OAE principle for noiseless batch bandit optimization. We also show lower bounds for the query complexity for linear and Lipshitz classes as well as a novel regret upper bound in terms of the Eluder dimension of the query class. Tractable Implementations of OAE . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Optimistic Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Diversity seeking versions of OAE . . . . . . . . . . . . . . . . . . . . . . . . . . Optimism and its properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3 Noisy responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vanilla OAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OAE -Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.2 Regression Fitted Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 G.3 UCI public datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 G.4 Transfer Learning Biological Datasets . . . . . . . . . . . . . . . . . . . . . . . . 38

arg maxThe equivalence between this definition of B t and the sequential batch selection rule follows by noting the equality constraint from 9 ensures that f t,i = f t is a valid solution for each of the intermediate problems defining the sequence {( f t,i , U t,i )} B i=1 . OAE -Seq is designed with a more general batch selection rule that may yield distinct intermediate arm selection functions { f t,i }. In our experiments we compute the virtual reward y t,i as an average of f optimistic

The remaining two datasets MultiValleyOneDim and MultiValleyOneDimHole are built with the problem of multimodal optimization in mind. Each of these datasets have 4 local maxima. We use MultiValleyOneDim to test the OAE's ability to avoid getting stuck in local optima. The second dataset mimics the construction of the OneHillValleyOneDimHole dataset and on top of testing the algorithm's ability to escape local optima, it also is meant to test what happens when the global optimum's neighborhood isn't present in the dataset. Since one of the algorithms we test is the greedy algorithm (corresponding to λ = 0), the 'Hole' datasets are meant to present a challenging situation for this class of algorithms.

annex

In the second scenario, let's first prove there exists a subset A ′ of A of size at least K 4 such that every element in A ′ has at least b 4 neighbors in B. We prove this condition by the way of contradiction. There are at least b K 2 edges in the graph. Suppose there were at most L vertices in A with degree greater or equal to b 4 . This value of L must satisfy the inequality,This is because the maximum number of edges a vertex in A can have equals b. Thus, L ≥ K 3 -1.All nodes in L have degree at least b 4 . If we restrict ourselves to a subset L of L of size K 8 and since in this scenario K 8 < b 4 , Hall's stable marriage theorem implies there is a perfect matching between L to B. The result follows.Lemma 2 of Russo & Van Roy (2013) (a,y)∈Dt (f (a) -y) 2 ≤ γ t } then with probability 1 for all T ∈ N,WhereProof. The proof of lemma D.6 follows the proof template as that of Lemma 2 in Russo & Van Roy (2013) . We reproduce it here for completeness.For ease of notation we use d = dim E (F, α T ).We first re-order the sequenceWhere inequality (i) holds by definition of α T noting thatProposition D.4 (since d ≥ dim E (F, ϵ) for all ϵ ≥ α T ) implies that for all i j with r ij > α T , we can bound j as We implemented and tested the DetD algorithm described in Section C.0.1. In our experiments we set λ div = 1 and set f t to be the result of solving the MeanOpt objective (see problem 3) for different values of λ reg . We set K to satisfy,We see that across the suite of synthetic datasets and architectures (NN 10-5 and NN 100-10) the performance of MeanOpt degrades when diversity is enforced (see figures 20 and 21 and compare with figures 8 and 9). A similar phenomenon is observed in the suite of UCI datasets (see figure 19 for DetD results on the BikeSharing dataset and figure 14 for comparison).In contrast, we note that DetD beats the performance of RandomBatch in the A373, MCF7 and HA1E datasets and over the two neural architectures NN 10-5 and NN 100-10. Nonetheless, the DetD is not able to beat RandomBatch in the VCAP dataset. These results indicate that a diversity objective that relies only on the geometry of the arm space and does not take into account the response values may be beneficial when y ⋆ ̸ ∈ F but could lead to a deterioration in performance when y ⋆ ∈ F. The algorithm designer should be careful when balancing diversity objectives and purely optimism driven exploration strategies, since the optimal combination depends on the nature of the dataset. It remains an interesting avenue for future research to design strategies that diagnose in advance the appropriate diversity/optimism balance for OAE -DvD. 

