ADALEAD: A SIMPLE AND ROBUST ADAPTIVE GREEDY SEARCH ALGORITHM FOR SEQUENCE DESIGN

Abstract

Efficient design of biological sequences will have a great impact across many industrial and healthcare domains. However, discovering improved sequences requires solving a difficult optimization problem. Traditionally, this challenge was approached by biologists through a model-free method known as "directed evolution", the iterative process of random mutation and selection. As the ability to build models that capture the sequence-to-function map improves, such models can be used as oracles to screen sequences before running experiments. In recent years, interest in better algorithms that effectively use such oracles to outperform model-free approaches has intensified. These span from approaches based on Bayesian Optimization, to regularized generative models and adaptations of reinforcement learning. In this work, we implement an open-source Fitness Landscape EXploration Sandbox (FLEXS) environment to test and evaluate these algorithms based on their optimality, consistency, and robustness. Using FLEXS, we develop an easy-to-implement, scalable, and robust evolutionary greedy algorithm (AdaLead). Despite its simplicity, we show that AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.

1. INTRODUCTION

An important problem across many domains in biology is the challenge of finding DNA, RNA, or protein sequences which perform a function of interest at a desired level. This task is challenging for two reasons: (i) the map φ between sequences X = {x 1 , • • • , x n } and their biological function y = {y 1 , • • • , y n } is non-convex and (ii) has sparse support in the space of possible sequences A L , which also grows exponentially in the length of the sequence L for alphabet A. This map φ is otherwise known as a "fitness landscape" (de Visser et al., 2018) . Currently, the most widely used practical approach in sequence design is "directed evolution" (Arnold, 1998) , where populations of biological entities are selected through an assay according to their function y, with each iteration becoming more stringent in the selection criteria. However, this model-free approach relies on evolutionary random walks through the sequence space, and most attempted optimization steps (mutations) are discarded due to their negative impact on y. Recent advances in DNA sequencing and synthesis technologies allow large assays which query y for specific sequences x with up to 10 5 physical samples per batch (Barrera et al., 2016) . This development presents an opening for machine learning to contribute in building better surrogate models φ : X → y which approximate the oracle φ that maps each sequence to its true function. We may use these models φ as proxies of φ, in order to generate and screen sequences in silico before they are sent to synthesis (Yang et al., 2019; Fox et al., 2007) . While a large body of work has focused on building better local approximate models φ on already published data (Otwinowski et al., 2018; Alipanahi et al., 2015; Riesselman et al., 2017; Sinai et al., 2017) , the more recent work is being done on optimization in this setting (Biswas et al., 2018; Angermueller et al., 2020; Brookes & Listgarten, 2018; Brookes et al., 2019) . Although synthesizing many sequences within a batch is now possible, because of the labor-intensive nature of the process, only a handful of iterations of learning can be performed. Hence data is often collected in serial batches b i , comprising data D t = {b 0 , • • • , b t } and the problem of sequence design is generally cast as that of proposing batches so that we may find the optimal sequence x * t = arg max x∈Dt φ(x) over the course of these experiments. In this paper, we focus our attention on ML-augmented exploration algorithms which use (possibly non-differentiable) surrogate models φ to improve the process of sequence design. While the work is under an active learning setting, in which an algorithm may select samples to be labelled, with data arriving in batches b i , our primary objective is black-box optimization, rather than improving the accuracy of surrogate model. We define E θ (D, φ ) to denote an exploration algorithm with parameters θ, which relies on dataset D and surrogate model φ . When the context is clear, we will simply use E as shorthand. In most contexts, the sequence space is large enough that even computational evaluation is limited to a very small subset of possible options. For this reason, we consider the optimization as samplerestricted, not only in the number of queries to the ground truth oracle, but also the number of queries to the surrogate model(Among other reasons, this allows us to thoroughly study the algorithms on landscapes that can be brute-forced, simulating a similar situation when the sequence space is very large compared to computation power, a very common setting.) The algorithm E may perform v sequence evaluations in silico for every sequence proposed. For example, v × B samples may be evaluated by the model before B samples are proposed for measurement. Ideally, E should propose strong sequences even when v is small; that is, the algorithm should not need to evaluate many sequences to arrive at a strong one.

2. CONTRIBUTIONS

In this study, we make three main contributions towards improving algorithms for sequence design: 1. To build on recent progress in biological sequence design, the research community needs good benchmarks and reference algorithm implementations against which to compare new methods. We implement an open-source simulation environment FLEXS that can emulate complex biological landscapes and can be readily used for training and evaluating sequence design algorithms. We hope that FLEXS will help ensure meaningful and reproducible research results and accelerate the process of algorithm development for ML-guided biological sequence design. 2. We introduce an abstracted oracle to allow the empirical study of exploration strategies, independent of the underlying models. This helps us understand relevant properties, such as robustness and consistency of the algorithms. 3. Inspired by evolutionary and Follow the Perturbed Leader approaches in combinatorial optimization, we propose a simple model-guided greedy approach, termed Adapt-with-the-Leader (ADALEAD). ADALEAD is simple to implement and is competitive with previous state-of-the-art algorithms. We propose ADALEAD as a strong, accessible baseline for testing sequence design algorithms. We show that in general, simple evolutionary algorithms are strong benchmarks to compete against and should be included in future analyses of new methods.

3. EVALUATION

We evaluate the algorithms on a set of criteria designed to be relevant to both the biological applicability as well as the soundness of the algorithms considered (Purohit et al., 2018) . We run the algorithms using FLEXS, where all of these algorithms and criteria evaluators are implemented. • Optimization: We let maximization be the objective. Most optimization algorithms operate under the assumption that critical information such as the best possible y * or the set of all local maxima M is unknown. While it is reasonable to assume that the best sequence is the one with the highest fitness, this is not necessarily the case in reality. For instance, we might wish to bind a particular target, but binding it too strongly may be less desirable than binding it at a moderate level. As measurements of this criterion, we consider the maximum y = max x φ(x) over all sequences considered, as well as the cardinality |S|, where S = {x i | φ(x i ) > y τ } and y τ > 0 is a minimum threshold value. It is noteworthy that we often do not know if any solutions y > y τ exists, hence finding many solutions by an algorithm is a sign of its strength. • Robustness: A major challenge for input design in model-guided algorithms is that optimizing directly on the surrogate φ can result in proposing a sequence x with large error, instead of approximating x * (e.g. if the proposed input x is far outside D). Additionally, while biological

