ADALEAD: A SIMPLE AND ROBUST ADAPTIVE GREEDY SEARCH ALGORITHM FOR SEQUENCE DESIGN

Abstract

Efficient design of biological sequences will have a great impact across many industrial and healthcare domains. However, discovering improved sequences requires solving a difficult optimization problem. Traditionally, this challenge was approached by biologists through a model-free method known as "directed evolution", the iterative process of random mutation and selection. As the ability to build models that capture the sequence-to-function map improves, such models can be used as oracles to screen sequences before running experiments. In recent years, interest in better algorithms that effectively use such oracles to outperform model-free approaches has intensified. These span from approaches based on Bayesian Optimization, to regularized generative models and adaptations of reinforcement learning. In this work, we implement an open-source Fitness Landscape EXploration Sandbox (FLEXS) environment to test and evaluate these algorithms based on their optimality, consistency, and robustness. Using FLEXS, we develop an easy-to-implement, scalable, and robust evolutionary greedy algorithm (AdaLead). Despite its simplicity, we show that AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.

1. INTRODUCTION

An important problem across many domains in biology is the challenge of finding DNA, RNA, or protein sequences which perform a function of interest at a desired level. This task is challenging for two reasons: (i) the map φ between sequences X = {x 1 , • • • , x n } and their biological function y = {y 1 , • • • , y n } is non-convex and (ii) has sparse support in the space of possible sequences A L , which also grows exponentially in the length of the sequence L for alphabet A. This map φ is otherwise known as a "fitness landscape" (de Visser et al., 2018) . Currently, the most widely used practical approach in sequence design is "directed evolution" (Arnold, 1998) , where populations of biological entities are selected through an assay according to their function y, with each iteration becoming more stringent in the selection criteria. However, this model-free approach relies on evolutionary random walks through the sequence space, and most attempted optimization steps (mutations) are discarded due to their negative impact on y. Recent advances in DNA sequencing and synthesis technologies allow large assays which query y for specific sequences x with up to 10 5 physical samples per batch (Barrera et al., 2016) . This development presents an opening for machine learning to contribute in building better surrogate models φ : X → y which approximate the oracle φ that maps each sequence to its true function. We may use these models φ as proxies of φ, in order to generate and screen sequences in silico before they are sent to synthesis (Yang et al., 2019; Fox et al., 2007) . While a large body of work has focused on building better local approximate models φ on already published data (Otwinowski et al., 2018; Alipanahi et al., 2015; Riesselman et al., 2017; Sinai et al., 2017) , the more recent work is being done on optimization in this setting (Biswas et al., 2018; Angermueller et al., 2020; Brookes & Listgarten, 2018; Brookes et al., 2019) . Although synthesizing many sequences within a batch is now possible, because of the labor-intensive nature of the process, only a handful of iterations of learning can be performed. Hence data is often collected in serial batches b i , comprising data D t = {b 0 , • • • , b t } and the problem of sequence design is generally cast as that of proposing batches so that we may find the optimal sequence x * t = arg max x∈Dt φ(x) over the course of these experiments.

