BIDIRECTIONAL LEARNING FOR OFFLINE MODEL-BASED BIOLOGICAL SEQUENCE DESIGN Anonymous

Abstract

Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. In this paper, we focus on biological sequence design to maximize some sequence score. A recent approach employs bidirectional learning, combining a forward mapping for exploitation and a backward mapping for constraint, and it relies on the neural tangent kernel (NTK) of an infinitely wide network to build a proxy model. Though effective, the NTK cannot learn features because of its parametrization, and its use prevents the incorporation of powerful pre-trained Language Models (LMs) that can capture the rich biophysical information in millions of biological sequences. We adopt an alternative proxy model, adding a linear head to a pre-trained LM, and propose a linearization scheme. This yields a closed-form loss and also takes into account the biophysical information in the pre-trained LM. In addition, the forward mapping and the backward mapping play different roles and thus deserve different weights during sequence optimization. To achieve this, we train an auxiliary model and leverage its weak supervision signal via a bi-level optimization framework to effectively learn how to balance the two mappings. Further, by extending the framework, we develop the first learning rate adaptation module Adaptive-η, which is compatible with all gradient-based algorithms for offline model-based optimization. Experimental results on DNA/protein sequence design tasks verify the effectiveness of our algorithm. Our code is available here.

1. INTRODUCTION

Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. This offline setting is realistic since in many real-world scenarios we do not have interactive access to the ground-truth evaluation. The design tasks of interest include material, aircraft, and biological sequence (Trabucco et al., 2021) . In this paper, we focus on biological sequence design, including DNA sequence and protein sequence, with the goal of maximizing some specified property of these sequences. A wide variety of methods have been proposed for biological sequence design, including evolutionary algorithms (Sinai et al., 2020; Ren et al., 2022) , reinforcement learning methods (Angermueller et al., 2019) , Bayesian optimization (Terayama et al., 2021) , search/sampling using generative models (Brookes et al., 2019; Chan et al., 2021), and GFlowNets (Jain et al., 2022) . Recently, gradient-based techniques have emerged as an effective alternative (Trabucco et al., 2021) . These approaches first train a deep neural network (DNN) on the static dataset as a proxy and then obtain the new designs by directly performing gradient ascent steps on the existing designs. Such methods have been widely used in biological sequence design (Norn et al., 2021; Tischer et al., 2020; Linder & Seelig, 2020) . One obstacle is the out-of-distribution issue, where the trained proxy model is inaccurate for the newly generated sequences. To mitigate the out-of-distribution issue, recent work proposes regularization of the model (Trabucco et al., 2021; Yu et al., 2021; Fu & Levine, 2021) or the design itself (Chen et al., 2022) . The first category focuses on training a better proxy by introducing inductive biases such as robustness (Yu et al., 2021) . The second category introduces bidirectional learning (Chen et al., 2022) , which consists of a forward mapping and a backward mapping, to optimize the design directly. Specifically, the backward mapping leverages the high-scoring design to predict the static dataset and vice versa for

